#**Exploratory Data Analysis of Bank Customer Churn**

After noticing increase in the number of customers leaving the bank, the bank decided to collect data during 6 month period to evaluate the problem. 10000 customers were selected randomly among three countries – France, Germany and Spain.

The problem here is a classification problem as we have to predict which customer is more likely to leave the bank given how long they have been a customer of the bank, whether they are an active member, their age, gender, credit score, estimated salary, number of products and if the customer holds a credit card or not.

In this kernel we try to visually predict the reasons for churning.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

### Plotly
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.plotly as py
from plotly import tools
init_notebook_mode(connected=True)


# Altair
import altair as alt

### Removes warnings that occassionally show up
import warnings
warnings.filterwarnings('ignore')

In [None]:
import json  # need it for json.dumps
from IPython.display import HTML

# Create the correct URLs for require.js to find the Javascript libraries
vega_url = 'https://cdn.jsdelivr.net/npm/vega@' + alt.SCHEMA_VERSION
vega_lib_url = 'https://cdn.jsdelivr.net/npm/vega-lib'
vega_lite_url = 'https://cdn.jsdelivr.net/npm/vega-lite@' + alt.SCHEMA_VERSION
vega_embed_url = 'https://cdn.jsdelivr.net/npm/vega-embed@3'
noext = "?noext"

altair_paths = {
    'vega': vega_url + noext,
    'vega-lib': vega_lib_url + noext,
    'vega-lite': vega_lite_url + noext,
    'vega-embed': vega_embed_url + noext
}

workaround = """
requirejs.config({{
    baseUrl: 'https://cdn.jsdelivr.net/npm/',
    paths: {paths}
}});
"""

# Define the function for rendering
def add_autoincrement(render_func):
    # Keep track of unique <div/> IDs
    cache = {}
    def wrapped(chart, id="vega-chart", autoincrement=True):
        """Render an altair chart directly via javascript.
        
        This is a workaround for functioning export to HTML.
        (It probably messes up other ways to export.) It will
        cache and autoincrement the ID suffixed with a
        number (e.g. vega-chart-1) so you don't have to deal
        with that.
        """
        if autoincrement:
            if id in cache:
                counter = 1 + cache[id]
                cache[id] = counter
            else:
                cache[id] = 0
            actual_id = id if cache[id] == 0 else id + '-' + str(cache[id])
        else:
            if id not in cache:
                cache[id] = 0
            actual_id = id
        return render_func(chart, id=actual_id)
    # Cache will stay defined and keep track of the unique div Ids
    return wrapped


@add_autoincrement
def render_alt(chart, id="vega-chart"):
    # This below is the javascript to make the chart directly using vegaEmbed
    chart_str = """
    <div id="{id}"></div><script>
    require(["vega-embed"], function(vegaEmbed) {{
        const spec = {chart};     
        vegaEmbed("#{id}", spec, {{defaultStyle: true}}).catch(console.warn);
    }});
    </script>
    """
    return HTML(
        chart_str.format(
            id=id,
            chart=json.dumps(chart) if isinstance(chart, dict) else chart.to_json(indent=None)
        )
    )

HTML("".join((
    "<script>",
    workaround.format(paths=json.dumps(altair_paths)),
    "</script>"
)))

In [None]:
train = pd.read_csv("../input/Churn_Modelling.csv")

In [None]:
train.head()

A quick look at the dataset tells us that the columns RowNumber, CustomerId and Surname will not have any impact on the customer leaving the bank. Out of the remaining variables, Geography, Gender, NumOfProducts, HasCrCard and IsActiveMember are categorical variables and the remaining variables i.e. CreditScore, Age, Tenure, Balance and EstimantedSalary are numerical variables. 

In [None]:
exited_1 = train.query('Exited==1')
exited_0 = train.query('Exited==0')

Since Altair does not allow visualizing more than 5000 rows, we had to split the data in two dataframes. We did this split on the Exited column.

In [None]:
france_1 = exited_1[exited_1['Geography']=='France']
france_0 = exited_0[exited_0['Geography']=='France']

germany_1 = exited_1[exited_1['Geography'] ==  'Germany']
germany_0 = exited_0[exited_0['Geography'] == 'Germany']

spain_1 = exited_1[exited_1['Geography'] ==  'Spain']
spain_0 = exited_0[exited_0['Geography'] == 'Spain']

#print(france_1.RowNumber.count(),france_0.RowNumber.count(),
#      germany_1.RowNumber.count(),germany_0.RowNumber.count(),
#      spain_1.RowNumber.count(),spain_0.RowNumber.count())

In [None]:
fig = {
    'data': [
        {
            'labels': ['Churn','Not Churn'],
            'values': [france_1.RowNumber.count(),france_0.RowNumber.count()],
            'type': 'pie',
            'name': 'France',
            'marker': {'colors': ['rgb(14, 111, 175)','rgb(255, 161, 0)']},
            'domain': {'x': [0, .48],
                       'y': [.51, 1]},
            'hoverinfo':'label+percent+name',
            'textinfo':'percent',
            'title':'France'
        },
         {
            'labels': ['Churn','Not Churn'],
            'values': [germany_1.RowNumber.count(),germany_0.RowNumber.count()],
            'marker': {'colors':['rgb(14, 111, 175)','rgb(255, 161, 0)']},
            'type': 'pie',
            'name': 'Germany',
            'domain': {'x': [.40, .60],
                       'y': [.51, 1]},
            'hoverinfo':'label+percent+name',
            'textinfo':'percent',
            'title':'Germany'        },
        {
            'labels': ['Churn','Not Churn'],
            'values': [spain_1.RowNumber.count(),spain_0.RowNumber.count()],
            'marker': {'colors':['rgb(14, 111, 175)','rgb(255, 161, 0)']},
            'type': 'pie',
            'name': 'Spain',
            'domain': {'x': [.52, 1],
                       'y': [.51, 1]},
            'hoverinfo':'label+percent+name',
            'textinfo':'percent',
            'title':'Spain'

        }
       
       
    ],
    'layout': {'title': 'Country-Wise Churn Rate',
               'showlegend': True}
}

iplot(fig, filename='basic_pie_chart')

This pie chart shows the country wise churn rate for the three countries. It is clear that Germany has the highest churn rate among the three. France and Spain have almost the same churn rate.

In [None]:
france_male_1 =france_1[france_1['Gender']== 'Male' ]
france_female_1 =france_1[france_1['Gender']== 'Female' ]

germany_male_1 = germany_1[germany_1['Gender']== 'Male' ]
germany_female_1 = germany_1[germany_1['Gender']== 'Female' ]

spain_male_1 = spain_1[spain_1['Gender']== 'Male' ]
spain_female_1 = spain_1[spain_1['Gender']== 'Female' ]


In [None]:
trace1 = go.Bar(
    x=['France', 'Germany', 'Spain'],
    y=[france_male_1.RowNumber.count(), germany_male_1.RowNumber.count(), spain_male_1.RowNumber.count()],
    name='Male'
)
trace2 = go.Bar(
    x=['France', 'Germany', 'Spain'],
    y=[france_female_1.RowNumber.count(), germany_female_1.RowNumber.count(), spain_female_1.RowNumber.count()],
    name='Female'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='grouped-bar')

The above bar graph shows us the gender wise churn rate for each country. In general we see that more number of female customers left the bank as compared to the male customers. This behaviour suggests that there is something inherently wrong with this bank. Since this is intangible, we can speculate at best that it has something to do with the services offered.

In [None]:
target_col = ["Exited"]
cat_cols   = train.nunique()[train.nunique() < 6].keys().tolist()
cat_cols   = [x for x in cat_cols if x not in target_col]
num_cols   = [x for x in train.columns if x not in cat_cols + target_col]


In [None]:
from altair import pipe, limit_rows, to_values
t = lambda data: pipe(train, limit_rows(max_rows=10000), to_values)
alt.data_transformers.register('custom', t)
alt.data_transformers.enable('custom')

In [None]:
interval = alt.selection_interval()

points = alt.Chart(train).mark_point().encode(
  x='Age',
  y='Balance',
  color=alt.condition(interval, 'Geography', alt.value('lightgray'))
).properties(
  selection=interval
)

histogram = alt.Chart(train).mark_bar().encode(
  x='count()',
  y='Geography',
  color='Geography'
).transform_filter(interval)

render_alt(points & histogram)


The above scatterplot shows us the countrywise variation of balance according to age. From the graph is evident that there is no correlation between balance and age.
We see that the bank has the maximum number of customers from France. It is followed by Germany and then by Spain. We also see that only France and Spain have accounts with 0 balance. This is really ironic as Germany has the highest churn rate. This invalidates the hypothesis that customers who don't use the bank account i.e who have zero bank balance are more likely to leave the bank.


In [None]:
#function  for histogram for customer churn types
def histogram(column) :
    trace1 = go.Histogram(x  = exited_1[column],
                          histnorm= "percent",
                          name = "Exited",
                          
                          marker = dict(line = dict(width = .5,
                                                    color = "black"
                                                    ), color = '#dd3b3b'
                                        ),
                         opacity = .9 
                         ) 
    
    trace2 = go.Histogram(x  = exited_0[column],
                          histnorm = "percent",
                          name = "Not Exited",
                         
                          marker = dict(line = dict(width = .5,
                                              color = "black"
                                             ), color = '#336dcc'
                                 ),
                          opacity = .9
                         )
    
    data = [trace1,trace2]
    layout = go.Layout(dict(title =column + " Distribution",
                           
                            xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = column,
                                             
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                            yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = "Percent",
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                           )
                      )
    fig  = go.Figure(data=data,layout=layout)
    
    iplot(fig)

In [None]:
histogram('Age')

The above histogram shows us the distribution of age of the customers who stayed and left the bank respectively. We can see that the number of customers who stayed in the bank decreases as the age increases and also for the critical ages from 30 to 40 years old the number of customers who stayed is higher than the ones who left. It can be observed that after 40, more customers are leaving the bank. The reason for this could be that our competitors are offering better incentives to them. 


In [None]:
histogram('Balance')

This interactive graph illustrates the balance distribution between the customers who left and stayed in the bank. The balance for the exited customers almost follows normal distributions except for the customers with 0 bank balance which amounts to exactly 24.5% of the total number of churned customers. The same is true for the customers who stayed with the bank. This supports our result from the previous visualizations that just customers with 0 bank balance do not leave the bank.

In [None]:
histogram('CreditScore')

Next we wanted to see if there is any pattern between credit score and customers leaving the bank. To explore this , an interactive plot depicting the credit score distribution for the customers who left and those who did not was created. For both of the situations where the customer exited or not there is no particular pattern observed except that the customers with credit score less than 400 tend to leave the bank.

In [None]:
trace = []
def gen_boxplot(df):
    for feature in df:
        trace.append(
            go.Box(
                name = feature,
                y = df[feature]
            )
        )

new_df = train[num_cols[6:]]
gen_boxplot(new_df)
data = trace
iplot(data)

The two interactive box plots show the quantile distribution for balance and estimated salary. The plots were created to evaluate the behaviour of outliers. 
As we can see there are no outliers for the given two variables.

In [None]:
brush = alt.selection(type='interval', encodings=['x'])

bars =alt.Chart(train).mark_bar().encode(
    alt.X('Balance', bin=True),
    alt.Y('count()'),
    alt.Color('Geography'),
    opacity=alt.condition(brush, alt.OpacityValue(1), alt.OpacityValue(0.7))
).add_selection(
    brush
)


render_alt(alt.layer(bars,  data=train))

In this interactive chart we see for the three countries the distribution of balance. 
The balance variable was binned with an interval of 50000.
For the first binned balance(0-50,000) Germany has almost negligibe number of records. On the other hand in the third binned balance(100,000- 200,000) Germany has the highest, followed by France and Spain, respectively. This proves that people in Germany are really prosperous. But why is it that Germany has the greatest churn rate? It could be because there are other banks in Germany have better products as compared to this bank. Since we don't have any data regarding this, the best we can do is speculate.

In [None]:
brush = alt.selection(type='interval', encodings=['x'])

bars = alt.Chart().mark_bar().encode(
    x=alt.X('Age'),
    y=alt.Y('mean(EstimatedSalary)',scale=alt.Scale(domain=(40000, 200000))),
    #x='CreditScore',
    #y='mean(EstimatedSalary)',
    #color='Geography',
    opacity=alt.condition(brush, alt.OpacityValue(1), alt.OpacityValue(0.7))
).add_selection(
    brush
)

line = alt.Chart().mark_rule(color='firebrick').encode(
    y='mean(EstimatedSalary)',
    size=alt.SizeValue(3)
).transform_filter(
    brush
)

render_alt(alt.layer(bars, line, data=train))

Generally it is observed as a person gets older his/hers salary increases, to check this assumption a plot of age and estimated salary was created, this assumption is proved wrong as there is no clear rise in the salary. A really weird thing observed from the graph is that the estimated salary exceeds 180,000 after 80. This could be because the customers are generating income from their late blooming business.

In [None]:
target = "Geography"
feature = "CreditScore"

fig = ff.create_distplot(
    [train[train[target] == y][feature].values for y in train[target].unique()], 
    train[target].unique(), 
    show_hist=False,
    show_rug=False,
)

for d in fig['data']:
    d.update({'fill' : 'tozeroy'})

layout = go.Layout(
    title   = "Country-wise Credit Behaviour",
    xaxis   = dict(title = "Credit"),
    yaxis   = dict(title = "Density"),
)

fig["layout"] = layout
iplot(fig)

The density interactive plot shows the distribution of credit behaviour according to country. The distribution of credit score for the three countries are almost identical. This plot has an advantage over histograms in a way that it allows smoother distributions. 

In [None]:
interval = alt.selection_interval()

base = alt.Chart(exited_1).mark_point().encode(
  y='Age',
  color=alt.condition(interval, 'Geography', alt.value('lightgray'))
).properties(
  selection=interval
)
render_alt(base.encode(x='EstimatedSalary') | base.encode(x='Balance'))


The above graph shows the variation between balance and estimated salary according to age. For the people with 0 bank account, we see that the estimated salary is scattered across the whole range. From this we can speculate that there is some problem with the services offered by the bank and this could be the reason why people are churning this bank. 

##Conclusion

Even though we came to the conclusion that there is some problem with the services offered by the bank, we cannot say it for sure. This is because there are some variables that are missing which could help us in determining the exact reason as to why customers are leaving this bank. The variables that could help us in predicting the reason for churning are:
1. Type of Account
2. Account maintainance fees (if any)
3. Card maintainance fees (if any)
4. Maximum transferrable amount per day
5. Rewards offered

Using this variables would lift a little obscurity on the reason of customer churn.