Churn Assignment Part B - Data Visualisation

In [1569]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv('churn_data.csv')
df.head()


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0



You are tasked with doing some exploratory data analysis, which is the first step in building a model to predict churn. A customer has ‘churned’ when they have stopped using a service or subscription. Predicting when customers will churn is a key tool in marketing that companies use to help retain their customers. The column related to whether a customer has churned is called ‘Exited’ in this dataset. To find the churn rate for a group, you need to take the average churn across that group.

In reality, the process of exploring the data is very large. In this assignment we will look at a subset of the total plots you would need to complete this.

First you should look at the differences in churn rates, split by the different categorical variables. Produce the appropriate visualisation to compare the **average churn rate**, split by:



## Churn Rate vs Average Churn rate Analysis and Visualization

***
## Geography
**Answer**: *so churn rate per country equals to number of customers who exited/total number of customers that country*100*
as we want it bsed on Geography, first we should group by geography, and as we have 1 as exited we can use sum as our aggregation function for that column to get the number of customers who exited vs the total number of customers in that country in perentage. Then we can plot histogram with proper labels and title and a horizontal average line to visualize it and then do some analysis respoectively*

In [1571]:
# i. Geography
churn_rate_geo=((df.groupby('Geography')['Exited'].sum()/df.groupby('Geography')['Exited'].count())*100).round(2).reset_index()
churn_rate_geo

Unnamed: 0,Geography,Exited
0,France,16.15
1,Germany,32.44
2,Spain,16.67


In [1572]:
total_average_churn_rate =((df['Exited'].sum())/(df['Exited'].count())*100).round(2)
total_average_churn_rate

20.37

*Now I want to use a bar chart to plot the churn rate in categories and a line in the same chart for the total average churn rate to have a base for comparison.*

In [1573]:
total_average_df=pd.DataFrame({'Geography':['Total Average'],'Exited':[total_average_churn_rate]})
churn_rate_geo= pd.concat([churn_rate_geo,total_average_df],ignore_index=True)
churn_rate_geo.head()

Unnamed: 0,Geography,Exited
0,France,16.15
1,Germany,32.44
2,Spain,16.67
3,Total Average,20.37


In [1574]:
#plotly Explress bar plot
fig=px.bar(churn_rate_geo.iloc[:3],x='Geography',y='Exited',
           labels={'Exited':'Churn rate (%)','Geography':'Countries'},
           color='Geography',
           color_discrete_map ={'France':'navy','Germany':'teal','Spain':'lime'},
           text='Exited',
)
fig.update_traces(textfont=dict(color='black',family='Arial',size=12),textposition='outside')

fig.add_hline(y = total_average_churn_rate, line=dict(
            color='darkred',dash='longdashdot',width=2)
        )
fig.update_layout(title=dict(text="<b>Churn Rate by Geography with Total Average Churn Line</b>",
                             x=0.5,y=0.95),
                             legend=dict(title_text="<b>Geography</b>"),
                             margin=dict(l=50,r=50,t=50,b=50))
fig.add_annotation(x=(len(churn_rate_geo['Geography']))-1,
                   y=total_average_churn_rate,
                   text=f'Total Average Churn Rate(Threshold line):{total_average_churn_rate:.2f}%',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=4,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   
                   borderwidth=1,
                   height=30,
                   ax=100,
                   ay=-50
                   )
fig.show()


As we can see Germany(32.44%) has twice the churn rate of Spain(16.67%) and France(16.15%) and it is significantly higher than the average churn rate which is 20.37%. Customers in France and Spain will less likely to exit compared to the average customer base churn rate while customers in Germany is more likly to exit.
It can be because of the competition, cultural factors, specific features, bugs or issues that occurred and caused people to churn or there can be a better offer out there in German Market so people switch. The staff or technical services may have some problems in Germany. There can be bad word of mouth or other things like maybe the portal doesn't support German language. Maybe the German branch is new and can't position itself well in the market yet. We should look into the details for accurate cause of high churn rate in Germany.

****
### Gender

*Churn rate based on gender can both plotted via a piechart or a bar chart. I am going to plot barchart as it gives more data with the threshold line:*

In [1575]:
# ii. Gender

churn_rate_gender=((df.groupby('Gender')['Exited'].sum()/df.groupby('Gender')['Exited'].count())*100).round(2).reset_index()
churn_rate_gender


Unnamed: 0,Gender,Exited
0,Female,25.07
1,Male,16.46


In [1576]:
total_average_churn_rate =((df['Exited'].sum())/(df['Exited'].count())*100).round(2)
total_average_churn_rate

20.37

In [1577]:
total_average_df_gender=pd.DataFrame({'Gender':['Total Average'],'Exited':[total_average_churn_rate]})
churn_rate_gender= pd.concat([churn_rate_gender,total_average_df_gender],ignore_index=True)
churn_rate_gender.head()


Unnamed: 0,Gender,Exited
0,Female,25.07
1,Male,16.46
2,Total Average,20.37


In [1578]:
#plotly Explress bar plot
fig=px.bar(churn_rate_gender.iloc[:2],x='Gender',y='Exited',
           labels={'Exited':'Churn rate (%)'},
           color='Gender',
           color_discrete_map ={'Male':'navy','Female':'pink'},
           text='Exited',
)
fig.update_traces(textfont=dict(color='black',family='Arial',size=12),textposition='outside')

fig.add_hline(y = total_average_churn_rate, line=dict(
            color='darkred',dash='longdashdot',width=2)
        )
fig.update_layout(title=dict(text="<b>Churn Rate by Gender with Total Average Churn Rate Line</b>",
                             x=0.5,y=0.95),
                             legend=dict(title_text="<b>Geography</b>"),
                             margin=dict(l=50,r=50,t=50,b=50))
fig.add_annotation(x=len(churn_rate_gender['Gender'])-1,
                   y=total_average_churn_rate ,
                   text=f'Total Average Churn Rate(Threshold line):{total_average_churn_rate:.2f}%',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=4,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   
                   borderwidth=1,
                   height=30,
                   ax=100,
                   ay=-50)
                
fig.show()

*As we can see from the bar chart the churn rate for Females is 25.07% which is higher than the average churn rate 
while the churn rate for Male is 16.46% and is lower than the average churn rate. This suggests that female customers on average are more likely to exit than male customers compared to the overall average churn rate. Further investigation may be needed to understand the reason behind these differences. For example exploring factors such as product preferences, customer satisfaction and marketing strategies specific to each gender could provide additional insight into the observed variations in churn rate.*

***
## Tenure

In [867]:
# iii. Tenure

*For tenure, I want to see the trend of as customers remained longer with the bank and how likely they tend to exit.As Tenure can be considered as categorical and numerical as customer with 3 years tenure is an older customer than the one who has been customer for only a  year, It is more meaningful to plot a line chart and to see the trend.

In [868]:
churn_rate_tenure=((df.groupby('Tenure')['Exited'].sum()/df.groupby('Tenure')['Exited'].count())*100).round(2).reset_index()
churn_rate_tenure

Unnamed: 0,Tenure,Exited
0,0,23.0
1,1,22.42
2,2,19.18
3,3,21.11
4,4,20.53
5,5,20.65
6,6,20.27
7,7,17.22
8,8,19.22
9,9,21.65


In [869]:
total_average_churn_rate =((df['Exited'].sum())/(df['Exited'].count())*100).round(2)
total_average_churn_rate

20.37

In [870]:

total_average_df=pd.DataFrame({'Tenure':['Total Average'],'Exited':[total_average_churn_rate]})
churn_rate_tenure= pd.concat([churn_rate_tenure,total_average_df],ignore_index=True)
churn_rate_tenure.head(15)

Unnamed: 0,Tenure,Exited
0,0,23.0
1,1,22.42
2,2,19.18
3,3,21.11
4,4,20.53
5,5,20.65
6,6,20.27
7,7,17.22
8,8,19.22
9,9,21.65


In [1589]:
#plotly Express line plots   
fig=px.line(churn_rate_tenure.iloc[:11],x='Tenure',y='Exited',   
           labels={'Exited':'Churn rate (%)','Tenure':'Tenure (years)'},
           markers=True,
           line_shape='linear',
)
# attributes for appearence of the text and xaxes and line 
fig.update_traces(textfont=dict(color='black',family='Arial',size=12),textposition='top center')
fig.update_traces(line=dict(color='navy',width=2))
fig.update_xaxes(dtick=1)
# add the horizontal line for average churn rate to compare with 
fig.add_hline(y = total_average_churn_rate, line=dict(
            color='darkred',dash='longdashdot',width=2)
        )
#Title and layout attributes
fig.update_layout(title=dict(text="<b>Churn Rate by Tenure with Total Average Churn Rate Line</b> ",
                             x=0.5,y=0.95),
                             legend=dict(title_text="<b>Tenure</b>"),
                             margin=dict(l=50,r=50,t=50,b=50))
# The annotation for the average horizontal line 
fig.add_annotation(x=len(churn_rate_tenure['Tenure']),
                   y=total_average_churn_rate,
                   text=f'Total Average Churn Rate(Threshold line):{total_average_churn_rate:.2f}%',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=4,
                   arrowwidth=2,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   
                   borderwidth=1,
                   height=30,
                   ax=10,
                   ay=-40)
                
fig.show()

Althoough we can consider the tenure as categorical but because the different tenure years are related to each other and we want to observe the trend I chose the line plot to visualize. From the plot, we can see that:
1. There is a great chance that customers who just signed up exit as the highest churn rate occurs whithin the first year (start from 23% to 22.42%) which can be logical as the first customers don't know the service well and more likely to exit.
2. Customers who are signed up between one to two years as less probable to leave as churn rate dropped drastically from 22.42% to 19.18% becomes less than the total average (churn rate 20.37%). This can be a marketing strategy, a promotion, a feature added, some enhancement in services and launching campaigns. 
3. For some reason after the second year up to the third year, customers tendency to exit increases relatively fast from 19.18% to 21.11%/ This maybe due to some services that obliged customers to remain with the service at least two years or it can be something else we should get more details like where this customers were based, what was their credit score, were they active members or not, how competition influences the churn.
4. The customers who are signed up for 3 to 4 years tend less to exit than the previous year and the churn rate is decreasing from 21.11% to a point near the average churn rate 20.53% (20.37%).
5. Customers who are with the bank betweeen 4 to 5 years tend to exit slightly more  which is negligible (from 20.53% to 20.65%)
6. Then the line shows a declining exit trend for customers who are with the bank for between 5 to 6 from 20.65% to 20.27% which is less than the average churn rate of 20.37%
7. Interestingly, the tendency to exit for people with tenure between 6 to 7 years decreases drastically from 20.27% to all time least churn rate of 17.22% which is way less than the total average churn rate of 20.37%.  May be there was a promotion or marketing strategy to keep loyal customers, maybe there are some extra services or loan or credit service to these loyal customers.
8. Customers who are signed up for between 7 to 9 years tend to increasingly exit the service as it is logical they are near to the end of the lifecycle of the customers.It started from 17.22% on their 7 tenure year and increases to 19.22% on their 8th.
9. Customers with tenure between 8 and 9 years tendencies to leave the service increases drastically from 19.22% to 21.65% which is above total average churn rate.
10. for the customers who are signed up for more than 9 years, the probability of leaving becomes less than customers with 9 years of tenure decreasing from 21.65% to 20.61% which is still slightlly higher than the total average churn rate of 20.37%.

In general, we can summarize our analysis as follows:
- The churn rate decreases drastically for customers with tenure between 0 to 2 years. (**from 23% to 19.18% which is around 4%**)
- Customers with tenure of between 2 to 3 years tend to leave service more likely than the previous tenure group. (**from 19.18% to 21.11% which is above the average and with increase of around 3%**)
- Customers with tenure between 3 to 6 years tend to less likely leave the service and with reaching the probability of less than total average churn rate.(**from 21.11% in year 3 to 20.27% in year 6 which reaches less than the total average churn rate of 20.37% with derease of  0.84%**)
- Then the most drastic drop in the line chart happens for customers who has tenure between 6 to 7 years which reaches the all time least churn rate of 17.22%.(**from 20.27% in year 6 to 17.22% in year 7  dereases  3.05% which is the  highest decrease on the line chart**)- This needs to be investigated as it is an unusual behaviour in compare to other tenures. We should dive deep into other aspects: marketing strategy, feature or campaign launch, promotion of loyal customers.
- Interestingly,after that fall in churn rate we have a huge increase in the churn continuously between customers with 7 to 9 years of tenure. It creates a deep valey in the line chart and this huge decrease and increase should be investigated.(**Increases from 17.22% to 21.65 which goes above average of 20.37%, with 4.43% increase in churn rate**)- This maybe due to customer lifetime cycle is near or that decrease in churn rate for the previous year maybe due to some marketing strategy, promotion, a specific feature. It should be investigated more to find the cause and correlated factors.
- The last tenure slot is customers who are with the bank for 9 to 10 years which slightly decreases and become close to the total average line.(**Decrease from 21.65% to 20.61% which is a 1.04% decrease and the churn rate is still above the total average of 20/37%**)
I think we should consider other aspects such as active or not,  does the customers who remained provide more value than if the customers who exited had stayed? 

For maximum marks, make sure plots are correctly labelled.

***

## 2. Distribution

2. We would also like to know how the data is distributed. Some models require features to be
normally distributed, and highly skewed variables can affect summary statistics if left
unchecked. Produce the appropriate visualisation for the distribution of:

**Geography**

*For geography, as it is categorical we can use bar chart, histogram is a use of bar chart usually used for continuous data like age and credit score, plotly allows to use histogram for categorical data as well however bar chart or pie chart is enough*

In [1590]:
geo_dist=df['Geography'].value_counts().reset_index()
geo_dist

Unnamed: 0,Geography,count
0,France,5014
1,Germany,2509
2,Spain,2477


In [1591]:
# i. Geography

fig=px.bar(geo_dist,x='Geography',y='count',
           labels={'Geography':'Country(Geography)','count':'Number of Customers'},
           color='Geography',
           color_discrete_map ={'France':'navy','Germany':'teal','Spain':'lime'},
           text='count',
)
fig.update_traces(textfont=dict(color='black',family='Arial',size=12),textposition='outside')
fig.update_layout(title=dict(text="<b>Distribution of Customers by Country(Geography)</b>",
                             x=0.5,y=0.95),
                             legend=dict(title_text="<b>Country(Geography)</b>"),
                             margin=dict(l=50,r=50,t=50,b=50))
fig.show()


*We can see from the bar chart that France has the largest dataset and marketsize OF the bank customer base with 5014 customers which is twice Germany (2509) and around twice the customer base in spain (2477). This should take into consideration when analyzing other variables. The polocies, customer behaviour, customer preferences and competition is different in each country which should be taken into account. We should investigate more to see if there is any trend among the countries.Besides,we should investigate why France has the largest customer base. What is driving the distribution?Is France the oldest and central branch?*
*we already looked into the churn rate and France which has the largest customer base along with Spain has similar churn rate of around 16%. but Germany which has 2509 customers has twice the churn rate of each. We should investigate more. It can be that the branch in France was established long ago while Spain is pretty new. German market maybe difficult to enter or people may have specific culture or tendency to use other banks. We should gather information to get correct insight on that*
***
## Age

In [1592]:

# ii. Age
mean_age=df['Age'].mean().round()
mean_age=int(mean_age)
mean_age

39

In [1593]:
df['Age'].max()

92

In [1594]:
df['Age'].min()

18

In [1595]:
median_age=df['Age'].median()
median_age=int(median_age)
median_age

37

In [1596]:
mode_age=df['Age'].mode().iloc[0]
mode_age

37

In [1597]:
q25_age=df['Age'].quantile(0.25)
q25_age=int(q25_age)
q25_age

32

In [1598]:
q75_age=df['Age'].quantile(0.75)
q75_age=int(q75_age)
q75_age

44

In [1618]:

fig=px.histogram(
           df,x='Age',
           nbins=40,
           range_x=[17,df['Age'].max()+5],
           
)
fig.update_traces(textposition='outside',texttemplate='%{y}',text=df['Age'])
fig.update_traces(marker=dict(color=['navy' if i% 2==0 else 'lightblue' for i in range(len(df))]))
fig.update_layout(title=dict(text="<b>Distribution of Customers by Age</b>",
                             x=0.5,y=0.95),
                             margin=dict(l=0,r=0,t=50,b=2),
                             yaxis=dict(title='<b>Number of Customers</b>', showgrid=True),
                             xaxis=dict(title='<b>Age</b>',
                             tickmode='linear',
                             dtick=1,
                             showgrid=True,
                             zeroline=False,


                ))
fig.add_shape( go.layout.Shape( type='line',
                x0=mean_age,
                x1=mean_age,
                y0=0,
                y1=900,
                line=dict(color='White',dash='longdashdot',width=2),
                             )
)
fig.add_annotation(
                   x=mean_age,
                   y= 830 ,
                   text=f'Mean Age:{mean_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=50,
                   ay=-50
)
fig.add_shape( go.layout.Shape( type='line',
                x0=median_age,
                x1=median_age,
                y0=0,
                y1=934,
                line=dict(color='black',dash='longdashdot',width=2),
                             )
)
fig.add_annotation(
                   x=median_age,
                   y= 930 ,
                   text=f'Median Age:{median_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=50,
                   ay=-70
)
fig.add_shape( go.layout.Shape( type='line',
                x0=q25_age,
                x1=q25_age,
                y0=0,
                y1=860,
                line=dict(color='black',dash='longdashdot',width=2),
                             )
)
fig.add_annotation(
                   x=q25_age,
                   y= 830 ,
                   text=f'First Quantile (25%):{q25_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   arrowwidth=2,
                   borderwidth=1,
                   height=30,
                   ax=-15,
                   ay=-30,
)

fig.add_shape( go.layout.Shape( type='line',
                x0=q75_age,
                x1=q75_age,
                y0=0,
                y1=486,
                line=dict(color='black',dash='longdashdot',width=2),
                             )
)
fig.add_annotation(
                   x=q75_age,
                   y= 470 ,
                   text=f'Third Quantile (75%):{q75_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   arrowwidth=2,
                   borderwidth=1,
                   height=30,
                   ax=200,
                   ay=-10,
)

fig.show()

The histogram shows information on the distribution, quantile ranges, mean and median for the bank customers. We can summarize our observations as below:
1. The central tendency of the age distribution is around 39 years, with the mean, median and mode on the late 30s.
2. The interquantile range (44-32=12) indicates moderate variability in the middle 50% of the data
3. The distribution is a bit right-skewed, suggesting that the majority of customers belong to the younger age groups.
4. The highest count is observed in the range of 36 to 37 age group.
5. The 25th percentile is 32, indicating that a quarter of our customers are 32 years old or younger.
6. The 75th percentile is 32, indicating that most of our customers are 44 years old or younger.
7. There are relatively fewer individuals in the older age group and the distribution tails off for ages beyond 60s(In the previous assignment I cut off this age group after carefully analyzing other features relative to them)
8. The age can be considered representative of a population with a peak in their late 30s.
9. Understanding the age distribution will be crucial for targeted marketing, product development, any analysis related to demographics.
10. Consideration of age-related trends and preferences  should be taken into account when we want to do churn analysis.

***
## Credit Score

In [882]:
# iii. Credit Score
df.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [883]:
mean_cr=df['CreditScore'].mean().round()
mean_cr=int(mean_cr)
mean_cr

651

In [884]:
median_cr=df['CreditScore'].median()
median_cr=int(median_cr)
median_cr


652

In [885]:
df['CreditScore'].max()

850

In [886]:
df['CreditScore'].min()

350

In [887]:
mode_cr=df['CreditScore'].mode().iloc[0]
mode_cr

850

In [1619]:
df[df['CreditScore']==850].count()

RowNumber          233
CustomerId         233
Surname            233
CreditScore        233
Geography          233
Gender             233
Age                233
Tenure             233
Balance            233
NumOfProducts      233
HasCrCard          233
IsActiveMember     233
EstimatedSalary    233
Exited             233
dtype: int64

In [889]:
q25_cr=df['CreditScore'].quantile(0.25)
q25_cr=int(q25_cr)
q25_cr

584

In [890]:
q75_cr=df['CreditScore'].quantile(0.75)
q75_cr=int(q75_cr)
q75_cr

718

In [1620]:

fig=px.histogram(
           df,x='CreditScore',
           nbins=30,
           range_x=[300,df['CreditScore'].max()+20],
           
)
fig.update_traces(textposition='outside',texttemplate='%{y}',text=df['CreditScore'])
fig.update_traces(marker=dict(color=['navy' if i% 2==0 else 'lightblue' for i in range(len(df))]))
fig.update_layout(title=dict(text="<b>Distribution of Customers by Credit Score</b>",
                             x=0.5,y=0.95),
                             margin=dict(l=50,r=50,t=50,b=50),
                             yaxis=dict(title='<b>Number of Customers</b>'),
                             xaxis=dict(title='<b>Credit Score</b>',
                             tickmode='linear',
                             dtick=20,
                             showgrid=True,
                             zeroline=True,
                             showline=True,
                          
                ))

fig.add_shape( go.layout.Shape( type='line',
                x0=mean_cr,
                x1=mean_cr,
                y0=0,
                y1=796,
                line=dict(color='black',dash='longdashdot',width=3),
                             )
)
fig.add_annotation(
                   x=mean_cr,
                   y= 720,
                   text=f'Mean Creadit Score:{mean_cr}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   arrowwidth=2,
                   borderwidth=1,
                   height=30,
                   ax=-200,
                   ay=-60,
)
fig.add_shape( go.layout.Shape( type='line',
                x0=median_cr,
                x1=median_cr,
                y0=0,
                y1=796,
                line=dict(color='darkred',width=3),
                             )
)
fig.add_annotation(
                   x=median_cr,
                   y= 670 ,
                   text=f'Median Credit Score:{median_cr}',
                   showarrow=True,
                   arrowcolor='darkred',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   
                   borderwidth=1,
                   height=30,
                   ax=100,
                   ay=-50,
)

fig.add_shape( go.layout.Shape( type='line',
                x0=q25_cr,
                x1=q25_cr,
                y0=0,
                y1=672,
                line=dict(color='white',dash='longdashdot',width=3),
                             )
)
fig.add_annotation(
                   x=q25_cr,
                   y= 660 ,
                   text=f'First Quantile (25%):{q25_cr}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   
                   borderwidth=1,
                   height=30,
                   ax=-15,
                   ay=-30,
)

fig.add_shape( go.layout.Shape( type='line',
                x0=q75_cr,
                x1=q75_cr,
                y0=0,
                y1=712,
                line=dict(color='white',dash='longdashdot',width=3),
                             )
)
fig.add_annotation(
                   x=q75_cr,
                   y= 650 ,
                   text=f'Third Quantile (75%):{q75_cr}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=200,
                   ay=-10,
)

fig.show()

The histogram shows information on the distribution, quantile ranges, mean and median for the bank customers' credit score. We can summarize our observations as below:

1. mean(651) and median(652) are so close indicating relatively symmetrical distribution. The mode being 850 shows this is the most repetitive score (233 customers with score of 850) which is the highest score one can get

2. The interquatile range is 134(718-584) which means that the middle 50% of our customers have credit score in this range with 134 point score variations.

3. The first quantile shows that 25% of our customers have credit score of 580 or lower

4. The third quantile shows that 75% of our customers have credit score of 718 or lower
5. The distribution is relatively symmetric with a peak around the higher credit scores
6. For churn rate analysis generally higher credit scores may indicate lower credit risk which can potentially correlate with lower churn rates. (The hypothesis has to be proven, this is just the common sense)
7. It would be necessary to analyze the churn rates across different credit score ranges to understand the correlation.
As this distribution is symmetirc distribution it seems it is prepared for further analysis. Credit score can be an important factor in predicting customer behaviour, and understanding its distribution helps making data-driven decisions.

***
## Subplots

3: Combine all of the above visualisations into a subplot (hint: Subplot takes figures created in graph objects, so you may need to recreate some visualisations). For maximum marks, make sure that you correctly label each figure in the subplot.

In [1623]:
#Create the subplots
fig=go.Figure
subplots_titles=['<b>Churn Rate by Geography with Total Average Churn Line</b>',
                    '<b>Churn Rate by Gender with Total Average Churn Rate Line</b>',
                    '<b>Churn Rate by Tenure with Total Average Churn Rate Line</b>',
                    '<b>Distribution of Customers by Country(Geography)</b>',
                    '<b>Distribution of Customers by Age</b>',
                    '<b>Distribution of Customers by Credit Score</b>'
        ]
# create subplots for the above six plots with a title for our set of plots
fig=make_subplots(
    rows=3,
    cols=3,
    subplot_titles= subplots_titles,
    shared_xaxes=False, 
    shared_yaxes = False,
    horizontal_spacing= 0.1,
    vertical_spacing=0.1,
    row_heights=[1,1,1.1],
    column_widths=[1,1,1.62]
    )
# Update the main title with respective font and position
fig.update_layout(title=dict(text='<b>Exploratory Visualization of Churn Date: Distribution and Churn Rate Analysis</b>', 
                               x=0.78,y=0.99),
                               title_font=dict(size=20),
                               height=1500,
                               width=1500,
                  )



# update font size for subplots
fig.update_annotations(font_size=13,
                        borderwidth=2,
                        height=10,
                        )
##  First Plot
# Create the first subplot which is bar chart of churn rate by Geography
# First just get the interested dataframe subset for plotting the churn rate 
churn_rate_geo=((df.groupby('Geography')['Exited'].sum()/df.groupby('Geography')['Exited'].count())*100).round(2).reset_index()
total_average_churn_rate =((df['Exited'].sum())/(df['Exited'].count())*100).round(2) # total average churn rate for line plot to have a reference 
##  Second Plot
# Create the second subplot which is bar chart of churn rate by Gender
# First just get the interested dataframe for plotting  the churn rate
churn_rate_gender=((df.groupby('Gender')['Exited'].sum()/df.groupby('Gender')['Exited'].count())*100).round(2).reset_index()
##  Third Plot
# Create the third subplot which is a line chart of churn rate by Tenure
# First just get the interested dataframe for plotting the churn rate
churn_rate_tenure=((df.groupby('Tenure')['Exited'].sum()/df.groupby('Tenure')['Exited'].count())*100).round(2).reset_index()
##  Fourth Plot
# Fourth figure Which is a bar chart for distribution based on geograpy
# First get the dataset for the plot
geo_dist=df['Geography'].value_counts().reset_index()
##  Fifth plot
# Fifth figure which is a histogram for customer distribution based on Age
# As in distribution mean, median, first and third quantile are important I add that on the top of the selected data set for the histogram
mean_age=df['Age'].mean().round() # mean of age among the customers
mean_age=int(mean_age)

median_age=df['Age'].median() # median of age among the customers
median_age=int(median_age)

q25_age=df['Age'].quantile(0.25) # 25% of customers belong to the age group less than this number
q25_age=int(q25_age)

q75_age=df['Age'].quantile(0.75) # 75% of customers belong to the age group less than this number
q75_age=int(q75_age)
##  sixth plot
# Sixth figure which is a histogram for customer distribution based on Credit Score
# As in distribution mean, median, first and third quantile are important I add that on the top of the selected data set for the histogram
mean_cr=df['CreditScore'].mean().round() # Credit Score Mean
mean_cr=int(mean_cr)
median_cr=df['CreditScore'].median()     # Credit Score Median
median_cr=int(median_cr)
q25_cr=df['CreditScore'].quantile(0.25)  # First Quantile Range
q25_cr=int(q25_cr)
q75_cr=df['CreditScore'].quantile(0.75)  # Third Quantile Range
q75_cr=int(q75_cr)

###################################################################################################################
churn_geo_bar_color=['navy','teal','lime']
churn_gender_bar_color=['pink','navy']
churn_geo_bar_1x1=go.Bar(x=churn_rate_geo['Geography'], # first plot
                        y= churn_rate_geo['Exited'],
                        text=churn_rate_geo['Exited'],
                        textposition='outside',
                        marker_color=churn_geo_bar_color,
                         )
churn_gender_bar_1x2=go.Bar(x=churn_rate_gender['Gender'], # second plot
                        y= churn_rate_gender['Exited'],
                        text=churn_rate_gender['Exited'],
                        textposition='outside',
                        marker_color=churn_gender_bar_color,
                         )
churn_tenure_line_1x3=go.Scatter(x=churn_rate_tenure['Tenure'], # Third plot
                        y= churn_rate_tenure['Exited'],
                        mode='lines',
                        line=dict(color='navy',width=2)
                         )
Cust_distribution_geo_bar_2x1=go.Bar(x=geo_dist['Geography'], # Fourth plot
                        y= geo_dist['count'],
                        text=geo_dist['count'],
                        textposition='outside',
                        marker_color=churn_geo_bar_color
                         )
Cust_distribution_age_hist_2x2=go.Histogram(x=df['Age'],  # Fifth plot
                        text=df['Age'],
                        nbinsx=40,
                        textposition='outside',
                        marker=dict(color=['navy' if i% 2==0 else 'lightblue' for i in range(len(df))]),
                        texttemplate='%{y}'
                        )
Cust_distribution_creditscore_hist_3x1=go.Histogram(x=df['CreditScore'],  # Fifth plot
                        text=df['CreditScore'],
                        nbinsx=30,
                        textposition='outside',
                        marker=dict(color=['navy' if i% 2==0 else 'lightblue' for i in range(len(df))]),
                        texttemplate='%{y}'
                        )
# Title and Layout updates of the subplots

fig.update_layout(
    annotations=[
        dict(
            text="<b>Churn Rate by Geography with Total Average Churn Line</b>", # First figure titel
            x=0.12,
            y=1.017,
            borderwidth=2,
            height=20,
            xref='paper',
            yref='paper',
            xanchor='center',
            yanchor='middle',
        ),
        dict(
            text="<b>Churn Rate by Gender with Total Average Churn Rate Line</b>", #Second figure title
            x=0.46,
            y=1.017,
            borderwidth=2,
            height=20,
            xref='paper',
            yref='paper',
            xanchor='center',
            yanchor='middle',
        ),
          dict(
            text="<b>Churn Rate by Tenure with Total Average Churn Rate Line</b>", #Second figure title
            x=0.84,
            y=1.017,
            borderwidth=2,
            height=20,
            xref='paper',
            yref='paper',
            xanchor='center',
            yanchor='middle',
        ),
                dict(
            text="<b>Distribution of Customers by Country(Geography)</b>", # Fourth Figure Title
            x=0.12,
            y=0.66,
            borderwidth=2,
            height=20,
            xref='paper',
            yref='paper',
            xanchor='center',
            yanchor='middle',
        ),
      dict(
            text="<b>Distribution of Customers by Age</b>", # Fifth Figure Title
            x=0.55,
            y=0.66,
            borderwidth=2,
            height=20,
            xref='paper',
            yref='paper',
            xanchor='center',
            yanchor='middle',
        ),
           dict(
            text="<b>Distribution of Customers by Credit Score</b>", # Fifth Figure Title
            x=0.35,
            y=0.3,
            borderwidth=2,
            height=20,
            xref='paper',
            yref='paper',
            xanchor='center',
            yanchor='middle',
        ),
    ],
    showlegend=False,
)
#######

# First Figure 
fig.add_trace(churn_geo_bar_1x1,row=1,col=1)
fig.add_hline(y = total_average_churn_rate, line=dict(
            color='darkred',dash='longdashdot',width=2)
        )
fig.add_annotation(x=(len(churn_rate_geo['Geography']))-1,
                   y=total_average_churn_rate,
                   text=f'Total Ave. Churn %:{total_average_churn_rate:.2f}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=4,
                   xref='x',
                   yref='y',
                   font=dict(size=12,color='black'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=5,
                   ay=-40,
                   row=1,col=1
                   )
# Second Figure and it's attributes
fig.add_trace(churn_gender_bar_1x2,row=1,col=2)


fig.add_hline(y = total_average_churn_rate, line=dict(
            color='darkred',dash='longdashdot',width=2)
        )
fig.add_annotation(x=(len(churn_rate_gender['Gender']))-1,
                   y=total_average_churn_rate,
                   text=f'Total Ave. Churn %:{total_average_churn_rate:.2f}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=4,
                   xref='x',
                   yref='y',
                   font=dict(size=12,color='black'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=5,
                   ay=-60,
                   row=1,col=2
                   )
# Third figure and its attributes
fig.add_trace(churn_tenure_line_1x3,row=1,col=3)
fig.add_hline(y = total_average_churn_rate, line=dict(
            color='darkred',dash='longdashdot',width=2)
        )
fig.add_annotation(x=(len(churn_rate_tenure['Tenure']))-1,
                   y=total_average_churn_rate,
                   text=f'Total Ave. Churn %:{total_average_churn_rate:.2f}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=4,
                   xref='x',
                   yref='y',
                   font=dict(size=12,color='black'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=10,
                   ay=22.5,
                   row=1,col=3
                   )
# The fourth plot and its attributes
fig.add_trace(Cust_distribution_geo_bar_2x1,row=2,col=1)

#The fifth plot and its attributes
fig.add_trace(Cust_distribution_age_hist_2x2,row=2,col=2)
fig.add_shape( type='line',
              x0=mean_age,
              x1=mean_age,                       # add vertical lines for mean
              y0=0, 
              y1=900,
              line=dict(color='white',dash='longdash',width=2),
              row=2,
              col=2
         )          

fig.add_annotation(                              # add annotation for the mean line
                   x=mean_age,
                   y= 830 ,
                   text=f'Mean Age:{mean_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=50,
                   ay=-50,
                   row=2,
                   col=2
)
fig.add_shape(  type='line',                                       #add vertical line for median
                x0=median_age,
                x1=median_age,
                y0=0,
                y1=934,
                line=dict(color='black',dash='longdashdot',width=2),
                row=2,
                col=2
                  )
fig.add_annotation(                                                #add annotation for median line
                   x=median_age,
                   y= 930 ,
                   text=f'Median Age:{median_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=50,
                   ay=-70,
                   col=2,
                   row=2
)
fig.add_shape(  type='line',                                        # add first quantile line 
                x0=q25_age,
                x1=q25_age,
                y0=0,
                y1=860,
                line=dict(color='black',dash='longdashdot',width=2),
                col=2,
                row=2       
              )
fig.add_annotation(                                               # add annotation for the first quantile
                   x=q25_age,
                   y= 830 ,
                   text=f'First Quantile (25%):{q25_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   arrowwidth=2,
                   borderwidth=1,
                   height=30,
                   ax=-15,
                   ay=-30,
                   row=2,
                   col=2
)

fig.add_shape(  type='line',                                         # add the third quantile line
                x0=q75_age,
                x1=q75_age,
                y0=0,
                y1=486,
                line=dict(color='black',dash='longdashdot',width=2),
                row=2,
                col=2
                             )
fig.add_annotation(                                                  # add annotation for the third quantile
                   x=q75_age,
                   y= 470 ,
                   text=f'Third Quantile (75%):{q75_age}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   arrowwidth=2,
                   borderwidth=1,
                   height=30,
                   ax=200,
                   ay=-10,
                   row=2,
                   col=2
)
######
#Sixth Plot
fig.add_trace(Cust_distribution_creditscore_hist_3x1,row=3,col=1)
fig.add_shape( type='line',
              x0=mean_cr,
              x1=mean_cr,                       # add vertical lines for mean
              y0=0, 
              y1=796,
              line=dict(color='black',dash='longdashdot',width=2),
              row=3,
              col=1
         )          

fig.add_annotation(                              # add annotation for the mean line
                   x=mean_cr,
                   y= 750 ,
                   text=f'Mean Creadit Score:{mean_cr}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=50,
                   ay=-50,
                   row=3,
                   col=1
)
fig.add_shape(  type='line',                                       #add vertical line for median
                x0=median_cr,
                x1=median_cr,
                y0=0,
                y1=796,
                line=dict(color='darkred',width=3),
                row=3,
                col=1
                  )
fig.add_annotation(                                                #add annotation for median line
                   x=median_cr,
                   y= 670,
                   text=f'Median Credit Score:{median_cr}',
                   showarrow=True,
                   arrowcolor='darkred',
                   arrowwidth=2,
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=100,
                   ay=-50,
                   col=1,
                   row=3
)
fig.add_shape(  type='line',                                        # add first quantile line 
                x0=q25_cr,
                x1=q25_cr,
                y0=0,
                y1=672,
                line=dict(color='white',dash='longdashdot',width=3),
                col=1,
                row=3       
              )
fig.add_annotation(                                               # add annotation for the first quantile
                   x=q25_cr,
                    y= 660 ,
                   text=f'First Quantile (25%):{q25_cr}',
                   showarrow=True,
                   arrowwidth=2,
                   arrowcolor='black',
                   arrowhead=3,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=-15,
                   ay=-30,
                   row=3,
                   col=1
)

fig.add_shape(  type='line',                                         # add the third quantile line
                x0=q75_cr,
                x1=q75_cr,
                y0=0,
                y1=712,
                line=dict(color='white',dash='longdashdot',width=3),
                row=3,
                col=1
                             )
fig.add_annotation(                                                  # add annotation for the third quantile
                   x=q75_cr,
                   y= 650,
                   text=f'Third Quantile (75%):{q75_cr}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=3,
                   arrowwidth=2,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='right',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=30,
                   ax=200,
                   ay=-10,
                   row=3,col=1
)


######
                             
# x and y axis labels

fig.update_xaxes(title_text='Countries',row=1,col=1)                # first plot
fig.update_yaxes(title_text='Churn Rate(%)',row=1,col=1)
fig.update_xaxes(title_text='Gender',row=1,col=2)                   # second plot
fig.update_yaxes(title_text='Churn Rate(%)',row=1,col=2)
fig.update_xaxes(title_text='Tenure (years)',row=1,col=3)           # third plot
fig.update_yaxes(title_text='Churn Rate(%)',row=1,col=3)
fig.update_xaxes(title_text='Country(Geography)',row=2,col=1)       # fourth plot
fig.update_yaxes(title_text='Number of Customers',row=2,col=1)
fig.update_xaxes(title_text='Age', tickmode='linear',dtick=1,showgrid=True,zeroline=False,showline=True, #fifth plot
                 range=[17,df['Age'].max()+5],domain=[0.3,1],row=2,col=2)
fig.update_yaxes(title_text='Number of Customers',row=2,col=2)
fig.update_xaxes(title_text='Credit Score', tickmode='linear',dtick=20,showgrid=True,zeroline=False,showline=True, #sixth plot
                 range=[300,df['CreditScore'].max()+20],domain=[0.1,0.7],row=3,col=1)
fig.update_yaxes(title_text='Number of Customers',row=3,col=1)
fig.show()

***

## Correlation Analysis with target feature (Exited)

4: We now want to see how each variable is correlated with the target. Create a correlation matrix which shows how each variable is correlated with each other variable using df.corr(). Then, select the column related to our target (the exited column). Create a bar chart to visualise the correlation between each feature and the target.

*Before jump into answering this, first we should encode the related categorical variables geograpghy and Gender usng one hot encoder*

In [1624]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [1625]:
# Before we start to see the correlation between features and target first we should encode the categorical variables using one-hot encoder
my_encoder= OneHotEncoder(sparse_output=False)
gender = my_encoder.fit_transform(
    df[["Gender"]]
)
my_encoder.categories_[0]

array(['Female', 'Male'], dtype=object)

In [1626]:
df_encoded = pd.DataFrame(
    gender,
    columns=my_encoder.categories_[0],
    index=df.index
).astype(int)

In [1627]:
df = pd.concat( #adding the encoded columns to our churn dataset
    [
        df,
        df_encoded
    ],
    axis=1
).drop(["Gender"], axis=1)

In [1631]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Female,Male
0,1,15634602,Hargrave,619,France,42,2,0.0,1,1,1,101348.88,1,1,0
1,2,15647311,Hill,608,Spain,41,1,83807.86,1,0,1,112542.58,0,1,0
2,3,15619304,Onio,502,France,42,8,159660.8,3,1,0,113931.57,1,1,0
3,4,15701354,Boni,699,France,39,1,0.0,2,0,0,93826.63,0,1,0
4,5,15737888,Mitchell,850,Spain,43,2,125510.82,1,1,1,79084.1,0,1,0


In [1632]:
Geography= my_encoder.fit_transform(
    df[["Geography"]]
)

In [1633]:
my_encoder.categories_[0]

array(['France', 'Germany', 'Spain'], dtype=object)

In [1634]:
df_encoded_Geography= pd.DataFrame(
    Geography,
    columns=my_encoder.categories_[0],
    index=df.index
).astype(int)

In [1635]:
df= pd.concat(
    [
        df,
        df_encoded_Geography
    ],
    axis=1
).drop(["Geography"], axis=1)

In [1636]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Female,Male,France,Germany,Spain
0,1,15634602,Hargrave,619,42,2,0.0,1,1,1,101348.88,1,1,0,1,0,0
1,2,15647311,Hill,608,41,1,83807.86,1,0,1,112542.58,0,1,0,0,0,1
2,3,15619304,Onio,502,42,8,159660.8,3,1,0,113931.57,1,1,0,1,0,0
3,4,15701354,Boni,699,39,1,0.0,2,0,0,93826.63,0,1,0,1,0,0
4,5,15737888,Mitchell,850,43,2,125510.82,1,1,1,79084.1,0,1,0,0,0,1


In [1637]:
df.drop('RowNumber',axis=1)
 

Unnamed: 0,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Female,Male,France,Germany,Spain
0,15634602,Hargrave,619,42,2,0.00,1,1,1,101348.88,1,1,0,1,0,0
1,15647311,Hill,608,41,1,83807.86,1,0,1,112542.58,0,1,0,0,0,1
2,15619304,Onio,502,42,8,159660.80,3,1,0,113931.57,1,1,0,1,0,0
3,15701354,Boni,699,39,1,0.00,2,0,0,93826.63,0,1,0,1,0,0
4,15737888,Mitchell,850,43,2,125510.82,1,1,1,79084.10,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,15606229,Obijiaku,771,39,5,0.00,2,1,0,96270.64,0,0,1,1,0,0
9996,15569892,Johnstone,516,35,10,57369.61,1,1,1,101699.77,0,0,1,1,0,0
9997,15584532,Liu,709,36,7,0.00,1,0,1,42085.58,1,1,0,1,0,0
9998,15682355,Sabbatini,772,42,3,75075.31,2,1,0,92888.52,1,0,1,0,1,0


In [1638]:
selected_columns=[ #columns to include in our correlation matrix
    'CreditScore',
    'Age',
    'Tenure',
    'Balance',
    'NumOfProducts',
    'HasCrCard',
    'IsActiveMember',
    'EstimatedSalary',
    'Exited',
    'Female',
    'Male',
    'France',
    'Germany',
    'Spain'
    ]
df_selected=df[selected_columns]
correlation_matrix=df_selected.corr() # correlation matrix

In [1639]:
correlation_matrix.mean()

CreditScore        0.072035
Age                0.095825
Tenure             0.070035
Balance            0.061387
NumOfProducts      0.047847
HasCrCard          0.068883
IsActiveMember     0.064460
EstimatedSalary    0.072673
Exited             0.084260
Female             0.008051
Male              -0.008051
France            -0.038526
Germany            0.050285
Spain             -0.005878
dtype: float64

In [1640]:
correlation_matrix.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Female,Male,France,Germany,Spain
count,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0
mean,0.072035,0.095825,0.070035,0.061387,0.047847,0.068883,0.06446,0.072673,0.08426,0.008051,-0.008051,-0.038526,0.050285,-0.005878
std,0.267342,0.272801,0.268011,0.314788,0.286287,0.268204,0.274165,0.267065,0.289619,0.393563,0.393563,0.363045,0.351576,0.335475
min,-0.027094,-0.039208,-0.028362,-0.30418,-0.30418,-0.014858,-0.156128,-0.011421,-0.156128,-1.0,-1.0,-0.580359,-0.580359,-0.575418
25%,-0.003688,-0.01129,-0.01169,-0.014207,-0.018999,-0.011274,-0.018331,-0.007021,-0.051455,-0.014071,-0.023936,-0.088518,-0.017969,-0.043722
50%,0.001849,-0.002825,0.000137,-0.001908,0.006111,-0.005612,-0.003384,0.0032,-0.010569,-0.001455,0.001455,-0.005052,0.007917,-0.004084
75%,0.006086,0.04225,0.012029,0.024431,0.013142,0.00512,0.021091,0.011647,0.115528,0.023936,0.014071,0.002158,0.04133,0.007974
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [1641]:
# Correlation related to target feature exit from the most related to the least related feature
target_correlations=correlation_matrix['Exited'].sort_values(ascending=False)
# As Exited is one I just droped it
target_correlations.drop('Exited',axis=0,inplace=True)

In [1642]:
target_correlations=target_correlations.reset_index().rename(columns={'index':'Feature','Exited':'Correlation'})
target_correlations

Unnamed: 0,Feature,Correlation
0,Age,0.285323
1,Germany,0.173488
2,Balance,0.118533
3,Female,0.106512
4,EstimatedSalary,0.012097
5,HasCrCard,-0.007138
6,Tenure,-0.014001
7,CreditScore,-0.027094
8,NumOfProducts,-0.04782
9,Spain,-0.052667


In [1652]:

fig=px.bar(target_correlations,x='Feature',
           y='Correlation',
           color='Correlation',
           color_continuous_scale=px.colors.sequential.Cividis,
           labels={'Correlation':'Correlation Coefficient','Feature':'Features'}     
)
fig.update_layout(title=dict(text="<b>Correlation of features to Customer churn (Exited)<b>",
                             x=0.4,y=0.95),
)

fig.show()


4.1. Order the bars so that the feature with the highest correlation is the first bar.

For the correlation we should consider the absolute value of the correlation. 
- 1 indicates a perfect positive relation: every feature has correlation of 1 with itself
- -1 indicates a perfect negative relation: For exited it has a perfect negative correlation with not exited 
- 0 means no correlations

In [1654]:
# Absolute correlation shows the strength of correlation of features without considering whether they are inversely related or positively related
target_sorted=target_correlations.set_index('Feature').abs()
target_sorted.sort_values(by='Correlation',ascending=False).reset_index()

Unnamed: 0,Feature,Correlation
0,Age,0.285323
1,Germany,0.173488
2,IsActiveMember,0.156128
3,Balance,0.118533
4,Female,0.106512
5,Male,0.106512
6,France,0.104955
7,Spain,0.052667
8,NumOfProducts,0.04782
9,CreditScore,0.027094


In [1655]:
merged_abs=pd.merge(target_sorted,target_correlations,how='left',on='Feature',suffixes=('_sort_abs','_original'))

In [1656]:
sort_highest_correlated=merged_abs.sort_values('Correlation_sort_abs',ascending=False)
sort_highest_correlated.reset_index(drop='True')

Unnamed: 0,Feature,Correlation_sort_abs,Correlation_original
0,Age,0.285323,0.285323
1,Germany,0.173488,0.173488
2,IsActiveMember,0.156128,-0.156128
3,Balance,0.118533,0.118533
4,Female,0.106512,0.106512
5,Male,0.106512,-0.106512
6,France,0.104955,-0.104955
7,Spain,0.052667,-0.052667
8,NumOfProducts,0.04782,-0.04782
9,CreditScore,0.027094,-0.027094


In [1657]:
# sort based on absolute value by keeping the original value
fig=px.bar(sort_highest_correlated,x='Feature',
           y='Correlation_original',
           color='Correlation_sort_abs',
           color_continuous_scale=px.colors.sequential.Cividis,
           labels={'Correlation_original':'Correlation Coefficient','Feature':'Features','Correlation_sort_abs':'Strength of Correlation Hue'}, 
           category_orders={'feature':sort_highest_correlated['Feature']}  
  
)
fig.update_layout(title=dict(text="<b>Correlation of features to Customer Churn(Exited)<b>",
                             x=0.4,y=0.95),
                             legend_title_text='Correlation strength of features to exit hue',

                             )
fig.show()

4.2. Add the correlation value to the top of each bar

In [1659]:
# add the variations on top of each bar
fig.update_traces(
    text=sort_highest_correlated['Correlation_original'].round(3),
    textposition='outside',
    textfont=dict(family='Arial',size=11,color='black')
     
)
fig.show()

4.3. Add a line to the figure which shows the average correlation (hint: This will require adding an extra trace).

In [1660]:
# for mean line we need average of all the original correlations
corr_ave=sort_highest_correlated['Correlation_original'].mean()   
# As we sort it out based on absolute value which is based on highest correlation, 
# we need the mean of absolute correlations values both positive and negative
corr_ave_abs=sort_highest_correlated['Correlation_sort_abs'].mean() 
corr_ave_abs_negative=-corr_ave_abs


In [1662]:
sort_highest_correlated['corr_ave']=corr_ave
sort_highest_correlated['corr_ave_abs']=corr_ave_abs
sort_highest_correlated['corr_ave_abs_negative']=corr_ave_abs_negative


In [1664]:
# Correlation average line
fig.add_traces(
             go.Scatter(x=sort_highest_correlated['Feature'],
                        y=sort_highest_correlated['corr_ave'],
                        mode='lines',
                        line=dict(color='darkred',dash='longdashdot',width=2),

))
# correlation average line annotation
fig.add_annotation(x=(len(sort_highest_correlated['Feature'])-11),
                   y=corr_ave,
                   text=f'Average Correlation with exit :{corr_ave:.3f}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=4,
                   arrowwidth=2,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=20,
                   ax=60,
                   ay=-20
                   )
# correlation absolute average line
fig.add_traces(
             go.Scatter(x=sort_highest_correlated['Feature'],
                        y=sort_highest_correlated['corr_ave_abs'],
                        mode='lines',
                        line=dict(color='darkred',dash='longdashdot',width=2),

))
# correlation absolute average annotation
fig.add_annotation(x=(len(sort_highest_correlated['Feature'])-11),
                   y=corr_ave_abs,
                   text=f'Average Correlation distance from exit(absolute values) :{corr_ave_abs:.3f}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=4,
                   arrowwidth=2,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=20,
                   ax=60,
                   ay=-55
                   )
# correlation negative of absolute average line: for the comparison of distance from the perfect correlation of 1
fig.add_traces(
             go.Scatter(x=sort_highest_correlated['Feature'],
                        y=sort_highest_correlated['corr_ave_abs_negative'],
                        mode='lines',
                        line=dict(color='darkred',dash='longdashdot',width=2),

))

fig.add_annotation(x=(len(sort_highest_correlated['Feature'])-11),
                   y=corr_ave_abs_negative,
                   text=f'Average Correlation distance from Exit(Negative of absolute values) :{corr_ave_abs_negative:.3f}',
                   showarrow=True,
                   arrowcolor='black',
                   arrowhead=4,
                   arrowwidth=2,
                   xref='x',
                   yref='y',
                   font=dict(size=15,color='black',family='bold'),
                   align='left',
                   xanchor='left',
                   bgcolor='lightyellow',
                   bordercolor='black',
                   borderwidth=1,
                   height=20,
                   ax=60,
                   ay=40
                   )
fig.update_layout( title_font=dict(size=20),
                               height=500,
                               width=1000,
                               showlegend=False,
                  )

             


                   


From the barchart we can see how correlation of features to exit, varies from eachother and from the average correlation and average correlation distance. Because as much as the absolute value of correlation goes up the correlation is stronger apart from the total average of correlations, I find this useful to see the  correlated features to exited at a glance. The function we used shows how much linear relationship our features have to our target feature(exited).
- Age has the strongest correlation amongst the other features with correlation coefficient of 0.285. As the age goes up customers' tendency to leave increases. 
- The next most correlated factor is Germany(correlation coefficient =0.173) which means if customers are based in Germany, they will more likely to exit.
- The third strongest correlated feature is whether the customer is active member or not(correlation coefficient =-0.156). If they not active they are more likly to exit
- The next strongest correlated feature is balance(0.119). As the balance of the customer increases the tendency to leave the serivce increases.
- It is depicted that the fifth strongly correlated feature to churn is whether the customer is female or not (correlation coefficient =0.107). Female customers tend to leave the service more than male customers. If the customer is female the probability to leave will be higher. The next feature is the reverse of female (correlation coefficient =-0.107) it means the same as previous statement. If the customer is male he is less likely to exit.
- The next feature is France (correlation coefficient =-0.105) which has the reverse strong relationship with churn. If a customer is from France it is less likely that they leave.
- In general all the above seven features' correlations are above the average lines, so they are the most correlated features to churn.

-The other 6 features seems not correlated that much as they are close to 0 and they are less than both average and absolute average correlations. So, spain(correlation coefficient =-0.053), Number of Products(correlation coefficient =-0.048),Credit Score(correlation coefficient =-0.027), Tenure(correlation coefficient =-0.014). Estimated Salary(correlation coefficient =0.012) and Has card (-0.007) are not deteminate factors on predicting the churn.


Please save this notebook as a PDF or notebook file containing your finished plots and submit them on the website by 8th December.
