# **Week 2: Challenge Activity**

In this challenge activity, you will use the logistics regression model ([from Python - Statsmodels](https://www.statsmodels.org/stable/index.html)) to predict the churn rate of customers in a multinational telecom company. The churn rate is the percentage of subscribers to a service who discontinue their subscriptions to the service within a given time period. For a company to expand its clientele, its growth rate, as measured by the number of new customers, must exceed its churn rate.
<br>
The logistics regression model can predict how likely it is that its current customers will leave the company in the near future and hence calculate its churn rate.
<br>
The data is on an individual level and has features about geographies, personal information (age / gender) and credit ratings-based features of the customers. The target is a boolean which tells us if a particular customer had a churn (“exited”) from that company (flag=1).
<br>
<br>
**Please download data csv file from [Task 1 in Callenge Activity](https://canvas.hull.ac.uk/courses/75587/files/6259234/download?download_frd=1) and save your Gdrive.**



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Importing necessary libararies
# We are using Logistic Regression in Python with the statsmodels package (statsmodels.formula.api).

import pandas as pd
from statsmodels.formula.api import logit
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from sklearn.preprocessing import quantile_transform
import numpy as np
import seaborn as sns
import ipywidgets as widgets
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [3]:
# Connect to your Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
# Load the data from your Gdrive.
# Set index columns. We use 'RowNumber', 'CustomerId',' Surname'.
df = pd.read_csv("/content/gdrive/MyDrive/MSc/Ethics/Churn_Modelling.csv", # Please change your filepath you saved the data.
                 index_col=["RowNumber","CustomerId","Surname"])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10000 entries, (np.int64(1), np.int64(15634602), 'Hargrave') to (np.int64(10000), np.int64(15628319), 'Walker')
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 1.6+ MB


In [56]:
# Divide columns based on data types.
categorical_cols = ["Geography", "Gender", "Tenure", "NumOfProducts", "HasCrCard",]
non_categorical_cols = ["CreditScore", "Age", "Balance", "EstimatedSalary"]

# Create a dropdown list with numeircal columns.
w = widgets.Dropdown(
    options=non_categorical_cols,
    value="CreditScore",
    description="Task:",
)
display(w)

Dropdown(description='Task:', options=('CreditScore', 'Age', 'Balance', 'EstimatedSalary'), value='CreditScore…

In [7]:
# Store values of selected colomn from above dropdown list
non_categorical_col = w.value

# Set background colour
layout = go.Layout(plot_bgcolor='#F0E9E6')

# Make a figure
# More information is https://plotly.com/python/graph-objects/
fig = go.Figure(layout=layout)

# Add boxplots to the figure
# Binary flag 1 if the customer closed account with bank and 0 if the customer
# is retained.
fig.add_trace(go.Box(y=df.loc[df["Exited"]==1,non_categorical_col],\
                     marker_color = 'indianred',
                     name="Churn"))
fig.add_trace(go.Box(y=df.loc[df["Exited"]==0,non_categorical_col],\
                     marker_color = 'lightseagreen',
                     name="Non Churn"))
# Set title and labels
fig.update_layout(
                   title='Continious Regressor to Target',
                   xaxis_title=f"{non_categorical_col}",
                   yaxis_title='IQR',
                   xaxis_showgrid=False,
                   yaxis_showgrid=False
)

fig.show()

In [8]:
# Divid explanatory variable coloumns to categroical and numerical data
categorical_cols = ["Geography", "Gender",  "HasCrCard",]
non_categorical_cols = ["Age"]

# Create a formula
# Target variable: Exited
# ~: separates the left-hand side of the model from the right-hand side
# C(categorical column name): Categorical --> Dummy coding
# Further information is https://www.statsmodels.org/stable/examples/notebooks/generated/formulas.html
formula = "Exited"+"~"+"+".join(non_categorical_cols)+"+"+\
          "+".join([f"C({each_categorcial_col})" for each_categorcial_col\
                    in categorical_cols]) + "-" + "1"
print(f'Formula: {formula}')

Formula: Exited~Age+C(Geography)+C(Gender)+C(HasCrCard)-1


In [9]:
# Create logistic regression and fit it using the formula above.
logitfit = logit(formula = str(formula),
                 data = df).fit()

# Check summary
logitfit.summary()

Optimization terminated successfully.
         Current function value: 0.449401
         Iterations 6


0,1,2,3
Dep. Variable:,Exited,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9994.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 20 Oct 2025",Pseudo R-squ.:,0.111
Time:,08:03:42,Log-Likelihood:,-4494.0
converged:,True,LL-Null:,-5054.9
Covariance Type:,nonrobust,LLR p-value:,2.577e-240

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
C(Geography)[France],-3.9351,0.120,-32.888,0.000,-4.170,-3.701
C(Geography)[Germany],-3.0214,0.119,-25.368,0.000,-3.255,-2.788
C(Geography)[Spain],-3.9133,0.127,-30.898,0.000,-4.162,-3.665
C(Gender)[T.Male],-0.5299,0.053,-9.974,0.000,-0.634,-0.426
C(HasCrCard)[T.1],-0.0320,0.058,-0.552,0.581,-0.145,0.082
Age,0.0633,0.002,26.207,0.000,0.059,0.068


In [10]:
# Predict the probability of Churn
df["proba"] = logitfit.predict(df)
df["predicted"] = 0 # initally set all 0

#A treshold of 0.3 helps in taking care of the imbalnce
df.loc[df["proba"]>0.3,"predicted"] = 1

In [11]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,proba,predicted
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,0.212698,0
2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,0.211107,0
3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,0.212698,0
4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0.187452,0
5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0.227296,0


In [43]:
from sklearn.metrics import confusion_matrix, classification_report

# Compare predictions to actual
print(confusion_matrix(df["Exited"], df["predicted"]))
print(classification_report(df["Exited"], df["predicted"]))

# See distribution
print(df["Exited"].value_counts())
print(df["predicted"].value_counts())
print(f"Percentage of customers who Exited: {df['Exited'].mean():.3f}")
print(f"Percentage of customers predicted to exit: {df['predicted'].mean():.3f}")
print(f"Average predicted probability: {df['proba'].mean():.3f}")

[[6883 1080]
 [1154  883]]
              precision    recall  f1-score   support

           0       0.86      0.86      0.86      7963
           1       0.45      0.43      0.44      2037

    accuracy                           0.78     10000
   macro avg       0.65      0.65      0.65     10000
weighted avg       0.77      0.78      0.78     10000

Exited
0    7963
1    2037
Name: count, dtype: int64
predicted
0    8037
1    1963
Name: count, dtype: int64
Percentage of customers who Exited: 0.204
Percentage of customers predicted to exit: 0.196
Average predicted probability: 0.204


In [12]:
# Create a dropdown list with categorical columns.
w_cat = widgets.Dropdown(
    options=categorical_cols,
    value="Geography",
    description="Task:",
)
display(w_cat)

Dropdown(description='Task:', options=('Geography', 'Gender', 'HasCrCard'), value='Geography')

In [13]:
# Calculate crosstab with the selected category and Exited. More about crosstab: https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
ct = pd.crosstab(df[w_cat.value], df["Exited"])
ct.columns = ["No Churn","Churn"] # Name no churn and chrun
ct =ct.reset_index()
ct

Unnamed: 0,Geography,No Churn,Churn
0,France,4204,810
1,Germany,1695,814
2,Spain,2064,413


In [14]:
#Extracting Logit's coefficients and sorting them.
logit_coeffs = logitfit.summary2().tables[1]
logit_coeffs = logit_coeffs.reindex(logit_coeffs["Coef."].abs().sort_values().index)

In [15]:
logit_coeffs

Unnamed: 0,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
C(HasCrCard)[T.1],-0.031968,0.057911,-0.55202,0.5809346,-0.145472,0.081536
Age,0.063294,0.002415,26.207172,2.201518e-151,0.058561,0.068028
C(Gender)[T.Male],-0.529893,0.05313,-9.973583,1.989203e-23,-0.634025,-0.425761
C(Geography)[Germany],-3.021432,0.119103,-25.368235,5.6552389999999994e-142,-3.25487,-2.787995
C(Geography)[Spain],-3.913336,0.126652,-30.898339,1.2574400000000002e-209,-4.161569,-3.665103
C(Geography)[France],-3.935134,0.119654,-32.887697,3.295416e-237,-4.169651,-3.700617


In [16]:
# Visualise coefficient importance of logistic regression.
fig = px.bar(ct, x=logit_coeffs['Coef.'],\
             y=logit_coeffs['Coef.'].index,
             orientation="h",
             color_discrete_sequence=['lightseagreen'] )

fig.update_layout(
                   plot_bgcolor='#F0E9E6',
                   title='Feature Importances',
                   xaxis_title='Coefficient Importance',
                   yaxis_title='Features',
                   xaxis_showgrid=False,
                   yaxis_showgrid=False
)

fig.show()

In [17]:
layout = go.Layout(plot_bgcolor='#F0E9E6')
fig = go.Figure(layout=layout)

# Add scatter plot.
fig.add_trace(
    go.Scatter(
    x=logit_coeffs['Coef.'],
    y=logit_coeffs['Coef.'].index,
    line=dict(color='#42C4F7', width=2),
    mode='markers',

    error_x=dict(
            type='data',
            symmetric=False,
            array=logit_coeffs['0.975]'] - logit_coeffs['Coef.'], # Upper CI
            arrayminus=logit_coeffs['Coef.'] - logit_coeffs['[0.025'], # Lower CI
            color='#8793c4')
        )
    )


fig.update_layout(
                   title='Regression Meta Analysis',
                   xaxis_title='Weight Estimates',
                   yaxis_title='Variable',
                   xaxis_showgrid=False,
                   yaxis_showgrid=False
)

fig.show()

In [18]:
# One hot encoding Geography
dummy_encoded_df = pd.get_dummies(df[non_categorical_cols+categorical_cols],\
               columns=["Geography"])

dummy_encoded_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Age,Gender,HasCrCard,Geography_France,Geography_Germany,Geography_Spain
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,15634602,Hargrave,42,Female,1,True,False,False
2,15647311,Hill,41,Female,0,False,False,True
3,15619304,Onio,42,Female,1,True,False,False
4,15701354,Boni,39,Female,0,True,False,False
5,15737888,Mitchell,43,Female,1,False,False,True


In [19]:
dummy_encoded_df = dummy_encoded_df[["HasCrCard", "Age", "Gender",
                                      "Geography_Germany", "Geography_Spain",	\
                                      "Geography_France",
                                      ]]

# Change Gender to 0 or 1
dummy_encoded_df["Gender"] = dummy_encoded_df["Gender"].map({
    "Female":0, "Male":1
})

dummy_encoded_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,HasCrCard,Age,Gender,Geography_Germany,Geography_Spain,Geography_France
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,15634602,Hargrave,1,42,0,False,False,True
2,15647311,Hill,0,41,0,False,True,False
3,15619304,Onio,1,42,0,False,False,True
4,15701354,Boni,0,39,0,False,False,True
5,15737888,Mitchell,1,43,0,False,True,False


In [20]:
#Compute effects of each feature
effects = dummy_encoded_df * logit_coeffs['Coef.'].to_numpy()
effects.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,HasCrCard,Age,Gender,Geography_Germany,Geography_Spain,Geography_France
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,15634602,Hargrave,-0.031968,2.658365,-0.0,-0.0,-0.0,-3.935134
2,15647311,Hill,-0.0,2.595071,-0.0,-0.0,-3.913336,-0.0
3,15619304,Onio,-0.031968,2.658365,-0.0,-0.0,-0.0,-3.935134
4,15701354,Boni,-0.0,2.468482,-0.0,-0.0,-0.0,-3.935134
5,15737888,Mitchell,-0.031968,2.72166,-0.0,-0.0,-3.913336,-0.0


In [21]:
effects.columns

Index(['HasCrCard', 'Age', 'Gender', 'Geography_Germany', 'Geography_Spain',
       'Geography_France'],
      dtype='object')

In [22]:
layout = go.Layout(plot_bgcolor='#F0E9E6')
fig = go.Figure(layout=layout)

for each_col in effects.columns:
    fig.add_trace(go.Box(x=effects[each_col],\
                     marker_color = 'lightseagreen',
                     name=each_col))

fig.update_layout(
                   title='Effect Plot',
                   xaxis_title="Effects",
                   yaxis_title='Features',
                   xaxis_showgrid=False,
                   yaxis_showgrid=False
)

fig.show()

In [23]:
(df["proba"]>0.5).head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,proba
RowNumber,CustomerId,Surname,Unnamed: 3_level_1
1,15634602,Hargrave,False
2,15647311,Hill,False
3,15619304,Onio,False
4,15701354,Boni,False
5,15737888,Mitchell,False
6,15574012,Chu,False
7,15592531,Bartlett,False
8,15656148,Obinna,False
9,15792365,He,False
10,15592389,H?,False


In [24]:
# Display a single observation
SET_INDEX_DF = 16 #　Select a customer

local_data = dummy_encoded_df.iloc[SET_INDEX_DF,:]
local_effects =  effects.iloc[SET_INDEX_DF,:]
local_data,local_effects

(HasCrCard                1
 Age                     58
 Gender                   1
 Geography_Germany     True
 Geography_Spain      False
 Geography_France     False
 Name: (17, 15737452, Romeo), dtype: object,
 HasCrCard           -0.031968
 Age                  3.671076
 Gender              -0.529893
 Geography_Germany   -3.021432
 Geography_Spain     -0.000000
 Geography_France    -0.000000
 Name: (17, 15737452, Romeo), dtype: float64)

In [25]:
# Display with pandas
dummy_encoded_df.iloc[SET_INDEX_DF:SET_INDEX_DF+1,:].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,HasCrCard,Age,Gender,Geography_Germany,Geography_Spain,Geography_France
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17,15737452,Romeo,1,58,1,True,False,False


In [26]:
layout = go.Layout(plot_bgcolor='#F0E9E6')
fig = go.Figure(layout=layout)

for each_col in effects.columns:
    fig.add_trace(go.Box(x=effects[each_col],\
                     marker_color = 'lightseagreen',
                     name=each_col))


fig.add_trace(go.Scatter(
    x=local_effects.to_numpy(),
    y=local_effects.index,
    hovertext=local_data.to_numpy(),
    hoverinfo="text",
    marker=dict(
        color="red"
    ),
    mode="markers",
    marker_symbol="square-x",
    showlegend=False
))



fig.update_layout(
                   title=f'Local Effects of Data Point {SET_INDEX_DF},Predicted Log Odds-{df.iloc[SET_INDEX_DF,:]["proba"]}',
                   xaxis_title="Effects",
                   yaxis_title='Features',
                   xaxis_showgrid=False,
                   yaxis_showgrid=False
)

fig.show()

#### Task 1:  Write code for a customer for which the predicted log odds is lesser than 0.5. Create the box plot and provide interpretation.

In [27]:
(df["proba"]<0.5).head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,proba
RowNumber,CustomerId,Surname,Unnamed: 3_level_1
1,15634602,Hargrave,True
2,15647311,Hill,True
3,15619304,Onio,True
4,15701354,Boni,True
5,15737888,Mitchell,True
6,15574012,Chu,True
7,15592531,Bartlett,True
8,15656148,Obinna,True
9,15792365,He,True
10,15592389,H?,True


In [37]:
# Select a 32 year old female customer from Spain who has a credit card
selected_customer = df[(df['Age'] == 32) & (df['Gender'] == 'Female') & (df['Geography'] == 'Spain') & (df['HasCrCard'] == 1)]

selected_customer

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,proba,predicted
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
22,15597945,Dellucci,636,Spain,Female,32,8,0.0,2,1,0,138555.46,0,0.127875,0
151,15650237,Morgan,754,Spain,Female,32,7,0.0,2,1,0,89520.75,0,0.127875,0
197,15635905,Moran,616,Spain,Female,32,6,0.0,2,1,1,43001.46,0,0.127875,0
353,15777352,Ikedinachukwu,568,Spain,Female,32,7,169399.6,1,1,0,61936.22,0,0.127875,0
467,15663252,Olisanugo,850,Spain,Female,32,9,0.0,2,1,1,18924.92,0,0.127875,0
577,15761986,Obialo,439,Spain,Female,32,3,138901.61,1,1,0,75685.97,0,0.127875,0
674,15745621,Wertheim,640,Spain,Female,32,6,118879.35,2,1,1,19131.71,0,0.127875,0
913,15566091,Thomsen,545,Spain,Female,32,4,0.0,1,1,0,94739.2,0,0.127875,0
1682,15746749,Fleming,681,Spain,Female,32,3,0.0,2,1,1,59679.9,0,0.127875,0
1910,15773605,Iadanza,670,Spain,Female,32,3,0.0,2,1,0,46175.7,0,0.127875,0


In [38]:
# Display a single observation
SET_INDEX_DF =  21

local_data = dummy_encoded_df.iloc[SET_INDEX_DF,:]
local_effects =  effects.iloc[SET_INDEX_DF,:]
local_data,local_effects

(HasCrCard                1
 Age                     32
 Gender                   0
 Geography_Germany    False
 Geography_Spain       True
 Geography_France     False
 Name: (22, 15597945, Dellucci), dtype: object,
 HasCrCard           -0.031968
 Age                  2.025421
 Gender              -0.000000
 Geography_Germany   -0.000000
 Geography_Spain     -3.913336
 Geography_France    -0.000000
 Name: (22, 15597945, Dellucci), dtype: float64)

In [39]:
# Display with pandas
df.iloc[SET_INDEX_DF:SET_INDEX_DF+1,:].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,proba,predicted
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
22,15597945,Dellucci,636,Spain,Female,32,8,0.0,2,1,0,138555.46,0,0.127875,0


In [40]:
layout = go.Layout(plot_bgcolor='#F0E9E6')
fig = go.Figure(layout=layout)

for each_col in effects.columns:
    fig.add_trace(go.Box(x=effects[each_col],\
                     marker_color = 'lightseagreen',
                     name=each_col))


fig.add_trace(go.Scatter(
    x=local_effects.to_numpy(),
    y=local_effects.index,
    hovertext=local_data.to_numpy(),
    hoverinfo="text",
    marker=dict(
        color="red"
    ),
    mode="markers",
    marker_symbol="square-x",
    showlegend=False
))



fig.update_layout(
                   title=f'Local Effects of Data Point {SET_INDEX_DF},Predicted Log Odds-{df.iloc[SET_INDEX_DF,:]["proba"]}',
                   xaxis_title="Effects",
                   yaxis_title='Features',
                   xaxis_showgrid=False,
                   yaxis_showgrid=False
)

fig.show()

#### Task 2:  Consider both the data point and its interpretation. What further interpretation you can derive in terms of how representative they are of the overall data distributions.

This customer is a 32 yr old female from Spain with a credit card. 32 yr old = 2.03 log odds adding to the likelihood of churn, female = 0 log odds not affecting the likelihood of churn, from Spain = -3.91 reducing the likelihood of churn and having a credit card which is not a signficant feature according to the p-value of 0.58 but has been included in the churn probability calculation and would change the log-odds by -0.03. The overall log odds of -1.919 given as a sum of the individual log odds leads me to a probability of 0.12787 (applying the sigmoid function to convert). As this is less than the threshold of 0.3 set in the *df.loc[df["proba"] > 0.3, "predicted"] = 1* statement, they have been predicted as not churning which in this case matches the actual Exited status.

In terms of how representative they are, the age is at the lower end of the distribution and although it pushes the overall probability of churn higher it is not enough to offset the negative log odds from the other features. The gender is female which is the baseline value set at 0 log odds so there isn't a negative pull towards not churn from that. Coming from Spain provides a negative almost as strong as the negative that France would be and stronger than the negative that Germany is, so they have less chance of churn than a similar person from Germany. The range of the box in the plot for has credit card tells us that whether you have a credit card or not does not change whether you are 'representative' of the set.

Given the large negative provided by all countries, the insignifcance of the credit card and the binary impact of gender, from this information we could generate a very simple classification. Given that a probability of 0.3 was used to predict churn - this tells us that the total log odds would be -0.84 and for example if you are a female from Spain (total negative -3.91) you need to add 3.07 to be predicted as churn - this equates to an age of roughly 49 as you can see in the table - those 49 yr old females from Spain all being predicted to churn based on the probability of just over 0.30.

In [52]:
# Select a 49 year old female customer from Spain who has a credit card
selected_customer = df[(df['Age'] == 49) & (df['Gender'] == 'Female') & (df['Geography'] == 'Spain') & (df['HasCrCard'] == 1)]

selected_customer

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,proba,predicted
RowNumber,CustomerId,Surname,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
574,15607312,Ch'ang,648,Spain,Female,49,10,0.0,2,1,1,159835.78,1,0.300719,1
1052,15715003,Ko,625,Spain,Female,49,2,80816.45,1,1,1,20018.79,0,0.300719,1
1571,15607133,Shih,717,Spain,Female,49,1,110864.38,2,1,1,124532.9,1,0.300719,1
1895,15783398,Rizzo,535,Spain,Female,49,7,115309.75,1,1,0,111421.77,0,0.300719,1
4126,15633378,Davidson,692,Spain,Female,49,9,0.0,2,1,0,178342.63,0,0.300719,1
6204,15790763,Trujillo,599,Spain,Female,49,2,0.0,2,1,0,111190.53,0,0.300719,1
6405,15745399,Marino,649,Spain,Female,49,2,0.0,1,1,0,84863.85,1,0.300719,1
8599,15624424,Palerma,678,Spain,Female,49,1,0.0,2,1,1,102472.9,0,0.300719,1
9136,15664432,Chao,727,Spain,Female,49,7,96296.78,1,1,0,190457.87,1,0.300719,1


In [54]:
# Display a single observation
SET_INDEX_DF =  573

local_data = dummy_encoded_df.iloc[SET_INDEX_DF,:]
local_effects =  effects.iloc[SET_INDEX_DF,:]
local_data,local_effects

(HasCrCard                1
 Age                     49
 Gender                   0
 Geography_Germany    False
 Geography_Spain       True
 Geography_France     False
 Name: (574, 15607312, Ch'ang), dtype: object,
 HasCrCard           -0.031968
 Age                  3.101426
 Gender              -0.000000
 Geography_Germany   -0.000000
 Geography_Spain     -3.913336
 Geography_France    -0.000000
 Name: (574, 15607312, Ch'ang), dtype: float64)

Task 3 - Please examine the taxonomy of interpretable methods you have studied this week as shown in the image below. Reflect on the interpretability techniques utilised in this challenge exercise. Then, identify and list the branches of the taxonomy to which these techniques correspond.

The interpretability is intrinsic as the coefficients used to interpret are an aspect of the model itself. It has model specific tools that can be used (calculating odds ratios) as well as model agnostic tools (that work on anything). We have explained both local (probability of churn for individual cases) and global (the overall distribution) aspects of the model.