## Capstone Project 1
## Statistical Data Analysis
### Federico Di Martino

#### Performing Statistical Data Analysis on Bank Customer Churn Data

#### Answering the following questions
- Are there variables that are particularly significant in terms of explaining the answer to your project question?
- Are there significant differences between subgroups in your data that may be relevant to your project aim?
- Are there strong correlations between pairs of independent variables or between an independent and a dependent variable?
- What are the most appropriate tests to use to analyze these relationships?



### Preliminary Actions

In [1]:
## Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## Import data
churn_data = pd.read_csv("Churn_Modelling.csv", index_col = 0)

## Print head to show structure of data
print(churn_data.head())

           CustomerId   Surname  CreditScore Geography  Gender  Age  Tenure  \
RowNumber                                                                     
1            15634602  Hargrave          619    France  Female   42       2   
2            15647311      Hill          608     Spain  Female   41       1   
3            15619304      Onio          502    France  Female   42       8   
4            15701354      Boni          699    France  Female   39       1   
5            15737888  Mitchell          850     Spain  Female   43       2   

             Balance  NumOfProducts  HasCrCard  IsActiveMember  \
RowNumber                                                        
1               0.00              1          1               1   
2           83807.86              1          0               1   
3          159660.80              3          1               0   
4               0.00              2          0               0   
5          125510.82              1          1    

In [2]:
## Generate correlation matrix in same way that was done for data story segment

## To work, all values need to be numeric
# churn_data.fillna(value=np.nan, inplace=True)
## reshape data so that geography column becomes three binary columns
heatmap_data = churn_data
heatmap_data['IsFrance'] = 0
heatmap_data['IsSpain'] = 0
heatmap_data['IsGermany'] = 0

heatmap_data.loc[heatmap_data['Geography'] == 'France','IsFrance'] = 1
heatmap_data.loc[heatmap_data['Geography'] == 'Spain','IsSpain'] = 1
heatmap_data.loc[heatmap_data['Geography'] == 'Germany','IsGermany'] = 1

heatmap_data['IsFrance'] = pd.to_numeric(heatmap_data['IsFrance'])
heatmap_data['IsSpain'] = pd.to_numeric(heatmap_data['IsSpain'])
heatmap_data['IsGermany'] = pd.to_numeric(heatmap_data['IsGermany'])

## Change gender column such that female -> 1, male -> 0
heatmap_data.loc[heatmap_data['Gender'] == 'Female','Gender'] = 1
heatmap_data.loc[heatmap_data['Gender'] == 'Male','Gender'] = 0
heatmap_data["Gender"] = pd.to_numeric(heatmap_data["Gender"])


# Drop columns not be used
heatmap_data = heatmap_data.drop(['CustomerId', 'Surname', 'Geography'], axis = 'columns')


# Calculate correlations
corr = heatmap_data.corr()

# Visualise correlation matrix
corr.style.background_gradient(cmap='coolwarm', axis = None).set_precision(2)


Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,IsFrance,IsSpain,IsGermany
CreditScore,1.0,0.0029,-0.004,0.00084,0.0063,0.012,-0.0055,0.026,-0.0014,-0.027,-0.0089,0.0048,0.0055
Gender,0.0029,1.0,0.028,-0.015,-0.012,0.022,-0.0058,-0.023,0.0081,0.11,-0.0068,-0.017,0.025
Age,-0.004,0.028,1.0,-0.01,0.028,-0.031,-0.012,0.085,-0.0072,0.29,-0.039,-0.0017,0.047
Tenure,0.00084,-0.015,-0.01,1.0,-0.012,0.013,0.023,-0.028,0.0078,-0.014,-0.0028,0.0039,-0.00057
Balance,0.0063,-0.012,0.028,-0.012,1.0,-0.3,-0.015,-0.01,0.013,0.12,-0.23,-0.13,0.4
NumOfProducts,0.012,0.022,-0.031,0.013,-0.3,1.0,0.0032,0.0096,0.014,-0.048,0.0012,0.009,-0.01
HasCrCard,-0.0055,-0.0058,-0.012,0.023,-0.015,0.0032,1.0,-0.012,-0.0099,-0.0071,0.0025,-0.013,0.011
IsActiveMember,0.026,-0.023,0.085,-0.028,-0.01,0.0096,-0.012,1.0,-0.011,-0.16,0.0033,0.017,-0.02
EstimatedSalary,-0.0014,0.0081,-0.0072,0.0078,0.013,0.014,-0.0099,-0.011,1.0,0.012,-0.0033,-0.0065,0.01
Exited,-0.027,0.11,0.29,-0.014,0.12,-0.048,-0.0071,-0.16,0.012,1.0,-0.1,-0.053,0.17


### Are there variables that are particularly significant in terms of explaining the answer to your project question?

The project aims to predict the likelihood of a customer churning (Exited =1 in the data).  Hence the particularly significant variables are the ones that correlate the most with Exited. Although there are no extremely strong correlations, the Gender, Age, Balance, IsActiveMember and IsGermany variables are the ones that show most correlation and should be investigated the most. 

### Are there significant differences between subgroups in your data that may be relevant to your project aim?
The variables for customer gender, activity and nationality denote distinct subgroups. They are represented by variables that I think are signficant, as above.

### Are there strong correlations between pairs of independent variables or between an independent and a dependent variable?
There aren't any strong (coefficient magnitude > 0.8) correlations per se. The strongest ones are, apart from the ones relating to Exited detailed above, the ones relating balance to the number of products and to nationality.

### What are the most appropriate tests to use to analyze these relationships?
The test I used to analyze this relationship was using the Pearson product-moment method.