
# Practical Assignment 1


### [Bank Customer Churn Prediction](https://www.kaggle.com/kmalit/bank-customer-churn-prediction)
by Keldine Malit, 2018.

The objective of this notebook is to present an extensive analysis of a Customer Churn Dataset and to predict the customer churn rate. The first part of this notebook will focus on an exploration of the dataset and the other part will demonstrate different machine learning models for the prediction of churn.



## Problem definition

The aim of this study is to identify and visualize which factors contribute to customer churn, and then assess whether a model could be designed to do the following:
1. Classify if a customer is going to churn or not, 
2. Choose a model, based on model performance, that will attach a probability to the churn to make it easier for customer service to target low hanging fruits in their efforts to prevent churn.

**We will use this notebook as the framework for your assignment. There are portions of the notebook that will require no editing, however you will be required to edit certain portions in order to complete the assignment.**
______

## Goal of Assignment
_______
Your assignment is to identify and perform the following:

**Question 1:** Which of the features is the most influential in ensuring an accurate classification? Provide a *detailed analysis* in the form of an EDA to justify your answer, and ensure that the sub-questions are answered appropriately.

**Question 2:** Upon removing the feature you indentified in Question 1, which classification model is able to perform best? Provide a *detailed performance analysis* to justify your answer.
_______

### Load Libraries

In [None]:
# data wrangling 
import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.options.display.max_rows = None
pd.options.display.max_columns = None

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

# QUESTION 1

### Acquire data

The Python Pandas packages helps us work with our datasets. We start by acquiring the dataset into Pandas DataFrames. 

In [None]:
# Read the data frame
df = pd.read_csv('Churn_Modelling.csv', delimiter=',')

### Describe the data

Identify the relevant features of the dataset:

In [None]:
print(df.columns.values)

The dataset has 1000 rows with 14 attributes. 

____

**1.1 Which features are categorical?**

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, or binary? Among other things this helps us select the appropriate plots for visualization.



**1.2 Which features are numerical?**

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

______


In [None]:
# preview the data
df.head()

_____
**1.3  Which features are mixed data types?**

Numerical, alphanumeric data within same feature. These are candidates for correction.



**1.4 Which features may contain errors or typos?**

This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.
_______


In [None]:
df.tail()

______
**1.5 Which features contain blank, null or empty values?**

These will require correcting.



**1.6 What are the data types for various features?**

This helps us to know where conversions are needed.
______


In [None]:
df.info()

____
**1.7 What is the distribution of numerical feature values across the samples?**

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.
_____


In [None]:
df.describe()

_____
**1.8 What is the distribution of categorical features?**
____

In [None]:
df.describe(include=['O'])

## Data Visualisation

Now we can continue confirming some of our assumptions using visualizations for analyzing the data.

______________
**1.9 Reassess the visualizations below, and add or remove as you see fit to provide a proper visual exploration.**
_________

### Correlating numerical features

Let us start by understanding correlations between numerical features and our solution goal (Exited).



In [None]:
g = sns.FacetGrid(df, col='Exited')
g.map(plt.hist, 'Age', bins=20)

### Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.


In [None]:
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(df, col='Exited', row='IsActiveMember', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

### Correlating categorical features

Now we can correlate categorical features with our solution goal.


In [None]:
# grid = sns.FacetGrid(train_df, col='Embarked')
grid = sns.FacetGrid(df, row='Geography', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'IsActiveMember', 'Exited', 'Gender', palette='deep')
grid.add_legend()

In [None]:
# We first review the 'Status' relation with categorical variables
fig, axarr = plt.subplots(2, 2, figsize=(20, 12))
sns.countplot(x='Geography', hue = 'Exited',data = df, ax=axarr[0][0])
sns.countplot(x='Gender', hue = 'Exited',data = df, ax=axarr[0][1])
sns.countplot(x='HasCrCard', hue = 'Exited',data = df, ax=axarr[1][0])
sns.countplot(x='IsActiveMember', hue = 'Exited',data = df, ax=axarr[1][1])

### Correlating categorical and numerical features

We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Geography (Categorical non-numeric), Gender (Categorical non-numeric), Age (Numeric continuous), with Exited (Categorical numeric).


In [None]:
# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(df, row='Geography', col='Exited', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Gender', 'Age', alpha=.5, ci=None)
grid.add_legend()

In [None]:
# Relations based on the continuous data attributes
fig, axarr = plt.subplots(3, 2, figsize=(20, 12))
sns.boxplot(y='CreditScore',x = 'Exited', hue = 'Exited',data = df, ax=axarr[0][0])
sns.boxplot(y='Age',x = 'Exited', hue = 'Exited',data = df , ax=axarr[0][1])
sns.boxplot(y='Tenure',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][0])
sns.boxplot(y='Balance',x = 'Exited', hue = 'Exited',data = df, ax=axarr[1][1])
sns.boxplot(y='NumOfProducts',x = 'Exited', hue = 'Exited',data = df, ax=axarr[2][0])
sns.boxplot(y='EstimatedSalary',x = 'Exited', hue = 'Exited',data = df, ax=axarr[2][1])

## Wrangle data

At this point your should have collected several assumptions and decisions regarding the datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

### Correcting by dropping features

This is a good starting goal to execute. By dropping features we are dealing with fewer data points, which speeds up our notebook and eases the analysis.

Based on our assumptions and decisions we want to drop the features: RowNumber, CustomerId, and Surname.



In [None]:
# Check columns list and missing values
df.isnull().sum()

In [None]:
# Get unique count for each variable
df.nunique()

From the above, we will not require the first 2 attributes as the are specific to a customer. It is borderline with the surname as this would result to profiling so we exclude this as well.

In [None]:
# Drop the columns as explained above
df = df.drop(["RowNumber", "CustomerId", "Surname"], axis = 1)

In [None]:
# Review the top rows of what is left of the data frame
df.head()

In [None]:
# Check variable data types
df.dtypes

### Creating new feature extracting from existing
We seek to add features that are likely to have an impact on the probability of churning. 

In [None]:
df['BalanceSalaryRatio'] = df.Balance/df.EstimatedSalary
sns.boxplot(y='BalanceSalaryRatio',x = 'Exited', hue = 'Exited',data = df)
plt.ylim(-1, 5)

In [None]:
# Given that tenure is a 'function' of age, we introduce a variable aiming to standardize tenure over age:
df['TenureByAge'] = df.Tenure/(df.Age)
sns.boxplot(y='TenureByAge',x = 'Exited', hue = 'Exited',data = df)
plt.ylim(-1, 1)
plt.show()

In [None]:
'''Lastly we introduce a variable to capture credit score given age to take into account credit 
behaviour visavis adult life :-)'''
df['CreditScoreGivenAge'] = df.CreditScore/(df.Age)

In [None]:
# Resulting Data Frame
df.head()

In [None]:
# Arrange columns by data type for easier manipulation
continuous_vars = ['CreditScore',  'Age', 'Tenure', 'Balance','NumOfProducts', 'EstimatedSalary', 'BalanceSalaryRatio',
                   'TenureByAge','CreditScoreGivenAge']
cat_vars = ['HasCrCard', 'IsActiveMember','Geography', 'Gender']
df = df[['Exited'] + continuous_vars + cat_vars]
df.head()

### Converting categorical feature to numeric

In [None]:
'''For the one hot variables, we change 0 to -1 so that the models can capture a negative relation 
where the attribute in inapplicable instead of 0'''
df.loc[df.HasCrCard == 0, 'HasCrCard'] = -1
df.loc[df.IsActiveMember == 0, 'IsActiveMember'] = -1
df.head()

In [None]:
# One hot encode the categorical variables
lst = ['Geography', 'Gender']
remove = list()
for i in lst:
    if (df[i].dtype == np.str or df[i].dtype == np.object):
        for j in df[i].unique():
            df[i+'_'+j] = np.where(df[i] == j,1,-1)
        remove.append(i)
df = df.drop(remove, axis=1)
df.head()

In [None]:
# minMax scaling the continuous variables
minVec = df[continuous_vars].min().copy()
maxVec = df[continuous_vars].max().copy()
df[continuous_vars] = (df[continuous_vars]-minVec)/(maxVec-minVec)
df.head()

## Feature Importance

______
**Question 1.10:** Which of the features is the most influential in ensuring an accurate classification? Provide a detailed analysis in the form of an EDA to justify your answer. Below is some code you may use as part of your analysis of the data to assist with the argument you need to develop.

________


# QUESTION 2

## Model and Predict

Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Churned or not) with other variables or features (Age, Tenure, Balance...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

- Logistic Regression
- KNN or k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forrest
- Perceptron
- Artificial neural network
- RVM or Relevance Vector Machine

________

**Question 2:**  Upon removing the feature you indentified in Question 1, as well as additional features you think are not relevant after your analysis, which classification model is able to perform best? Provide a detailed performance analysis to justify your answer. 

*Idea:* If you are able to I would suggest that you perform hyperparameter tuning where appropriate.
____


Let's set up our training and testing data sets.


In [None]:
# Split Train, test data
df_train = df.sample(frac=0.8,random_state=200)
df_test = df.drop(df_train.index)
print(len(df_train))
print(len(df_test))

In [None]:
y_train = df_train["Exited"]
X_train = df_train.drop(["Exited"], axis = 1)
y_test = df_test["Exited"]
X_test = df_test.drop(["Exited"], axis = 1)

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. 

In [None]:
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Y_pred = logreg.predict(X_test)
acc_log = logreg.score(X_train, y_train) * 100
acc_log

Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of **two categories**, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. 

In [None]:
# Support Vector Machines

svc = SVC()
svc.fit(X_train, y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
acc_svc

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). 

In [None]:
# K-nearest Neighbour

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, y_train) * 100, 2)
acc_knn

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. 

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, y_train) * 100, 2)
acc_gaussian

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

In [None]:
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, y_train) * 100, 2)
acc_perceptron

In [None]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, y_train) * 100, 2)
acc_linear_svc

In [None]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, y_train) * 100, 2)
acc_sgd

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. 

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
acc_decision_tree

The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. 

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
acc_random_forest

### Model evaluation

We can now rank our evaluation of all the models to choose the best one for our problem. 

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)