# Predicting Customer Churn at a Bank

Every month BankCo loses thousands of customers to it’s competitors. Customers who leaves the bank are known as "Churned customers" which is undesirable situation for the BankCo. In this exercise, I will help the BankCo to predict which customers are likely to churn in the future. As a result, BankCo will take measures based on the predictions. 

# I. Framing the problem

The task is supervised learning because we have labeled training examples (`Exited`). Morever, It is also a classification problem since it is about classifying which customers may churn. More speficically, It is a **binary classification problem** since the model will predict whether a client may churn or not. Finally, no continuous data flow is present, there is no particular need to pay attention on rapid data change and also data are not that big and they can fit in memory, thus, **batch learning** is enough for this case. 

In [None]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from pandas.plotting import scatter_matrix

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import SMOTE

## Import Data

In [None]:
PATH = !pwd
print(os.listdir('../input'))

In [None]:
os_path = '../input/predicting-churn-for-bank-customers'
os_path

In [None]:
#Reading the data as a csv file and saving it as a pandas dataframe
def load_churn_data(data_path=PATH):
    csv_path = os.path.join(data_path, "Churn_Modelling.csv")
    return pd.read_csv(csv_path)

# II. Data preprocessing

In [None]:
churn_df = load_churn_data(data_path=os_path)

#printing the number of columns 
print("Number of columns: ", len(churn_df.columns))

# First 5 rows of the dataset
churn_df.head()

Each row represents one customer. As can be seen using head() function and columns , there are 14 attributes. 
*  RowNumber
*  CustomerId
*  Surname   
*  **CreditScore**: numerical number to quantify how trusty is an individual to pay off the debt.
*  Geography  
*  Gender     
*  Age        
*  **Tenure**: How long has a customer stayed with the bank.     
*  Balance    
*  NumOfProducts
*  HasCrCard    
*  IsActiveMember
*  EstimatedSalary
*  Exited         

*The info() method is particularly useful in getting the description of the dataset, for instance the each attribute type, the number of rows of each attribute and also non-null values.*

In [None]:
# Listing the features and their data type
churn_df.info()

As a result, the info() method informs that there are 10000 instances within the dataset, and all attributes have 10000 non-null values, meaning that all customers do not miss any feature. </br>

Eleven of the attributes are numerical, `Surname`, `Geography`, and `Gender` are of object type and since the data is from a CSV file, it means they must be text attributes. </br>

Looking at the first 5 columns:
*  Surname represents the first name of each customer.
*  Geography and Gender must be categorical attributes. 

Let's use value_counts() method to find out how many customers belongs to each category in these two attributes (Geography and Gender)


In [None]:
# Viewing how many datapoints in each category of Geography attribute
churn_df["Geography"].value_counts()


As a result, customers come from either of the three different countries (`France`, `Germany`, and `Spain`)

In [None]:
# Viewing how many datapoints in each category of Gender attribute
churn_df["Gender"].value_counts()


Gender attribute is self explanatory but Male is more repetitive than female. </br> 
*Taking a look at other attributes using describe() method* which shows the summary of numerical attributes.

In [None]:
# Descriptive statistics
churn_df.describe()

Some values like `count, mean, max, and min ` are easy to understand. std is the standard deviation means how far are the values from the mean in an attribute. </br>
25%, 50%, and 75% are the percentiles. For instance, 25% of the customers are not active member, do not have credit card, do not have a balance, have lowest tenure, and are below 32 years of age.

We can also use histogram to graphically understand the data.

In [None]:
%matplotlib inline
#Creating histogram for numerical attributes
churn_df.drop(columns=['CustomerId', 'RowNumber']).hist(bins=50, figsize=(20,15))
plt.show()

### Target variable

*   The target, will the customer churn or not (Exited) is binary. 
*   Exited = 1 if the client has churned
*   Exited = 0 if the client has not churned

From the histogram, we can notice that we have **class imbalance**, because most customers did not churn. 
This is confirmed by using value_counts()

In [None]:
# Viewing how many datapoints represents customers who have churned or not in the target variable
numbers = churn_df["Exited"].value_counts()

fig, ax = plt.subplots(figsize=(10, 8))
#creating a pie chart to visualize the customers who churned and who did not churn
ax.pie(numbers, labels=['Churned', 'Stayed with the Bank'], colors = ['green', 'red'], autopct='%2.2f%%', startangle=90)
plt.title("Comparison of customers churned and stayed with bank", size =18)
plt.show()

A class imbalance in target variable is confirmed by this pie chart above, 79.63% represents customers who churned and 20.37% represents customers who did not churn. 

#### Creating a Test Set

The reason for creating a test set at this time is to avoid what is called **Data snooping**. Looking at the test set while exploiting the data and choosing an algorithm to use is not a good practice as it might trick us to choose the a particular model depending on the pattern in the test test. </br>

Several methods are used to split the data into test and train set and It is often easy. We could just pick 20% of the data for test and other 80% for training randomly but running the program will generate new test set which is what we need to avoid. </br>



As our dataset is not very large, we can avoid using random (if the data was too big, we could stick with random methods) method in splitting the data into train and test set. Using a method called stratified sampling let us choose a test test and make sure that it is representative of the overall data which is good to fully test the performance of the model on unseen data. </br>

Instead of using a method like: </br>

`X_train, X_test, y_train, y_test = train_test_split(features, y, test_size=0.3, random_state=42)`

We can use stratified sampling.

Let use Tenure for stratified sampling. This will help us to make sure that test dataset represents the whole data. Then, we use it to split the data in test set and train set.

1. We look at histogram.
2. Convert it to categorical since it is numerical (creating Tenure category attribute).


In [None]:
%matplotlib inline
#Creating histogram for numerical attribute (Tenure)
churn_df["Tenure"].hist(bins=30, figsize=(10,7))
plt.show()

Most customers have Tenure period between 0-10

In [None]:
# Provides train/test indices to split data in train/test sets.
split_cond = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
#splitting the data to get train and test set according to tenure category
for i, j in split_cond.split(churn_df, churn_df["Tenure"]):
    strat_train_data = churn_df.loc[i]
    strat_test_data = churn_df.loc[j]
    
Y_strat_train_data = strat_train_data["Exited"]
Y_strat_test_data = strat_test_data["Exited"]


## III. Exploratory Data Analysis and Data Visualization

The goal of this section is to use data analysis techniques and data visualizations to explore the relationship between the features and the target variable, `Exited`.

In [None]:
churn = strat_train_data.copy()#creating a copy of data in order to play with it
churn.head()

Let's start with pairplot which will let us compare the attributes with the target variable. Specifically, it will help to visualize the distributions of several independent variables against the target variable. 

In [None]:
sns.set(style="ticks")
#Creating a pairplot using seaborn library
sns.pairplot(churn[['CreditScore','Age','EstimatedSalary','Balance','Exited']], hue="Exited",height = 4.0)
plt.show()

There is an overlap among several attribute which indicates alot of `non-linear relationship` between variables. This also give insights about the algorithm to consider while choosing the model. Probabably **`a non-linear model`** would be better choice. 

#### Looking for correlation

We can start by looking at how each attributes correlates to the Exited dependent variable.

In [None]:
corr_matrix = churn.corr()#correlation among all variables
corr_matrix["Exited"].sort_values(ascending=False)#correlation between target variable and other attributes

When the correlation (which is linear correlation) is close to 1, there is a positive and strong correlation between the two variables, negative otherwise. </br>

It looks like None of the variables is strongly correlated but some variables like (`Age`, `Balance` looks promising). </br>

There is also another method for correlation, using pandas' scatter_matrix. 

In [None]:
attributes = ["Exited", "Age", "Balance"]
#correlation visualization among most correlated features to target variable.
scatter_matrix(churn[attributes], figsize=(12, 8))

Balance and Age seems to have a correlation even though it is not high. There is of extracting a feature between the two.

In [None]:
sns.set(style="darkgrid")#setting the dark background

#creating subplots
figure, axes = plt.subplots(1, 4, figsize=(20, 10))
i = 0
#A list of all categorical attributes
y_ = ["HasCrCard", "Gender", 'Geography', 'IsActiveMember']
type(y_)
for axi in y_:
    ax = sns.countplot(x=axi, hue="Exited", data=churn, ax=axes[i])#countplot of 3 categorical
    i=i+1
plt.show()

*  Customers from France are more frequent which explains why they are more who did not churn.
*  I noticed that customers with credit cards who did not churn are greater than those without credit cards but      supringly `those with credit cards churned more than those without credit cards`

*  The dataset has a large number of male than of female. Despite that, female churned more.
*  Active members as we would except do not churn more

Finally, let's see if there are some outliers so that we can remove them.

In [None]:
#Using boxplot to detect outliers
boxplot = churn_df.boxplot(column=['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary'], figsize=(12,8))

There are no considerable outliers.

# IV. Feature Engineering

Let's now try possible *feature extraction* and *feature selection*. </br>

We can start by removing variables like (`'RowNumber', 'CustomerId', 'Surname'`)

Then use sklearn's polynomial features to combine several features. We will then remove correlated features so that we can feed best features to our machine learning model.

In [None]:
#performing feature selection by removing irrelevant features
y_train_data = churn['Exited']
y_test_data = strat_test_data['Exited']

X = churn.drop(columns = ['RowNumber', 'CustomerId', 'Surname'])#Train data

X_test_data = strat_test_data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])#test data


Before digging into feature extraction, two variables which appear to be important are still categorical.</br>

#### Handling Text and Categorical Attributes

In [None]:
#Training set
#Geography
encoder_geo = LabelEncoder()#creating label encoder object
X["Geography"] = encoder_geo.fit_transform(X["Geography"]).copy()#converting to numberical for geography attribute

#Gender
encoder_gender = LabelEncoder()#creating a label encoder object
X["Gender"] = encoder_gender.fit_transform(X["Gender"]).copy()#converting to numerical for gender
X.shape

In [None]:
#For test set
#Geography
encoder_geo = LabelEncoder()#creating label encoder object
X_test_data["Geography"] = encoder_geo.fit_transform(strat_test_data["Geography"]).copy()#converting to numberical for geography

#Gender
encoder_gender = LabelEncoder()#Creating label encoder object
##converting to numberical for gender attribute
X_test_data["Gender"] = encoder_gender.fit_transform(strat_test_data["Gender"]).copy()

X_test_data.shape

#### Feature extraction

1. One possibility is to combine Balance and estimated salary. How is the customers' balance compared to the his/her salary.

2. Does balance have something to with Age. May be older people tend to save money.

In [None]:
#combining balance and estimated salary
X['Balance_extimatedsalary_ratio'] = X.Balance/X.EstimatedSalary
X_test_data['Balance_extimatedsalary_ratio'] = X_test_data.Balance/X_test_data.EstimatedSalary

#combining balance and age. Balance and age ratio
X['Balance_age_ratio'] = X.Balance/(X.Age)
X_test_data['Age_ratio'] = X_test_data.Balance/(X_test_data.Age)


##### checking the relevance of new features

In [None]:
#correlation among independent variables and dependent variable
corr_matrix_ = X.corr()
corr_matrix_["Exited"].sort_values(ascending=False)


**The newly created features have an improved correlation with a target variable which is what I wanted.**

In [None]:
#dropping target variable
X = X.drop(columns=["Exited"])
X_test_data = X_test_data.drop(columns=["Exited"])

3. generating polynomial and interaction features

According to python documentation, I will generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations
of the features with degree less than or equal to the specified degree.
For example, if an input sample is two dimensional and of the form
[a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]

In [None]:
#Performing feature extraction by combining other features
poly_features = PolynomialFeatures(2, interaction_only=True, include_bias=False)

#creating new attributes for training set
new_attrib = poly_features.fit_transform(X)

#creating a dataframe for new attributes
new_attrib = pd.DataFrame(new_attrib,columns = poly_features.get_feature_names(X.columns))
new_attrib.shape

In [None]:
#test test 
#creating new attributes for testing set
new_attrib_test = poly_features.fit_transform(X_test_data)

#creating a dataframe for new attributes
new_attrib_test = pd.DataFrame(new_attrib_test,columns = poly_features.get_feature_names(X_test_data.columns))
new_attrib_test.shape

#### Removing correlated attributes

In [None]:
# Train set
# Creating correlation matrix
corr_matrix2 = new_attrib.corr()

# Select high values of correlation matrix
higher_values = corr_matrix2.where(np.triu(np.ones(corr_matrix2.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.7
attr_to_drop = [column for column in higher_values.columns if any(abs(higher_values[column]) > 0.7)]

#dropping several attributes with higher correlation to avoid collinearity issue
X_train_new = new_attrib.drop(columns = attr_to_drop)#dropping attributes with higher correlation
X_train_new.shape

In [None]:
# Test set
# Creating correlation matrix
corr_matrix3 = new_attrib_test.corr()

# Select high values of correlation matrix
higher_v_test = corr_matrix3.where(np.triu(np.ones(corr_matrix3.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.7
attr_to_drop_test = [column for column in higher_v_test.columns if any(abs(higher_v_test[column]) > 0.7)]

#dropping several attributes with higher correlation to avoid collinearity issue
X_test_new = new_attrib_test.drop(columns = attr_to_drop_test)
X_test_new.shape

### Transformation pipeline

#### Feature scaling

Using standardization method, it is performed by substracting the mean value and divides by the variance. 

In [None]:
#creating a pipeline for several transformations for training set
pipe = Pipeline([('std_scaler', StandardScaler())])

#Using pipeline object to call fit_transform
X_transformed = pipe.fit_transform(X)

In [None]:
##creating a pipeline for several transformations for testing set
pipe_test = Pipeline([('std_scaler', StandardScaler())])

#Using pipeline object to call fit_transform
test_transformed = pipe_test.fit_transform(X_test_data)

# V. Classification Model Development


The goal of this section is to develop a model capable of predicting which clients are likely to churn.  

Before selecting a baseline model, I have decided to choose error metric which help in identifying if a model is performing well or poorly. </br> 

Going back to the purpose of this exercise, which is to predict customer churn (whether a customer will churn in the future or will stay as Bank's customer); the **error metric** will help me to determine if the model will accurately classify if a customer will churn or not. </br>

Since, this is a classification problem, **false positives and false negatives** are both misclassifications which help in quantifying the model perfomance. </br>

Also, the class imbalance problem that we might face tells me that using **precision and recall** error metric is also a good choice, *instead of using accuracy as error metric because the classifier my predict all customers to churn but still have high accuracy*. 

The table below explains several outcomes and their error types: 

| Exited         | Prediction   | error type     |
| :------------- | :----------: | -----------:   |
|  0             | 1            | False positive |
|  1             | 1            | True positive  |
|  0             | 0            | True Negative  |
|  1             | 0            | False Negative |

In my opinion, the bank would want to minimize both False positive and False Negative. And also false positive is more riskier because we would be predicting that a customer **may not** churn but the customer will churn. 

## Creating a baseline model

In [None]:
#data to be used for model training testing after all data preprocessing and EDA
X_train = X_transformed.copy()#data for training the model (features)
X_test = test_transformed.copy()#data for test the model performance (independent variables)
y_train = y_train_data.copy()#data for training (target)
y_test = y_test_data.copy()#data for test the model performance (target)

print("Train data: ", X_train.shape)
print("Test data: ", X_test.shape)
print("Train for y: ", y_train.shape)
print("Test for y ", y_test.shape)


As I stated during EDA, there is a non-linear relationship among data, which indicates that instead of spending time on linear models like logistic or linear classifiers, I can just try non-linear models such as Random-Forest. </br>

For regularization, I will use K-folds cross-validation which randomly splits train set into k subsets (called folds), then also train and evaluates the model K times. 

It also picks different subset (fold) on every evaluation and train the model on the remaining 9 subset. 


### V. 1. Random forest

In [None]:
# #Splitting the data into test and train sets
# X_train, X_test, y_train, y_test = train_test_split(features, y, test_size=0.3, random_state=42)

'''Creating a Gaussian Classifier, balanced so that I can avoid imbalanced class 
   issues, random state to 1 to avoid change of results (predictions)
'''
#rf_classifier=RandomForestClassifier(n_estimators=100)
rf_classifier = RandomForestClassifier(class_weight="balanced", random_state=1)

rf_classifier.fit(X_train, y_train)
#cross validation
predictions_results = cross_val_predict(rf_classifier, X_train, y_train, cv=10)

predicted_rf = pd.Series(predictions_results)

#predicted

#### cross validation score

In [None]:
#cross validation accuracy
accuracies = cross_val_score(estimator = rf_classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
sum(accuracies)/10

In [None]:
# Error metrics
# False positives.
fp_filter = (predicted_rf == 1) & (y_train == 0)
fp = len(predicted_rf[fp_filter])

# True positives.`
tp_filter = (predicted_rf == 1) & (y_train == 1)
tp = len(predicted_rf[tp_filter])

# False negatives.
fn_filter = (predicted_rf == 0) & (y_train == 1)
fn = len(predicted_rf[fn_filter])

# True negatives
tn_filter = (predicted_rf == 0) & (y_train == 0)
tn = len(predicted_rf[tn_filter])

# Rates
tpr = tp / (tp + fn)#recall
fpr = tp / (tp + fp)#precision

print("False positive: ",fp)
print("True positive: ", tp)
print("False negative: ", fn)
print("True negative: ", tn)

### Precision and recall

In my opinion, I am more interested in model predicting "1's" correctly because the bank is more interested in knowing which customer will churn,  </br>

By default, precision and recall function from sklearn print the class 1 results which is what I want. 

To view all statistics, I will use `classification_report` function. 

In [None]:
#printing precision, recall and f-score on train set
print("Precison score: ", precision_score(y_train, predicted_rf))
print("Recall score: ", recall_score(y_train, predicted_rf))
print("f1_score score: ", f1_score(y_train, predicted_rf))

A precision of 0.78 means that in all customers that the model predict will churn, 77 % of them actually churned. 

A recall of 0.42 means that in all customers in the dataset that churned, 42 % of them actually churned. 

A precision of 78% is not that bad considering the class imbalance we have. 

### Confusion matrix

In [None]:
confusion_matrix(y_train, predicted_rf)
#y_train

In [None]:
#printing precision, recall and f1-score for both classes using train set
print(classification_report(y_train, predicted_rf))

In [None]:
#printing roc score using train sample
roc_auc_score(y_train, predicted_rf)

### Testing the Random forest model for unseen data

In [None]:
#predicting which clients are likely to open a term deposit account
y_pred_rf = rf_classifier.predict(X_test)

print("precison score: ", precision_score(y_test, y_pred_rf))
print("recall score: ", recall_score(y_test, y_pred_rf))
print("f1_score score: ", f1_score(y_test, y_pred_rf))


In [None]:
#printing precision, recall and f1-score for both classes using test set
print(classification_report(y_test, y_pred_rf))
#printing roc score using test samples
roc_auc_score(y_test, y_pred_rf)

In my opinion, because the bank is more interested in knowing which customer will churn, I am more interested in model predicting **1's" correctly**. </br>


## V. 2. SVM

In [None]:
svm_classifier = SVC(kernel='poly', degree=8, probability=True)#Building SVM classifier
svm_classifier.fit(X_train, y_train)#fitting the model

y_pred_svm = svm_classifier.predict(X_train)#predicting


In [None]:
print(classification_report(y_train, y_pred_svm))

### V. 3. XGBoost

In [None]:
#building a model using XGBoost algorithm
XGB_model = xgb.XGBClassifier(objective="binary:logistic",eta= 0.002,subsample=0.5,max_depth=6, n_estimators=1000)
XGB_model.fit(X_train, y_train)#fitting the model using training data

In [None]:
#Predicting on train data
y_pred_XGB = XGB_model.predict(X_train)
y_pred_XGB

In [None]:
#cross validation accuracy
accuracies = cross_val_score(estimator = XGB_model, X = X_train, y = y_train, cv = 10, n_jobs = -1)
sum(accuracies)/10

In [None]:
#printing precision, recall and f-score
print("Precison score: ", precision_score(y_train, y_pred_XGB))
print("Recall score: ", recall_score(y_train, y_pred_XGB))
print("f1_score score: ", f1_score(y_train, y_pred_XGB))

In [None]:
#making predictions using XGB classifier
y_pred_XGB_t = XGB_model.predict(X_test)

In [None]:
#printing precision, recall and f1-score for both classes using train data
print(classification_report(y_train, y_pred_XGB))

#printing roc score using test samples
roc_auc_score(y_test, y_pred_XGB_t)

### Parameter Tuning for XGBoost model

### Tuning max_depth, min_child_weight, and learning rate


In [None]:
#tuning hyparameters for XGBoost model
XGB_tuned_model = xgb.XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=1, objective= 'binary:logistic', 
    scale_pos_weight=1)

#building a dictionary with several parameters
param_tuning = {'max_depth': [3,5,7,9],'min_child_weight':[1,3,5]} 

#### Using GridSearch to fine tune hyperparameters

In [None]:
#building a gridsearch 
gsearch_instance = GridSearchCV(estimator = XGB_tuned_model, param_grid = param_tuning, scoring='accuracy',n_jobs=-1,iid=False, cv=10)

#fit with all sets of parameters 
gsearch_instance.fit(X_train, y_train)

#best parameters after gridsearch hyperparameter tuning
gsearch_instance.best_params_, gsearch_instance.best_score_

In [None]:
#Building a model based on tuned parameters 
XGB_tuned_model_result = xgb.XGBClassifier(learning_rate =0.1, n_estimators=140, max_depth=3, min_child_weight=1, objective= 'binary:logistic', 
    scale_pos_weight=1)

##fitting the model using train data
XGB_tuned_model_result.fit(X_train, y_train)

#making predictions
y_pred_XGB_tuned = XGB_tuned_model_result.predict(X_train)

In [None]:
#printing precision, recall and f1-score for both classes using train data
print(classification_report(y_train, y_pred_XGB_tuned))

## Model selection

XGboost seems to perform better than Random forest and SVM, with the precision of 86% on the test set and 46%

Also the tuned model has 83% precision and 48% recall which is more of a balance between precision and recall. 

#### Let's use ROC (Receiver operating characteristics) curve

In [None]:
'''
   This function plot the roc curve using false positive and true positive 
   
   Input: f: false positive rate
   
          t: True positive rate
          
          l: Label
'''

def roc_curve_p(f, t, l=None):
    plt.plot(f, t, linewidth=2, label=l)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (FPR)')
    plt.ylabel('True Positive Rate (TPR)')

In [None]:
#getting scores for the random forest classifier
y_probas_forest = cross_val_predict(rf_classifier, X_train, y_train, cv=10, method="predict_proba")
y_scores_forest = y_probas_forest[:, 1]

#getting scores for XGboost classifier
y_probas_xgb = cross_val_predict(XGB_tuned_model_result, X_train, y_train, cv=10, method="predict_proba")
y_scores_xgb = y_probas_xgb[:, 1]

#getting scores for SVM classifier
y_probas_svm = cross_val_predict(svm_classifier, X_train, y_train, cv=10, method="predict_proba")
y_scores_svm = y_probas_svm[:, 1]

#using roc_curve function to get false positives and true positives for random forest classifier
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_train, y_scores_forest)
##using roc_curve function to get false positives and true positives for SVM classifier
fpr_svm, tpr_svm, thresholds_svm = roc_curve(y_train, y_scores_svm)
#using roc_curve function to get false positives and true positives for XGboost classifier
fpr_xgb, tpr_xgb, thresholds_xgb = roc_curve(y_train, y_scores_xgb)

#plotting roc curve for random forest model
plt.plot(fpr_rf, tpr_rf, "b:", label="Random Forest")
#plotting roc curve for SVM classifier
plt.plot(fpr_svm, tpr_svm, "r:", label="SVM")
#plotting roc curve for XGboost classifier
roc_curve_p(fpr_xgb, tpr_xgb, "XGboost")
plt.legend(loc="bottom right")
plt.show()

### Testing the choosen model on unseen data to provide conclusion

In [None]:
#making predictions using XGB classifier
y_pred_XGB_test = XGB_tuned_model_result.predict(X_test)

In [None]:
#printing precision, recall and f1-score for both classes
print(classification_report(y_test, y_pred_XGB_test))

## VI. Conclusion

The purpose of the exercise was to predict which customers are likely to churn in the future. Depending on several factors such as roc curve, precion-recall trade-off and f1-score I have choosen to use XGBoost model. With the performance on the test set: 

* A precision of 0.79 means that in all customers that the model predict will churn, 77 % of them actually     churned. 

* A recall of 0.44 means that in all customers in the dataset that churned, 42 % of them actually churned. 

To improve on the accuracy several factors might be considered such as data augmentation on the target variable using several methods of over-sampling (**as it can be seen below**). However, It does perform well on test set. 

Also, collecting new data to get better features. 

### Balancing the dataset using SMOTE

In [None]:
#creating an object of SMOTE to perform over-sampling of minority class
sm = SMOTE(random_state = 1) 

#resampling the dataset
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

#fitting the model using resampled data
XGB_tuned_model_result.fit(X_train_res, y_train_res)

#making predictions using the model 
predictions_res= XGB_tuned_model_result.predict(X_test)

#printing the classification report
print(classification_report(y_test, predictions_res))