# Q3. Bank Churn Classification Problem
## Dataset Description:
Banking is one of those traditional industries that has gone through a steady transformation over the past few decades. Yet, many banks today with a sizeable customer base are hoping to gain a competitive edge but have not tapped into the vast amounts of data they have, especially in solving one of the most acknowledged problems – customer churn (i.e., a customer leaving the bank). It is advantageous to banks to know what leads a client to leave the bank. Banks often use the customer churn rate as one of their key business metrics because the cost of retaining existing customers is far less than acquiring new ones, and meanwhile increasing customer retention can greatly increase profits.  
Churn prevention allows companies to develop different programs such as loyalty and retention programs to keep as many customers as possible. Following are the attributes of the dataset we will be working with. 
 

- RowNumber (continuous) — corresponds to the record (row) number and has no effect on the output. 
 

- CustomerId  (categorical)— contains random values and has no effect on customer leaving the bank. 
 

- Surname  (categorical)— the surname of a customer has no impact on their decision to leave the bank 
 

- CreditScore  (continuous) — can influence customer churn, since a customer with a higher credit score is less likely to leave the bank. 
 

- Geography (categorical) — a customer’s location can affect their decision to leave the bank. 
 

- Gender (categorical) — it’s interesting to explore whether gender plays a role in a customer leaving the bank. 
 

- Age (continuous) — this is certainly relevant, since older customers are less likely to leave their bank than younger ones. 
 

- Tenure (continuous) — refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank. 
 

- Balance (continuous) — also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances. 
 

- NumOfProducts (continuous) — refers to the number of products that a customer has purchased through the bank. 
 

- HasCrCard (categorical) — denotes whether a customer has a credit card. This column is also relevant since people with a credit card are less likely to leave the bank. 
 

- IsActiveMember (categorical) — active customers are less likely to leave the bank. 
 

- EstimatedSalary (continuous) — as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries. 
 

- Exited (Categorical) — whether or not the customer left the bank. (Target variable) 

# Import all the necessary libraries

In [1]:
#Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Classifier Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#Scaling & splitting Libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Sampling library
from imblearn.over_sampling import SMOTE

#Evaluation Libraries
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

### b.	Data Loading / Preprocessing
#### i.	Loading
1. Load the data <BankChurn.csv> as a pandas dataframe using the `pd.read_csv()` function which returns a dataframe , store this value in a variable named ‘df’.

2. The resulting dataframe should have the shape (10000,14) indicating that there are 10000 instances and 14 columns. 

3. In this dataframe, currently you have 9 features which are the following: RowNumber, CustomerID, Surname, CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary. Using the ‘pandas.dataframe.drop’ function to drop the RowNumber, CustomerID and Surname columns.

4. Using the ‘pandas.isnull()’ function check if there are any missing values in the dataframe and report this value (i.e., the number of missing values per column of the dataframe).

5. Your task is to use feature columns to predict the target column (which is categorical in our case). This can be cast as a classification problem. 

6. Create a dataframe X of features (by dropping the ‘Exited’ column from the original dataframe). Create a Pandas Series object of targets Y (by only considering the ‘Exited’ column from the original dataframe). Moving forward, we will be working with X and Y.

### ii. Data Visualisation
1. Visualize the distribution of the ‘Age’ and ‘CreditScore’ column using the ‘matplotlib.pyplot.hist’ function as two separate plots. Label the x-axis and the y-axis along with giving the plot a title and assign a bin size of 7.

- What are the respective mean values of these two features (use the pandas.DataFrame.mean() function)?
- What is the respective standard deviation of these two features (use the pandas.DataFrame.std() function)? 

2. Only for this question use the dataframe consisting of the target variable (initialized as ‘df’). Using matplotlib visualize the number of males and females in each country who are active members and not active members. (Visualize this using a barchart. You will need to use the ‘Gender’, ‘Geography’ and ‘IsActiveMember’ features for this question). Visualize these graphs on two separate plots with respect to their active status. To create a barchart using matplotlib use the ‘matplotlib.pyplot.bar()’ function. Also label the x-axis, y-axis and give the plots a title. 

- How many males are from France and are active members?
- How many females are from Spain and are active members?
- How many males are from France or Germany who are not active members?

3. Using the target variable in Y plot a bar chart showing the distribution of the ‘Exited’ column (To create a barchart using matplotlib use the ‘matplotlib.pyplot.bar()’ function). 

- What can be said about this distribution (specifically keeping in mind this distribution represents the target variable) will this have an impact on the results of the classification model? 

4. So far you should have successfully been able to load, preprocess and visualize your data. Now, use the ‘pd.get_dummies()’ function to convert categorical data into dummy variables (‘Gender’ and ‘Geography’).
**(Perform this only on X)**. 

- What is the shape of X?

### iii. Data Splitting
1. Split data into training and test sets using the sklearn ‘train_test_split() function in a **80:20** ratio. The result of your data split should be X_train, X_test, y_train, y_test. (Respectively your training features, testing features, training targets and testing target arrays).

### iv. Data Scaling
1. Employ the ‘MinMaxScaler’  function on the continuous attributes in X_train. Employ the ‘fit_transform()’ function of the scaler to retrieve the new (scaled) version of the training data (i.e., fit_transform() should be run on `X_train`). Store the result in X_train again. 


2. Scale the X_test data using the scaler you have just fit, this time using the `transform()` function. Note: store the scaled values back into X_test.  At the end of this step, you must have X_train, X_test, scaled according to the MinMaxScaler.

### c. Modelling
### i. Modeling (Model Instantiation / Training) using Logistic Regression classifier 
1. Employ the Logistic Regression classifier from sklearn and instantiate the model. Label this model as ‘model_1_lr’

2. Once instantiated, `fit()` the model using the scaled X_train, y_train data.

3. Employ the `predict()` function to obtain predictions on X_test and store this in a variable labeled as ‘y_pred_lr’.

4. Employ the ‘accuracy_score()’ function by using the ‘y_pred_lr’ and ‘y_test’ variables as the functions parameters and print the accuracy of the Logistic Regression model   

### ii. Modeling (Model Instantiation / Training) using Support Vector Machine Classifier 

1. Employ the Support Vector Machine (SVM) classifier from sklearn and instantiate the model. Label this model as ‘model_2_svm’

2. Once instantiated, ‘fit()’ the model using the scaled X_train, y_train data.

3. Employ the ‘predict()’ function to obtain predictions on X_test and store this in a variable labeled as ‘y_pred_svm’. 

4. Employ the ‘accuracy_score’ function (‘sklearn.metrics.accuracy()’ function) by using the ‘y_pred_lr’ and ‘y_test’ variables as the functions parameters and print the accuracy of the SVM model. 

### iii. Modeling Logistic Regression Classifier on a balanced dataset 
1. Employ Synthetic Minority Oversampling on X_train and y_train. To use SMOTE you will have to install the imbalanced-learn library, this can either be down by executing the following command ‘pip install -U imbalanced-learn’ command ‘conda install -c conda-forage imbalanced-learn’ command for the Anaconda Cloud platform. (For more information click the following link: https://imbalanced-learn.org/stable/install.html).  
Import the ‘SMOTE’ function from the ‘imblearn.over_sampling’. Use the ‘smote.refit_resample()’ function on X_train and y_train using its default parameters. Store them in X_train_smote, y_train_smote. - Be careful to employ SMOTE ONLY on the training data and not on the full dataset because that can cause inadvertent “data leakage” (please see: https://arxiv.org/pdf/2107.00079.pdf for details) . 

2. Employ a new Logistic Regression classifier from sklearn and instantiate the model. Label this model as ‘model_3_smote_lr’

3. Once instantiated, ‘fit()’ the model using the balanced X_train_smote, y_train_smote data.

4. Employ the ‘predict()’ function to obtain predictions on X_test and store this in a variable labeled as ‘y_pred_smote_lr’. 

5. Employ the ‘accuracy_score’ function by using the ‘y_pred_lr’ and ‘y_test’ variables as the functions parameters and print the accuracy of the new Logistic Regression model.

- What is your initial observation of the accuracy of model_3_smote_lr vs. accuracy of model_1_lr? What could be the reasoning for (any possible) change in accuracy?

### iv. Modeling SVM on a balanced dataset
1. Employ Synthetic Minority Oversampling on X_train and y_train. Import the ‘SMOTE’ function from the ‘imblearn.over_sampling’. Use the ‘smote.refit_resample()’ function on X_train and y_train. Store them in X_train_smote, y_train_smote. 

- At the end of this step, your new training set i.e., (X_train_smote , y_train_smote) should have the same number of instances for each of the two classes.

2. Employ a new SVM classifier from sklearn and instantiate the model. Label this model as ‘model_4_smote_svm’

3. Once instantiated, ‘fit()’ the model using the balanced X_train_smote, y_train_smote data.

4. Employ the ‘predict()’ function to obtain predictions on X_test and store this in a variable labeled as ‘y_pred_smote_svm’.  

5. Employ the ‘accuracy_score’ function (‘sklearn.metrics.accuracy()’ function) by using the ‘y_pred_lr’ and ‘y_test’ variables as the functions parameters and print the accuracy of the new SVM model. 

- What is your initial observation of the accuracy of model_4_smote_svm vs. accuracy of model_2_svm? What could be the reasoning for (any possible) change in accuracy? 

### Modeling Grid Search Parameter Selection for SVM 
1. We will now be reverting to our X_train and y_train data. Initialize a variable labeled as ‘param_grid’ storing the following: {"gamma": [0.001, 0.01, 0.1], "C": [1,10,100,1000,10000]}.

2. Employ the gridsearchCV function and initialize the following parameters: estimator = SVC(), param_grid = param_grid, cv=5, verbose =1, scoring = ‘accuracy’

3.  Once instantiated, ‘fit()’ the model using the X_train_smote, y_train_smote data.

4. Print the best paramaters using the **‘best_params_’** attribute and print the mean cross validated score of the best estimator (hint use the ‘best_score_’ attribute).

5. Employ the ‘score’ function by using the ‘X_test’ and ‘y_test’ variables as the functions parameters and print the accuracy of the new gridsearch SVM model.SVM model.

### d. Evaluation
1. (2 points) Calculate F1 Score, Precision, Recall, Accuracy (All on the test set X_test, y_test) .

- Employ the `classification_report()` function from sklearn.metrics to report the precision recall, f1 score and accuracy for each class for the first **four models (parts c.i – c.iv).**

2. Visualize a confusion matrix for the first four models 

- Employ the `confusion_matrix()` function from sklearn.metrics to report the confusion matrix results.
- Report the False Negative and False Positive values for model_1_lr.

3. Report the best F1 score of the grid search implemented in the fifth model **(part c.v)**. Also report the best parameters from the grid search on the training set. 