You are hired by an international bank with millions of customers spread all around Europe mainly in, three countries, Spain, France and Germany. In the last six months the bank detected that the number of people leaving the bank started to increase, so they decided to take measures. 

The bank decided to take a small sample of 10,000 of their customers and retrieve some information.
For six months they followed the behaviour of these 10,000 customers and analysed which stayed and who left the bank. 

Therefore, they want you to develop a model that can measure the probability of a customer leaving the bank.

### Start by importing numpy and pandas

In [1]:
import numpy as np
import pandas as pd

## Part 1 - Data Preprocessing

**Import the dataset Bank_customers.csv**

**Perform all required data preprocessing steps, until you have your train set and yout test set.**

In [2]:
df_customers = pd.read_csv("Bank_customers.csv")
df_customers

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [3]:
df_customers.info() #Show Non-null Counts and Types 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [4]:
#Suppression des variables inutiles 
drop_columns = ["RowNumber","CustomerId", "Surname"]
df_customers.drop(columns = drop_columns, inplace=True)
df_customers

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [5]:
#Conversion des variables catégorielles en variables quantitatives 
df_customers["Geography"] = df_customers["Geography"].map({"France":1, "Spain":2, "Germany":3})
df_customers["Gender"] = df_customers["Gender"].map({"Female":1, "Male":2})
df_customers

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,1,1,42,2,0.00,1,1,1,101348.88,1
1,608,2,1,41,1,83807.86,1,0,1,112542.58,0
2,502,1,1,42,8,159660.80,3,1,0,113931.57,1
3,699,1,1,39,1,0.00,2,0,0,93826.63,0
4,850,2,1,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,1,2,39,5,0.00,2,1,0,96270.64,0
9996,516,1,2,35,10,57369.61,1,1,1,101699.77,0
9997,709,1,1,36,7,0.00,1,0,1,42085.58,1
9998,772,3,2,42,3,75075.31,2,1,0,92888.52,1


In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(df_customers.drop(columns=["Exited"]),df_customers["Exited"] , test_size=0.3)

print(f"Shape of train set : {X_train.shape}")
print(f"Shape of test set : {X_test.shape}")

Shape of train set : (7000, 10)
Shape of test set : (3000, 10)


### Perform feature scaling

Use the StandardScaler class from sklearn.preprocessing. You can read more about it here https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Use the fit_transform and the transform methods to perform feature scaling on your training set and your test set


In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
print(f"X_train standard deviations : {X_train.std(axis=0).round(2)}")
print(f"X_train means : {X_train.mean(axis=0).round(2)}")
print(" ")
print(f"X_test standard deviations : {X_test.std(axis=0).round(2)}")
print(f"X_test means : {X_test.mean(axis=0).round(2)}")

X_train standard deviations : [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
X_train means : [-0. -0.  0. -0.  0.  0.  0.  0.  0.  0.]
 
X_test standard deviations : [0.99 1.01 1.   1.   0.98 1.01 1.   1.01 1.   1.  ]
X_test means : [-0.01  0.02 -0.02  0.01 -0.02  0.05 -0.02 -0.01  0.01 -0.02]


## Part 2 - Now let's create the ANN!

### Importing the Keras libraries and packages
You are going to use the Sequential model. You can read more about it here https://keras.io/models/sequential/

In [9]:
import keras
from keras.models import Sequential
from keras.layers import Dense

import **Dropout** to be used when adding layers: A Simple Way to Prevent Neural Networks from Overfitting

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

Arguments:

rate: float between 0 and 1. Fraction of the input units to drop.

In [10]:
from keras.layers import Dropout

### Start by declaring a sequential model named classifier

In [11]:
classifier = keras.Sequential()

### Add the input layer and the first hidden layer using the .add() method

Use the **Dense** method that takes the following arguments: 

**units**: dimensionality of the output space.

**activation**: Activation function to use (relu for hidden layers, and sigmoid for output layer). 

**kernel_initializer**: Initializer for the kernel weights matrix.

In addition, add the argument **input_dim**: dimension of the input layer, to be passed for the first hidden layer.

In [12]:
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 10))

**Apply Dropout to the input units for regularization with a probability of 0.1**

In [13]:
classifier.add(Dropout(0.1, input_shape=(6,)))

### Add the second hidden layer made of 6 units

In [14]:
classifier.add(Dense(units = 6))

### Add the output layer 

In [15]:
classifier.add(Dense(units = 1, activation = 'sigmoid'))

In [22]:
classifier.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 6)                 66        
                                                                 
 dropout (Dropout)           (None, 6)                 0         
                                                                 
 dense_1 (Dense)             (None, 6)                 42        
                                                                 
 dense_2 (Dense)             (None, 1)                 7         
                                                                 
Total params: 115
Trainable params: 115
Non-trainable params: 0
_________________________________________________________________


### Compiling the ANN

Before training a model, you need to configure the learning process, which is done via the **compile** method. 

It receives three arguments: an **optimizer** (use 'adam'), a **loss** function (use 'binary_crossentropy'), and list of **metrics** (use the 'accuracy'). 

### Fitting the ANN to the Training set

Keras models are trained on Numpy arrays of input data and labels. For training a model, you will typically use the **fit** function.

Arguments: 

Input data, Target data,

**batch_size**: Number of samples per gradient update 

**epochs**: Number of iterations on the dataset

You are going to train your model using a batch size of 10, and 100 epochs 

## Part 3 - Making predictions and evaluating the model



### Predicting the Test set results

#### Use the .predict(input data) method. It generates output predictions for the input samples.

#### Since the output value of a sigmoid function varies between 0 and 1. Choose a threshold (use 0.5) to assign a value of 1 for predictions higher than 0.5, and 0 otherwise.

### Confusion Matrix

## Predicting a single new observation

Predict if the customer with the following informations will leave the bank:
Geography : France  -  Credit score : 600  -  Gender : Male  -  Age : 40  -  Tenure : 3  -  Balance : 60000  -  Number of products : 2  -  Has credit card: Yes  -  Is an active memeber: Yes  -  Estimated salary : 50000

What is the output value of your classifier? is the customer going to leave the bank?

##  4 - Evaluating the ANN

You can use Sequential Keras models as part of your Scikit-Learn workflow via the wrappers found in keras library. More info can be found here https://keras.io/scikit-learn-api/

KerasClassifier Arguments :
    
build_fn should construct, compile and return a Keras model, which will then be used to fit/predict. 

sk_params: model parameters & fitting parameters

In [17]:
from keras.wrappers.scikit_learn import KerasClassifier

In [18]:
from sklearn.model_selection import cross_val_score

#### create the function that builds the architecture of the ANN

In [19]:
def build_classifier():
    
    return classifier  #classifier is local to this function

#### Create an object inside this class named kclassifier that takes the previous function as arguments, and  the batch_size and the number of epochs

In [20]:
kclassifier =

SyntaxError: invalid syntax (<ipython-input-20-afe3492751eb>, line 1)

#### Use the cross_val_score function. Define an array that will contain the k-accuracies of the k-fold CV, name it accuracies.


In [None]:
accuracies = 

#### Calculate the mean and the varince of your accuracies

In [None]:
mean = 

In [None]:
variance =

## Part 5 - Tuning the ANN (something for you to work on at home, if you want)

In [None]:
#from sklearn.model_selection import GridSearchCV