# ARTIFICIAL NEURAL NETWORK

### Type of Machine Learning
![towardsdatascience.com](https://miro.medium.com/max/602/0*-068ud_-o3ajwq_z.jpg)
https://towardsdatascience.com/what-are-the-types-of-machine-learning-e2b9e5d1756f

### Type of Supervised Machine Learning
![towardsdatascience.com](https://github.com/IALeMans/Meetup_ai-basics_2019-2/raw/master/classification_regression.png)
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/machine_learning.html

### Cycle Projet IA

![Cycle IA](https://github.com/IALeMans/Meetup_ai-basics_2019-2/raw/master/cycle.png)

## Problem Definition
Bank churn modelling, our goal is to predict whether a customer is at risk to exit 

# Part 1 - Data Preprocessing

### Let's make a supervised classification pipeline

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle

import seaborn as sns

## Importing the dataset

In [2]:
dataset = pd.read_csv('Churn_Modelling.csv')

# features
X = dataset.iloc[:, 3:13].values

# targets
y = dataset.iloc[:, 13].values

In [3]:
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [4]:
dataset.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [None]:
# seaborn dataviz
df_viz = dataset.iloc[:,3:]
sns.pairplot(df_viz)

* balance deformation 
* age vs exited
* 5 products vs exited

In [5]:
# features first 5 rows
# array not dataset
X[:5]

array([[619, 'France', 'Female', 42, 2, 0.0, 1, 1, 1, 101348.88],
       [608, 'Spain', 'Female', 41, 1, 83807.86, 1, 0, 1, 112542.58],
       [502, 'France', 'Female', 42, 8, 159660.8, 3, 1, 0, 113931.57],
       [699, 'France', 'Female', 39, 1, 0.0, 2, 0, 0, 93826.63],
       [850, 'Spain', 'Female', 43, 2, 125510.82, 1, 1, 1, 79084.1]],
      dtype=object)

In [6]:
# target first 5 rows
y[:5]

array([1, 0, 1, 0, 0])

## Encoding categorical datas
the problem : how to deal with 'France' or 'Spain' values

In [7]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# encodage de la colonne 1 : Country
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])

# encodage de la colonne 2 : genre
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

In [18]:
X[:, 1].shape

(10000,)

In [8]:
X[:5]

array([[619, 0, 0, 42, 2, 0.0, 1, 1, 1, 101348.88],
       [608, 2, 0, 41, 1, 83807.86, 1, 0, 1, 112542.58],
       [502, 0, 0, 42, 8, 159660.8, 3, 1, 0, 113931.57],
       [699, 0, 0, 39, 1, 0.0, 2, 0, 0, 93826.63],
       [850, 2, 0, 43, 2, 125510.82, 1, 1, 1, 79084.1]], dtype=object)

In [9]:
# The problem : is Spain (2) higher than France (0) ???

onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [10]:
# encoded 'Geography' (x2), 'CreditScore', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary'
X[0]

array([0.0000000e+00, 0.0000000e+00, 6.1900000e+02, 0.0000000e+00,
       4.2000000e+01, 2.0000000e+00, 0.0000000e+00, 1.0000000e+00,
       1.0000000e+00, 1.0000000e+00, 1.0134888e+05])

## one more dataviz
nice but seaborn data viz excluded categorical datas, get back to pandas

In [None]:
dataset.head()

In [None]:
df_viz = pd.get_dummies(dataset.iloc[:,3:], prefix=['Geo', 'Gender'])
df_viz.head()

In [None]:
g = sns.PairGrid(df_viz)
g.map(plt.scatter)
g.map_diag(sns.distplot)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)

## Splitting the dataset into the Training set and Test set

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [12]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(8000, 11)
(2000, 11)
(8000,)
(2000,)


## Feature Scaling

In [13]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
X_train[:5]

array([[-0.5698444 ,  1.74309049,  0.16958176, -1.09168714, -0.46460796,
         0.00666099, -1.21571749,  0.8095029 ,  0.64259497, -1.03227043,
         1.10643166],
       [ 1.75486502, -0.57369368, -2.30455945,  0.91601335,  0.30102557,
        -1.37744033, -0.00631193, -0.92159124,  0.64259497,  0.9687384 ,
        -0.74866447],
       [-0.5698444 , -0.57369368, -1.19119591, -1.09168714, -0.94312892,
        -1.031415  ,  0.57993469, -0.92159124,  0.64259497, -1.03227043,
         1.48533467],
       [-0.5698444 ,  1.74309049,  0.03556578,  0.91601335,  0.10961719,
         0.00666099,  0.47312769, -0.92159124,  0.64259497, -1.03227043,
         1.27652776],
       [-0.5698444 ,  1.74309049,  2.05611444, -1.09168714,  1.73658844,
         1.04473698,  0.8101927 ,  0.8095029 ,  0.64259497,  0.9687384 ,
         0.55837842]])

## pickeling datasets

In [15]:
datasets = [X_train, X_test, y_train, y_test]

In [16]:
f = open('datasets.p', 'wb')
for obj in datasets:
    pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL)
f.close()

## pickeling transformers

In [17]:
transformers = [labelencoder_X_1, labelencoder_X_2, onehotencoder, scaler]
f = open('transformers.p', 'wb')
for obj in transformers:
    pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL)
f.close()

# Part 2 - Now let's make the ANN!
jump to next notebooks :D