### **Why the churn analysis is important?**  
Churn analytics is the process of measuring the rate at which customers quit the product, site, or service.  
It answers the questions “Are we losing customers?”  
  
<img src="https://www.appier.com/hubfs/Imported_Blog_Media/GettyImages-1030850238-01.jpg" width="700" height="350"/>
  
Trying to acquire new customers is much more expensive than retaining existing customers.  
Thats why we try to retaining existing customers.

In [2]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

url = '../Datasets/Churn_Modelling.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


**Some informations for dataset:**  
- *"RowNumber"* and *"CustomerId"* shouldn't included  
- *"Surname"* shouldn't be important for our neural networks.  
  
Our neural networks shoul not learn from surname or etc.  
*"Exited"* is what we are want to predict.

**Choose just needed columns:**

In [3]:
X = df.iloc[:,3:13].values
y = df.loc[:,'Exited'].values

print(X,'\n',y)

[[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]] 
 [1 0 1 ... 1 1 0]


**Encoding**  
Some datas like *"Geography"* needed to be encoded. You can find the columns thats are nedded to be encoded:  
- *"Geography"*  
- *"Gender"*

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:,1] = le.fit_transform(X[:,1]) #Geography

le2 = LabelEncoder()
X[:,2] = le2.fit_transform(X[:,2]) #Gender

X #All numeric

array([[619, 0, 0, ..., 1, 1, 101348.88],
       [608, 2, 0, ..., 0, 1, 112542.58],
       [502, 0, 0, ..., 1, 0, 113931.57],
       ...,
       [709, 0, 0, ..., 0, 1, 42085.58],
       [772, 1, 1, ..., 1, 0, 92888.52],
       [792, 0, 0, ..., 1, 0, 38190.78]], dtype=object)

We just set the label encodings, but there is a problem.  
The problem is as you can see for instance France is 0 and another instance is Germany is 2.  
  
We know that a measurement comparison cannot be made between them, but the machine cannot know this.  
Machine thinking about them like Germany is greater then France.  
We should do *"One Hot Encoding"* to fix this problem.

In [5]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ohe = ColumnTransformer([('ohe', OneHotEncoder(dtype=float), [1])], remainder='passthrough')

X = ohe.fit_transform(X)
X = X[:,1:]
X

array([[0.0, 0.0, 619, ..., 1, 1, 101348.88],
       [0.0, 1.0, 608, ..., 0, 1, 112542.58],
       [0.0, 0.0, 502, ..., 1, 0, 113931.57],
       ...,
       [0.0, 0.0, 709, ..., 0, 1, 42085.58],
       [1.0, 0.0, 772, ..., 1, 0, 92888.52],
       [0.0, 0.0, 792, ..., 1, 0, 38190.78]], dtype=object)

#### **What is the normalization, why should we normalize the data?**  
Normalization is also called statistical normalization is a way used by statistical data processing.
  
Purpose of the normalization method is to deal with the data in a single order in cases where the difference between the data is too great.  
  
Normalization is not a must for all datasets but is recommended when the dataset variables have different ranges.  
  
- Normalization helps improve the performance and reliability of a machine learning model
- Normalization is beneficial to some machine learning algorithms that use Euclidean distance.
- The normalization technique is also proper when using linear models and interpreting their coefficients
- It helps gradient descents converge faster

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X = sc.fit_transform(X)
X

array([[-0.57873591, -0.57380915, -0.32622142, ...,  0.64609167,
         0.97024255,  0.02188649],
       [-0.57873591,  1.74273971, -0.44003595, ..., -1.54776799,
         0.97024255,  0.21653375],
       [-0.57873591, -0.57380915, -1.53679418, ...,  0.64609167,
        -1.03067011,  0.2406869 ],
       ...,
       [-0.57873591, -0.57380915,  0.60498839, ..., -1.54776799,
         0.97024255, -1.00864308],
       [ 1.72790383, -0.57380915,  1.25683526, ...,  0.64609167,
        -1.03067011, -0.12523071],
       [-0.57873591, -0.57380915,  1.46377078, ...,  0.64609167,
        -1.03067011, -1.07636976]])

**Create train and test datas to end this notebook:**

In [7]:
X = pd.DataFrame(X)
y = pd.DataFrame(y)

X.columns = ['Germany', 'Spain', 'CreditScore','Gender', 'Age', 'Tenure','Balance',
            'NumOfProducts', 'HasCrCard','IsActiveMember', 'EstimatedSalary']

y.columns = ['Exited']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.33, random_state=0)

X_train.to_csv('../Datasets/X_train.csv', index=False)
X_test.to_csv('../Datasets/X_test.csv', index=False)
y_train.to_csv('../Datasets/y_train.csv', index=False)
y_test.to_csv('../Datasets/y_test.csv', index=False)