# Objetivo do Projeto

- Desenvolver um modelo de Classificação capaz de prever se um cliente vai entrar em churn ou não.

- Etapas do Projeto:
    - Coleta de Dados
    - Análise Descritiva dos Dados
    - Análise Exploratória de Dados
    - Limpeza de Dados
    - Feature Engineering
    - Modelagem de Dados
    - Treinamento de Modelos
    - Avaliação de Modelos
    - Registro MLFLOW
    - Calcular ROI do Projeto
    - Construção de uma API

| Coluna             | Descrição                                                                 |
|--------------------|---------------------------------------------------------------------------|
| `RowNumber`        | Número da linha (apenas um índice)                                        |
| `CustomerId`       | ID único do cliente                                                       |
| `Surname`          | Sobrenome do cliente                                                      |
| `CreditScore`      | Pontuação de crédito (quanto maior, melhor o perfil de crédito)           |
| `Geography`        | País de origem do cliente (`France`, `Spain`, `Germany`)                  |
| `Gender`           | Gênero (`Male`/`Female`)                                                  |
| `Age`              | Idade do cliente                                                          |
| `Tenure`           | Quantos anos o cliente está no banco                                      |
| `Balance`          | Saldo da conta bancária                                                   |
| `NumOfProducts`    | Número de produtos adquiridos (cartões, investimentos, etc.)              |
| `HasCrCard`        | Possui cartão de crédito? (`1` = sim, `0` = não)                          |
| `IsActiveMember`   | É um cliente ativo? (`1` = sim, `0` = não)                                |
| `EstimatedSalary`  | Salário estimado                                                          |
| `Exited`           | **Target** — Saiu do banco? (`1` = sim, `0` = não)                        |


# Imports

In [60]:
import numpy as np
import pandas as pd


from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [38]:
pd.set_option('display.float_format', '{:.2f}'.format)


# Data Load

In [39]:
df = pd.read_csv('../data/rclientes.csv')

# Descrição dos Dados

In [40]:
df1 = df.copy()

## 1.1 Dimensão dos Dados

In [41]:
print('Quantidade de Linhas: {}'.format(df1.shape[0]))
print('Quantidade de Colunas: {}'.format(df1.shape[1]))

Quantidade de Linhas: 10000
Quantidade de Colunas: 14


## 1.2 Tipo dos Dados

In [42]:
df1.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

## 1.3 Check Na

In [43]:
df1.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

## 1.4 Estatística Descritiva

In [44]:
num_attributes = df1.select_dtypes( include=['int64', 'float64'] )

In [45]:
# Central Tendency - mean, meadina 
ct1 = pd.DataFrame( num_attributes.apply( np.mean ) ).T
ct2 = pd.DataFrame( num_attributes.apply( np.median ) ).T

# dispersion - std, min, max, range, skew, kurtosis
d1 = pd.DataFrame( num_attributes.apply( np.std ) ).T 
d2 = pd.DataFrame( num_attributes.apply( min ) ).T 
d3 = pd.DataFrame( num_attributes.apply( max ) ).T 
d4 = pd.DataFrame( num_attributes.apply( lambda x: x.max() - x.min() ) ).T 
d5 = pd.DataFrame( num_attributes.apply( lambda x: x.skew() ) ).T 
d6 = pd.DataFrame( num_attributes.apply( lambda x: x.kurtosis() ) ).T 

# concatenar
m = pd.concat( [d2, d3, d4, ct1, ct2, d1, d5, d6] ).T.reset_index()
m.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
m

Unnamed: 0,attributes,min,max,range,mean,median,std,skew,kurtosis
0,RowNumber,1.0,10000.0,9999.0,5000.5,5000.5,2886.75,0.0,-1.2
1,CustomerId,15565701.0,15815690.0,249989.0,15690940.57,15690738.0,71932.59,0.0,-1.2
2,CreditScore,350.0,850.0,500.0,650.53,652.0,96.65,-0.07,-0.43
3,Age,18.0,92.0,74.0,38.92,37.0,10.49,1.01,1.4
4,Tenure,0.0,10.0,10.0,5.01,5.0,2.89,0.01,-1.17
5,Balance,0.0,250898.09,250898.09,76485.89,97198.54,62394.29,-0.14,-1.49
6,NumOfProducts,1.0,4.0,3.0,1.53,1.0,0.58,0.75,0.58
7,HasCrCard,0.0,1.0,1.0,0.71,1.0,0.46,-0.9,-1.19
8,IsActiveMember,0.0,1.0,1.0,0.52,1.0,0.5,-0.06,-2.0
9,EstimatedSalary,11.58,199992.48,199980.9,100090.24,100193.91,57507.62,0.0,-1.18


In [46]:
df1['Exited'].value_counts()

Exited
0    7963
1    2037
Name: count, dtype: int64

# 2.0 Feature Engineering

In [47]:
df2 = df1.copy()

# 3.0 Limpeza de Dados

In [48]:
df3 = df2.copy()

In [49]:
cols_to_drop = ['RowNumber','Surname', 'Geography', 'Gender', 'CustomerId']

df3 = df3.drop(columns=cols_to_drop)
df3.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,42,2,0.0,1,1,1,101348.88,1
1,608,41,1,83807.86,1,0,1,112542.58,0
2,502,42,8,159660.8,3,1,0,113931.57,1
3,699,39,1,0.0,2,0,0,93826.63,0
4,850,43,2,125510.82,1,1,1,79084.1,0


# 4.0 Análise Exploratória de Dados

In [50]:
df4 = df3.copy()

# 5.0 Data Preparation

In [51]:
df5 = df4.copy()

# 6.0 Machine Learning Model

In [52]:
df6 = df5.copy()

In [53]:
X = df6.drop('Exited', axis=1)
y = df6['Exited']

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [55]:
X_test

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
6252,596,32,3,96709.07,2,0,0,41788.37
4684,623,43,1,0.00,2,1,1,146379.30
1731,601,44,4,0.00,2,1,0,58561.31
4742,506,59,8,119152.10,2,1,1,170679.74
4521,560,27,7,124995.98,1,1,1,114669.79
...,...,...,...,...,...,...,...,...
4862,645,55,1,133676.65,1,0,1,17095.49
7025,569,51,3,0.00,3,1,0,75084.96
7647,768,25,0,78396.08,1,1,1,8316.19
7161,690,36,6,110480.48,1,0,0,81292.33


## 6.1 Random Forest

In [57]:
rf = RandomForestClassifier(random_state=42)

rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

In [61]:
precision = precision_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
acc = accuracy_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)

In [62]:
print('Precision: ', precision)
print('Recall: ', recall)
print('Acc: ', acc)
print('F1: ', f1)

Precision:  0.7422680412371134
Recall:  0.4346076458752515
Acc:  0.8576
F1:  0.5482233502538071
