# Beta Bank 

* Os clientes do Beta Bank estão saindo: pouco a pouco, escapulindo todo mês. Os banqueiros descobriram que é mais barato salvar os clientes existentes do que atrair novos.
* Precisamos prever se um cliente deixará o banco em breve. Você tem os dados sobre o comportamento passado dos clientes e rescisões de contratos com o banco.
* Construa um modelo com o valor máximo possível de F1. Para passar na revisão, você precisa de um valor F1 de pelo menos 0,59 para o conjunto de dados de teste.
* Além disso, meça a métrica AUC-ROC e compare-a com o valor F1.

**Características**
* RowNumber — índice das strings de dados
* CustomerId — identificador exclusivo do cliente
* Surname — sobrenome
* CreditScore — pontuação de crédito
* Geography — país de residência
* Gender — gênero
* Age — idade
* Tenure — tempo de serviço para o cliente
* Balance — saldo da conta
* NumOfProducts — número de produtos bancários usados pelo cliente
* HasCrCard — cliente possui cartão de crédito (1 - sim; 0 - não)
* IsActiveMember — cliente ativo (1 - sim; 0 - não)
* EstimatedSalary — salário estimado

**Objetivo**

* Exited — o cliente saiu (1 - sim; 0 - não) 

## Iniciação & Vizualização 
* Importar as bibliotecas e vizualizar o data frame 

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle


In [2]:
df=pd.read_csv('/datasets/Churn.csv')
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


* Coluna Tenure com valores ausentes

In [4]:
df[df['Tenure'].isna()]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.00,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.00,1,0,0,84509.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,9945,15703923,Cameron,744,Germany,Male,41,,190409.34,2,1,1,138361.48,0
9956,9957,15707861,Nucci,520,France,Female,46,,85216.61,1,1,0,117369.52,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
9985,9986,15586914,Nepean,659,France,Male,36,,123841.49,2,1,0,96833.00,0


* Aparentemente não tem nenhum tipo de corelação com alguma outra coluna

In [5]:
print(df['Tenure'].value_counts())
print()
print('Média:',df['Tenure'].mean())
print()
print('Mediana:',df['Tenure'].median())

1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: Tenure, dtype: int64

Média: 4.997690023099769

Mediana: 5.0


* Irei substituir pela mediana e tornar uma coluna dtype int, já que não temos números com vírgula

In [6]:
df['Tenure'] = df['Tenure'].fillna(5)
df['Tenure'] = df['Tenure'].astype(int)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


* Tudo correto, irei prosseguir com o projeto

## One Hot Encoding

In [8]:
df=df.drop(['Surname'],axis=1)#Como o sobrenome não interfere, já tirei agora para facilitar o código
df= pd.get_dummies(df)
df


Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,1,15634602,619,42,2,0.00,1,1,1,101348.88,1,1,0,0,1,0
1,2,15647311,608,41,1,83807.86,1,0,1,112542.58,0,0,0,1,1,0
2,3,15619304,502,42,8,159660.80,3,1,0,113931.57,1,1,0,0,1,0
3,4,15701354,699,39,1,0.00,2,0,0,93826.63,0,1,0,0,1,0
4,5,15737888,850,43,2,125510.82,1,1,1,79084.10,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,771,39,5,0.00,2,1,0,96270.64,0,1,0,0,0,1
9996,9997,15569892,516,35,10,57369.61,1,1,1,101699.77,0,1,0,0,0,1
9997,9998,15584532,709,36,7,0.00,1,0,1,42085.58,1,1,0,0,1,0
9998,9999,15682355,772,42,3,75075.31,2,1,0,92888.52,1,0,1,0,0,1


## Divisão dos Dados

In [9]:
df_train, df_meio = train_test_split(df, test_size=0.4, random_state=12345)
#df_train com 60% e df_meio com 40%
df_valid, df_test = train_test_split(df_meio, test_size=0.5, random_state=12345)
#dividi o df_meio na metade para ficar df_valid 20% e df_test 20% do df original



In [10]:
df_merged = pd.concat([df_train, df_valid], axis=0)
df_merged

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
7479,7480,15671987,567,35,8,153137.74,1,1,0,88659.07,0,0,0,1,0,1
3411,3412,15815628,711,37,8,113899.92,1,0,0,80215.20,0,1,0,0,1,0
6027,6028,15799494,850,44,3,140393.65,2,0,1,186285.52,0,0,1,0,0,1
1247,1248,15711288,512,24,6,0.00,2,1,0,37654.31,0,1,0,0,0,1
3716,3717,15699492,665,27,2,147435.96,1,0,0,187508.06,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6480,6481,15693704,679,24,6,114948.76,2,0,1,135768.25,0,1,0,0,1,0
8194,8195,15620746,632,42,4,126115.60,1,1,0,100998.50,0,1,0,0,0,1
1279,1280,15634896,521,39,6,0.00,2,0,1,27375.15,0,1,0,0,1,0
6532,6533,15732299,756,67,4,0.00,3,1,1,93081.87,0,1,0,0,0,1


## Criação da Feature e Target

In [11]:
features_train = df_train.drop(['Exited','RowNumber','CustomerId'], axis=1)
target_train = df_train['Exited']

features_test = df_test.drop(['Exited','RowNumber','CustomerId'], axis=1)
target_test = df_test['Exited'] 

features_valid = df_valid.drop(['Exited','RowNumber','CustomerId'], axis=1)
target_valid = df_valid['Exited'] 

features_train_valid = df_merged.drop(['Exited','RowNumber','CustomerId'], axis=1)
target_train_valid = df_merged['Exited'] 
#criado para o treinamento quando o modelo for usar o conjunto de teste
print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_train_valid.shape)
print(target_train_valid.shape)

(6000, 13)
(6000,)
(2000, 13)
(2000,)
(2000, 13)
(2000,)
(8000, 13)
(8000,)


## Escalonamento

In [12]:
numeric=['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])
features_test.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
7041,-2.226392,-0.088482,-1.10169,-1.233163,0.830152,1,0,0.647083,1,0,0,0,1
5709,-0.08712,0.006422,1.446272,-1.233163,-0.89156,1,0,-1.65841,1,0,0,1,0
7117,-0.917905,-0.752805,-0.009707,0.722307,-0.89156,1,1,-1.369334,0,0,1,0,1
7775,-0.253277,0.101325,1.810266,-1.233163,0.830152,1,0,0.075086,0,0,1,0,1
8735,0.785204,-0.847708,1.810266,0.615625,-0.89156,0,1,-1.070919,1,0,0,0,1


## Criação do Upsample e Downsample 

In [13]:
features_zeros = features_train[target_train == 0]
features_ones = features_train[target_train == 1]
target_zeros = target_train[target_train==0]
target_ones = target_train[target_train==1]



print(features_zeros.shape)
print(features_ones.shape)
print(target_zeros.shape)
print(target_ones.shape)

(4804, 13)
(1196, 13)
(4804,)
(1196,)


In [14]:
repeat = 10
features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

print(features_upsampled.shape)
print(target_upsampled.shape)

(16764, 13)
(16764,)


In [15]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    arg1 = pd.concat([features_zeros] + [features_ones] * repeat)
    arg2 = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        arg1, arg2, random_state=12345
    )

    return features_upsampled, target_upsampled, arg1, arg2


features_upsampled, target_upsampled, arg_1, arg_2 = upsample(
    features_train, target_train, 10
)

print(features_upsampled.shape)
print(target_upsampled.shape)

(16764, 13)
(16764,)


In [16]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones]) 
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones]) 
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )
    return features_downsampled, target_downsampled


features_downsampled, target_downsampled = downsample(
    features_train, target_train, 0.1
)

print(features_downsampled.shape)
print(target_downsampled.shape)


(1676, 13)
(1676,)


In [17]:
print(target_train.value_counts(normalize=True))
print(target_test.value_counts(normalize=True))
print(target_valid.value_counts(normalize=True))
print(target_downsampled.value_counts(normalize=True))
target_upsampled.value_counts(normalize=True)

0    0.800667
1    0.199333
Name: Exited, dtype: float64
0    0.7885
1    0.2115
Name: Exited, dtype: float64
0    0.791
1    0.209
Name: Exited, dtype: float64
1    0.713604
0    0.286396
Name: Exited, dtype: float64


1    0.713434
0    0.286566
Name: Exited, dtype: float64

## DecisionTreeClassifier

### Diferença de Acurácia & Escolha dos Hiperparâmetros 

In [18]:
for depth in range(1, 11):
        model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
        model.fit(features_train,target_train)
        
        train_predictions = model.predict(features_train)
        predictions_valid = model.predict(features_valid)
        
        modelb = DecisionTreeClassifier(random_state=12345, max_depth=depth, class_weight='balanced')
        modelb.fit(features_train,target_train)
        
        train_predictionsb = modelb.predict(features_train)
        predictions_validb = modelb.predict(features_valid)
        
        print( 'depth', depth, ": ")
        print('train:',accuracy_score(target_train, train_predictions))
        print('valid:',accuracy_score(target_valid, predictions_valid))
        print('trainB:',accuracy_score(target_train, train_predictionsb))
        print('validB:',accuracy_score(target_valid, predictions_validb))
        print()

depth 1 : 
train: 0.8006666666666666
valid: 0.791
trainB: 0.7483333333333333
validB: 0.7545

depth 2 : 
train: 0.8353333333333334
valid: 0.824
trainB: 0.7556666666666667
validB: 0.765

depth 3 : 
train: 0.8463333333333334
valid: 0.838
trainB: 0.7556666666666667
validB: 0.765

depth 4 : 
train: 0.856
valid: 0.852
trainB: 0.7066666666666667
validB: 0.711

depth 5 : 
train: 0.8645
valid: 0.853
trainB: 0.8071666666666667
validB: 0.8105

depth 6 : 
train: 0.8723333333333333
valid: 0.858
trainB: 0.7785
validB: 0.7665

depth 7 : 
train: 0.8813333333333333
valid: 0.85
trainB: 0.8168333333333333
validB: 0.7735

depth 8 : 
train: 0.8881666666666667
valid: 0.8475
trainB: 0.8265
validB: 0.7615

depth 9 : 
train: 0.899
valid: 0.847
trainB: 0.8485
validB: 0.769

depth 10 : 
train: 0.9126666666666666
valid: 0.8385
trainB: 0.8496666666666667
validB: 0.747



* O Hiperparâmetro class_weight='balanced' apenas piorou os resultados. 
* O depth escolhido foi 6 
* Como os resultados foram próximos não tem *overfitting*

### Teste dos Melhores Hiperparâmetros

In [19]:
model = DecisionTreeClassifier(random_state=12345, max_depth=6)
model.fit(features_train_valid,target_train_valid)
        
predictions_test = model.predict(features_test)
        
        
        

print(accuracy_score(target_test, predictions_test))
        


0.7945


### F1_score

In [20]:
model = DecisionTreeClassifier(random_state=12345, max_depth=6)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

print(f1_score(target_valid,predicted_valid))

0.5683890577507599


In [21]:
modelb = DecisionTreeClassifier(random_state=12345, max_depth=6, class_weight='balanced')
modelb.fit(features_train, target_train)
predicted_valid = modelb.predict(features_valid)

print(f1_score(target_valid,predicted_valid))

0.5581835383159887


* Ainda não é o suficiente 

In [22]:
model1b = DecisionTreeClassifier(random_state=12345, max_depth=6, class_weight='balanced')
model1b.fit(features_upsampled, target_upsampled)
predicted_valid = model1b.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.5609065155807366


In [23]:
model1 = DecisionTreeClassifier(random_state=12345, max_depth=6)
model1.fit(features_upsampled, target_upsampled)
predicted_valid = model1.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.4647619047619047


In [24]:
model2b = DecisionTreeClassifier(random_state=12345, max_depth=6, class_weight='balanced')
model2b.fit(features_downsampled, target_downsampled)
predicted_valid = model2b.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.5502439024390243


In [25]:
model2 = DecisionTreeClassifier(random_state=12345, max_depth=6)
model2.fit(features_downsampled, target_downsampled)
predicted_valid = model2.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.48


**Conclusão**

* O valor f1 não atingiu 0.59 de nenhuma maneira

### AUC-ROC

In [26]:
model = DecisionTreeClassifier(random_state=12345, max_depth=5)
model.fit(features_train, target_train)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]


auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)

0.8227003550711051


In [27]:
model = DecisionTreeClassifier(random_state=12345, max_depth=5, class_weight='balanced')
model.fit(features_train, target_train)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]


auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)

0.8310244134068074


**Conclusão**

* O resultado é melhor que um modelo aleatório

### Conclusão do Modelo

* A acurácia do modelo teve sua melhor versão sem o hiperparâmetro class_weight='balanced' e max_depth=6, atingindo seu máximo de 85.8%.
* O conjunto de teste teve uma acurácia de 79.45%
* O Valor f1_score não chegou ao valor mínimo de nenhuma maneira
* AUC-ROC teve um desempenho bom, considerado bem melhor que um modelo aleatório. Foi maior que o F1_score(0.82x0.56).

**Conclusão Final: Modelo Reprovado**

**Motivo: F1_score**

## RandomForestClassifier

### Diferença de Acurácia & Escolha dos Hiperparâmetros

In [28]:
for depth in range(1, 11):
        model = RandomForestClassifier(random_state=12345, max_depth=depth)
        model.fit(features_train,target_train)
        
        train_predictions = model.predict(features_train)
        predictions_valid = model.predict(features_valid)
        
        modelb = RandomForestClassifier(random_state=12345, max_depth=depth, class_weight='balanced')
        modelb.fit(features_train,target_train)
        
        train_predictionsb = modelb.predict(features_train)
        predictions_validb = modelb.predict(features_valid)
        
        print( 'depth', depth, ": ")
        print('train:',accuracy_score(target_train, train_predictions))
        print('valid:',accuracy_score(target_valid, predictions_valid))
        print('trainB:',accuracy_score(target_train, train_predictionsb))
        print('validB:',accuracy_score(target_valid, predictions_validb))
        print()

depth 1 : 
train: 0.8006666666666666
valid: 0.791
trainB: 0.746
validB: 0.7545

depth 2 : 
train: 0.8105
valid: 0.8005
trainB: 0.7655
validB: 0.763

depth 3 : 
train: 0.8306666666666667
valid: 0.8175
trainB: 0.7795
validB: 0.7665

depth 4 : 
train: 0.8466666666666667
valid: 0.83
trainB: 0.803
validB: 0.786

depth 5 : 
train: 0.8526666666666667
valid: 0.833
trainB: 0.8195
validB: 0.8015

depth 6 : 
train: 0.8678333333333333
valid: 0.851
trainB: 0.8321666666666667
validB: 0.8075

depth 7 : 
train: 0.8776666666666667
valid: 0.858
trainB: 0.8528333333333333
validB: 0.8155

depth 8 : 
train: 0.8875
valid: 0.8615
trainB: 0.8733333333333333
validB: 0.8205

depth 9 : 
train: 0.8963333333333333
valid: 0.861
trainB: 0.898
validB: 0.834

depth 10 : 
train: 0.9076666666666666
valid: 0.86
trainB: 0.9208333333333333
validB: 0.8385



* O Hiperparâmetro class_weight='balanced' apenas piorou os resultados. 
* O depth escolhido foi 8 
* Como os resultados foram próximos não tem *overfitting*

### Teste dos Melhores Hiperparâmetros



In [29]:
model = RandomForestClassifier(random_state=12345, max_depth=8)
model.fit(features_train_valid,target_train_valid)
        
predictions_test = model.predict(features_test)
        
        
        
print(accuracy_score(target_test, predictions_test))
        

0.7005


###  F1_score

In [30]:
model2 = RandomForestClassifier(random_state=12345, max_depth=8)
model2.fit(features_train, target_train)
predicted_valid = model2.predict(features_valid)

print(f1_score(target_valid,predicted_valid))

0.5610142630744849


In [31]:
model2b = RandomForestClassifier(random_state=12345, max_depth=8, class_weight='balanced')
model2b.fit(features_train, target_train)
predicted_valid = model2b.predict(features_valid)
    
print(f1_score(target_valid,predicted_valid))

0.6110509209100757


* O segundo teste de valores f1 é o sufuciente, mas irei fazer mais testes para aumentar o valor.

In [32]:
model1_2b = RandomForestClassifier(random_state=12345, max_depth=8, class_weight='balanced')
model1_2b.fit(features_upsampled, target_upsampled)
predicted_valid = model1_2b.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.6164801627670397


In [33]:
model1_2 = RandomForestClassifier(random_state=12345, max_depth=8)
model1_2.fit(features_upsampled, target_upsampled)
predicted_valid = model1_2.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.505890505890506


In [34]:
model2_2b = RandomForestClassifier(random_state=12345, max_depth=8, class_weight='balanced')
model2_2b.fit(features_downsampled, target_downsampled)
predicted_valid = model2_2b.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.5349019607843137


In [35]:
model2_2 = RandomForestClassifier(random_state=12345, max_depth=8)
model2_2.fit(features_downsampled, target_downsampled)
predicted_valid = model2_2.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.4555288461538462


**Conclusão**

* O valor f1 ultrapassou o mínimo estipulado, chegando a 0.61 com o hiperparâmetro class_weight='balanced', sem upsample ou downsample

### AUC-ROC


In [36]:
model = RandomForestClassifier(random_state=12345, max_depth=8, class_weight='balanced')
model.fit(features_train, target_train)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]


auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)

0.8479742195391939


In [37]:
model = RandomForestClassifier(random_state=12345, max_depth=8)
model.fit(features_train, target_train)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]


auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)

0.8486804299566293


**Conclusão**

* O Modelo é bem melhor que um modelo aleatório

### Conclusão do Modelo

* A acurácia do modelo teve sua melhor versão sem o hiperparâmetro class_weight='balanced', com o max_depth = 8, atingindo seu máximo de 86.15%. 
* A acurácia com o conjunto de teste teve um valor de 70.05%
* O Valor f1_score ultrapassou o valor mínimo de 0.59. Sua melhor versão obteve 0.61 com class_weight='balanced' e max_depth = 8
* AUC-ROC teve um desempenho bom, melhor que o do modelo anterior, considerado bem melhor que um modelo aleatório. Foi maior que o F1_score(0.84x0.61).

**Conclusão Final: Modelo Aprovado**

**Modelo: RandomForestClassifier**

**Acurácia: 86%**

**Acurácia Teste: 70.05**

**F1_score: 61.1%**

**AUC-ROC: 84.79%**

## LogisticRegression

### Diferença de Acurácia & Escolha dos Hiperparâmetros

In [38]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train,target_train) 
predictions = model.predict(features_train) 
predictions_valid = model.predict(features_valid)
print(accuracy_score(target_train, predictions))
print(accuracy_score(target_valid, predictions_valid))

0.8185
0.802


In [39]:
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(features_train,target_train) 
predictions = model.predict(features_train) 
predictions_valid = model.predict(features_valid)
print(accuracy_score(target_train, predictions))
print(accuracy_score(target_valid, predictions_valid))

0.7176666666666667
0.702


* Praticamente iguais, sem *overfitting*
* Melhor modelo sem o hiperparâmetro class_weight='balanced'


### Teste dos Melhores Hiperparâmetros

In [40]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train_valid,target_train_valid) 
predictions = model.predict(features_test) 

print(accuracy_score(target_test, predictions))

0.675


### F1_score

In [41]:
model3 = LogisticRegression(random_state=12345, solver='liblinear')
model3.fit(features_train, target_train)
predicted_valid = model3.predict(features_valid)

print(f1_score(target_valid,predicted_valid))

0.33108108108108103


In [42]:
model3b = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model3b.fit(features_train, target_train)
predicted_valid = model3b.predict(features_valid)

print(f1_score(target_valid,predicted_valid))

0.4888507718696398


* Não é o suficiente, irei realizar testes para tentar aumentar ele

In [43]:
model3_1b = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model3_1b.fit(features_upsampled, target_upsampled)
predicted_valid = model3_1b.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.4888507718696398


In [44]:
model3_1 = LogisticRegression(random_state=12345, solver='liblinear')
model3_1.fit(features_upsampled, target_upsampled)
predicted_valid = model3_1.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.41965764770844843


In [45]:
model3_2b = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model3_2b.fit(features_downsampled, target_downsampled)
predicted_valid = model3_2b.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.4800678541136556


In [46]:
model3_2 = LogisticRegression(random_state=12345, solver='liblinear')
model3_2.fit(features_downsampled, target_downsampled)
predicted_valid = model3_2.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.43010752688172044


**Conclusão**

* O valor f1 não ultrapassou o mínimo estipulado

### AUC-ROC

In [47]:
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(features_train, target_train)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]


auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)

0.7636629788469564


In [48]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)

probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]


auc_roc=roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)

0.7586922858231662


**Conclusão**

* O Modelo é melhor que um modelo aleatório com ou sem o hiperparâmetro class_weight='balanced'

### Conclusão do Modelo

* A acurácia do modelo teve sua melhor versão sem o hiperparâmetro class_weight='balanced', atingindo seu máximo de 80.2%.
* A acurácia do conjunto de teste foi de 67.5%
* O Valor f1_score não ultrapassou o valor mínimo de 0.59
* AUC-ROC teve um desempenho bom com ou sem o hiperparâmetro class_weight='balanced'. O valor AUC-ROC foi maior que o F1_score.

**Conclusão Final: Modelo Reprovado**

**Motivo: F1_score**


## Conclusão Geral

* Considerando que o melhor modelo pode ser avaliado diretamente na métrica de interesse, F1, de no mínimo 0.59. o primeiro modelo (DecisionTreeClassifier) e o terceiro (LogisticRegression) foram reprovados. 

* O melhor modelo então foi o RandomForestClassifier

* Irei realizar mais um teste para saber qual os melhores hipeparâmetros do RandomForestClassifier com base apenas no f1_score.

* Os hiperparâmetros escolhidos sem levar em consideração o f1_score foi: RandomForestClassifier(random_state=12345, max_depth=8) 


In [49]:
for depth in range(1, 11):
    model2b = RandomForestClassifier(random_state=12345, max_depth=depth, class_weight='balanced')
    model2b.fit(features_train, target_train)
    predicted_valid = model2b.predict(features_valid)
    
    print( 'depth', depth, ": ")
    print(f1_score(target_valid,predicted_valid))

depth 1 : 
0.5415499533146592
depth 2 : 
0.5627306273062731
depth 3 : 
0.5663881151346332
depth 4 : 
0.5884615384615385
depth 5 : 
0.6049751243781095
depth 6 : 
0.6099290780141844
depth 7 : 
0.6144200626959248
depth 8 : 
0.6110509209100757
depth 9 : 
0.6244343891402715
depth 10 : 
0.6159334126040428


### Conclusão Geral Final

* O melhor modelo é: RandomForestClassifier(random_state=12345, max_depth=9, class_weight='balanced')

