# Alura Challenge - Semana 03

## Objetivos deste trabalho

•	Verificar se a variável target está balanceada;

•	Aplicar encoding nos seus dados;

•	Criar dois ou mais modelos de Machine Learning;

•	Avaliar cada modelo utilizando métricas de ML;

•	Escolher o melhor modelo;

•	Otimizar o melhor modelo;

•	Verificar qual o melhor tipo de balanceamento com esses dados.

## Verificando o balanceamento da variável target Churn

In [None]:
proporcao_churn

0    73.46
1    26.54
Name: Churn, dtype: float64

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Conforme já anteriormente observado, a variável target Churn está dividida em, aproximadamente, 73,5% 'Não' e 26,5% 'Sim' dos dados do nosso dataset. Desta forma, há um desbalanceamento entre estes valores, o que poderá ocasionar enviesamento e conseguinte interferência no nosso modelo de previsão. Para mitigar este efeito, utilizaremos algumas técnicas de rebalanceamentos de dados: </p>

## Tratando o desbalanceamento da variável target Churn

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Para tratarmos o desbalanceamento da nossa variável target, duas técnicas consagradas poderão ser utilizadas: undersampling (redução da quantidade de dados do valor 'Não') e oversampling' (ampliação dos dados do valor 'Sim'). Ambas técnicas possuem vantagens e desvantagens intrínsecas. Para este trabalho, iremos avaliar as duas possibilidades de modo a comparar o resultado final obtido em cada técnica. Após o balanceamento, iremos proceder com a normalização dos dados numéricos, uma vez que eles se encontram em escalas diferentes entre si e entre as demais features.</p>

## Definindo as variáveis dos modelos

In [None]:
y = df_churn['Churn']
X = df_churn[['Idoso', 'Dependentes', 'Meses_Contrato', 'Internet', 'Fatura_Online', 'Gasto_Mensal', 'Gasto_Total', 'Parceiro',
                      'Mensal', 'Bianual', 'Cartao_credito', 'Boleto_eletronico', 'Boleto_correios', 'Transf_banco', 'Anual']]

In [None]:
features_numericas = ['Meses_Contrato', 'Gasto_Total', 'Gasto_Mensal']

## Criando modelos de Machine Learning para predizer o Churn

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Estamos aptos, neste momento, a criar nossos modelos preditivos baseados nas features previamente selecionadas e nas variáveis balanceadas. Optamos por utilizar os principais modelos de machine learning de classificação: Logistic Regression, Random Forest Classifier, Decision Tree Classifier, Gradient Boosting Classifier e Support Vector Classifier. Antes, porém, iremos iniciar com o modelo comparador DummyClassifier. Obs.: Nossas features foram previamente classificadas numericamente em 0 e 1 (para as variáveis originalmente categóricas do nosso dataset). Desta forma, não será necessário utilizarmos Enconding no dataframe. Utilizaremos o método cross_validate para realizarmos a validação cruzada dos modelos e reduzirmos o fator de aleatoriedade, juntamente com o método Pipeline, que irá agrupar as funções de balanceamento, normalização e treino do modelo. Esta é uma boa prática para reduzirmos, também, o chamado 'data leakage' e o conseguinte enviesamento dos modelos.</p>

In [None]:
modelos = {'Logistic Regression': LogisticRegression(random_state = 53, max_iter = 200, solver = 'lbfgs'), 
           'Random Forest Classifier': RandomForestClassifier(random_state = 53, max_depth = 15, n_estimators = 100),
           'Decision Tree Classifier': DecisionTreeClassifier(random_state = 53, max_depth = 6, criterion = 'gini'),
           'Gradient Boosting Classifier': GradientBoostingClassifier(n_estimators = 100, max_depth = 3, min_samples_split = 1,
                                                              learning_rate = 0.1, random_state = 53),
           'Support Vector Classifier': svm.SVC(kernel = 'rbf', random_state = 53)}

### DummyClassifier

In [None]:
X_dummy = X.copy()
y_dummy = y.copy()

original = []
predito = []

def classification_report_accuracy_score(y_real, y_pred):
    original.extend(y_real)
    predito.extend(y_pred)
    return accuracy_score(y_real, y_pred)

modelo_dummy = DummyClassifier(strategy = 'stratified', random_state = 53)

resultado_dummy = cross_validate(modelo_dummy, X_dummy, y_dummy, cv = 10, return_train_score = False,
                  scoring = make_scorer(classification_report_accuracy_score))
report_dummy = (classification_report(original, predito))

### Undersampling

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Para a técnica de Undersampling, utilizaremos o método NearMiss, que considera a menor distância média entre K-vizinhos mais próximos.</p>

In [None]:
X_undersampling = X.copy()
y_undersampling = y.copy()

original = []
predito = []

def classification_report_accuracy_score(y_real, y_pred):
    original.extend(y_real)
    predito.extend(y_pred)
    return accuracy_score(y_real, y_pred)

resultado_undersampling = {}
report_undersampling = {}

for nome, modelo in modelos.items():
    
    undersampling = NearMiss(version = 2)
    scaler = MinMaxScaler()
    modelo_tipo = modelo
    
    pipeline = Pipeline([('balanceamento', undersampling), ('normalizacao', scaler), ('modelo', modelo_tipo)])
    
    cv = StratifiedKFold(n_splits = 10, shuffle=True)
    resultado_undersampling[nome] = cross_validate(pipeline, X_undersampling, y_undersampling, cv = cv, 
                                    return_train_score = False, scoring = make_scorer(classification_report_accuracy_score))
    report_undersampling[nome] = (classification_report(original, predito))
    original = []
    predito = []

### Oversampling

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Para a técnica de Oversampling, utilizaremos o método SMOTE, que cria dados sintéticos para a classe de menor quantidade proporcional de dados.</p>

In [None]:
X_oversampling = X.copy()
y_oversampling = y.copy()

original = []
predito = []

def classification_report_accuracy_score(y_real, y_pred):
    original.extend(y_real)
    predito.extend(y_pred)
    return accuracy_score(y_real, y_pred)

resultado_oversampling = {}
report_oversampling = {}

for nome, modelo in modelos.items():
    
    oversampling = SMOTE(random_state = 53, k_neighbors = 5)
    scaler = MinMaxScaler()
    modelo_tipo = modelo
    
    pipeline = Pipeline([('balanceamento', oversampling), ('normalizacao', scaler), ('modelo', modelo_tipo)])
    
    cv = StratifiedKFold(n_splits = 10, shuffle=True)
    resultado_oversampling[nome] = cross_validate(pipeline, X_oversampling, y_oversampling, cv = cv, 
                                   return_train_score = False, scoring = make_scorer(classification_report_accuracy_score))
    report_oversampling[nome] = (classification_report(original, predito))
    original = []
    predito = []

## Avaliando os modelos criados

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Uma vez instanciados e treinados os modelos, podemos partir para a avaliação dos mesmos. Para tanto, nos valeremos do método classification_report da biblioteca sklearn, a qual disponibiliza algumas métricas importantes como a Precisão (do total de usuários que o modelo previu como positivos, quantos realmente eram positivos), Revocação (do total de usuários que realmente eram positivos, quantos o modelo previu como positivos) e f1-score (média harmônica enter a Precisão e a Revocação). Além disto, vamos avaliar também o intervalo (95% de confiança) relativo ao score para os dados de teste. Primeiramente, iremos avaliar nosso modelo comparador DummyClassifier:</p>

### DummyClassifier

In [None]:
print(report_dummy)

              precision    recall  f1-score   support

           0       0.73      0.78      0.76      5174
           1       0.26      0.22      0.24      1869

    accuracy                           0.63      7043
   macro avg       0.50      0.50      0.50      7043
weighted avg       0.61      0.63      0.62      7043



In [None]:
media = resultado_dummy['test_score'].mean()
desvio_padrao = resultado_dummy['test_score'].std()
print("Intervalo Accuracy teste: [%.2f, %.2f]" % ((media - 2 * desvio_padrao)*100, (media + 2 * desvio_padrao) * 100))

Intervalo Accuracy teste: [60.92, 65.11]


<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Aqui, podemos perceber claramente a necessidade de balanceamento dos dados: a acurácia do modelo Dummy Classifier foi de 0.60, porém analisando especificamente as métricas relativas ao Churn 'Sim' (ou 1), verificamos que há um claro enviesamento do modelo em favor do valor 'Não', indo ao encontro da proporção anteriormente calculada.</p>

### Undersampling

In [None]:
for nome, report in report_undersampling.items():
    print(f'{nome}:\n')
    print(report)

Logistic Regression:

              precision    recall  f1-score   support

           0       0.89      0.50      0.64      5174
           1       0.37      0.82      0.51      1869

    accuracy                           0.58      7043
   macro avg       0.63      0.66      0.57      7043
weighted avg       0.75      0.58      0.60      7043

Random Forest Classifier:

              precision    recall  f1-score   support

           0       0.84      0.34      0.49      5174
           1       0.31      0.82      0.45      1869

    accuracy                           0.47      7043
   macro avg       0.57      0.58      0.47      7043
weighted avg       0.70      0.47      0.48      7043

Decision Tree Classifier:

              precision    recall  f1-score   support

           0       0.82      0.34      0.48      5174
           1       0.30      0.79      0.44      1869

    accuracy                           0.46      7043
   macro avg       0.56      0.57      0.46      704

In [None]:
lista_valores = []
for i in resultado_undersampling.values():
    media = i['test_score'].mean()
    desvio_padrao = i['test_score'].std()
    lista_valores.append("Intervalo Accuracy teste: [%.2f, %.2f]" % ((media - 2 * desvio_padrao)*100, 
                                                             (media + 2 * desvio_padrao) * 100))
lista_nomes = []
for i in resultado_undersampling.keys():
    lista_nomes.append(i)    

merged_lista = tuple(zip(lista_valores, lista_nomes))

for i in merged_lista:
    print(i)

('Intervalo Accuracy teste: [56.14, 60.66]', 'Logistic Regression')
('Intervalo Accuracy teste: [44.04, 49.56]', 'Random Forest Classifier')
('Intervalo Accuracy teste: [41.94, 50.35]', 'Decision Tree Classifier')
('Intervalo Accuracy teste: [43.03, 49.99]', 'Gradient Boosting Classifier')
('Intervalo Accuracy teste: [53.53, 59.29]', 'Support Vector Classifier')


### Oversampling

In [None]:
for nome, report in report_oversampling.items():
    print(f'{nome}:\n')
    print(report)

Logistic Regression:

              precision    recall  f1-score   support

           0       0.87      0.81      0.84      5174
           1       0.56      0.68      0.61      1869

    accuracy                           0.77      7043
   macro avg       0.72      0.74      0.73      7043
weighted avg       0.79      0.77      0.78      7043

Random Forest Classifier:

              precision    recall  f1-score   support

           0       0.86      0.83      0.84      5174
           1       0.56      0.62      0.59      1869

    accuracy                           0.77      7043
   macro avg       0.71      0.72      0.72      7043
weighted avg       0.78      0.77      0.77      7043

Decision Tree Classifier:

              precision    recall  f1-score   support

           0       0.88      0.75      0.81      5174
           1       0.51      0.72      0.60      1869

    accuracy                           0.74      7043
   macro avg       0.70      0.74      0.71      704

In [None]:
lista_valores = []
for i in resultado_oversampling.values():
    media = i['test_score'].mean()
    desvio_padrao = i['test_score'].std()
    lista_valores.append("Intervalo Accuracy teste: [%.2f, %.2f]" % ((media - 2 * desvio_padrao)*100, 
                                                             (media + 2 * desvio_padrao) * 100))
lista_nomes = []
for i in resultado_oversampling.keys():
    lista_nomes.append(i)    

merged_lista = tuple(zip(lista_valores, lista_nomes))

for i in merged_lista:
    print(i)

('Intervalo Accuracy teste: [74.71, 79.71]', 'Logistic Regression')
('Intervalo Accuracy teste: [74.80, 79.39]', 'Random Forest Classifier')
('Intervalo Accuracy teste: [70.97, 77.94]', 'Decision Tree Classifier')
('Intervalo Accuracy teste: [74.87, 80.54]', 'Gradient Boosting Classifier')
('Intervalo Accuracy teste: [73.44, 79.99]', 'Support Vector Classifier')


## Escolhendo o melhor modelo e o tipo de balanceamento mais adequado

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Para nosso problema principal de redução da taxa de churn dos clientes da Alura Voz, queremos otimizar a identificação dos clientes com propensão a evadir do serviço, de modo a tentarmos estratégias comerciais de retenção dos mesmos na base. Sendo assim, diante dos resultados apresentados anteriormente, podemos perceber que o modelo Gradient Boosting Classifier, juntamente com a técnica oversampling, apresentou o melhor score integrado, seja de precisão, seja de revocação. Portanto, este será o modelo escolhido para o nosso dataset.</p>

## Otimizando o melhor modelo

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Podemos, ainda, tentar otimizar os hiperparâmetros do nosso modelo escolhido e avaliar a sua performance com os dados balanceados. Para isto, utilizaremos a biblioteca GridSearchCV, que realiza a validação cruzada para cada variação dos parâmetros escolhidos.</p>

In [None]:
modelo_escolhido = GradientBoostingClassifier()

In [None]:
hiperparametros = {'n_estimators': [100, 300, 500, 600],
                   'max_depth': [3, 4, 5, 6, 7],
                   'min_samples_split': [2, 3, 4, 5],
                   'learning_rate': [0.1, 1]}

In [None]:
otimiz = GridSearchCV(modelo_escolhido, hiperparametros, cv = StratifiedKFold(n_splits = 10, shuffle=True),  
                                                                                   verbose = 3, n_jobs = 1)
otimiz.fit(X, y)

Fitting 10 folds for each of 160 candidates, totalling 1600 fits
[CV 1/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.774 total time=   0.5s
[CV 2/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.783 total time=   0.5s
[CV 3/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.803 total time=   0.6s
[CV 4/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.803 total time=   0.6s
[CV 5/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.788 total time=   0.6s
[CV 6/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.803 total time=   0.6s
[CV 7/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.815 total time=   0.5s
[CV 8/10] END learning_rate=0.1, max_depth=3, min_samples_split=2, n_estimators=100;, score=0.784 total time=   0.5s

[CV 1/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.748 total time=   3.8s
[CV 2/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.750 total time=   3.7s
[CV 3/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.796 total time=   3.7s
[CV 4/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.803 total time=   3.9s
[CV 5/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.795 total time=   6.1s
[CV 6/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.793 total time=   6.2s
[CV 7/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.776 total time=   4.8s
[CV 8/10] END learning_rate=0.1, max_depth=3, min_samples_split=3, n_estimators=600;, score=0.784 total time=   3.8s
[CV 9/10] END learning_rate=0.1, max_depth=3, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.746 total time=   2.9s
[CV 2/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.756 total time=   3.0s
[CV 3/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.799 total time=   3.1s
[CV 4/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.803 total time=   3.0s
[CV 5/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.797 total time=   3.0s
[CV 6/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.800 total time=   2.9s
[CV 7/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.787 total time=   2.9s
[CV 8/10] END learning_rate=0.1, max_depth=3, min_samples_split=5, n_estimators=500;, score=0.783 total time=   3.0s
[CV 9/10] END learning_rate=0.1, max_depth=3, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.766 total time=   2.4s
[CV 2/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.776 total time=   2.4s
[CV 3/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.786 total time=   2.4s
[CV 4/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.794 total time=   2.4s
[CV 5/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.804 total time=   2.4s
[CV 6/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.788 total time=   2.4s
[CV 7/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.791 total time=   2.5s
[CV 8/10] END learning_rate=0.1, max_depth=4, min_samples_split=3, n_estimators=300;, score=0.783 total time=   2.5s
[CV 9/10] END learning_rate=0.1, max_depth=4, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.766 total time=   0.7s
[CV 2/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.779 total time=   0.8s
[CV 3/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.810 total time=   0.7s
[CV 4/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.803 total time=   0.7s
[CV 5/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.801 total time=   0.8s
[CV 6/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.798 total time=   0.7s
[CV 7/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.805 total time=   0.8s
[CV 8/10] END learning_rate=0.1, max_depth=4, min_samples_split=5, n_estimators=100;, score=0.793 total time=   0.8s
[CV 9/10] END learning_rate=0.1, max_depth=4, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.755 total time=   6.1s
[CV 2/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.766 total time=   5.9s
[CV 3/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.770 total time=   6.2s
[CV 4/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.793 total time=   6.0s
[CV 5/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.780 total time=   6.3s
[CV 6/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.787 total time=   6.0s
[CV 7/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.764 total time=   6.0s
[CV 8/10] END learning_rate=0.1, max_depth=5, min_samples_split=2, n_estimators=600;, score=0.763 total time=   6.0s
[CV 9/10] END learning_rate=0.1, max_depth=5, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.757 total time=   4.8s
[CV 2/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.755 total time=   4.9s
[CV 3/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.779 total time=   5.1s
[CV 4/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.803 total time=   4.9s
[CV 5/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.780 total time=   4.8s
[CV 6/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.795 total time=   4.8s
[CV 7/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.770 total time=   4.9s
[CV 8/10] END learning_rate=0.1, max_depth=5, min_samples_split=4, n_estimators=500;, score=0.774 total time=   4.7s
[CV 9/10] END learning_rate=0.1, max_depth=5, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.765 total time=   3.4s
[CV 2/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.772 total time=   3.6s
[CV 3/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.774 total time=   3.8s
[CV 4/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.798 total time=   3.7s
[CV 5/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.774 total time=   3.6s
[CV 6/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.790 total time=   3.6s
[CV 7/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.786 total time=   3.6s
[CV 8/10] END learning_rate=0.1, max_depth=6, min_samples_split=2, n_estimators=300;, score=0.778 total time=   3.8s
[CV 9/10] END learning_rate=0.1, max_depth=6, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.772 total time=   1.2s
[CV 2/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.766 total time=   1.1s
[CV 3/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.796 total time=   1.2s
[CV 4/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.793 total time=   1.2s
[CV 5/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.788 total time=   1.2s
[CV 6/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.795 total time=   1.1s
[CV 7/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.803 total time=   1.2s
[CV 8/10] END learning_rate=0.1, max_depth=6, min_samples_split=4, n_estimators=100;, score=0.791 total time=   1.2s
[CV 9/10] END learning_rate=0.1, max_depth=6, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.759 total time=   7.0s
[CV 2/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.762 total time=   7.2s
[CV 3/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.757 total time=   7.1s
[CV 4/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.786 total time=   7.0s
[CV 5/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.766 total time=   7.0s
[CV 6/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.784 total time=   7.1s
[CV 7/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.783 total time=   7.0s
[CV 8/10] END learning_rate=0.1, max_depth=6, min_samples_split=5, n_estimators=600;, score=0.774 total time=   7.0s
[CV 9/10] END learning_rate=0.1, max_depth=6, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.767 total time=   7.5s
[CV 2/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.757 total time=   7.3s
[CV 3/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.776 total time=   7.2s
[CV 4/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.810 total time=   7.2s
[CV 5/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.780 total time=   7.2s
[CV 6/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.771 total time=   7.1s
[CV 7/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.774 total time=   7.2s
[CV 8/10] END learning_rate=0.1, max_depth=7, min_samples_split=3, n_estimators=500;, score=0.777 total time=   7.0s
[CV 9/10] END learning_rate=0.1, max_depth=7, min_samples_split=

[CV 1/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.760 total time=   4.2s
[CV 2/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.756 total time=   4.0s
[CV 3/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.776 total time=   4.1s
[CV 4/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.795 total time=   4.1s
[CV 5/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.780 total time=   4.3s
[CV 6/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.784 total time=   4.3s
[CV 7/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.786 total time=   4.2s
[CV 8/10] END learning_rate=0.1, max_depth=7, min_samples_split=5, n_estimators=300;, score=0.773 total time=   4.2s
[CV 9/10] END learning_rate=0.1, max_depth=7, min_samples_split=

[CV 2/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.732 total time=   0.5s
[CV 3/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.735 total time=   0.6s
[CV 4/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.800 total time=   0.5s
[CV 5/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.757 total time=   0.5s
[CV 6/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.774 total time=   0.5s
[CV 7/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.774 total time=   0.5s
[CV 8/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.749 total time=   0.5s
[CV 9/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=100;, score=0.773 total time=   0.5s
[CV 10/10] END learning_rate=1, max_depth=3, min_samples_split=3, n_estimators=1

[CV 4/10] END learning_rate=1, max_depth=3, min_samples_split=4, n_estimators=600;, score=0.780 total time=   3.7s
[CV 5/10] END learning_rate=1, max_depth=3, min_samples_split=4, n_estimators=600;, score=0.737 total time=   3.6s
[CV 6/10] END learning_rate=1, max_depth=3, min_samples_split=4, n_estimators=600;, score=0.767 total time=   3.7s
[CV 7/10] END learning_rate=1, max_depth=3, min_samples_split=4, n_estimators=600;, score=0.750 total time=   3.8s
[CV 8/10] END learning_rate=1, max_depth=3, min_samples_split=4, n_estimators=600;, score=0.744 total time=   3.6s
[CV 9/10] END learning_rate=1, max_depth=3, min_samples_split=4, n_estimators=600;, score=0.739 total time=   3.5s
[CV 10/10] END learning_rate=1, max_depth=3, min_samples_split=4, n_estimators=600;, score=0.761 total time=   3.5s
[CV 1/10] END learning_rate=1, max_depth=3, min_samples_split=5, n_estimators=100;, score=0.725 total time=   0.5s
[CV 2/10] END learning_rate=1, max_depth=3, min_samples_split=5, n_estimators=1

[CV 6/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=500;, score=0.750 total time=   4.0s
[CV 7/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=500;, score=0.737 total time=   3.9s
[CV 8/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=500;, score=0.675 total time=   3.9s
[CV 9/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=500;, score=0.727 total time=   3.8s
[CV 10/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=500;, score=0.783 total time=   3.8s
[CV 1/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=600;, score=0.749 total time=   4.6s
[CV 2/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=600;, score=0.725 total time=   4.7s
[CV 3/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=600;, score=0.769 total time=   4.8s
[CV 4/10] END learning_rate=1, max_depth=4, min_samples_split=2, n_estimators=6

[CV 8/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=300;, score=0.751 total time=   2.3s
[CV 9/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=300;, score=0.734 total time=   2.3s
[CV 10/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=300;, score=0.783 total time=   2.3s
[CV 1/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=500;, score=0.760 total time=   4.0s
[CV 2/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=500;, score=0.740 total time=   3.8s
[CV 3/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=500;, score=0.738 total time=   4.1s
[CV 4/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=500;, score=0.773 total time=   4.0s
[CV 5/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=500;, score=0.753 total time=   3.8s
[CV 6/10] END learning_rate=1, max_depth=4, min_samples_split=4, n_estimators=5

[CV 10/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=100;, score=0.768 total time=   0.9s
[CV 1/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=300;, score=0.738 total time=   2.8s
[CV 2/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=300;, score=0.729 total time=   2.8s
[CV 3/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=300;, score=0.750 total time=   2.8s
[CV 4/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=300;, score=0.770 total time=   2.8s
[CV 5/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=300;, score=0.746 total time=   2.9s
[CV 6/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=300;, score=0.763 total time=   2.8s
[CV 7/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=300;, score=0.747 total time=   2.9s
[CV 8/10] END learning_rate=1, max_depth=5, min_samples_split=2, n_estimators=3

[CV 2/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.725 total time=   0.8s
[CV 3/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.745 total time=   0.9s
[CV 4/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.776 total time=   0.8s
[CV 5/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.756 total time=   0.9s
[CV 6/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.770 total time=   0.9s
[CV 7/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.751 total time=   0.8s
[CV 8/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.737 total time=   0.9s
[CV 9/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=100;, score=0.710 total time=   0.9s
[CV 10/10] END learning_rate=1, max_depth=5, min_samples_split=4, n_estimators=1

[CV 4/10] END learning_rate=1, max_depth=5, min_samples_split=5, n_estimators=600;, score=0.766 total time=   5.8s
[CV 5/10] END learning_rate=1, max_depth=5, min_samples_split=5, n_estimators=600;, score=0.733 total time=   6.0s
[CV 6/10] END learning_rate=1, max_depth=5, min_samples_split=5, n_estimators=600;, score=0.759 total time=   6.2s
[CV 7/10] END learning_rate=1, max_depth=5, min_samples_split=5, n_estimators=600;, score=0.734 total time=   5.8s
[CV 8/10] END learning_rate=1, max_depth=5, min_samples_split=5, n_estimators=600;, score=0.720 total time=   5.8s
[CV 9/10] END learning_rate=1, max_depth=5, min_samples_split=5, n_estimators=600;, score=0.723 total time=   5.9s
[CV 10/10] END learning_rate=1, max_depth=5, min_samples_split=5, n_estimators=600;, score=0.776 total time=   5.8s
[CV 1/10] END learning_rate=1, max_depth=6, min_samples_split=2, n_estimators=100;, score=0.716 total time=   1.2s
[CV 2/10] END learning_rate=1, max_depth=6, min_samples_split=2, n_estimators=1

[CV 6/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=500;, score=0.756 total time=   5.8s
[CV 7/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=500;, score=0.757 total time=   5.9s
[CV 8/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=500;, score=0.756 total time=   5.8s
[CV 9/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=500;, score=0.751 total time=   5.9s
[CV 10/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=500;, score=0.780 total time=   5.7s
[CV 1/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=600;, score=0.739 total time=   7.1s
[CV 2/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=600;, score=0.730 total time=   7.3s
[CV 3/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=600;, score=0.752 total time=   7.1s
[CV 4/10] END learning_rate=1, max_depth=6, min_samples_split=3, n_estimators=6

[CV 8/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=300;, score=0.747 total time=   3.5s
[CV 9/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=300;, score=0.730 total time=   3.5s
[CV 10/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=300;, score=0.791 total time=   3.4s
[CV 1/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=500;, score=0.746 total time=   6.0s
[CV 2/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=500;, score=0.733 total time=   6.0s
[CV 3/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=500;, score=0.730 total time=   5.8s
[CV 4/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=500;, score=0.791 total time=   6.1s
[CV 5/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=500;, score=0.757 total time=   5.9s
[CV 6/10] END learning_rate=1, max_depth=6, min_samples_split=5, n_estimators=5

[CV 10/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=100;, score=0.763 total time=   1.4s
[CV 1/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=300;, score=0.746 total time=   4.1s
[CV 2/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=300;, score=0.753 total time=   4.2s
[CV 3/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=300;, score=0.753 total time=   4.1s
[CV 4/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=300;, score=0.793 total time=   4.1s
[CV 5/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=300;, score=0.753 total time=   4.0s
[CV 6/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=300;, score=0.756 total time=   3.8s
[CV 7/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=300;, score=0.739 total time=   4.1s
[CV 8/10] END learning_rate=1, max_depth=7, min_samples_split=3, n_estimators=3

[CV 2/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.728 total time=   1.3s
[CV 3/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.763 total time=   1.4s
[CV 4/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.781 total time=   1.3s
[CV 5/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.736 total time=   1.2s
[CV 6/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.753 total time=   1.3s
[CV 7/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.740 total time=   1.4s
[CV 8/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.767 total time=   1.3s
[CV 9/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=100;, score=0.727 total time=   1.3s
[CV 10/10] END learning_rate=1, max_depth=7, min_samples_split=5, n_estimators=1

GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=True),
             estimator=GradientBoostingClassifier(), n_jobs=1,
             param_grid={'learning_rate': [0.1, 1],
                         'max_depth': [3, 4, 5, 6, 7],
                         'min_samples_split': [2, 3, 4, 5],
                         'n_estimators': [100, 300, 500, 600]},
             verbose=3)

In [None]:
best_par = otimiz.best_params_
print(best_par)

{'learning_rate': 0.1, 'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 100}


In [None]:
modelo_churn_otimizado = GradientBoostingClassifier(max_depth = best_par['max_depth'],
                                                    min_samples_split = best_par['min_samples_split'], 
                                                    n_estimators = best_par['n_estimators'],
                                                    learning_rate = best_par['learning_rate'], random_state = 53)

<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>De posse dos parâmetros otimizados para o nosso modelo escolhido, iremos realizar uma nova validação cruzada a fim de reduzir a aleatoriedade e o possível vazamento de dados oriundos do processo de otimização de hiperparâmetros:</p>

In [None]:
original = []
predito = []

def classification_report_accuracy_score(y_real, y_pred):
    original.extend(y_real)
    predito.extend(y_pred)
    return accuracy_score(y_real, y_pred)

scores = cross_val_score(modelo_churn_otimizado, X, y, cv = StratifiedKFold(n_splits = 10, shuffle=True), 
                             scoring = make_scorer(classification_report_accuracy_score))
print(classification_report(original, predito))
media = scores.mean()
desvio_padrao = scores.std()
print("Intervalo Accuracy teste: [%.2f, %.2f]" % ((media - 2 * desvio_padrao)*100, (media + 2 * desvio_padrao) * 100))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87      5174
           1       0.65      0.51      0.57      1869

    accuracy                           0.80      7043
   macro avg       0.74      0.71      0.72      7043
weighted avg       0.79      0.80      0.79      7043

Intervalo Accuracy teste: [77.57, 81.82]


<p style='font-size: 16px; line-height: 2; margin: 10px 50px; text-align: justify;'>Percebemos que houve uma melhora discreta na qualidade de predição do nosso modelo escohido, com destaque para a redução da diferença dos valores no intervalo de confiança. Por fim, iremos salvar o modelo otimizado para posterior deployment.</p>

In [None]:
filename = 'modelo_churn.sav'
joblib.dump(modelo_churn_otimizado, filename)

['modelo_churn.sav']