O churn de clientes é um problema crítico para qualquer empresa, pois afeta diretamente a receita, o crescimento e a sustentabilidade do negócio. A capacidade de identificar antecipadamente clientes com maior propensão ao cancelamento permite ações preventivas mais eficientes e estratégicas. 

Neste estudo, será utilizada uma base de dados de uma empresa de telecomunicações que atua com serviços de televisão, telefonia e internet, na qual serão aplicadas técnicas de análise de dados e modelos preditivos para compreender os fatores associados ao churn e apoiar decisões voltadas à retenção de clientes.

In [1]:
import pandas as pd
import numpy as np

In [2]:
!pip install scikit-learn



In [3]:
!pip install kagglehub



In [4]:
import kagglehub

path = kagglehub.dataset_download("blastchar/telco-customer-churn")
base = pd.read_csv(path+'/WA_Fn-UseC_-Telco-Customer-Churn.csv')

Using Colab cache for faster access to the 'telco-customer-churn' dataset.


In [5]:
pd.set_option('display.max_columns', None)

base.head(10)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
5,9305-CDSKC,Female,0,No,No,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
6,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,No
7,6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
8,7892-POOKP,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes
9,6388-TABGU,Male,0,No,Yes,62,Yes,No,DSL,Yes,Yes,No,No,No,No,One year,No,Bank transfer (automatic),56.15,3487.95,No


In [6]:
base.isna().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


In [7]:
base.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


In [8]:
np.unique(base.dtypes, return_counts=True)

(array([dtype('int64'), dtype('float64'), dtype('O')], dtype=object),
 array([ 2,  1, 18]))

Em colunas categórigas não oredenadas iremos utilizar o OneHotEncoder e nas com valores, usaremos o StandardScaler

In [203]:
from sklearn.preprocessing import LabelEncoder,StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

colunas_scaler = ['TotalCharges','MonthlyCharges','tenure']

matriz_encoders = []

base_tratada = base.copy().drop(columns=['customerID'])

#Tratando valores faltantes do contrato
base_tratada['tenure'] = base_tratada['tenure'].replace(' ',0)
base_tratada['MonthlyCharges'] = base_tratada['MonthlyCharges'].replace(' ',0)
base_tratada['TotalCharges'] = base_tratada['TotalCharges'].replace(' ',0)

#Especificando quais colunas utilizaremos o scaler, já que é o tipo de coluna menos comum, arrays para guardarmos os scalers 'fittados' para acesso mais tarde
colunas_scaler = ['TotalCharges','MonthlyCharges','tenure']

array_scalers = []
array_encoders = []

for each in base_tratada.columns:
  if each not in colunas_scaler:
    #Criando Encoder
    label_encoder = LabelEncoder().fit(np.array(base[each]))

    #Atrubuindo valores transformados do Label Encoder
    base_tratada[each] = label_encoder.transform(np.array(base[each]))
    #Assegurando que estão como valores inteiros
    base_tratada[each] = base_tratada[each].astype(int)

    #Guardando os encoders para uma análise futura neste notebook
    array_encoders.append([each, label_encoder])

  else:
    base_tratada[each] = base_tratada[each].astype(float)
    scaler = StandardScaler().fit(np.array(base_tratada[each]).reshape(-1,1))
    base_tratada[each] = scaler.transform(np.array(base_tratada[each]).reshape(-1,1))
    array_scalers.append([each, scaler])



In [204]:
#Separando variável de colunas à serem dropadas, e adicionando Churn à essa lista
drop_cols = colunas_scaler.copy()
drop_cols.append('Churn')

#Valres que iremos utilizar no Encoder
base_ohe = base_tratada.drop(columns=drop_cols)
colunas_ohe_list = base_tratada.drop(columns=drop_cols).columns.tolist()

#Coletando os índices das colunas à serem transformadas
for each in colunas_ohe_list:
  colunas_ohe_list[colunas_ohe_list.index(each)] = base_tratada.columns.get_loc(each)

#Criando modelo de transformação e dando fit
oh_encoder = ColumnTransformer(transformers=[('OneHot',OneHotEncoder(),base_ohe.columns)], remainder='drop')

oh_encoder.fit(base_ohe)


In [207]:
base_ohe

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod
0,0,0,1,0,0,1,0,0,2,0,0,0,0,0,1,2
1,1,0,0,0,1,0,0,2,0,2,0,0,0,1,0,3
2,1,0,0,0,1,0,0,2,2,0,0,0,0,0,1,3
3,1,0,0,0,0,1,0,2,0,2,2,0,0,1,0,0
4,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,1,0,1,1,1,2,0,2,0,2,2,2,2,1,1,3
7039,0,0,1,1,1,2,1,0,2,2,0,2,2,1,1,1
7040,0,0,1,1,0,1,0,2,0,0,0,0,0,0,1,2
7041,1,1,1,0,1,2,1,0,0,0,0,0,0,0,1,3


In [290]:
#Atribuindo à nova variavel, com o transform, essa variável será nossa lista de Features
x_base = oh_encoder.transform(base_ohe)

In [292]:
x_base = np.append(x_base,base_tratada[colunas_scaler],axis=1)

In [296]:
x_base.shape

(7043, 46)

In [16]:
y_base = base_tratada['Churn']
y_base.shape

(7043,)

In [17]:
base_tratada.describe()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,0.504756,0.162147,0.483033,0.299588,-2.421273e-17,0.903166,0.940508,0.872923,0.790004,0.906432,0.904444,0.797104,0.985376,0.992475,0.690473,0.592219,1.574329,-6.406285e-17,-3.7832390000000004e-17,0.26537
std,0.500013,0.368612,0.499748,0.45811,1.000071,0.295752,0.948554,0.737796,0.859848,0.880162,0.879949,0.861551,0.885002,0.885091,0.833755,0.491457,1.068104,1.000071,1.000071,0.441561
min,0.0,0.0,0.0,0.0,-1.318165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.54586,-1.00578,0.0
25%,0.0,0.0,0.0,0.0,-0.9516817,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.9725399,-0.8299464,0.0
50%,1.0,0.0,0.0,0.0,-0.1372744,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,2.0,0.1857327,-0.3905282,0.0
75%,1.0,0.0,1.0,1.0,0.9214551,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,0.8338335,0.6648034,1.0
max,1.0,1.0,1.0,1.0,1.613701,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,1.794352,2.825806,1.0


In [18]:
base_tratada.corr()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
gender,1.0,-0.001874,-0.001808,0.010517,0.005106,-0.006488,-0.006739,-0.000863,-0.015017,-0.012057,0.000549,-0.006825,-0.006421,-0.008743,0.000126,-0.011754,0.017352,-0.014569,-8e-05,-0.008612
SeniorCitizen,-0.001874,1.0,0.016479,-0.211185,0.016567,0.008576,0.146185,-0.03231,-0.128221,-0.013632,-0.021398,-0.151268,0.030776,0.047266,-0.142554,0.15653,-0.038551,0.220173,0.103006,0.150889
Partner,-0.001808,0.016479,1.0,0.452676,0.379697,0.017706,0.14241,0.000891,0.150828,0.15313,0.16633,0.126733,0.137341,0.129574,0.294806,-0.014877,-0.154798,0.096848,0.317504,-0.150448
Dependents,0.010517,-0.211185,0.452676,1.0,0.159712,-0.001762,-0.024991,0.04459,0.152166,0.091015,0.080537,0.133524,0.046885,0.021321,0.243187,-0.111377,-0.040292,-0.11389,0.062078,-0.164221
tenure,0.005106,0.016567,0.379697,0.159712,1.0,0.008448,0.343032,-0.030359,0.325468,0.370876,0.371105,0.322942,0.289373,0.296866,0.671607,0.006152,-0.370436,0.2479,0.826178,-0.352229
PhoneService,-0.006488,0.008576,0.017706,-0.001762,0.008448,1.0,-0.020538,0.387436,-0.015198,0.024105,0.003727,-0.019158,0.055353,0.04387,0.002247,0.016505,-0.004184,0.247398,0.113214,0.011942
MultipleLines,-0.006739,0.146185,0.14241,-0.024991,0.343032,-0.020538,1.0,-0.109216,0.007141,0.117327,0.122318,0.011466,0.175059,0.180957,0.110842,0.165146,-0.176793,0.433576,0.452577,0.038037
InternetService,-0.000863,-0.03231,0.000891,0.04459,-0.030359,0.387436,-0.109216,1.0,-0.028416,0.036138,0.044944,-0.026047,0.107417,0.09835,0.099721,-0.138625,0.08614,-0.32326,-0.175755,-0.047291
OnlineSecurity,-0.015017,-0.128221,0.150828,0.152166,0.325468,-0.015198,0.007141,-0.028416,1.0,0.185126,0.175985,0.285028,0.044669,0.055954,0.374416,-0.157641,-0.096726,-0.053878,0.253224,-0.289309
OnlineBackup,-0.012057,-0.013632,0.15313,0.091015,0.370876,0.024105,0.117327,0.036138,0.185126,1.0,0.187757,0.195748,0.147186,0.136722,0.28098,-0.01337,-0.124847,0.119777,0.37441,-0.195525


In [19]:
array_corr = []

#% de relevância da correlação
relevancia = 0.35

#Criando uma variável para o ambiente não recalcular toda vez que utilizarmos o dataframe na função
base_corr = base_tratada.corr()

for col in base_corr.columns:
    for line in base_corr.index:
        if abs(base_corr.at[col,line]) > relevancia and abs(base_corr.at[col,line]) != 1.0:
            if [col, line, base_corr.at[col,line]] not in array_corr and\
                [line, col, base_corr.at[col,line]] not in array_corr:

                array_corr.append([col, line, base_corr.at[col,line]])

print('Número de correlações relevantes encontradas: ',len(array_corr))
print('--------')
print('--------')
for each in array_corr:
    print(each[0])
    print(each[1])
    print(each[2])
    print('--------')

Número de correlações relevantes encontradas:  23
--------
--------
Partner
Dependents
0.4526762829294659
--------
Partner
tenure
0.37969746116829356
--------
tenure
OnlineBackup
0.37087612301584916
--------
tenure
DeviceProtection
0.3711054358369816
--------
tenure
Contract
0.6716065492280595
--------
tenure
PaymentMethod
-0.3704361179501759
--------
tenure
TotalCharges
0.8261783979502471
--------
tenure
Churn
-0.3522286701130793
--------
PhoneService
InternetService
0.38743602203093397
--------
MultipleLines
MonthlyCharges
0.43357600985754013
--------
MultipleLines
TotalCharges
0.45257679157440833
--------
OnlineSecurity
Contract
0.37441553839452074
--------
OnlineBackup
TotalCharges
0.3744096123030767
--------
DeviceProtection
Contract
0.3502770893212923
--------
DeviceProtection
TotalCharges
0.38789726384622947
--------
TechSupport
Contract
0.42536667159313896
--------
StreamingTV
StreamingMovies
0.43477235280035037
--------
StreamingTV
TotalCharges
0.3914698664788375
--------
Stre

Vamos pegar os 3 melhores e 3 piores resultados de correlação desta amostra, e ver alguns gráficos

In [20]:
import plotly.express as px

sorted_array_corr = sorted(array_corr, key= lambda x:x[2])

top1, top2, top3 = sorted_array_corr[-1],sorted_array_corr[-2],sorted_array_corr[-3]
bot1, bot2, bot3 = sorted_array_corr[0],sorted_array_corr[1],sorted_array_corr[2]

print("MAIORES CORRELAÇÕES")
print('############################################################################################################')
top1_fig = px.scatter(base_corr,x=top1[0],y=top1[1], width=800, height=450)
top1_fig.show()
print('############################################################################################################')
top2_fig = px.scatter(base_corr,x=top2[0],y=top2[1], width=800, height=450)
top2_fig.show()
print('############################################################################################################')
top3_fig = px.scatter(base_corr,x=top3[0],y=top3[1], width=800, height=450)
top3_fig.show()

print('############################################################################################################')

print("MENORES CORRELAÇÕES")
print('############################################################################################################')
bot1_fig = px.scatter(base_corr,x=bot1[0],y=bot1[1], width=800, height=450)
bot1_fig.show()
print('############################################################################################################')
bot2_fig = px.scatter(base_corr,x=bot2[0],y=bot2[1], width=800, height=450)
bot2_fig.show()
print('############################################################################################################')
bot3_fig = px.scatter(base_corr,x=bot3[0],y=bot3[1], width=800, height=450)
bot3_fig.show()

MAIORES CORRELAÇÕES
############################################################################################################


############################################################################################################


############################################################################################################


############################################################################################################
MENORES CORRELAÇÕES
############################################################################################################


############################################################################################################


############################################################################################################


Podemos observar algumas relações lineares de correção, isso pode indicar que um algoritmo linear com aquelas colunas pode nos fornecer uma boa previsão. Vemos essa relação principalmente em 'StreamingMovies' com 'StreamingTV', que estão bem perto no modelo de negócio.

In [21]:

for each in base.columns:
  if each != 'Churn' and each != 'customerID':
    grafico = px.bar(base,x=base[each], y=base_tratada['Churn'])
    grafico.show()

Com estes gráficos o time de estratégia ja pode retirar informações bem interessantes

Vamos fazer uma validação cruzada para decidir qual modelo usar

In [22]:
#x = base_tratada.iloc[:,0:-1]
#y = base_tratada.iloc[:,-1]

In [23]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_base,y_base, test_size = 0.1, random_state = 0, shuffle=True)

x_train.shape, x_test.shape

((6338, 46), (705, 46))

In [24]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.neural_network import MLPClassifier


from sklearn.model_selection import cross_val_score, KFold, GridSearchCV

from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

#Especificação necessária para alguns algoritmos não retornarem um erro
import os
os.environ['LOKY_MAX_CPU_COUNT'] = '2'

In [25]:
Dec_Tree = []
Rand_Fore = []
KMeans = []
KNN = []
Neural = []


for i in range(30):

    kfold = KFold(n_splits=10, shuffle=True, random_state=i)

    dec_tree_model = DecisionTreeClassifier()
    scores = cross_val_score(dec_tree_model, x_base, y_base, cv=kfold)
    Dec_Tree.append(scores.mean())


    rand_fore_model = RandomForestClassifier()
    scores = cross_val_score(rand_fore_model, x_base, y_base, cv=kfold)
    Rand_Fore.append(scores.mean())

    knn_model = KNeighborsClassifier()
    scores = cross_val_score(knn_model, x_base, y_base, cv=kfold)
    KNN.append(scores.mean())

    neural_model = MLPClassifier()
    scores = cross_val_score(neural_model, x_base, y_base, cv=kfold)
    Neural.append(scores.mean())



Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



In [26]:
print(Dec_Tree)

[np.float64(0.7255442053513862), np.float64(0.7235519422952933), np.float64(0.7315066892327531), np.float64(0.725965103159252), np.float64(0.7263960751128304), np.float64(0.7204295615731786), np.float64(0.7315089055448099), np.float64(0.7340635074145713), np.float64(0.730086637653127), np.float64(0.7258333333333333), np.float64(0.7333502578981302), np.float64(0.7306548194713087), np.float64(0.7231280222437138), np.float64(0.7349167875564152), np.float64(0.7289490651192779), np.float64(0.7295152321083173), np.float64(0.7302300934880723), np.float64(0.7359076805286912), np.float64(0.7319314152159897), np.float64(0.7289502740167634), np.float64(0.7224161831076725), np.float64(0.7312268294648614), np.float64(0.7208558994197292), np.float64(0.7361945921985815), np.float64(0.7324941569954868), np.float64(0.7285227272727273), np.float64(0.7293737911025145), np.float64(0.7305095502901354), np.float64(0.7242587443584785), np.float64(0.7228493713733075)]


In [27]:
print("Média da Decision Tree: ",np.array(Dec_Tree).mean())
print("Média da Random Forest: ",np.array(Rand_Fore).mean())
print("Média da KNN: ",np.array(KNN).mean())
print("Média da Rede Neural: ",np.array(Neural).mean())

Média da Decision Tree:  0.728704048463357
Média da Random Forest:  0.7879927600472814
Média da KNN:  0.7666570760799485
Média da Rede Neural:  0.7820386847195357


Vamos seguir com a rede neural, random forest e árvore de decisão, e realizar o tuning dos parâmetros

In [28]:
params = {
    'criterion':['gini','entropy'],
    'min_samples_split':[2,4,6],
    'min_samples_leaf':[1,2,3],
    'random_state':[0]
}

grid = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=params, cv=10)
grid.fit(x_base, y_base)

print(grid.best_score_)
print(grid.best_params_)

print(grid.cv_results_)


0.7397453255963895
{'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 6, 'random_state': 0}
{'mean_fit_time': array([0.06969986, 0.05818753, 0.05600827, 0.05431483, 0.05727158,
       0.05298061, 0.05327582, 0.05200498, 0.05362272, 0.06778634,
       0.06623671, 0.0670954 , 0.06687465, 0.06344275, 0.06404665,
       0.06190248, 0.07366948, 0.08904431]), 'std_fit_time': array([0.01730094, 0.00265099, 0.00347624, 0.00143795, 0.0033307 ,
       0.00110071, 0.0024859 , 0.00144265, 0.0035196 , 0.00465044,
       0.00207461, 0.00466206, 0.00491987, 0.00186689, 0.00432356,
       0.00326379, 0.01272578, 0.00410427]), 'mean_score_time': array([0.00182021, 0.00169401, 0.00150874, 0.00149074, 0.00161033,
       0.00146205, 0.00156648, 0.00155964, 0.00152388, 0.0016408 ,
       0.00146387, 0.0016578 , 0.00156269, 0.00149205, 0.0016669 ,
       0.00161312, 0.00181725, 0.00206213]), 'std_score_time': array([3.84965115e-04, 2.72966842e-04, 1.14134112e-04, 2.47756255e-05,
       2.7346

In [29]:
params = {
    'n_estimators':[75,100,125],
    'criterion':['gini','entropy'],
    'min_samples_split':[2,4],
    'min_samples_leaf':[1,2,3],
    'random_state':[0]
}
grid = GridSearchCV(RandomForestClassifier(),param_grid=params, cv=10)
grid.fit(x_base, y_base)

print(grid.best_score_)
print(grid.best_params_)

print(grid.cv_results_)

0.8022124838813667
{'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100, 'random_state': 0}
{'mean_fit_time': array([0.74010499, 1.04085653, 1.33790565, 0.74813166, 0.94690506,
       1.20966682, 0.71446986, 0.91845903, 1.13567693, 0.62172849,
       0.91951082, 1.12280898, 0.58607202, 0.85199656, 1.0701525 ,
       0.58901002, 0.86374276, 1.05585129, 0.85837338, 1.04056518,
       1.43070335, 0.74303401, 1.02984788, 1.25897982, 0.75764027,
       0.90023816, 1.20198059, 0.76397147, 0.98853369, 1.2192353 ,
       0.63879092, 0.91886525, 1.11996791, 0.7092021 , 0.83955812,
       1.14527566]), 'std_fit_time': array([0.12126988, 0.14717411, 0.21436334, 0.11766699, 0.05920451,
       0.1774966 , 0.15670617, 0.1384328 , 0.14470089, 0.02035434,
       0.13069826, 0.14428194, 0.01515346, 0.1436693 , 0.15049869,
       0.02102742, 0.16268485, 0.16282754, 0.14512811, 0.06761275,
       0.2290478 , 0.01938294, 0.16932352, 0.17110041, 0.11891353,
       0.0

In [30]:
params = {
    'n_neighbors':[2,5,10,15],
    'weights':['uniform','distance'],
    'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size':[15,20,30,45,60],
    'p':[1.0,2.0],
    }
grid = GridSearchCV(KNeighborsClassifier(),param_grid=params, cv=10)
grid.fit(x_base, y_base)

print(grid.best_score_)
print(grid.best_params_)

print(grid.cv_results_)

0.787447412959381
{'algorithm': 'auto', 'leaf_size': 15, 'n_neighbors': 15, 'p': 2.0, 'weights': 'uniform'}
{'mean_fit_time': array([0.00312436, 0.00283525, 0.00295761, 0.00278287, 0.00249856,
       0.00259151, 0.0025084 , 0.00259159, 0.00316632, 0.00267322,
       0.00260057, 0.00239007, 0.00255098, 0.0026793 , 0.00255659,
       0.00266085, 0.00303304, 0.00297337, 0.002703  , 0.00252519,
       0.00244739, 0.00250955, 0.00247018, 0.00264657, 0.00298281,
       0.00252495, 0.00256422, 0.00241077, 0.00252531, 0.00244591,
       0.00255227, 0.00251353, 0.00317988, 0.00246012, 0.00246911,
       0.00251114, 0.00246439, 0.00247214, 0.00254555, 0.0026051 ,
       0.00300527, 0.00262117, 0.00261619, 0.00254388, 0.00258517,
       0.00271924, 0.00263851, 0.0028796 , 0.00307157, 0.00288823,
       0.00259831, 0.00265934, 0.00310218, 0.00254817, 0.00265608,
       0.0025219 , 0.003567  , 0.00254505, 0.00251126, 0.00260344,
       0.00305147, 0.00254478, 0.00258446, 0.00275691, 0.00304453,
   

In [31]:
params = {
    'hidden_layer_sizes':[(4,4,4)],
    'activation':['tanh','relu'],
    'solver':['lbfgs','sgd'],
    'learning_rate':['adaptive'],
    'max_iter':[1000],
    'random_state':[0],
    }
grid = GridSearchCV(MLPClassifier(),param_grid=params, cv=10)
grid.fit(x_base, y_base)

print(grid.best_score_)
print(grid.best_params_)

print(grid.cv_results_)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number 

0.8009393133462283
{'activation': 'relu', 'hidden_layer_sizes': (4, 4, 4), 'learning_rate': 'adaptive', 'max_iter': 1000, 'random_state': 0, 'solver': 'lbfgs'}
{'mean_fit_time': array([24.21494734,  6.78560154, 10.79979355,  4.36270401]), 'std_fit_time': array([3.67454358, 3.67330634, 3.2115706 , 0.62379406]), 'mean_score_time': array([0.00735099, 0.00230496, 0.00369115, 0.00167885]), 'std_score_time': array([0.00816058, 0.00038158, 0.00174887, 0.00038473]), 'param_activation': masked_array(data=['tanh', 'tanh', 'relu', 'relu'],
             mask=[False, False, False, False],
       fill_value=np.str_('?'),
            dtype=object), 'param_hidden_layer_sizes': masked_array(data=[(4, 4, 4), (4, 4, 4), (4, 4, 4), (4, 4, 4)],
             mask=[False, False, False, False],
       fill_value=np.str_('?'),
            dtype=object), 'param_learning_rate': masked_array(data=['adaptive', 'adaptive', 'adaptive', 'adaptive'],
             mask=[False, False, False, False],
       fill_value=np


lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html



In [32]:
params = {
    'kernel':['rbf', 'sigmoid'],
    'gamma':['auto'],
    'C':[1.0,1.5,2.0],
    'tol':[0.005],
    'max_iter':[1000,1500,2000],
    'random_state':[0]
    }

grid = GridSearchCV(SVC(),param_grid=params, cv=10)
grid.fit(x_base, y_base)

print(grid.best_score_)
print(grid.best_params_)

print(grid.cv_results_)


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.


Solver terminated early (max_iter=1000).  Consider pre-processing your data wit

0.8023547308188265
{'C': 1.5, 'gamma': 'auto', 'kernel': 'rbf', 'max_iter': 2000, 'random_state': 0, 'tol': 0.005}
{'mean_fit_time': array([0.98902991, 1.52600472, 1.56744959, 1.27969818, 1.96152706,
       2.20406487, 1.00904672, 1.41784902, 1.56205707, 1.2824815 ,
       1.95595741, 2.13527429, 0.9852999 , 1.38691385, 1.51283841,
       1.31116323, 1.84176407, 2.10953264]), 'std_fit_time': array([0.17871619, 0.23636376, 0.19609603, 0.1848608 , 0.21108889,
       0.39109875, 0.19676985, 0.18130662, 0.22647519, 0.19296661,
       0.38646144, 0.34305571, 0.18077338, 0.18788639, 0.24278936,
       0.21791461, 0.27149549, 0.31764384]), 'mean_score_time': array([0.14357257, 0.21556571, 0.21036785, 0.12134628, 0.18551822,
       0.18754969, 0.14013662, 0.20672622, 0.21316569, 0.12355447,
       0.17943916, 0.18111126, 0.15079985, 0.20026793, 0.21569386,
       0.12695048, 0.17334533, 0.1882401 ]), 'std_score_time': array([0.028566  , 0.04831795, 0.03853461, 0.01414398, 0.02890201,
       0.


Solver terminated early (max_iter=2000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.



In [33]:
# @title
params = {
    'penalty':['l2'],
    'loss':['hinge', 'squared_hinge'],
    'C':[1.0,1.3,1.7,2.0,2.5,3,3.5,4],
    'max_iter':[500,1000,1500,2000,3000],
    'random_state':[0]
    }

grid = GridSearchCV(LinearSVC(),param_grid=params, cv=10)
grid.fit(x_base, y_base)

print(grid.best_score_)
print(grid.best_params_)

print(grid.cv_results_)



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iteratio

0.8029237185686654
{'C': 3, 'loss': 'squared_hinge', 'max_iter': 500, 'penalty': 'l2', 'random_state': 0}
{'mean_fit_time': array([0.04999721, 0.0526602 , 0.05919158, 0.05943923, 0.06423051,
       0.02621329, 0.02795181, 0.02609992, 0.02729278, 0.02876999,
       0.07366166, 0.09372036, 0.1130873 , 0.0796499 , 0.0750566 ,
       0.02674735, 0.02668152, 0.02793818, 0.0260571 , 0.02634475,
       0.06755335, 0.0780777 , 0.07850072, 0.08502123, 0.08980091,
       0.02691553, 0.02905769, 0.02680604, 0.02706554, 0.02623856,
       0.07758222, 0.08445363, 0.12646027, 0.1345083 , 0.10258424,
       0.02660658, 0.02719493, 0.02557161, 0.02797215, 0.02642844,
       0.0876894 , 0.09629157, 0.10388513, 0.10955267, 0.11904392,
       0.0262892 , 0.02782457, 0.02745516, 0.02616827, 0.02745395,
       0.11463332, 0.17962773, 0.13926513, 0.12584558, 0.13164554,
       0.02722313, 0.02744596, 0.02659898, 0.02635543, 0.02637956,
       0.10962307, 0.12416997, 0.13236718, 0.14007869, 0.21362386,
     

Nosso melhor modelo por uma margem muito pequena foi o SVC, iremos utilizar ele

In [256]:
#Facilitando algumas coisas

dict_encoders = dict(array_encoders)
dict_scalers = dict(array_scalers)

In [257]:
#Vamos fazer um novo split para este modelo, e apresentando os resutados de teste como previsões reais

x = base.iloc[:,1:-1].copy()
y = base.iloc[:,-1].copy()

x['tenure'] = x['tenure'].replace(' ',0).astype(float)
x['MonthlyCharges'] = x['MonthlyCharges'].replace(' ',0).astype(float)
x['TotalCharges'] = x['TotalCharges'].replace(' ',0).astype(float)



In [258]:
for key, value in dict_encoders.items():
  encoder = value

  if key == 'Churn':
    y = encoder.transform(np.array(y)).astype(int)
  else:
    x[key] = encoder.transform(np.array(x[key])).astype(int)

In [259]:
for key, value in dict_scalers.items():
  encoder = value

  x[key] = encoder.transform(np.array(x[key]).reshape(-1,1))

In [260]:
x_scalers = x[colunas_scaler].copy()
x = oh_encoder.transform(x)
x = np.append(x,x_scalers,axis=1)
x

array([[ 1.        ,  0.        ,  1.        , ..., -0.99261052,
        -1.16032292, -1.27744458],
       [ 0.        ,  1.        ,  1.        , ..., -0.17216471,
        -0.25962894,  0.06632742],
       [ 0.        ,  1.        ,  1.        , ..., -0.9580659 ,
        -0.36266036, -1.23672422],
       ...,
       [ 1.        ,  0.        ,  1.        , ..., -0.85293201,
        -1.1686319 , -0.87024095],
       [ 0.        ,  1.        ,  0.        , ..., -0.87051315,
         0.32033821, -1.15528349],
       [ 0.        ,  1.        ,  1.        , ...,  2.01389665,
         1.35896134,  1.36937906]])

In [308]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.1, random_state = 0, shuffle=True)

In [309]:
y_train

array([1, 0, 0, ..., 1, 1, 0])

In [310]:
modelo = SVC(C=1.5, gamma='auto', kernel='rbf', max_iter=2000, random_state=0, tol=0.005)
modelo.fit(x_train, y_train)


Solver terminated early (max_iter=2000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.



In [311]:
from sklearn.metrics import accuracy_score, classification_report
from yellowbrick.classifier import ConfusionMatrix

y_pred = modelo.predict(x_test)

print('ACCURACY SCORE: ',accuracy_score(y_test, y_pred))
print("#######################################################################")
print(classification_report(y_test, y_pred))


ACCURACY SCORE:  0.8099290780141843
#######################################################################
              precision    recall  f1-score   support

           0       0.84      0.92      0.88       533
           1       0.65      0.47      0.55       172

    accuracy                           0.81       705
   macro avg       0.75      0.70      0.71       705
weighted avg       0.80      0.81      0.80       705



Agora vamos fazer predições com uma amostra aleatória da base, similar à base de teste, mas vamos organizar para visualizar depois, como se fosse a ingestão de dados num pipeline normal. Então não vamos treinar nada de novo, só transformar.

In [None]:
# @title
x_teste_final = base.iloc[:,1:-1].copy().sample(frac=0.1)
y_teste_final = base.iloc[x_teste_final.index, -1]

x_teste_final_tratado = x_teste_final.copy()

x_teste_final_tratado['tenure'] = x_teste_final_tratado['tenure'].replace(' ',0).astype(int)
x_teste_final_tratado['MonthlyCharges'] = x_teste_final_tratado['MonthlyCharges'].replace(' ',0).astype(float)
x_teste_final_tratado['TotalCharges'] = x_teste_final_tratado['TotalCharges'].replace(' ',0).astype(float)

#Criando um iterador para percorrer a lista de encoders, já que a posição da coluna não reflete os índices
list_iterator = 0

for each in colunas_ohe_list:
  x_teste_final_tratado.iloc[:,each] = array_encoders[list_iterator][1].transform(x_teste_final_tratado.iloc[:,each]).astype(int)

  list_iterator += 1

#Encoding dos dados, não precisamos de um Label Encoder pra esse
y_teste_final_tratado = y_teste_final.replace('No',0).replace('Yes',1).values

#Agora fazer o escalonamento
#Vamos pegar as posições das colunas que iremos transformar

colunas_scaler_number = []
for each in colunas_scaler:
  colunas_scaler_number.append(x_teste_final.columns.get_loc(each))

list_iterator = 0

for each in colunas_scaler_number:
  print(each)
  x_teste_final_tratado.iloc[:,each] = array_scalers[list_iterator][1].transform(np.array(x_teste_final_tratado.iloc[:,each]).reshape(-1,1))

  list_iterator += 1

#E para finalizar, realizar OneHotEncoding

x_teste_final_ohe = oh_encoder.transform(x_teste_final_tratado)

#Juntando valores dos 2 tipos de encoders
x_teste_final_tratado = np.append(x_teste_final_ohe,x_teste_final_tratado[colunas_scaler],axis=1)

In [319]:
sample = base.iloc[:,:].copy().sample(frac=0.1)

x_teste_final = sample.iloc[:,1:-1].copy()
y_teste_final = sample.iloc[:, -1]

x_teste_final['tenure'] = x_teste_final['tenure'].replace(' ',0).astype(float)
x_teste_final['MonthlyCharges'] = x_teste_final['MonthlyCharges'].replace(' ',0).astype(float)
x_teste_final['TotalCharges'] = x_teste_final['TotalCharges'].replace(' ',0).astype(float)

x_teste_final_tratado = x_teste_final.copy()


for key, value in dict_encoders.items():
  encoder = value

  if key == 'Churn':
    y_teste_final = encoder.transform(np.array(y_teste_final)).astype(int)
  else:
    x_teste_final_tratado[key] = encoder.transform(np.array(x_teste_final_tratado[key])).astype(int)

y_teste_final_tratado = y_teste_final.copy()

for key, value in dict_scalers.items():
  encoder = value

  x_teste_final_tratado[key] = encoder.transform(np.array(x_teste_final_tratado[key]).reshape(-1,1))

x_teste_final_scalers = x_teste_final_tratado[colunas_scaler].copy()
x_teste_final_tratado = oh_encoder.transform(x_teste_final_tratado)
x_teste_final_tratado = np.append(x_teste_final_tratado,x_teste_final_scalers,axis=1)

In [320]:
y_pred = modelo.predict(x_teste_final_tratado)
accuracy_score(y_teste_final, y_pred)

0.8053977272727273

Um resultado bem perto do esperado.

In [321]:
DataframeResultado = sample.copy()
DataframeResultado['Churn Previsto'] = y_pred

DataframeResultado['Churn Previsto'] = dict_encoders['Churn'].inverse_transform(DataframeResultado['Churn Previsto'])

DataframeResultado

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Churn Previsto
3860,9862-KJTYK,Male,0,No,Yes,19,No,No phone service,DSL,No,No,No,No,No,No,Month-to-month,Yes,Credit card (automatic),25.35,566.1,No,No
5109,7113-HIPFI,Male,0,Yes,Yes,66,Yes,No,DSL,Yes,Yes,Yes,Yes,No,No,Two year,No,Mailed check,65.85,4097.05,No,No
1578,1205-WNWPJ,Female,0,No,No,7,Yes,No,DSL,No,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,59.50,415.95,Yes,No
2584,3969-JQABI,Female,0,Yes,No,58,Yes,No,DSL,Yes,Yes,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),65.25,3791.6,No,No
6403,3258-ZKPAI,Male,0,Yes,Yes,72,Yes,Yes,Fiber optic,Yes,Yes,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),116.60,8337.45,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6649,4020-KIUDI,Male,0,Yes,Yes,6,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Credit card (automatic),19.85,138.85,No,No
1634,9995-HOTOH,Male,0,Yes,Yes,63,No,No phone service,DSL,Yes,Yes,Yes,No,Yes,Yes,Two year,No,Electronic check,59.00,3707.6,No,No
6127,6198-PNNSZ,Female,0,Yes,No,56,Yes,Yes,Fiber optic,Yes,Yes,Yes,No,Yes,Yes,One year,No,Bank transfer (automatic),109.80,6109.65,No,No
4414,5372-FBKBN,Female,0,No,Yes,21,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,20.75,452.2,No,No


Na última coluna podemos ver os resultados reais e previstos pelo nosso modelo

Agora vamos prever alguns registros que criaremos aleatoriamente, só de brincadeira.




In [371]:
predict_values = base.iloc[0:1,1:].copy().truncate(after=-1)
temp = pd.DataFrame(np.array(temporary_values).reshape(1,-1), columns=predict_values.columns)

In [372]:
temp

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Male,0,No,Yes,42,Yes,Yes,No,No internet service,Yes,No,No,No internet service,No internet service,One year,Yes,Credit card (automatic),87.7,1592.35,No


In [374]:
predict_values = pd.concat([predict_values, pd.DataFrame(temp)])

predict_values

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Male,0,No,Yes,42,Yes,Yes,No,No internet service,Yes,No,No,No internet service,No internet service,One year,Yes,Credit card (automatic),87.7,1592.35,No
0,Male,0,No,Yes,42,Yes,Yes,No,No internet service,Yes,No,No,No internet service,No internet service,One year,Yes,Credit card (automatic),87.7,1592.35,No


In [375]:
import random

#Primeiro criando os registros
print(base.columns[1:])

predict_values = base.iloc[0:1,1:].copy().truncate(after=-1)

#Enquanto o loop percorrer, ele irá criar uma variável temporaria que guarda os valores de cada coluna em um único array que será o registro,
# assim teremos registros aleatórios

for i in range(100):
  temporary_values = []
  for each in base.columns[1:]:
    temporary_values.append(
        np.unique(base[each])[random.randint(0,len(np.unique(base[each]))-1)]
        )
  predict_values = pd.concat([predict_values, pd.DataFrame(np.array(temporary_values).reshape(1,-1), columns=predict_values.columns)])


Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')
    gender SeniorCitizen Partner Dependents tenure PhoneService  \
0     Male             1      No        Yes     10           No   
0     Male             1      No         No      9          Yes   
0   Female             0     Yes         No     58           No   
0   Female             1      No        Yes     29           No   
0     Male             1     Yes        Yes     60           No   
..     ...           ...     ...        ...    ...          ...   
0   Female             1      No         No     36          Yes   
0   Female             0     Yes         No     57           No   
0   Female             0      

In [376]:
predict_values

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Male,1,No,Yes,10,No,No,Fiber optic,No internet service,Yes,No,Yes,No,No internet service,One year,No,Credit card (automatic),88.8,4103.9,No
0,Male,1,No,No,9,Yes,Yes,DSL,No internet service,No internet service,No,No,No internet service,No,Month-to-month,Yes,Bank transfer (automatic),60.15,107.25,Yes
0,Female,0,Yes,No,58,No,No phone service,Fiber optic,No internet service,No internet service,No,No,No,Yes,One year,Yes,Credit card (automatic),29.95,1327.4,Yes
0,Female,1,No,Yes,29,No,No,Fiber optic,Yes,No internet service,No internet service,Yes,Yes,Yes,Month-to-month,No,Bank transfer (automatic),115.05,665.45,Yes
0,Male,1,Yes,Yes,60,No,No phone service,Fiber optic,No internet service,Yes,No,No,No,No internet service,Month-to-month,Yes,Bank transfer (automatic),50.1,543.8,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,Female,1,No,No,36,Yes,Yes,DSL,No internet service,No,No internet service,No internet service,Yes,No internet service,Two year,No,Electronic check,67.65,78.45,Yes
0,Female,0,Yes,No,57,No,No,No,Yes,Yes,No,No,Yes,No,Two year,Yes,Bank transfer (automatic),103.2,1250.1,No
0,Female,0,No,Yes,1,No,No phone service,No,Yes,No,No internet service,No,No internet service,Yes,One year,No,Electronic check,84.4,2627.35,No
0,Male,0,Yes,No,38,No,No phone service,No,No internet service,Yes,Yes,No internet service,Yes,Yes,One year,No,Mailed check,107.6,308.05,No


In [377]:
sample = predict_values.iloc[:,:].copy()

x_teste_final = sample.iloc[:,0:-1].copy()
y_teste_final = sample.iloc[:, -1]

x_teste_final['tenure'] = x_teste_final['tenure'].replace(' ',0).astype(float)
x_teste_final['MonthlyCharges'] = x_teste_final['MonthlyCharges'].replace(' ',0).astype(float)
x_teste_final['TotalCharges'] = x_teste_final['TotalCharges'].replace(' ',0).astype(float)

x_teste_final_tratado = x_teste_final.copy()


for key, value in dict_encoders.items():
  encoder = value

  if key == 'Churn':
    y_teste_final = encoder.transform(np.array(y_teste_final)).astype(int)
  else:
    x_teste_final_tratado[key] = encoder.transform(np.array(x_teste_final_tratado[key])).astype(int)

y_teste_final_tratado = y_teste_final.copy()

for key, value in dict_scalers.items():
  encoder = value

  x_teste_final_tratado[key] = encoder.transform(np.array(x_teste_final_tratado[key]).reshape(-1,1))

x_teste_final_scalers = x_teste_final_tratado[colunas_scaler].copy()
x_teste_final_tratado = oh_encoder.transform(x_teste_final_tratado)
x_teste_final_tratado = np.append(x_teste_final_tratado,x_teste_final_scalers,axis=1)

In [378]:
y_pred = modelo.predict(x_teste_final_tratado)

accuracy_score(y_teste_final, y_pred)

0.53

Com essa base de valores bem aleatórios conseguimos uma precisão de mais de 50%, que para uma distribuição aleatória de dados, pode ser um valor interessante sobre a robustez desse modelo.