#Case 6 - Uso de Classificação em uma base de Churn
- Fazer o processo do Crisp-Dm Completo
- Entender os dados
- Ajustar e formatar da melhor maneira
- Encontrar o modelo
- Fazer uso do Tunning para esse modelo e plotar os resultados

In [None]:
!pip install ydata-profiling

In [63]:
import pandas as pd
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

In [35]:
churn = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Classificação/churn_data.xlsx')
churn.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Conhecendo os dados - Data Understanding

In [36]:
ProfileReport(df=churn,title="Profiling Report")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



# Preparação dos dados - Data Preparation

In [37]:
# A coluna de CustomerID não nos parece interessar para a analise, por isso vai ser retirada
churn = churn.drop(columns=['customerID'])

In [38]:
churn.head(3)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


Todas as colunas nos formatos corretos e como deveriam estar

In [39]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   object 
 1   SeniorCitizen     7032 non-null   int64  
 2   Partner           7032 non-null   object 
 3   Dependents        7032 non-null   object 
 4   tenure            7032 non-null   int64  
 5   PhoneService      7032 non-null   object 
 6   MultipleLines     7032 non-null   object 
 7   InternetService   7032 non-null   object 
 8   OnlineSecurity    7032 non-null   object 
 9   OnlineBackup      7032 non-null   object 
 10  DeviceProtection  7032 non-null   object 
 11  TechSupport       7032 non-null   object 
 12  StreamingTV       7032 non-null   object 
 13  StreamingMovies   7032 non-null   object 
 14  Contract          7032 non-null   object 
 15  PaperlessBilling  7032 non-null   object 
 16  PaymentMethod     7032 non-null   object 


In [40]:
# Retirando qualquer valor Nulo
churn.dropna(inplace=True)

In [41]:
# Removendo os duplicados

# Identificar valores duplicados
churn[churn.duplicated(keep=False)]

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
22,Male,0,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,20.15,20.15,Yes
100,Male,0,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,20.2,20.2,No
541,Female,0,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,19.55,19.55,No
645,Male,0,No,No,1,Yes,No,DSL,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,45.7,45.7,Yes
661,Male,0,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,20.05,20.05,No
689,Male,0,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Mailed check,20.45,20.45,No
961,Male,0,No,No,1,Yes,No,DSL,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,45.7,45.7,Yes
973,Male,0,No,No,1,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,69.9,69.9,Yes
1239,Male,0,No,No,1,Yes,No,DSL,No,No,No,No,No,No,Month-to-month,No,Electronic check,45.3,45.3,Yes
1334,Male,0,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,20.15,20.15,Yes


In [42]:
churn.drop_duplicates(inplace=True)

In [43]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7010 entries, 0 to 7031
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7010 non-null   object 
 1   SeniorCitizen     7010 non-null   int64  
 2   Partner           7010 non-null   object 
 3   Dependents        7010 non-null   object 
 4   tenure            7010 non-null   int64  
 5   PhoneService      7010 non-null   object 
 6   MultipleLines     7010 non-null   object 
 7   InternetService   7010 non-null   object 
 8   OnlineSecurity    7010 non-null   object 
 9   OnlineBackup      7010 non-null   object 
 10  DeviceProtection  7010 non-null   object 
 11  TechSupport       7010 non-null   object 
 12  StreamingTV       7010 non-null   object 
 13  StreamingMovies   7010 non-null   object 
 14  Contract          7010 non-null   object 
 15  PaperlessBilling  7010 non-null   object 
 16  PaymentMethod     7010 non-null   object 


### Ajustando as colunas e sepando o dataset

In [44]:
#Separando os dados
x = churn.drop(columns=['Churn'])

y= churn[['Churn']].copy()

In [45]:
# transformando os dados e retirandos as variáveis categoricas
le = LabelEncoder()

le.fit(y.Churn)
y.Churn = le.transform(y.Churn)


x = pd.get_dummies(data=x)

In [46]:
x.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,29.85,1,0,0,1,1,0,...,0,1,0,0,0,1,0,0,1,0
1,0,34,56.95,1889.5,0,1,1,0,1,0,...,0,0,1,0,1,0,0,0,0,1
2,0,2,53.85,108.15,0,1,1,0,1,0,...,0,1,0,0,0,1,0,0,0,1
3,0,45,42.3,1840.75,0,1,1,0,1,0,...,0,0,1,0,1,0,1,0,0,0
4,0,2,70.7,151.65,1,0,1,0,1,0,...,0,1,0,0,0,1,0,0,1,0


In [47]:
y.head()

Unnamed: 0,Churn
0,0
1,0
2,1
3,0
4,1


### Separando os dados em treino e teste

In [48]:
x_train,x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

In [54]:
# Nomalizando esses dados
scaler = StandardScaler()

scaler.fit(x_train)

In [56]:
x_train_scaled = scaler.transform(X=x_train)
x_test_scaled = scaler.transform(X=x_test)

x_train = pd.DataFrame(data=x_train_scaled)
x_test = pd.DataFrame(data=x_test_scaled)

In [58]:
x_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,35,36,37,38,39,40,41,42,43,44
0,2.254424,-0.517013,0.289133,-0.369281,-0.997151,0.997151,-1.026382,1.026382,0.652598,-0.652598,...,-0.798057,0.914484,-0.523142,-0.562797,-0.830155,0.830155,-0.530323,-0.522589,1.400332,-0.543828
1,-0.443572,0.914026,1.289914,1.486541,1.002857,-1.002857,-1.026382,1.026382,-1.532338,1.532338,...,1.253044,-1.093513,1.911529,-0.562797,-0.830155,0.830155,1.885642,-0.522589,-0.714116,-0.543828
2,-0.443572,-0.721447,-1.515937,-0.866136,-0.997151,0.997151,-1.026382,1.026382,0.652598,-0.652598,...,-0.798057,0.914484,-0.523142,-0.562797,-0.830155,0.830155,-0.530323,1.913551,-0.714116,-0.543828
3,-0.443572,-1.089428,-1.532589,-0.968238,-0.997151,0.997151,0.974296,-0.974296,0.652598,-0.652598,...,-0.798057,0.914484,-0.523142,-0.562797,1.204595,-1.204595,-0.530323,-0.522589,-0.714116,1.838818
4,2.254424,-0.762334,0.512269,-0.5487,1.002857,-1.002857,0.974296,-0.974296,0.652598,-0.652598,...,-0.798057,0.914484,-0.523142,-0.562797,-0.830155,0.830155,1.885642,-0.522589,-0.714116,-0.543828


## Modelagem de dados - Modeling

Para esse modelo será usado alguns modelos diferentes para tentar chegar a uma previsão cada vez melhor

### Regressão Logistica

In [59]:
from sklearn.linear_model import LogisticRegression


In [61]:
lr = LogisticRegression(penalty='l2', C=1.0)
lr.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


In [65]:
y_pred = lr.predict(x_test)

print(f'Acurácia: {accuracy_score(y_test, y_pred)*100}%')
print(f'F1_score: {f1_score(y_test, y_pred)*100}%')

Acurácia: 79.45791726105563%
F1_score: 58.5014409221902%
