# Lab | Cross Validation

For this lab, we will build a model on customer churn binary classification problem. You will be using `files_for_lab/Customer-Churn.csv` file.



### Instructions

1. Apply SMOTE for upsampling the data

    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.



2. Apply TomekLinks for downsampling

    - It is important to remember that it does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
    - Use logistic regression to fit the model and compute the accuracy of the model.
    - Use decision tree classifier to fit the model and compute the accuracy of the model.
    - Compare the accuracies of the two models.
    - You can also apply this algorithm one more time and check the how the imbalance in the two classes changed from the last time.



# My Code

In [36]:
# Importamos las librerias que vamos a necesitar

# manejo bbdd
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import datetime

#gráficas
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#estadística
import math
from scipy.stats import norm
from scipy import stats #para box-cox entre otros
from scipy.stats import skew

#preprocesamiento
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

#modelos y evaluación
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

import statsmodels.api as sm
from statsmodels.formula.api import ols

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks



In [20]:
#Vamos a cargar los datos
data = pd.read_csv('Customer-Churn.txt')

In [21]:
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [22]:
#Voy a realizar buenas practicas a través de una función.
#Quiero que primero me ponga una '_' entre una minúscula y una mayúscula.
#Segundo convierta todo el texto en minúscula.
#Tercero si fuera necesario si encuentra un espacio entre palabras, ponga un '_'.

#Importo la libreria.
import re

# Defino la función
def formatear_nombre_columna(nombre_columna):
   
    nombre_columna = re.sub(r'([a-z])([A-Z])', r'\1_\2', nombre_columna) # Incorporo un guion bajo antes de cada mayúscula (excepto la primera)
    nombre_columna = nombre_columna.lower() # Lo convierto todo en minúsculas.
    nombre_columna = nombre_columna.replace(" ", "_") # Reemplazar espacios con guiones bajos.
    
    return nombre_columna

# Aplico la función a los nombres de las columnas.
data.columns = [formatear_nombre_columna(col) for col in data.columns]

# Comprobamos.
print(data.columns)

Index(['gender', 'senior_citizen', 'partner', 'dependents', 'tenure',
       'phone_service', 'online_security', 'online_backup',
       'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies',
       'contract', 'monthly_charges', 'total_charges', 'churn'],
      dtype='object')


In [23]:
data.head()

Unnamed: 0,gender,senior_citizen,partner,dependents,tenure,phone_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,monthly_charges,total_charges,churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             7043 non-null   object 
 1   senior_citizen     7043 non-null   int64  
 2   partner            7043 non-null   object 
 3   dependents         7043 non-null   object 
 4   tenure             7043 non-null   int64  
 5   phone_service      7043 non-null   object 
 6   online_security    7043 non-null   object 
 7   online_backup      7043 non-null   object 
 8   device_protection  7043 non-null   object 
 9   tech_support       7043 non-null   object 
 10  streaming_tv       7043 non-null   object 
 11  streaming_movies   7043 non-null   object 
 12  contract           7043 non-null   object 
 13  monthly_charges    7043 non-null   float64
 14  total_charges      7043 non-null   object 
 15  churn              7043 non-null   object 
dtypes: float64(1), int64(2),

In [25]:
#Voy aconvertir la columna `TotalCharges` de tipo objeto a tipo numérico.
data['total_charges']=pd.to_numeric(data['total_charges'], errors='coerce')

In [26]:
#Vamos a ver si tenemos nulos en el DataFrame
data.isnull().sum()

gender                0
senior_citizen        0
partner               0
dependents            0
tenure                0
phone_service         0
online_security       0
online_backup         0
device_protection     0
tech_support          0
streaming_tv          0
streaming_movies      0
contract              0
monthly_charges       0
total_charges        11
churn                 0
dtype: int64

In [27]:
#Voy a imputar los nulos conla media.
data['total_charges'].fillna(data['total_charges'].median(), inplace=True)

In [28]:
data.head()

Unnamed: 0,gender,senior_citizen,partner,dependents,tenure,phone_service,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,contract,monthly_charges,total_charges,churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [29]:
#Comprobamos
data.isnull().sum()

gender               0
senior_citizen       0
partner              0
dependents           0
tenure               0
phone_service        0
online_security      0
online_backup        0
device_protection    0
tech_support         0
streaming_tv         0
streaming_movies     0
contract             0
monthly_charges      0
total_charges        0
churn                0
dtype: int64

## Preparación de los datos.

Voy a convertir las varibales categóricas a numéricas y selecionar la varibale objetivo.

In [31]:
# Convierto la columna 'Churn' a valores numéricos
data['churn'] = data['churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Voy a convertir las variables categóricas a variables dummy.
data_dummies = pd.get_dummies(data, drop_first=True)

# Separo las features de la variable objetivo.
X = data_dummies.drop('churn', axis=1)
y = data_dummies['churn']

## Dividir los datos en conjuntos de "Train" y "Test".

In [32]:
# Divido los datos en X_train, X_test, y_train e y_test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Aplico SMOTE para aumentar la clase minoritaria.

In [33]:
# Aplico SMOTE.
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Verifico el balance después de usar SMOTE.
print(y_train_smote.value_counts())

churn
0    4139
1    4139
Name: count, dtype: int64


## Entreno y evaluo el modelo de regresión logística.

In [34]:
# Entreno el modelo de regresión logística.
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_smote, y_train_smote)

# Predección y evaluación.
y_pred_log_reg = log_reg.predict(X_test)

print("Accuracy del modelo de regresión logística con SMOTE:", accuracy_score(y_test, y_pred_log_reg))
print(classification_report(y_test, y_pred_log_reg))

Accuracy del modelo de regresión logística con SMOTE: 0.71611071682044
              precision    recall  f1-score   support

           0       0.88      0.71      0.79      1035
           1       0.48      0.73      0.58       374

    accuracy                           0.72      1409
   macro avg       0.68      0.72      0.68      1409
weighted avg       0.77      0.72      0.73      1409



## Entreno y evaluo el modelo "Decision Tree".

In [37]:
# Entreno el modelo Decision Tree.
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train_smote, y_train_smote)

# Predección y evaluación.
y_pred_tree_clf = tree_clf.predict(X_test)

print("Accuracy del modelo Decision Tree con SMOTE:", accuracy_score(y_test, y_pred_tree_clf))
print(classification_report(y_test, y_pred_tree_clf))

Accuracy del modelo Decision Tree con SMOTE: 0.6990773598296665
              precision    recall  f1-score   support

           0       0.82      0.76      0.79      1035
           1       0.44      0.53      0.48       374

    accuracy                           0.70      1409
   macro avg       0.63      0.65      0.64      1409
weighted avg       0.72      0.70      0.71      1409



## Aplico TomekLinks para reducir la clase mayoritaria.

Voy a aplicar TomekLinks para reducir la clase mayoritaria en el conjunto de entrenamiento original.

In [38]:
# Aplico TomekLinks.
tomek = TomekLinks()
X_train_tomek, y_train_tomek = tomek.fit_resample(X_train, y_train)

# Verifico el balance después de usar TomekLinks.
print(y_train_tomek.value_counts())

churn
0    3677
1    1495
Name: count, dtype: int64


## Entreno y Evaluo el modelo de regresión logística con TomekLinks.

In [39]:
# Entreno el modelo de regresión logística.
log_reg_tomek = LogisticRegression(random_state=42)
log_reg_tomek.fit(X_train_tomek, y_train_tomek)

# Predección y evaluación.
y_pred_log_reg_tomek = log_reg_tomek.predict(X_test)

print("Accuray del modelo de regresión logística con TomekLinks:", accuracy_score(y_test, y_pred_log_reg_tomek))
print(classification_report(y_test, y_pred_log_reg_tomek))

Accuray del modelo de regresión logística con TomekLinks: 0.7927608232789212
              precision    recall  f1-score   support

           0       0.87      0.85      0.86      1035
           1       0.60      0.64      0.62       374

    accuracy                           0.79      1409
   macro avg       0.74      0.74      0.74      1409
weighted avg       0.80      0.79      0.79      1409



## Entreno y evaluo el modelo de Decision Tree con TomekLinks.

In [40]:
# Entreno el modelo de Decision Tree.
tree_clf_tomek = DecisionTreeClassifier(random_state=42)
tree_clf_tomek.fit(X_train_tomek, y_train_tomek)

# Predección y evaluación.
y_pred_tree_clf_tomek = tree_clf_tomek.predict(X_test)

print("Accuracy del modelo Decision Tree con Tomek Links:", accuracy_score(y_test, y_pred_tree_clf_tomek))
print(classification_report(y_test, y_pred_tree_clf_tomek))

Accuracy del modelo Decision Tree con Tomek Links: 0.7267565649396736
              precision    recall  f1-score   support

           0       0.83      0.79      0.81      1035
           1       0.49      0.55      0.52       374

    accuracy                           0.73      1409
   macro avg       0.66      0.67      0.66      1409
weighted avg       0.74      0.73      0.73      1409



## Comparación de lo resultados de los diferentes modelos.

In [42]:
# Coparación de las "accuracy's" de todos los modelos.

print("Comparación de accuracy's:")

print("Regresión logística con SMOTE:", accuracy_score(y_test, y_pred_log_reg))
print("Decision Tree con SMOTE:", accuracy_score(y_test, y_pred_tree_clf))

print("Regresión logística con TomekLinks:", accuracy_score(y_test, y_pred_log_reg_tomek))
print("Decision Tree con TomekLinks:", accuracy_score(y_test, y_pred_tree_clf_tomek))

Comparación de accuracy's:
Regresión logística con SMOTE: 0.71611071682044
Decision Tree con SMOTE: 0.6990773598296665
Regresión logística con TomekLinks: 0.7927608232789212
Decision Tree con TomekLinks: 0.7267565649396736
