Load the dataset and explore the variables.   
We will try to predict variable Churn using a logistic regression on variables tenure, SeniorCitizen,MonthlyCharges.   
Extract the target variable.   
Extract the independent variables and scale them.   
Build the logistic regression model.   
Evaluate the model.   
Even a simple model will give us more than 70% accuracy. Why?   
Synthetic Minority Oversampling TEchnique (SMOTE) is an over sampling technique based on nearest neighbors that adds new points between existing points. Apply imblearn.over_sampling.SMOTE to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?   
Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. Apply imblearn.under_sampling.TomekLinks to the dataset. Build and evaluate the logistic regression model. Is it there any improvement?   

In [56]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
import warnings
warnings.filterwarnings('ignore')

In [57]:
#Load the dataset and explore the variables.

data = pd.read_csv('customer_churn.csv')
data.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [94]:
data.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [95]:
data.Churn.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [96]:
 #Extract the target variable.
target = data["Churn"]

In [97]:
# Extract the independent variables and scale them
independent_vars = data[["tenure", "SeniorCitizen", "MonthlyCharges"]]

scaler = StandardScaler()
independent_vars_scaled = scaler.fit_transform(independent_vars)

In [114]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, recall_score

# Encode the target variable
le = LabelEncoder()
target_encoded = le.fit_transform(target)

# Split the encoded target variable
X_train, X_test, y_train, y_test = train_test_split(independent_vars_scaled, target_encoded, test_size=0.25, random_state=42)

In [117]:
print("Tamaño de X_train:", X_train.shape)
print("Tamaño de X_test:", X_test.shape)
print("Tamaño de y_train:", y_train.shape)
print("Tamaño de y_test:", y_test.shape)

Tamaño de X_train: (5282, 3)
Tamaño de X_test: (1761, 3)
Tamaño de y_train: (5282,)
Tamaño de y_test: (1761,)


In [107]:
# Build the logistic regression model
logreg_model = LogisticRegression(random_state=0, solver='lbfgs')
logreg_model.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [108]:
# Evaluate the model
y_pred = logreg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8076650106458482


In [101]:
#Even a simple model will give us more than 70% accuracy. Why?
#·l desequilibrio de clases en la variable objetivo "Churn" puede influir en la precisión del modelo. En este caso, hay considerablemente más instancias con "No" churn (5174) en comparación con las instancias con "Sí" churn (1869).
#Cuando las clases están desequilibradas, un modelo simple como la regresión logística puede tener sesgo hacia la clase mayoritaria.


In [105]:
from sklearn.metrics import precision_score, recall_score

# ...

# Evaluate the model
y_pred_test = logreg_model.predict(X_test)

accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test)
recall_test = recall_score(y_test, y_pred_test)

print("The accuracy in the TEST set is: {:.2f}".format(accuracy_test))
print("The precision in the TEST set is: {:.2f}".format(precision_test))
print("The recall in the TEST set is: {:.2f}".format(recall_test))

The accuracy in the TEST set is: 0.81
The precision in the TEST set is: 0.70
The recall in the TEST set is: 0.49


In [103]:
# Apply SMOTE to the dataset and build/evaluate the logistic regression model
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(independent_vars_scaled, target)

X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

logreg_model_smote = LogisticRegression()
logreg_model_smote.fit(X_train_smote, y_train_smote)

y_pred_smote = logreg_model_smote.predict(X_test_smote)
accuracy_smote = accuracy_score(y_test_smote, y_pred_smote)
print("Accuracy with SMOTE:", accuracy_smote)

Accuracy with SMOTE: 0.740096618357488


In [104]:
# Apply TomekLinks to the dataset and build/evaluate the logistic regression model
tomek = TomekLinks()
X_resampled_tomek, y_resampled_tomek = tomek.fit_resample(independent_vars_scaled, target)

X_train_tomek, X_test_tomek, y_train_tomek, y_test_tomek = train_test_split(X_resampled_tomek, y_resampled_tomek, test_size=0.2, random_state=42)

logreg_model_tomek = LogisticRegression()
logreg_model_tomek.fit(X_train_tomek, y_train_tomek)

y_pred_tomek = logreg_model_tomek.predict(X_test_tomek)
accuracy_tomek = accuracy_score(y_test_tomek, y_pred_tomek)
print("Accuracy with TomekLinks:", accuracy_tomek)

Accuracy with TomekLinks: 0.8082191780821918
