# Descripción del proyecto

Los clientes de Beta Bank se están yendo, cada mes, poco a poco. Los banqueros descubrieron que es más barato salvar a los clientes existentes que atraer nuevos.

Necesitamos predecir si un cliente dejará el banco pronto. Tú tienes los datos sobre el comportamiento pasado de los clientes y la terminación de contratos con el banco.

Crea un modelo con el máximo valor F1 posible. Para aprobar la revisión, necesitas un valor F1 de al menos 0.59. Verifica F1 para el conjunto de prueba.

Además, debes medir la métrica AUC-ROC y compararla con el valor F1.

# Descripción de los datos

Puedes encontrar los datos en el archivo /datasets/Churn.csv file. Descarga el conjunto de datos.

## Características

RowNumber: índice de cadena de datos

CustomerId: identificador de cliente único

Surname: apellido

CreditScore: valor de crédito

Geography: país de residencia

Gender: sexo

Age: edad

Tenure: período durante el cual ha madurado el depósito a plazo fijo de un cliente (años)

Balance: saldo de la cuenta

NumOfProducts: número de productos bancarios utilizados por el cliente

HasCrCard: el cliente tiene una tarjeta de crédito (1 - sí; 0 - no)

IsActiveMember: actividad del cliente (1 - sí; 0 - no)

EstimatedSalary: salario estimado

# Instrucciones del proyecto

Descarga y prepara los datos. Explica el procedimiento.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.utils import resample

In [2]:
data= pd.read_csv('/datasets/Churn.csv')

In [3]:
print(data.info())
display(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [4]:
print(data.isna().sum())

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64


Eliminamos las columnas innecesarias

In [5]:
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

Convertimos las variables categoricas en numericas

In [6]:
data = pd.get_dummies(data, drop_first=True)

Dividimos los datos

In [7]:
X = data.drop('Exited', axis=1)
y = data['Exited']

creamos conjuntos de entrenamiento y prueba

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

llenamos valores faltantes con la mediana

In [9]:
imputer = SimpleImputer(strategy='median')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

normalizamos los datos

In [10]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Entrenamos el arbol de decisiones

Realizamos predicciones

Evaluamos el rendimiento

probamos diferentes profundidades para el modelo de arbol

In [11]:
depths = [None, 5, 10, 15]
num_estimators = 100
class_weights = 'balanced'

for depth in depths:
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    print(f"Profundidad del árbol: {depth}")
    print(classification_report(y_test, y_pred))

Profundidad del árbol: None
              precision    recall  f1-score   support

           0       0.88      0.85      0.86      1607
           1       0.45      0.51      0.48       393

    accuracy                           0.78      2000
   macro avg       0.67      0.68      0.67      2000
weighted avg       0.79      0.78      0.79      2000

Profundidad del árbol: 5
              precision    recall  f1-score   support

           0       0.87      0.97      0.92      1607
           1       0.77      0.40      0.52       393

    accuracy                           0.86      2000
   macro avg       0.82      0.68      0.72      2000
weighted avg       0.85      0.86      0.84      2000

Profundidad del árbol: 10
              precision    recall  f1-score   support

           0       0.89      0.92      0.90      1607
           1       0.61      0.51      0.56       393

    accuracy                           0.84      2000
   macro avg       0.75      0.72      0.73      

# Con data original

## Prueba con modelo bosque aleatorio

De manera similar a como entrenaste y usaste el modelo de árbol de decisión

In [12]:
for depth in depths:
    model = RandomForestClassifier(random_state=12345, n_estimators=num_estimators, max_depth=depth)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    print(f"Profundidad del árbol: {depth}")
    print(classification_report(y_test, y_pred))

Profundidad del árbol: None
              precision    recall  f1-score   support

           0       0.88      0.96      0.92      1607
           1       0.76      0.48      0.59       393

    accuracy                           0.87      2000
   macro avg       0.82      0.72      0.76      2000
weighted avg       0.86      0.87      0.86      2000

Profundidad del árbol: 5
              precision    recall  f1-score   support

           0       0.86      0.98      0.91      1607
           1       0.80      0.33      0.47       393

    accuracy                           0.85      2000
   macro avg       0.83      0.66      0.69      2000
weighted avg       0.85      0.85      0.83      2000

Profundidad del árbol: 10
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      1607
           1       0.76      0.46      0.57       393

    accuracy                           0.87      2000
   macro avg       0.82      0.71      0.75      

## Prueba con modelo de regresión logística


De manera similar a como entrenaste y usaste el modelo de árbol de decisión

In [13]:
model = LogisticRegression(random_state=12345)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Logistic Regression Model")
print(classification_report(y_test, y_pred))


Logistic Regression Model
              precision    recall  f1-score   support

           0       0.83      0.96      0.89      1607
           1       0.56      0.20      0.30       393

    accuracy                           0.81      2000
   macro avg       0.69      0.58      0.59      2000
weighted avg       0.78      0.81      0.77      2000



# Usando balanceo de los modelos

Repite lo de arriba usando la opción class_weight de los modelos

In [14]:
model = RandomForestClassifier(random_state=12345, class_weight=class_weights)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Random Forest Model with Class Weighting")
print(classification_report(y_test, y_pred))


Random Forest Model with Class Weighting
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      1607
           1       0.78      0.47      0.58       393

    accuracy                           0.87      2000
   macro avg       0.83      0.72      0.75      2000
weighted avg       0.86      0.87      0.86      2000



In [15]:
model = LogisticRegressionCV(Cs=[0.001, 0.01, 0.1, 1, 10, 100], cv=5, class_weight=class_weights)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Logistic Regression Model with Class Weighting")
print(classification_report(y_test, y_pred))


Logistic Regression Model with Class Weighting
              precision    recall  f1-score   support

           0       0.91      0.72      0.80      1607
           1       0.38      0.72      0.50       393

    accuracy                           0.72      2000
   macro avg       0.65      0.72      0.65      2000
weighted avg       0.81      0.72      0.74      2000



# Hacer upsampling a la data

Puedes usar la siguiente función, esto te dará las features y target con upsampling:

In [16]:
def upsample(features, target, repeat):
    features_zeros = pd.DataFrame(features[target == 0])
    features_ones = pd.DataFrame(features[target == 1])
    target_zeros = pd.DataFrame(target[target == 0])
    target_ones = pd.DataFrame(target[target == 1])

    features_upsampled = pd.concat([features_zeros] + [features_ones] * int(repeat))
    target_upsampled = pd.concat([target_zeros] + [target_ones] * int(repeat))


    features_upsampled, target_upsampled = resample(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled


X_train_upsampled, y_train_upsampled = upsample(X_train, y_train, 10)

logistic_model_upsampled = LogisticRegression(random_state=12345)
logistic_model_upsampled.fit(X_train_upsampled, y_train_upsampled)

y_pred_logistic_upsampled = logistic_model_upsampled.predict(X_test)

f1_logistic_upsampled = f1_score(y_test, y_pred_logistic_upsampled)
roc_auc_logistic_upsampled = roc_auc_score(y_test, y_pred_logistic_upsampled)

print("Regresión Logística con upsampling:")
print(f"F1 Score: {f1_logistic_upsampled:.4f}")
print(f"AUC-ROC Score: {roc_auc_logistic_upsampled:.4f}")



Regresión Logística con upsampling:
F1 Score: 0.3993
AUC-ROC Score: 0.6315


  return f(*args, **kwargs)


In [17]:
random_forest_model_upsampled = RandomForestClassifier(random_state=12345, n_estimators=100)
random_forest_model_upsampled.fit(X_train_upsampled, y_train_upsampled)

y_pred_rf_upsampled = random_forest_model_upsampled.predict(X_test)

f1_rf_upsampled = f1_score(y_test, y_pred_rf_upsampled)
roc_auc_rf_upsampled = roc_auc_score(y_test, y_pred_rf_upsampled)

print("Random Forest con upsampling:")
print(f"F1 Score: {f1_rf_upsampled:.4f}")
print(f"AUC-ROC Score: {roc_auc_rf_upsampled:.4f}")

  random_forest_model_upsampled.fit(X_train_upsampled, y_train_upsampled)


Random Forest con upsampling:
F1 Score: 0.6000
AUC-ROC Score: 0.7550


In [18]:
X_train_downsampled, y_train_downsampled = upsample(X_train, y_train, 1/10)

random_forest_model_downsampled = RandomForestClassifier(random_state=12345, n_estimators=100)
random_forest_model_downsampled.fit(X_train_downsampled, y_train_downsampled)

y_pred_rf_downsampled = random_forest_model_downsampled.predict(X_test)

f1_rf_downsampled = f1_score(y_test, y_pred_rf_downsampled)
roc_auc_rf_downsampled = roc_auc_score(y_test, y_pred_rf_downsampled)

print("Random Forest con downsampling:")
print(f"F1 Score: {f1_rf_downsampled:.4f}")
print(f"AUC-ROC Score: {roc_auc_rf_downsampled:.4f}")

  random_forest_model_downsampled.fit(X_train_downsampled, y_train_downsampled)


Random Forest con downsampling:
F1 Score: 0.0000
AUC-ROC Score: 0.5000
