# Customer Segmentation – Telecom Dataset

## Exploratory Data Analysis (EDA)

### Data Loading

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load datasets
df_contract = pd.read_csv('/datasets/final_provider/contract.csv')
df_personal = pd.read_csv('/datasets/final_provider/personal.csv')
df_internet = pd.read_csv('/datasets/final_provider/internet.csv')
df_phone = pd.read_csv('/datasets/final_provider/phone.csv')

### Data Overview

In [None]:
# General view
for name, df in zip(
    ["Contract", "Personal", "Internet", "Phone"],
    [df_contract, df_personal, df_internet, df_phone]
):
    print(f"\n🟩 {name} - Shape: {df.shape}")
    display(df.head())
    display(df.info())


### Merged Dataset

In [None]:
df = df_personal.merge(df_contract, on='customerID', how='left') \
                .merge(df_internet, on='customerID', how='left') \
                .merge(df_phone, on='customerID', how='left')

print(f"\n📐 Merged dataset: {df.shape}")
df.head()

In [None]:
display(df.info())

### Duplicate Values and Missing Data

In [None]:
# Missing values
missing_values = df.isnull().sum().sort_values(ascending=False)
print("\n🔍 Missing values per column:")
print(missing_values[missing_values > 0])

# Duplicates
print("\n🧾 Duplicates:", df.duplicated().sum())

### Target Variable

In [None]:
# Create binary target column: 1 = churned, 0 = active
df['churn'] = df['EndDate'].apply(lambda x: 0 if x == 'No' else 1)

# Visualize distribution
sns.countplot(x='churn', data=df)
plt.title("Customer Churn Distribution")
plt.xlabel("Churn (1 = churned)")
plt.ylabel("Count")
plt.show()

# Proportion
churn_rate = df['churn'].value_counts(normalize=True)
print("\n📊 Churn rate:")
print(churn_rate)

### Variable Types

In [None]:
# Categorical and numerical variables
cat_cols = df.select_dtypes(include='object').columns.tolist()
num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

print("\n🔤 Categorical variables:", cat_cols)
print("\n🔢 Numerical variables:", num_cols)


## Data Processing

### Data Type Conversions

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

### Date Features

In [None]:
def procesar_fechas(df):
    # Copy of dataframe
    df = df.copy()

    # Convert BeginDate to datetime
    df['BeginDate'] = pd.to_datetime(df['BeginDate'])

    # Convert EndDate to datetime, keeping "No" as NaT temporarily
    df['EndDate'] = pd.to_datetime(df['EndDate'], errors='coerce')

    # Replace NaT (the "No" values) with the maximum date in the dataset
    fecha_referencia = df['EndDate'].max()
    df['EndDate'] = df['EndDate'].fillna(fecha_referencia)

    # Create tenure_days column
    df['tenure_days'] = (df['EndDate'] - df['BeginDate']).dt.days

    return df

In [None]:
df = procesar_fechas(df)

We create a new column with the information about how long the customer has been subscribed to the service.

### Additional Features

A new variable called num_servicios was created, which represents the total number of additional services contracted by each customer. This variable was derived from the optional services columns:

- OnlineSecurity

- OnlineBackup

- DeviceProtection

- TechSupport

- StreamingTV

- StreamingMovies

The logic behind this transformation is that customers who subscribe to more services tend to be more engaged with the company, which could influence their decision to churn or not. Therefore, num_services can be a relevant predictive variable to improve the model's performance.

In [None]:
# List of optional services columns
servicios_cols = [
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies'
]

# Create the num_servicios variable by counting how many services are active per customer
df['num_servicios'] = df[servicios_cols].apply(lambda row: sum(row == 'Yes'), axis=1)


In [None]:
df.info()

### Function for Numerical Columns Processing

In [None]:
def imputar_y_convertir_numericas(df):
    """
    Convert numerical types:
    - float64 → float32
    - int64 → int32
    Also imputes the median if there are missing values.
    """
    for col in df.select_dtypes(include=['float64', 'int64']).columns:
        if df[col].isnull().any():
            mediana = df[col].median()
            df[col].fillna(mediana, inplace=True)

        if df[col].dtype == 'float64':
            df[col] = df[col].astype('float32')
        elif df[col].dtype == 'int64':
            df[col] = df[col].astype('int32')

    return df


### Function for Categorical Columns Processing

In [None]:
def imputar_servicios_especiales(df):
    """
    Impute service columns with specific values:
    - 'No Internet' for services that require a connection.
    - 'No Phone' for multiple lines.
    """
    # Services that require Internet
    servicios_internet = [
        'InternetService', 'OnlineSecurity', 'OnlineBackup',
        'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'
    ]
    
    for col in servicios_internet:
        if df[col].isnull().sum() > 0:
            df[col].fillna('No Internet', inplace=True)
    
    # Services that require a phone
    if df['MultipleLines'].isnull().sum() > 0:
        df['MultipleLines'].fillna('No Phone', inplace=True)
    
    return df



In [None]:
# General imputation
df = imputar_y_convertir_numericas(df)
df = imputar_servicios_especiales(df)


In [None]:
df.info()

## Categorical Feature Encoding

### Binary Columns

In [None]:
def codificar_binarias(df):
    """
    Detects binary columns and encodes them as 0 and 1.
    Returns the modified DataFrame and a dictionary with the mappings used.
    """
    df = df.copy()
    mapeos = {}
    
    for col in df.select_dtypes(include='object').columns:
        valores = df[col].dropna().unique()
        if len(valores) == 2:
            # Create a smart mapping if possible
            if 'No' in valores and 'Yes' in valores:
                mapeo = {'No': 0, 'Yes': 1}
            elif 'Female' in valores and 'Male' in valores:
                mapeo = {'Female': 0, 'Male': 1}
            else:
                mapeo = {valores[0]: 0, valores[1]: 1}

            df[col] = df[col].map(mapeo)
            mapeos[col] = mapeo
    
    return df, mapeos



In [None]:
df, mapeos_binarias = codificar_binarias(df)

In [None]:
for col, mapeo in mapeos_binarias.items():
    print(f"Column: {col}")
    print(f"  Applied Mapping: {mapeo}\n")

### Multi-class Columns

In [None]:
# Drop 'customerID'
df = df.drop(columns=['customerID'])

# Drop datetime columns
datetime_cols = df.select_dtypes(include=['datetime64']).columns
df = df.drop(columns=datetime_cols)

In [None]:
# Identify categorical columns with more than two classes (excluding ID)
cat_cols = df.select_dtypes(include=['object', 'category']).columns
multiclase_cols = [col for col in cat_cols if df[col].nunique() > 2]

In [None]:
# Display the encoded columns
print("Columns encoded with One-Hot Encoding:")
for col in multiclase_cols:
    print(f"- {col} ({df[col].nunique()} classes)")

In [None]:
# Apply OHE and store the number of new columns
df_encoded = pd.get_dummies(df, columns=multiclase_cols, drop_first=True)

# Calculate how many new columns were generated
n_new_columns = df_encoded.shape[1] - df.shape[1] + len(multiclase_cols)
print(f"\n{n_new_columns} new columns were generated with One-Hot Encoding.")

# Replace df with the new encoded DataFrame
df = df_encoded

In [None]:
df.info()

## Model Training

### Class Balancing

In [None]:
neg = (df['churn'] == 0).sum()
pos = (df['churn'] == 1).sum()
weight_for_1 = neg / pos

print(weight_for_1)

We can see a clear class imbalance, which we will need to address, but it will be handled through the parameters of each model.

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop(columns=['churn'])
y = df['churn']

# First split into training (70%) and temporary (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=42)

# Then split the temporary set into validation (15%) and test (15%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)

print(f"Train set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")


### Feature Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Numerical variables to scale
num_vars = ['MonthlyCharges', 'TotalCharges', 'tenure_days', 'num_servicios']

# Copy to avoid views
X_train = X_train.copy()
X_val = X_val.copy()
X_test = X_test.copy()

# Create the scaler and fit it only on the training data
scaler = MinMaxScaler()
X_train.loc[:, num_vars] = scaler.fit_transform(X_train[num_vars])

# Apply the transformation to validation and test sets
X_val.loc[:, num_vars] = scaler.transform(X_val[num_vars])
X_test.loc[:, num_vars] = scaler.transform(X_test[num_vars])


### Model Training Function

In [None]:
import time
from sklearn.metrics import roc_auc_score, accuracy_score

def train_model(model, X_train, y_train, class_weight=None, epochs=50, batch_size=64, verbose=0, **kwargs):
    """
    Trains a model and measures training time.
    For sklearn, XGBoost, LightGBM, or Keras models.
    
    Parameters:
    - model: model to train
    - X_train, y_train: training data
    - class_weight: dictionary of class weights (used in sklearn and keras)
    - epochs, batch_size, verbose: used for Keras
    - kwargs: additional parameters for .fit() if applicable
    
    Returns:
    - trained model
    - training time in seconds (float)
    """
    start_time = time.time()
    
    # Detect if it is a Keras model
    if hasattr(model, "fit") and hasattr(model.fit, "__call__") and 'keras' in str(type(model)).lower():
        # For Keras model
        model.fit(X_train, y_train,
                  class_weight=class_weight,
                  epochs=epochs,
                  batch_size=batch_size,
                  verbose=verbose,
                  validation_split=0,  # Do not use validation here, external val set is passed
                  **kwargs)
    else:
        # sklearn, xgboost, or lightgbm models
        try:
            model.fit(X_train, y_train, class_weight=class_weight, **kwargs)
        except TypeError:
            # Some models do not support class_weight
            model.fit(X_train, y_train, **kwargs)
            
    training_time = time.time() - start_time
    return model, training_time


### Model Evaluation Function

In [None]:
def evaluate_model(model, X, y_true, batch_size=64):
    """
    Evaluates the model on X data with true labels y_true.
    For sklearn, xgboost, lightgbm, or keras models.
    Returns a dictionary with metrics (AUC-ROC, Accuracy).
    """
    # Detect if it is a Keras model
    if hasattr(model, "predict") and 'keras' in str(type(model)).lower():
        y_pred_proba = model.predict(X, batch_size=batch_size).flatten()
        y_pred = (y_pred_proba >= 0.5).astype(int)
    else:
        y_pred_proba = model.predict_proba(X)[:, 1] if hasattr(model, "predict_proba") else None
        y_pred = model.predict(X)
    
    auc = roc_auc_score(y_true, y_pred_proba) if y_pred_proba is not None else None
    acc = accuracy_score(y_true, y_pred)
    
    return {'AUC-ROC': auc, 'Accuracy': acc}

results = {
    "model_name": [],
    "dataset": [],  # 'validation' or 'test'
    "AUC-ROC": [],
    "Accuracy": []
}

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg_model = LogisticRegression(
    solver='liblinear',     
    class_weight='balanced', 
    random_state=42,
    max_iter=500             
)

In [None]:
# Train model and capture time
logreg_model, logreg_training_time = train_model(logreg_model, X_train, y_train)

# Evaluation on validation and test sets
logreg_val_metrics = evaluate_model(logreg_model, X_val, y_val)
logreg_test_metrics = evaluate_model(logreg_model, X_test, y_test)

# Display results
print("Logistic Regression - Validation:", logreg_val_metrics)
print("Logistic Regression - Test:", logreg_test_metrics)
print(f"Training time: {logreg_training_time:.2f} seconds")


### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate the model with initial parameters
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

In [None]:
# Train model and capture time
rf_model, rf_training_time = train_model(rf_model, X_train, y_train)

# Evaluation on validation and test sets
rf_val_metrics = evaluate_model(rf_model, X_val, y_val)
rf_test_metrics = evaluate_model(rf_model, X_test, y_test)

# Display results
print("Random Forest - Validation:", rf_val_metrics)
print("Random Forest - Test:", rf_test_metrics)
print(f"Random Forest training time: {rf_training_time:.2f} seconds")


### XGBoost

In [None]:
import xgboost as xgb

# Initialize XGBoost model with basic parameters and class balancing
xgb_model = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum(),  # class balancing
    random_state=42
)


In [None]:
# Train model and measure time
xgb_model, xgb_training_time = train_model(xgb_model, X_train, y_train)

# Evaluate on validation and test sets
xgb_val_metrics = evaluate_model(xgb_model, X_val, y_val)
xgb_test_metrics = evaluate_model(xgb_model, X_test, y_test)

# Display results
print("XGBoost - Validation:", xgb_val_metrics)
print("XGBoost - Test:", xgb_test_metrics)
print(f"XGBoost training time: {xgb_training_time:.2f} seconds")

### LightGBM

In [None]:
import lightgbm as lgb

# Initialize LightGBM model with basic parameters and class balancing
lgb_model = lgb.LGBMClassifier(
    class_weight='balanced',
    random_state=42,
    n_estimators=100,
    learning_rate=0.1
)

In [None]:
# Train model and measure time
lgb_model, lgb_training_time = train_model(lgb_model, X_train, y_train)

# Evaluate on validation and test sets
lgb_val_metrics = evaluate_model(lgb_model, X_val, y_val)
lgb_test_metrics = evaluate_model(lgb_model, X_test, y_test)

# Display results
print("LightGBM - Validation:", lgb_val_metrics)
print("LightGBM - Test:", lgb_test_metrics)
print(f"LightGBM training time: {lgb_training_time:.2f} seconds")


### Neural Network

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Define a simple architecture for binary classification
def build_nn_model(input_dim):
    model = Sequential([
        Dense(64, activation='relu', input_shape=(input_dim,)),
        Dropout(0.3),
        Dense(32, activation='relu'),
        Dropout(0.3),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=Adam(learning_rate=0.001),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model


In [None]:
# Build the model
nn_model = build_nn_model(X_train.shape[1])

# Compute class weights for balancing (optional)
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

class_weights = compute_class_weight(class_weight='balanced',
                                     classes=np.unique(y_train),
                                     y=y_train)
class_weight_dict = dict(enumerate(class_weights))

# Train neural network and measure time
nn_model, nn_training_time = train_model(
    nn_model,
    X_train,
    y_train,
    class_weight=class_weight_dict,
    epochs=50,
    batch_size=64,
    verbose=0
)

# Evaluate on validation and test sets
nn_val_metrics = evaluate_model(nn_model, X_val, y_val)
nn_test_metrics = evaluate_model(nn_model, X_test, y_test)

# Display results
print("Neural Network - Validation:", nn_val_metrics)
print("Neural Network - Test:", nn_test_metrics)
print(f"Neural Network training time: {nn_training_time:.2f} seconds")


### Results

In [None]:
import pandas as pd

# Create a DataFrame with the results
results_df = pd.DataFrame([
    {
        "Model": "Logistic Regression",
        "AUC-ROC Validation": logreg_val_metrics['AUC-ROC'],
        "Accuracy Validation": logreg_val_metrics['Accuracy'],
        "AUC-ROC Test": logreg_test_metrics['AUC-ROC'],
        "Accuracy Test": logreg_test_metrics['Accuracy'],
        "Time (s)": logreg_training_time
    },
    {
        "Model": "Random Forest",
        "AUC-ROC Validation": rf_val_metrics['AUC-ROC'],
        "Accuracy Validation": rf_val_metrics['Accuracy'],
        "AUC-ROC Test": rf_test_metrics['AUC-ROC'],
        "Accuracy Test": rf_test_metrics['Accuracy'],
        "Time (s)": rf_training_time
    },
    {
        "Model": "XGBoost",
        "AUC-ROC Validation": xgb_val_metrics['AUC-ROC'],
        "Accuracy Validation": xgb_val_metrics['Accuracy'],
        "AUC-ROC Test": xgb_test_metrics['AUC-ROC'],
        "Accuracy Test": xgb_test_metrics['Accuracy'],
        "Time (s)": xgb_training_time
    },
    {
        "Model": "LightGBM",
        "AUC-ROC Validation": lgb_val_metrics['AUC-ROC'],
        "Accuracy Validation": lgb_val_metrics['Accuracy'],
        "AUC-ROC Test": lgb_test_metrics['AUC-ROC'],
        "Accuracy Test": lgb_test_metrics['Accuracy'],
        "Time (s)": lgb_training_time
    },
    {
        "Model": "Neural Network",
        "AUC-ROC Validation": nn_val_metrics['AUC-ROC'],
        "Accuracy Validation": nn_val_metrics['Accuracy'],
        "AUC-ROC Test": nn_test_metrics['AUC-ROC'],
        "Accuracy Test": nn_test_metrics['Accuracy'],
        "Time (s)": nn_training_time
    }
])

# Display table sorted by Test AUC-ROC
results_df_sorted = results_df.sort_values(by="AUC-ROC Test", ascending=False)
display(results_df_sorted)

## Results Interpretation

Five supervised models were compared to predict customer churn. The metrics used were AUC-ROC (primary), Accuracy (secondary), and Training Time.

### 1. Primary Metric: AUC-ROC

AUC-ROC measures the model's ability to distinguish between customers who churn and those who do not. Values closer to 1 are better.

| Model               | Test AUC-ROC   |
|---------------------|----------------|
| **XGBoost**         | **0.8916**   
| LightGBM            | 0.8819  
| Random Forest       | 0.8563  
| Red Neuronal        | 0.8391  
| Regresión Logística | 0.8375  

> **XGBoost** demonstrates the best performance for this metric, showing excellent classification capability.

---

### 2. Secondary Metric: Accuracy

Accuracy measures the proportion of correct predictions. It does not account for class imbalance but provides a general view of performance.

| Model               | Test Accuracy   |
|---------------------|-----------------|
| **XGBoost**         | **0.8411**   
| LightGBM            | 0.8174  
| Random Forest       | 0.8127  
| Regresión Logística | 0.7512  
| Red Neuronal        | 0.7294  

>  **XGBoost** also leads in this metric, confirming its robustness.

---

### 3. Training Time

Time required to train each model was measured.

| Model               | Time(s)    |
|---------------------|------------|
| **Regresión Logística** | **0.014**   
| LightGBM            | 0.166  
| Random Forest       | 0.320  
| XGBoost             | 1.276  
| Red Neuronal        | 3.566  

>  **The neural network was the slowest,**, while **LightGBM** offers a good balance between speed and performance.

---

### Summary

| Criterion             | Best Model | Comment                                                                 |
|-----------------------|----------------|---------------------------------------------------------------------------|
| Best AUC-ROC        | **XGBoost**    | Highest predictive capability                                             |
| Speed               | LogReg         | Very fast, but lower performance                                          |
| Overall Balance     | LightGBM       | Combines reasonable accuracy with computational efficiency                |

---

### Recommendation

- For **maximum accuracy**, use **XGBoost**.
- For **efficiency with good performance**, **LightGBM** is an excellent alternative.
- The **neural network** and **logistic regression** can be excluded in this case due to lower performance.



## General Conclusion

---

The objective of this project was to develop models capable of predicting **customer churn** using historical data from a telecommunications company. Five supervised models were implemented: **Logistic Regression**, **Random Forest**, **XGBoost**, **LightGBM**, and a **Simple Neural Network**.

### Accuracy vs Interpretability

Although complex models like **XGBoost** and **LightGBM** provided the best results in terms of **AUC-ROC** and **accuracy**, the **balance between precision and explainability** should be considered:

- **XGBoost** achieved the highest predictive performance (AUC-ROC = 0.89), making it the most effective model if accuracy is the primary goal.  
- **LightGBM**, with similar metrics and significantly lower training time, represents a more efficient alternative suitable for production when computational cost is a concern.  
- **Logistic Regression**, while less accurate, provides **maximum interpretability**, which is essential if predictions must be translated into business decisions or justified to stakeholders.

### Interpretability as a Business Criterion

According to reviewer recommendations, in real-world problems like this, where decisions must be understandable to business areas and justify retention campaigns, **model interpretability is critical**. Therefore:

- It is recommended to **start with interpretable models** (such as logistic regression or decision trees) to generate a comprehensible foundation of churn behavior.  
- **Complex models** like XGBoost or neural networks should only be used **if there is a significant performance improvement** that justifies the loss of transparency.

### Final Recommendation

- For an **analytical or exploratory environment**, use **Logistic Regression** or **Random Forest** to facilitate interpretation and insights extraction.  
- For an **operational or competitive environment**, where maximum precision is required, use **XGBoost** with interpretability tools like **SHAP** to explain predictions.

---

This balanced approach ensures both **predictive effectiveness** and **practical usability** in real business contexts.

---

