# **TelecomX Challenge – Parte 2**
Notebook desarrollado para el análisis de datos, preprocesamiento y modelado predictivo.

Este documento incluye:
- Exploración inicial de datos (EDA)
- Limpieza y preprocesamiento
- Codificación de variables categóricas
- Entrenamiento de modelos de Machine Learning
- Evaluación y métricas
- Interpretabilidad de modelos
- Exportación de resultados y modelos

**Autor:** Magno Gabriel Huaromo Montañez

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_selection import VarianceThreshold
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')

## **1. Carga de datos**

In [None]:
PATH = 'datos_challenge.csv'  # Asegúrate de colocar el archivo en el mismo directorio del notebook
df = pd.read_csv(PATH)
print('Dimensiones:', df.shape)
df.head()

## **2. Exploración inicial de datos (EDA)**

In [None]:
# Información general
df.info()

# Resumen estadístico de variables numéricas
df.describe()

In [None]:
# Distribución de valores nulos
df.isnull().sum().sort_values(ascending=False)

## **3. Identificación de la variable objetivo (target)**

In [None]:
possible_targets = [c for c in df.columns if c.lower() in ['churn','cancel','cancelled','canceled']]
if not possible_targets:
    raise ValueError('No se detectó columna target')
target_col = possible_targets[0]
print('Variable target:', target_col)
df[target_col].value_counts(normalize=True)

## **4. Limpieza y preprocesamiento**

In [None]:
# Eliminar IDs irrelevantes
id_candidates = [c for c in df.columns if c.lower() in ['customerid','id','clientid','clienteid','custid']]
if id_candidates:
    df.drop(columns=id_candidates, inplace=True)

# Manejo de valores nulos
cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()
cat_cols = [c for c in cat_cols if c != target_col]
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

for c in num_cols:
    if df[c].isnull().any():
        df[c].fillna(df[c].median(), inplace=True)
for c in cat_cols:
    if df[c].isnull().any():
        df[c].fillna('Unknown', inplace=True)

# Mapear target a 0/1
mapping = {}
for val in df[target_col].unique():
    if str(val).strip().lower() in ['yes','si','true','1','churn']:
        mapping[val] = 1
    else:
        mapping[val] = 0
df[target_col] = df[target_col].map(mapping).astype(int)
df.head()

## **5. Reducción de categorías raras**

In [None]:
threshold = 0.01
for c in cat_cols:
    vc = df[c].value_counts(normalize=True)
    rares = vc[vc < threshold].index.tolist()
    if rares:
        df[c] = df[c].apply(lambda x: 'Other' if x in rares else x)

## **6. Codificación One-Hot**

In [None]:
X = df.drop(columns=[target_col])
y = df[target_col]
X_encoded = pd.get_dummies(X, columns=cat_cols, drop_first=True)
print('Dimensiones después del encoding:', X_encoded.shape)

In [None]:
selector = VarianceThreshold(threshold=0.0001)
selector.fit(X_encoded)
mask = selector.get_support()
X_sel = X_encoded.loc[:, mask]
print('Dimensiones después de eliminar variables constantes:', X_sel.shape)

## **7. División de datos en entrenamiento y prueba**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_sel, y, test_size=0.3, random_state=42, stratify=y)
print('Tamaño de entrenamiento:', X_train.shape)
print('Tamaño de prueba:', X_test.shape)

## **8. Entrenamiento de modelos**

In [None]:
# Escalado para regresión logística
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(max_iter=1000, class_weight='balanced')
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1, class_weight='balanced')

log_reg.fit(X_train_scaled, y_train)
rf.fit(X_train, y_train)

## **9. Evaluación de modelos**

In [None]:
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_rf = rf.predict(X_test)

print('--- Logistic Regression ---')
print(confusion_matrix(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

print('--- Random Forest ---')
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

## **10. Interpretación de modelos**

In [None]:
rf_importances = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
rf_importances.head(10).plot(kind='barh', figsize=(8,5))
plt.title('Top 10 variables importantes - Random Forest')
plt.show()