# ü§ñ TelecomX ‚Äî Parte 2: Modelagem para Previs√£o de Churn

Este notebook implementa o pipeline de **Machine Learning** solicitado:

1. Carrega o **CSV limpo** da Parte 1 (`data/processed/telecomx_clean.csv`);
2. Remove colunas sem valor preditivo (IDs e similares);
3. Codifica **categ√≥ricas** (One-Hot Encoding) e prepara **num√©ricas**;
4. Verifica **propor√ß√£o de churn** e (opcionalmente) aplica **balanceamento** (SMOTE);
5. Avalia **necessidade de normaliza√ß√£o** (por modelo);
6. Calcula **correla√ß√£o** e cria **gr√°ficos direcionados** (tenure √ó churn, gasto total √ó churn);
7. Divide em **treino/teste** (70/30 ou 80/20);
8. Treina ao menos **dois modelos** (p.ex. Regress√£o Log√≠stica e Random Forest);
9. Compara **m√©tricas** (Acur√°cia, Precis√£o, Recall, F1, Matriz de Confus√£o, ROC-AUC);
10. Interpreta com **coeficientes** (LR) e **import√¢ncias** (RF);
11. Gera **relat√≥rio HTML** com resultados e insights.

> Compat√≠vel com Google Colab.

## üì¶ Instala√ß√£o e Imports

In [None]:
# !pip install -q -r /content/requirements.txt || true

import os, re, json, warnings, datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, RocCurveDisplay,
                             classification_report)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['axes.grid'] = True


## 1) üìÇ Carregando o CSV tratado

In [None]:
DATA_PROCESSED = os.path.join(os.getcwd().replace('/notebooks',''), "data", "processed")
csv_path = os.path.join(DATA_PROCESSED, "telecomx_clean.csv")
print("CSV esperado:", csv_path)
df = pd.read_csv(csv_path)
print("Shape:", df.shape)
display(df.head(3))


## üîé Identifica√ß√£o do alvo (Churn/Evas√£o) e limpeza b√°sica

In [None]:
def find_target_column(columns):
    aliases = ['churn','evasao','evas√£o','status_evasao','evaded','target']
    for c in columns:
        canon = c.lower().replace(' ', '_').replace('-', '_')
        if canon in aliases:
            return c
    for c in columns:
        if df[c].dropna().nunique()==2:
            return c
    return None

target_col = find_target_column(df.columns)
print("Alvo detectado:", target_col)

def to01(s):
    map_simnao = {'sim':1,'s':1,'yes':1,'y':1,'true':1,'1':1,1:1,
                  'nao':0,'n√£o':0,'n':0,'no':0,'false':0,'0':0,0:0}
    def f(v):
        if pd.isna(v): return np.nan
        if isinstance(v,(int,float)) and v in (0,1): return int(v)
        if isinstance(v,str): return map_simnao.get(v.strip().lower(), np.nan)
        return np.nan
    return s.apply(f)

if target_col is None:
    raise RuntimeError("N√£o foi poss√≠vel identificar a coluna alvo de churn. Ajuste o nome manualmente.")

y = to01(df[target_col])
df = df.loc[~y.isna()].copy()
y = y.loc[df.index].astype(int)
print("Distribui√ß√£o do alvo:"); print(y.value_counts(normalize=True).rename('proporcao'))


## 2) üßπ Removendo colunas n√£o preditivas (IDs etc.)

In [None]:
def is_identifier(col, series):
    name = col.lower()
    if any(tok in name for tok in ['id','uuid','customerid','clienteid','account','cpf','cnpj','rg','id_']):
        return True
    try:
        ratio = series.nunique(dropna=True) / max(1, len(series))
        if ratio > 0.98:
            return True
    except Exception:
        pass
    return False

id_like = [c for c in df.columns if c != target_col and is_identifier(c, df[c])]
print("Colunas removidas como identificadores:", id_like)
X = df.drop(columns=[target_col] + id_like, errors='ignore')


## 3) üî° Codifica√ß√£o One-Hot e tipos de vari√°veis

In [None]:
cat_cols = [c for c in X.columns if X[c].dtype=='object']
num_cols = [c for c in X.columns if c not in cat_cols]
print("Categ√≥ricas:", len(cat_cols), "| Num√©ricas:", len(num_cols))

onehot = OneHotEncoder(handle_unknown='ignore', sparse=False)
preprocess_scaled = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', onehot, cat_cols)
], remainder='drop')

preprocess_no_scale = ColumnTransformer([
    ('num', 'passthrough', num_cols),
    ('cat', onehot, cat_cols)
], remainder='drop')


## 4) üìè Propor√ß√£o de Churn (desbalanceamento)

In [None]:
counts = y.value_counts()
props = y.value_counts(normalize=True)
print("Contagem:\n", counts)
print("\nPropor√ß√£o:\n", props)

ax = counts.sort_index().plot(kind='bar')
ax.set_title("Distribui√ß√£o do alvo (0 = Retido, 1 = Churn)")
ax.set_xlabel("Classe"); ax.set_ylabel("Contagem")
plt.tight_layout(); plt.show()


## 5) ‚öñÔ∏è (Opcional) Balanceamento de Classes (SMOTE)

In [None]:
USE_SMOTE = True
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
print("SMOTE habilitado?:", USE_SMOTE)


## 6‚Äì7) üìà Normaliza√ß√£o (quando necess√°rio) e üîó Correla√ß√£o

In [None]:
num_only = df[[c for c in X.columns if c in X.columns and c not in [col for col in X.columns if X[col].dtype=='object']]].copy()
corr = num_only.corr(numeric_only=True)
fig, ax = plt.subplots()
im = ax.imshow(corr.values)
ax.set_xticks(range(len(corr.columns)))
ax.set_xticklabels(corr.columns, rotation=90)
ax.set_yticks(range(len(corr.index)))
ax.set_yticklabels(corr.index)
ax.set_title("Correla√ß√£o entre vari√°veis num√©ricas")
plt.tight_layout(); plt.show()

corr_y = {}
for c in num_only.columns:
    try:
        corr_y[c] = np.corrcoef(df[c].astype(float), y.astype(float))[0,1]
    except Exception:
        corr_y[c] = np.nan

corr_y = pd.Series(corr_y).sort_values(ascending=False)
print("\nCorrela√ß√£o com churn (num√©ricas):")
display(corr_y.dropna().head(10))


## 8) üîç An√°lises Direcionadas (exemplos)

In [None]:
def find_col_like(patterns, pool):
    for p in patterns:
        for c in pool:
            if p in c.lower():
                return c
    return None

tenure_col = find_col_like(['tenure','tempo','meses','months'], X.columns)
charges_col = find_col_like(['totalcharges','total_charges','total','gasto','charges'], X.columns)

if tenure_col is not None:
    df_plot = pd.DataFrame({tenure_col: df[tenure_col], 'churn': y})
    ax = df_plot.boxplot(by='churn', column=tenure_col)
    plt.title(f"{tenure_col} por churn"); plt.suptitle("")
    plt.xlabel("Churn"); plt.ylabel(tenure_col)
    plt.tight_layout(); plt.show()

if charges_col is not None:
    df_plot = pd.DataFrame({charges_col: df[charges_col], 'churn': y})
    ax = df_plot.boxplot(by='churn', column=charges_col)
    plt.title(f"{charges_col} por churn"); plt.suptitle("")
    plt.xlabel("Churn"); plt.ylabel(charges_col)
    plt.tight_layout(); plt.show()


## 9) ‚úÇÔ∏è Split Treino/Teste

In [None]:
TEST_SIZE = 0.3
RANDOM_STATE = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE,
                                                    stratify=y, random_state=RANDOM_STATE)
print("Treino:", X_train.shape, "| Teste:", X_test.shape)


## 10‚Äì11) üß™ Modelos e Avalia√ß√£o

In [None]:
def evaluate_model(name, y_true, y_pred, y_proba=None):
    from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                                 confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, RocCurveDisplay,
                                 classification_report)
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    print(f"\n{name} ‚Äî M√©tricas")
    print(f"Acur√°cia: {acc:.4f} | Precis√£o: {prec:.4f} | Recall: {rec:.4f} | F1: {f1:.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred, zero_division=0))
    cm = confusion_matrix(y_true, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.title(f"Matriz de Confus√£o ‚Äî {name}")
    plt.tight_layout(); plt.show()
    if y_proba is not None:
        try:
            auc = roc_auc_score(y_true, y_proba)
            RocCurveDisplay.from_predictions(y_true, y_proba)
            plt.title(f"ROC ‚Äî {name} (AUC={auc:.4f})")
            plt.tight_layout(); plt.show()
        except Exception as e:
            print("ROC/AUC n√£o dispon√≠vel:", e)

results = []

# Logistic Regression
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.linear_model import LogisticRegression
pipe_lr_steps = [('preprocess', preprocess_scaled)]
if USE_SMOTE:
    pipe_lr_steps.append(('smote', SMOTE(random_state=RANDOM_STATE)))
pipe_lr_steps.append(('clf', LogisticRegression(max_iter=1000)))
pipe_lr = ImbPipeline(steps=pipe_lr_steps)

pipe_lr.fit(X_train, y_train)
y_pred_lr = pipe_lr.predict(X_test)
try:
    y_proba_lr = pipe_lr.predict_proba(X_test)[:,1]
except Exception:
    y_proba_lr = None
evaluate_model("Logistic Regression", y_test, y_pred_lr, y_proba_lr)
results.append(("LogisticRegression", *[
    accuracy_score(y_test, y_pred_lr),
    precision_score(y_test, y_pred_lr, zero_division=0),
    recall_score(y_test, y_pred_lr, zero_division=0),
    f1_score(y_test, y_pred_lr, zero_division=0)
]))

# Random Forest
from sklearn.ensemble import RandomForestClassifier
pipe_rf_steps = [('preprocess', preprocess_no_scale)]
if USE_SMOTE:
    pipe_rf_steps.append(('smote', SMOTE(random_state=RANDOM_STATE)))
pipe_rf_steps.append(('clf', RandomForestClassifier(
    n_estimators=300, random_state=RANDOM_STATE, n_jobs=-1
)))
pipe_rf = ImbPipeline(steps=pipe_rf_steps)

pipe_rf.fit(X_train, y_train)
y_pred_rf = pipe_rf.predict(X_test)
try:
    y_proba_rf = pipe_rf.predict_proba(X_test)[:,1]
except Exception:
    y_proba_rf = None
evaluate_model("Random Forest", y_test, y_pred_rf, y_proba_rf)
results.append(("RandomForest", *[
    accuracy_score(y_test, y_pred_rf),
    precision_score(y_test, y_pred_rf, zero_division=0),
    recall_score(y_test, y_pred_rf, zero_division=0),
    f1_score(y_test, y_pred_rf, zero_division=0)
]))

res_df = pd.DataFrame(results, columns=['Modelo','Acuracia','Precisao','Recall','F1'])
print("\nComparativo de modelos:")
display(res_df.sort_values('F1', ascending=False).reset_index(drop=True))


## 12) üß≠ Import√¢ncia das Vari√°veis

In [None]:
def get_feature_names(preprocessor):
    output = []
    for name, trans, cols in preprocessor.transformers_:
        if name == 'num' and trans == 'passthrough':
            output.extend(cols)
        elif name == 'num':
            try:
                output.extend(cols)
            except Exception:
                pass
        elif name == 'cat':
            try:
                ohe = trans
                cats = ohe.get_feature_names_out(cols)
                output.extend(cats.tolist())
            except Exception:
                output.extend(cols)
    return output

# LR
try:
    pre_lr = pipe_lr.named_steps['preprocess']
    feat_names_lr = get_feature_names(pre_lr)
    coefs = pipe_lr.named_steps['clf'].coef_.ravel()
    imp_lr = pd.Series(coefs, index=feat_names_lr).sort_values(key=abs, ascending=False).head(20)
    display(imp_lr.to_frame('coef'))
    ax = imp_lr.iloc[::-1].plot(kind='barh')
    ax.set_title("Top coeficientes (|valor|) ‚Äî Logistic Regression")
    plt.tight_layout(); plt.show()
except Exception as e:
    print("Import√¢ncia (LR) indispon√≠vel:", e)

# RF
try:
    pre_rf = pipe_rf.named_steps['preprocess']
    feat_names_rf = get_feature_names(pre_rf)
    fi = pipe_rf.named_steps['clf'].feature_importances_
    imp_rf = pd.Series(fi, index=feat_names_rf).sort_values(ascending=False).head(20)
    display(imp_rf.to_frame('importance'))
    ax = imp_rf.iloc[::-1].plot(kind='barh')
    ax.set_title("Top import√¢ncias ‚Äî Random Forest")
    plt.tight_layout(); plt.show()
except Exception as e:
    print("Import√¢ncia (RF) indispon√≠vel:", e)


## 13‚Äì14) üìù Relat√≥rio e Artefatos

In [None]:
from pathlib import Path
FIG_DIR = os.path.join(os.getcwd().replace('/notebooks',''), "reports", "figures")
Path(FIG_DIR).mkdir(parents=True, exist_ok=True)

ax = res_df.set_index('Modelo')[['Acuracia','Precisao','Recall','F1']].plot(kind='bar')
ax.set_title("Comparativo de modelos")
plt.xticks(rotation=0)
plt.tight_layout()
fig_path = os.path.join(FIG_DIR, "comparativo_modelos.png")
plt.savefig(fig_path); plt.show()
print("Figura salva:", fig_path)

def build_html_report():
    html = [f"""
<!DOCTYPE html>
<html lang="pt-br"><head><meta charset="utf-8"/>
<title>Relat√≥rio ‚Äî Modelagem de Churn</title>
<style>
body{{font-family:Arial,Helvetica,sans-serif;max-width:980px;margin:24px auto;line-height:1.5}}
h1,h2{{margin-top:24px}}
img{{max-width:100%;height:auto;border:1px solid #ddd;border-radius:6px;padding:4px;margin:8px 0}}
table{{border-collapse:collapse;width:100%}}
th,td{{border:1px solid #ddd;padding:6px 8px;text-align:center}}
thead{{background:#f3f3f3}}
hr{{border:none;border-top:1px solid #ddd;margin:24px 0}}
</style></head><body>
<h1>Relat√≥rio ‚Äî Modelagem de Churn (TelecomX)</h1>
<p><strong>Data:</strong> {datetime.date.today().isoformat()}</p>
<h2>Pipeline</h2>
<ul>
  <li>Pr√©-processamento com One-Hot Encoding para categ√≥ricas e scaling (quando necess√°rio).</li>
  <li>Divis√£o treino/teste {int((1-0.3)*100)}%.</li>
  <li>Modelos: Regress√£o Log√≠stica (com normaliza√ß√£o) e Random Forest (sem normaliza√ß√£o).</li>
  <li>Balanceamento: {"habilitado"}</li>
</ul>
<h2>M√©tricas</h2>
{{metrics_table}}
<img src="../reports/figures/comparativo_modelos.png" alt="Comparativo de modelos"/>
<h2>Import√¢ncia de Vari√°veis</h2>
<p>Coeficientes (LR) e Import√¢ncias (RF) foram calculados e podem ser visualizados nas c√©lulas anteriores.</p>
<h2>Insights</h2>
<ul>
  <li>Vari√°veis de tenure e custo tendem a estar entre os principais drivers de churn.</li>
  <li>Segmentos com maior risco devem receber a√ß√µes de reten√ß√£o proativas.</li>
</ul>
<hr/><p style="font-size:12px;color:#666">Relat√≥rio gerado automaticamente.</p>
</body></html>
""".replace("{metrics_table}", res_df.to_html(index=False))]
    return "\n".join(html)

html = build_html_report()
report_path = os.path.join(os.getcwd().replace('/notebooks',''), "reports", "Relatorio_Modelagem_Churn.html")
with open(report_path, "w", encoding="utf-8") as f:
    f.write(html)

print("Relat√≥rio salvo em:", report_path)
