

**1. Abordar o Problema e Analisar**

Objetivo: Prever quais clientes têm maior probabilidade de cancelar o serviço.

Desafios:

Base desbalanceada,

Muitos atributos categóricos (tipo de contrato, forma de pagamento, serviços extras).

Necessidade de métricas além da acurácia.



**2. Obter os Dados**

Fonte sugerida: Telco Customer Churn Dataset (Kaggle).

Exemplo: Churn.csv.

In [None]:
import pandas as pd

df = pd.read_csv("Churn.csv")
df.head()


**3. Explorar os Dados**

In [None]:
print(df.info())
print(df['Churn'].value_counts(normalize=True))

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x="Churn", data=df)
plt.title("Distribuição de Clientes (Churn)")
plt.show()


**4. Tratamento dos Dados**

In [None]:
# Remover coluna irrelevante
df = df.drop(["customerID"], axis=1)

# Corrigir TotalCharges
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df["TotalCharges"].fillna(df["TotalCharges"].median(), inplace=True)

# Transformar target
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})


**5. Separar Base de Dados em Arrays**

In [None]:
X = df.drop("Churn", axis=1)
y = df["Churn"]


**6. Técnicas de Pré-processamento**

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

num_cols = X.select_dtypes(include=["int64","float64"]).columns
cat_cols = X.select_dtypes(include=["object"]).columns

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])


**7. Dividir Base de Dados entre Treino e Teste**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


**8. Definir Vários Modelos e Aplicar Treinamento**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss")
}

results = {}
for name, model in models.items():
    pipe = Pipeline(steps=[("preprocessor", preprocessor),
                           ("classifier", model)])
    pipe.fit(X_train, y_train)
    acc = pipe.score(X_test, y_test)
    results[name] = acc

results


**9. Validar o Modelo**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

best_model = max(results, key=results.get)
print(f"Melhor modelo: {best_model} - Acurácia: {results[best_model]:.2f}")

pipe = Pipeline(steps=[("preprocessor", preprocessor),
                       ("classifier", models[best_model])])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Matriz de Confusão")
plt.show()


**10. Salvar a Solução**

In [None]:
import joblib

joblib.dump(pipe, "modelo_churn.pkl")
print("Modelo salvo com sucesso!")
