# Customer Churn - Telecom
---

### CRISP-DM Methodology
This project follows the CRISP-DM (*Cross-Industry Standard Process for Data Mining*) framework applied to **Customer Retention & Churn Prediction**:
| **Stage** | **Objective** | **Methodological Execution** |
| :--- | :--- | :--- |
| **1. Business Understanding** | Mitigate revenue loss by identifying at-risk customers. | • **Target Definition**: Binary Classification (Churn: Yes/No).<br>• **KPIs**: Maximize **Lift** in retention campaigns & Revenue Saved vs. Cost. |
| **2. Data Understanding** | Detect patterns of friction and dissatisfaction. | • **EDA**: Distribution analysis (Detect Imbalance).<br>• **Hypothesis Testing**: Correlation Matrix & Independence Tests (Chi-Square). |
| **3. Data Preparation** | Construct a robust dataset for parametric modeling. | • **Scaling**: Standardization (Z-score) for coefficient comparability.<br>• **Encoding**: One-Hot Encoding for nominal variables.<br>• **Splitting**: Stratified Train/Test Split to preserve class ratio. |
| **4. Modeling** | Estimate Churn Probability | • **Algorithms**: Logistic Regression, SVM Linear, KNN, Regression, Decision Tree, Random Florest, XGBoost, LightGBM.<br>• **Inference**: Analyze **Odds Ratios** to determine feature elasticity. |
| **5. Evaluation** | Assess model reliability and financial impact. | • **Discrimination**: AUC-ROC & F1-Score (Threshold Tuning).<br>• **Calibration**: Probability Calibration Curve (Reliability Diagram). |
| **6. Deployment** | Integrate insights into the CRM lifecycle. | • **Deliverable**: "High-Risk" Customer List for Marketing Squad.<br>• **Artifact**: Serialize model (`joblib`) for batch inference. |

---

### Installs:

In [0]:
%%capture
%pip install -r '../requirements.txt'
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize and Visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Data Modeling / Model Linear / Metrics / Save Model
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay
import joblib

In [0]:
# SRC/ Functions Utils:
import sys
sys.path.append('../src')
from visualization import GraphicsData
from utils import EDATest, optimize_dtypes

### Dev objects

In [0]:
# ============================================================================= #
# >>> Module of functions and classes for creating graphs and visualizing data. #                                        
# ============================================================================= #

# ======================================================== #
# Imports:                                                 #
# ======================================================== #
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin


class FeatureEngineer(
    BaseEstimator, 
    TransformerMixin
):

    """ """

    # Initialize Class
    def __init__(
        self,
    ):
        """"""

    # Def Fit
    def fit(
        self, 
        X, 
        y = None
    ):
        """"""

    def transform(
        self, 
        X
    ):
        
        # Check Data set 
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)

        X = X.copy()

        # -------- 1) Total wallet share --------
        mon_cols = ['longmon', 'tollmon', 'equipmon', 'cardmon','wiremon']
        existing_mon = [c for c in  mon_cols if c in X.columns]
        X['total_spend'] = X[existing_mon].sum(axis = 1).astype('float32') if existing_mon else 0.0

        # -------- 2) Affordability and expenses/income --------

        if 'income' in X.columns:
            income_scale = 1000.0
            denom_income = X['income'].astype('float64') * income_scale + 1.0
        else: 
            denom_income = 1.0

        X['affordability_idx'] = (
            X['total_spend'].astype('float64') / denom_income
        ).astype('float32')

        for col in ['longmon', 'equipmon', 'cardmon', 'wiremon']:
            if col in X.columns:
                X[f'{col}_inc'] = (
                    X[col].astype('float64') / denom_income
                ).astype('float32')
            
            else:
                X[f'{col}_inc'] = 0.0
        
        # -------- 3) Risk (toxicity + education) --------
        toxic_list = ['internet', 'wireless', 'equip', 'voice', 'pager']
        existing_toxic = [c for c in toxic_list if c in X.columns]
        X['toxic_score'] = X[existing_toxic].sum(axis = 1).astype('float32') if existing_toxic else 0.0

        if 'ed' in X.columns:
            X['toxic_ed'] = (
                X['toxic_score'].astype('float64') * X['ed'].astype('int64')
            ).astype('float32')
        else:
            X['toxic_ed'] = 0.0

        # -------- 4) Tenure in years (tenure is in months) --------
        

In [0]:


        # -------- 4) Tenure em anos (tenure está em meses) --------
        if "tenure" in X.columns:
            X["tenure_years"] = (
                X["tenure"].astype("float64") / 12.0
            ).astype("float32")
        else:
            X["tenure_years"] = 0.0

        # -------- 5) Comportamento de uso --------
        # tenure_longmon / tenure_cardmon
        if "longmon" in X.columns:
            X["tenure_longmon"] = (
                X["tenure_years"].astype("float64")
                * X["longmon"].astype("float64")
            ).astype("float32")
        else:
            X["tenure_longmon"] = 0.0

        if "cardmon" in X.columns:
            X["tenure_cardmon"] = (
                X["tenure_years"].astype("float64")
                * X["cardmon"].astype("float64")
            ).astype("float32")
        else:
            X["tenure_cardmon"] = 0.0

        # age_longmon / age_cardmon (age em anos, SEM /12)
        if "age" in X.columns and "longmon" in X.columns:
            X["age_longmon"] = (
                X["age"].astype("float64") * X["longmon"].astype("float64")
            ).astype("float32")
        else:
            X["age_longmon"] = 0.0

        if "age" in X.columns and "cardmon" in X.columns:
            X["age_cardmon"] = (
                X["age"].astype("float64") * X["cardmon"].astype("float64")
            ).astype("float32")
        else:
            X["age_cardmon"] = 0.0

        # -------- 6) Estabilidade --------
        if "age" in X.columns:
            X["stability_age"] = (
                X["tenure_years"].astype("float64")
                * (X["age"].astype("float64") - 18.0)
            ).astype("float32")
        else:
            X["stability_age"] = 0.0

        if "address" in X.columns:
            X["stability_address"] = (
                X["tenure_years"].astype("float64")
                * X["address"].astype("float64")
            ).astype("float32")
        else:
            X["stability_address"] = 0.0

        if "employ" in X.columns:
            X["stability_employ"] = (
                X["tenure_years"].astype("float64")
                * X["employ"].astype("float64")
            ).astype("float32")
        else:
            X["stability_employ"] = 0.0

        # -------- 7) Good score --------
        good_cols = ["callcard", "confer", "callwait"]
        existing_good = [c for c in good_cols if c in X.columns]
        X["good_score"] = X[existing_good].sum(axis=1) if existing_good else 0.0

        # -------- 8) Limpeza final (inf / NaN -> 0), igual ao estilo do projeto --------
        X = X.replace([np.inf, -np.inf], np.nan).fillna(0)

        return X

In [0]:

class FeatureEngineerTelecom(BaseEstimator, TransformerMixin):
    """
    Engenharia de features para o dataset de telecom.

    - Agregações de custos (total_spend)
    - Razões gasto / renda (affordability_idx, longmon_inc, etc.)
    - Risco (toxic_score, toxic_ed)
    - Comportamento de uso (tenure/age x gastos)
    - Estabilidade (tenure x perfil)
    - Good score
    """

    def __init__(self, income_in_thousands: bool = True):
        """
        Parameters
        ----------
        income_in_thousands : bool, default=True
            Se True, assume que a coluna 'income' está em milhares (20 = 20.000)
            e multiplica por 1000 no denominador das razões.
        """
        self.income_in_thousands = income_in_thousands

    def fit(self, X, y=None):
        # Nada a aprender; apenas compatível com API do sklearn
        return self

    def transform(self, X):
        # Garante DataFrame (caso venha como array)
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)

        X = X.copy()

        # -------- 1) Total wallet share --------
        mon_cols = ["longmon", "tollmon", "equipmon", "cardmon", "wiremon"]
        existing_mon = [c for c in mon_cols if c in X.columns]
        X["total_spend"] = X[existing_mon].sum(axis=1) if existing_mon else 0.0

        # -------- 2) Affordability e gastos/renda --------
        if "income" in X.columns:
            income_scale = 1000.0 if self.income_in_thousands else 1.0
            denom_income = X["income"].astype("float64") * income_scale + 1.0
        else:
            denom_income = 1.0

        X["affordability_idx"] = (
            X["total_spend"].astype("float64") / denom_income
        ).astype("float32")

        for col in ["longmon", "equipmon", "cardmon", "wiremon"]:
            if col in X.columns:
                X[f"{col}_inc"] = (
                    X[col].astype("float64") / denom_income
                ).astype("float32")
            else:
                X[f"{col}_inc"] = 0.0

        # -------- 3) Risco (toxicidade + educação) --------
        toxic_list = ["internet", "wireless", "equip", "voice", "pager"]
        existing_toxic = [c for c in toxic_list if c in X.columns]
        X["toxic_score"] = X[existing_toxic].sum(axis=1) if existing_toxic else 0.0

        if "ed" in X.columns:
            X["toxic_ed"] = (
                X["toxic_score"].astype("float64") * X["ed"].astype("int64")
            ).astype("float32")
        else:
            X["toxic_ed"] = 0.0

        # -------- 4) Tenure em anos (tenure está em meses) --------
        if "tenure" in X.columns:
            X["tenure_years"] = (
                X["tenure"].astype("float64") / 12.0
            ).astype("float32")
        else:
            X["tenure_years"] = 0.0

        # -------- 5) Comportamento de uso --------
        # tenure_longmon / tenure_cardmon
        if "longmon" in X.columns:
            X["tenure_longmon"] = (
                X["tenure_years"].astype("float64")
                * X["longmon"].astype("float64")
            ).astype("float32")
        else:
            X["tenure_longmon"] = 0.0

        if "cardmon" in X.columns:
            X["tenure_cardmon"] = (
                X["tenure_years"].astype("float64")
                * X["cardmon"].astype("float64")
            ).astype("float32")
        else:
            X["tenure_cardmon"] = 0.0

        # age_longmon / age_cardmon (age em anos, SEM /12)
        if "age" in X.columns and "longmon" in X.columns:
            X["age_longmon"] = (
                X["age"].astype("float64") * X["longmon"].astype("float64")
            ).astype("float32")
        else:
            X["age_longmon"] = 0.0

        if "age" in X.columns and "cardmon" in X.columns:
            X["age_cardmon"] = (
                X["age"].astype("float64") * X["cardmon"].astype("float64")
            ).astype("float32")
        else:
            X["age_cardmon"] = 0.0

        # -------- 6) Estabilidade --------
        if "age" in X.columns:
            X["stability_age"] = (
                X["tenure_years"].astype("float64")
                * (X["age"].astype("float64") - 18.0)
            ).astype("float32")
        else:
            X["stability_age"] = 0.0

        if "address" in X.columns:
            X["stability_address"] = (
                X["tenure_years"].astype("float64")
                * X["address"].astype("float64")
            ).astype("float32")
        else:
            X["stability_address"] = 0.0

        if "employ" in X.columns:
            X["stability_employ"] = (
                X["tenure_years"].astype("float64")
                * X["employ"].astype("float64")
            ).astype("float32")
        else:
            X["stability_employ"] = 0.0

        # -------- 7) Good score --------
        good_cols = ["callcard", "confer", "callwait"]
        existing_good = [c for c in good_cols if c in X.columns]
        X["good_score"] = X[existing_good].sum(axis=1) if existing_good else 0.0

        # -------- 8) Limpeza final (inf / NaN -> 0), igual ao estilo do projeto --------
        X = X.replace([np.inf, -np.inf], np.nan).fillna(0)

        return X


### Load the data

In [0]:
df = pd.read_csv('../data/ChurnData.csv')

### Verify successful load with some randomly selected records


In [0]:
df.sample(9)

In [0]:
df.head()

In [0]:
df.info()


### 3. Data Preparation

#### Adjusting the variable types with their respective characteristics. 
---
- In this data, there are both binary and ordinal variables; I will be adjusting them so that there is no invalid statistical aggregation in the analyses.

In [0]:
df = optimize_dtypes(df)

print(f'New dtypes of variables:')
df.info()

print(f'Visual sample:')
df.head()

In [0]:
GraphicsData(data = df).plot_target_analysis(target_col='churn', colors=['#1abc9c', '#ff6b6b'])

##### Train and Test Data Split
---

- Before starting the data preprocessing and modeling process, the dataset will be split into **Training** and **Test** sets.

The main goal is to prevent data leakage, ensuring that all statistical information, outlier handling, transformation decisions, and feature engineering strategies are derived exclusively from the training data.

> This approach preserves the integrity of the validation process and ensures that performance metrics reflect the model’s true generalization capability, rather than contamination from information in the test set.

- The `stratify` parameter will be applied in the `train_test_split` procedure.

Since churn prediction is a binary classification problem, keeping the original class proportion in both subsets is statistically recommended and, in practice, helps avoid evaluation bias.

> Stratified sampling preserves the prior distribution of the target variable, reducing distortions in class balance that could bias model training, decision-threshold calibration, and metrics such as Recall, Precision, and ROC-AUC.

---


In [0]:
train_set, test_set = train_test_split(df, test_size = 0.2, stratify = df['churn'], shuffle = True, random_state = 33)

In [0]:
# Checking the proportions of the target variable
print(f'Shape of training: {train_set.shape}')
print(f'Shape of test: {test_set.shape}')

print('\n--- Churn Rate (Stratify Validation) ---')
print(f'Original: {df['churn'].mean():.2%}')
print(f'Train:    {train_set['churn'].mean():.2%}')
print(f'Test:    {test_set['churn'].mean():.2%}')

In [0]:
import numpy as np
import pandas as pd

def engineer_features(df: pd.DataFrame, *, income_in_thousands: bool = True) -> pd.DataFrame:
    X = df.copy()

    # 1) total_spend
    mon_cols = ["longmon", "tollmon", "equipmon", "cardmon", "wiremon"]
    X["total_spend"] = X[[c for c in mon_cols if c in X.columns]].sum(axis=1)

    # 2) renda no denominador (income em milhares)
    income_scale = 1000.0 if income_in_thousands else 1.0
    denom_income = (X["income"].astype("float64") * income_scale + 1.0) if "income" in X.columns else 1.0

    X["affordability_idx"] = (X["total_spend"].astype("float64") / denom_income).astype("float32")

    for col in ["longmon", "equipmon", "cardmon", "wiremon"]:
        X[f"{col}_inc"] = (X[col].astype("float64") / denom_income).astype("float32") if col in X.columns else 0.0

    # 3) toxicidade
    toxic_list = ["internet", "wireless", "equip", "voice", "pager"]
    X["toxic_score"] = X[[c for c in toxic_list if c in X.columns]].sum(axis=1)
    X["toxic_ed"] = (X["toxic_score"].astype("float64") * X["ed"].astype("int64")).astype("float32") if "ed" in X.columns else 0.0

    # 4) tenure em anos (tenure está em meses)
    X["tenure_years"] = (X["tenure"].astype("float64") / 12.0).astype("float32") if "tenure" in X.columns else 0.0

    # 5) interações de uso (age SEM /12)
    if "longmon" in X.columns:
        X["tenure_longmon"] = (X["tenure_years"].astype("float64") * X["longmon"].astype("float64")).astype("float32")
        X["age_longmon"]    = (X["age"].astype("float64") * X["longmon"].astype("float64")).astype("float32") if "age" in X.columns else 0.0
    else:
        X["tenure_longmon"] = 0.0
        X["age_longmon"] = 0.0

    if "cardmon" in X.columns:
        X["tenure_cardmon"] = (X["tenure_years"].astype("float64") * X["cardmon"].astype("float64")).astype("float32")
        X["age_cardmon"]    = (X["age"].astype("float64") * X["cardmon"].astype("float64")).astype("float32") if "age" in X.columns else 0.0
    else:
        X["tenure_cardmon"] = 0.0
        X["age_cardmon"] = 0.0

    # 6) estabilidade (tenure_years x variáveis de perfil)
    X["stability_age"] = (X["tenure_years"].astype("float64") * (X["age"].astype("float64") - 18.0)).astype("float32") if "age" in X.columns else 0.0
    X["stability_address"] = (X["tenure_years"].astype("float64") * X["address"].astype("float64")).astype("float32") if "address" in X.columns else 0.0
    X["stability_employ"]  = (X["tenure_years"].astype("float64") * X["employ"].astype("float64")).astype("float32") if "employ" in X.columns else 0.0

    # 7) good_score
    X["good_score"] = X[[c for c in ["callcard", "confer", "callwait"] if c in X.columns]].sum(axis=1)

    # limpeza final (mesma “filosofia” do arquivo do projeto: tratar inf/NaN) [file:1]
    X = X.replace([np.inf, -np.inf], np.nan).fillna(0)

    return X


In [0]:
import pandas as pd

from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

In [0]:
# Configura globalmente para transformers retornarem DataFrame no transform/fit_transform
set_config(transform_output="pandas")

# Exemplo de colunas
num_cols = ["income", "tenure"]
cat_cols = ["ed", "custcat"]

# Pipelines por tipo de dado
num_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

cat_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

# Aplica transformações por subconjunto de colunas
preprocess = ColumnTransformer(
    transformers=[
        ("num", num_pipe, num_cols),
        ("cat", cat_pipe, cat_cols),
    ],
    remainder="drop",
    verbose_feature_names_out=False,   # <- remove prefixos tipo "num__"
).set_output(transform="pandas")


In [0]:
train_set_p = preprocess.fit_transform(train_set)
train_set_p

#### Selecting variables for training and test data

In [0]:
X_train = train_set[['ENGINESIZE','FUELCONSUMPTION_COMB']]
y_train = train_set['CO2EMISSIONS']

print(f'The shape of X_train is: {X_train.shape}')
print(f'\nThe shape of y_train is: {y_train.shape}')

In [0]:
X_test = test_set[['ENGINESIZE','FUELCONSUMPTION_COMB']]
y_test = test_set['CO2EMISSIONS']

print(f'The shape of X_train is: {X_test.shape}')
print(f'\nThe shape of y_train is: {y_test.shape}')

#### Preprocessing

#### Note:
---
- For the preprocessing stage, opt for the application of the `PowerTransformer` (Yeo-Johnson method). This application, a parametric power transformation technique, evolves to **stabilize the variance** and approximate the distribution of predictors to a Normal (Gaussian) distribution. The method acts by mitigating the positive skewness (**long tail**) and correcting the **heteroscedasticity** identified in the `ENGINESIZE` variable, thus ensuring the fulfillment of the statistical predictions of the linear model.

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# --- 1. Definição das Variáveis (Feature Selection) ---

# A. Variáveis para EXCLUIR (Redundantes/Derivadas)
drop_cols = [
    'longten', 'tollten', 'cardten', # Totais (Interação Preço x Tempo)
    'lninc', 'loglong', 'logtoll', 'logturn', # Logs (Matemáticas)
    'callwait', 'confer', 'ebill', # Opcionais: Se quiser simplificar o modelo (Mantenha se achar relevante)
    'churn' # O Target jamais entra no X
]

# B. Variáveis Numéricas (Precisam de Z-Score)
numeric_features = [
    'tenure', 'age', 'address', 'income', 'employ', 
    'longmon', 'tollmon', 'equipmon', 'cardmon', 'wiremon'
]

# C. Variáveis Binárias (Já estão prontas: 0/1)
binary_features = [
    'equip', 'callcard', 'wireless', 'pager', 'internet', 'voice'
]

# D. Variáveis Categóricas/Ordinais (Precisam de Dummies)
# 'custcat' e 'ed' são números que representam categorias
categorical_features = ['ed', 'custcat'] 

# --- 2. Separação X e y ---

# Garante que estamos usando apenas as colunas que sobraram após o filtro mental
selected_features = numeric_features + binary_features + categorical_features

X = df[selected_features]
y = df['churn'] # O Target isolado

# Split Estratificado (Mantém a proporção de Churn no Treino e Teste)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# --- 3. Construção do Pipeline Robusto ---

preprocessor = ColumnTransformer(
    transformers=[
        # Numéricas: Padronização
        ('num', StandardScaler(), numeric_features),
        
        # Categóricas: One-Hot (drop='first' remove a colinearidade)
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features),
        
        # Binárias: Passar direto
        ('bin', 'passthrough', binary_features)
    ],
    verbose_feature_names_out=False # Mantém nomes limpos (ex: 'ed_2' em vez de 'cat__ed_2')
)

# Pipeline Final
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(
        solver='liblinear', # Ótimo para datasets pequenos/médios
        penalty='l1',       # Lasso: Ajuda a zerar coeficientes inúteis (Feature Selection automático)
        C=1.0,              # Inverso da força de regularização
        random_state=42
    ))
])

# --- 4. Treinamento ---
model_pipeline.fit(X_train, y_train)

# Feedback Visual
print("Pipeline treinado com sucesso.")
print(f"Total de Features de Entrada: {X.shape[1]}")
print(f"Total de Coeficientes Gerados (após One-Hot): {len(model_pipeline.named_steps['classifier'].coef_[0])}")

In [0]:
X_scaler = PowerTransformer(method = 'yeo-johnson')
y_scaler = PowerTransformer(method = 'yeo-johnson')

# Train data
X_train_preprocessed  = X_scaler.fit_transform(X_train)

# Test data 
X_test_preprocessed = X_scaler.transform(X_test)

# Label data
y_train_preprocessed  = y_scaler.fit_transform(y_train.values.reshape(-1, 1))

# Test data 
y_test_preprocessed = y_scaler.transform(y_test.values.reshape(-1, 1))

#### Train Data Preprocessed

In [0]:
pd.DataFrame(X_train_preprocessed, columns = X_scaler.get_feature_names_out(X_train.columns)).head()

#### Test Data Preprocessed

In [0]:
pd.DataFrame(X_test_preprocessed, columns = X_scaler.get_feature_names_out(X_test.columns)).head()

### 4. Modeling:

#### Cross-Validation

In [0]:
# Create model and  K-Fold
model = LinearRegression()
kfold = KFold(n_splits = 5, shuffle = True, random_state = 33)

# Create Cross-Validation
cv_results = cross_validate(
    model, 
    X_train_preprocessed,
    y_train_preprocessed,
    cv = kfold,
    scoring = 'r2',
    return_estimator = True
)

# Extraction of coefficients
coefs_list = []
for estimator in cv_results['estimator']:
    coefs_list.append(estimator.coef_.flatten())

# Converts to a NumPy array for easier statistical analysis (Shape: [5, n_features])
coefs_array = np.array(coefs_list)

# Stability Calculation (Audit)
coefs_mean = np.mean(coefs_array, axis = 0)
coefs_std = np.std(coefs_array, axis = 0)

# Metrics
print(f'--- Performance Metrics ---')
print(f'Mean R² {np.mean(cv_results['test_score']):.4f}')
print(f'Std R²: {np.std(cv_results['test_score']):.4f}')

print(f'\n--- Stability Metrics (Coefficients) ---')
feature_names = ['ENGINESIZE', 'FUELCONSUMPTION_COMB']
df_stability = pd.DataFrame(
    {
        'Feature': feature_names,
        'Mean Coef': coefs_mean,
        'Std Coef': coefs_std,
        'CV (%)': (coefs_std / np.abs(coefs_mean)) * 100
    }
)

print(df_stability)

#### Final Training

In [0]:
model.fit(X_train_preprocessed, y_train_preprocessed)

print(f'Coefficients: {model.coef_[0]}')
print(f'Intercept: {model.intercept_}')

#### Test Model

In [0]:
y_pred = model.predict(X_test_preprocessed)

#### Metrics:

In [0]:
# 1. Bring the test Y and the predicted Y back to the "Real World"
# The inverse_transform requires a 2D array, hence the reshape
y_pred_real = y_scaler.inverse_transform(y_pred.reshape(-1, 1))
y_test_real = y_scaler.inverse_transform(y_test_preprocessed)

#2. Calculate metrics on the REAL scale (Grams of CO2)
mae_real = mean_absolute_error(y_test_real, y_pred_real)
rmse_real = root_mean_squared_error(y_test_real, y_pred_real)
r2_real = r2_score(y_test_real, y_pred_real)

print(f'--- Business Metrics (Original Scale) ---')
print(f"MAE Real: {mae_real:.2f} g/km")
print(f"RMSE Real: {rmse_real:.2f} g/km")
print(f"R2 Real: {r2_real:.4f}")

print(f"\n--- Statistical Metrics (Yeo-Johnson Scale) ---")
print(f"R2 Transformed: {r2_score(y_test_preprocessed, y_pred):.4f}")

#### Key Observations:
---

- 1. **Cross-Validation:** The application of Cross-Validation demonstrated exceptional stability in the model. The standard deviation of only **0.0170** between the k-folds confirms that the performance (average R² of **0.899**) is consistent and robust, minimizing the risk of sampling bias.
---

- 2. **Generalization Test:** In the test data, the model achieved a **Transformed R² of 0.91** (and **0.88** in the Real Scale). This transformed score, slightly higher than the training score (**0.90**), confirms that there was absolutely no *Overfitting*. The model learned the underlying physics of the data rather than memorizing noise.
---
- 3. **Stability (CV%):** **(`ENGINESIZE`)**: **4.05%** and **(`FUELCONSUMPTION_COMB`)**: **1.92%**. The CV is drastically below the 20% threshold, indicating "State-of-the-Art" stability. The **multicollinearity** (0.82 correlation) was effectively neutralized. The model assigned a clear, unwavering weight to the Fuel Consumption (Mean Coef ~0.71) as the dominant factor, while maintaining Engine Size (Mean Coef ~0.27) as a stable secondary predictor.
---
#### Insight:
---
---
- The convergence between the training **R² (0.90)** and the transformed test **R² (0.91)** validates the **Yeo-Johnson** strategy. The slight decrease to **0.88** in the "Real Scale" is mathematically expected due to the non-linear inverse transformation of residuals, but it represents the honest accuracy for the business (MAE ~13g/km). Technically, the model successfully linearized a complex physical phenomenon.

### 5. Evaluation:

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def plot_lift_and_gains(y_true, y_proba):
    """
    Gera a Tabela de Lift e os Gráficos de Gains/Lift.
    
    Args:
        y_true: Array com os valores reais (0 ou 1)
        y_proba: Array com as probabilidades preditas (predict_proba)
    """
    # 1. Criar DataFrame auxiliar
    df = pd.DataFrame({'y_true': y_true, 'y_proba': y_proba})
    
    # 2. Ordenar por probabilidade (Ranking)
    df = df.sort_values('y_proba', ascending=False)
    
    # 3. Criar os Decis (qcut divide em grupos de tamanho igual)
    df['decile'] = pd.qcut(df['y_proba'].rank(method='first'), 10, labels=False)
    df['decile'] = 10 - df['decile'] # Inverter para que 1 seja o Top Risk
    
    # 4. Agregação (A Mágica acontece aqui)
    lift_table = df.groupby('decile')['y_true'].agg(['count', 'sum', 'mean']).reset_index()
    lift_table.columns = ['Decile', 'Total_Customers', 'Real_Churners', 'Churn_Rate']
    
    # 5. Cálculos de Engenharia
    global_churn_rate = df['y_true'].mean()
    lift_table['Lift'] = lift_table['Churn_Rate'] / global_churn_rate
    
    # Cumulative Gains (Quanto do churn total eu peguei?)
    lift_table['Cumulative_Churners'] = lift_table['Real_Churners'].cumsum()
    lift_table['Total_Churners_In_Base'] = df['y_true'].sum()
    lift_table['Gain_Percentage'] = lift_table['Cumulative_Churners'] / lift_table['Total_Churners_In_Base']
    
    # --- VISUALIZAÇÃO ---
    fig, ax1 = plt.subplots(figsize=(10, 6))
    
    # Gráfico de Barras (Lift por Decil)
    sns.barplot(x='Decile', y='Lift', data=lift_table, color='skyblue', alpha=0.7, ax=ax1)
    ax1.axhline(1, color='red', linestyle='--', label='Baseline (Aleatório)')
    ax1.set_ylabel('Lift (x vezes melhor que aleatório)')
    ax1.set_title('Lift Analysis & Cumulative Gains', fontsize=14, fontweight='bold')
    
    # Gráfico de Linha (Ganho Acumulado) - Eixo Secundário
    ax2 = ax1.twinx()
    sns.lineplot(x=lift_table.index, y=lift_table['Gain_Percentage'], color='green', marker='o', ax=ax2, label='% Churn Capturado')
    ax2.set_ylabel('% Total de Churners Capturados')
    ax2.set_ylim(0, 1.1)
    
    # Destaque do KPI "Lift in Top Decile"
    top_decile_gain = lift_table.loc[0, 'Gain_Percentage']
    plt.text(0, top_decile_gain, f'{top_decile_gain:.0%} Capturado', 
             bbox=dict(facecolor='yellow', alpha=0.5))
    
    plt.show()
    
    return lift_table

# Exemplo de chamada no seu notebook:
# y_proba = model.predict_proba(X_test)[:, 1] # Pegar prob da classe 1
# lift_df = plot_lift_and_gains(y_test, y_proba)
# display(lift_df)

In [0]:
# Flatten arrays to ensure 1D dimension.
y_pred_real = y_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten()
y_test_real = y_scaler.inverse_transform(y_test_preprocessed).flatten()

# Waste calculation
residuals = y_test_real - y_pred_real

# --- GRÁFICO A: Residuals vs Predicted ---

# Subplots
fig, ax = plt.subplots(1, 2, figsize = (21, 7))

sns.scatterplot(
    x = y_pred_real,
    y = residuals,
    ax = ax[0],
    alpha = 0.6,
    color = 'steelblue',
    edgecolor = 'black',
    s = 70
)
# Reference Line (Zero Error)
ax[0].axhline(y = 0, color = 'crimson', linestyle = '--', linewidth = 2, label = 'Zero Error' )

std_resid = np.std(residuals)
ax[0].axhline(y = std_resid * 2, color = 'gray', linestyle = ':', alpha = 0.5, label = '+/- Std Dev')
ax[0].axhline(y = -std_resid * 2, color = 'gray', linestyle = ':', alpha = 0.5 )

ax[0].set_title('A. Homoscedasticity: Residuals vs. Predicted Values', fontsize = 14, fontweight = 'bold')
ax[0].set_xlabel('Predicted Emission (g/km)', fontsize = 12)
ax[0].set_ylabel('Error (Real - Predicted)', fontsize = 12)
ax[0].legend()
ax[0].grid(True, alpha = 0.3)


# --- GRAPH B: Distribution of Residuals (The Normality Test)
sns.histplot(
    residuals, 
    kde = True, 
    ax = ax[1],
    color = 'darkslategray',
    edgecolor = 'black',
   
)

mean_resid = np.mean(residuals)
ax[1].axvline(mean_resid, color = 'gold', linestyle = '-', linewidth = 3, label = f'Mean Error: {mean_resid:.2f}')

ax[1].set_title('B. Normality: Error of Distribution', fontsize = 14, fontweight = 'bold')
ax[1].set_xlabel('Magnitude of Error (g/km)', fontsize = 12)
ax[1].set_ylabel('Frequency', fontsize = 12)
ax[1].legend()
ax[1].grid(True, alpha = 0.3)

plt.tight_layout()
plt.show()

#### Key Observations:
---

#### 1. Technical Performance

  - **Explanatory Power (Real R² Score): 0.8844**

  - The model explains **88.4% of the variability** in CO2 emissions using the "Real World" scale (g/km). In the transformed mathematical space (Yeo-Johnson), the fit is even higher (**0.91**), confirming that the non-linear approach successfully captured the physical behavior of the data. This is a significant improvement over the simple univariate model (~0.80).

  - **Margin of Error (MAE): 13.03**

  - The **Mean Absolute Error** indicates that, on average, our predictions deviate by only **13.03 g/km** from the actual value. For a business context where emissions range up to 450 g/km, this represents a very high precision level (approx. 5% relative error), allowing for reliable carbon footprint estimation.

  - **Sensitivity to Large Errors (RMSE vs MAE):**

  * The **RMSE (20.87)** is controlled relative to the MAE (13.03). The gap of ~7.8 points is healthy. It indicates that while there are outliers (likely high-performance sports cars or heavy vehicles), the **Yeo-Johnson transformation** successfully mitigated the extreme penalties that usually distort linear models.
---

#### 2. Model Interpretation

  - The model utilizes a **Power Law** approach (Yeo-Johnson) rather than a simple straight line:

  - **Feature Dominance (Standardized Coefficients):**
  - **Fuel Consumption (Coefficient ~0.71):** This is the dominant driver. The stability analysis showed a variation of only **1.92%** (CV) for this weight, proving it is the most reliable predictor.
  - **Engine Size (Coefficient ~0.27):** This acts as a secondary adjustment factor. Even with a 0.82 correlation to fuel, the model successfully isolated its unique contribution with high stability (CV ~4.05%).

  - **The "Curved" Surface:**
  - Unlike a rigid linear equation, the model projects a **curved surface**. This means it understands that "efficiency" changes as engines get bigger. Physically, this represents the diminishing returns of combustion efficiency in larger engines, providing a much more realistic simulation than a simple linear slope.
---

#### 3. **Conclusion:**

- The Multiple Regression Model with Power Transformation represents the "State-of-the-Art" for this dataset.

- **Strengths:** - **Robustness:** The coefficient stability (CV < 5%) proves the model is immune to multicollinearity. 
- **Physical Coherence:** The residuals follow a near-perfect Gaussian distribution (Mean Bias ~0.81g), indicating that all deterministic signal has been captured.

- **Limitations:**

- **Interpretability:** Because the model operates in a transformed space, we cannot say "1 liter adds X grams" directly. We must use the inverse transformation to get real values.
- **Scope:** The slight residual spread at the high end (>350g/km) suggests that for extreme heavy-duty vehicles, a separate model or additional features (like vehicle weight) might be required.

### 6. Deployment:
---

In [0]:
# DEPLOY PACKAGE: We save everything in a dictionary to ensure integrity.
production_bundle = {
    'model': model, 
    'pt_X': X_scaler, 
    'pt_y': y_scaler,
}

# Saved in a single "pickle" file
joblib.dump(production_bundle, './artifacts/co2_pipeline_v2.pkl')

# Return
print('✅ Complete pipeline saved successfully!')

In [0]:
def predict_emission(engine_size, fuel_consumption):
    
    """
    Performs full inference with Yeo-Johnson pre- and post-processing.
    """
    # 1. Loading (Load the Complete Package)
    bundle = joblib.load('./artifacts/co2_pipeline_v2.pkl')
    model_loaded = bundle['model']
    pt_X_loaded = bundle['pt_X']
    pt_y_loaded = bundle['pt_y']

    # 2. Input Data Engineering
    input_data = pd.DataFrame(
        [[engine_size, fuel_consumption]],
        columns = ['ENGINESIZE', 'FUELCONSUMPTION_COMB']
    )

    # 3. Pre-processing
    input_transformed = pt_X_loaded.transform(input_data)

    # 4. Prediction
    prediction_transformed = model_loaded.predict(input_transformed)

    # 5. Post-processing (Reducing to Grams of CO2)
    prediction_real  = pt_y_loaded.inverse_transform(prediction_transformed.reshape(-1, 1))

    result_g_km = prediction_real[0][0]

    return result_g_km

# --- FINAL TEST (User Acceptance Test) ---

# Scenario: 2.0L car getting 8.5 L/100km
engine = 2.0
consumption = 8.5

try:
    prediction = predict_emission(engine, consumption)

    print(f'--- INFERENCE REPORT ---')
    print(f'Engine: {engine} L')
    print(f'Consumption: {consumption} L/100km')
    print(f'Predicted Emission: {prediction:.2f} g/km')

except Exception as e:
    print(f'Deployment error: {e}')