# Preprocessing and Baselines

### Notebook Goals:
In this notebook, we prepare the data for modeling and train baseline models to establish a benchmark for churn predictions.

### Load Data and Initial Cleanup

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/telco_churn.csv")

In [2]:
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})
df.drop(columns=["customerID"], inplace=True)

df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.fillna(0)
df["Churn"] = df["Churn"].astype("int")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 


### Train/Test Split
We split this dataset into training and test sets using stratified sampling to preserve the original churn distribution.This ensures reliable performance evaluation on an imbalanced target variable.

In [4]:
y = df["Churn"]
X = df.drop(columns=["Churn"])

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Feature Encoding Stategy
Categorical Values are encoded using one-hot encoding to avoid imposing ordinal relationships. Numerical Values are scaled using standardization.All preprocessing steps are fit to exclusively on the training data and applied to the test data to prevent leakage.

In [9]:
categorical_features = list(X_train.select_dtypes(include=["object"]).columns)
numerical_features = list(X_train.select_dtypes(include=["int64", "float64"]).columns)

numerical_features.remove("SeniorCitizen")
categorical_features.append("SeniorCitizen")



print("Categorical columns:", categorical_features)
print("Numerical columns:", numerical_features)

Categorical columns: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'SeniorCitizen']
Numerical columns: ['tenure', 'MonthlyCharges', 'TotalCharges']


### Preprocessing Pipeline

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [11]:
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(
    handle_unknown="ignore",
    sparse_output=False
)

In [12]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

### Baseline Models