# 03 Preprocessing

This notebook performs feature encoding, scaling, and dataset splitting for model training.

Imports libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import os

Load Cleaned Data (from Notebook 01)

In [2]:
# Load cleaned data
%run ./01_data_loading_and_cleaning.ipynb
print("Data loaded for preprocessing.")

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-n

## 1. Encode Target Variable (Churn)

In [3]:
le = LabelEncoder()
data['Churn'] = le.fit_transform(data['Churn'])

print("Churn encoded as 0 = No, 1 = Yes")

Churn encoded as 0 = No, 1 = Yes


## 2. One-Hot Encode Categorical Features

In [4]:
categorical_cols = data.select_dtypes(include="object").columns.tolist()

print("Categorical columns being encoded:", categorical_cols)

data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

print("One-hot encoding completed.")

Categorical columns being encoded: ['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
One-hot encoding completed.


## 3. Scale Numerical Features

In [5]:
scaler = StandardScaler()
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

data[numeric_cols] = scaler.fit_transform(data[numeric_cols])

print("Numeric features scaled with StandardScaler.")


Numeric features scaled with StandardScaler.


## 4. Create Feature Matrix (X) and Target Vector (y)

In [6]:
X = data.drop('Churn', axis=1)
y = data['Churn']

print("Feature and target separation complete.")
print(f"X shape: {X.shape}, y shape: {y.shape}")


Feature and target separation complete.
X shape: (7043, 7072), y shape: (7043,)


## 5. Train-Test Split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Dataset split into train and test sets.")
print("Training data:", X_train.shape, "Testing data:", X_test.shape)

Dataset split into train and test sets.
Training data: (5634, 7072) Testing data: (1409, 7072)


## 6. Save Processed Datasets


In [10]:
os.makedirs("processed_data", exist_ok=True)

X_train.to_csv("processed_data/X_train.csv", index=False)
X_test.to_csv("processed_data/X_test.csv", index=False)
y_train.to_csv("processed_data/y_train.csv", index=False)
y_test.to_csv("processed_data/y_test.csv", index=False)

print("Processed datasets saved to /processed_data folder.")


Processed datasets saved to /results/processed_data folder.


## 7. Dataset Statistics Summary

In [11]:
dataset_stats = pd.DataFrame({
    'Feature': numeric_cols,
    'Mean': [data[col].mean() for col in numeric_cols],
    'Median': [data[col].median() for col in numeric_cols],
    'Min': [data[col].min() for col in numeric_cols],
    'Max': [data[col].max() for col in numeric_cols]
})

print(dataset_stats)

os.makedirs("results", exist_ok=True)
dataset_stats.to_csv("results/dataset_statistics.csv", index=False)

print("Dataset statistics saved.")


          Feature          Mean    Median       Min       Max
0          tenure -2.421273e-17 -0.137274 -1.318165  1.613701
1  MonthlyCharges -6.406285e-17  0.185733 -1.545860  1.794352
2    TotalCharges -1.488074e-17 -0.390463 -0.999120  2.826743
Dataset statistics saved.


## 8. Preprocessing Summary

- **Label Encoding** applied to the target variable `Churn`.
- **One-Hot Encoding** applied to all categorical features.
- **StandardScaler** used on numeric columns (`tenure`, `MonthlyCharges`, `TotalCharges`).
- **Train-test split** performed (80% training, 20% testing) with stratification.