#🔨**FEATURE ENGINEERING AND PREPROCESSING**

In this notebook, we’ll transform the cleaned data from Notebook 02/03 into model-ready features. Specifically, we will:

1. Convert categorical variables into numeric form (e.g., one-hot or label encoding).  
2. Create new features (e.g., tenure buckets, interaction terms).  
3. Scale or normalize numeric columns if needed.  

By the end, `X` will contain all engineered features and `y` will be our binary target (`Churn` = 0/1), ready for model training

## 1. Separating Features and Target

We begin feature engineering by separating the target variable (`Churn`) from the rest of the dataset. 
We'll store the features in `X` and the target in `y`.

Since machine learning models require numerical inputs, we also convert `Churn` from `"Yes"/"No"` to `1` and `0`.

In [1]:
#Load in our cleaned dataset
import pandas as pd
df = pd.read_csv(r"C:\Users\ADMIN\Documents\GitHub\Customer-churn-prediction\data\customer_churn_cleaned.csv")

#Create a copy of the dataset to avoid working on original df
alt_df = df.copy()

## Convert target column to 0/1
alt_df['Churn'] = alt_df['Churn'].map({'Yes': 1, 'No': 0})

# Separate features and target
X = alt_df.drop('Churn', axis=1)
y = alt_df['Churn']

## 2. Drop Irrelevant Columns

The `customerID` column is just a unique identifier and does not carry predictive value. We’ll remove it from our features.


In [2]:
# Drop customerID
X = X.drop('customerID', axis=1)

# Verify it’s gone
X.columns


Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges'],
      dtype='object')

### 3. Encode Binary Categorical Columns

There are several columns with only two values like 'Yes'/'No' or 'Male'/'Female'. We'll convert these to 0 and 1 so that machine learning models can understand them.

Below are some of the columns we’ll convert:
- `gender`: Male -> 0, Female -> 1
- `Partner`, `Dependents`, `PhoneService`, etc.: No -> 0, Yes -> 1

In [3]:
# 3.1 Define and map all binary columns
binary_cols = [
    'gender', 'Partner', 'Dependents', 'PhoneService',
    'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
    'DeviceProtection', 'TechSupport', 'StreamingTV',
    'StreamingMovies', 'PaperlessBilling', 'SeniorCitizen'
]

# Create a mapping dictionary
binary_map = {
    'Yes': 1,
    'No': 0,
    'Male': 0,
    'Female': 1
}

# Apply the mapping
X[binary_cols] = X[binary_cols].replace(binary_map)

# Verify the conversion
X[binary_cols].head()


  X[binary_cols] = X[binary_cols].replace(binary_map)


Unnamed: 0,gender,Partner,Dependents,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,SeniorCitizen
0,1,1,0,0,0,0,1,0,0,0,0,1,0
1,0,0,0,1,0,1,0,1,0,0,0,0,0
2,0,0,0,1,0,1,1,0,0,0,0,1,0
3,0,0,0,0,0,1,0,1,1,0,0,0,0
4,1,0,0,1,0,0,0,0,0,0,0,1,0


In [4]:
multi_cat_cols = []
num_cols = []

for col in X.columns:
    if X[col].dtype == 'object':
        # Any remaining object columns must have 3+ categories
        multi_cat_cols.append(col)
    else:
        # Anything not object is numeric at this point
        num_cols.append(col)

# 5. Print the final lists to verify
print("Binary columns (0/1 encoded):\n", binary_cols)
print("\nMulti-category columns (to one-hot):\n", multi_cat_cols)
print("\nNumerical columns:\n", num_cols)

Binary columns (0/1 encoded):
 ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'SeniorCitizen']

Multi-category columns (to one-hot):
 ['InternetService', 'Contract', 'PaymentMethod']

Numerical columns:
 ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges']


In [5]:
# 5. Re-classify columns properly to avoid overlap

# (binary_cols was defined in the previous cell)
multi_cat_cols = []
num_cols = []

for col in X.columns:
    if col in binary_cols:
        # Skip binary columns entirely
        continue
    elif X[col].dtype == 'object':
        # Remaining object columns are truly multi-category
        multi_cat_cols.append(col)
    else:
        # All other columns (not in binary_cols and not object) are continuous numeric
        num_cols.append(col)

# 6. Print the final lists to verify
print("Binary columns (0/1 encoded):\n", binary_cols)
print("\nMulti-category columns (to one-hot):\n", multi_cat_cols)
print("\nNumerical columns:\n", num_cols)


Binary columns (0/1 encoded):
 ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'SeniorCitizen']

Multi-category columns (to one-hot):
 ['InternetService', 'Contract', 'PaymentMethod']

Numerical columns:
 ['tenure', 'MonthlyCharges', 'TotalCharges']


### One-Hot Encoding of Categorical Features


Next, we’ll transform each multi-category column into multiple dummy (0/1) columns. This includes:

- `InternetService` (DSL, Fiber optic, No)
- `Contract` (Month-to-month, One year, Two year)
- `PaymentMethod` (Bank transfer (automatic), Credit card (automatic), Electronic check, Mailed check)

We’ll use `pd.get_dummies(..., drop_first=True)` to avoid collinearity, dropping one category per variable.


In [6]:
# One-hot encode multi-category features
X = pd.get_dummies(X, columns=multi_cat_cols, drop_first=True)

# Quick check of the updated DataFrame
print("New shape after one-hot encoding:", X.shape)
X.head()


New shape after one-hot encoding: (7032, 23)


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,...,PaperlessBilling,MonthlyCharges,TotalCharges,InternetService_Fiber optic,InternetService_No,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,1,0,1,0,1,0,0,0,1,0,...,1,29.85,29.85,False,False,False,False,False,True,False
1,0,0,0,0,34,1,0,1,0,1,...,0,56.95,1889.5,False,False,True,False,False,False,True
2,0,0,0,0,2,1,0,1,1,0,...,1,53.85,108.15,False,False,False,False,False,False,True
3,0,0,0,0,45,0,0,1,0,1,...,0,42.3,1840.75,False,False,True,False,False,False,False
4,1,0,0,0,2,1,0,0,0,0,...,1,70.7,151.65,True,False,False,False,False,True,False


### Standardizing Numerical Features

- Tenure, monthly charges and the total charges columns have different scales and ranges.
- Some machine learning models like Logistic Regression are sensitive to these differences and can negatively impact performances.
- To fix this, we’ll standardize the numerical columns using StandardScaler, which transforms the data to have a mean of 0 and a standard deviation of 1. This ensures all features contribute equally to the model’s learning process.

In [7]:
from sklearn.preprocessing import StandardScaler

# Define the numerical columns to scale
scale_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
X[scale_cols] = scaler.fit_transform(X[scale_cols])

# Final check
X[scale_cols].describe()


Unnamed: 0,tenure,MonthlyCharges,TotalCharges
count,7032.0,7032.0,7032.0
mean,-1.126643e-16,6.062651000000001e-17,-1.119064e-16
std,1.000071,1.000071,1.000071
min,-1.280248,-1.547283,-0.9990692
25%,-0.9542963,-0.9709769,-0.8302488
50%,-0.1394171,0.184544,-0.3908151
75%,0.9199259,0.8331482,0.6668271
max,1.612573,1.793381,2.824261


### 6. Train / Validation / Test Split


To properly tune and evaluate our models, we’ll split the data into:
- **Training set** (70%) – used for fitting and cross‐validation tuning.  
- **Validation set** (15%) – used for comparing models and selecting hyperparameters.  
- **Test set** (15%) – used only once for final performance reporting, to simulate “brand‐new” data.

We fix `random_state=42` so that our split is reproducible.

In [8]:
from sklearn.model_selection import train_test_split

# Starting from X_encoded and y (features already encoded and scaled)
# 1) Split off 30% as temp (will become validation + test), 70% remains train
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=42
)

# 2) Split that 30% into two equal halves: 15% val, 15% test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42
)

# 3) Confirm the sizes
print("Training set size :", X_train.shape, y_train.shape)
print("Validation set size :", X_val.shape, y_val.shape)
print("Test set size       :", X_test.shape, y_test.shape)


Training set size : (4922, 23) (4922,)
Validation set size : (1055, 23) (1055,)
Test set size       : (1055, 23) (1055,)


In [10]:
# (a) Save each split as a CSV
X_train.to_csv(r"C:\Users\ADMIN\Documents\GitHub\Customer-churn-prediction\data/X_train.csv", index=False)
X_val.to_csv(r"C:\Users\ADMIN\Documents\GitHub\Customer-churn-prediction\data/X_val.csv", index=False)
X_test.to_csv(r"C:\Users\ADMIN\Documents\GitHub\Customer-churn-prediction\data/X_test.csv", index=False)

y_train.to_frame(name="Churn").to_csv(r"C:\Users\ADMIN\Documents\GitHub\Customer-churn-prediction\data/y_train.csv", index=False)
y_val.to_frame(name="Churn").to_csv(r"C:\Users\ADMIN\Documents\GitHub\Customer-churn-prediction\data/y_val.csv", index=False)
y_test.to_frame(name="Churn").to_csv(r"C:\Users\ADMIN\Documents\GitHub\Customer-churn-prediction\data/y_test.csv", index=False)
