# üî®**FEATURE ENGINEERING AND PREPROCESSING**

In this notebook, we‚Äôll transform the cleaned data from Notebook 02/03 into model-ready features. Specifically, we will:

1. Convert categorical variables into numeric form (e.g., one-hot or label encoding).  
2. Create new features (e.g., tenure buckets, interaction terms).  
3. Scale or normalize numeric columns if needed.  

By the end, `X` will contain all engineered features and `y` will be our binary target (`Churn` = 0/1), ready for model training

## 1. Separating Features and Target

We begin feature engineering by separating the target variable (`Churn`) from the rest of the dataset. 
We'll store the features in `X` and the target in `y`.

Since machine learning models require numerical inputs, we also convert `Churn` from `"Yes"/"No"` to `1` and `0`.

In [9]:
#Load in our cleaned dataset
import pandas as pd
df = pd.read_csv("..\data\customer_churn_cleaned.csv")

#Create a copy of the dataset to avoid working on original df
alt_df = df.copy()

## Convert target column to 0/1
alt_df['Churn'] = alt_df['Churn'].map({'Yes': 1, 'No': 0})

# Separate features and target
X = alt_df.drop('Churn', axis=1)
y = alt_df['Churn']

  df = pd.read_csv("..\data\customer_churn_cleaned.csv")


## 2. Drop columns we will not use in our modeling process

In [10]:
# Drop customerID
X = X.drop(['customerID','gender'], axis=1)

# Verify the columns are not in our feature dataset
X.columns


Index(['SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges'],
      dtype='object')

### 3. Encode Binary Categorical Columns

There are several columns with only two values like 'Yes'/'No' or 'Male'/'Female'. We'll convert these to 0 and 1 so that machine learning models can understand them.

Below are some of the columns we‚Äôll convert:
- `Partner`, `Dependents`, `PhoneService`, etc.: No -> 0, Yes -> 1
- `SeniorCitizen` was already mapped so we didn't include it to the list we'll convert.

In [11]:
# Define and map all binary columns
binary_cols = [
    'Partner', 'Dependents', 'PhoneService',
    'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
    'DeviceProtection', 'TechSupport', 'StreamingTV',
    'StreamingMovies', 'PaperlessBilling'
]

# Create a mapping dictionary
binary_map = {
    'Yes': 1,
    'No': 0,
}

# Apply the mapping
X[binary_cols] = X[binary_cols].replace(binary_map)

# Verify the conversion
X[binary_cols].head()


  X[binary_cols] = X[binary_cols].replace(binary_map)


Unnamed: 0,Partner,Dependents,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling
0,1,0,0,0,0,1,0,0,0,0,1
1,0,0,1,0,1,0,1,0,0,0,0
2,0,0,1,0,1,1,0,0,0,0,1
3,0,0,0,0,1,0,1,1,0,0,0
4,0,0,1,0,0,0,0,0,0,0,1


In [12]:
# Re-classify columns to avoid overlap

# (binary_cols was defined in the previous cell)
multi_cat_cols = []
num_cols = []

for col in X.columns:
    if col in binary_cols:
        # Skip binary columns entirely
        continue
    elif X[col].dtype == 'object':
        # Remaining object columns are truly multi-category
        multi_cat_cols.append(col)
    else:
        # All other columns (not in binary_cols and not object) are continuous numeric
        num_cols.append(col)

# Print the final lists to verify
print("Binary columns (0/1 encoded):\n", binary_cols)
print("\nMulti-category columns (to one-hot):\n", multi_cat_cols)
print("\nNumerical columns:\n", num_cols)


Binary columns (0/1 encoded):
 ['Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']

Multi-category columns (to one-hot):
 ['InternetService', 'Contract', 'PaymentMethod']

Numerical columns:
 ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']


 ### 4.) One-Hot Encoding of Categorical Features


Next, we‚Äôll transform each multi-category column into multiple dummy (0/1) columns. This includes:

- `InternetService` (DSL, Fiber optic, No)
- `Contract` (Month-to-month, One year, Two year)
- `PaymentMethod` (Bank transfer (automatic), Credit card (automatic), Electronic check, Mailed check)

We‚Äôll use `pd.get_dummies(..., drop_first=True)` to avoid collinearity, dropping one category per variable.


In [None]:
# Check cardinality of the categorical features
cat_cols = X.select_dtypes(include='object').columns
for col in cat_cols:
    n_unique = X[col].nunique()
    print(f"{col}: {n_unique} unique values")


InternetService: 3 unique values
Contract: 3 unique values
PaymentMethod: 4 unique values


In [14]:
# One-hot encode multi-category features
X = pd.get_dummies(X, columns=multi_cat_cols, drop_first=True)

# Quick check of the updated DataFrame
print("New shape after one-hot encoding:", X.shape)
X.head()


New shape after one-hot encoding: (7032, 22)


Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,...,PaperlessBilling,MonthlyCharges,TotalCharges,InternetService_Fiber optic,InternetService_No,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,0,1,0,0,0,1,0,0,...,1,29.85,29.85,False,False,False,False,False,True,False
1,0,0,0,34,1,0,1,0,1,0,...,0,56.95,1889.5,False,False,True,False,False,False,True
2,0,0,0,2,1,0,1,1,0,0,...,1,53.85,108.15,False,False,False,False,False,False,True
3,0,0,0,45,0,0,1,0,1,1,...,0,42.3,1840.75,False,False,True,False,False,False,False
4,0,0,0,2,1,0,0,0,0,0,...,1,70.7,151.65,True,False,False,False,False,True,False


### 5. Train / Validation / Test Split


To properly tune and evaluate our models, we‚Äôll split the data into:
- **Training set** (70%) ‚Äì used for fitting and cross‚Äêvalidation tuning.  
- **Validation set** (15%) ‚Äì used for comparing models and selecting hyperparameters.  
- **Test set** (15%) ‚Äì used only once for final performance reporting, to simulate ‚Äúbrand‚Äênew‚Äù data.

We fix `random_state=42` so that our split is reproducible.

In [15]:
from sklearn.model_selection import train_test_split

# Starting from X_encoded and y (features already encoded)
# 1) Split off 30% as temp (will become validation + test), 70% remains train
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=42
)

# 2) Split that 30% into two equal halves: 15% val, 15% test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42
)

# 3) Confirm the sizes
print("Training set size :", X_train.shape, y_train.shape)
print("Validation set size :", X_val.shape, y_val.shape)
print("Test set size       :", X_test.shape, y_test.shape)


Training set size : (4922, 22) (4922,)
Validation set size : (1055, 22) (1055,)
Test set size       : (1055, 22) (1055,)


### 6.) Standardizing Numerical Features

- Tenure, monthly charges and the total charges columns have different scales and ranges.
- Some machine learning models like Logistic Regression are sensitive to these differences and can negatively impact performances.
- To fix this, we‚Äôll standardize the numerical columns using StandardScaler, which transforms the data to have a mean of 0 and a standard deviation of 1. This ensures all features contribute equally to the model‚Äôs learning process.

In [16]:
from sklearn.preprocessing import StandardScaler

# Columns to scale
scale_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Initialize and fit *only* on train
scaler = StandardScaler().fit(X_train[scale_cols])

#  Transform train, val, test
X_train[scale_cols] = scaler.transform(X_train[scale_cols])
X_val[scale_cols]   = scaler.transform(X_val[scale_cols])
X_test[scale_cols]  = scaler.transform(X_test[scale_cols])

#  Quick sanity check
print("Train stats:\n", X_train[scale_cols].describe().loc[['mean','std']])
print("Val   stats:\n", X_val[scale_cols].describe().loc[['mean','std']])
print("Test  stats:\n", X_test[scale_cols].describe().loc[['mean','std']])

Train stats:
             tenure  MonthlyCharges  TotalCharges
mean  9.960879e-17    1.999394e-16  3.175933e-17
std   1.000102e+00    1.000102e+00  1.000102e+00


Val   stats:
         tenure  MonthlyCharges  TotalCharges
mean -0.027460       -0.031823     -0.042709
std   0.986229        0.981209      0.954764
Test  stats:
         tenure  MonthlyCharges  TotalCharges
mean  0.008898        0.021688      0.017461
std   1.003170        0.981061      1.011844


In [None]:
# Save each split as a CSV
X_train.to_csv(r"..\data\X_train.csv", index=False)
X_val.to_csv(r"..\data\X_val.csv", index=False)
X_test.to_csv(r"..\data\X_test.csv", index=False)

y_train.to_frame(name="Churn").to_csv(r"..\data\y_train.csv", index=False)
y_val.to_frame(name="Churn").to_csv(r"..\data\y_val.csv", index=False)
y_test.to_frame(name="Churn").to_csv(r"..\data\y_test.csv", index=False)
