# Feature Engineering
This notebook:
- Loads the clean churn dataset
- Cleans categorical features
- Performs Trainâ€“test splitting with stratification
- Saves Train-test splits

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Load clean dataset

In [2]:
df = pd.read_csv('../data/clean_telco_churn.csv')

## Clean categorical features

In [3]:
for columns in df:
    if df[columns].dtype == 'object':
        print(f'{columns}: {df[columns].unique()}')

gender: ['Female' 'Male']
Partner: ['Yes' 'No']
Dependents: ['No' 'Yes']
PhoneService: ['No' 'Yes']
MultipleLines: ['No phone service' 'No' 'Yes']
InternetService: ['DSL' 'Fiber optic' 'No']
OnlineSecurity: ['No' 'Yes' 'No internet service']
OnlineBackup: ['Yes' 'No' 'No internet service']
DeviceProtection: ['No' 'Yes' 'No internet service']
TechSupport: ['No' 'Yes' 'No internet service']
StreamingTV: ['No' 'Yes' 'No internet service']
StreamingMovies: ['No' 'Yes' 'No internet service']
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: ['Yes' 'No']
PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
Churn: ['No' 'Yes']


### Encode target variable

In [4]:
df['Churn'].replace({'Yes': 1, 'No': 0}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Churn'].replace({'Yes': 1, 'No': 0}, inplace=True)
  df['Churn'].replace({'Yes': 1, 'No': 0}, inplace=True)


### Replace service labels

In [5]:
df.replace("No phone service", "No", inplace=True)
df.replace("No internet service", "No", inplace=True)

## Train-test split

In [6]:
X = df.drop(columns=['Churn'])
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Save Train-test split

In [7]:
X_train.to_csv('../data/X_train.csv', index=False)
X_test.to_csv('../data/X_test.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)