#**Data Preprocessing**
To get the dataset ready for modelling, data preprocessing is very important. It handles missing values, encodes category variables, splits the dataset, and fixes data that isn't balanced.



#**Handle Missing Values**
- First, check again for any missing data and handle them using appropriate approaches, such as imputation or elimination.

- To manage missing values, I utilise the median for numerical columns and the mode for categorical columns.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

df = pd.read_csv('customer_churn_dataset.csv')

missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

numerical_columns = ['Age', 'MonthlyCharges', 'TotalCharges', 'Tenure']
df[numerical_columns] = df[numerical_columns].fillna(df[numerical_columns].median())

categorical_columns = ['Gender', 'ContractType', 'TechSupport', 'InternetService', 'PaperlessBilling', 'PaymentMethod', 'Churn']
df[categorical_columns] = df[categorical_columns].apply(lambda x: x.fillna(x.mode()[0]))

print("Missing values after handling:\n", df.isnull().sum())


Missing values in each column:
 CustomerID          0
Age                 0
Gender              0
ContractType        0
MonthlyCharges      0
TotalCharges        0
TechSupport         0
InternetService     0
Tenure              0
PaperlessBilling    0
PaymentMethod       0
Churn               0
dtype: int64
Missing values after handling:
 CustomerID          0
Age                 0
Gender              0
ContractType        0
MonthlyCharges      0
TotalCharges        0
TechSupport         0
InternetService     0
Tenure              0
PaperlessBilling    0
PaymentMethod       0
Churn               0
dtype: int64


#**Encode Categorical Features**
- Convert categorical variables into a numerical format using techniques such as one-hot encoding.
- This step creates binary columns for each category, excluding the target variable `Churn` to prevent data leakage.

In [None]:
df_encoded = pd.get_dummies(df, columns=categorical_columns[:-1], drop_first=True)

df_encoded.head()


Unnamed: 0,CustomerID,Age,MonthlyCharges,TotalCharges,Tenure,Churn,Gender_Male,ContractType_One year,ContractType_Two year,TechSupport_Yes,InternetService_Fiber optic,InternetService_No,PaperlessBilling_Yes,PaymentMethod_Credit card,PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,CUST_249,58,50.95,3617.45,71,No,False,False,False,False,False,True,True,False,False,False
1,CUST_1890,22,47.01,235.05,5,Yes,False,True,False,True,True,False,True,False,True,False
2,CUST_1614,62,22.58,0.0,0,No,False,False,False,True,True,False,True,False,False,False
3,CUST_2869,42,55.08,110.16,2,No,True,False,False,False,True,False,False,False,False,False
4,CUST_3770,66,43.12,474.32,11,Yes,True,True,False,True,False,False,True,False,False,False


#**Split the Dataset into Training, Validation, and Testing Sets**
- Divide the dataset into training, validation, and testing sets for model building and evaluation.
- I use a 70-15-15 split for training, validation, and testing, and ensure the target variable is distributed proportionately.

In [None]:
from sklearn.model_selection import train_test_split

categorical_columns = ['Gender', 'ContractType', 'TechSupport', 'InternetService', 'PaperlessBilling', 'PaymentMethod']
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

df_encoded['Churn'] = df_encoded['Churn'].map({'Yes': 1, 'No': 0})

#print(df_encoded.columns)

X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']
# Split the dataset:
#70% training
#15% validation
#15% testing
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

print("Training set distribution:\n", y_train.value_counts(normalize=True))
print("Validation set distribution:\n", y_val.value_counts(normalize=True))
print("Testing set distribution:\n", y_test.value_counts(normalize=True))


Training set distribution:
 Churn
0    0.794857
1    0.205143
Name: proportion, dtype: float64
Validation set distribution:
 Churn
0    0.794667
1    0.205333
Name: proportion, dtype: float64
Testing set distribution:
 Churn
0    0.794667
1    0.205333
Name: proportion, dtype: float64


#**Handle Imbalanced Data**
- Handle the imbalanced dataset using SMOTE.

In [None]:
X_train = X_train.drop('CustomerID', axis=1, errors='ignore')

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print("Balanced training set distribution:\n", y_train_balanced.value_counts(normalize=True))


Balanced training set distribution:
 Churn
0    0.5
1    0.5
Name: proportion, dtype: float64


In [None]:
df.to_csv('preprocessed_customer_churn_dataset.csv', index=False)

from google.colab import files
files.download('preprocessed_customer_churn_dataset.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#**Feature Engineering**

#**Create Additional Features Based on Domain Knowledge and EDA Insights**

To augment the dataset, I will build new characteristics that may assist the model better comprehend patterns in the data. Based on domain knowledge of customer turnover in the telecom business and insights from EDA, I can derive new features such as:

- `AverageMonthlyCharges`: TotalCharges divided by Tenure (if Tenure > 0).
- `ContractTypeEncoded`: Numeric encoding for contract type duration (e.g., Month-to-month = 1, One year = 2, Two year = 3).
- `IsSenior`: Binary feature indicating whether the customer's age is above 60.
- `HasTechSupport`: Binary feature indicating whether the customer has opted for technical support.

These features add more context and can potentially improve the model's ability to predict churn.

In [None]:
df = pd.read_csv('customer_churn_dataset.csv')

df['AverageMonthlyCharges'] = df['TotalCharges'] / df['Tenure'].replace(0, 1)

contract_mapping = {'Month-to-month': 1, 'One year': 2, 'Two year': 3}
df['ContractTypeEncoded'] = df['ContractType'].map(contract_mapping)

df['IsSenior'] = np.where(df['Age'] > 60, 1, 0)

df['HasTechSupport'] = df['TechSupport'].map({'Yes': 1, 'No': 0})

df.head()


Unnamed: 0,CustomerID,Age,Gender,ContractType,MonthlyCharges,TotalCharges,TechSupport,InternetService,Tenure,PaperlessBilling,PaymentMethod,Churn,AverageMonthlyCharges,ContractTypeEncoded,IsSenior,HasTechSupport
0,CUST_249,58,Female,Month-to-month,50.95,3617.45,No,No,71,Yes,Bank transfer,No,50.95,1,0,0
1,CUST_1890,22,Female,One year,47.01,235.05,Yes,Fiber optic,5,Yes,Electronic check,Yes,47.01,2,0,1
2,CUST_1614,62,Female,Month-to-month,22.58,0.0,Yes,Fiber optic,0,Yes,Bank transfer,No,0.0,1,1,1
3,CUST_2869,42,Male,Month-to-month,55.08,110.16,No,Fiber optic,2,No,Bank transfer,No,55.08,1,0,0
4,CUST_3770,66,Male,One year,43.12,474.32,Yes,DSL,11,Yes,Bank transfer,Yes,43.12,2,1,1


#**Explore Feature Interactions and Transformations**
Explore interactions between features and apply modifications that could help the model capture complicated patterns. Common techniques include:

- **Feature Interaction**: Create interaction terms between features that are likely to have a combined effect on churn.
- **Log Transformation**: Apply a log transformation to features with skewed distributions, such as `MonthlyCharges` or `TotalCharges`, to reduce skewness.


- **Feature Interactions**: Combine features to form interaction terms (`Contract_TechSupport` and `AvgCharges_Tenure`), which might capture more complex interactions.

- **Log Transformation**: Apply log transformation on MonthlyCharges and TotalCharges to normalize skewed distributions.

In [None]:
df['Contract_TechSupport'] = df['ContractTypeEncoded'] * df['HasTechSupport']

df['AvgCharges_Tenure'] = df['AverageMonthlyCharges'] * df['Tenure']

df['LogMonthlyCharges'] = np.log1p(df['MonthlyCharges'])
df['LogTotalCharges'] = np.log1p(df['TotalCharges'])

df.head()


Unnamed: 0,CustomerID,Age,Gender,ContractType,MonthlyCharges,TotalCharges,TechSupport,InternetService,Tenure,PaperlessBilling,PaymentMethod,Churn,AverageMonthlyCharges,ContractTypeEncoded,IsSenior,HasTechSupport,Contract_TechSupport,AvgCharges_Tenure,LogMonthlyCharges,LogTotalCharges
0,CUST_249,58,Female,Month-to-month,50.95,3617.45,No,No,71,Yes,Bank transfer,No,50.95,1,0,0,0,3617.45,3.950282,8.193801
1,CUST_1890,22,Female,One year,47.01,235.05,Yes,Fiber optic,5,Yes,Electronic check,Yes,47.01,2,0,1,2,235.05,3.871409,5.464044
2,CUST_1614,62,Female,Month-to-month,22.58,0.0,Yes,Fiber optic,0,Yes,Bank transfer,No,0.0,1,1,1,1,0.0,3.160399,0.0
3,CUST_2869,42,Male,Month-to-month,55.08,110.16,No,Fiber optic,2,No,Bank transfer,No,55.08,1,0,0,0,110.16,4.026779,4.710971
4,CUST_3770,66,Male,One year,43.12,474.32,Yes,DSL,11,Yes,Bank transfer,Yes,43.12,2,1,1,2,474.32,3.786913,6.163988


#**Save the Updated Dataset**

In [None]:
df.to_csv('preprocessed_customer_churn_dataset.csv', index=False)

from google.colab import files
files.download('preprocessed_customer_churn_dataset.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>