### Notebook order

This notebook comes after the EDA notebook which investigate what to do in the data cleaning process.

You can find EDA.ipynb in:  Milestone1_DataCollection_EDA_DataCleaning\notebooks\EDA.ipynb

In [1]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from verstack import NaNImputer

import warnings
warnings.filterwarnings("ignore")

### Load data

In [2]:
train = pd.read_csv("https://raw.githubusercontent.com/MohamedMostafa259/Customer-Churn-Prediction-and-Analysis/main/Data/train.csv")
train.sample(5, random_state=42)

Unnamed: 0,customer_id,Name,age,gender,security_no,region_category,membership_category,joining_date,joined_through_referral,referral_id,...,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
9550,fffe43004900440033003200320035003600,Tobias Liebold,24,F,I4AYTC2,City,Premium Membership,2015-04-22,No,xxxxxxxx,...,101.5,32593.2,15.0,801.18,Yes,No,No,Not Applicable,Products always in Stock,1
7112,fffe43004900440032003200350035003400,Patrick Kizer,53,F,WV0LB6W,Town,Silver Membership,2016-01-19,No,xxxxxxxx,...,324.61,39155.49,21.0,,No,Yes,No,Not Applicable,No reason specified,3
9545,fffe43004900440031003000380038003300,Annamaria Freese,53,F,94O1F22,Town,No Membership,2016-02-07,Yes,CID19334,...,47.71,35434.17,12.0,675.17,Yes,No,No,Not Applicable,Poor Product Quality,5
10261,fffe43004900440034003200300031003800,Gilda Lundy,61,M,74WFG9K,,Gold Membership,2017-10-24,No,xxxxxxxx,...,451.66,30621.93,7.0,755.93,Yes,Yes,Yes,Solved,Poor Product Quality,3
9876,fffe43004900440034003100380030003300,Angla Alameda,46,F,249HVEX,Town,Premium Membership,2016-06-11,No,xxxxxxxx,...,266.68,50462.15,Error,806.67,Yes,Yes,Yes,Solved,Products always in Stock,1


### Create a copy for cleaning

In [3]:
train_copy = train.copy()

#### Set unknown categories to `Nan`

In [4]:
train_copy = train_copy.replace(['?', 'Error'], np.nan)
train_copy['avg_frequency_login_days'] = train_copy['avg_frequency_login_days'].astype(float)

#### Handling negative incorrect values

In [5]:
nonnegative_cols = ['days_since_last_login', 'avg_time_spent', 'avg_frequency_login_days', 'points_in_wallet']
for col in nonnegative_cols:
	train_copy.loc[train_copy[col] < 0, col] = np.nan  

#### Dropping rows with NaNs in the target variable

In [6]:
train_copy.loc[train_copy['churn_risk_score'] == -1, 'churn_risk_score'] = np.nan
train_copy.dropna(subset=['churn_risk_score'], inplace=True)

In [7]:
train_copy.describe()

Unnamed: 0,age,days_since_last_login,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,churn_risk_score
count,35829.0,33885.0,34170.0,35829.0,31751.0,32354.0,35829.0
mean,37.120266,12.751424,292.491498,29304.272306,16.523181,690.389359,3.608278
std,15.86536,5.574802,331.518007,19484.565419,8.374922,186.791727,1.176426
min,10.0,1.0,1.837399,800.46,0.009208,6.432208,1.0
25%,23.0,9.0,71.39,14194.65,10.0,617.1525,3.0
50%,37.0,13.0,173.99,27584.53,16.0,698.42,4.0
75%,51.0,17.0,370.9075,40874.01,23.0,764.3075,5.0
max,64.0,26.0,3235.578521,99914.05,73.061995,2069.069761,5.0


### Drop unnecessary cols


In [8]:
cols_to_drop = ['customer_id', 'Name', 'security_no', 'referral_id']
train_copy = train_copy.drop(columns=cols_to_drop)

#### Filter categorical columns in a list to use later 

In [9]:
# date_cols = [('date', 'date_format'), ...]
date_cols = [('joining_date', '%Y-%m-%d'), ('last_visit_time', '%H:%M:%S')]

cat_cols = list(set(train_copy.select_dtypes(include='object').columns) - set(date_cols))
# last_visit_time → categories: morning & evening, ...
cat_cols

['region_category',
 'medium_of_operation',
 'internet_option',
 'offer_application_preference',
 'last_visit_time',
 'gender',
 'joined_through_referral',
 'joining_date',
 'past_complaint',
 'complaint_status',
 'used_special_discount',
 'feedback',
 'membership_category',
 'preferred_offer_types']

### Identify missing values

In [10]:
train_copy.isna().sum()

age                                0
gender                             0
region_category                 5263
membership_category                0
joining_date                       0
joined_through_referral         5292
preferred_offer_types            276
medium_of_operation             5230
internet_option                    0
last_visit_time                    0
days_since_last_login           1944
avg_time_spent                  1659
avg_transaction_value              0
avg_frequency_login_days        4078
points_in_wallet                3475
used_special_discount              0
offer_application_preference       0
past_complaint                     0
complaint_status                   0
feedback                           0
churn_risk_score                   0
dtype: int64

#### Splitting the cleaned train.csv file into train & validation splits (to avoid data leakage during imputation)

Initially, the dataset only contained two files: `train.csv` and `test.csv`. However, the `test.csv` file didn't include any target labels, and since the HackerEarth competition had already ended, I couldn't use it to evaluate my model's performance.

**Solution:**  
To address this, I will manually split the original `train.csv` into two separate sets: a new `train_split.csv` and a `validation_split.csv`, using the code below:

In [11]:
train_split, validation_split = train_test_split(train_copy, test_size=0.2, random_state=42, stratify=train_copy['churn_risk_score'])

train_copy.to_csv('train_cleaned.csv', index=False)
train_split.to_csv('train_split_cleaned.csv', index=False)
validation_split.to_csv('validation_split_cleaned.csv', index=False)

##### **Handling missing value approach**

I will use `verstack.NaNImputer` as it uses a powerful model-based imputation

## Building the whole cleaning pipeline

### DataCleaner Transformer

This transformer:

-	drops unwanted cols

-	replace unknown categories (e.g., '?') with `np.nan`

-	handle wrong negative values 

In [12]:
class DataCleaner(BaseEstimator, TransformerMixin):
	def __init__(self, cols_to_drop, nonnegative_cols):
		self.cols_to_drop = cols_to_drop
		self.nonnegative_cols = nonnegative_cols
   
	def fit(self, X, y=None):
		return self

	# X is pd.DataFrame
	def transform(self, X):
		X_copy = X.copy()
		X_copy.drop(columns=self.cols_to_drop, errors='ignore', inplace=True)	
			
		X_copy.replace(['?', 'Error'], np.nan, inplace=True)
		X_copy['avg_frequency_login_days'] = X_copy['avg_frequency_login_days'].astype(float)

		for col in self.nonnegative_cols:
			X_copy.loc[X_copy[col] < 0, col] = np.nan  

		return X_copy

####  Wrapping `verstack.NaNImputer` into an custom transformer for compatibility with scikit-learn's API

In [13]:
# inspired by the Adapter design pattern ;)
class NaNImputerWrapper(BaseEstimator, TransformerMixin):
    def __init__(self, train_sample_size=30_000, verbose=True):
        self.train_sample_size = train_sample_size
        self.verbose = verbose
        self.imputer = NaNImputer(self.train_sample_size, self.verbose)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return self.imputer.impute(X)

### Integrating transformers into a pipeline

In [None]:
def get_X_train_y_train(df, y='churn_risk_score'):
	X_train = df.drop(columns=[y])
	y_train = df[y]
	return X_train, y_train

In [19]:
# imputing train_copy for advanced analysis and building dashboards
X_train, y_train = get_X_train_y_train(train_copy)

# imputing train_split for model development (we avoid imputing the validation set here to avoid data leakage)
X_train_split, y_train_split = get_X_train_y_train(train_split)

In [15]:
cleaning_pipeline = Pipeline([
    ('dataCleaner', DataCleaner(cols_to_drop, nonnegative_cols)), 
    ('imputer', NaNImputerWrapper(train_sample_size=train_split.shape[0]))
])

X_train_cleaned_imputed = cleaning_pipeline.fit_transform(X_train)
X_train_split_cleaned_imputed = cleaning_pipeline.fit_transform(X_train_split)


 * Initiating NaNImputer.impute
     . Dataset dimensions:
     .. rows:         35829
     .. columns:      20
     .. mb in memory: 5.74
     .. NaN cols num: 8

   - Drop hopeless NaN cols

   - Processing whole data for imputation
     . Processed 10 cols; 8 to go

   - Imputing single core 8 cols
     . Imputed (multiclass) - 5263     NaN in region_category
     . Imputed (  binary  ) - 5292     NaN in joined_through_referral
     . Imputed (multiclass) - 276      NaN in preferred_offer_types
     . Imputed (multiclass) - 5230     NaN in medium_of_operation
     . Imputed (multiclass) - 1944     NaN in days_since_last_login
     . Imputed (regression) - 1659     NaN in avg_time_spent
     . Imputed (regression) - 4078     NaN in avg_frequency_login_days
     . Imputed (regression) - 3475     NaN in points_in_wallet

   - Missing values after imputation: 0

Time elapsed for impute execution: 55.04399 seconds

 * Initiating NaNImputer.impute
     . Dataset dimensions:
     .. rows:

### Save cleaned data

In [22]:
def save_df(X, y, name, extension='.csv'):
	df_cleaned_imputed = pd.concat([X, y], axis=1)
	df_cleaned_imputed.to_csv(name + extension, index=False)

In [23]:
save_df(X_train_cleaned_imputed, y_train, 'train_cleaned_imputed')

save_df(X_train_split_cleaned_imputed, y_train_split, 'train_split_cleaned_imputed')

### Save `cleaning_pipeline`

In [24]:
import joblib
joblib.dump(cleaning_pipeline, 'cleaning_pipeline.joblib')

['cleaning_pipeline.joblib']