### Notebook order

This notebook comes after the EDA notebook which investigate what to do in the data cleaning process.

You can find EDA.ipynb in:  Milestone1_DataCollection_EDA_DataCleaning\notebooks\EDA.ipynb

In [None]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from verstack import NaNImputer

import warnings
warnings.filterwarnings("ignore")

In [None]:
import xgboost
print(xgboost.__version__)

In [None]:
import xgboost
print(xgboost.__version__)

In [None]:
import xgboost
print(xgboost.__version__)

In [None]:
import xgboost
print(xgboost.__version__)

### Load data

In [None]:
train = pd.read_csv("https://raw.githubusercontent.com/MohamedMostafa259/Customer-Churn-Prediction-and-Analysis/main/Data/train.csv")
train.sample(5, random_state=42)

### Create a copy for cleaning

In [None]:
train_copy = train.copy()

#### Set unknown categories to `Nan`

In [None]:
train_copy = train_copy.replace(['?', 'Error'], np.nan)
train_copy['avg_frequency_login_days'] = train_copy['avg_frequency_login_days'].astype(float)

#### Handling negative incorrect values

In [None]:
nonnegative_cols = ['days_since_last_login', 'avg_time_spent', 'avg_frequency_login_days', 'points_in_wallet']
for col in nonnegative_cols:
	train_copy.loc[train_copy[col] < 0, col] = np.nan  

#### Dropping rows with NaNs in the target variable

In [None]:
train_copy.loc[train_copy['churn_risk_score'] == -1, 'churn_risk_score'] = np.nan
train_copy.dropna(subset=['churn_risk_score'], inplace=True)

In [None]:
train_copy.describe()

### Drop unnecessary cols


In [None]:
cols_to_drop = ['customer_id', 'Name', 'security_no', 'referral_id']
train_copy = train_copy.drop(columns=cols_to_drop)

#### Filter categorical columns in a list to use later 

In [None]:
# date_cols = [('date', 'date_format'), ...]
date_cols = [('joining_date', '%Y-%m-%d'), ('last_visit_time', '%H:%M:%S')]

cat_cols = list(set(train_copy.select_dtypes(include='object').columns) - set(date_cols))
# last_visit_time → categories: morning & evening, ...
cat_cols

### Identify missing values

In [None]:
train_copy.isna().sum()

#### Splitting the cleaned train.csv file into train & validation splits (to avoid data leakage during imputation)

Initially, the dataset only contained two files: `train.csv` and `test.csv`. However, the `test.csv` file didn't include any target labels, and since the HackerEarth competition had already ended, I couldn't use it to evaluate my model's performance.

**Solution:**  
To address this, I will manually split the original `train.csv` into two separate sets: a new `train_split.csv` and a `validation_split.csv`, using the code below:

In [None]:
train_split, validation_split = train_test_split(train_copy, test_size=0.2, random_state=42, stratify=train_copy['churn_risk_score'])

train_copy.to_csv('train_cleaned.csv', index=False)
train_split.to_csv('train_split_cleaned.csv', index=False)
validation_split.to_csv('validation_split_cleaned.csv', index=False)

##### **Handling missing value approach**

I will use `verstack.NaNImputer` as it uses a powerful model-based imputation

## Building the whole cleaning pipeline

### DataCleaner Transformer

This transformer:

-	drops unwanted cols

-	replace unknown categories (e.g., '?') with `np.nan`

-	handle wrong negative values 

In [None]:
class DataCleaner(BaseEstimator, TransformerMixin):
	def __init__(self, cols_to_drop, nonnegative_cols):
		self.cols_to_drop = cols_to_drop
		self.nonnegative_cols = nonnegative_cols
   
	def fit(self, X, y=None):
		return self

	# X is pd.DataFrame
	def transform(self, X):
		X_copy = X.copy()
		X_copy.drop(columns=self.cols_to_drop, errors='ignore', inplace=True)	
			
		X_copy.replace(['?', 'Error'], np.nan, inplace=True)
		
		if 'avg_frequency_login_days' in X_copy.columns:
			X_copy['avg_frequency_login_days'] = X_copy['avg_frequency_login_days'].astype(float)

		for col in self.nonnegative_cols:
			if col in X_copy.columns:
				X_copy.loc[X_copy[col] < 0, col] = np.nan

		return X_copy

####  Wrapping `verstack.NaNImputer` into an custom transformer for compatibility with scikit-learn's API

In [None]:
# inspired by the Adapter design pattern ;)
class NaNImputerWrapper(BaseEstimator, TransformerMixin):
	def __init__(self, train_sample_size=30_000, verbose=True):
		self.train_sample_size = train_sample_size
		self.verbose = verbose
		self.imputer = NaNImputer(self.train_sample_size, self.verbose)

	def fit(self, X, y=None):
		return self

	def transform(self, X):
		return self.imputer.impute(X)

### Integrating transformers into a pipeline

In [None]:
def get_X_train_y_train(df, y='churn_risk_score'):
	X_train = df.drop(columns=[y])
	y_train = df[y]
	return X_train, y_train

In [None]:
# imputing train_copy for advanced analysis and building dashboards
X_train, y_train = get_X_train_y_train(train_copy)

# imputing train_split for model development (we avoid imputing the validation set here to avoid data leakage)
X_train_split, y_train_split = get_X_train_y_train(train_split)

In [None]:
cleaning_pipeline = Pipeline([
	('dataCleaner', DataCleaner(cols_to_drop, nonnegative_cols)), 
	('imputer', NaNImputerWrapper(train_sample_size=train_split.shape[0]))
])

X_train_cleaned_imputed = cleaning_pipeline.fit_transform(X_train)
X_train_split_cleaned_imputed = cleaning_pipeline.fit_transform(X_train_split)

### Save cleaned data

In [None]:
def save_df(X, y, name, extension='.csv'):
	df_cleaned_imputed = pd.concat([X, y], axis=1)
	df_cleaned_imputed.to_csv(name + extension, index=False)

In [None]:
save_df(X_train_cleaned_imputed, y_train, 'train_cleaned_imputed')

save_df(X_train_split_cleaned_imputed, y_train_split, 'train_split_cleaned_imputed')

### Save `cleaning_pipeline`

In [None]:
import joblib
joblib.dump(cleaning_pipeline, 'cleaning_pipeline.joblib')