# Data Preprocessing

In this notebook, we will preprocess the data for modeling. This includes:
1. Loading the data and dropping unnecessary columns.
2. Encoding categorical variables.
3. Scaling numerical features.
4. Handling class imbalance with SMOTE.
5. Saving the preprocessed data and the preprocessor.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import joblib

## 1. Load and Clean Data

In [None]:
df = pd.read_csv('BankChurners.csv')
df_cleaned = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
df_cleaned.head()

## 2. Define Features and Target

In [None]:
X = df_cleaned.drop('Exited', axis=1)
y = df_cleaned['Exited']

## 3. Identify Categorical and Numerical Features

In [None]:
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

print(f'Categorical Features: {list(categorical_features)}')
print(f'Numerical Features: {list(numerical_features)}')

## 4. Create Preprocessing Pipeline

In [None]:
# Create a preprocessor object using ColumnTransformer
# The OneHotEncoder will create new columns for categorical features
# The StandardScaler will scale numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ], remainder='passthrough')

## 5. Split Data into Training and Test sets

In [None]:
# We stratify by y to ensure the same proportion of classes in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## 6. Apply Preprocessing to Training and Test Data

In [None]:
# Fit the preprocessor on the training data and transform it
X_train_processed = preprocessor.fit_transform(X_train)

# Transform the test data using the already fitted preprocessor
X_test_processed = preprocessor.transform(X_test)

## 7. Handle Class Imbalance using SMOTE

In [None]:
print('Original training target distribution:')
print(y_train.value_counts(normalize=True))

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)

print('\nNew training target distribution after SMOTE:')
print(pd.Series(y_train_resampled).value_counts(normalize=True))

## 8. Save the Preprocessed Data and Preprocessor

In [None]:
# Save the processed data arrays for the next notebook
np.save('X_train_resampled.npy', X_train_resampled)
np.save('y_train_resampled.npy', y_train_resampled)
np.save('X_test_processed.npy', X_test_processed)
np.save('y_test.npy', y_test.to_numpy())

# Save the preprocessor object for future use (e.g., in deployment)
joblib.dump(preprocessor, 'preprocessor.joblib')

print('Preprocessed data and preprocessor saved successfully.')