# Customer Churn Prediction – Data Preprocessing

This notebook walks through the preprocessing pipeline for the Customer Churn dataset.  
We will:
- Load and clean the dataset
- Encode categorical variables
- Scale numeric features
- Handle class imbalance with SMOTE
- Split into train and test sets

All steps are modularized in `preprocess.py`, and we'll call those functions here to ensure consistency between experiments and production.


In [4]:
import sys
import os
import pandas as pd

# Get the root project directory (change the path accordingly)
project_root = os.path.abspath("..")  # if your notebook is in /notebooks and src is in the parent
sys.path.append(project_root)

# Now import functions from src
from src.preprocess import load_data, clean_data, encode_features, scale_features, split_data


We import all the preprocessing functions from `src/preprocess.py` to replicate the exact pipeline used during model training.


# Data Preprocessing Notebook

This notebook handles all the preprocessing steps required for modeling customer churn.

### Objectives:
- Load raw data
- Clean and handle missing values
- Encode categorical variables
- Scale numerical features
- Handle class imbalance using SMOTE
- Split data into training and testing sets


## Step 1: Load the Data

We begin by importing the raw dataset and checking for missing values or data type issues.


## Step 2: Clean the Data

- Drop unnecessary columns (e.g., customerID)
- Convert strings to proper formats
- Handle missing or incorrect values
- Drop duplicates if any


## Step 3: Encode Categorical Features

We use:
- Label Encoding for binary or ordinal columns
- Map target column "Churn" to 0 and 1


## Step 4: Scale Numerical Features

Numerical columns are standardized using `StandardScaler` to normalize the scale and improve model performance.


## Step 5: Handle Class Imbalance

We use SMOTE (Synthetic Minority Over-sampling Technique) to balance the training dataset and improve classification of minority class (Churn = 1).


## Step 6: Train-Test Split

We split the data using an 80/20 ratio with stratification to maintain class distribution.


## Step 7: Save Preprocessed Data

The training and testing sets are saved to CSV files for use in modeling and evaluation.


✅ Data preprocessing pipeline completed successfully.
