# Data Preparation

This section performs the entire data preparation pipeline:
1. **Load, Clean, and Select Features**: Imports the dataset, cleans it, and selects relevant features using the `utils` package.
2. **Stratified Split**: Splits the processed data into training and test sets while preserving the class distribution of the target variable (`Churn`).

In [1]:
import sys
import pandas as pd
# Add utils to path
sys.path.append('..')
from utils import preprocessing as prep

# ---------------------------------------------------------
# 1. Load, Clean, and Select Features
# ---------------------------------------------------------
df = prep.load_data('../data/dataset.csv')
df_clean = prep.clean_data(df)

# Use our new function to remove TotalCharges
df_ready = prep.select_features(df_clean)

print(f"Current Shape: {df_ready.shape}")

# ---------------------------------------------------------
# 2. Stratified Split
# ---------------------------------------------------------
# Use our new function to split the data
X_train, X_test, y_train, y_test = prep.split_stratified_data(df_ready, target='Churn')

# Verify the Churn Rate (Just to be sure)
print("\nVerifying Churn Rate Preservation:")
print(f"Train Churn Rate: {y_train.value_counts(normalize=True)['Yes']:.2%}")
print(f"Test Churn Rate:  {y_test.value_counts(normalize=True)['Yes']:.2%}")

Duplicate rows found: 0
Missing values in TotalCharges: 11
customerID dropped.
Feature Selection: Dropped 'TotalCharges' (Multicollinearity).
Current Shape: (7043, 19)
Data Split Successfully (Test Size: 0.2)
 - Train Shape: (5634, 18)
 - Test Shape:  (1409, 18)

Verifying Churn Rate Preservation:
Train Churn Rate: 26.54%
Test Churn Rate:  26.54%


## 1. Target Encoding

We must convert the target variable `Churn` from text into numbers so the model can calculate the error.
* **Function:** `encode_binary_target`
* **Logic:** Maps the positive class ('Yes') to `1` and the negative class ('No') to `0`.

In [2]:
# We use our generic function to map Yes->1 and No->0
y_train_enc = prep.encode_binary_target(y_train, pos_label='Yes', neg_label='No')
y_test_enc = prep.encode_binary_target(y_test, pos_label='Yes', neg_label='No')

# Verify the result
print("\nTarget Encoded Successfully.")
print(f"Train Class Distribution:\n{y_train_enc.value_counts()}")

Target Encoding: Mapping 'Yes' -> 1 and 'No' -> 0
Target Encoding: Mapping 'Yes' -> 1 and 'No' -> 0

Target Encoded Successfully.
Train Class Distribution:
Churn
0    4139
1    1495
Name: count, dtype: int64


## 2. Feature Engineering (The Pipeline)

We build a **Feature Transformer** to process the input variables (`X`). This ensures the model receives mathematically consistent data.

**The Strategy:**
1.  **Numerical Columns (e.g., `tenure`, `MonthlyCharges`):**
    * **Action:** Apply `StandardScaler`.
    * **Reason:** Centers the data around 0 with a standard deviation of 1. This prevents large values (like TotalCharges) from dominating the model weights.

2.  **Categorical Columns (e.g., `InternetService`):**
    * **Action:** Apply `OneHotEncoder`.
    * **Reason:** Converts text categories into binary columns (0s and 1s). We use `drop='first'` to avoid multicollinearity (the "dummy variable trap").

**Important:** We fit the transformer *only* on the Training data to prevent data leakage.

In [3]:
# 1. Build the Transformer (Using our new function)
# This identifies which columns are numbers vs. text automatically
feature_transformer = prep.create_feature_transformer(X_train)

# 2. Fit and Transform Train
# The transformer learns the means (for scaling) and categories (for encoding) from Train
X_train_enc = feature_transformer.fit_transform(X_train)

# 3. Transform Test
# We apply the EXACT same rules to Test (we do not re-fit!)
X_test_enc = feature_transformer.transform(X_test)

# 4. Convert to DataFrames for inspection
# We recover the new column names generated by the OneHotEncoder
feature_names = feature_transformer.get_feature_names_out()

X_train_processed = pd.DataFrame(X_train_enc, columns=feature_names, index=X_train.index)
X_test_processed = pd.DataFrame(X_test_enc, columns=feature_names, index=X_test.index)

print(f"\nFeature Engineering Complete.")
print(f"Original Shape: {X_train.shape}")
print(f"Processed Shape: {X_train_processed.shape}")

Transformer Configuration:
 - Scaling 3 numerical cols: ['SeniorCitizen', 'tenure', 'MonthlyCharges']
 - Encoding 15 categorical cols: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

Feature Engineering Complete.
Original Shape: (5634, 18)
Processed Shape: (5634, 29)


## 3. Saving Processed Data

We save the processed datasets and the `feature_transformer` object.
* **Datasets:** Saved as CSVs for the Modeling Notebook.
* **Transformer:** Saved as a `.pkl` file. This is critical for the final application, allowing us to process new raw customer data exactly the same way as our training data.

In [4]:
prep.save_processed_data(
    X_train_processed, 
    X_test_processed, 
    y_train_enc, 
    y_test_enc, 
    feature_transformer
)

Data saved to ../data/processed/
Transformer saved to ../data/processed/feature_transformer.pkl
