# Correct Pipeline Setup: Preventing Data Leakage

This notebook demonstrates the **correct** way to preprocess data for machine learning to avoid **Data Leakage**.

### What is Data Leakage?
Data leakage occurs when information from outside the training dataset (like the test set) is used to create the model. 
Common causes include:
1. **Imputing missing values** using statistics (mean/median) calculated on the entire dataset.
2. **Scaling/Normalizing** data using the entire dataset's range.

### The Fix: Pipelines
We will use `sklearn.pipeline.Pipeline` and `ColumnTransformer` to ensure that all transformations are fit **only** on the training data and then applied to the test data associated with inference.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

## 1. Load Raw Data
We load the original raw dataset, not the pre-processed one.

In [9]:
# Load raw data
df = pd.read_csv('data/dataset.csv')

# Separate Features and Target
target_col = 'median_house_value'
X = df.drop(columns=[target_col])
y = df[target_col]

print("Input Shape:", X.shape)
X.head()

Input Shape: (20640, 9)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,NEAR BAY


## 2. Train-Test Split
Critically, this must happen **before** any preprocessing.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

Training samples: 16512
Test samples: 4128


## 3. Define Preprocessing Pipeline
We treat numeric and categorical columns differently.

In [11]:
# Select numerical and categorical columns
num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

print("Numeric Columns:", num_cols)
print("Categorical Columns:", cat_cols)

# Preprocessing for numerical data: Impute Median -> Standard Scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data: Impute 'missing' -> OneHotEncode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ])

Numeric Columns: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
Categorical Columns: ['ocean_proximity']


## 4. Build and Fit the Full Pipeline
We combine the preprocessor with the estimator (Linear Regression).

In [12]:
# Create the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LinearRegression())])

# Fit the pipeline on Training Data ONLY
pipeline.fit(X_train, y_train)

print("Model trained successfully using Pipeline!")

Model trained successfully using Pipeline!


## 5. Evaluation
We evaluate on the test set. Note that `pipeline.predict(X_test)` automatically processes `X_test` using the scalers fitted on `X_train`, ensuring no data leakage.

In [13]:
# Predict on Test Set
y_pred = pipeline.predict(X_test)

# Calculate Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=== Test Set Evaluation ===")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R2 Score: {r2:.4f}")

=== Test Set Evaluation ===
RMSE: 70059.1933
MAE: 50670.4892
R2 Score: 0.6254


## 6. Cross-Validation (Bonus)
Pipelines make cross-validation easier and safer.

In [14]:
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse = np.sqrt(-cv_scores)

print(f"Cross-Validation RMSE Scores: {cv_rmse}")
print(f"Average CV RMSE: {cv_rmse.mean():.4f}")

Cross-Validation RMSE Scores: [68721.65876664 67485.36849682 67641.76081141 67893.04481481
 71370.84352754]
Average CV RMSE: 68622.5353
