# Data Preprocessing - Housing Price Prediction

This notebook demonstrates data cleaning and transformation for the Kaggle House Prices dataset as part of the Housing Price Prediction Project. It includes:
- Loading the raw dataset (`/Users/junshao/bootcamp_Jun_Shao/project/data/raw/train.csv`).
- Applying cleaning functions from `src/cleaning.py` (fill missing, drop duplicates, normalize, encode categorical).
- Saving the preprocessed dataset to `/data/processed/preprocessed_train.csv`.
- Documenting assumptions made during cleaning.

The goal is to prepare the dataset for modeling, handling missing values, duplicates, scaling, and encoding.

## Setup and Imports

**Explanation**:
- Import libraries: `pandas` for data handling, `os` and `dotenv` for environment variables.
- Add project root to `sys.path` to import `src.cleaning`.
- Load environment variables for paths.

In [1]:
import pandas as pd
import os
import sys
from dotenv import load_dotenv

# Add project root to sys.path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

from src.cleaning import fill_missing, drop_duplicates, normalize_data, encode_categorical

# Load environment variables
load_dotenv()
DATA_DIR_RAW = os.getenv('DATA_DIR_RAW')
DATA_DIR_PROCESSED = os.getenv('DATA_DIR_PROCESSED')

# Verify environment variables
print(f'DATA_DIR_RAW: {DATA_DIR_RAW}')
print(f'DATA_DIR_PROCESSED: {DATA_DIR_PROCESSED}')

DATA_DIR_RAW: data/raw
DATA_DIR_PROCESSED: data/processed


## Load Raw Data

**Explanation**:
- Load the Kaggle train dataset using absolute path.
- Display shape, preview, and missing values to understand the data.

In [2]:
# Load raw dataset
raw_file = '/Users/junshao/bootcamp_Jun_Shao/project/data/raw/train.csv'
df = pd.read_csv(raw_file)

# Display initial state
print('Original Data Shape:', df.shape)
print('Original Data Preview:')
print(df.head())
print('\nMissing Values Before Cleaning:')
print(df[['LotFrontage', 'Alley', 'PoolQC', 'Fence', 'MiscFeature']].isna().sum())

Original Data Shape: (1460, 81)
Original Data Preview:
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  Sal

## Apply Preprocessing Functions

**Explanation**:
- Fill missing values (median for numeric, 'None' for categorical).
- Drop duplicates.
- Normalize numeric columns.
- Encode categorical columns.
- Display state after each step.

In [3]:
# Fill missing values
df_clean = fill_missing(df)
print('Missing Values After Filling:')
print(df_clean[['LotFrontage', 'Alley', 'PoolQC', 'Fence', 'MiscFeature']].isna().sum())

# Drop duplicates
df_clean = drop_duplicates(df_clean)
print('\nData Shape After Dropping Duplicates:', df_clean.shape)

# Normalize numeric columns
numeric_cols = ['SalePrice', 'LotArea', 'GrLivArea', 'OverallQual']
df_clean = normalize_data(df_clean, numeric_cols)
print('\nData After Normalization (Preview):')
print(df_clean[numeric_cols].head())

# Encode categorical columns
categorical_cols = ['MSSubClass', 'MSZoning', 'Neighborhood']
df_preprocessed = encode_categorical(df_clean, categorical_cols)
print('\nData After Encoding (Shape):', df_preprocessed.shape)
print('Data After Encoding (Preview):')
print(df_preprocessed.head())

Missing Values After Filling:
LotFrontage    0
Alley          0
PoolQC         0
Fence          0
MiscFeature    0
dtype: int64

Data Shape After Dropping Duplicates: (1460, 81)

Data After Normalization (Preview):
   SalePrice   LotArea  GrLivArea  OverallQual
0   0.241078  0.033420   0.259231     0.666667
1   0.203583  0.038795   0.174830     0.555556
2   0.261908  0.046507   0.273549     0.666667
3   0.145952  0.038561   0.260550     0.666667
4   0.298709  0.060576   0.351168     0.777778

Data After Encoding (Shape): (1460, 123)
Data After Encoding (Preview):
   Id  LotFrontage   LotArea Street Alley LotShape LandContour Utilities  \
0   1         65.0  0.033420   Pave  None      Reg         Lvl    AllPub   
1   2         80.0  0.038795   Pave  None      Reg         Lvl    AllPub   
2   3         68.0  0.046507   Pave  None      IR1         Lvl    AllPub   
3   4         60.0  0.038561   Pave  None      IR1         Lvl    AllPub   
4   5         84.0  0.060576   Pave  None      IR1

## Save Preprocessed Dataset

**Explanation**:
- Save the preprocessed DataFrame to `/data/processed/preprocessed_train.csv`.

In [4]:
# Save preprocessed data
preprocessed_file = '/Users/junshao/bootcamp_Jun_Shao/project/data/processed/preprocessed_train.csv'
df_preprocessed.to_csv(preprocessed_file, index=False)
print(f'Saved preprocessed data to {preprocessed_file}')

Saved preprocessed data to /Users/junshao/bootcamp_Jun_Shao/project/data/processed/preprocessed_train.csv


## Assumptions and Rationale

**Assumptions**:
- Missing numeric values (e.g., LotFrontage) are random and suitable for median imputation due to skewness.
- Missing categorical values represent 'None' (e.g., no basement for BsmtQual), as per data_description.txt.
- Duplicates are errors and can be removed without significant loss.
- Normalization (Min-Max) is appropriate for numeric features to improve model performance, assuming no extreme outliers post-cleaning.
- One-hot encoding is suitable for categorical features, assuming they are nominal and dimensionality increase is manageable.

**Rationale**:
- Median imputation handles skewness in numeric features (e.g., LotArea).
- 'None' filling for categorical aligns with dataset semantics, avoiding bias.
- Normalization scales features to [0, 1] for better model convergence.
- Encoding transforms categorical data for machine learning, though it increases columns.

**Tradeoffs**:
- Imputation vs. deletion: Imputation retains data but may introduce bias; deletion reduces bias but loses information.
- One-hot vs. label encoding: One-hot avoids ordinal assumptions but increases dimensionality.
- Normalization vs. standardization: Min-Max is simple but sensitive to outliers.