## 3. Data Preparation 

### 3.1 Objectives of Data Preparation

The goal of this phase is to transform the raw dataset into a **model-ready structure** while preserving domain meaning. Based on the Data Understanding phase, particular care is taken to:

* Preserve **structural missingness** (absence of features)
* Avoid misleading imputations
* Prepare variables for regression modeling

### 3.2 Using the Data Processing Module

We'll use the structured `DataProcessor` class from `src.data_processing` to ensure consistent and reproducible data preparation.

#### 3.2.1 Initialize Data Processor and Load Data

In [1]:
import sys
sys.path.append('../src')
from data_processing import DataProcessor
from config import DATA_PATHS

# Initialize the data processor
processor = DataProcessor()

# Load raw data
df = processor.load_data()
print(f"Data shape: {df.shape}")
print(f"Missing values summary:")
missing_summary = processor.get_missing_value_summary(df)
for col, perc in missing_summary.items():
    print(f"{col}: {perc}%")

Data shape: (1460, 81)
Missing values summary:
PoolQC: 99.52%
MiscFeature: 96.3%
Alley: 93.77%
Fence: 80.75%
MasVnrType: 59.73%
FireplaceQu: 47.26%
LotFrontage: 17.74%
GarageQual: 5.55%
GarageFinish: 5.55%
GarageType: 5.55%
GarageYrBlt: 5.55%
GarageCond: 5.55%
BsmtFinType2: 2.6%
BsmtExposure: 2.6%
BsmtCond: 2.53%
BsmtQual: 2.53%
BsmtFinType1: 2.53%
MasVnrArea: 0.55%
Electrical: 0.07%


#### 3.2.2 Handle Missing Values Using DataProcessor

The `handle_missing_values` method automatically applies domain-aware missing value treatment:

In [2]:
# Apply missing value handling
df_processed = processor.handle_missing_values(df)

# Verify no problematic missing values remain
remaining_missing = processor.get_missing_value_summary(df_processed)
print(f"Remaining missing values: {len(remaining_missing)}")
for col, perc in remaining_missing.items():
    print(f"{col}: {perc}%")

Remaining missing values: 0


#### 3.2.3 Correct Data Types

The `correct_data_types` method ensures proper variable semantics:

In [3]:
# Apply data type corrections
df_corrected = processor.correct_data_types(df_processed)

# Verify MSSubClass is now categorical
print(f"MSSubClass dtype: {df_corrected['MSSubClass'].dtype}")

MSSubClass dtype: object


## 3.3 Complete Data Preparation Pipeline

### 3.3.1 Automated Data Preparation

The `prepare_data` method combines all steps into a single pipeline:

In [4]:
# Run the complete data preparation pipeline
df_final = processor.prepare_data()

# Validate data quality
is_valid = processor.validate_data_quality(df_final)
print(f"Data quality validation passed: {is_valid}")
print(f"Final data shape: {df_final.shape}")

Data quality validation passed!
Data quality validation passed: True
Final data shape: (1460, 81)


### 3.3.2 Manual Step-by-Step Processing (for educational purposes)

Below shows the individual steps that are automated in the `prepare_data` method:

In [5]:
# Step 1: Load data
df_manual = processor.load_data()

# Step 2: Handle missing values
df_manual = processor.handle_missing_values(df_manual)

# Step 3: Correct data types
df_manual = processor.correct_data_types(df_manual)

# Step 4: Validate
print(f"Manual processing - Data quality: {processor.validate_data_quality(df_manual)}")

Data quality validation passed!
Manual processing - Data quality: True


## 3.4 Understanding the Missing Value Strategy

The DataProcessor applies a domain-aware missing value strategy:

### 3.4.1 Structural Missing Values (Informative)

These represent the absence of physical features:
- Basement features → 'NoBasement'
- Garage features → 'NoGarage' 
- Outdoor features → 'NoFireplace', 'NoPool', 'NoFence'
- Alley access → 'NoAlley'

### 3.4.2 Semi-Structural Missing Values

- Masonry veneer type → 'None' (with area = 0)

### 3.4.3 True Missing Values

- Electrical system → mode imputation (single missing value)

### 3.4.4 Spatial Missing Values

- LotFrontage → neighborhood-based median imputation

## 3.5 Data Quality Validation

The `validate_data_quality` method ensures data meets modeling requirements:

In [6]:
# The validation was already performed above
#Here's what it checks:
print("Quality checks performed:")
print("1. No problematic missing values remain")
print("2. MSSubClass is properly converted to categorical")
print("3. All structural missingness is properly encoded")

Quality checks performed:
1. No problematic missing values remain
2. MSSubClass is properly converted to categorical
3. All structural missingness is properly encoded


## 3.6 Final Data Readiness Check

In [7]:
# Final verification
final_missing = processor.get_missing_value_summary(df_final)
print(f"Final missing value count: {len(final_missing)}")
if final_missing:
    for col, perc in final_missing.items():
        print(f"{col}: {perc}%")
else:
    print("No missing values remain - data is ready for feature engineering!")

Final missing value count: 0
No missing values remain - data is ready for feature engineering!


## 3.7 Data Type Consistency

The DataProcessor ensures proper data type handling:

In [8]:
# Verify key data types
print(f"MSSubClass (should be object): {df_final['MSSubClass'].dtype}")
print(f"SalePrice (should be numeric): {df_final['SalePrice'].dtype}")

# Show categorical vs numerical counts
cat_cols = df_final.select_dtypes(include='object').columns
num_cols = df_final.select_dtypes(include=['int64', 'float64']).columns
print(f"\nCategorical columns: {len(cat_cols)}")
print(f"Numerical columns: {len(num_cols)}")

MSSubClass (should be object): object
SalePrice (should be numeric): int64

Categorical columns: 44
Numerical columns: 37


## 3.8 Outlier Considerations

Certain numerical variables are known to be right-skewed:

* `LotArea`
* `GrLivArea`
* `TotalBsmtSF`

No rows are removed at this stage. Instead:

* Outliers are flagged for later transformation (e.g., log-scaling)
* Extreme observations are retained due to real-world plausibility

## 3.9 Output of Data Preparation Phase

After preparation using the DataProcessor:

* All missing values are **intentionally handled** using domain knowledge
* Structural absences are **explicitly encoded**
* Data types are consistent with variable semantics
* Dataset is ready for **feature engineering and regression modeling**

The processed data has been automatically saved to: `../data/processed/train_prepared.csv`

In [9]:
# Data is already saved by the prepare_data() method
# Let's verify it was saved correctly
import pandas as pd
saved_data = pd.read_csv(DATA_PATHS['prepared_train'])
print(f"Saved data shape: {saved_data.shape}")
print(f"Saved data matches final data: {df_final.shape == saved_data.shape}")

Saved data shape: (1460, 81)
Saved data matches final data: True


## 3.10 Transition to Next CRISP-DM Phase

The dataset is now suitable for:

* Feature encoding (ordinal and one-hot)
* Feature scaling
* Regression model development and evaluation

### Key Benefits of Using the DataProcessor Module:

1. **Reproducibility**: Same processing can be applied to test data
2. **Consistency**: Domain-aware handling of missing values
3. **Validation**: Built-in quality checks
4. **Automation**: Complete pipeline in one method call
5. **Maintainability**: Centralized logic for easy updates