# 🏠 House Prices - Advanced Regression Techniques

**Learning Project**: Predicting house sale prices using Neural Networks  
**Kaggle Competition**: [House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

---

## 📚 What You'll Learn

This project builds on your Digit Recognizer experience but introduces **regression** instead of classification:

| Concept | Classification (Digit Recognizer) | Regression (House Prices) |
|---------|-----------------------------------|---------------------------|
| **Output** | Class label (0-9) | Continuous value (price) |
| **Output Layer** | 10 neurons (softmax) | 1 neuron (no activation) |
| **Loss Function** | CrossEntropyLoss | MSELoss / L1Loss |
| **Metrics** | Accuracy | RMSE, MAE, R² |
| **Prediction** | `torch.max(output, 1)` | `output.squeeze()` |

---

## 📋 Project Structure

This notebook is divided into **7 phases**. Each phase contains:
- 📖 **Explanation** of concepts
- 🎯 **Learning objectives** for that phase
- ✅ **TODO blocks** where you'll write code
- 💡 **Hints** to guide you (not complete solutions!)

Work through each phase step by step. Ask for help if you get stuck!

---

# Phase 1: Environment Setup ✅

## 🎯 Learning Objectives
- Import necessary libraries for data science and deep learning
- Check PyTorch installation and GPU availability
- Load the dataset from CSV files
- Understand the data structure

## 📖 Key Concepts

**Libraries we'll use:**
- `pandas` - Data manipulation and analysis
- `numpy` - Numerical operations
- `matplotlib` & `seaborn` - Data visualization
- `torch` - Neural network framework
- `sklearn` - Traditional ML algorithms and preprocessing tools

---

In [1]:
# TODO 1.1: Import Libraries
# Import the following:
# - pandas as pd
# - numpy as np
# - matplotlib.pyplot as plt
# - seaborn as sns
# - torch (PyTorch)
# - torch.nn as nn
# - torch.optim for optimizers

# HINT: Use 'import X as Y' syntax for cleaner code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.optim



In [2]:
# TODO 1.2: Configure Display Settings
# Set up nice display settings:
# - Set seaborn style to 'darkgrid'
# - Set matplotlib figure size to (12, 6) by default
# - Set pandas display options to show all columns

# HINT: Use sns.set_style(), plt.rcParams['figure.figsize'], pd.set_option()

sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
pd.set_option('display.max_columns', None)


In [3]:
# TODO 1.3: Check PyTorch Setup
# Print the following:
# - PyTorch version
# - CUDA availability (GPU support)
# - Device being used (cuda or cpu)

# HINT: torch.__version__, torch.cuda.is_available(), torch.device()

print(f"PyTorch version: {torch.__version__}")
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")
device = torch.device("cuda" if cuda_available else "cpu")
print(f"Device being used: {device}")


PyTorch version: 2.8.0
CUDA available: False
Device being used: cpu


In [5]:
# TODO 1.4: Load the Data
# Load the training data from '../data/train.csv' into a DataFrame called 'train_df'
# Load the test data from '../data/test.csv' into a DataFrame called 'test_df'
# Display the first 5 rows of training data
# Print the shape of both datasets

# HINT: Use pd.read_csv(), .head(), .shape

# Your code here:
path_train = "../data/train.csv"
path_test = "../data/test.csv"

train_df = pd.read_csv(path_train)
test_df = pd.read_csv(path_test)

print("Shape of the train dataset:", train_df.shape)
print()
print("Shape of the test dataset:", test_df.shape)
print()
print("5 first rows of the training data:")
print(train_df.head())



Shape of the train dataset: (1460, 81)

Shape of the test dataset: (1459, 80)

5 first rows of the training data:
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities LotConfig LandSlope Neighborhood Condition1  \
0         Lvl    AllPub    Inside       Gtl      CollgCr       Norm   
1         Lvl    AllPub       FR2       Gtl      Veenker      Feedr   
2         Lvl    AllPub    Inside       Gtl      CollgCr       Norm   
3         Lvl    AllPub    Corner       Gtl      Crawfor       Norm   
4         Lvl    AllPub       FR2       Gtl      NoRidge       Norm   

  Condition2 B

In [19]:
# TODO 1.5: Basic Dataset Information
# Display:
# - Column names and data types using .info()
# - Basic statistics using .describe()
# - Number of numerical vs categorical columns

# HINT: Use .info(), .describe(), .select_dtypes()

# Your code here:
# TODO 1.5: Basic Dataset Information

# 1. Column names and data types
print("=" * 50)
print("DATASET INFORMATION")
print("=" * 50)
train_df.info()

print("\n" + "=" * 50)
print("BASIC STATISTICS")
print("=" * 50)
train_df.describe()  # ← Add this!

print("\n" + "=" * 50)
print("DATA TYPES BREAKDOWN")
print("=" * 50)
print(f"Numerical columns (float): {len(train_df.select_dtypes(float).columns)}")
print(f"Numerical columns (int): {len(train_df.select_dtypes(int).columns)}")
print(f"Categorical columns (object): {len(train_df.select_dtypes(object).columns)}")
print(f"\nTotal columns: {len(train_df.columns)}")



DATASET INFORMATION
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   i

---

## ✅ Phase 1 Checklist - COMPLETE! 🎉
Before moving to Phase 2, make sure you've:
- [x] Imported all necessary libraries
- [x] Configured display settings
- [x] Checked PyTorch installation
- [x] Loaded both train and test datasets
- [x] Examined basic dataset information

**Phase 1 Status: ✅ COMPLETE**

---

# Phase 2: Exploratory Data Analysis (EDA) 🔍

## 🎯 Learning Objectives
- Understand the distribution of the target variable (SalePrice)
- Identify missing values in the dataset
- Analyze correlations between features and target
- Visualize key relationships
- Identify important features for modeling

## 📖 Key Concepts

**Why EDA matters:**
- Understanding your data prevents modeling mistakes
- Missing values need to be handled before training
- Feature correlations help with feature selection
- Outliers can hurt model performance

**Important Note:** The competition metric is **RMSE of log(SalePrice)**, so we'll need to consider log transformation!

---

In [None]:
# TODO 2.1: Analyze the Target Variable
# Create visualizations for SalePrice:
# - Histogram with KDE
# - Box plot to identify outliers
# - Calculate and print basic statistics (mean, median, std, min, max)
# - Check if the distribution is skewed (hint: .skew())

# HINT: Use plt.subplot() to create multiple plots, sns.histplot(), sns.boxplot()

# Your code here:



In [None]:
# TODO 2.2: Visualize Log-Transformed Target
# Create a histogram of log(SalePrice) - this is what we'll actually predict!
# Compare the skewness before and after log transformation
# Use np.log1p() which is log(1+x) to handle any zeros safely

# HINT: np.log1p(train_df['SalePrice']).hist()

# Your code here:



In [None]:
# TODO 2.3: Missing Values Analysis
# Calculate the percentage of missing values for each column
# Display only columns with missing values, sorted by percentage (highest first)
# Create a visualization showing missing value percentages

# HINT: Use .isnull().sum(), calculate percentages, filter where > 0, sort_values()
# For visualization: sns.barplot() works well

# Your code here:



In [None]:
# TODO 2.4: Correlation Analysis
# Calculate correlation of all NUMERICAL features with SalePrice
# Display the top 10 most positively correlated features
# Display the top 5 most negatively correlated features
# Create a heatmap of correlations for the top 10 features

# HINT: Select numerical columns using .select_dtypes(include=[np.number])
# Use .corr() to get correlation matrix, then select 'SalePrice' column
# For heatmap: sns.heatmap() with annot=True

# Your code here:



In [None]:
# TODO 2.5: Visualize Key Relationships
# Create scatter plots for the top 3 most correlated features vs SalePrice
# Add trend lines to see the relationships clearly
# Identify any outliers that might need handling

# HINT: Use plt.subplot() to create 1x3 grid
# sns.regplot() shows scatter + trend line

# Your code here:



In [None]:
# TODO 2.6: Categorical Features Analysis
# Identify all categorical features
# For 2-3 interesting categorical features, create box plots showing SalePrice distribution by category
# Examples: Neighborhood, OverallQual, HouseStyle

# HINT: Use .select_dtypes(include=['object']) for categorical features
# sns.boxplot() with x=categorical, y='SalePrice'

# Your code here:



---

## ✅ Phase 2 Checklist
Before moving to Phase 3, make sure you've:
- [ ] Analyzed SalePrice distribution and skewness
- [ ] Identified all columns with missing values
- [ ] Found the most correlated features with SalePrice
- [ ] Created visualizations for key relationships
- [ ] Identified potential outliers
- [ ] Analyzed categorical features

**Key Insights to Note:**
- Which features have the highest correlation with SalePrice?
- Which features have the most missing values?
- Is the target variable skewed? (It should be - consider log transformation!)

---

# Phase 3: Data Preprocessing 🔄

## 🎯 Learning Objectives
- Handle missing values with appropriate imputation strategies
- Separate numerical and categorical features
- Encode categorical variables for ML models
- Handle outliers
- Engineer new features from existing ones

## 📖 Key Concepts

**Missing Value Strategies:**
- Numerical: Mean, median, or specific value (e.g., 0 for missing garage size)
- Categorical: Mode or 'None' category
- Drop if >50% missing (be careful!)

**Feature Engineering:**
Creating new features can improve model performance significantly!
- TotalSF = 1stFlrSF + 2ndFlrSF + TotalBsmtSF
- TotalBath = FullBath + 0.5*HalfBath
- HouseAge = YrSold - YearBuilt

---

In [None]:
# TODO 3.1: Create a Copy for Processing
# Create copies of train and test dataframes to preserve originals
# We'll call them 'train' and 'test'

# HINT: Use .copy() to avoid modifying original data

# Your code here:



In [None]:
# TODO 3.2: Save the Target Variable
# Extract the target variable (SalePrice) from training data
# Apply log transformation: y = np.log1p(SalePrice)
# Drop SalePrice from the training dataframe
# Store test IDs for later submission

# HINT: y_train = np.log1p(train['SalePrice'])
# test_ids = test['Id']
# Use .drop() to remove columns

# Your code here:



In [None]:
# TODO 3.3: Handle Missing Values - Numerical Features
# For numerical columns with missing values:
# - LotFrontage: Fill with median
# - GarageYrBlt: Fill with YearBuilt (makes sense - garage built with house)
# - Garage features (GarageCars, GarageArea): Fill with 0 (no garage)
# - Basement features (BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF): Fill with 0 (no basement)
# - MasVnrArea: Fill with 0 (no masonry veneer)

# HINT: Use .fillna() method
# Apply same transformations to both train and test!

# Your code here:



In [None]:
# TODO 3.4: Handle Missing Values - Categorical Features
# For categorical columns with missing values:
# - For features where missing means 'None' (e.g., PoolQC, Fence, Alley), fill with 'None'
# - For features where missing is random (e.g., Electrical), fill with mode (most common value)
# - Check data description to understand what missing means for each feature!

# HINT: Use .fillna('None') or .fillna(train[column].mode()[0])
# Features that likely mean 'None': PoolQC, MiscFeature, Alley, Fence, FireplaceQu,
#                                   GarageType, GarageFinish, GarageQual, GarageCond,
#                                   BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2,
#                                   MasVnrType

# Your code here:



In [None]:
# TODO 3.5: Feature Engineering
# Create new features that might be useful:
# - TotalSF: Total square footage (1stFlrSF + 2ndFlrSF + TotalBsmtSF)
# - TotalBath: Total bathrooms (FullBath + 0.5*HalfBath + BsmtFullBath + 0.5*BsmtHalfBath)
# - HouseAge: Age of house (YrSold - YearBuilt)
# - RemodAge: Years since remodeling (YrSold - YearRemodAdd)
# - TotalPorchSF: Total porch area (OpenPorchSF + EnclosedPorch + 3SsnPorch + ScreenPorch)

# HINT: Simply add columns with arithmetic operations
# train['TotalSF'] = train['1stFlrSF'] + train['2ndFlrSF'] + train['TotalBsmtSF']

# Your code here:



In [None]:
# TODO 3.6: Handle Outliers
# Based on EDA, remove extreme outliers that don't make sense
# Common outliers in this dataset:
# - Houses with GrLivArea > 4000 but SalePrice < 300000 (huge house, low price - error?)
# - You can visualize: plt.scatter(train['GrLivArea'], y_train)

# HINT: Use boolean indexing to filter
# Remember to also filter y_train to match!

# Your code here:



In [None]:
# TODO 3.7: Encode Categorical Variables
# Use one-hot encoding (pd.get_dummies) to convert categorical variables to numbers
# Apply to both train and test datasets
# Use drop_first=True to avoid multicollinearity

# HINT: train = pd.get_dummies(train, drop_first=True)
# Make sure to apply to both train and test!
# After encoding, train and test might have different columns - we'll handle this in next TODO

# Your code here:



In [None]:
# TODO 3.8: Align Train and Test Columns
# After one-hot encoding, train and test might have different columns
# Align them to have the same columns (use .align() method)
# Fill any missing values with 0

# HINT: train, test = train.align(test, join='left', axis=1, fill_value=0)

# Your code here:



In [None]:
# TODO 3.9: Final Verification
# Print the shapes of train and test
# Check for any remaining missing values
# Print the number of features after preprocessing

# Your code here:



---

## ✅ Phase 3 Checklist
Before moving to Phase 4, make sure you've:
- [ ] Handled all missing values (both numerical and categorical)
- [ ] Created new engineered features
- [ ] Removed outliers
- [ ] One-hot encoded all categorical variables
- [ ] Aligned train and test columns
- [ ] Verified no missing values remain

---

# Phase 4: Feature Scaling & Selection ⚖️

## 🎯 Learning Objectives
- Understand why feature scaling is critical for neural networks
- Apply StandardScaler to normalize features
- Split data into training and validation sets
- Convert data to PyTorch tensors

## 📖 Key Concepts

**Why Scale Features?**
- Neural networks train faster with scaled features
- Features with large values can dominate the learning process
- Standardization: (x - mean) / std → mean=0, std=1

**Important:** 
- Fit scaler on training data only!
- Transform both train and validation using the same scaler
- Save the scaler for test predictions

---

In [None]:
# TODO 4.1: Import Scaling and Splitting Tools
# Import:
# - train_test_split from sklearn.model_selection
# - StandardScaler from sklearn.preprocessing

# Your code here:



In [None]:
# TODO 4.2: Train-Validation Split
# Split your training data into train and validation sets
# Use 80-20 split (test_size=0.2)
# Set random_state=42 for reproducibility
# Variables: X_train, X_val, y_train_split, y_val

# HINT: from sklearn.model_selection import train_test_split
# X_train, X_val, y_train_split, y_val = train_test_split(train, y_train, test_size=0.2, random_state=42)

# Your code here:



In [None]:
# TODO 4.3: Feature Scaling
# Create a StandardScaler instance
# Fit it on X_train only (never on validation or test!)
# Transform X_train, X_val, and test using the fitted scaler
# Store results in X_train_scaled, X_val_scaled, test_scaled

# HINT: 
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_val_scaled = scaler.transform(X_val)

# Your code here:



In [None]:
# TODO 4.4: Convert to PyTorch Tensors
# Convert all arrays to PyTorch tensors with dtype=torch.float32
# Variables needed:
# - X_train_tensor, X_val_tensor, test_tensor (features)
# - y_train_tensor, y_val_tensor (targets)

# HINT: torch.tensor(array, dtype=torch.float32)
# For y, reshape to (n, 1) using .reshape(-1, 1)

# Your code here:



In [None]:
# TODO 4.5: Verify Tensor Shapes
# Print the shapes of all tensors
# Print the number of features (this is your input size for the network!)
# Verify data types are float32

# Your code here:



---

## ✅ Phase 4 Checklist
Before moving to Phase 5, make sure you've:
- [ ] Split data into train and validation sets (80-20)
- [ ] Scaled features using StandardScaler
- [ ] Converted all data to PyTorch tensors (float32)
- [ ] Verified tensor shapes are correct
- [ ] Noted the number of input features for your network

---

# Phase 5: Neural Network for Regression 🏗️

## 🎯 Learning Objectives
- Design a neural network architecture for regression
- Understand the difference between classification and regression networks
- Implement dropout for regularization
- Choose appropriate loss function and optimizer

## 📖 Key Concepts

**Regression vs Classification Network:**

```python
# Classification (10 classes):
self.output = nn.Linear(64, 10)  # 10 neurons
# No activation - CrossEntropyLoss includes softmax

# Regression (continuous value):
self.output = nn.Linear(64, 1)   # 1 neuron
# No activation - we want raw continuous output
```

**Loss Functions for Regression:**
- MSELoss (L2): Penalizes large errors heavily
- L1Loss (MAE): More robust to outliers
- HuberLoss: Combination of both

**Suggested Architecture:**
```
Input (n features) → 256 → ReLU → Dropout(0.2)
                   → 128 → ReLU → Dropout(0.2)
                   → 64  → ReLU
                   → 1   (output)
```

---

In [None]:
# TODO 5.1: Define the Neural Network Class
# Create a class called HousePricePredictor that inherits from nn.Module
# Architecture:
# - Input layer: takes n_features as input
# - Hidden layer 1: 256 neurons + ReLU + Dropout(0.2)
# - Hidden layer 2: 128 neurons + ReLU + Dropout(0.2)
# - Hidden layer 3: 64 neurons + ReLU
# - Output layer: 1 neuron (NO activation function!)

# HINT: Similar to Digit Recognizer but:
# - Output layer has 1 neuron instead of 10
# - No activation on output layer
# - Add dropout layers: nn.Dropout(0.2)

# Your code here:



In [None]:
# TODO 5.2: Initialize the Model
# Create an instance of your HousePricePredictor
# Pass the correct number of input features (from your tensors)
# Move model to the appropriate device (GPU if available)
# Print the model architecture

# HINT: 
# n_features = X_train_tensor.shape[1]
# model = HousePricePredictor(n_features).to(device)

# Your code here:



In [None]:
# TODO 5.3: Define Loss Function and Optimizer
# Loss function: Use MSELoss (Mean Squared Error) for regression
# Optimizer: Use Adam with learning_rate=0.001

# HINT:
# criterion = nn.MSELoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Your code here:



In [None]:
# TODO 5.4: Count Parameters
# Calculate and print the total number of trainable parameters in your model
# This helps you understand model complexity

# HINT: sum(p.numel() for p in model.parameters() if p.requires_grad)

# Your code here:



---

## ✅ Phase 5 Checklist
Before moving to Phase 6, make sure you've:
- [ ] Defined HousePricePredictor class with correct architecture
- [ ] Output layer has 1 neuron (for regression)
- [ ] Added dropout layers for regularization
- [ ] Initialized model with correct input size
- [ ] Defined MSELoss criterion
- [ ] Initialized Adam optimizer
- [ ] Counted model parameters

---

# Phase 6: Training Pipeline 🚂

## 🎯 Learning Objectives
- Create DataLoaders for efficient batch training
- Implement training loop with validation
- Track regression metrics (MSE, RMSE, MAE, R²)
- Visualize training progress
- Save the best model

## 📖 Key Concepts

**Regression Metrics:**
- **MSE** (Mean Squared Error): Average of squared errors
- **RMSE** (Root MSE): Square root of MSE - same units as target
- **MAE** (Mean Absolute Error): Average of absolute errors
- **R²** (R-squared): How much variance is explained (1.0 is perfect)

**Training Process:**
1. Forward pass: Get predictions
2. Calculate loss
3. Backward pass: Calculate gradients
4. Update weights
5. Validate and track metrics

---

In [None]:
# TODO 6.1: Import Additional Tools
# Import:
# - TensorDataset, DataLoader from torch.utils.data
# - mean_squared_error, mean_absolute_error, r2_score from sklearn.metrics

# Your code here:



In [None]:
# TODO 6.2: Create DataLoaders
# Create TensorDatasets for train and validation
# Create DataLoaders with batch_size=32
# Shuffle training data, don't shuffle validation

# HINT:
# train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
# train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Your code here:



In [None]:
# TODO 6.3: Implement Training Loop
# Create a training loop that:
# - Trains for a specified number of epochs (start with 100)
# - For each epoch:
#   * Train on batches
#   * Calculate training loss
#   * Validate on validation set
#   * Calculate validation metrics (MSE, RMSE, MAE, R²)
#   * Track losses and metrics
#   * Print progress every 10 epochs
#   * Save best model based on validation RMSE

# HINT: Similar to Digit Recognizer but:
# - Use MSE loss instead of CrossEntropyLoss
# - No torch.max() for predictions - just use output.squeeze()
# - Calculate RMSE = sqrt(MSE)
# - For R²: use sklearn.metrics.r2_score(y_true, y_pred)

# Structure:
# 1. Initialize lists to track metrics
# 2. Set number of epochs
# 3. For each epoch:
#    a. Training phase (model.train())
#    b. Validation phase (model.eval())
#    c. Calculate and store metrics
#    d. Save best model

# Your code here:



In [None]:
# TODO 6.4: Visualize Training Progress
# Create plots showing:
# - Training and validation loss over epochs
# - Validation RMSE over epochs
# - Validation R² over epochs

# HINT: Use plt.subplot() to create multiple plots
# Plot both train and val losses on same plot for comparison

# Your code here:



In [None]:
# TODO 6.5: Evaluate Best Model
# Load the best saved model
# Evaluate on validation set
# Print final metrics:
# - Validation RMSE
# - Validation MAE
# - Validation R²

# HINT: model.load_state_dict(torch.load('best_model.pth'))

# Your code here:



In [None]:
# TODO 6.6: Prediction vs Actual Plot
# Create a scatter plot of predicted vs actual values on validation set
# Add a diagonal line showing perfect predictions
# This visualizes how well the model performs

# HINT:
# plt.scatter(y_val_actual, y_val_pred, alpha=0.5)
# plt.plot([min, max], [min, max], 'r--')  # diagonal line

# Your code here:



---

## ✅ Phase 6 Checklist
Before moving to Phase 7, make sure you've:
- [ ] Created DataLoaders for batching
- [ ] Implemented complete training loop
- [ ] Tracked training and validation losses
- [ ] Calculated regression metrics (RMSE, MAE, R²)
- [ ] Saved the best model
- [ ] Visualized training progress
- [ ] Created prediction vs actual plot

**Target Performance:**
- Validation RMSE < 0.15 (good)
- Validation RMSE < 0.13 (great!)
- R² > 0.85 (good fit)

---

# Phase 7: Evaluation & Submission 📊

## 🎯 Learning Objectives
- Generate predictions on test data
- Create Kaggle submission file
- Validate submission format
- Document and save model
- (Optional) Compare with traditional ML models

## 📖 Key Concepts

**Important Steps:**
1. Load best model
2. Predict on test set (already scaled)
3. **Reverse log transformation** (critical!)
4. Create submission.csv
5. Validate format

**Remember:** We predicted log(SalePrice), so we need to reverse it:
```python
predictions = np.expm1(log_predictions)  # exp(x) - 1
```

---

In [None]:
# TODO 7.1: Load Best Model and Generate Test Predictions
# Load your best saved model
# Set model to eval mode
# Generate predictions on test set
# Remember to move test tensor to the same device as model

# HINT:
# model.load_state_dict(torch.load('best_model.pth'))
# model.eval()
# with torch.no_grad():
#     predictions = model(test_tensor.to(device))

# Your code here:



In [None]:
# TODO 7.2: Reverse Log Transformation
# Convert predictions from log scale back to actual prices
# Use np.expm1() which is the inverse of np.log1p()
# Convert to numpy array and flatten if needed

# HINT:
# predictions_log = predictions.cpu().numpy().flatten()
# predictions_price = np.expm1(predictions_log)

# Your code here:



In [None]:
# TODO 7.3: Create Submission File
# Create a DataFrame with columns: ['Id', 'SalePrice']
# Id should be from test_ids you saved earlier
# SalePrice should be your predictions (in original scale!)
# Save to '../submission.csv'

# HINT:
# submission = pd.DataFrame({
#     'Id': test_ids,
#     'SalePrice': predictions_price
# })
# submission.to_csv('../submission.csv', index=False)

# Your code here:



In [None]:
# TODO 7.4: Validate Submission Format
# Load the submission file and check:
# - Columns are ['Id', 'SalePrice']
# - Shape is (1459, 2)
# - No missing values
# - All prices are positive
# - Display first few rows

# Your code here:



In [None]:
# TODO 7.5: Save Model and Metadata
# Save:
# - Model state dict to '../trained_models/house_price_model.pth'
# - Model metadata (architecture, performance, date) to '../trained_models/model_metadata.json'
# - Scaler object using joblib to '../trained_models/scaler.pkl'

# HINT:
# torch.save(model.state_dict(), '../trained_models/house_price_model.pth')
# import json, joblib

# Your code here:



---

## 🎉 Optional: Compare with Traditional ML Models

Want to go further? Compare your neural network with traditional models!

---

In [None]:
# OPTIONAL TODO 7.6: Linear Regression Baseline
# Train a simple Linear Regression model for comparison
# Evaluate on validation set
# Compare RMSE with your neural network

# HINT:
# from sklearn.linear_model import LinearRegression
# lr = LinearRegression()
# lr.fit(X_train_scaled, y_train_split)
# predictions = lr.predict(X_val_scaled)

# Your code here:



In [None]:
# OPTIONAL TODO 7.7: Random Forest Model
# Train a Random Forest regressor
# Evaluate and compare with neural network

# HINT:
# from sklearn.ensemble import RandomForestRegressor
# rf = RandomForestRegressor(n_estimators=100, random_state=42)
# rf.fit(X_train_scaled, y_train_split.ravel())

# Your code here:



In [None]:
# OPTIONAL TODO 7.8: Model Comparison Table
# Create a comparison table showing:
# Model | RMSE | MAE | R² | Training Time
# For: Neural Network, Linear Regression, Random Forest

# Your code here:



---

## ✅ Phase 7 Checklist
Before finalizing, make sure you've:
- [ ] Generated predictions on test set
- [ ] Reversed log transformation
- [ ] Created submission.csv
- [ ] Validated submission format
- [ ] Saved model and metadata
- [ ] (Optional) Compared with traditional ML models

---

## 🚀 Next Steps

1. **Submit to Kaggle:**
   - Go to [competition page](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
   - Click "Submit Predictions"
   - Upload your `submission.csv`

2. **Update README.md:**
   - Add your final results
   - Include visualizations
   - Document your learning journey

3. **Create MODEL_CARD.md:**
   - Document architecture
   - Performance metrics
   - Training details

4. **Git Commit:**
   - Commit your completed notebook
   - Push to GitHub

---

## 🎓 What You've Learned

Congratulations! Through this project, you've learned:

✅ **Regression with Neural Networks**
- Output layer design for continuous predictions
- Appropriate loss functions (MSELoss)
- Regression metrics (RMSE, MAE, R²)

✅ **Feature Engineering**
- Creating new features from existing ones
- Feature scaling and normalization
- Handling mixed data types

✅ **Data Preprocessing**
- Missing value imputation strategies
- One-hot encoding categorical variables
- Outlier detection and handling

✅ **Model Evaluation**
- Proper train/validation split
- Tracking multiple metrics
- Comparing different model types

✅ **Production Skills**
- Saving models and scalers for deployment
- Creating submission files for competitions
- Documenting models with metadata

---

## 💪 Keep Learning!

Ready for more challenges?
- Try ensemble methods (combining multiple models)
- Experiment with feature selection techniques
- Learn about XGBoost and LightGBM
- Explore AutoML tools

Great job! 🎉