# House Prices Modeling - Advanced Regression Techniques

This notebook documents our machine learning modeling approach for the Kaggle House Prices competition. We start with a simple baseline model and progressively improve it.

## Table of Contents
1. [Baseline Model](#baseline)
2. [Model Performance Analysis](#performance)
3. [Feature Importance](#features)
4. [Model Comparison](#comparison)
5. [Final Submission](#submission)

---

## 1. Baseline Model Approach {#baseline}

### Strategy
Our baseline model follows a simple but effective approach:

1. **Feature Selection**: Use only numerical features to avoid complex categorical encoding
2. **Missing Value Handling**: Fill missing values with median (robust to outliers)
3. **Model Choice**: Linear Regression for interpretability and speed
4. **No Feature Engineering**: Keep it simple for baseline comparison

### Why This Approach?
- **Simplicity**: Easy to implement and understand
- **Fast**: Quick to train and evaluate
- **Interpretable**: Linear coefficients show feature importance
- **Baseline**: Establishes minimum performance threshold

### Imports and Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import cross_val_score
import warnings

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

print("Libraries imported successfully!")

### Data Loading and Preprocessing

In [None]:
# Load the data
print("Loading data...")
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"Target variable: SalePrice (range: ${train_df['SalePrice'].min():,} - ${train_df['SalePrice'].max():,})")

### Feature Selection: Numerical Features Only

In [None]:
# Select only numerical features
numerical_features = train_df.select_dtypes(include=[np.number]).columns.tolist()
numerical_features.remove('Id')  # Remove Id column
numerical_features.remove('SalePrice')  # Remove target variable

print(f"Found {len(numerical_features)} numerical features:")
print("\nFeature Categories:")

# Categorize features for better understanding
size_features = [f for f in numerical_features if any(x in f.lower() for x in ['sf', 'area', 'frontage'])]
room_features = [f for f in numerical_features if any(x in f.lower() for x in ['bath', 'bedroom', 'kitchen', 'room', 'garage'])]
quality_features = [f for f in numerical_features if any(x in f.lower() for x in ['qual', 'cond', 'overall'])]
time_features = [f for f in numerical_features if any(x in f.lower() for x in ['year', 'yr', 'mo'])]
other_features = [f for f in numerical_features if f not in size_features + room_features + quality_features + time_features]

print(f"\n📏 Size/Area Features ({len(size_features)}): {size_features}")
print(f"\n🏠 Room/Bath Features ({len(room_features)}): {room_features}")
print(f"\n⭐ Quality Features ({len(quality_features)}): {quality_features}")
print(f"\n📅 Time Features ({len(time_features)}): {time_features}")
print(f"\n🔧 Other Features ({len(other_features)}): {other_features}")

### Missing Value Analysis and Imputation

In [None]:
# Extract features and target
X_train = train_df[numerical_features].copy()
y_train = train_df['SalePrice'].copy()
X_test = test_df[numerical_features].copy()

# Analyze missing values
print("Missing Value Analysis:")
train_missing = X_train.isnull().sum()
test_missing = X_test.isnull().sum()

missing_features = pd.DataFrame({
    'Feature': numerical_features,
    'Train_Missing': [train_missing[f] for f in numerical_features],
    'Test_Missing': [test_missing[f] for f in numerical_features],
    'Train_Pct': [train_missing[f]/len(X_train)*100 for f in numerical_features],
    'Test_Pct': [test_missing[f]/len(X_test)*100 for f in numerical_features]
})

# Show only features with missing values
missing_summary = missing_features[(missing_features['Train_Missing'] > 0) | (missing_features['Test_Missing'] > 0)]
print(f"\nFeatures with missing values: {len(missing_summary)}")
print(missing_summary.to_string(index=False))

# Fill missing values with median
print("\nFilling missing values with median...")
for feature in numerical_features:
    if X_train[feature].isnull().sum() > 0 or X_test[feature].isnull().sum() > 0:
        median_value = X_train[feature].median()
        X_train[feature].fillna(median_value, inplace=True)
        X_test[feature].fillna(median_value, inplace=True)
        print(f"  {feature}: filled with {median_value}")

print(f"\nMissing values after imputation: {X_train.isnull().sum().sum()} (train), {X_test.isnull().sum().sum()} (test)")

## 2. Model Training and Performance Analysis {#performance}

### Linear Regression Baseline

In [None]:
# Train Linear Regression model
print("Training Linear Regression baseline model...")
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)

# Ensure no negative predictions
y_test_pred = np.maximum(y_test_pred, 0)

# Calculate metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

print(f"\n📊 Baseline Model Performance:")
print(f"  Training RMSE: ${train_rmse:,.2f}")
print(f"  Training MAE:  ${train_mae:,.2f}")
print(f"  Training R²:   {train_r2:.4f}")
print(f"\n🎯 Test Predictions:")
print(f"  Range: ${y_test_pred.min():,.0f} - ${y_test_pred.max():,.0f}")
print(f"  Mean:  ${y_test_pred.mean():,.0f}")

### Cross-Validation Analysis

In [None]:
# Perform cross-validation
print("Performing 5-fold cross-validation...")
cv_scores = cross_val_score(lr_model, X_train, y_train, cv=5, 
                           scoring='neg_mean_squared_error', n_jobs=-1)
cv_rmse_scores = np.sqrt(-cv_scores)

print(f"\n🔄 Cross-Validation Results:")
print(f"  CV RMSE: ${cv_rmse_scores.mean():,.2f} (±${cv_rmse_scores.std()*2:.2f})")
print(f"  Individual folds: {[f'${score:,.0f}' for score in cv_rmse_scores]}")

# Check for overfitting
if train_rmse < cv_rmse_scores.mean() - cv_rmse_scores.std():
    print("  ⚠️  Model may be overfitting (training RMSE much lower than CV RMSE)")
else:
    print("  ✅ Model shows good generalization")

## 3. Feature Importance Analysis {#features}

### Linear Regression Coefficients

In [None]:
# Analyze feature importance
feature_importance = pd.DataFrame({
    'Feature': numerical_features,
    'Coefficient': lr_model.coef_,
    'Abs_Coefficient': np.abs(lr_model.coef_)
})
feature_importance = feature_importance.sort_values('Abs_Coefficient', ascending=False)

print("🏆 Top 15 Most Important Features (by absolute coefficient):")
top_features = feature_importance.head(15)
for i, (_, row) in enumerate(top_features.iterrows(), 1):
    direction = "📈" if row['Coefficient'] > 0 else "📉"
    print(f"{i:2d}. {row['Feature']:20} : {direction} ${row['Coefficient']:8,.0f}")

print(f"\nModel intercept: ${lr_model.intercept_:,.2f}")

### Feature Importance Visualization

In [None]:
# Visualize feature importance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Top 15 most important features
top_15 = feature_importance.head(15)
colors = ['green' if x > 0 else 'red' for x in top_15['Coefficient']]
bars1 = ax1.barh(range(len(top_15)), top_15['Coefficient'], color=colors, alpha=0.7)
ax1.set_yticks(range(len(top_15)))
ax1.set_yticklabels(top_15['Feature'])
ax1.set_xlabel('Coefficient Value')
ax1.set_title('Top 15 Feature Coefficients', fontweight='bold')
ax1.grid(axis='x', alpha=0.3)
ax1.axvline(x=0, color='black', linestyle='-', alpha=0.3)

# Feature correlation with target
correlations = []
for feature in numerical_features:
    corr = X_train[feature].corr(y_train)
    correlations.append(corr)

corr_df = pd.DataFrame({
    'Feature': numerical_features,
    'Correlation': correlations,
    'Abs_Correlation': np.abs(correlations)
})
corr_df = corr_df.sort_values('Abs_Correlation', ascending=False)

top_corr_15 = corr_df.head(15)
colors2 = ['green' if x > 0 else 'red' for x in top_corr_15['Correlation']]
bars2 = ax2.barh(range(len(top_corr_15)), top_corr_15['Correlation'], color=colors2, alpha=0.7)
ax2.set_yticks(range(len(top_corr_15)))
ax2.set_yticklabels(top_corr_15['Feature'])
ax2.set_xlabel('Correlation with SalePrice')
ax2.set_title('Top 15 Feature Correlations', fontweight='bold')
ax2.grid(axis='x', alpha=0.3)
ax2.axvline(x=0, color='black', linestyle='-', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Key Insights:")
print(f"  • Most important positive feature: {top_features.iloc[0]['Feature']} (+${top_features.iloc[0]['Coefficient']:,.0f})")
most_negative = top_features[top_features['Coefficient'] < 0].iloc[0] if len(top_features[top_features['Coefficient'] < 0]) > 0 else None
if most_negative is not None:
    print(f"  • Most important negative feature: {most_negative['Feature']} (${most_negative['Coefficient']:,.0f})")
print(f"  • Highest correlation: {top_corr_15.iloc[0]['Feature']} ({top_corr_15.iloc[0]['Correlation']:.3f})")

## 4. Model Comparison {#comparison}

Let's compare our baseline Linear Regression with a Random Forest model to see if we can improve performance.

In [None]:
# Train Random Forest for comparison
print("Training Random Forest model for comparison...")
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Make predictions
rf_train_pred = rf_model.predict(X_train)
rf_test_pred = rf_model.predict(X_test)

# Calculate metrics
rf_train_rmse = np.sqrt(mean_squared_error(y_train, rf_train_pred))
rf_train_r2 = r2_score(y_train, rf_train_pred)

# Cross-validation for Random Forest
rf_cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5, 
                              scoring='neg_mean_squared_error', n_jobs=-1)
rf_cv_rmse_scores = np.sqrt(-rf_cv_scores)

# Model comparison
comparison_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest'],
    'Train_RMSE': [train_rmse, rf_train_rmse],
    'Train_R2': [train_r2, rf_train_r2],
    'CV_RMSE_Mean': [cv_rmse_scores.mean(), rf_cv_rmse_scores.mean()],
    'CV_RMSE_Std': [cv_rmse_scores.std(), rf_cv_rmse_scores.std()]
})

print("\n🏁 Model Comparison:")
print(comparison_df.to_string(index=False, float_format='%.2f'))

# Determine best model
best_model_idx = comparison_df['CV_RMSE_Mean'].idxmin()
best_model_name = comparison_df.iloc[best_model_idx]['Model']
print(f"\n🏆 Best model based on CV RMSE: {best_model_name}")

# Use best model for final predictions
if best_model_name == 'Linear Regression':
    final_model = lr_model
    final_predictions = y_test_pred
else:
    final_model = rf_model
    final_predictions = rf_test_pred

print(f"\n✅ Using {best_model_name} for final submission")

### Model Performance Visualization

In [None]:
# Visualize model performance
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Model Performance Analysis', fontsize=16, fontweight='bold')

# Actual vs Predicted - Linear Regression
axes[0, 0].scatter(y_train, y_train_pred, alpha=0.6, color='blue')
axes[0, 0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual SalePrice')
axes[0, 0].set_ylabel('Predicted SalePrice')
axes[0, 0].set_title(f'Linear Regression: Actual vs Predicted\nR² = {train_r2:.4f}')
axes[0, 0].grid(True, alpha=0.3)

# Residuals - Linear Regression
residuals = y_train - y_train_pred
axes[0, 1].scatter(y_train_pred, residuals, alpha=0.6, color='blue')
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted SalePrice')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Linear Regression: Residual Plot')
axes[0, 1].grid(True, alpha=0.3)

# Actual vs Predicted - Random Forest
axes[1, 0].scatter(y_train, rf_train_pred, alpha=0.6, color='green')
axes[1, 0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
axes[1, 0].set_xlabel('Actual SalePrice')
axes[1, 0].set_ylabel('Predicted SalePrice')
axes[1, 0].set_title(f'Random Forest: Actual vs Predicted\nR² = {rf_train_r2:.4f}')
axes[1, 0].grid(True, alpha=0.3)

# CV Scores Comparison
models = ['Linear Regression', 'Random Forest']
cv_means = [cv_rmse_scores.mean(), rf_cv_rmse_scores.mean()]
cv_stds = [cv_rmse_scores.std(), rf_cv_rmse_scores.std()]

axes[1, 1].bar(models, cv_means, yerr=cv_stds, capsize=5, 
               color=['blue', 'green'], alpha=0.7)
axes[1, 1].set_ylabel('CV RMSE')
axes[1, 1].set_title('Cross-Validation RMSE Comparison')
axes[1, 1].grid(True, alpha=0.3)

# Add value labels on bars
for i, (mean, std) in enumerate(zip(cv_means, cv_stds)):
    axes[1, 1].text(i, mean + std + 500, f'${mean:,.0f}', 
                   ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## 5. Final Submission {#submission}

### Create Kaggle Submission File

In [None]:
# Create submission file
print("Creating final submission file...")

submission = pd.DataFrame({
    'Id': test_df['Id'],
    'SalePrice': final_predictions
})

# Save submission
submission.to_csv('final_submission.csv', index=False)

print(f"\n📄 Submission file created: final_submission.csv")
print(f"   Shape: {submission.shape}")
print(f"   Model used: {best_model_name}")
print(f"\n🎯 Prediction Summary:")
print(f"   Min prediction: ${final_predictions.min():,.0f}")
print(f"   Max prediction: ${final_predictions.max():,.0f}")
print(f"   Mean prediction: ${final_predictions.mean():,.0f}")
print(f"   Median prediction: ${np.median(final_predictions):,.0f}")

print("\n📋 First 10 predictions:")
print(submission.head(10).to_string(index=False))

print("\n📋 Last 10 predictions:")
print(submission.tail(10).to_string(index=False))

### Model Summary and Next Steps

In [None]:
print("\n" + "="*60)
print("🏠 HOUSE PRICES MODELING SUMMARY")
print("="*60)

print(f"\n📊 Dataset:")
print(f"   • Training samples: {len(train_df):,}")
print(f"   • Test samples: {len(test_df):,}")
print(f"   • Features used: {len(numerical_features)} numerical features")
print(f"   • Target variable: SalePrice (${train_df['SalePrice'].min():,} - ${train_df['SalePrice'].max():,})")

print(f"\n🤖 Model Performance:")
print(f"   • Best model: {best_model_name}")
best_cv_rmse = comparison_df.loc[comparison_df['Model'] == best_model_name, 'CV_RMSE_Mean'].iloc[0]
best_cv_std = comparison_df.loc[comparison_df['Model'] == best_model_name, 'CV_RMSE_Std'].iloc[0]
print(f"   • Cross-validation RMSE: ${best_cv_rmse:,.0f} (±${best_cv_std*2:,.0f})")
best_r2 = comparison_df.loc[comparison_df['Model'] == best_model_name, 'Train_R2'].iloc[0]
print(f"   • R² Score: {best_r2:.4f} ({best_r2*100:.1f}% variance explained)")

print(f"\n🔍 Key Features:")
print(f"   • Most important: {top_features.iloc[0]['Feature']} (${top_features.iloc[0]['Coefficient']:+,.0f})")
print(f"   • Highest correlation: {top_corr_15.iloc[0]['Feature']} ({top_corr_15.iloc[0]['Correlation']:+.3f})")
print(f"   • Missing values handled: {len(missing_summary)} features imputed with median")

print(f"\n📈 Next Steps for Improvement:")
print(f"   1. Feature Engineering: Create interaction terms, polynomial features")
print(f"   2. Categorical Features: Encode and include categorical variables")
print(f"   3. Regularization: Try Ridge, Lasso, ElasticNet regression")
print(f"   4. Advanced Models: XGBoost, LightGBM, Neural Networks")
print(f"   5. Ensemble Methods: Combine multiple models")
print(f"   6. Outlier Treatment: Remove or transform outliers identified in EDA")
print(f"   7. Target Transformation: Log transform target variable")

print("\n✅ Baseline model complete! Ready for Kaggle submission.")
print("="*60)

---

## Conclusion

This notebook established our baseline modeling approach for the House Prices competition. We successfully:

1. ✅ **Created a simple baseline model** using only numerical features
2. ✅ **Achieved reasonable performance** with Linear Regression
3. ✅ **Identified key features** that drive house prices
4. ✅ **Compared multiple models** to select the best performer
5. ✅ **Generated a submission file** ready for Kaggle

The baseline establishes a foundation for future improvements through feature engineering, advanced modeling techniques, and ensemble methods.

**Files created:**
- `final_submission.csv` - Ready for Kaggle submission
- This notebook documenting our modeling approach

**Next notebook:** `04_advanced_modeling.ipynb` - Feature engineering and advanced techniques