# Task 6: House Price Prediction

## Objective
Predict house prices using property features such as size, bedrooms, location, and other real estate characteristics.

## Dataset
House Price Prediction Dataset (available on Kaggle)

## Problem Statement
Real estate pricing is a critical application of machine learning in the finance and property sectors. Accurate price predictions help buyers, sellers, and investors make informed decisions. Using various property features like square footage, number of bedrooms, bathrooms, location, age of the house, and amenities, we can build regression models to predict market prices. This is a continuous value prediction (regression) problem where we estimate the sale price based on property characteristics.

---

## Step 1: Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("All libraries imported successfully!")

## Step 2: Load and Inspect the Dataset

**Note:** Download house_prices.csv from Kaggle or create a sample dataset.
Link: https://www.kaggle.com/datasets/c1714457c52b8a4f366991e48d4e3af4/housing-prices-dataset-with-latitude-longitude

In [None]:
# Create a sample house price dataset if needed
# Or load from CSV: df = pd.read_csv('house_prices.csv')

np.random.seed(42)

# Generate sample house price dataset
n_samples = 500

df = pd.DataFrame({
    'SquareFeet': np.random.uniform(800, 5000, n_samples),
    'Bedrooms': np.random.randint(1, 6, n_samples),
    'Bathrooms': np.random.uniform(1, 4, n_samples),
    'YearBuilt': np.random.randint(1950, 2023, n_samples),
    'Garage': np.random.randint(0, 4, n_samples),
    'Pool': np.random.randint(0, 2, n_samples),
    'Lot_Size': np.random.uniform(0.5, 2, n_samples),
})

# Generate prices based on features with some correlation
df['Price'] = (
    150 * df['SquareFeet'] +
    50000 * df['Bedrooms'] +
    30000 * df['Bathrooms'] +
    2000 * (2023 - df['YearBuilt']) +  # Negative: older houses worth less
    20000 * df['Garage'] +
    100000 * df['Pool'] +
    50000 * df['Lot_Size'] +
    np.random.normal(0, 50000, n_samples)  # Add noise
)

# Ensure prices are positive
df['Price'] = np.abs(df['Price'])

print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
df.head()

## Step 3: Data Inspection and Cleaning

In [None]:
# Check data types and missing values
print("Data Types:")
print(df.dtypes)

print("\nMissing Values:")
print(df.isnull().sum())

print("\nDataset Shape:")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

In [None]:
# Display summary statistics
print("Summary Statistics:")
df.describe()

## Step 4: Exploratory Data Analysis (EDA)

In [None]:
# Distribution of target variable (Price)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df['Price'], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Distribution of House Prices', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Price ($)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].grid(True, alpha=0.3)

# Box plot
axes[1].boxplot(df['Price'])
axes[1].set_title('Box Plot of House Prices', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Price ($)', fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Price Statistics:")
print(f"  Mean: ${df['Price'].mean():,.2f}")
print(f"  Median: ${df['Price'].median():,.2f}")
print(f"  Std Dev: ${df['Price'].std():,.2f}")
print(f"  Min: ${df['Price'].min():,.2f}")
print(f"  Max: ${df['Price'].max():,.2f}")

In [None]:
# Correlation analysis
plt.figure(figsize=(12, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, fmt='.2f')
plt.title('Correlation Matrix - House Features and Price', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Relationship between features and price
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

features = ['SquareFeet', 'Bedrooms', 'Bathrooms', 'YearBuilt', 'Garage', 'Lot_Size']

for idx, feature in enumerate(features):
    axes[idx].scatter(df[feature], df['Price'], alpha=0.5, s=30, color='steelblue')
    axes[idx].set_xlabel(feature, fontsize=11)
    axes[idx].set_ylabel('Price ($)', fontsize=11)
    axes[idx].set_title(f'{feature} vs Price', fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 5: Data Preprocessing

In [None]:
# Check for missing values and handle them
print("Handling missing values...")
df_clean = df.dropna()  # Drop rows with missing values

print(f"Original dataset shape: {df.shape}")
print(f"Clean dataset shape: {df_clean.shape}")
print(f"Rows removed: {df.shape[0] - df_clean.shape[0]}")

In [None]:
# Separate features and target
X = df_clean.drop('Price', axis=1)
y = df_clean['Price']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns:")
print(X.columns.tolist())

In [None]:
# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"\nTraining target statistics:")
print(f"  Mean: ${y_train.mean():,.2f}")
print(f"  Std Dev: ${y_train.std():,.2f}")
print(f"\nTesting target statistics:")
print(f"  Mean: ${y_test.mean():,.2f}")
print(f"  Std Dev: ${y_test.std():,.2f}")

In [None]:
# Feature scaling using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled successfully!")
print(f"\nScaled training data shape: {X_train_scaled.shape}")
print(f"Scaled testing data shape: {X_test_scaled.shape}")
print(f"\nScaled data - Mean (should be ~0): {X_train_scaled.mean(axis=0)[:3]}")
print(f"Scaled data - Std (should be ~1): {X_train_scaled.std(axis=0)[:3]}")

## Step 6: Model Training - Linear Regression

In [None]:
# Train Linear Regression Model
print("Training Linear Regression Model...")
lin_reg_model = LinearRegression()
lin_reg_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lin = lin_reg_model.predict(X_test_scaled)

print("Linear Regression Model trained successfully!")

In [None]:
# Evaluate Linear Regression
mae_lin = mean_absolute_error(y_test, y_pred_lin)
rmse_lin = np.sqrt(mean_squared_error(y_test, y_pred_lin))
r2_lin = r2_score(y_test, y_pred_lin)

print("\n" + "="*50)
print("LINEAR REGRESSION MODEL EVALUATION")
print("="*50)
print(f"Mean Absolute Error (MAE): ${mae_lin:,.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_lin:,.2f}")
print(f"R² Score: {r2_lin:.4f}")
print("="*50)

## Step 7: Model Training - Gradient Boosting Regressor

In [None]:
# Train Gradient Boosting Model
print("Training Gradient Boosting Regressor Model...")
gb_model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)
gb_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_gb = gb_model.predict(X_test_scaled)

print("Gradient Boosting Model trained successfully!")

In [None]:
# Evaluate Gradient Boosting
mae_gb = mean_absolute_error(y_test, y_pred_gb)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
r2_gb = r2_score(y_test, y_pred_gb)

print("\n" + "="*50)
print("GRADIENT BOOSTING MODEL EVALUATION")
print("="*50)
print(f"Mean Absolute Error (MAE): ${mae_gb:,.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_gb:,.2f}")
print(f"R² Score: {r2_gb:.4f}")
print("="*50)

## Step 8: Model Comparison

In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Gradient Boosting'],
    'MAE': [mae_lin, mae_gb],
    'RMSE': [rmse_lin, rmse_gb],
    'R² Score': [r2_lin, r2_gb]
})

print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(comparison_df.to_string(index=False))
print("="*70)

# Determine best model
best_model = 'Gradient Boosting' if r2_gb > r2_lin else 'Linear Regression'
print(f"\nBest Model (based on R² Score): {best_model}")

## Step 9: Visualization - Actual vs Predicted Prices

In [None]:
# Plot Linear Regression predictions
plt.figure(figsize=(14, 6))

test_indices = np.arange(len(y_test))

plt.plot(test_indices, y_test.values, 'o-', label='Actual Prices', 
         markersize=6, linewidth=2, alpha=0.7)
plt.plot(test_indices, y_pred_lin, 's-', label='Linear Regression Predictions',
         markersize=6, linewidth=2, alpha=0.7)
plt.xlabel('Test Sample Index', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Actual vs Linear Regression Predicted House Prices', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Plot Gradient Boosting predictions
plt.figure(figsize=(14, 6))

plt.plot(test_indices, y_test.values, 'o-', label='Actual Prices',
         markersize=6, linewidth=2, alpha=0.7)
plt.plot(test_indices, y_pred_gb, 's-', label='Gradient Boosting Predictions',
         markersize=6, linewidth=2, alpha=0.7)
plt.xlabel('Test Sample Index', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Actual vs Gradient Boosting Predicted House Prices', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Step 10: Scatter Plot - Actual vs Predicted

In [None]:
# Create subplots for both models
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Linear Regression scatter plot
axes[0].scatter(y_test, y_pred_lin, alpha=0.5, s=50, color='steelblue')
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Price ($)', fontsize=11)
axes[0].set_ylabel('Predicted Price ($)', fontsize=11)
axes[0].set_title(f'Linear Regression\nR² = {r2_lin:.4f}', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Gradient Boosting scatter plot
axes[1].scatter(y_test, y_pred_gb, alpha=0.5, s=50, color='salmon')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual Price ($)', fontsize=11)
axes[1].set_ylabel('Predicted Price ($)', fontsize=11)
axes[1].set_title(f'Gradient Boosting\nR² = {r2_gb:.4f}', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 11: Feature Importance Analysis

In [None]:
# Feature importance from Gradient Boosting
feature_importance_gb = pd.DataFrame({
    'Feature': X.columns,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance - Gradient Boosting:")
print(feature_importance_gb.to_string(index=False))

In [None]:
# Feature importance from Linear Regression (absolute coefficients)
feature_importance_lin = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': np.abs(lin_reg_model.coef_)
}).sort_values('Coefficient', ascending=False)

print("Feature Importance - Linear Regression (Absolute Coefficients):")
print(feature_importance_lin.to_string(index=False))

In [None]:
# Plot feature importance comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Gradient Boosting feature importance
axes[0].barh(feature_importance_gb['Feature'], feature_importance_gb['Importance'], color='steelblue')
axes[0].set_xlabel('Importance Score', fontsize=11)
axes[0].set_title('Feature Importance - Gradient Boosting', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='x')
axes[0].invert_yaxis()

# Linear Regression feature importance
axes[1].barh(feature_importance_lin['Feature'], feature_importance_lin['Coefficient'], color='salmon')
axes[1].set_xlabel('Absolute Coefficient', fontsize=11)
axes[1].set_title('Feature Importance - Linear Regression', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='x')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

## Step 12: Residual Analysis

In [None]:
# Calculate residuals
residuals_lin = y_test.values - y_pred_lin
residuals_gb = y_test.values - y_pred_gb

# Create residual plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Linear Regression residual plot
axes[0, 0].scatter(y_pred_lin, residuals_lin, alpha=0.5, color='steelblue')
axes[0, 0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 0].set_xlabel('Predicted Price ($)', fontsize=11)
axes[0, 0].set_ylabel('Residuals ($)', fontsize=11)
axes[0, 0].set_title('Linear Regression - Residual Plot', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Gradient Boosting residual plot
axes[0, 1].scatter(y_pred_gb, residuals_gb, alpha=0.5, color='salmon')
axes[0, 1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Predicted Price ($)', fontsize=11)
axes[0, 1].set_ylabel('Residuals ($)', fontsize=11)
axes[0, 1].set_title('Gradient Boosting - Residual Plot', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Linear Regression histogram of residuals
axes[1, 0].hist(residuals_lin, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
axes[1, 0].set_xlabel('Residuals ($)', fontsize=11)
axes[1, 0].set_ylabel('Frequency', fontsize=11)
axes[1, 0].set_title('Linear Regression - Residuals Distribution', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Gradient Boosting histogram of residuals
axes[1, 1].hist(residuals_gb, bins=30, edgecolor='black', alpha=0.7, color='salmon')
axes[1, 1].set_xlabel('Residuals ($)', fontsize=11)
axes[1, 1].set_ylabel('Frequency', fontsize=11)
axes[1, 1].set_title('Gradient Boosting - Residuals Distribution', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 13: Model Metrics Comparison

In [None]:
# Compare metrics across models
metrics_to_plot = ['MAE', 'RMSE', 'R² Score']
lin_values = [mae_lin, rmse_lin, r2_lin]
gb_values = [mae_gb, rmse_gb, r2_gb]

x = np.arange(len(metrics_to_plot))
width = 0.35

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Absolute values
axes[0].bar(x - width/2, [mae_lin, rmse_lin, 0], width, label='Linear Regression', color='steelblue')
axes[0].bar(x + width/2, [mae_gb, rmse_gb, 0], width, label='Gradient Boosting', color='salmon')
axes[0].set_xlabel('Metrics', fontsize=12)
axes[0].set_ylabel('Error Value ($)', fontsize=12)
axes[0].set_title('MAE and RMSE Comparison', fontsize=12, fontweight='bold')
axes[0].set_xticks(x[:2])
axes[0].set_xticklabels(metrics_to_plot[:2])
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3, axis='y')

# R² Score
axes[1].bar(['Linear Regression', 'Gradient Boosting'], [r2_lin, r2_gb], 
            color=['steelblue', 'salmon'], width=0.5)
axes[1].set_ylabel('R² Score', fontsize=12)
axes[1].set_title('R² Score Comparison', fontsize=12, fontweight='bold')
axes[1].set_ylim([0, 1])
axes[1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, v in enumerate([r2_lin, r2_gb]):
    axes[1].text(i, v + 0.02, f'{v:.4f}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

## Step 14: Price Prediction Example

In [None]:
# Make predictions on specific examples
# Example: 3000 sqft, 4 bed, 2.5 bath, built 2010, 2 car garage, no pool, 0.5 lot

example_houses = pd.DataFrame({
    'SquareFeet': [2000, 3000, 1500],
    'Bedrooms': [3, 4, 2],
    'Bathrooms': [2, 2.5, 1.5],
    'YearBuilt': [2000, 2010, 1990],
    'Garage': [2, 2, 1],
    'Pool': [0, 1, 0],
    'Lot_Size': [0.5, 1.0, 0.25]
})

# Scale the examples
example_scaled = scaler.transform(example_houses)

# Predict using both models
pred_lin = lin_reg_model.predict(example_scaled)
pred_gb = gb_model.predict(example_scaled)

# Display results
results = pd.DataFrame({
    'SquareFeet': example_houses['SquareFeet'],
    'Bedrooms': example_houses['Bedrooms'],
    'Linear_Reg_Price': pred_lin,
    'GB_Predicted_Price': pred_gb,
    'Average_Price': (pred_lin + pred_gb) / 2
})

print("\n" + "="*80)
print("HOUSE PRICE PREDICTIONS FOR NEW EXAMPLES")
print("="*80)
print(results.to_string(index=False))
print("="*80)

## Step 15: Key Findings and Insights

In [None]:
print("\n" + "="*70)
print("KEY FINDINGS AND INSIGHTS")
print("="*70)

print(f"\n1. DATASET OVERVIEW:")
print(f"   - Total samples: {len(df_clean)}")
print(f"   - Training samples: {len(X_train)}")
print(f"   - Testing samples: {len(X_test)}")
print(f"   - Number of features: {X.shape[1]}")

print(f"\n2. PRICE STATISTICS:")
print(f"   - Average price: ${y.mean():,.2f}")
print(f"   - Median price: ${y.median():,.2f}")
print(f"   - Min price: ${y.min():,.2f}")
print(f"   - Max price: ${y.max():,.2f}")
print(f"   - Std deviation: ${y.std():,.2f}")

print(f"\n3. MODEL PERFORMANCE SUMMARY:")
print(f"   \n   LINEAR REGRESSION:")
print(f"   - MAE: ${mae_lin:,.2f}")
print(f"   - RMSE: ${rmse_lin:,.2f}")
print(f"   - R² Score: {r2_lin:.4f}")

print(f"   \n   GRADIENT BOOSTING:")
print(f"   - MAE: ${mae_gb:,.2f}")
print(f"   - RMSE: ${rmse_gb:,.2f}")
print(f"   - R² Score: {r2_gb:.4f}")

improvement_r2 = ((r2_gb - r2_lin) / r2_lin * 100) if r2_lin != 0 else 0
print(f"\n   Performance improvement (R² Score): {improvement_r2:.1f}%")

print(f"\n4. TOP 3 MOST IMPORTANT FEATURES:")
print(f"   \n   Gradient Boosting:")
for idx, row in feature_importance_gb.head(3).iterrows():
    print(f"   - {row['Feature']}: {row['Importance']:.4f}")

print(f"\n   Linear Regression:")
for idx, row in feature_importance_lin.head(3).iterrows():
    print(f"   - {row['Feature']}: {row['Coefficient']:.2f}")

print(f"\n5. RECOMMENDATION:")
print(f"   - Best Model: {best_model}")
print(f"   - This model achieves better R² score and lower prediction errors.")

print(f"\n6. MODEL INTERPRETATION:")
avg_error_lin = mae_lin
avg_error_gb = mae_gb
print(f"   - Linear Regression typical error: ±${avg_error_lin:,.2f}")
print(f"   - Gradient Boosting typical error: ±${avg_error_gb:,.2f}")
print(f"   - Average house price: ${y.mean():,.2f}")
error_pct_lin = (avg_error_lin / y.mean()) * 100
error_pct_gb = (avg_error_gb / y.mean()) * 100
print(f"   - Linear Regression error percentage: {error_pct_lin:.1f}%")
print(f"   - Gradient Boosting error percentage: {error_pct_gb:.1f}%")

print(f"\n7. FEATURE CORRELATIONS WITH PRICE:")
price_corr = df_clean.corr()['Price'].drop('Price').sort_values(ascending=False)
for feature, corr in price_corr.head(5).items():
    print(f"   - {feature}: {corr:.4f}")

print(f"\n8. CONCLUSION:")
print(f"   - Both models perform reasonably well for house price prediction.")
print(f"   - {best_model} captures non-linear relationships better.")
print(f"   - Square footage and number of bedrooms are key price drivers.")
print(f"   - The model can be used for property valuation and market analysis.")
print(f"   - Feature importance analysis helps identify key value factors.")

print("\n" + "="*70)

## Summary

In this task, we successfully:
1. ✅ Loaded and inspected the house price dataset
2. ✅ Performed comprehensive exploratory data analysis (EDA)
3. ✅ Cleaned the data and handled missing values
4. ✅ Preprocessed features using StandardScaler
5. ✅ Trained Linear Regression and Gradient Boosting models
6. ✅ Evaluated models using MAE, RMSE, and R² metrics
7. ✅ Compared model performance and selected the best one
8. ✅ Visualized actual vs predicted prices
9. ✅ Analyzed feature importance
10. ✅ Performed residual analysis
11. ✅ Made predictions on new examples

**Skills Demonstrated:**
- Regression modeling with multiple algorithms
- Data preprocessing and feature scaling
- Exploratory data analysis and visualization
- Model evaluation using multiple metrics
- Feature importance analysis
- Residual analysis and model diagnostics
- Real estate data understanding
- Price prediction and valuation
- Model comparison and selection