# Linear Regression: House Price Prediction

## Problem Overview

Linear regression is one of the fundamental algorithms in machine learning for supervised learning tasks. This notebook demonstrates end-to-end implementation of linear regression for predicting house prices based on various features.

### Learning Objectives
- Understand the mathematical foundation of linear regression
- Implement data preprocessing and feature engineering
- Train and evaluate linear regression models
- Visualize model performance and residual analysis
- Handle real-world data challenges

### Mathematical Foundation

Linear regression assumes a linear relationship between input features and target variable:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$$

Where:
- $y$ is the target variable (house price)
- $\beta_0$ is the intercept
- $\beta_i$ are the coefficients for features $x_i$
- $\epsilon$ is the error term

## 1. Import Required Libraries

We start by importing all necessary libraries for data manipulation, modeling, and visualization.

In [None]:
# Core data manipulation and numerical computing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Optional, List
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import make_regression, fetch_california_housing

# Statistical analysis
from scipy import stats
import statsmodels.api as sm

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print("Libraries imported successfully!")

## 2. Data Generation and Loading

We'll work with both synthetic and real datasets to understand different aspects of linear regression.

In [None]:
def generate_house_price_data(n_samples: int = 1000, noise_level: float = 0.1) -> pd.DataFrame:
    """Generate synthetic house price dataset with realistic features.
    
    Args:
        n_samples: Number of samples to generate
        noise_level: Amount of noise to add to the target variable
        
    Returns:
        DataFrame with house features and prices
    """
    np.random.seed(42)
    
    # Generate realistic house features
    house_size = np.random.normal(2000, 500, n_samples)  # Square feet
    house_size = np.clip(house_size, 500, 5000)  # Reasonable bounds
    
    bedrooms = np.random.poisson(3, n_samples) + 1  # 1-7 bedrooms typically
    bedrooms = np.clip(bedrooms, 1, 7)
    
    bathrooms = np.random.normal(2.5, 0.8, n_samples)  # 1-5 bathrooms
    bathrooms = np.clip(bathrooms, 1, 5)
    
    age = np.random.exponential(15, n_samples)  # House age in years
    age = np.clip(age, 0, 100)
    
    # Distance to city center (km)
    distance_to_center = np.random.gamma(2, 5, n_samples)
    distance_to_center = np.clip(distance_to_center, 1, 50)
    
    # School rating (1-10)
    school_rating = np.random.beta(2, 2, n_samples) * 10
    
    # Generate price based on realistic relationships
    base_price = (
        100 * house_size +  # $100 per sq ft
        15000 * bedrooms +   # $15k per bedroom
        20000 * bathrooms +  # $20k per bathroom
        -2000 * age +        # Depreciation
        -1000 * distance_to_center +  # Location premium
        5000 * school_rating +  # School district premium
        100000  # Base price
    )
    
    # Add noise
    noise = np.random.normal(0, noise_level * np.mean(base_price), n_samples)
    price = base_price + noise
    
    # Ensure positive prices
    price = np.maximum(price, 50000)
    
    # Create DataFrame
    df = pd.DataFrame({
        'house_size_sqft': house_size,
        'bedrooms': bedrooms,
        'bathrooms': bathrooms,
        'age_years': age,
        'distance_to_center_km': distance_to_center,
        'school_rating': school_rating,
        'price': price
    })
    
    return df

# Generate synthetic dataset
df_synthetic = generate_house_price_data(n_samples=1000)
print(f"Generated synthetic dataset with {len(df_synthetic)} samples")
print(f"Features: {list(df_synthetic.columns[:-1])}")
print(f"Target: {df_synthetic.columns[-1]}")

# Display basic statistics
print("\nDataset Statistics:")
df_synthetic.describe()

## 3. Exploratory Data Analysis

Understanding our data is crucial before building models.

In [None]:
def perform_eda(df: pd.DataFrame) -> None:
    """Perform comprehensive exploratory data analysis.
    
    Args:
        df: Input DataFrame
    """
    print("=== EXPLORATORY DATA ANALYSIS ===")
    
    # Basic information
    print(f"\nDataset shape: {df.shape}")
    print(f"Missing values: {df.isnull().sum().sum()}")
    
    # Data types
    print("\nData types:")
    print(df.dtypes)
    
    # Target variable distribution
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Price distribution
    axes[0, 0].hist(df['price'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0, 0].set_title('Price Distribution')
    axes[0, 0].set_xlabel('Price ($)')
    axes[0, 0].set_ylabel('Frequency')
    
    # Log price distribution (often more normal)
    axes[0, 1].hist(np.log(df['price']), bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
    axes[0, 1].set_title('Log Price Distribution')
    axes[0, 1].set_xlabel('Log Price')
    axes[0, 1].set_ylabel('Frequency')
    
    # Price vs house size (most important feature)
    axes[1, 0].scatter(df['house_size_sqft'], df['price'], alpha=0.6, color='coral')
    axes[1, 0].set_title('Price vs House Size')
    axes[1, 0].set_xlabel('House Size (sq ft)')
    axes[1, 0].set_ylabel('Price ($)')
    
    # Correlation heatmap
    corr_matrix = df.corr()
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
                square=True, ax=axes[1, 1])
    axes[1, 1].set_title('Feature Correlation Matrix')
    
    plt.tight_layout()
    plt.show()
    
    # Print correlation with target
    print("\nCorrelation with price:")
    price_corr = df.corr()['price'].sort_values(ascending=False)
    for feature, corr in price_corr.items():
        if feature != 'price':
            print(f"{feature}: {corr:.3f}")

# Perform EDA
perform_eda(df_synthetic)

## 4. Data Preprocessing

Prepare the data for machine learning by handling missing values, scaling features, and splitting into train/test sets.

In [None]:
def preprocess_data(df: pd.DataFrame, target_col: str = 'price', 
                   test_size: float = 0.2, scale_features: bool = True) -> Tuple:
    """Preprocess data for machine learning.
    
    Args:
        df: Input DataFrame
        target_col: Name of target column
        test_size: Proportion for test set
        scale_features: Whether to standardize features
        
    Returns:
        Tuple of (X_train, X_test, y_train, y_test, scaler, feature_names)
    """
    # Separate features and target
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    feature_names = X.columns.tolist()
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42
    )
    
    # Scale features if requested
    scaler = None
    if scale_features:
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)
    
    print(f"Training set size: {X_train.shape[0]} samples")
    print(f"Test set size: {X_test.shape[0]} samples")
    print(f"Number of features: {X_train.shape[1]}")
    
    return X_train, X_test, y_train, y_test, scaler, feature_names

# Preprocess the data
X_train, X_test, y_train, y_test, scaler, feature_names = preprocess_data(df_synthetic)

print("\nFeature scaling statistics (training set):")
if scaler is not None:
    print(f"Feature means: {scaler.mean_}")
    print(f"Feature std: {scaler.scale_}")

## 5. Model Implementation and Training

Implement different variants of linear regression models.

In [None]:
def train_linear_models(X_train: np.ndarray, y_train: np.ndarray) -> dict:
    """Train multiple linear regression variants.
    
    Args:
        X_train: Training features
        y_train: Training targets
        
    Returns:
        Dictionary of trained models
    """
    models = {
        'Linear Regression': LinearRegression(),
        'Ridge Regression': Ridge(alpha=1.0),
        'Lasso Regression': Lasso(alpha=1.0),
    }
    
    trained_models = {}
    
    print("Training models...")
    for name, model in models.items():
        print(f"\nTraining {name}...")
        model.fit(X_train, y_train)
        trained_models[name] = model
        
        # Print model coefficients
        if hasattr(model, 'coef_'):
            print(f"Intercept: {model.intercept_:.2f}")
            print(f"Coefficients: {model.coef_}")
            
    return trained_models

# Train models
models = train_linear_models(X_train, y_train)

## 6. Model Evaluation

Evaluate model performance using multiple metrics.

In [None]:
def evaluate_regression_models(models: dict, X_train: np.ndarray, X_test: np.ndarray,
                             y_train: np.ndarray, y_test: np.ndarray) -> pd.DataFrame:
    """Evaluate regression models comprehensively.
    
    Args:
        models: Dictionary of trained models
        X_train, X_test: Training and test features
        y_train, y_test: Training and test targets
        
    Returns:
        DataFrame with evaluation metrics
    """
    results = []
    
    for name, model in models.items():
        # Make predictions
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        
        # Calculate metrics
        train_r2 = r2_score(y_train, train_pred)
        test_r2 = r2_score(y_test, test_pred)
        
        train_mse = mean_squared_error(y_train, train_pred)
        test_mse = mean_squared_error(y_test, test_pred)
        
        train_mae = mean_absolute_error(y_train, train_pred)
        test_mae = mean_absolute_error(y_test, test_pred)
        
        # Cross-validation score
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
        
        results.append({
            'Model': name,
            'Train R²': train_r2,
            'Test R²': test_r2,
            'Train RMSE': np.sqrt(train_mse),
            'Test RMSE': np.sqrt(test_mse),
            'Train MAE': train_mae,
            'Test MAE': test_mae,
            'CV R² Mean': cv_scores.mean(),
            'CV R² Std': cv_scores.std()
        })
    
    return pd.DataFrame(results)

# Evaluate models
evaluation_results = evaluate_regression_models(models, X_train, X_test, y_train, y_test)

print("Model Evaluation Results:")
print(evaluation_results.round(4))

# Find best model
best_model_name = evaluation_results.loc[evaluation_results['Test R²'].idxmax(), 'Model']
best_model = models[best_model_name]
print(f"\nBest performing model: {best_model_name}")

## 7. Model Interpretation and Feature Importance

Understand which features are most important for predictions.

In [None]:
def analyze_feature_importance(model, feature_names: List[str]) -> None:
    """Analyze and visualize feature importance.
    
    Args:
        model: Trained linear model
        feature_names: Names of features
    """
    if hasattr(model, 'coef_'):
        coefficients = model.coef_
        
        # Create feature importance DataFrame
        importance_df = pd.DataFrame({
            'Feature': feature_names,
            'Coefficient': coefficients,
            'Abs_Coefficient': np.abs(coefficients)
        }).sort_values('Abs_Coefficient', ascending=False)
        
        print("Feature Importance (by coefficient magnitude):")
        print(importance_df)
        
        # Plot feature importance
        plt.figure(figsize=(12, 8))
        
        # Coefficient values
        plt.subplot(2, 1, 1)
        colors = ['red' if coef < 0 else 'green' for coef in importance_df['Coefficient']]
        plt.barh(importance_df['Feature'], importance_df['Coefficient'], color=colors, alpha=0.7)
        plt.title('Feature Coefficients (Positive = Increases Price, Negative = Decreases Price)')
        plt.xlabel('Coefficient Value')
        plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
        
        # Absolute coefficient values
        plt.subplot(2, 1, 2)
        plt.barh(importance_df['Feature'], importance_df['Abs_Coefficient'], 
                color='skyblue', alpha=0.7)
        plt.title('Feature Importance (Absolute Coefficient Values)')
        plt.xlabel('Absolute Coefficient Value')
        
        plt.tight_layout()
        plt.show()
        
        return importance_df
    else:
        print("Model does not have coefficients to analyze.")
        return None

# Analyze feature importance
feature_importance = analyze_feature_importance(best_model, feature_names)

## 8. Residual Analysis

Analyze model residuals to check assumptions and identify potential issues.

In [None]:
def perform_residual_analysis(model, X_test: np.ndarray, y_test: np.ndarray) -> None:
    """Perform comprehensive residual analysis.
    
    Args:
        model: Trained model
        X_test: Test features
        y_test: Test targets
    """
    # Make predictions
    y_pred = model.predict(X_test)
    residuals = y_test - y_pred
    
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Predicted vs Actual
    axes[0, 0].scatter(y_test, y_pred, alpha=0.6, color='blue')
    axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                   'r--', lw=2, label='Perfect Prediction')
    axes[0, 0].set_xlabel('Actual Values')
    axes[0, 0].set_ylabel('Predicted Values')
    axes[0, 0].set_title('Predicted vs Actual Values')
    axes[0, 0].legend()
    
    # 2. Residuals vs Predicted
    axes[0, 1].scatter(y_pred, residuals, alpha=0.6, color='green')
    axes[0, 1].axhline(y=0, color='red', linestyle='--')
    axes[0, 1].set_xlabel('Predicted Values')
    axes[0, 1].set_ylabel('Residuals')
    axes[0, 1].set_title('Residuals vs Predicted Values')
    
    # 3. Residual distribution
    axes[1, 0].hist(residuals, bins=30, alpha=0.7, color='orange', edgecolor='black')
    axes[1, 0].set_xlabel('Residuals')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Residual Distribution')
    
    # 4. Q-Q plot for normality check
    stats.probplot(residuals, dist="norm", plot=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot (Normality Check)')
    
    plt.tight_layout()
    plt.show()
    
    # Statistical tests
    print("Residual Analysis Summary:")
    print(f"Mean residual: {np.mean(residuals):.6f} (should be close to 0)")
    print(f"Std residual: {np.std(residuals):.2f}")
    
    # Shapiro-Wilk test for normality
    shapiro_stat, shapiro_p = stats.shapiro(residuals[:5000])  # Limit sample size
    print(f"Shapiro-Wilk test p-value: {shapiro_p:.6f}")
    if shapiro_p > 0.05:
        print("✓ Residuals appear to be normally distributed")
    else:
        print("✗ Residuals may not be normally distributed")
    
    # Durbin-Watson test for autocorrelation
    from statsmodels.stats.diagnostic import durbin_watson
    dw_stat = durbin_watson(residuals)
    print(f"Durbin-Watson statistic: {dw_stat:.3f} (2.0 indicates no autocorrelation)")

# Perform residual analysis
perform_residual_analysis(best_model, X_test, y_test)

## 9. Polynomial Regression Extension

Explore polynomial features to capture non-linear relationships.

In [None]:
def train_polynomial_regression(X_train: np.ndarray, X_test: np.ndarray, 
                              y_train: np.ndarray, y_test: np.ndarray,
                              degrees: List[int] = [2, 3]) -> dict:
    """Train polynomial regression models.
    
    Args:
        X_train, X_test: Training and test features
        y_train, y_test: Training and test targets
        degrees: List of polynomial degrees to try
        
    Returns:
        Dictionary with polynomial models and results
    """
    poly_results = []
    poly_models = {}
    
    for degree in degrees:
        print(f"\nTraining Polynomial Regression (degree {degree})...")
        
        # Create polynomial features
        poly_features = PolynomialFeatures(degree=degree, include_bias=False)
        X_train_poly = poly_features.fit_transform(X_train)
        X_test_poly = poly_features.transform(X_test)
        
        print(f"Original features: {X_train.shape[1]}")
        print(f"Polynomial features: {X_train_poly.shape[1]}")
        
        # Train model with regularization to prevent overfitting
        model = Ridge(alpha=1.0)  # Use Ridge to handle high-dimensional polynomial features
        model.fit(X_train_poly, y_train)
        
        # Evaluate
        train_pred = model.predict(X_train_poly)
        test_pred = model.predict(X_test_poly)
        
        train_r2 = r2_score(y_train, train_pred)
        test_r2 = r2_score(y_test, test_pred)
        
        poly_results.append({
            'Degree': degree,
            'Train R²': train_r2,
            'Test R²': test_r2,
            'Train RMSE': np.sqrt(mean_squared_error(y_train, train_pred)),
            'Test RMSE': np.sqrt(mean_squared_error(y_test, test_pred)),
            'Features': X_train_poly.shape[1]
        })
        
        poly_models[f'Poly_{degree}'] = {
            'model': model,
            'poly_features': poly_features
        }
    
    # Display results
    poly_df = pd.DataFrame(poly_results)
    print("\nPolynomial Regression Results:")
    print(poly_df.round(4))
    
    # Plot complexity vs performance
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(poly_df['Degree'], poly_df['Train R²'], 'o-', label='Training R²', color='blue')
    plt.plot(poly_df['Degree'], poly_df['Test R²'], 'o-', label='Test R²', color='red')
    plt.xlabel('Polynomial Degree')
    plt.ylabel('R² Score')
    plt.title('Model Performance vs Polynomial Degree')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 2, 2)
    plt.plot(poly_df['Features'], poly_df['Test R²'], 'o-', color='green')
    plt.xlabel('Number of Features')
    plt.ylabel('Test R² Score')
    plt.title('Test Performance vs Model Complexity')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return poly_models, poly_df

# Train polynomial regression models
poly_models, poly_results_df = train_polynomial_regression(X_train, X_test, y_train, y_test)

## 10. Model Comparison and Selection

Compare all models and select the best one based on multiple criteria.

In [None]:
def compare_all_models() -> None:
    """Compare all trained models and provide recommendations."""
    print("=== MODEL COMPARISON SUMMARY ===")
    
    # Combine linear and polynomial results
    print("\n1. Linear Models:")
    print(evaluation_results[['Model', 'Test R²', 'Test RMSE', 'CV R² Mean']].round(4))
    
    print("\n2. Polynomial Models:")
    print(poly_results_df[['Degree', 'Test R²', 'Test RMSE', 'Features']].round(4))
    
    # Model selection criteria
    print("\n=== MODEL SELECTION CRITERIA ===")
    
    # Best linear model
    best_linear_idx = evaluation_results['Test R²'].idxmax()
    best_linear = evaluation_results.iloc[best_linear_idx]
    print(f"\nBest Linear Model: {best_linear['Model']}")
    print(f"  - Test R²: {best_linear['Test R²']:.4f}")
    print(f"  - Test RMSE: {best_linear['Test RMSE']:.2f}")
    print(f"  - CV R² Mean: {best_linear['CV R² Mean']:.4f}")
    
    # Best polynomial model
    best_poly_idx = poly_results_df['Test R²'].idxmax()
    best_poly = poly_results_df.iloc[best_poly_idx]
    print(f"\nBest Polynomial Model: Degree {best_poly['Degree']}")
    print(f"  - Test R²: {best_poly['Test R²']:.4f}")
    print(f"  - Test RMSE: {best_poly['Test RMSE']:.2f}")
    print(f"  - Features: {best_poly['Features']}")
    
    # Overall recommendation
    print("\n=== RECOMMENDATION ===")
    if best_linear['Test R²'] > best_poly['Test R²'] - 0.01:  # Small tolerance
        print(f"Recommended Model: {best_linear['Model']}")
        print("Reason: Simpler model with comparable performance (Occam's Razor)")
    else:
        print(f"Recommended Model: Polynomial Degree {best_poly['Degree']}")
        print("Reason: Significantly better performance justifies complexity")
    
    # Visualization of all models
    plt.figure(figsize=(12, 6))
    
    # Prepare data for plotting
    all_models = []
    all_r2 = []
    all_rmse = []
    
    # Add linear models
    for _, row in evaluation_results.iterrows():
        all_models.append(row['Model'])
        all_r2.append(row['Test R²'])
        all_rmse.append(row['Test RMSE'])
    
    # Add polynomial models
    for _, row in poly_results_df.iterrows():
        all_models.append(f"Poly Deg {row['Degree']}")
        all_r2.append(row['Test R²'])
        all_rmse.append(row['Test RMSE'])
    
    # Plot R² scores
    plt.subplot(1, 2, 1)
    colors = plt.cm.viridis(np.linspace(0, 1, len(all_models)))
    bars = plt.bar(range(len(all_models)), all_r2, color=colors, alpha=0.7)
    plt.xlabel('Models')
    plt.ylabel('Test R² Score')
    plt.title('Model Comparison: R² Scores')
    plt.xticks(range(len(all_models)), all_models, rotation=45, ha='right')
    
    # Highlight best model
    best_idx = np.argmax(all_r2)
    bars[best_idx].set_color('red')
    bars[best_idx].set_alpha(1.0)
    
    # Plot RMSE
    plt.subplot(1, 2, 2)
    bars = plt.bar(range(len(all_models)), all_rmse, color=colors, alpha=0.7)
    plt.xlabel('Models')
    plt.ylabel('Test RMSE')
    plt.title('Model Comparison: RMSE (Lower is Better)')
    plt.xticks(range(len(all_models)), all_models, rotation=45, ha='right')
    
    # Highlight best model (lowest RMSE)
    best_rmse_idx = np.argmin(all_rmse)
    bars[best_rmse_idx].set_color('red')
    bars[best_rmse_idx].set_alpha(1.0)
    
    plt.tight_layout()
    plt.show()

# Compare all models
compare_all_models()

## 11. Real-World Application: California Housing Dataset

Apply our best model to a real dataset to validate its effectiveness.

In [None]:
def apply_to_real_data() -> None:
    """Apply the best model to real California housing data."""
    print("=== REAL-WORLD VALIDATION ===")
    
    # Load California housing dataset
    california_data = fetch_california_housing()
    X_real = california_data.data
    y_real = california_data.target * 100000  # Convert to actual dollar values
    feature_names_real = california_data.feature_names
    
    print(f"California Housing Dataset:")
    print(f"  - Samples: {X_real.shape[0]}")
    print(f"  - Features: {X_real.shape[1]}")
    print(f"  - Features: {feature_names_real}")
    print(f"  - Target: Median house value ($)")
    
    # Split and scale data
    X_train_real, X_test_real, y_train_real, y_test_real = train_test_split(
        X_real, y_real, test_size=0.2, random_state=42
    )
    
    scaler_real = StandardScaler()
    X_train_real_scaled = scaler_real.fit_transform(X_train_real)
    X_test_real_scaled = scaler_real.transform(X_test_real)
    
    # Train our best model on real data
    best_model_real = LinearRegression()  # Use the model type that performed best
    best_model_real.fit(X_train_real_scaled, y_train_real)
    
    # Evaluate on real data
    y_pred_real = best_model_real.predict(X_test_real_scaled)
    
    r2_real = r2_score(y_test_real, y_pred_real)
    rmse_real = np.sqrt(mean_squared_error(y_test_real, y_pred_real))
    mae_real = mean_absolute_error(y_test_real, y_pred_real)
    
    print(f"\nReal Data Results:")
    print(f"  - R² Score: {r2_real:.4f}")
    print(f"  - RMSE: ${rmse_real:,.2f}")
    print(f"  - MAE: ${mae_real:,.2f}")
    
    # Feature importance on real data
    importance_real = pd.DataFrame({
        'Feature': feature_names_real,
        'Coefficient': best_model_real.coef_,
        'Abs_Coefficient': np.abs(best_model_real.coef_)
    }).sort_values('Abs_Coefficient', ascending=False)
    
    print("\nFeature Importance (Real Data):")
    print(importance_real)
    
    # Visualization
    plt.figure(figsize=(15, 10))
    
    # Predicted vs Actual
    plt.subplot(2, 2, 1)
    plt.scatter(y_test_real, y_pred_real, alpha=0.5, color='blue')
    plt.plot([y_test_real.min(), y_test_real.max()], 
             [y_test_real.min(), y_test_real.max()], 'r--', lw=2)
    plt.xlabel('Actual Price ($)')
    plt.ylabel('Predicted Price ($)')
    plt.title(f'Real Data: Predicted vs Actual (R² = {r2_real:.3f})')
    
    # Residuals
    residuals_real = y_test_real - y_pred_real
    plt.subplot(2, 2, 2)
    plt.scatter(y_pred_real, residuals_real, alpha=0.5, color='green')
    plt.axhline(y=0, color='red', linestyle='--')
    plt.xlabel('Predicted Price ($)')
    plt.ylabel('Residuals ($)')
    plt.title('Residuals vs Predicted')
    
    # Feature importance
    plt.subplot(2, 2, 3)
    colors = ['red' if coef < 0 else 'green' for coef in importance_real['Coefficient']]
    plt.barh(importance_real['Feature'], importance_real['Coefficient'], 
             color=colors, alpha=0.7)
    plt.title('Feature Coefficients (Real Data)')
    plt.xlabel('Coefficient Value')
    plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
    
    # Error distribution
    plt.subplot(2, 2, 4)
    plt.hist(residuals_real, bins=30, alpha=0.7, color='orange', edgecolor='black')
    plt.xlabel('Residuals ($)')
    plt.ylabel('Frequency')
    plt.title('Residual Distribution')
    
    plt.tight_layout()
    plt.show()

# Apply to real data
apply_to_real_data()

## 12. Summary and Key Takeaways

### What We Learned

1. **Linear Regression Fundamentals**: We implemented linear regression from scratch and using scikit-learn, understanding the mathematical foundation and assumptions.

2. **Data Preprocessing**: Proper data preprocessing including scaling, splitting, and handling different data types is crucial for model performance.

3. **Model Variants**: We explored different variants:
   - **Linear Regression**: Basic least squares solution
   - **Ridge Regression**: L2 regularization to prevent overfitting
   - **Lasso Regression**: L1 regularization for feature selection
   - **Polynomial Regression**: Capturing non-linear relationships

4. **Evaluation Metrics**: Multiple metrics provide different insights:
   - **R²**: Proportion of variance explained
   - **RMSE**: Root mean squared error in original units
   - **MAE**: Mean absolute error, robust to outliers

5. **Model Interpretation**: Linear models are highly interpretable through coefficient analysis.

6. **Residual Analysis**: Checking model assumptions through residual patterns.

### Best Practices

1. **Always scale features** when using regularized models
2. **Use cross-validation** for robust performance estimation
3. **Analyze residuals** to validate model assumptions
4. **Consider regularization** to prevent overfitting
5. **Test on real data** to validate model generalization

### When to Use Linear Regression

✅ **Good for:**
- Linear relationships between features and target
- Interpretability is important
- Baseline model for comparison
- Small to medium datasets
- When you need to understand feature importance

❌ **Not ideal for:**
- Highly non-linear relationships
- Very high-dimensional data without regularization
- When interpretability is not needed and accuracy is paramount

### Next Steps

1. Explore **ensemble methods** like Random Forest
2. Try **non-linear models** like SVM or Neural Networks
3. Implement **feature engineering** techniques
4. Learn about **advanced regularization** methods
5. Study **time series** regression for temporal data

## 13. Practice Exercises

### Exercise 1: Feature Engineering
Create new features from existing ones (e.g., price per square foot, age categories) and see how they affect model performance.

### Exercise 2: Outlier Detection
Implement outlier detection and removal techniques and analyze their impact on model performance.

### Exercise 3: Different Datasets
Apply the same pipeline to other regression datasets (e.g., Boston housing, automobile prices).

### Exercise 4: Hyperparameter Tuning
Use GridSearchCV to find optimal regularization parameters for Ridge and Lasso regression.

### Exercise 5: Advanced Metrics
Implement additional evaluation metrics like MAPE (Mean Absolute Percentage Error) and analyze when each metric is most appropriate.