# California Housing Price Prediction - Model Building

This notebook focuses on building and evaluating a Linear Regression model for predicting California housing prices. We'll cover:
1. Data preparation and preprocessing
2. Model training
3. Model evaluation
4. Results visualization

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set style for visualizations
plt.style.use('seaborn')
sns.set_palette("husl")

## 1. Data Preparation
Let's load and preprocess our dataset for model training.

In [None]:
# Load the dataset
df = pd.read_csv('../data/california_housing.csv')

# Split features and target
X = df.drop('Median_House_Value', axis=1)
y = df['Median_House_Value']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

## 2. Model Training
Now let's train our Linear Regression model.

In [None]:
# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Print model coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})
print("Model Coefficients:")
print(coefficients.sort_values(by='Coefficient', ascending=False))

## 3. Model Evaluation
Let's evaluate our model's performance using various metrics.

In [None]:
# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model Performance Metrics:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R-squared Score: {r2:.4f}")

## 4. Results Visualization
Let's create visualizations to better understand our model's performance.

In [None]:
# 1. Actual vs Predicted Values Plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual House Values')
plt.ylabel('Predicted House Values')
plt.title('Actual vs Predicted House Values')
plt.tight_layout()
plt.show()

# 2. Residuals Plot
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted House Values')
plt.ylabel('Residuals')
plt.title('Residuals Plot')
plt.tight_layout()
plt.show()

# 3. Feature Importance Plot
plt.figure(figsize=(12, 6))
coefficients.sort_values(by='Coefficient', ascending=True).plot(x='Feature', y='Coefficient', kind='barh')
plt.title('Feature Importance (Model Coefficients)')
plt.xlabel('Standardized Coefficient')
plt.tight_layout()
plt.show()

## Model Summary
Our linear regression model shows:
1. The R-squared value indicates how much variance in house prices our model explains
2. The RMSE and MAE give us the average prediction error in the same units as house prices
3. Feature importance plot shows which factors have the strongest influence on house prices
4. Residuals plot helps us verify the model's assumptions and identify any patterns in predictions