# Day 01 - Project 1

## Linear Regression on California Housing Dataset

### Step 1: Import Required Libraries
We import all necessary libraries to handle data, build models, and visualize results.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

### Step 2: Load the Dataset
Fetch the California Housing dataset from sklearn datasets.

In [None]:
california = fetch_california_housing()
print(california.keys())
print(california.data.shape)
print(california.feature_names)
print(california.DESCR)

### Step 3: Create a DataFrame
We transform the dataset into a pandas DataFrame for easier exploration.

In [None]:
X = pd.DataFrame(california.data, columns=california.feature_names)
y = california.target
X.head()

### Step 4: Split Data into Train and Test Sets
We split the data to train on one part and test on another to avoid overfitting.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 5: Create and Train the Linear Regression Model

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

### Step 6: Predict on Test Set
Use the trained model to predict on the unseen test data.

In [None]:
y_pred = model.predict(X_test)

### Step 7: Evaluate the Model
We evaluate the model using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² score.

In [None]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

### Step 8: Visualize Predictions
We plot the true vs predicted house prices using different colors for clarity.

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(range(len(y_test)), y_test, color='blue', label='True Prices', alpha=0.6)
plt.scatter(range(len(y_pred)), y_pred, color='red', label='Predicted Prices', alpha=0.6)
plt.title('True vs Predicted House Prices')
plt.xlabel('Sample Index')
plt.ylabel('House Price ($100,000 units)')
plt.legend()
plt.show()

### 📚 Final Notes
- MSE is the average of squared errors (not a percentage).
- RMSE gives the real-world average error in the same unit as house prices.
- A lower RMSE means better model performance.
- R² score shows how much variance is explained by the model (closer to 1 is better).