# **Regression Task: *Predicting California Housing Prices***

#### Predicting California Housing Prices Using Linear Regression. ***California Housing Prices dataset***

## **Step 1: Import Libraries**
#### *The basic libraries that are required for data analysis and machine learning.*

In [None]:
import pandas as pd              # for handling data
import numpy as np               # for numerical operations
import matplotlib.pyplot as plt  # for data visualization
import seaborn as sns
import sklearn                   # scikit-learn library (machine learning tools)
from sklearn.model_selection import train_test_split # split dataset into training and testing sets
from sklearn.linear_model import LinearRegression    # machine learning algorithm
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # evaluating model performance

## **Step 2: Load and Check Data**
#### *Load the dataset and look at the first few rows to understand the structure.*

In [None]:
price = pd.read_csv("/kaggle/input/california-housing-prices/housing.csv")

#print(price)

In [None]:
price.info() #complete information  of the dataset

In [None]:
price.head() #first 5 rows of the dataset

In [None]:
price.tail() #last 5 rows of the dataset

In [None]:
price.shape #number of rows & coloumns of the dataset

In [None]:
price.describe() #statistical summary of numerical columns

## **Step 3: Preprocessing Data**
#### *-Handle missing values*
#### *-Encode categorical column (ocean_proximity)*
#### *-Split features/target*

In [None]:
price.dropna(inplace=True)    # Drop rows with missing values

# Encode categorical feature using One-Hot Encoding
price = pd.get_dummies(price, drop_first=True)

# Define features X and target y
X = price.drop("median_house_value", axis=1)
y = price["median_house_value"]

In [None]:
X.shape

In [None]:
y.shape

## **Step 4: Split the data into Training and Testing Sets**
#### *We separate the features (X) and the target (y).*
#### *Split into 80% training and 20% testing data*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

## **Step 5: Feature Scaling**

In [None]:
# Scaling numerical input data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## **Step 6: Train - Model 1: Linear Regression**
#### *We create the model and fit it (train it) on the training data.*

In [None]:
model_lin = LinearRegression()         # Initialize Linear Regression Model
model_lin.fit(X_train_scaled, y_train) # Fit model

## **Step 7: Make Predictions - Model 1: Linear Regression**
#### *Use the trained model to predict on the test set.*

In [None]:
y_pred_lin = model_lin.predict(X_test_scaled) # Predict test values

In [None]:
# y_pred_lin

## **Step 8: Evaluate Model - Model 1: Linear Regression**

In [None]:
# Evaluation metrics
mae_lin = mean_absolute_error(y_test, y_pred_lin)
mse_lin = mean_squared_error(y_test, y_pred_lin)
rmse_lin = np.sqrt(mse_lin)
r2_lin = r2_score(y_test, y_pred_lin)

print("Linear Model Performance:")
print("Mean Absolute Error:", round(mae_lin, 2))
print("Mean Squared Error:", round(mse_lin, 2))
print("RMSE:", round(rmse_lin, 2))
print("R² Score:", round(r2_lin, 3))

## **Step 9: Visualize - Model 1: *Linear Regression***

## -------- 9.1. *Actual vs Predicted Plot*

In [None]:
plt.figure(figsize=(7,6))
sns.scatterplot(x=y_test, y=y_pred_lin)
plt.xlabel("Actual House Value")
plt.ylabel("Predicted House Value")
plt.title("Linear Regression: Actual vs Predicted House Prices")
plt.grid(True)
plt.show()

## -------- 9.2. *Error Distribution*

In [None]:
errors_lin = y_test - y_pred_lin
plt.figure(figsize=(7,6))
sns.histplot(errors_lin, bins=50)
plt.xlabel("Prediction Error")
plt.ylabel("Frequency")
plt.title("Linear Regression: Error Distribution")
plt.show()

## -------- 9.3. *Regression Line Chart*
#### *Helps show linear relationship*

In [None]:
plt.figure(figsize=(7,6))
plt.plot(y_test.values[:100], label="Actual")
plt.plot(y_pred_lin[:100], label="Predicted")
plt.title("Linear Regression: Actual vs Predicted (sample 100)")
plt.legend()
plt.show()

## **Step 10: Train - Model 2: Random Forest Regressor**

In [None]:
# Initialize the model
model_ran = RandomForestRegressor(random_state=42, n_estimators=200)

model_ran.fit(X_train, y_train) # Train the model

## **Step 11: Make Predictions - Model 2: *Random Forest Regressor***

In [None]:
y_pred_ran = model_ran.predict(X_test)

In [None]:
y_pred_ran

## **Step 12: Evaluate Model - Model 2: *Random Forest Regressor***

In [None]:
# Evaluation metrics for Random Forest
mae_ran = mean_absolute_error(y_test, y_pred_ran)
mse_ran = mean_squared_error(y_test, y_pred_ran)
rmse_ran = np.sqrt(mse_ran)
r2_ran = r2_score(y_test, y_pred_ran)

print("Random Forest Model Performance:")
print("Mean Absolute Error:", round(mae_ran, 2))
print("Mean Squared Error:", round(mse_ran, 2))
print("RMSE:", round(rmse_ran, 2))
print("R² Score:", round(r2_ran, 3))

## **Step 13: Visualize - Model 2: *Random Forest Regressor***

## -------- 13.1. *Actual vs Predicted Plot*

In [None]:
plt.figure(figsize=(7,6))
sns.scatterplot(x=y_test, y=y_pred_ran)
plt.xlabel("Actual House Value")
plt.ylabel("Predicted House Value")
plt.title("Random Forest Regressor: Actual vs Predicted House Prices")
plt.grid(True)
plt.show()

## -------- 13.2. *Error Distribution*

In [None]:
errors_ran = y_test - y_pred_ran
plt.figure(figsize=(7,6))
sns.histplot(errors_ran, bins=50)
plt.xlabel("Prediction Error")
plt.ylabel("Frequency")
plt.title("Random Forest Regressor: Error Distribution")
plt.show()

## -------- 13.3. *Regression Line Chart*

In [None]:
plt.figure(figsize=(7,6))
plt.plot(y_test.values[:100], label="Actual")
plt.plot(y_pred_ran[:100], label="Predicted")
plt.title("Random Forest Regressor: Actual vs Predicted (sample 100)")
plt.legend()
plt.show()

## **Step 14: Compare Model Performances**

In [None]:
comp_table = pd.DataFrame({
    "Linear Regression": [mae_lin, mse_lin, rmse_lin, r2_lin],
    "Random Forest": [mae_ran, mse_ran, rmse_ran, r2_ran]
}, index=["MAE", "MSE", "RMSE", "R² Score"])

In [None]:
comp_table

## **Summary**
*300-500 words of all the code, including dataset description, preprocessing, model implementation, results, and interpretation*

In this exercise, I aimed to develop machine learning models to predict the **median house value in California** using the **California Housing Prices dataset**. This dataset comprises **20,640 records** with **9 numerical features**, including median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude. The target variable, median_house_value, represents the value of a house in USD, making it suitable for a regression task.

I began by loading the dataset into a pandas DataFrame and inspecting its structure using *.info()*, *.head()*, *.tail()*, *.shape*, and *.describe()*. This initial exploration revealed that all features were numeric and there were no missing values, allowing me to proceed without extensive preprocessing. I then separated the features (*X*) from the target variable (*y*) and split the dataset into training and testing sets using an 80:20 ratio to evaluate model performance on unseen data.

For model implementation, I first applied **Linear Regression**, a foundational algorithm for predicting continuous variables. I trained the model on the training set and generated predictions on the test set. I evaluated the model using **Mean Absolute Error (MAE)**, **Mean Squared Error (MSE)**, **Root Mean Squared Error (RMSE)**, and **R² Score**, which provided insights into the model's accuracy and generalization. The Linear Regression model achieved an R² score of 0.649, indicating a moderate fit, with MAE and RMSE values highlighting the average prediction error in housing prices.

To improve performance and capture potential non-linear relationships in the dataset, I implemented a **Random Forest Regressor**, an ensemble learning method that builds multiple decision trees and averages their predictions to reduce overfitting. I trained the Random Forest model on the same training data and generated predictions on the test set. Evaluation metrics showed that the Random Forest model outperformed Linear Regression, with a higher R² score and lower MAE and RMSE, demonstrating its enhanced ability to model complex patterns in the data.

I also visualized the results using scatter plots of actual versus predicted values and residual plots to examine the distribution of prediction errors. These visualizations confirmed that the Random Forest model provided predictions more closely aligned with the true house values, while Linear Regression underperformed for extreme values.

In conclusion, this exercise allowed me to practice key steps in a regression pipeline, including data inspection, preprocessing, model selection, training, evaluation, and interpretation. Both Linear Regression and Random Forest demonstrated the practical application of supervised learning, with Random Forest offering superior predictive performance. The visualizations and evaluation metrics helped me understand model behavior, quantify prediction accuracy, and interpret the results in a real-world context, reinforcing my understanding of regression modeling and machine learning best practices.