# Project: House Price Prediction using Machine Learning

## Objective
The goal of this project is to predict median house values based on various features such as location, number of rooms, and median income.

## Dataset
We are using the **California Housing Dataset**, which allows us to perform regression analysis.

## Methodology
1. **Data Loading:** Fetching data from a remote repository.
2. **Data Preprocessing:** Handling missing values and categorical features.
3. **Model Selection:** Using Linear Regression and Random Forest Regressor.
4. **Evaluation:** Measuring performance using Mean Squared Error (MSE).

##  --- STEP 1: IMPORT LIBRARIES ---

In [None]:

# Importing necessary libraries for data handling and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Setting visualization style
sns.set(style="whitegrid")

## --- STEP 2: LOAD DATASET ---

In [None]:

# We are using the California Housing dataset directly from a remote URL
url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv"

print("Loading dataset...")
df = pd.read_csv(url)

# Display the first 5 rows to verify data
print("Dataset loaded successfully!")
print(df.head())

## --- STEP 3: DATA PREPROCESSING ---

In [None]:


# 1. Handle Missing Values
# Filling missing 'total_bedrooms' with the median value
median_bedrooms = df["total_bedrooms"].median()
df["total_bedrooms"].fillna(median_bedrooms, inplace=True)

# 2. Convert Text Data to Numbers
# The 'ocean_proximity' column is text, so we convert it using One-Hot Encoding
df = pd.get_dummies(df, columns=["ocean_proximity"], drop_first=True)

# 3. Define Features (X) and Target (y)
# Target: median_house_value (Price)
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

print("Preprocessing complete.")
print(f"Features shape: {X.shape}")

## --- STEP 4: TRAIN-TEST SPLIT ---

In [None]:

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Data Size: {X_train.shape[0]} samples")
print(f"Testing Data Size: {X_test.shape[0]} samples")

## --- STEP 5: MODEL TRAINING ---

In [None]:

# We use Random Forest Regressor as it usually performs better than Linear Regression
model = RandomForestRegressor(n_estimators=100, random_state=42)

print("Training the model (Please wait)...")
model.fit(X_train, y_train)

print("Training completed successfully!")

## --- STEP 6: MODEL EVALUATION ---

In [None]:

# Making predictions on the test set
y_pred = model.predict(X_test)

# Calculate RMSE (Root Mean Squared Error)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model Performance Metrics:")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")
print(f"Accuracy Score (R^2): {r2:.2f}")

## --- STEP 7: TEST PREDICTION ---

In [None]:

# Taking a random sample from the test set to compare Actual vs Predicted Price
sample_index = 0
sample_data = X_test.iloc[sample_index].values.reshape(1, -1)
actual_price = y_test.iloc[sample_index]

# Predict price
predicted_price = model.predict(sample_data)[0]

print(f"Actual Price: ${actual_price:,.2f}")
print(f"Predicted Price: ${predicted_price:,.2f}")
print(f"Difference: ${abs(actual_price - predicted_price):,.2f}")