***"Dataset: housing.csv"***

Task 4 Requirements Recap

Split into train/test sets

Fit a linear regression model

Evaluate using RMSE or R²

Tools: sklearn, pandas, numpy

**Import Libraries**

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


**Load Dataset**

In [2]:
df = pd.read_csv("housing.csv")
df.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


**Handle Missing Values**

In [5]:
# Fill numeric columns with mean
num_cols = df.select_dtypes(include=['number']).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].mean())

# Fill categorical columns with mode
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])


**Define Features and Target**

In [6]:
# Assuming 'median_house_value' is the target column
X = df.drop('median_house_value', axis=1)

# If categorical column 'ocean_proximity' exists, convert it
if 'ocean_proximity' in X.columns:
    X = pd.get_dummies(X, drop_first=True)

y = df['median_house_value']


**Train-Test Split**

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


**Train Model**

In [8]:
model = LinearRegression()
model.fit(X_train, y_train)


**Predictions & Evaluation**

In [9]:
y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.2f}")


R² Score: 0.6257
RMSE: 70031.42


Summary – California Housing Price Prediction
- Linear Regression was used to predict median house values.  
- All missing values were filled with the column mean.  
- The model achieved an R² score indicating how well it explains price variation, and an RMSE showing prediction error in dollar terms.  