<a href="https://colab.research.google.com/github/Sid44444/001-fundamentals/blob/master/California_housing_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Section 1: Import Libaries ↓**

This code imports three essential libraries for machine learning in Python: pandas for handling and analyzing data, and scikit-learn tools for splitting datasets and building a linear regression model. In short, it sets up the environment to prepare data, split it into training/testing sets, and train a linear regression algorithm.



In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


**Section 2: Load Sample Data ↓**

This code loads the California housing dataset from scikit-learn as a pandas DataFrame using fetch_california_housing(as_frame=True). It then stores the data in df and displays the first few rows with df.head().

In [None]:

from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
df = data.frame
df.head(10)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
5,4.0368,52.0,4.761658,1.103627,413.0,2.139896,37.85,-122.25,2.697
6,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25,2.992
7,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25,2.414
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26,2.267
9,3.6912,52.0,4.970588,0.990196,1551.0,2.172269,37.84,-122.25,2.611


**Section 3: Splitting the data set ↓**

This code selects three columns (MedInc, AveRooms, AveOccup) from the DataFrame as features and MedHouseVal as the target variable. It then splits the data into training and testing sets, using 80% for training and 20% for testing with a fixed random state for reproducibility.

In [None]:

X = df[['MedInc', 'AveRooms', 'AveOccup','HouseAge']]  # Features
y = df['MedHouseVal']  # Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)


**Section 4: Train model ↓**

This code creates a LinearRegression model and trains it using the training data (X_train and y_train) so it can learn the relationship between the features and the target variable.

In [None]:

model = LinearRegression()
model.fit(X_train, y_train)


**Section 5: Make predictions ↓**

This code uses the trained model to predict target values for the test dataset (X_test). It then prints the first five predicted values to give a quick look at the model’s output.


In [None]:

predictions = model.predict(X_test)
print("First 5 predictions:", predictions[:5])


First 5 predictions: [1.05863448 1.50579155 2.33778951 2.68199918 2.09540234]


**Section 6: Checking the mean squared error ↓**

This code imports the mean_squared_error function from scikit-learn and calculates the MSE between the actual test values (y_test) and the model’s predictions, then prints the result to evaluate prediction accuracy.





In [None]:

from sklearn.metrics import mean_squared_error
print("MSE:", mean_squared_error(y_test, predictions))


MSE: 0.6838831803628469


**Section 7: Explanation of the results**

The California Housing dataset predicts median house value in units of hundreds of thousands of dollars (i.e., \$100,000 increments). So:

If your Mean Squared Error (MSE) rounds to 0.7007.
MSE represents the average squared difference between predicted and actual values.
Since the target variable (MedHouseVal) is in units of $100,000, an MSE of 0.7007 means the average squared error is about 0.7007 × (100,000)^2.
To interpret more intuitively, take the square root to get Root Mean Squared Error (RMSE):
0.7007≈0.836\sqrt{0.7007} \approx 0.8360.7007​≈0.836
This means the model’s predictions are off by about \$83,600 on average.

 Context: For California housing prices, which typically range from \$150,000 to \$500,000 in this dataset, an error of \$83,600 is moderate—not perfect, but reasonable for a simple linear regression model using only three features.