TASK 3: Housing Price Prediction
We load and inspect the California Housing dataset.

Split the dataset into training and testing.

Scale features using StandardScaler.

Use RandomForestRegressor to predict house prices.

Evaluate the model using MSE and R² score.

New Data Prediction

Feature Importance Analysis

In [None]:
# Import libraries
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score


In [None]:
# Load dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target


In [None]:
# Inspect dataset
print(df.head(10))


   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
5  4.0368      52.0  4.761658   1.103627       413.0  2.139896     37.85   
6  3.6591      52.0  4.931907   0.951362      1094.0  2.128405     37.84   
7  3.1200      52.0  4.797527   1.061824      1157.0  1.788253     37.84   
8  2.0804      42.0  4.294118   1.117647      1206.0  2.026891     37.84   
9  3.6912      52.0  4.970588   0.990196      1551.0  2.172269     37.84   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.4

In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None


In [None]:
print(df.describe())

             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude   MedHouseVal  
count  20640.000000  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704      2.068558  
std       10.386050      2.135952      2.003532      1.153956  
min        0.692308     32.54000

In [None]:
# Split features and target
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']


In [None]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

In [None]:
# Predictions
y_pred = model.predict(X_test_scaled)


In [None]:
#Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.3f}")
print(f"R^2 Score: {r2:.3f}")

Mean Squared Error: 0.255
R^2 Score: 0.805


Predict House Prices for New Data (Sample Prediction)

In [None]:
# Create a sample house (same order as feature names)
# Features:
# ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
#  'Population', 'AveOccup', 'Latitude', 'Longitude']

sample_house = pd.DataFrame([{
    'MedInc': 4.5,
    'HouseAge': 20,
    'AveRooms': 5.0,
    'AveBedrms': 1.0,
    'Population': 1500,
    'AveOccup': 3.0,
    'Latitude': 34.05,
    'Longitude': -118.25
}])

# Scale the sample
sample_scaled = scaler.transform(sample_house)

# Predict price
predicted_price = model.predict(sample_scaled)

print(f"Predicted Median House Value: ${predicted_price[0]*100000:.2f}")


Predicted Median House Value: $202292.00


In [None]:
# Feature importance
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance_df)



Feature Importance:
      Feature  Importance
0      MedInc    0.524871
5    AveOccup    0.138443
6    Latitude    0.088936
7   Longitude    0.088629
1    HouseAge    0.054593
2    AveRooms    0.044272
4  Population    0.030650
3   AveBedrms    0.029606


In [None]:
# Multiple house predictions
new_houses = pd.DataFrame([
    {
        'MedInc': 6.0, 'HouseAge': 15, 'AveRooms': 6,
        'AveBedrms': 1.2, 'Population': 1200,
        'AveOccup': 2.8, 'Latitude': 37.77, 'Longitude': -122.41
    },
    {
        'MedInc': 2.5, 'HouseAge': 35, 'AveRooms': 4,
        'AveBedrms': 1.1, 'Population': 3000,
        'AveOccup': 4.2, 'Latitude': 32.71, 'Longitude': -117.16
    }
])

# Scale and predict
new_houses_scaled = scaler.transform(new_houses)
predictions = model.predict(new_houses_scaled)

for i, price in enumerate(predictions):
    print(f"House {i+1} Predicted Price: ${price*100000:.2f}")


House 1 Predicted Price: $302241.06
House 2 Predicted Price: $111780.00


In [None]:
# Compare actual vs predicted values
comparison_df = pd.DataFrame({
    'Actual Price': y_test.values[:10],
    'Predicted Price': y_pred[:10]
})

print("\nActual vs Predicted Prices:")
print(comparison_df)



Actual vs Predicted Prices:
   Actual Price  Predicted Price
0       0.47700         0.509500
1       0.45800         0.741610
2       5.00001         4.923257
3       2.18600         2.529610
4       2.78000         2.273690
5       1.58700         1.645060
6       1.98200         2.376610
7       1.57500         1.670270
8       3.40000         2.772971
9       4.46600         4.913459
