<a href="https://colab.research.google.com/github/TheRealChichi/MLprojects/blob/main/California_House_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("camnugent/california-housing-prices")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/camnugent/california-housing-prices?dataset_version_number=1...


100%|██████████| 400k/400k [00:00<00:00, 13.2MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/camnugent/california-housing-prices/versions/1





In [None]:
import os
for file in os.listdir(path):
    if file.endswith(".csv"):
        dataset_path = os.path.join(path, file)
        break

import pandas as pd
df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [None]:
print("\nMissing values:\n", df.isnull().sum())


Missing values:
 longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64


In [None]:
df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].median())

In [None]:
print(df.isnull().sum())

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64


In [None]:
print("\nDataset statistics:\n", df.describe())


Dataset statistics:
           longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   
std        2.003532      2.135952           12.585558   2181.615252   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1447.750000   
50%     -118.490000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000  39320.000000   

       total_bedrooms    population    households  median_income  \
count    20640.000000  20640.000000  20640.000000   20640.000000   
mean       536.838857   1425.476744    499.539680       3.870671   
std        419.391878   1132.462122    382.329753       1.899822   
min          1.000000      3.000000      1.000000       0.499900  

### Selecting Features

In [None]:
print(df.columns)

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')


In [None]:
# Define target variable (house price)
target = 'median_house_value'

# Define features (all columns except the target)
features = df.drop(columns = [target])

### Handling categorical features

Machine Learning models generally require numerical input.
We need to change categorical column into numerical.

"Ocean_proximity" is categorical, we use one hot encoding to converts categorical variables into a series of binary columns.

In [None]:
features = pd.get_dummies(features, columns = ['ocean_proximity'], drop_first = True)

#### Features Scaling (Optional)

Some models (e.g., Linear Regression, SVM) perform better when features are on a similar scale:

Why Scale?

Ensures equal weight for all features, preventing dominance by features with larger ranges (e.g., total_rooms vs. median_income).
Improves the convergence speed of optimization algorithms.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features = pd.DataFrame(scaler.fit_transform(features), columns=features.columns)

### 2.4 Splitting the Dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, df[target], test_size=0.2, random_state=42)


print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

Training samples: 16512
Testing samples: 4128


# Step 3: Model Building
Objective:
Train multiple regression models on our training set.
Evaluate their performance on the testing set.
Compare results to identify the best model for predicting house prices

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## Models We'll Use:



1.   Linear Regression: Simple model for linear relationships.
2. Random Forest Regressor: Ensemble method using multiple decision trees.
3. Gradient Boosting Regressor: Sequential model improving over errors.
4. Support Vector Regressor (SVR): Works well with scaled data.
5. K-Nearest Neighbors (KNN) Regressor: Predicts using the nearest data points.
6. XGBoost Regressor: Powerful, optimized boosting algorithm.





### 3.2 Training the Models

In [None]:
# Creating a dictionry of models, each associated with a different algorithm
models = {
    "Linear Regression" :  LinearRegression(),
    "Random Forest": RandomForestRegressor(random_state = 42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Support Vector Regressor": SVR(),
    "K-Nearest Neighbors": KNeighborsRegressor(),
    "XGBoost Regressor": XGBRegressor(random_state=42)
}

# Disctionnary to store model performance
model_performance = {}

# Training and evaluating models
for model_name, model in models.items():
  #Train the model
  model.fit(X_train, y_train)

  # Make predictions on the test set
  y_pred = model.predict(X_test)

  #Evaluate model performance
  mae = mean_absolute_error(y_test, y_pred)
  rmse = mean_squared_error(y_test, y_pred)
  r2 = r2_score(y_test, y_pred)

  #Store the model performance inside a dictionary
  model_performance[model_name] = {
      "MAE": mae,
      "RMSE": rmse,
      "R2": r2
  }

#Display the performance of each model
performance_df = pd.DataFrame(model_performance).T #.T shortcut for .transpose()
performance_df.sort_values(by = "RMSE", ascending = True, inplace = True)
print(performance_df)

                                   MAE          RMSE        R2
XGBoost Regressor         31645.725504  2.238154e+09  0.829202
Random Forest             31642.757536  2.404390e+09  0.816516
Gradient Boosting         38248.031950  3.123095e+09  0.761670
K-Nearest Neighbors       40855.347432  3.765442e+09  0.712651
Linear Regression         50670.738241  4.908477e+09  0.625424
Support Vector Regressor  86961.276982  1.365593e+10 -0.042112


**Summary:**
- Trained multiple models on the training set.
- Evaluated each model using MAE, RMSE, and R² Score.
- Ranked models based on performance metrics.
- Identified the best model for our use case (Gradient Boosting Regressor in this example).
