# Project: Prediction of house's prices

- Date: July 7 2025

- Data: The data used in that project came from an American census of 1990. We could find them in the modul datasets from sklearn, then we use the function fetch_california_housing to download the data.

- Description: The goal of this project is to build some models able to predict the house's prices. 


## Downloading of data

In [6]:
from sklearn.datasets import fetch_california_housing

# on récupère le dataset California Housing
dataset = fetch_california_housing(as_frame=True)

# DataFrame of the features
X = dataset.data

# Prices
y = dataset.target

# Display of the data
print("===Data===\n",X)
print("===Target===\n",y)

===Data===
        MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
...       ...       ...       ...        ...         ...       ...       ...   
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   
20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43   
20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37   

       Longitude  
0       

We can notice that the prices are to low for houses, that normal because the price where divided by $100 000. So 4.5 means actually $450 000.

In [7]:
print(dataset)

{'data':        MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
...       ...       ...       ...        ...         ...       ...       ...   
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   
20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43   
20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37   

       Longitude  
0        -1

## Importing the different modules.
It is important to notice that there is a big difference between a regression and a classification.
- A Regression: for predict numbers, real value like temperature, prices,...
- A classification : for predict category and label like type of the flower, the disease,cat or dog,...

So here, we will use regressors only.

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split,KFold, StratifiedKFold
from sklearn.metrics import mean_squared_error, r2_score
from numpy import sqrt

In [10]:
# Spliting of the data
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.8)

# Train and evaluation

In [11]:
#Initialisation of Models
reg_log=LinearRegression()
reg_tree=DecisionTreeRegressor()
reg_nn=KNeighborsRegressor()
reg_svr=SVR(kernel='rbf')

#Training the models
reg_log.fit(X_train,y_train)
reg_tree.fit(X_train,y_train)
reg_nn.fit(X_train,y_train)
reg_svr.fit(X_train,y_train)

#Predictions
y_pred_log  = reg_log.predict(X_test)
y_pred_tree = reg_tree.predict(X_test)
y_pred_nn   = reg_nn.predict(X_test)
y_pred_svr  = reg_svr.predict(X_test)

In [None]:
def printMetrics(method_title, y_predict, y_true):
    print(method_title) 
    print("\tRoot MSE       :", sqrt(mean_squared_error(y_true, y_predict)))
    print("\tR² score  :", r2_score(y_true, y_predict))

printMetrics("Decision Tree: ", y_pred_tree, y_test)
printMetrics("Logistic regression: ",y_pred_log, y_test)
printMetrics("KNearest Neighbor: ",y_pred_nn, y_test)
printMetrics("SVR (kernel=rbf): ",y_pred_svr, y_test)


Decision Tree: 
	Root MSE       : 0.70364246988449
	R² score  : 0.6207378258250453
Logistic regression: 
	Root MSE       : 0.707299774222593
	R² score  : 0.6167850172834648
KNearest Neighbor: 
	Root MSE       : 1.0571715536327777
	R² score  : 0.1438962904581068
SVM (kernel=rbf): 
	Root MSE       : 1.1524100957741938
	R² score  : -0.017301184608337294


# Interpretation
We can see here that the different models are not really good. Indeed,

- The SVR model (whith the default parameters) :

    Have one of the biggest error rate among the models chosen in this project and his 'r2 score' is negatif, that's means that it is worst than a model which just predict the average. His rate errors is around 1.10 that means that if the price of the house in the dataset were divided by 100,000 , we have an average difference of $110,000 between the prediction and the real value of the house, that is too much. That model is obviously not good enough with that parameters. Maybe we have to change the value of some parameters( as C, gamma,...) to have a better model. And according to ChatGPT, the SVM is not the first choice when we want to do regression like here.

- The K-nearest neighbor(with default neigbors equals 5):
    
    Have also a big error rate. We have around 1.0 that means we will have a difference of $100.000 from the real value of the price, which is not acceptable. Also we have a 'r2 score' very close to 0 because it's around 0.1, we can say that this model is almost useless like the fisrt one it is just a little bit better than just taking the average.

- The Logistic regressor:

    Have a high rate error rate even if it is lower than the first 2 models. We will have with this model a difference of 70,000 from the real prices, and this is still not acceptable. However, the 'r2 score' is a bit close to 1, this model with predict almost corretly the deviation in our data. 

- The Decison Tree regressor:

    Still have a high a error rate like the logistic regressor model. We will have a deviation of almost 70,000 from the real prices. Concerning the 'r2 score', it is amoung the best we have here like the previous one, it will be able to predict nearly correctly the deviation in our data.  


We are going to test a new model based on many decision tree (because it was one of the best model among those tested): The random forest.

Let's see what we can have.

In [None]:
#We try another model here : RandomForest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV


reg_for=RandomForestRegressor().fit(X_train,y_train)
y_pred_for=reg_for.predict(X_test)
printMetrics("Random Forest(default number of estimators) :",y_pred_for, y_test)


Random Forest :
	Root MSE       : 0.493505261144027
	R² score  : 0.8134397927603711


The Random forest model (with the default parameters) take more time but is the best among all models created for the moment. The 'r2_score' is closer to 1 and the root of MSE is lower. We will have a difference of almost $49,000 from the real price, which is a little bit good.  

Let'see if we can involve the score by changing the parameter value.

In [16]:
import pandas as pd

param_grid = {'n_estimators': [10, 50, 100, 200]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# All results
results = pd.DataFrame(grid_search.cv_results_)

# Display
print(results[['param_n_estimators', 'mean_test_score', 'std_test_score']])

# We keep the best model
best_Rdfor=grid_search.best_estimator_

#Estimation
y_pred_best_Rdfor =best_Rdfor.predict(X_test)
printMetrics("Best Random Forest between estimator=[10,50,100,200] :", y_pred_best_Rdfor , y_test)

   param_n_estimators  mean_test_score  std_test_score
0                  10         0.778720        0.005599
1                  50         0.795469        0.002391
2                 100         0.799775        0.003571
3                 200         0.799646        0.003398
Best Random Forest between estimator=[10,50,100,200] :
	Root MSE       : 0.4948874852170567
	R² score  : 0.8123932826134437


We can notice that the score increases with number of parameters.

However, the difference ( regarding the score ) between 100 estimators and 200 estimators is very small, It seems better to keep 100 estimators instead of 200 to have less running time. We notice also that thereis no a significant difference between the best model here and the default model( default parameter) chosen by the random forest. 

In [18]:
#try the last model with different parameter to see what will be the result : Done
# Do also the cross validation : done
# Next project, let see if I could work on unsupervised learning to change a litte bit.

# Cross Validation

Now we want to use cross validation to re-evaluate our models with the cross_val_score function.

In [19]:
#use stratifyKfold after.
from sklearn.model_selection import cross_val_score
import numpy as np

#We put the models and their scores in dictionaries.
models={"Logistic Regression :":reg_log, "Decision Tree :":reg_tree, "K-Nearest Neighbors :": reg_nn,"SVR :": reg_svr, "Random Forest :": reg_for}
scores={"Logistic Regression :": 0, "Decision Tree :": 0, "K-Nearest Neighbors :": 0,"SVR :": reg_svr, "Random Forest :": 0}

#We fill the scores
for key in models:
    scores[key]=cross_val_score(models[key],X,y,cv=5, scoring="r2")

#We display
for key in models:
    print(key, np.mean(scores[key]))



Logistic Regression : 0.553031114027957
Decision Tree : 0.3478390630692032
K-Nearest Neighbors : 0.002334523135833111
SVR : -0.1101188223291046
Random Forest : 0.6532313714333096


We can notice when we split more the dataset, the models have some problems to correctly predict prices. The only models whith their scores closer to 1 than to 0 are: the Logistic regression and the Random Forest.

## Conclusion

Finally we can say that maybe the models used in that project are not very suitable for the regression (unsupervised learning) or the problem came from the dataset itself.

According to ChatGPT the 3 best models for a regression are :
- Gradient Bosting
- Random Forest Regressor (we use it and it was our best model amoung those used)
- Lasso Regressor

We'll be using them in another project.