<a href="https://colab.research.google.com/github/Avazjon-Isoboev/Housing-price-prediction-model/blob/main/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import sklearn # scikit-learn kutubxonasi

In [2]:
# Onlayn dataset joylashgan manzilini ko'rsatamiaz
URL = "https://github.com/ageron/handson-ml2/blob/master/datasets/housing/housing.csv?raw=true"
df = pd.read_csv(URL)

**We divide the data into training and testing**

In [3]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

X_train = train_set.drop("median_house_value", axis=1)
y = train_set["median_house_value"].copy()

X_num = X_train.drop("ocean_proximity", axis=1)

## Let's build a pipeline

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin
# bizga kerak ustunlar indekslari
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # bizni funksiyamiz faqat transformer. estimator emas
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room: # add_bedrooms_per_room ustuni ixtiyoriy bo'ladi
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

#### For numeric columns

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_pipeline = Pipeline([
          ('imputer', SimpleImputer(strategy='median')),
          ('attribs_adder', CombinedAttributesAdder(add_bedrooms_per_room = True)),
          ('std_scaler', StandardScaler())             
])

#### For text columns

In [6]:
from sklearn.compose import ColumnTransformer

num_attribs = list(X_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])

Here is the final, complete pipeline (full_pipeline).

Just call the .fit_transform() method to start the pipeline.

In [7]:
X_prepared = full_pipeline.fit_transform(X_train)

In [8]:
X_prepared[0:5,:]

array([[ 1.27258656, -1.3728112 ,  0.34849025,  0.22256942,  0.21122752,
         0.76827628,  0.32290591, -0.326196  , -0.17491646,  0.05137609,
        -0.2117846 ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 0.70916212, -0.87669601,  1.61811813,  0.34029326,  0.59309419,
        -0.09890135,  0.6720272 , -0.03584338, -0.40283542, -0.11736222,
         0.34218528,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.44760309, -0.46014647, -1.95271028, -0.34259695, -0.49522582,
        -0.44981806, -0.43046109,  0.14470145,  0.08821601, -0.03227969,
        -0.66165785,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 1.23269811, -1.38217186,  0.58654547, -0.56148971, -0.40930582,
        -0.00743434, -0.38058662, -1.01786438, -0.60001532,  0.07750687,
         0.78303162,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.10855122,  0.5320839 ,  1

The data is ready for ML.

Machine Learning:
Our goal is prediction, for which there are several ML algorithms

#### Linear Regression - Linear regression
We create a new model from the ``LinearRegression'' class in `sklearn'.

In [10]:
from sklearn.linear_model import LinearRegression

LR_model = LinearRegression()

`LinearRegression' is an estimator. Estimators receive data and _learn_ to fit them using the `.fit()` method (machine _learning_)

In [11]:
LR_model.fit(X_prepared, y)

LinearRegression()

LinearRegression() 

DONE! Machine Learning is over! Yes, you're right, with just 3 lines of code, we've just taught a computer to predict house prices.

How can we test the model? Let's feed a row from the housing dataset to the model and compare the result with the existing result (label).

In [13]:
# we randomly extract 5 rows
test_data = X_train.sample(5)
test_data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
11848,-120.74,39.82,9.0,1955.0,398.0,294.0,122.0,3.9583,INLAND
12960,-121.29,38.71,32.0,1875.0,361.0,1027.0,343.0,3.5769,INLAND
1880,-119.98,38.94,25.0,1339.0,328.0,503.0,219.0,1.9018,INLAND
18118,-122.02,37.31,33.0,2563.0,434.0,1230.0,418.0,6.3197,<1H OCEAN
8340,-118.32,33.94,37.0,2740.0,504.0,1468.0,479.0,4.5368,<1H OCEAN


In [14]:
# extract the prices corresponding to the rows above (we need to predict exactly these values)
test_label = y.loc[test_data.index]
test_label

11848    126500.0
12960    103800.0
1880     109700.0
18118    340100.0
8340     168800.0
Name: median_house_value, dtype: float64

We can pass the test_data through the pipeline and make it the view we need.

Note that this time we call the .transform() method because we called the .fit() method before.


In [15]:
test_data_prepared = full_pipeline.transform(test_data)
test_data_prepared

array([[-5.77240575e-01,  1.95490481e+00, -1.55595157e+00,
        -3.15925145e-01, -3.35319155e-01, -9.95981657e-01,
        -9.92205122e-01,  4.07227528e-02,  4.43569878e+00,
        -5.93454635e-02, -1.59972930e-01,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-8.51473708e-01,  1.43538796e+00,  2.69138504e-01,
        -3.52713843e-01, -4.23625823e-01, -3.51315122e-01,
        -4.12086284e-01, -1.59567520e-01,  1.30847606e-02,
        -8.87792407e-03, -3.50463305e-01,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-1.98300245e-01,  1.54303560e+00, -2.86323693e-01,
        -5.99198121e-01, -5.02385824e-01, -8.12168143e-01,
        -7.37582826e-01, -1.03923778e+00,  2.84388010e-01,
        -6.91078206e-02,  5.53525401e-01,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-1.21545587e+00,  7.80141492e

##Prediction

In [16]:
predicted_data = LR_model.predict(test_data_prepared)
predicted_data

array([120505.19492933, 130911.50784911,  34424.41818596, 336084.05952571,
       253231.02454541])

What you see above are the predicted values. Well, let's compare how they differ from real values:

In [23]:
pd.DataFrame({'Prediction':predicted_data, 'Real price': test_label})

Unnamed: 0,Prediction,Real price
11848,120505.194929,126500.0
12960,130911.507849,103800.0
1880,34424.418186,109700.0
18118,336084.059526,340100.0
8340,253231.024545,168800.0


### STEP 5. Let's evaluate the model

As you can see, our model predicted with less error in some places, and more in some places.
But 5 lines are not enough to evaluate the accuracy of the model. Let's test it using the test set we extracted earlier:

In [24]:
test_set

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20046,-119.01,36.06,25.0,1505.0,,1392.0,359.0,1.6812,47700.0,INLAND
3024,-119.46,35.14,30.0,2943.0,,1565.0,584.0,2.5313,45800.0,INLAND
15663,-122.44,37.80,52.0,3830.0,,1310.0,963.0,3.4801,500001.0,NEAR BAY
20484,-118.72,34.28,17.0,3051.0,,1705.0,495.0,5.7376,218600.0,<1H OCEAN
9814,-121.93,36.62,34.0,2351.0,,1063.0,428.0,3.7250,278000.0,NEAR OCEAN
...,...,...,...,...,...,...,...,...,...,...
15362,-117.22,33.36,16.0,3165.0,482.0,1351.0,452.0,4.6050,263300.0,<1H OCEAN
16623,-120.83,35.36,28.0,4323.0,886.0,1650.0,705.0,2.7266,266800.0,NEAR OCEAN
18086,-122.05,37.31,25.0,4111.0,538.0,1585.0,568.0,9.2298,500001.0,<1H OCEAN
2144,-119.76,36.77,36.0,2507.0,466.0,1227.0,474.0,2.7850,72300.0,INLAND


In [25]:
X_test = test_set.drop('median_house_value', axis=1)
X_test

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
20046,-119.01,36.06,25.0,1505.0,,1392.0,359.0,1.6812,INLAND
3024,-119.46,35.14,30.0,2943.0,,1565.0,584.0,2.5313,INLAND
15663,-122.44,37.80,52.0,3830.0,,1310.0,963.0,3.4801,NEAR BAY
20484,-118.72,34.28,17.0,3051.0,,1705.0,495.0,5.7376,<1H OCEAN
9814,-121.93,36.62,34.0,2351.0,,1063.0,428.0,3.7250,NEAR OCEAN
...,...,...,...,...,...,...,...,...,...
15362,-117.22,33.36,16.0,3165.0,482.0,1351.0,452.0,4.6050,<1H OCEAN
16623,-120.83,35.36,28.0,4323.0,886.0,1650.0,705.0,2.7266,NEAR OCEAN
18086,-122.05,37.31,25.0,4111.0,538.0,1585.0,568.0,9.2298,<1H OCEAN
2144,-119.76,36.77,36.0,2507.0,466.0,1227.0,474.0,2.7850,INLAND


Let's extract the Label (`median_house_value`) column.

In [26]:
y_test = test_set['median_house_value'].copy()
y_test

20046     47700.0
3024      45800.0
15663    500001.0
20484    218600.0
9814     278000.0
           ...   
15362    263300.0
16623    266800.0
18086    500001.0
2144      72300.0
3665     151500.0
Name: median_house_value, Length: 4128, dtype: float64

We also pass `test_set` through the pipeline:

In [27]:
X_test_prepared = full_pipeline.transform(X_test)

##Prediction

In [28]:
y_predicted = LR_model.predict(X_test_prepared)

We use the Root mean square error (RMSE) we saw in the previous section to compare the forecast and real data:

In [29]:
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE counting
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

72701.32600762138


So `RMSE=72701$` came out. Not bad, but not good either. That is, our model makes an average error of $72,000 when evaluating houses.

There is no single, universal solution to improve model accuracy. Things you can try:
- Finding better parameters
- Choosing a better model (algorithm).
- Collecting more information, etc.

We will try another model now.

### DecisionTree

In [30]:
from sklearn.tree import DecisionTreeRegressor
Tree_model = DecisionTreeRegressor()
Tree_model.fit(X_prepared, y)

DecisionTreeRegressor()

Let's check the model:

In [31]:
y_predicted = Tree_model.predict(X_test_prepared)

In [33]:
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE counting
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

72835.0766498973


It was not much different from before.

### RandomForest

In [34]:
from sklearn.ensemble import RandomForestRegressor
RF_model = RandomForestRegressor()
RF_model.fit(X_prepared, y)

RandomForestRegressor()

Let's check the model:

In [36]:
y_predicted = RF_model.predict(X_test_prepared)
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE counting
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

50249.614402988474


Better than before.

## Evaluation by Cross-Validation method

Above, we split the data into test and train sets for model evaluation.
The disadvantage of this method is that we always use the same data for testing and training.

With cross-validation, we can divide the data into several parts and train and test the model several times using different parts.

For example, the figure below shows training and testing by dividing the data by 5.

![](https://www.oreilly.com/library/view/machine-learning-quick/9781788830577/assets/b90b29ab-dfe7-4c11-9a2f-321e84f79495.png)

For cross validation, it is not necessary to divide the data into train and test, it is done by sklearn itself.

We can create a simple function to display the validation results

In [37]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Std.dev:", scores.std())

#### Cross-validation

In [38]:
from sklearn.model_selection import cross_val_score

#### LogisticRegression

In [39]:
scores = cross_val_score(LR_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)

In [40]:
display_scores(LR_rmse_scores)

Scores: [65000.67382615 70960.56056304 67122.63935124 66089.63153865
 68402.54686442 65266.34735288 65218.78174481 68525.46981754
 72739.87555996 68957.34111906]
Mean: 67828.38677377408
Std.dev: 2468.091395065227


#### Decision Tree

In [41]:
scores = cross_val_score(Tree_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [65375.83339931 72023.16256547 69365.31842123 69966.16485146
 74405.72625964 68153.69281188 66068.80120441 69338.83191738
 69065.27285487 71139.24147622]
Mean: 69490.20457618593
Std.dev: 2533.051175721662


#### Random Forest

In [42]:
scores = cross_val_score(RF_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [47009.50026838 51527.19530316 49734.31367723 51624.91118242
 52212.7528972  47024.84566841 47845.2808272  50545.21811819
 49500.38578603 49675.71723105]
Mean: 49670.01209592746
Std.dev: 1787.1790456268573


## Save the model

We should save the model we created for future use. In general, it is desirable to store not only the model, but also other necessary variables. For example pipeline.

For this we use `pickle` or `joblib` modules in Python.

### Save using `pickle`

In [44]:
import pickle

filename = 'RF_model.pkl' # faylga istalgan nom beramiz
with open(filename, 'wb') as file:
    pickle.dump(RF_model, file)

In [45]:
with open(filename, 'rb') as file:
    model = pickle.load(file)

Let's test the model

In [47]:
scores = cross_val_score(model, X_prepared, y, scoring="neg_mean_squared_error", cv=5)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [49435.39619195 51573.78187504 50577.50637735 49747.04217157
 50603.05396562]
Mean: 50387.356116305236
Std.dev: 749.2020302096875


In [48]:
import joblib

filename = 'RF_model.jbl' 
joblib.dump(RF_model, filename)

['RF_model.jbl']

In [49]:
model = joblib.load(filename)

In [50]:
scores = cross_val_score(model, X_prepared, y, scoring="neg_mean_squared_error", cv=5)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [49867.41621589 51452.98203645 50688.66735082 49640.44450664
 50137.01459504]
Mean: 50357.30494096657
Std.dev: 650.0333877592278


In [51]:
filename = 'pipeline.jbl'
joblib.dump(full_pipeline, filename)

['pipeline.jbl']