### Predicting house prices with regression

> Regression problems normally answers questions like **How much...?**, **How mayny...?**. In this notebook we will use different **Regression** Algorithms to predict the house prices from the well known dataset in **sklearn** `load_boston`

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
import seaborn as sns

### Loading data.

In [7]:
boston = datasets.load_boston()
pd.DataFrame(boston.data).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [9]:
print(boston)

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]]), 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 1

In [10]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [13]:
boston.target.shape, boston.data.shape

((506,), (506, 13))

### Spliting our data
> We are goin to split our data into train, and test

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [23]:
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = .3, random_state =99)

In [24]:
X_train.shape, y_train.shape, X_test.shape

((354, 13), (354,), (152, 13))

### Creating different `Regression` models.

In [95]:
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression

#### Scaling our X and y data for trsin and test
> We are going to scale our data using the `StandardScaler` for both train and test datasets.

In [35]:
scaleX = StandardScaler()
scaley = StandardScaler()

In [37]:
X_train_scaled = scaleX.fit_transform(X_train)
X_test_scaled = scaleX.transform(X_test)
X_train_scaled[0]

array([-0.35961633, -0.48960055, -0.48281904, -0.2511236 , -0.16357244,
       -0.64911303, -0.43813561,  0.32717674, -0.65180049, -0.65554224,
        1.11770735,  0.42946687, -0.59396908])

In [43]:
y_test[:2]

array([35.4, 35.2])

#### 1. SGDRegressor
> SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

In [41]:
sgd_reg = SGDRegressor(loss='squared_loss', random_state=42, penalty=None)
sgd_reg

SGDRegressor(penalty=None, random_state=42)

In [44]:
sgd_reg.fit(X_train_scaled, y_train)

SGDRegressor(penalty=None, random_state=42)

In [46]:
sgd_reg.predict(X_test_scaled[:5]), y_test[:5]

(array([34.09016066, 34.8356811 , 25.35752209, 25.31266708, 34.4594283 ]),
 array([35.4, 35.2, 24.8, 22.6, 34.9]))

#### 2. SVR
> The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to datasets with more than a couple of 10000 samples.

In [61]:
svr_reg = SVR(C=1, epsilon=.5, kernel="linear",degree=3 )
svr_reg

SVR(C=1, epsilon=0.5, kernel='linear')

In [62]:
svr_reg.fit(X_train_scaled, y_train)

SVR(C=1, epsilon=0.5, kernel='linear')

In [64]:
svr_reg.predict(X_test_scaled[:5]), y_test[:5]

(array([32.77539084, 34.47704835, 25.07823343, 24.12403952, 33.66906661]),
 array([35.4, 35.2, 24.8, 22.6, 34.9]))

#### 3. DecisionTreeRegressor
```
class sklearn.tree.DecisionTreeRegressor(*, criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0)
```

In [66]:
dst_reg = DecisionTreeRegressor()

In [67]:
dst_reg.fit(X_train_scaled, y_train)

DecisionTreeRegressor()

In [71]:
dst_reg.predict(X_train_scaled[:5]), y_train[:5]

(array([19.9, 11.8, 13.1, 20. ,  7.2]), array([19.9, 11.8, 13.1, 20. ,  7.2]))

#### 4. RandomForestRegressor
> A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

In [84]:
randf_reg = RandomForestRegressor(n_estimators = 5, max_depth=5, random_state=45)


In [85]:
randf_reg.fit(X_train_scaled, y_train)

RandomForestRegressor(max_depth=5, n_estimators=5, random_state=45)

In [86]:
randf_reg.predict(X_train_scaled[:5]), y_train[:5]

(array([20.17370258, 11.04782471, 14.90296757, 21.33863727,  9.60354437]),
 array([19.9, 11.8, 13.1, 20. ,  7.2]))

#### 3. ExtraTreesRegressor
> This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [87]:
ext_reg = ExtraTreesRegressor(n_estimators=100, random_state=0)

In [88]:
ext_reg.fit(X_train_scaled, y_train)

ExtraTreesRegressor(random_state=0)

In [90]:
ext_reg.predict(X_test_scaled[:5]), y_test[:5]

(array([32.415, 40.858, 28.084, 22.642, 34.329]),
 array([35.4, 35.2, 24.8, 22.6, 34.9]))

#### 4. KNeighborsRegressor

> The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.

In [91]:
neigh_reg = KNeighborsRegressor(n_neighbors=2)

In [92]:
neigh_reg.fit(X_train_scaled, y_train)

KNeighborsRegressor(n_neighbors=2)

In [93]:
neigh_reg.predict(X_test_scaled[:5]), y_test[:5]

(array([28.  , 32.75, 25.3 , 19.55, 34.05]),
 array([35.4, 35.2, 24.8, 22.6, 34.9]))

#### 7. LinearRegression
> LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

In [102]:
l_reg = LinearRegression()
l_reg.fit(X_train_scaled, y_train)

LinearRegression()

In [103]:
l_reg.fit(X_train_scaled, y_train)

LinearRegression()

In [104]:
l_reg.predict(X_test_scaled[:5]), y_test[:5]

(array([34.39031078, 34.72259889, 25.11192059, 24.05948598, 34.85344369]),
 array([35.4, 35.2, 24.8, 22.6, 34.9]))

### Scoring Different models

In [106]:
svr_reg ## SVR
sgd_reg ## SGDRegressor
dst_reg ## DecisionTreeRegressor
l_reg ## LinearRegression
neigh_reg ## KNeighborsRegressor
randf_reg ## RandomForestRegressor
ext_reg   ## ExtraTreesRegressor

ExtraTreesRegressor(random_state=0)

#### SVR

In [107]:
svr_reg.score(X_train_scaled, y_train), svr_reg.score(X_test_scaled, y_test)

(0.7369778607349755, 0.6667837988608363)

> `SVR` is `73%` accurate on the train and `67%` accurate on the test

#### SGDRegressor

In [108]:
sgd_reg.score(X_train_scaled, y_train), sgd_reg.score(X_test_scaled, y_test)

(0.7624407538153944, 0.6695152132670618)

> `SGDRegressor` is `76%` accurate on the train and `67%` accurate on the test

#### DecisionTreeRegressor

In [109]:
dst_reg.score(X_train_scaled, y_train), dst_reg.score(X_test_scaled, y_test)

(1.0, 0.7514846521915527)

`DecisionTreeRegressor` is `100%` accurate on the train and `75%` accurate on the test

#### LinearRegression

In [110]:
l_reg.score(X_train_scaled, y_train), l_reg.score(X_test_scaled, y_test)

(0.764467385910802, 0.6721726414424739)

`LinearRegression` is `76%` accurate on the train and `67%` accurate on the test

#### KNeighborsRegressor

In [111]:
neigh_reg.score(X_train_scaled, y_train), neigh_reg.score(X_test_scaled, y_test)

(0.9396633008285875, 0.7092287520704416)

`KNeighborsRegressor` is `94%` accurate on the train and `71%` accurate on the test

#### RandomForestRegressor

In [112]:
randf_reg.score(X_train_scaled, y_train), randf_reg.score(X_test_scaled, y_test)

(0.9269031662146822, 0.7741025482542851)

`RandomForestRegressor` is `93%` accurate on the train and `77%` accurate on the test

#### ExtraTreesRegressor

In [113]:
ext_reg.score(X_train_scaled, y_train), ext_reg.score(X_test_scaled, y_test)

(1.0, 0.8325553079114401)

`ExtraTreesRegressor` is `100%` accurate on the train and `83%` accurate on the test

### Conlusion

> The best model for predicting the house prices for `boston_dataset` is `ExtraTreesRegressor` according to the scoring outcome we get above comparing with other algos.