# ___Boston Housing Linear Regression___
---
---

## Tasks
1- Read in the dataset using sklearn's load_boston() function (found in sklearn.datasets)  

2- Perform some basic exploratory data analysis to get a feel for the data. Graph some stuff!  

3- Create a correlation heatmap to check to see how highly correlated our predictor variables (features) are (Remember, if our predictors are highly correlated, this is bad.)  

4- Train the model based on the train set. Use 75 percent of the data for train part. Use 25 percent of the data for test part.  

Hint: from sklearn.model_selection import train_test_split  

Hint: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)  

5 -Make predictions on your test set (X_test) and see how well it compares to the actual targets (y_test) from the test set.  

6- Compute, Mean-Square-Error (MSE) and R Squared score of your Model  

Hint: from sklearn.metrics import r2_score  

Hint: from sklearn.metrics import mean_squared_error  

In [70]:
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline

## # 1
## Load Boston Data

In [27]:
boston = load_boston()

print(boston.data)
print(boston.data.shape)

bos = pd.DataFrame(boston.data)

bos.columns = boston.feature_names
bos['PRICE'] = boston.target

print(bos.head())

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
(506, 13)
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  PRICE  
0     15.3  396.90   4.98   24.0  
1     17.8  396.90   9.14   21.6  
2 

## # 2  
## Exploratory data

In [61]:
print("Dataset described: \n", bos.describe())
print('\n\n\n')
print("Dataset types: \n", bos.dtypes)
print('\n\n\n')
print("Dataset info: \n", bos.info)

Dataset described: 
              CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   
std      8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   
max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000   

              AGE         DIS         RAD         TAX     PTRATIO           B  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   
std   

## # 4
## Build model

In [39]:
def data_model(data, target):  #x,y
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.25, random_state=0)
    pipeline = make_pipeline(LinearRegression())
    model = pipeline.fit(X_train, y_train)
    
    return (X_test, y_test, model)

## # 5 
## Testing Models

In [69]:
print("Value distribution of features: ")
print(list(boston['data'][0]))

min_max = MinMaxScaler()
boston_min_max = min_max.fit_transform(boston['data'])
print('\n')
print("Value distribution after min max: ")
print(list(boston_min_max[0]))

std = StandardScaler()
boston_std = std.fit_transform(boston['data'])
print('\n')
print("Value distribution after std: ")
print(list(boston_std[0]))

Value distribution of features: 
[0.00632, 18.0, 2.31, 0.0, 0.538, 6.575, 65.2, 4.09, 1.0, 296.0, 15.3, 396.9, 4.98]


Value distribution after min max: 
[0.0, 0.18, 0.06781524926686218, 0.0, 0.31481481481481477, 0.5775052692086607, 0.6416065911431514, 0.26920313906646415, 0.0, 0.20801526717557245, 0.2872340425531916, 0.9999999999999999, 0.08967991169977926]


Value distribution after std: 
[-0.4197819386460084, 0.2848298609673567, -1.2879094989577484, -0.2725985670699254, -0.14421743255530006, 0.4136718893017465, -0.1200134161980508, 0.1402136034929299, -0.9828428567665046, -0.6666082090210975, -1.4590003802772087, 0.44105193260704206, -1.075562304567866]


## # 6
## Model Evaluation

In [65]:
print("Base:")
X_test, y_test, model = data_model(boston['data'], boston['target'])
prediction = model.predict(X_test)
print("MSE: {}".format(mean_squared_error(y_test, prediction)))
print("R Squared: {}".format(r2_score(y_test, prediction)))
print('\n')


print("MinMax:")
X_test, y_test, model = data_model(boston_min_max, boston['target'])
prediction = model.predict(X_test)
print("MSE: {}".format(mean_squared_error(y_test, prediction)))
print("R Squared: {}".format(r2_score(y_test, prediction)))
print('\n')



print("Std:")
X_test, y_test, model = data_model(boston_std, boston['target'])
prediction = model.predict(X_test)
print("MSE: {}".format(mean_squared_error(y_test, prediction)))
print("R Squared: {}".format(r2_score(y_test, prediction)))

Base:
MSE: 29.78224509230252
R Squared: 0.635463843320211


MinMax:
MSE: 29.782245092302418
R Squared: 0.6354638433202122


Std:
MSE: 29.782245092302354
R Squared: 0.6354638433202131
