# F Linear Regression BOSTON

Train a linear regression model on the BOSTON dataset to predict median
values of houses. Test it on the test set. Show the code how you did it! What
dependent variables (columns) have the biggest influence?

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.metrics import mean_squared_error, r2_score

plt.rcParams['figure.figsize'] = [15, 8]

In [2]:
np.set_printoptions(suppress=True)

In [3]:
boston = tf.keras.datasets.boston_housing.load_data()
X_train = boston[0][0]
Y_train = boston[0][1]

X_test = boston[0][0]
Y_test = boston[0][1]

c = ["crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", 'medv']
boston = pd.DataFrame({c: X_train[:, i] for i, c in enumerate(c[:-1])})
boston_test = boston
boston[c[-1]] = Y_train
#boston.drop(boston.columns[[1,2,3,4,7,8,9,10,12,13]], axis=1, inplace=True)
boston

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1.23247,0.0,8.14,0.0,0.5380,6.142,91.7,3.9769,4.0,307.0,21.0,396.90,18.72,15.2
1,0.02177,82.5,2.03,0.0,0.4150,7.610,15.7,6.2700,2.0,348.0,14.7,395.38,3.11,42.3
2,4.89822,0.0,18.10,0.0,0.6310,4.970,100.0,1.3325,24.0,666.0,20.2,375.52,3.26,50.0
3,0.03961,0.0,5.19,0.0,0.5150,6.037,34.5,5.9853,5.0,224.0,20.2,396.90,8.01,21.1
4,3.69311,0.0,18.10,0.0,0.7130,6.376,88.4,2.5671,24.0,666.0,20.2,391.43,14.65,17.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
399,0.21977,0.0,6.91,0.0,0.4480,5.602,62.0,6.0877,3.0,233.0,17.9,396.90,16.20,19.4
400,0.16211,20.0,6.96,0.0,0.4640,6.240,16.3,4.4290,3.0,223.0,18.6,396.90,6.59,25.2
401,0.03466,35.0,6.06,0.0,0.4379,6.031,23.3,6.6407,1.0,304.0,16.9,362.25,7.83,19.4
402,2.14918,0.0,19.58,0.0,0.8710,5.709,98.5,1.6232,5.0,403.0,14.7,261.95,15.79,19.4


In [4]:
from sklearn.linear_model import LinearRegression

In [5]:
regr = LinearRegression()
print (regr)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


In [12]:
regr.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
pred = regr.predict(X_test)

#print (pred)
#print (Y_test)

In [16]:
print('Coefficients: \n', regr.coef_)
# The mean squared error of the linear regression
print("Mean squared error: %.2f"
      % mean_squared_error(Y_test, pred))
# Varaince Score: PI for accuracy of regression: a 1 would be a perfect variance score
print('Variance score: %.2f' % r2_score(Y_test, pred))

Coefficients: 
 [ -0.11999751   0.05700033   0.0039838    4.12698187 -20.50029633
   3.38024903   0.00756808  -1.71189793   0.33474754  -0.01177972
  -0.90231804   0.00871913  -0.55584251]
Mean squared error: 22.00
Variance score: 0.74


Our first intention was to compare coefficients for analysing the influence of the different variables. However we found this:

Regular regression coefficients describe the relationship between each predictor variable and the response. The coefficient value represents the mean change in the response given a one-unit increase in the predictor. Consequently, it’s easy to think that variables with larger coefficients are more important because they represent a larger change in the response.

However, the units vary between the different types of variables, which makes it impossible to compare them directly. For example, the meaning of a one-unit change is very different if you’re talking about temperature, weight, or chemical concentration.

This problem is further complicated by the fact that there are different units within each type of measurement. For example, weight can be measured in grams and kilograms. If you fit models for the same data set using grams in one model and kilograms in another, the coefficient for weight changes by a factor of a thousand even though the underlying fit of the model remains unchanged. The coefficient value changes greatly while the importance of the variable remains constant.

Takeaway: Larger coefficients don’t necessarily identify more important predictor variables.
(https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-identify-the-most-important-predictor-variables-in-regression-models)

### So we are going to need the MSE of each variable in order to determine the one with the biggest influence!

In [25]:
for i in range(0,13):
    print(c[i],":")
    reg = LinearRegression()
    reg.fit(X_train[:, i].reshape(-1, 1), Y_train)
    pred = reg.predict(X_test[:, i].reshape(-1, 1))
    print('Coefficients : {}'.format(reg.coef_))
    print("MSE: {}".format(mean_squared_error(Y_test, pred)))
    print("\n")

crim :
Coefficients : [-0.37725684]
MSE: 72.49923060928539


zn :
Coefficients : [0.14737305]
MSE: 72.3835733788123


indus :
Coefficients : [-0.6446655]
MSE: 65.38896540382329


chas :
Coefficients : [6.43943008]
MSE: 82.21505727944826


nox :
Coefficients : [-34.41973083]
MSE: 68.36363721633272


rm :
Coefficients : [8.84314508]
MSE: 45.3221194654976


age :
Coefficients : [-0.12004703]
MSE: 73.39948383390582


dis :
Coefficients : [1.15186574]
MSE: 79.16704438078895


rad :
Coefficients : [-0.39762194]
MSE: 72.68954082972854


tax :
Coefficients : [-0.02484192]
MSE: 67.58232478739882


ptratio :
Coefficients : [-2.06776036]
MSE: 63.972246486587544


black :
Coefficients : [0.03366193]
MSE: 74.61110841268861


lstat :
Coefficients : [-0.92782169]
MSE: 39.42905636385886




A lower MSE means a bigger influence of the variable, so looking at the data we can say that in this dataset "rm" has the biggest impact on property prices, followed by "lstat".