### Boston Housing Data

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per \$10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's

In [59]:
#Lets read the file to a dataframe and print all the rows and columns
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/lmanikon/lmanikon.github.io/master/teaching/datasets/bostonHousing.csv')
print(df.head)


<bound method NDFrame.head of         CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  \
0    0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296   
1    0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242   
2    0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242   
3    0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222   
4    0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222   
..       ...   ...    ...   ...    ...    ...   ...     ...  ...  ...   
506  0.98765   0.0  12.50     0  0.561  6.980  89.0  2.0980    3  320   
507  0.23456   0.0  12.50     0  0.561  6.980  76.0  2.6540    3  320   
508  0.44433   0.0  12.50     0  0.561  6.123  98.0  2.9870    3  320   
509  0.77763   0.0  12.70     0  0.561  6.222  34.0  2.5430    3  329   
510  0.65432   0.0  12.80     0  0.561  6.760  67.0  2.9870    3  345   

     PTRATIO       B  LSTAT  MEDV  
0       15.3  396.90   4.98  24.0  
1       17.8  396.90 

In [60]:
#drop rows with atleast one missing value
df = df.dropna()
print(df.shape)

(506, 14)


In [61]:
#Split the data into X and y -- feature set and class label column respectively
#MEDV is the target column
X = df.drop(columns=['MEDV'])
y = df['MEDV']



In [62]:
#Standardize the dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
X


array([[-0.420303  ,  0.28797489, -1.30024108, ..., -1.45389248,
         0.44614055, -1.00982322],
       [-0.41785992, -0.48448139, -0.60207324, ..., -0.31747751,
         0.44614055, -0.47756756],
       [-0.41786225, -0.48448139, -0.60207324, ..., -0.31747751,
         0.40149536, -1.13137198],
       ...,
       [-0.36932193, -0.48448139,  0.19436613, ...,  2.04626563,
        -0.1451065 ,  1.03987284],
       [-0.33052832, -0.48448139,  0.22370091, ...,  2.04626563,
        -0.1451065 ,  8.07690675],
       [-0.34488067, -0.48448139,  0.2383683 , ...,  2.04626563,
        -0.38643182,  4.11057855]])

In [63]:
#Perform PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
principalDf


Unnamed: 0,principal component 1,principal component 2
0,-2.076270,0.757225
1,-1.463351,0.594415
2,-2.046466,0.607857
3,-2.586053,0.009885
4,-2.446936,0.114024
...,...,...
501,0.318521,0.657941
502,0.666336,0.167241
503,0.946284,0.204977
504,2.353962,-1.026250


In [64]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(principalDf, y, test_size=0.2)

In [65]:
# linear regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

linR = LinearRegression().fit(X_train, y_train)

y_pred = linR.predict(X_test)

print(mean_squared_error(y_test, y_pred))

28.467971098571404
