## a. Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for?

The data should be partitioned so that after the training, the model will have something that it hasn't seen before to predict on. As such, the training set is used to for learning and to fit the parameters, while the validation set is used to tune the hyperparameters.

## b. Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model.

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
housing = pd.read_csv('3/BostonHousing.csv')

In [3]:
X = housing[['CRIM','CHAS','RM']]
y = housing['MEDV']

In [4]:
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size=.3, random_state=10)

In [5]:
regr = LinearRegression()
regr.fit(train_X, train_y)

LinearRegression()

In [6]:
regr.intercept_

-27.87615474886597

In [7]:
regr.coef_

array([-0.29594781,  2.22379595,  8.11607882])

**Equation: MEDV = -27.876 + (-0.296 CRIM) + 2.224 CHAS + 8.116 RM**

## c. Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6?

In [8]:
regr.predict(np.array([[0.1,0,6]]))

array([20.79072341])

**MEDV = -27.876 -0.296 * 0.1 + 2.224 * 0 + 8.116 * 6 = 20.79 thousand**

## d. Reduce the number of predictors:
### i. Which predictors are likely to be measuring the same thing among the 13 predictors? Discuss the relationships among INDUS, NOX, and TAX.

Thinking logically, there's likely a positive relationship between INDUS, NOX and TAX because Boston being a well developed city, the higher the proportion of non-retail business (e.g. industrial businesses), the higher the polution as well as taxes in the region.

### ii. Compute the correlation table for the 12 numerical predictors and search for highly correlated pairs. These have potential redundancy and can cause multicollinearity. Choose which ones to remove based on this table.


In [9]:
housing.dtypes

CRIM         float64
ZN           float64
INDUS        float64
CHAS           int64
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD            int64
TAX            int64
PTRATIO      float64
LSTAT        float64
MEDV         float64
CAT. MEDV      int64
dtype: object

In [10]:
corr = housing.iloc[: , :12].corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT
CRIM,1.0,-0.200469,0.406583,-0.055892,0.420972,-0.219247,0.352734,-0.37967,0.625505,0.582764,0.289946,0.455621
ZN,-0.200469,1.0,-0.533828,-0.042697,-0.516604,0.311991,-0.569537,0.664408,-0.311948,-0.314563,-0.391679,-0.412995
INDUS,0.406583,-0.533828,1.0,0.062938,0.763651,-0.391676,0.644779,-0.708027,0.595129,0.72076,0.383248,0.6038
CHAS,-0.055892,-0.042697,0.062938,1.0,0.091203,0.091251,0.086518,-0.099176,-0.007368,-0.035587,-0.121515,-0.053929
NOX,0.420972,-0.516604,0.763651,0.091203,1.0,-0.302188,0.73147,-0.76923,0.611441,0.668023,0.188933,0.590879
RM,-0.219247,0.311991,-0.391676,0.091251,-0.302188,1.0,-0.240265,0.205246,-0.209847,-0.292048,-0.355501,-0.613808
AGE,0.352734,-0.569537,0.644779,0.086518,0.73147,-0.240265,1.0,-0.747881,0.456022,0.506456,0.261515,0.602339
DIS,-0.37967,0.664408,-0.708027,-0.099176,-0.76923,0.205246,-0.747881,1.0,-0.494588,-0.534432,-0.232471,-0.496996
RAD,0.625505,-0.311948,0.595129,-0.007368,0.611441,-0.209847,0.456022,-0.494588,1.0,0.910228,0.464741,0.488676
TAX,0.582764,-0.314563,0.72076,-0.035587,0.668023,-0.292048,0.506456,-0.534432,0.910228,1.0,0.460853,0.543993


From above, we see that the following variable pairs are highly correlated:
 NOX and INDUS: Correlation coefficient = 0.76365
 
 TAX and INDUS: Correlation coefficient = 0.72076
 
 AGE and NOX: Correlation coefficient = 0.73147
 
 DIS and NOX: Correlation coefficient = -0.76923
 
 DIS and INDUS: Correlation coefficient = -0.70803
 
 DIS and AGE: Correlation coefficient = -0.74788 
 
 TAX and RAD: Correlation coefficient = 0.91023
 
**Variables to be removed: TAX, INDUS, NOX, AGE**

Correlation Matrix after removing the four variables

In [11]:
corr = housing[['CRIM', 'ZN', 'CHAS', 'RM', 'DIS', 'RAD', 'PTRATIO', 'LSTAT']].corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,CRIM,ZN,CHAS,RM,DIS,RAD,PTRATIO,LSTAT
CRIM,1.0,-0.200469,-0.055892,-0.219247,-0.37967,0.625505,0.289946,0.455621
ZN,-0.200469,1.0,-0.042697,0.311991,0.664408,-0.311948,-0.391679,-0.412995
CHAS,-0.055892,-0.042697,1.0,0.091251,-0.099176,-0.007368,-0.121515,-0.053929
RM,-0.219247,0.311991,0.091251,1.0,0.205246,-0.209847,-0.355501,-0.613808
DIS,-0.37967,0.664408,-0.099176,0.205246,1.0,-0.494588,-0.232471,-0.496996
RAD,0.625505,-0.311948,-0.007368,-0.209847,-0.494588,1.0,0.464741,0.488676
PTRATIO,0.289946,-0.391679,-0.121515,-0.355501,-0.232471,0.464741,1.0,0.374044
LSTAT,0.455621,-0.412995,-0.053929,-0.613808,-0.496996,0.488676,0.374044,1.0
