# Multiple Linear Regression in Python
Use the following code to import the California housing prices dataset and linear models in python. The dataset is taken from https://www.kaggle.com/camnugent/california-housing-prices/version/1. I have removed the categorical variables and rows with missing variables to make it easire to run the models. 

In [1]:
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None


train_df = pd.read_csv("reduced_data.csv")
X = train_df.drop(['median_house_value'],axis=1)
Y = train_df['median_house_value']

Print the shape (number of rows and columns) of the feature matrix X, and print the first 5 rows

In [2]:
print(X.shape)
print(X[:5])

(20433, 8)
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  
0       322.0       126.0         8.3252  
1      2401.0      1138.0         8.3014  
2       496.0       177.0         7.2574  
3       558.0       219.0         5.6431  
4       565.0       259.0         3.8462  


Using ordinary least squares, fit a multiple linear regression (MLR) on all the feature variables using the entire dataset. Report the regression coefficient of each input feature and evaluate the model using mean absolute error (MAE). Example of ordinary least squares in Python is shown in Section 1.1.1 of http://scikit-learn.org/stable/modules/linear_model.html.

In [3]:
clf = linear_model.LinearRegression()
clf.fit(X, Y)

# print intercept and coefficients
print('Intercept: ', clf.intercept_)
print('Coef: ', clf.coef_)

predictions = clf.predict(X)

mae = mean_absolute_error(Y, predictions)
print('MAE: ', mae)

Intercept:  -3585395.747892478
Coef:  [-4.27301205e+04 -4.25097369e+04  1.15790031e+03 -8.24972507e+00
  1.13820707e+02 -3.83855780e+01  4.77013513e+01  4.02975217e+04]
MAE:  50799.6307289529


Split the data into a training set and a test set, using the train_test_split with test_size = 0.30 and random_state = 11. Fit an MLR using the training set. Evaluate the trained model using the training set and the test set, respectively. Compare the two MAE values thus obtained.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.30, random_state=11)

clf_train = linear_model.LinearRegression()
clf_train.fit(X_train, Y_train)
predictions_train = clf_train.predict(X_train)
mae_train = mean_absolute_error(Y_train, predictions_train)
print('MAE train: ', mae_train)
predictions_test = clf_train.predict(X_test)
mae_test = mean_absolute_error(Y_test, predictions_test)
print('MAE test: ', mae_test)

MAE train:  50749.10314465295
MAE test:  50916.74299435109


Calculate the pearson correlation matrix of the independent variables in the training set. Report the variables which have magnitude of correlation greater than 0.9 w.r.t the variable 'households'.

In [5]:
X_train.corr()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
longitude,1.0,-0.925627,-0.111272,0.042788,0.069305,0.101596,0.056116,-0.020466
latitude,-0.925627,1.0,0.013098,-0.034147,-0.066424,-0.1096,-0.070537,-0.074943
housing_median_age,-0.111272,0.013098,1.0,-0.356534,-0.316644,-0.294652,-0.298702,-0.115736
total_rooms,0.042788,-0.034147,-0.356534,1.0,0.927454,0.859323,0.916556,0.198486
total_bedrooms,0.069305,-0.066424,-0.316644,0.927454,1.0,0.880929,0.979547,-0.013082
population,0.101596,-0.1096,-0.294652,0.859323,0.880929,1.0,0.910283,-0.001523
households,0.056116,-0.070537,-0.298702,0.916556,0.979547,0.910283,1.0,0.008033
median_income,-0.020466,-0.074943,-0.115736,0.198486,-0.013082,-0.001523,0.008033,1.0


Independent variables having correlation greater than 0.9 w.r.t 'households': 
1. Total_rooms
2. Total_bedrooms
3. Population

Add the following independent variables to both train and test sets:
1. average_bedrooms = total_bedrooms/households
2. average_rooms = total_rooms/households
3. average_population = population/households

Recalculate the correlation matrix.

In [6]:
X_train['avg_bedrooms'] = X_train['total_bedrooms']/X_train['households']
X_train['avg_rooms'] = X_train['total_rooms']/X_train['households']
X_train['avg_population'] = X_train['population']/X_train['households']

X_test['avg_bedrooms'] = X_test['total_bedrooms']/X_test['households']
X_test['avg_rooms'] = X_test['total_rooms']/X_test['households']
X_test['avg_population'] = X_test['population']/X_test['households']

X_train.corr()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,avg_bedrooms,avg_rooms,avg_population
longitude,1.0,-0.925627,-0.111272,0.042788,0.069305,0.101596,0.056116,-0.020466,0.017548,-0.027099,0.011811
latitude,-0.925627,1.0,0.013098,-0.034147,-0.066424,-0.1096,-0.070537,-0.074943,0.062059,0.104294,-0.00233
housing_median_age,-0.111272,0.013098,1.0,-0.356534,-0.316644,-0.294652,-0.298702,-0.115736,-0.077163,-0.158539,0.012569
total_rooms,0.042788,-0.034147,-0.356534,1.0,0.927454,0.859323,0.916556,0.198486,0.03617,0.146227,-0.031031
total_bedrooms,0.069305,-0.066424,-0.316644,0.927454,1.0,0.880929,0.979547,-0.013082,0.054525,0.004237,-0.036556
population,0.101596,-0.1096,-0.294652,0.859323,0.880929,1.0,0.910283,-0.001523,-0.062026,-0.072951,0.077684
households,0.056116,-0.070537,-0.298702,0.916556,0.979547,0.910283,1.0,0.008033,-0.05005,-0.08195,-0.034892
median_income,-0.020466,-0.074943,-0.115736,0.198486,-0.013082,-0.001523,0.008033,1.0,-0.059447,0.350785,0.000417
avg_bedrooms,0.017548,0.062059,-0.077163,0.03617,0.054525,-0.062026,-0.05005,-0.059447,1.0,0.833841,-0.002194
avg_rooms,-0.027099,0.104294,-0.158539,0.146227,0.004237,-0.072951,-0.08195,0.350785,0.833841,1.0,0.003475


The new variables have a low correlation with the 'households' variable.

Fit an MLR on the new train data (with additional independent variables) and report the MAE on the new train and test sets.

In [7]:
clf_train = linear_model.LinearRegression()
clf_train.fit(X_train, Y_train)
predictions_train = clf_train.predict(X_train)
mae_train = mean_absolute_error(Y_train, predictions_train)
print('MAE train: ', mae_train)
predictions_test = clf_train.predict(X_test)
mae_test = mean_absolute_error(Y_test, predictions_test)
print('MAE test: ', mae_test)

MAE train:  50474.268791457886
MAE test:  50783.974603927876
