# Problem 1 

First attach some common libraries:

In [None]:
# Import Numpy and Pandas and assign aliases "np" and "pd"
import numpy as np
import pandas as pd

import matplotlib as mpl 
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression


The Boston home price dataset can be downloaded from sklearn package:

In [None]:
from sklearn.datasets import load_boston
boston=load_boston()
print(boston)

Let's see what's inside boston:

In [None]:
boston.keys()

comment: This line prints out the 2nd level of data in the boston object, "DESCR" appears to describe the dataset included.

In [None]:
print(boston.DESCR)

comment: The line below prints out the number of rows and columns for the boston data object and also prints the header for each data column

print("row, cols=", boston.data.shape)

print(boston.feature_names)

comment: The line below accesses the data rows 0 through 5 and prints out their fourth column. Ranged indices use the ":" colon operator. For instance, using the modified line "sum(boston.data[0:506,3])" will gather the CHAS variable and sum up the entire column. This one in particular equated to 35, implying 35 tracts are bounded by the Charles River

In [None]:
print("row, cols=", boston.data.shape)
print(boston.feature_names)
sum(boston.data[0:506,3])

comment: The 0th row with 4th column is accessed in the boston dataset

In [None]:
boston.data[0, 4]

Transform array to pandas data frame: a data frame is great for holding datasets: each column represents a variable and each row an example.  Columns can be different types. 

In [None]:
bostonDF=pd.DataFrame(boston.data, columns=boston.feature_names)

comment: The line below prints out the first 5 rows and all columns of the pandas dataframe

In [None]:
bostonDF.head()

In [None]:
#comment: The line below assigns bostonDF['medv'] to the boston target dataframe. 
 #   The pandas Dataframe .describe member function of the bostonDF object prints out a data table with statistical information for each column

In [None]:
bostonDF['medv']=boston.target
print(boston.keys())

It's useful to see the range (min, max) of each variable when creating a prediction model: the model is unlikely to work well outside the observed data ranges.  

In [None]:
bostonDF.describe()

comment: The info() member function prints out the datatype for all dataframe columns; every value in this set is a 64-bit floating point number

In [None]:
bostonDF.info()

The relationship between crime rate and median price doesn't look quite linear...

In [None]:
plt.scatter(bostonDF.DIS, bostonDF.medv)  
plt.title('DIS vs. Median Price')  
plt.xlabel('DIS')  
plt.ylabel('Median Price')  
plt.show()  

comment: The graph generated below shows that if the number of lots zoned for 25,000 sqft is higher, the median price 

In [None]:
plt.scatter(bostonDF.RM, bostonDF.medv)  
plt.title('RM vs. Median Price')  
plt.xlabel('RM')  
plt.ylabel('Median Price')  
plt.show()

comment: The lines below assigns X to every row in the columns "Per capita crime rate per town" and "proportion of residential land zoned for over 25,000 sqft".
    Additionally, y is assigned to the" Median home value"

In [None]:
X=bostonDF.loc[:, ["RM", "DIS"]]
y=bostonDF.loc[:, "medv"]

comment: The LinearRegression() function from sci-kit learn is used to produce a complex object, which is assigned to "lmfit". The object "lmfit" is then passed X and y into its fit() member function to fit a Linear Regression model using X to produce beta coefficients, and y as the predicted output.

In [None]:
lmfit = LinearRegression()
lmfit.fit(X, y)
print(lmfit)

My model is $\hat{medv}=22.5-0.35 \text{CRIM}+0.12 \text{ZN}$

In [None]:
print(lmfit.intercept_)
print(lmfit.coef_)

comment: the "hat median value" variable rows are extracted from the lmfit object using its .predict() function used on the input X. From browsing the data, it is apparent that the "hat_medv" values are not ideal for predicting "medv", likely due to only "CRIM" and "ZN" being used as the independent variables.

In [None]:
bostonDF['hat_medv']=lmfit.predict(X)
print(bostonDF)

Differences between actual median home price and predicted home price for the training dataset:

In [None]:
bostonDF['error']=bostonDF['medv']-bostonDF['hat_medv']
print(bostonDF['error'])

A plot of prediction error vs. a variable is called a residual plot: if the points are evenly spread out above and below zero, then linear model might be okay. If points show a pattern, then maybe linear regression is not a good fit. 

In [None]:
plt.scatter(bostonDF.RM, bostonDF.error)  
plt.title('RM vs. Error')  
plt.xlabel('RM')  
plt.ylabel('Error')  
plt.show()  

In [None]:
plt.scatter(bostonDF.DIS, bostonDF.error)  
plt.title('DIS vs. Error')  
plt.xlabel('DIS')  
plt.ylabel('Error')  
plt.show() 

Mean squared Error on training dataset (really should use a separate test dataset to assess the model's actual performance.

In [None]:
(bostonDF['error']**2).mean()

In [None]:
#comment: The line below produced the absolute value of each error value, then takes the mean of the whole set. 
# This can be an unweighted interpretation of an average error of $5,650.29 for each city's mean home value.

In [None]:
(abs(bostonDF['error'])).mean()