# Lab 3: Data Science in Python

### 1 Data

In this lab you will work with the Boston house price dataset. The dataset is available through SciKit-learn. Import the dataset and print its description. Then create a pandas DataFrame containing all 14 attributes in the dataset.

In [4]:
from sklearn import datasets
import pandas as pd

#load data set
boston = datasets.load_boston()
#creating a dataframe using pandas 
df_boston = pd.DataFrame(boston.data,columns=boston.feature_names)
#print description
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

### 2 Implementing OLS Regression

Create a class `OLS`. The constructor should take a DataFrame `data` and a label name `response` for the response variable. Using NumPy, an Ordinary Least Squares Regression should be fitted on the data (including intercept). This should be implemented manually, using only simple matrix operations (inverse, transposition and multiplication) but you are recommended to check your results against `scipy.linalg.lstsq` (also available as `numpy.linalg.lstsq`). The class should implement the instance functions `get_yhat` (return an ndarray representing $\hat{y}$), `get_residuals` (return an ndarray representing $y-\hat{y}$), `get_rmse` (return the root mean sqared) and `get_beta` (return the fitted $\beta$ vector as an ndarray).

In [6]:
import numpy as np
#from numpy.linalg import inv 

class OLS:
    
    def __init__(self, dataFrame, response):
        self.columnNames = np.insert(dataFrame.columns, 0, 'INT')
        
        self.X = dataFrame.as_matrix()
        self.Y = response
        
        
        #add intercept in martix
        intercept =  np.ones(shape=self.Y.shape)[..., None]
        self.X = np.concatenate((intercept, self.X), 1)
        
        
        #calculate cofficient
        self.coeffs = inv(self.X.transpose().dot(self.X)).dot(self.X.transpose()).dot(self.Y)
        
        #y_hat calculate
        self.yhat = self.X.dot(self.coeffs)
        
        #residual
        self.residual = self.Y - self.yhat
        
        #MSE
        self.mse = np.mean(self.Y - self.yhat)
        
    def get_beta(self):
        results = pd.DataFrame({'coefficients':self.coeffs}, index=self.columnNames)
        return results.round(2)
    
    def get_yhat(self):
        return self.yhat
    
    def get_residual(self):
        return self.residual
    
    def get_rmse(self):
        return self.mse
    
    def cross_validate(self,n):
        pass
    

#test code
obj = OLS(df_boston,boston.target)
obj.get_beta() #get beta cofficients

Unnamed: 0,coefficients
INT,36.49
CRIM,-0.11
ZN,0.05
INDUS,0.02
CHAS,2.69
NOX,-17.8
RM,3.8
AGE,0.0
DIS,-1.48
RAD,0.31


### 3 Cross Validation and plotting

Implement a function `cross_validate` in the OLS class which should take one parameter, `n`, and perform n-fold cross-validation on the data supplied at creation of an OLS object.

Use `matplotlib` and the results from above to plot $y$ against in-sample $\hat{y}$ and out-of-sample cross-validated $\hat{y}$ in a scatter plot. Plot in-sample and out-out-sample in different colors. The plot should have suitable labels on both axes and a legend explaining which color represents which $\hat{y}$

### 4 SciKit-learn

Explore SciKit-learn and select 3 regression algorithms you would like to try. Also choose some sort of shrinking, regularization or feature selection unless such is inherent in the model (as in for example LASSO or kernel SVM). Play around with any hyperparameters or, better, apply a hyper-parameter tuning algorithm (there are classes that do this in SciKit-learn). Apply the selected algorithms on the Boston data set, using cross validation to generate out-of-sample predictions (also implemented in SciKit-learn), and plot the results.