## Tutorial 2: Machine Learning and Programming in Python

The following tutorial describes the main elements of linear regressions, regularization techniques and cross-validation by predicting the housing prices in Boston. In order to implement these techniques, the "Boston Housing Dataset" is included. The dataset is available from different sources like GitHub, Kaggle and Sklearn. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

The data contains the following variables:

    CRIM - per capita crime rate by town
    ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
    INDUS - proportion of non-retail business acres per town.
    CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
    NOX - nitric oxides concentration (parts per 10 million)
    RM - average number of rooms per dwelling
    AGE - proportion of owner-occupied units built prior to 1940
    DIS - weighted distances to five Boston employment centres
    RAD - index of accessibility to radial highways
    TAX - full-value property-tax rate per $10,000
    PTRATIO - pupil-teacher ratio by town
    B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    LSTAT - % lower status of the population
    MEDV - Median value of owner-occupied homes in $1000's


In [9]:
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt #For upcoming plots

Before starting the analysis, one has to prepare the data. This is done via the following code:

In [10]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

# Load the dataset and set the column names
boston_housing = pd.read_csv('boston_house_prices.csv', skiprows=1, names=column_names)

#Delete first row, due to the inclusion of variable description in the first row
boston_housing = boston_housing.drop(index=0)

#Converting all columns to numeric
boston_housing = boston_housing.apply(pd.to_numeric, errors='coerce')

#Check for NaNs
print("Any NaNs?",boston_housing.isna().any().any())

#Note that the data is already standardized

Any NaNs? False


### Task 1)

a) Get a first insight into the data.

   - How many observations are present?
   - Which types do the variables have?
   - Print out a summary statistic including the min, max, quantiles and the median

b) Implement the training and test set. The training set should countain 70% of the overall data. The remaining 30% are for the test data.
   - Use MEDV as your depended variable
   - All remaining variables are used as the independent variables (Remember: We want to predict the median value of owner-occupied homes)

(Hint: Set a "random_state = 10" in order to reproduce your training and test set.)

c) Train and test a linear regression.
   - Show your results in a suitable table
   - How can your model be validated? Compare your validation metrics between the training and test set.


In [11]:
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error, r2_score 




d) Include the ridge penalization term.
   - Implement a ridge regression that includes alpha = 1.
   - Does the model improve in comparison to task c? Explain
   - Try out different tuning parameters. What happens when alpha increases or decreases? 


(Hint: Use a large grid for the hyperparameter values. E.g. from 0 to 100.)

In [12]:
from sklearn.linear_model import Ridge



e) Include the lasso penalization term.
   - Implement a ridge regression that includes alpha = 1.
   - Does the model improve in comparison to task c? Explain
   - Try out different tuning parameters. What happens when alpha increases or decreases?

   (Hint: Use a large grid for the hyperparameter values. E.g. from 0 to 100.)  

In [13]:
from sklearn.linear_model import Lasso



f) Include the elastic net penalization term.
   - Implement a elatsic net model that includes alpha = 1 and the l1 ratio of 0.5.
   - Does the model improve in comparison to task c? Explain
   - Try out different tuning parameters. What happens when alpha increases or decreases?

   (Hint: Use a large grid for the hyperparameter values. E.g. from 0 to 100.)   

In [14]:
from sklearn.linear_model import ElasticNet



g) Now include cross-validation to find the optimal tuning paramater for your lasso model.
  - Did your lasso model improve?
  - What is the problem of using cross-validation?

In [15]:
from sklearn.model_selection import GridSearchCV



h) Now include cross-validation to find the optimal tuning paramater and the penalty ratio for your elastic net model.
  - Did your elastic net model improve?
  - How is the ratio of l1 to l2 penalty? Interpret your results.

In [16]:
from sklearn.model_selection import GridSearchCV