# LSE ST451: Bayesian Machine Learning
## Author: Kostas Kalogeropoulos

## Week 2: Baysian Linear Regression part I

Topics covered 
 - Working with Pandas data frames
 - Working with 'for' loops in Python
 - Fitting linear regression models
 - Polynomial curve fitting
 - Introduction to training and test error concepts
 - Ridge regression
 

We begin with loading the necessary libraries. We will use **numpy** and **matplotlib** that we saw on week 1 and gain familiarity with **pandas** for handling data. We will also get started on **sklearn** for linear and ridge regression, spliting data into train and test samples and get our hands on a real dataset on Boston house prices

In [None]:
import numpy as np             
import pandas as pd           #Python Data Analysis Library handle data in a user friendly way
#import random
import matplotlib.pyplot as plt #for plots
%matplotlib inline
from sklearn import linear_model # A very popular Python library for Machine Learning
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split #needed to for assessing prediction
from sklearn import datasets ## imports datasets from scikit-learn

### Linear Regression / Polynomial Curve fitting

We start with the polynomial curve fitting exercise seen in the lecture slides. The *truth* is the functions $sin(2\pi x)$ for $x \in [0,1]$. We observe 10 points, equispaced in $[0,1]$ - array x,  with $N(0,0.3^2)$ independent measurement error, array y. The array xg contains a grid of 100 points in $[0,1]$

In [None]:
#Polynomial fitting exercise
np.random.seed(1)
x = np.linspace(0, 1, 10)
xg = np.linspace(0,1,100)
f = np.sin(2*np.pi*xg)
y = np.sin(2*np.pi*x)+0.3*np.random.randn(10)
plt.plot(xg,f,label='true process')
plt.plot(x,y,'o',label='data')
plt.legend()

In order to start working with pandas data frames we put x and y in pandas data frame called 'data'.

In [None]:
#put x,y into the pandas dataframe called data 
data = pd.DataFrame(np.column_stack([y,x]),columns=['y','x']) #columns, then column names
data

We can add the variable 'x2' containing the values of x$^2$

In [None]:
data['x2']=data['x']**2
data

We need to also add the variable x$^3$ to x$^9$ so that we fit polynomials up to the 9-th degree. This can be done very quickly using a 'for loop'. See below how to use a 'for loop' in an example were string and numerical variables are being used.

In [None]:
#Expand the data including powers of x up to 9
for i in range(3,10):  #executes the following indented commands for i varying from 3 to 9
    colname = 'x%d'%i # the %d %i puts a different number in the name of each variable
    data[colname] = data['x']**i #raise to the power of i
    # the for loop continues until the first time a command is not indented.
data

Now we will fit a linear regression model with x and x$^2$, in other words a polynomial of the 2nd degree. The sklearn command 'Linear Regression' is not particularly user friendly in terms of its output so we will summarise the results in a pandas data frame. We will look at the **training mean squared error, which is the average squared distance between the predicted y's from the model and the actual y's**. We will also look at the regression coefficients  

In [None]:
predictors = ['x','x2']
linreg = LinearRegression(fit_intercept=True,normalize=False)
linreg.fit(data[predictors],data['y'])
y_pred = linreg.predict(data[predictors])

#code for plot
plt.plot(data['x'],y_pred, label='OLS fit')
plt.plot(data['x'],data['y'],'o',label='data')
plt.plot(xg,f,label='true process')
plt.title('Plot for power: 2')
plt.legend()

#output to return
mse = np.sqrt(np.mean((y_pred-data['y'])**2))
results = [mse]
results.extend([linreg.intercept_])
results.extend(linreg.coef_) 
results = pd.DataFrame([results],columns = ['MSE','intercept','x','x2'])
print(results)

### Activity 1

Fit polynomials of 3rd and 9th degrees to the previous data and provide their coefficients and their training mean squared error. 

We will now introduce a more dense version of x and its powers (based on the array xg) in order to compare the model fit with the true process in the entire interval of $[0,1]$.

In [None]:
xg0 = np.ones(100)
grid = pd.DataFrame(np.column_stack([xg0,xg]),columns=['x0','x'])
for i in range(2,10):  
    colname = 'x%d'%i 
    grid[colname] = grid['x']**i 
grid.head()

The following code, in addition to fitting the polynomial on the data, also provides the fit that correspond to it in the entire interval $[0,1]$.

Below we see the case for a 9-th degree polynomial but you can repeat for eny order between 1 and 9 by setting 'npower' to the desired order.

In [None]:
npower=9
grid_predictors = ['x0','x']
grid_predictors.extend(['x%d'%i for i in range(2,npower+1)])
X = grid[grid_predictors]
predictors = ['x']
predictors.extend(['x%d'%i for i in range(2,npower+1)])
linreg = LinearRegression(fit_intercept=True,normalize=False)
linreg.fit(data[predictors],data['y'])
beta = [linreg.intercept_]
beta.extend(linreg.coef_)
fitted = np.dot(X,beta)

plt.plot(grid['x'],fitted, label='OLS fit')
plt.plot(data['x'],data['y'],'o',label='data')
plt.plot(xg,f,label='true process')
plt.title('Plot for power: %d'%npower)
plt.legend()

### Ridge Regression

We now turn to Ridge regression. We will use all powers of x up to 9 but we also include a penalty term on the squared value of the coefficients. The amount penalisation is controlled by the parameter 'lam'.

Below is some code that fits a Ridge regression model and provides (as before) the estimates of the coefficients and the training MSE.

In [None]:
npower = 9
predictors = ['x']
predictors.extend(['x%d'%i for i in range(2,npower+1)])
lam=np.exp(-10)
ridgereg = Ridge(alpha=lam,normalize=False,fit_intercept=True)
ridgereg.fit(data[predictors],data['y'])
y_pred = ridgereg.predict(data[predictors])

#plot
plt.plot(data['x'],y_pred, label='OLS fit')
plt.plot(data['x'],data['y'],'o',label='data')
plt.plot(xg,f,label='true process')
plt.title('Plot for ridge regression')
plt.legend()

#output to return
mse = np.sqrt(np.mean((y_pred-data['y'])**2))
results = [mse]
results.extend([ridgereg.intercept_])
results.extend(ridgereg.coef_)
col = ['MSE','intercept'] + ['x%d'%i for i in range(1,npower+1)]
results = pd.DataFrame([results],columns = col)
results

### Test Error - Out of sample performance

In the previous exercise we compared visually the fit from the model when applied to x being in the interval $[0,1]$ versus the true function. This is very useful but cannot be done in practice as the true function is unknown.

We will therefore repeat the exercise under a more realistic setting: 
- Generate more 100 noisly observations from the function $sin(2\pi x)$ as before
- **Randomly select 10** of those. These will form the **train sample** that will be used to fit the models and obtain the regression coefficients. 
- Set **aside** the remaining 90 observations. These will for the **test sample** that will be used to assess the predictions of each model that was estimated in the train sample. 
- Contrasting those predictions to the real data will allow us to estimate the **test error** and assess the out of sample performance of each model.
- Monitor both the training and test errors for each model.


We start by simulate 100 noisy data

In [None]:
#Simulate more data
np.random.seed(4)
x = np.linspace(0,1,100)
f = np.sin(2*np.pi*x)
y = np.sin(2*np.pi*x)+0.3*np.random.randn(100)
plt.plot(x,f,label='true process')
plt.plot(x,y,'o',label='data')
plt.legend()

In [None]:
data = pd.DataFrame(np.column_stack([y,x]),columns=['y','x']) #columns, then column names
npower = 9
for i in range(2,10):  
    colname = 'x%d'%i 
    data[colname] = data['x']**i 
data.head()

Next, we randomly split the data to train and test samples and see those

In [None]:
npower = 9  #feel free to try different numbers for npower
predictors = ['x']
predictors.extend(['x%d'%i for i in range(2,npower+1)])

# Split up your data
trainX, testX, trainy, testy = train_test_split(data[predictors],data['y'],
                                                test_size=0.9, random_state=1)
trainX

In [None]:
testX

Next we will 
 1. fit the ridge regession model with 9-th order polynomial on the train dataset
 2. calculate the train error as before
 2. obtain its predictions for the test dataset
 3. compare the prediction of the previous step against the test data

In [None]:
npower = 9
predictors = ['x']
predictors.extend(['x%d'%i for i in range(2,npower+1)])

#Step 1
lam=np.exp(-2.5)
ridgereg = Ridge(alpha=lam,normalize=False,fit_intercept=True)
ridgereg.fit(trainX,trainy)

#Step 2
y_pred_train = ridgereg.predict(trainX)
train_mse = np.sqrt(np.mean((y_pred_train-trainy)**2))

#Steps 3 and 4
y_pred_test = ridgereg.predict(testX)
test_mse = np.sqrt(np.mean((y_pred_test-testy)**2))

#output to return
results = [train_mse,test_mse]
results.extend([ridgereg.intercept_])
results.extend(ridgereg.coef_)
col = ['trainMSE','testMSE','intercept'] + ['x%d'%i for i in range(1,npower+1)]
results = pd.DataFrame([results],columns = col)
results

Similarly for a linear regression model based on a polynomial of order 'npower' we will
 1. fit the linear regession model of the 'npower'-th order polynomial on the train dataset
 2. calculate the train error as before
 2. obtain its predictions for the test dataset
 3. compare the prediction of the previous step against the test data

In [None]:
# Step 1
npower = 5
for i in range(2,10):  
    colname = 'x%d'%i 
    data[colname] = data['x']**i 
predictors = ['x']
predictors.extend(['x%d'%i for i in range(2,npower+1)])
linreg = LinearRegression(fit_intercept=True,normalize=False)
linreg.fit(trainX[predictors],trainy)

# Step 2
y_pred_train = linreg.predict(trainX[predictors])
train_mse = np.sqrt(np.mean((y_pred_train-trainy)**2))

# Steps 3 and 4
y_pred_test = linreg.predict(testX[predictors])
test_mse = np.sqrt(np.mean((y_pred_test-testy)**2))

# output to return
results = [train_mse,test_mse]
results.extend([linreg.intercept_])
results.extend(linreg.coef_)
col = ['trainMSE','testMSE','intercept'] + ['x%d'%i for i in range(1,npower+1)]
results = pd.DataFrame([results],columns = col)
results

### Activity 2

1. Find the order of polynomial that produces the smallest test MSE. 
2. Can you beat that with Ridge Regression? In other words can you find penalty 'lam' such that the Ridge regression model produces an even smaller test MSE?

### Real data example

Finally we will look at a real data example

In [None]:
boston = datasets.load_boston() ## loads Boston dataset from datasets library
print(boston.DESCR)

In [None]:
bos = pd.DataFrame(boston.data, columns=boston.feature_names)
bos['PRICE'] = boston.target
bos.head()

In [None]:
bos.describe()

### Activity 3

1. Split the data into train and test samples, such that the test size is 30\%
2. Fit a linear regression and a ridge regression model on the train data
3. Obtain the predictions from both models on the test data and compare
4. Repeat for a test size of 95\%

You will need to split into X and Y matrices/vectors first with the following code

In [139]:
X = bos.drop('PRICE', axis = 1)
Y = bos['PRICE']