# Linear Regression Lab

This workbook is to accompany the PDF on Moodle. I will go through a simple linear regression example, then ask you to build multiple different ones. I will build the model by referring to the PDF

step 1, import the things we probably always want

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd


These ones are not for always, but I like this plotting style for this workbook

In [None]:
import matplotlib
matplotlib.style.use('ggplot')

Let's take this set

X has values 5,7,9,11,13,15
y has 11,14,20,24,29,31

and we want to build a model
$\hat{y} = w_0 + w_1x$

In [None]:
X = np.array([5, 7, 9, 11, 13, 15])
y = np.array([11, 14, 20, 24, 29, 31])

Let's plot it to see if a linear model makes sense for this

In [None]:
plt.scatter(X,y)
plt.show()

We can check the correlation coefficient (if you don't know what this, don't worry about it

In [None]:
np.corrcoef(X, y)

0.99322, very high correlation!

## Referring to section 2 Data Structure of the PDF let's look at X and y in more detail

In [None]:
y.shape

In [None]:
y.ndim

y is clearly a 1d array, as expected - good

In [None]:
X.shape

In [None]:
X.ndim

X is also a 1d array, not good. It needs to be a 2d array

In [None]:
X

It's written as one row, but really we need 6 rows with one entry in each row. Let's reshape the array

In [None]:
X = X.reshape(-1,1)

In [None]:
X.shape

In [None]:
X.ndim

2d array. Good. Let's look at it

In [None]:
X

6 rows now

## Section 3, build the model

In [1]:
from sklearn.linear_model import LinearRegression

In [2]:
LinearRegression?

[1;31mInit signature:[0m
[0mLinearRegression[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mfit_intercept[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mcopy_X[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mpositive[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
to minimize the residual sum of squares between the observed targets in
the dataset, and the targets predicted by the linear approximation.

Parameters
----------
fit_intercept : bool, default=True
    Whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculations
    (i.e. data is expected to be centered).

copy_X : bool, default=True
    If True, X will be copie

Create the model, with sklearn you initialise the model with an "empty constructor" of the base form of the model. LinearRegression is the constructor

In [None]:
model = LinearRegression()

Now "fit" the model using x and y. Use an appropriate ? after method name to see. .fit expects the X part to be a matrix, but we have a 1 dimensional array. Pay attention to the error message when you just do x,y

In [None]:
model.fit(X,y)

## Section 4 Inferences

In [None]:
model.coef_

In [None]:
model.intercept_

There are the parameters

$\hat{y} = 0.2142857142857153 + 2.12857143 x$

is the model

In [None]:
w0 = model.intercept_
w1 = model.coef_[0]

In [None]:
predictions = w0 + np.dot(w1,X).reshape(1,-1)

In [None]:
predictions

Using the built in .predict

In [None]:
pred = model.predict(X)

In [None]:
pred

In [None]:
pred == predictions

They're the same.

## Section 5 Evaluation

In [None]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

In [None]:
mse(y,pred)

In [None]:
r2(y,pred)

Very close to 1!

Root Mean Squared Error

In [None]:
rmse = np.sqrt(mse(y,pred))

In [None]:
rmse

In [None]:
plt.plot(X,y,'o')
plt.plot(X,pred)
plt.show()

shows the line of best fit

In [None]:
model.score(X,y)

Agrees with the r2 from above

# Work for you now

We assume that the value of cars goes down with age, can we make a model that will predict the value of a particular type of car, just by looking at its age

This is a collection of data for a particular make and model of car. The person collecting the data recorded the age of a car and its value. 

You should notice there are cars that are 5 years of age yet have different values, this is because there is variance that our model is not capturing but maybe it can show some interesting things

In [None]:
X=np.array([5,4,6,5,5,5,6,6,2,7,7])
y=np.array([85,103,70,82,89,98,66,95,169,70,48])

Go build and evaluate a model for this

## Simple linear regression with automobile data
We will now use sklearn to to predict automobile milesage per gallon (mpg) and evaluate these predictions. We first load the data and split them into a training set and a testing set.

In [None]:
#load mtcars
dfcars=pd.read_csv("mtcars.csv")
dfcars=dfcars.rename(columns={"Unnamed: 0":"name"})
dfcars.head()

We need to choose the variables that we think will be good predictors for the dependent variable `mpg`. 

>**EXERCISE:**  Pick one variable to use as a predictor for simple linear regression.  Create a markdown cell below and discuss your reasons.  You may want to justify this with some visualizations.  Is there a second variable you'd like to use as well, say for multiple linear regression with two predictors?

In [None]:
#your code (if any) here

> **EXERCISE:** With sklearn fit the training data using simple linear regression. 

> Plot the data and the prediction.  

>Print out the mean squared error for the set

In [None]:
#your code here
#define  predictor and response for set




In [None]:
#your code here
# create linear regression object with sklearn


#your code here
# train the model and make predictions


#your code here
#print out coefficients



In [None]:
# your code here
# Plot outputs


## Multiple linear regression with automobile data

> **EXERCISE:** With either sklearn or statsmodels, fit the training data using multiple linear regression with two predictors.  Use the model to make mpg predictions.

>How do these mean squared errors compare to those from the simple linear regression?

>Time permitting, repeat the training and testing with three predictors and calculate the mean squared errors.  How do these compare to the errors from the one and two predictor models?

In [None]:
#your code here


# Diabetes Dataset

Now we are going to do similar with the diabetes dataset built into sklearn

In [None]:
from sklearn import datasets
diabetes = datasets.load_diabetes()
print(diabetes.DESCR)

In [None]:
data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
data.boxplot(rot=90)
data.head()

Above I put the data into a pandas dataframe and then tried to visualise it. You will notice all of the data is in a similar range

### Normalised data
They all have a mean of 0 and standard deviation of 1. This is called Normalising the data and is a common step that I'll get into later on.

The obvious one you'll notice is the sex variable. This has two options which are 0.050680 or -0.044642, when usually with 2 options we would go with 0 and 1. The numbers were changed due to this normalisation

In [None]:
X = data 
y = diabetes.target

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X,y)

In [None]:
r_squared = lr.score(X,y)

Let's do a slightly different r2

In [None]:
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
adjusted_r_squared

In [None]:
lr.coef_

In [None]:
lr.intercept_

In [None]:
coef=pd.Series(lr.coef_ , index=diabetes.feature_names)
coef.plot(kind='bar', color = list('rgbkymc'))
plt.ylabel('Coefficient')

Some of those coefficients are very large, and it looks like age does not contribute as much as the others

The very large coefficients can often be problematic so we'll have to think about this one again later

Maybe removing age will give us a better model

We don't really have enough knowledge to figure it out, from past stuff I've done, you would get a better model without age, s3, s4 and s6. Maybe we'll look into that later

In [None]:
X = data.drop(["age"], axis=1)

In [None]:
lr = LinearRegression()
lr.fit(X,y)

In [None]:
lr.score(X,y)

In [None]:
r_squared = lr.score(X,y)

r2 score is the same, but your lecturer said it should be better. Well R2 is not always the best, there is an adjusted r2

In [None]:
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
adjusted_r_squared

Slightly better than the model with age included.

There are lots of other metrics that could be used, some could be better than others, typically we look at multiple ones to make our determination

Would Root Mean Squared Error be useful?