# Lab09: Introduction to multiple regression

### Learning Outcomes
In this tutorial, we will learn how do multiple regression. We will also use some of our previous skills (bootstrapping) to build distributions for the regression coefficients.
### Data set 
Again, I will be using the __World Happiness dataset__ from Kaggle:<br>
https://www.kaggle.com/unsdsn/world-happiness <br>
You will need to modify and combine these steps into several functions for the Assignment.<br>

<font color = 'red'> To do the assignment, use the lecture notes and the tutorial notebook!

### Preliminaries
Set up the environment by importing pandas, numpy, and matplotlib, scipy.optimize. This is already done in the preliminaries. Ensure that you have fully mastered and understood HW7 before starting the assignment. <br>


In [1]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.optimize as so
import pandas as pd
import numpy as np


In [2]:
df =pd.read_csv('2019.csv')
df.head()

Unnamed: 0,OverallRank,Country,Score,GDPpercapita,SocialSupport,HealthyLifeExpectancy,FreedomToMakeLifeChoices,Generosity,PerceptionsOfCorruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298


## 1. Multiple Regression
In multiple regression modelling, you are trying to model the response variable using __multiple__ explanatory variables! Basically, all you need to do is to implement the following formula:<br>

\begin{align}
\ y & = \beta_0\ + \beta_1 x_1\ + \beta_2 x_2 \ + \beta_3 x_3\ + ... + \beta_i x_i\\
\end{align}

To have a better understanding of what you need to do, take a look at the following formula, which is basically the same formula written in a different form:

\begin{align}
\ y & = \beta_0 \times\ 1\ + \beta_1 x_1\ + \beta_2 x_2 \ + \beta_3 x_3\ + ... + \beta_i x_i\\
\end{align}

    
__Take a look at the formula! What do you need to implement it in python?__

In the following example, I will be implementing a multiple regression with two explanatory variables.

<font color = 'red'>__Make sure you make the necessary changes so that your code implements a multiple regression model with any number of explanatory variables__

### 1.1 multiple regression prediction function.
I will be showing you the steps you need to implement within your function.

Lets say we want to model __Happiness score__ using __Healthy life expectancy__ and __Social support__.<br>
There will be two explanatory variables and ONE intercept (b0)

The following example is using a _hypothetical_ model with a _hypothetical parameter array_. You will need to incorporate these steps into a function!

<font color = 'red'>__This is just one way of implementing the function. You can come up with your own methods to implement this multiple regression prediction function__
    
    ** use print statements on the way for debugging

In [9]:
b_toy = [1.5, 2, 3]
print(type(b_toy))

xnames = ['HealthyLifeExpectancy', 'SocialSupport']
print(type(xnames))

yp = np.ones(len(df.index)) * b_toy[0]
print(type(yp))
print(yp)

def multiRegPredict(b, D, xnames):
    yp = np.ones(len(D.index)) * b[0]
    for i in range(2):
        yp = yp + D[xnames[i]]*b[i+1]
    return yp

print(yp)

multiRegPredict(b_toy, df, xnames)



<class 'list'>
<class 'list'>
<class 'numpy.ndarray'>
[1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5]
[1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
 1.5 1.5 1.5 1.5

0      8.233
1      8.211
2      8.302
3      8.424
4      8.064
       ...  
152    5.153
153    3.773
154    1.710
155    3.815
156    8.661
Length: 157, dtype: float64

### 1.2 The LOSS function
Here, I will show you the steps for implementing the RSS loss function. The function should be returning the RSS value __AND__ the derivative array. Use print statement at each step of the way. It will help you debug the code!

__Again, keep in mind that I am showing you the steps for a multiple regression with two explanatory variables__

#### 1.2.1 Calculate the RSS value

In [None]:
y = df['Score']
print(y)
print(type(yp))


# 1. Calculate the residuals
res = y - yp
print(type(res))

# 2. Calcualate the sum quare of the residuals
res2 = res**2

# 3. Calculate the sum of the square of the residuals
RSS = sum(res2)
print("The RSS is %f" %RSS)

#### 1.2.2 Build up the derivative array

In [None]:
# 1. Initialize the derivative array
deriv = np.zeros(3) # 3 for this specific situation, Use len(xnames) + 1 for other cases

# 2. Compute the derivative for the intercept
deriv[0] = -2 * sum(res)

# 3. Use a for loop to cmpute the derivatives for the slopes
for i in range(len(xnames)):
    # Select the corresponsding predictors
    xi = df[xnames[i]]
    deriv[i+1] = -2 * np.sum(xi*res)

print(deriv)

def multiRegLossRSS(b, D, y, xnames):
    # compute yp using multiRegPredict
    yp = multiRegPredict(b, D, xnames)
    res = ...
    rss = ...
    grad = ...
    return (rss, grad)

RSS, grad = multiRegLossRSS([1.5, 2, 3], df, df.score, ['HealthyLifeExpectancy', 'SocialSupport'])

## 3. leave-one-out cross validation
All the steps performed in the previous assignment are applicable in here.

As an example, lets say I want to use the first half of the dataframe for training and the second half for testing. The the code will be:

In [None]:
# train vs test
train_df = df.loc[0:79]
test_df = df.loc[79:157]

def leaveOneOutCV(D, y, xname, fitcn=?, predictfcn=multiRegPredict):
    N = len(y)
    yp = np.zeros(N)
    ind = np.arange(N)

    # 1. Create a for loop that goes over each observation and removes it to make a prediction
    # 2. Compute R2, b using your function that optimizes the loss
    yp[i] = predictfcn(b, D[ind==i], xname)     # 3. Predict the value you removed



## 4. Bootstrapping for regression 
The steps you need to perform here are the same as the bootstrap function you have already defined in the previous assignments. Except here, you are using the bootstrapping method to get the distributions for the regression parameters. So basically, each parameter will be a stat which is calculated in each iteration and the output of this bootstrapping function is going to be a __numpy array__ with the distribution for each parameter on each column.

Just like before, I will show you the steps you need to perform for a special case in which you have 2 explanatory variables. You will need to put these steps together to build your bootstrapping function.

We will be using smf for fitting the model! The input to your bootstrap function, as stated in the question, is the explanatory variables part of the formula, basically, whatever comes after the '~'. So, it would be good to use smf.fit() to fit the model and get the number of parameters.

<font color = 'red'>__Again, this is just one way to do this and you are free to come up with new methods__

### 4.1 Fit a model right away JUST to get the number of parameters
Keep in mind that in the assignment, you are supposed to write a function that does the fitting. I don't have the function in here.

In [None]:
R2, b = fitcn(D, y, ...)

numParam = len(b)

### 4.2 create an array that represents the indices.
random sampling with replacements will be done using this array

In [10]:
N = len(df.index)
ind = np.arange(N)

### 4.3 Initialize the array that will have your stats (here: parameter estimates)
lets say, just like your previous bootstrap function, you want to do numIter iterations. The distribution for each parameter will be placed within each column of the array that we are initializing here!

In [11]:
numIter = 1000
stats = np.zeros((numIter, numParam))

NameError: name 'numParam' is not defined

### 4.4 iterate over the following steps (put the steps within a for loop, just like your previous bootstrap function)
<font color = 'red'>Again, you might need to modify these steps to work with the function you define for the assignment

In [None]:
sample = np.random.choice(ind, N)

R2, b = fitfcn(...)

stat[i, :] = b

### 4.5 you can use the your function for confidence interval to build confidence intervals for each parameter separately.
Remember, in here, I coded my bootstrap so that I have the distribution for each parameter on one column. For example, the distribution for the intercept, which is the first parameter, will be stat[:, 0].