# Week 1: Simple Regression

## Training Exercise 1.3

### Questions:

Dataset **TrainExer13** contains the winning times $W$ of the Olympic 100-meter finals (for men) from 1948 to 2004.  
The calendar years 1948-2004 are transformed to games $G$ 1-15 to simplify computations. 

A simple regression model for the trend in winning times is $W = a + bG + ε$.  

1. Compute $a$ and $b$, and determine the values of $R^2$ and $s$ the standard deviation.  
2. Are you confident on the predictive ability of this model? Motivate your answer.  
3. What prediction do you get for 2008, 2012, and 2016? Compare your predictions with the actual winning times.

### Get Data

In [1]:
import pandas as pd
from math import sqrt

In [2]:
olympic_finals = pd.read_csv("./data/TrainExer13.txt", sep="\t")
olympic_finals.head()

Unnamed: 0,Game,Year,Winning time men
0,1,1948,10.3
1,2,1952,10.4
2,3,1956,10.5
3,4,1960,10.2
4,5,1964,10.0


### Solving for $b$  

$b$ is noting but the slope of the line given by

$\frac{mean(W*G) - mean(W)*mean(G)}{mean{G^2} - {mean(G)}^2}$

In [3]:
Y = olympic_finals["Winning time men"]
X = olympic_finals.Game

In [4]:
b = ((X*Y).mean() - X.mean()*Y.mean()) / ((X**2).mean() - (X.mean())**2)
print(round(b,3))

-0.038


### Solving for $a$  

$a = \bar{y} − b\bar{x}$ or in this example as:

$a = \bar{W} − b\bar{G}$

In [5]:
w_bar = olympic_finals["Winning time men"].mean()
g_bar = olympic_finals.Game.mean()

In [6]:
a = w_bar - b*g_bar
print(round(a, 3))

10.386


### R-squared

$R^2 = 1 - (\frac{\sum_{i=1}^{15}{e^2}}{\sum_{i=1}^{15}{{w_i - \bar{w}}^2}})$

In [7]:
# creating the error term
olympic_finals["error"] = olympic_finals["Winning time men"] - a - b*olympic_finals["Game"]
olympic_finals

Unnamed: 0,Game,Year,Winning time men,error
0,1,1948,10.3,-0.048
1,2,1952,10.4,0.09
2,3,1956,10.5,0.228
3,4,1960,10.2,-0.034
4,5,1964,10.0,-0.196
5,6,1968,9.95,-0.208
6,7,1972,10.14,0.02
7,8,1976,10.06,-0.022
8,9,1980,10.25,0.206
9,10,1984,9.99,-0.016


In [8]:
sum_sq_error = (olympic_finals.error ** 2).sum()
sum_sq_W_diff_mean = ((olympic_finals["Winning time men"] - w_bar)**2).sum()

In [9]:
R_sq = 1- (sum_sq_error / sum_sq_W_diff_mean)
round(R_sq, 3)

0.673

### Standard deviation $s$

$s = \sqrt{\frac{1}{13}\sum_{i=1}^{15}{e^2}}$

In [10]:
s = sqrt(1/13 * sum_sq_error)
round(s, 3)

0.123