# Statistical Data Management Session 12: Linear Regression (chapter 11 in McClave & Sincich)

## 1. Additive in Batteries

**Solve this exercise on paper, using Python for calculations and t-values.**

An engineer found that by including small amounts of a compound in rechargeable batteries for portable computers, she could extend their lifetimes. She experimented with amounts of additives and the data are:

| Amount of Additive | Life (hours) |
|:---:| :---:|
|0|2|
|1|4|
|2|3|
|3|7|
|4|9|
 
1. Obtain the least squares fit of a straight line to the amount of additive.
2. What is the sum of squared errors?
3. Test whether you should reject the null hypothesis that the slope is 0. Take 0.10 as level of significance. What are the assumptions?
4. Give the coefficient of correlation.

$$
\begin{array}{|c|c|c|c|c|c|}
\hline \boldsymbol{i} & \boldsymbol{x}_{\boldsymbol{i}} & \boldsymbol{y}_{\boldsymbol{i}} & \boldsymbol{x}_{\boldsymbol{i}}^2 & \boldsymbol{y}_i^2 & \boldsymbol{x}_{\boldsymbol{i}} \boldsymbol{y}_{\boldsymbol{i}} \\
\hline 1 & 0 & 2 & 0 & 4 & 0 \\
\hline 2 & 1 & 4 & 1 & 16 & 4 \\
\hline 3 & 2 & 3 & 4 & 9 & 6 \\
\hline 4 & 3 & 7 & 9 & 49 & 21 \\
\hline 5 & 4 & 9 & 16 & 81 & 36 \\
\hline \text { Sum } & \sum_1^5 x_i=10 & \sum_1^5 \boldsymbol{y}_i=25 & \sum_1^5 x_i^2=30 & \sum_1^5 \boldsymbol{y}_i^2=159 & \sum_1^5 \boldsymbol{x}_{\boldsymbol{i}} \boldsymbol{y}_{\boldsymbol{i}}=67 \\
\hline
\end{array}
$$

 
1. $E(y)=\hat{\beta_0}+\hat{\beta_1}x$ $(n=5)$

    $\hat{\beta_1} = \frac{SS_{xy}}{SS_{xx}}$.
    
    $SS_{xy} = \Sigma_{i=1}^5 (x_i-\bar{x})(y_i-\bar{y})$
    
    $\qquad = \Sigma x_iy_i - \frac{\Sigma x_i \Sigma y_i}{n} = \cdots = 17$.
    
    $SS_{xx} = \Sigma x_i^2 - \frac{(\Sigma x_i )^2}{n} = \cdots = 10$.
    
    $\hat{\beta_1} = 1.7$.
    
    $\hat{\beta_0} = \bar{y}-\hat{\beta_1}\bar{x} = \cdots = 1.6$.
    
    $E(y) = 1.6+1.7x$.
    
    
    
2. $SSE = SS_{yy} - \hat{\beta_1} SS_{xy}$

    $SS_{yy} = \Sigma y_i^2 - \frac{(\Sigma y_i )^2}{n} = \cdots = 34$.
    
    $SSE = \cdots = 5.1$.
    
    This is a sum of squared errors of the predicted value $\hat{y_i} = 1.6+1.7x_i$ (based on the linear regression model) against the actual $y_i$ values:
    
$$
\begin{array}{|c|c|c|c|c|c|}
\hline \boldsymbol{i} & \boldsymbol{x}_{\boldsymbol{i}} & \hat{\boldsymbol{y}}_{\boldsymbol{i}} & \boldsymbol{y}_{\boldsymbol{i}} & \hat{\boldsymbol{y}}_{\boldsymbol{i}} - \boldsymbol{y}_{\boldsymbol{i}} & (\hat{\boldsymbol{y}}_{\boldsymbol{i}} - \boldsymbol{y}_{\boldsymbol{i}})^\boldsymbol{2} \\
\hline 1 & 0 & 1.6 & 2 & -0.4 & 0.16 \\
\hline 2 & 1 & 3.3 & 4 & 0.7 & 0.49 \\
\hline 3 & 2 & 5.0 & 3 & 2 & 4 \\
\hline 4 & 3 & 6.7 & 7 & -0.3 & 0.09 \\
\hline 5 & 4 & 8.4  & 9 &  -0.6  & 0.36 \\
\hline
\end{array}
$$    

$\qquad SSE = \Sigma_{i=1}^5 (\hat{y_i}-y_i)^2 = 5.1$
    

3. $H_0: \beta_1 = 0$

    $H_a: \beta_1 \neq 0 \Rightarrow$ two-tailed test.
    
    Test statistic $t = \frac{\hat{\beta_1}-0}{s/\sqrt{SS_{xx}}}$.
    
    $\alpha = 0.10 \Rightarrow \alpha/2 = 0.05,df=5-2=3$
    
    $t_{\alpha/2} = 2.353$. This is the critical t-value (see below).
    
    $s = \sqrt{\frac{SSE}{n-2}} = \cdots = 1.30.$
    
    $t = \cdots = 4.14$.
    
    As the test statistic is larger than the critical t-value, we reject $H_0$ with $\alpha = 0.10$.
    
    The model says that $y = \beta_0+\beta_1x+\epsilon$
    
    The assumptions on $\epsilon$ are:

   * $\epsilon$ is normally distributed;
   * $E(\epsilon) = 0$;
   * $\sigma^2(\epsilon)$ is a constant $\forall x$;
   * $\epsilon$ values are independent.
        
        
        
4. $r = \frac{SS_{xy}}{\sqrt{SS_{xx}SS_{yy}}} = \cdots = 0.92$.

    As $r>0$ this indicates that an increase in $x$ is associated with an increase in $y$.

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
%matplotlib inline

t_distr = sts.t(3)
t_alpha_2 = t_distr.ppf(1 - 0.1/2)
print(t_alpha_2)

## 2. Arts vs. Health in US Metropolitan Areas

For this exercise, we use the dataset provided in the ``ratings.csv`` file in the ``shared`` folder [converted from built-in Matlab dataset ``cities.mat``]. The dataset consists of various parameters (quality of life ratings for U.S. metropolitan areas), but in this exercise we concentrate on "arts" and "health", all assuming continuous values.

1. Draw a scatter plot of "arts" vs. "health". Use ``plt.xlim(<min>,<max>)`` and ``plt.ylim(<min>,<max>)`` (before ``plt.show()`` and ``plt.close()``) to alter the scatterplot's x and y ranges. Based on this plot, predict the outcome of the linear regression.
2. Perform a linear regression. Google this, the ``sts`` module has a method which does this in one line of code!
3. Interpret the coefficient of correlation (r-value) and p-value. Would you conclude that a better score for arts causes better health?

In [None]:
df = pd.read_csv("../../shared/ratings.csv")
df = df[["arts", "health"]]
print(df)

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(df["arts"],df["health"], marker='.')
plt.xlim(0,10000)
plt.ylim(0,4000)
plt.show()
plt.close()

Based on this plot, you would predict a positive r-value.

In [None]:
slope,intercept,r,p,stderr = sts.linregress(df["arts"],df["health"])
print(slope,r,p)

We refer to https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html 

The r-value is indeed positive and the p-value (for a test on $H_0$: slope = 0, $H_a$: slope $\neq$ 0) is extremely small, you would conclude that there is a statistically significant correlation. 

**Caution!** The "standard error of the estimate" (stderr in our code) is **not** the same as $s$ we used, it is, however, the same as the $s_\hat{\beta_1} = \frac{s}{\sqrt{SS_{xx}}}$. This goes to show that when using "out-of-the-box" methods, one has to be critical of what kind of algorithm it implements and check this in the documentation!

So, for the sake completeness, let's do the same manually.

In [None]:
n = len(df["arts"])

SSxx = (n-1)*df["arts"].var()
SSyy = (n-1)*df["health"].var()
print('SSxx:', SSxx, 'SSyy:', SSyy)

xbar = df["arts"].mean()
ybar = df["health"].mean()

SSxy = np.sum((df["arts"]-xbar)*(df["health"]-ybar)) # array x - scalar xbar => array of all values x[i]-xbar 

beta1Hat = SSxy/SSxx
print('beta1Hat:', beta1Hat)

SSE = SSyy - beta1Hat*SSxy
print('SSE:', SSE)

s = np.sqrt(SSE/(n-2))
print('s:', s)

s_beta1Hat = s/np.sqrt(SSxx)

tval = beta1Hat/s_beta1Hat
print('t value:', tval)

t = sts.t(n-2)
p_value = 2*(1 - t.cdf(tval))
print('p value:', p_value)

r = SSxy / np.sqrt(SSxx * SSyy)
print('r:', r)

plt.figure(figsize=(10,6))
plt.scatter(df["arts"],df["health"], marker='.')
plt.xlim(0,10000)
plt.ylim(0,4000)
x = np.linspace(0,10000,10000)
y = slope*x + intercept
plt.plot(x,y,color='orange')
plt.show()
plt.close()

3. You can **not** conclude that a higher arts rating **causes** a better health rating, only that they **correlate**.

## 3. Courses vs. Salary

Is there a correlation between the number of courses IEE employees take and their salary? 

1. Open the IEE dataset we used in session 4-5 in MySQL Workbench. Write the query to generate the correct dataset, containing the salary of employees and a count of the number of courses they subscribed to. Caution: there are employees who didn't take any courses and they should be included as well!
2. Perform the regression (again, this should be only one line of code!) and draw a conclusion on significance level $\alpha=0.05$.

In [None]:
conn = sqlite3.connect("../../shared/IEE_en.sqlite")
query = """
SELECT employee_nr, salary, COUNT(student) AS courses FROM employee LEFT JOIN subscription ON employee.employee_nr = subscription.student GROUP BY employee_nr
"""
df = pd.read_sql_query(query, conn)
print(df)

In [None]:
slope,intercept,r,p,stderr = sts.linregress(df["courses"],df["salary"])
print(slope,r,p)

plt.figure(figsize=(10,6))
plt.scatter(df["courses"],df["salary"])
x = np.linspace(0,4,1000)
y = slope*x + intercept
plt.plot(x,y,color='orange')
plt.show()
plt.close()

As $p > \alpha$, we conclude that there is not a significant correlation between the number of courses an employee took and their salary. Note that there is a *positive correlation* but this correlation is not *significant*!