# EBA3500 Home exam

***Note:*** This exam is also available as a Jupyter notebook [here](https://gist.githubusercontent.com/JonasMoss/970bd43684b4d615ab83812fcc3560d6/raw/2c1a69f0196ac01bc8c0ec94a3c58c3879d32adb/eba-3500-fall-2021-home-exam.ipynb). 

1. You need to be selective in the output you show. Only show output that supports
your argument. To hide output of a cell, you may use a semi-colon ";":


In [1]:
list(range(100000));

2. Make your plots look nice. Add appropriate axis labels, legends and so on.


3. *"Brevity is the soul of wit."* Strive not to write too much. We prefer pithy to lengthy expositions.



4. ***All subexercises are equally weighted***. That is, the weight of Task I (a) is the same as Task (II) b, etc. There are $20$ subtasks in total, giving $5$ points each.

## Task I: Applied regression

We will use the following data set.

In [2]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
modechoice = sm.datasets.modechoice.load_pandas().data
modechoice.head()

Unnamed: 0,individual,mode,choice,ttme,invc,invt,gc,hinc,psize
0,1.0,1.0,0.0,69.0,59.0,100.0,70.0,35.0,1.0
1,1.0,2.0,0.0,34.0,31.0,372.0,71.0,35.0,1.0
2,1.0,3.0,0.0,35.0,25.0,417.0,70.0,35.0,1.0
3,1.0,4.0,1.0,0.0,10.0,180.0,30.0,35.0,1.0
4,2.0,1.0,0.0,64.0,58.0,68.0,68.0,30.0,2.0


Our goal is to predict to variable `ttme` from the other variables in the model. See [this page](https://www.statsmodels.org/dev/datasets/generated/modechoice.html) for documentation of `modechoice`.

### (a) Using the data (ii)

Take a look at the documentation provided in the link above. Some of the categorical variables are coded as numeric in this data set. Modify `modechoice` to make them categorical. (***Hint:*** Use Google to figure out how to change the type of data frame column, called `series`, to categorical.)

### (b) Taking an initial look at the data

Use an appropriate plotting function to take a look at the data. Moreover, take
a look at some data summaries such as the correlation matrix and category plots,
if applicable.

### (c) Evaluating models
Fit at least five regression models with response `ttme` and make an informed choice among them. 
Do **not** show the entire output of the models you look at; it suffices to show the formula you use and the information you based your decisions on. Remember that you can and should look at interactions! (**Hint:** You have
learned about ANOVA tables, *F*-tests, and the adjusted $R^2$. But you may
also think about what makes most sense to include, the principle of parsimony, and so on. )

### (d) Making predictions (i)

In this exercise and the next, we're going to use the model

In [26]:
results = smf.ols("ttme ~ mode + choice + invc + invt", data=modechoice).fit()

Predict the values of `ttme` when `mode`, `choice`, `invc`, and `invt` are as in the following table. Put the values in a tuple and print it.

| mode  | choice | invc   | invt     | 
| ----- | ------ | ------ | -------- |
| '1.0'  | '0.0'    |	70    |	90       |
| '2.0'   | '0.0'    |	30    |	500	     |
| '4.0'   | '1.0'    |  24    | 0        |
 
Do all of these predictions make sense? Think about what `ttme` is.

### Making predictions (ii)

1. Predict the values of `ttme` given the observed values of `mode`, `choice`, `invc`, and `invt`. You do not need to print the values, but you need to demonstrate that you know how to do it. Plot the observed values (`x = ttme`) against the predicted values `y = predicted`. Do you see a pattern? 

2. What should the plot look like, if the model were able to predict ttme perfectly?

### (f) Missing covariates


#### (i)
Suppose we have chosen to use the model in (d). Your uncle Bob comes along with a vector of new data. He wants you to predict the value of `ttme`, but he has forgotten if `mode` equals `1.0` or `2.0`. He knows the values of all the other covariates that you need to know. How would you modify the model in (d) so it's able to predict the value of `ttme` given his data? You may assume Uncle Bob's vector is the first of the table in the previous exercise, but with `mode` equal to either `1.0` or `2.0`, if you wish to compute something explicitly. (**Hint:** You may want to use unions here. 
However, there are reasonable solutions to this exercise that do no use unions.)

#### (ii)
The following week, Bob visits again. This time he remembered `mode`, but he has forgotten `choice` and `invc`. Modify the model in (d) so that it works in this scenario as well.

## Task II: Algorithms for regression

In this task, you will make a program doing *forward* regression. The point of forward regression is to iteratively fit regression models, adding one covariate at a time, starting with the smallest model.

### (a) Extract response and covariates from a formula
Make a function `extract(formula)` that returns a tuple containing `response` and `covariates`. Here `response` should be the name of the reponse in the formula. `covariates` should be a list containing all covariates in the formula. Be sure to remove whitespaces too! For instance, `extract("y ~ x + z")` should return `("y", ["x", "z"])`. (***Hint:*** See the mock exam, Task II.)

### (b) Adding a covariate to a formula
Make a function `add_covariate(formula, covariate)` that adds a `covariate` to the covariate part of a `formula`. For instance, `add_covariate("y  ~  x + z ", " u")` should return `"y~x+z+u"`.
(***Hint:*** You need Python's tools for handling strings. See the mock home exam.)

### (c) Find all extensions of a model by one covariate
Make a function `extend_model(formula, covariates)` that returns a list of all extension of `formula` by one element of `covariates`, ignoring duplicates. That is, if `covariates` includes a covariate that's already in `formula`, ignore it, don't add a copy of it to `formula`. (***Hint:*** Can you use any of the other functions you have made in this exercise? You might want to use the set difference operator at some stage too. )

In [None]:
# Your function should do the following:
extend_model("y ~ x+ u + w", ["v", "z"]) 
# ['y~x+u+w+z', 'y~x+u+w+v']

### (d) Choosing the best covariate to add
Let `covariates` be a list of covariates and `formula` a formula. Make a function `improve_model(formula, covariates, data)` that returns a new formula. The new formula should be `formula` with the best single covariate in `covariates` added to the model `ols(formula, data)`, in terms of $R^2$.

For example, consider the `modechoice` data in the previous task, and the formula
`ttme ~ mode + choice + invc + invt + gc`. There are two remaining covariates that aren't in the formula, namely
`hinc` and `psize`. Then


In [36]:
# Adding hinc
smf.ols("ttme ~ mode + choice + invc + invt + gc + hinc", data = modechoice).fit().rsquared

0.8055537175020088

In [35]:
# Adding psize
smf.ols("ttme ~ mode + choice + invc + invt + gc + psize", data = modechoice).fit().rsquared

0.805735767844473

Since the $R^2$ of adding `psize` dominates the $R^2$ of adding `hinc`, `improve_model("ttme ~ mode + choice + invc + invt + gc", ["hinc", "psize"], modechoice)` should return the formula `"ttme ~ mode + choice + invc + invt + gc + psize"`.

Verify that your program works by testing it on:

In [None]:
formula = "ttme ~ mode + choice + invc"
covariates = ["hinc", "psize", "invt", "gc"]
improve_model(formula, covariates, modechoice)
# 'ttme~mode+choice+invc+psize'

### (e) Select the $k<p$ most significant predictors (i)

Make a function `forward_k` that iteratively adds the best predictor, according to $R^2$ value, for `k` steps, and returns the formula of the best regression model. It should start with the model using the formula `formula`. At $k=1$, it should add the best covariate. At the second step, it should use the formula obtained from step 1, then add the best predictor. At the third step, it should add the best predictor to the formula from step 2, and so on.

The argument `covariates` contains the covariates we're allowed to choose from.

(***Hint:*** Use the functions you have already defined in the previous three exercises. Google "forwards selection" if you struggle with understanding the exercises. The form of the function is really similar to the corresponding "backwards selection" function in the mock exam.)

In [60]:
# Here are some examples for you to use.
formula = "ttme ~ mode + choice + invc"
covariates = ["hinc", "psize", "invt", "gc"]
forward_k(formula, covariates, 1, modechoice)
# 'ttme~mode+choice+invc+psize'
forward_k(formula, covariates, 2, modechoice)
# 'ttme~mode+choice+invc+psize+gc'
forward_k(forward_k(formula, covariates, 1, modechoice), covariates, 1, modechoice)
# 'ttme~mode+choice+invc+psize+gc'

'ttme~mode+choice+invc+psize+gc'

### (f) Select the $k<p$ most significant predictors (ii)
Modify the function `forward_k2` to take a `max_formula` argument instead of a `covariates` argument. The `max_formula` contains all the allowed covariates on its right-hand side. For instance,

In [None]:
formula = "ttme ~ mode + choice + invc"
covariates = ["hinc", "psize", "invt", "gc"]
forward_k(formula, covariates, 2, modechoice)

Corresponds to

In [67]:
formula = "ttme ~ mode + choice + invc"
max_formula = "ttme ~ mode + choice + invc + hinc + psize + invt + gc"
forward_k2(formula, max_formula, 2, modechoice)

'ttme~mode+choice+invc+psize+gc'

### (g) Select covariates as long as the increment in $R^2$ is larger than $\delta$.

Modify `forward_k` so that it runs as long as the difference between the $R^2$ of the new formula and $R^2$ of the old formula is greater than `delta`. Call this new function `forward_delta`. (***Hint:*** You might wish to use `old_rsq = 0` as a default argument in your function.)

### (h) Application 

Use the forwards regression algorithm on the data in Task I to find a "best" model for `ttme` using the covariates `["individual", "mode", "choice", "invc", "hinc", "psize", "invt", "gc", "choice * psize", "mode * choice"]`. Use `forward_delta` with `delta = 0.01` if you managed to create it. If not, use `forward_k` with `k=4`. Comment on your results.

## Task III: Simulations
We'll take a closer look at the one-sample $t$-test in this exercise. Let 
$$H_0: \textrm{the true mean equals }\mu.$$
Then the statistic $t$-statistic
$$t = \sqrt{n}\frac{\overline{x} - \mu}{s},$$
is $t$-distributed with $n-1$ degrees of freedom, provided that the $X_i$  variables are independent and normally distributed with mean $\mu$ and any standard deviation $\sigma$.


### (a) Find the *t*-test in statsmodels and apply it.
`scipy` has a function that runs the one-sample *t*-test. Find it and apply it to the `volume` data data set `nile` from [statsmodels](https://www.statsmodels.org/dev/datasets/generated/nile.html), with $H_0: \mu = 1000$. Report the *p*-value for the two-sided alternative. 

### (b) Roll your own *t*-test function
Make your own variant of the *t*-test function. It should take two arguments `x` and `popmean`, and calculate the *p*-value for the two-sided alternative hypothesis. Verify that your calculation matches that of the previous exercises

(***Hints***: 
1. You can find the cumulative distribution (CDF) of the *t*-distribution in e.g. `scipy`. 
2. The CDF of a random variable $X$ is the function $F(x) = P(X\leq x)$. 
3. Now use that the $p$-value equals $p = 2P(T\geq |t|)$, where $t$ is the $t$ statistic and $T$ is *t*-distributed with $n-1$ degrees of freedom, called `df` in the documentation of `scipy`.
4. In order to use `numpy` to calculate $s$, use `np.std(x, ddof = 1)`.

### (c) The non-central *t*-distribution
We'll now have a look at the power of the *t*-test against some alternatives. The *power of a test* is defined as the probability of rejecting the null-hypothesis when it is false, i.e., the true distribution does not match the null-hypothesis. To make it presice, we must fix the significance level $\alpha$. Moreover, we have to make assumptions about what the true distribution is.

The non-central *t*-distribution is the distribution of $T$ when the true distribution is normal, but with $\mu\neq0$. Find the non-central *t*-distribution in `scipy.stats`. Plot its density function (pdf) for $\theta = -2, -1, 0, 1, 2$, its *non-centrality parameter*, in the same window. Let the degrees of freedom be 5. 


### (d) Power of the test (i)
Let $H_0:\mu = 0$, but suppose the true distribution is normal with mean $\mu \neq 0$ and standard deviation $\sigma$. We can calculate the power exactly using the non-central $t$-distribution. 
1. Consult the page on the non-central *t*-distribution, more specifically the section [Use in power analysis](https://en.wikipedia.org/wiki/Noncentral_t-distribution#Use_in_power_analysis). Use this information to make a function `power(mu, sigma, n, alpha = 0.05)` that calculates the probability of rejecting the null-hypothesis for any sample size `n` and true parameters `mu`,`sigma`, when the cutoff for significance is `alpha`. 
2. Calculate the power of the *t*-test when $\mu = 0.1$ and $n=22$, $\sigma = 2$. What does this mean? 

(***Hint:*** The capital $F$ on the wikipedia page refers to the cumulative distribution function (CDF) of a random variable, in this case the non-central *t*. You can find the quantiles of the *t*-distribution using the *t*-distribution in `scipy.stats`; this is the `ppf` function.)

In [127]:
### You may test your function on these values:
power(0, 1, 10) # 0.05
power(1, 1, 10) # 0.8030968422370941

0.8030968422370941

### (e) Power of the test (ii)
If the true distribution isn't normal, we cannot use the non-central *t*-distribution, and we have to simulate instead. Make a function that approximates the power of the *t*-test for any random variable generator and cutoff level $\alpha$. (***Hint:*** Here `generator` is a function that generates random variables and takes one argument, e.g. `generator = lambda n: rng.normal(0, 1, n)`. To solve this exercise, you need to calculate the value of the *p*-value of the *t*-test and check if it's smaller than `alpha`.) 

### (f) Power of the test (iii)
Investigate the power of the test for two cases. Let `n` be `3,9,15,21,...,99`. Plot the results in the same window, and comment. Moreover, make a rew `rng` with seed `313`. The distributions you should investigate are:

1. The true distribution is exponential with $\lambda = 1$,
2. The true distribution is uniformly distributed,
3. The true distribution is Laplace-distributed with location $\mu = 0.1$ and $\lambda = 1$.

(***Hint:*** Use the `Numpy` documentation. The simulations might take a while to complete.)
