# EBA3500 Mock home exam

This mock exam is part of the curriculum, and you should try to do it before
the real home exam. The real home exam will be similar to this one, and you will
benefit from taking this mock home exam seriously! 

**"Brevity is the soul of wit."** There will be some rules for how you're allowed
to write the final exam. Importantly, you must not write too much, or provide 
too much output. Give us what is relevant. I will make this clearer later. 

You need to be selective in the output you show. Only show output that supports
your argument. To hide output of a cell, you may use a semi-colon ";":


In [None]:
list(range(100000));


**Note:** This mock exam has not been double-checked for errors and typos. If
you come across any, please e-mail me.

## Task I: Applied regression

We will use the following data set.

In [None]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
dat = sm.datasets.get_rdataset("Guerry", "HistData").data

Our goal is to predict to variable `Lottery` from the other variables in the model.

In [None]:
dat

## (a) Using the data

Notice the following about the categorical variable `Departement`:

In [None]:
len(set(dat.Department))

It would be silly use this variable as a covariate in a regression model. Why? Are there any other covariates in the data frame with the same problem? Modify `dat` so it doesn't have problems with covariate vectors being useless in this sense.

### (b) Taking an initial look at the data

Use an appropriate plotting function to take a look at the data. Moreover, take
a look at some data summaries such as the correlation matrix and category plots,
if applicable.

### (c) Transforms
Some the covariates in this data set would may benefit from being transformed. Try out the logarithmic and quadratic transforms on the numeric columns of the data frame (except for `Lottery`), and compare the resulting pairwise correlations to what you obtained without doing the transforms.

### (d) Evaluating models
Fit at least five regression models and make an informed choice among them.
Do **not** show the entire output of the models you look at. (**Hint:** You have
learned about ANOVA tables, *F*-tests, and the adjusted $R^2$. But you may
also think about what makes sense, the principle of parsimony, and so on.)

### (e) Making predictions

In this exercise, we're going to use the model

In [None]:
results = smf.ols('Lottery ~ Literacy + Donations +Infants +Suicides + MainCity', data=dat).fit()

#### (i) Make speficic predictions

Predict the values of Lottery when `Literacy`, `Donations`, `Infants`, and `Suicides` are as in the following table:
| Literacy | Donations | Infants | Suicides | MainCity | 
| - | - | - | - | - |
| 20 |	4000 |	10000 |	10000 |	2:Med |
| 50 |	7777 |	1 |	1	| 2:Med |
| 0 |  0 |       0 |   0 | 3:Lg |
 

#### (ii) All predictions

Predict the values of `Lottery` given the observed values `Literacy`, `Donations`, `Infants`, and `Suicides`. Plot the observed values against the predicted values. Do you see a pattern?

### (f) Missing covariates (1)
Suppose we have chosen to use the model in (e). Your uncle Bob comes along with a vector of new data. He wants you to predict the value of `Lottery`, but he has forgotten if `MainCity` equals `2:Med` or `3:Lg`. He knows everything else that you need to know. How would you modify the model in (e) so it's able to predict the value of `Lottery` given his data? You may assume Uncle Bob's vector is the first of the table in the previous exercise, but with `MainCity` equal to either `2:Med` or `3:Lg`, if you wish to compute something. (**Hint:** You may want to use unions here. 
However, there are reasonable solutions to this exercise that do no use unions.)

### (g) Missing covariates (2)
The following week, Bob visits again. This time he remembered `MainCity`, be he has forgotten `Donations` and `Suicides`. Modify the model in (e) so that it works in this scenario as well.

## Task II: Algorithms for regression

In this task, you will make a program doing *backward* regression. See e.g. [this page](https://quantifyinghealth.com/stepwise-selection/) for more info. (Scroll down to backwards stepwise regression.) The point of backward regression is to iteratively fit regression models, removing one covariate at a time, starting out with the biggest model.

### (a) Find the least significant set of predictors
Figure out how to extract the name of the predictor associated with the biggest *p*-value in
an ANOVA table, i.e., `sm.stats.anova_lm(fit, typ=1)`. Make a function `largest_p_value` that takes a fitted `statsmodels.regression.linear_model.RegressionResultsWrapper` object and returns the name of the covariate / group of covariates with the largest *p*-value.

(***Hint:*** `sm.stats.anova_lm` is a data frame. You might want to use `numpy.argmax`).


In [None]:
frame = sm.stats.anova_lm(results, typ=1)["PR(>F)"] 
frame

Here `largest_p_value(results)` will return `"Infants"`, the name of the covariate with the largest *p*-value.

### (b) Removing covariate from formula
Make a function `remove_covariate(formula, covariate)` that removes `covariate` from `formula`. For instance, `remove_covariate("y ~ x + z", "z")` should return `"y ~ x"`.
(***Hint:*** You must need Python's tools for handling strings. The commands below should suffice)

In [None]:
formula = "y ~ x +   z" 
formula = formula.replace(" ", "") # Strips whitespace
response, covariates = formula.split("~") # Divides the string into two!
covariates = covariates.split("+") # Splits the covariates into a vector of covariates.

# Do you need to do somthing with `covariates` in order to solve the exercise?

# Now we merge the strings together again.
response + "~" + "+".join(covariates)


### (c) Select the $k<n$ most significant predictors

Make a function that iteratively removes the least significant predictor `k` times. 

In [None]:
def backward("formula", k, data):
    """ Returns the fitted regression model of the backwards regression model
        after k steps. k must be smaller than n, the number of observations. We assume
        that k smaller than, or equal to, the number of covariates in the formula."""
    
    # If k == 0, we just do an ordinary regression model
    if(k == 0) return(smf.fit("formula", data))
    n = data.nrow
    assert n > k, "k must be smaller than n"

    new_formula = ???
    # If k > 0, we can run the algorithm once more, but with a new formula and a new k!
    return backward(new_formula, k - 1, data)


### (d) Select predictors as long as the largest *p* is larger than `limit`

Modify the preeceding function so that it runs as the largest *p* is larget than `limit. 

In [None]:
def backward("formula", limit, data):
    """ Returns the fitted regression model of the backwards regression model 
        which runs as long as the largest *p*-value is larger than `limit`"""
    pass

### (e) Application 

Use the backwards regression algorithm on the data in Task I to find the "best" model for `Lottery`.

## Task III: Simulations

We will take a look at the [Jarque-Bera test](https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test). This tests if a univariate data set matches the normal distribution. More specifically, it tests if the sample skewness and kurtosis appear to match the normal distribution. We haven't covered skewness and kurtosis in this class, but feel free to look them up at wikipedia. Letting $\mu = EX$, their population values are defined as
$$\textrm{Skewness} = \frac{E(X-\mu)^3}{\textrm{Var}(X)^{2/3}}$$
and
$$\textrm{Kurtosis} = \frac{E(X-\mu)^4}{\textrm{Var}(X)^{2}}.$$
Roughly speaking, the skewness measures how skewed a distribution is, while the kurtosis measures its "tailedness". 

### (a) Implement the Jarque-Bera test, part I.
Define a Python function that takes `n,S,K` as arguments and outputs $\frac{n}{6}(S^2+\frac{1}{4}(K-3)^2)$. 

### (b) Implement the Jarque-Bera test, part II
The sample skewness and kurtosis are defined as
$$
S=\frac{\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\overline{x})^{3}}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}\right)^{3/2}},
$$

$$
K=\frac{\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\overline{x})^{4}}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}\right)^{2}}.
$$

Implement these as two functions `skewness` and `kurtosis` taking one argument `x` each.

### (c) Implement the Jarque-Bera test, part III
Make the `jarque_beta` function, taking `x` as an argument, that impliments the Jarque-Bera test. To do this, use that the definition of the Jarque-Bera test is $\frac{n}{6}(S^2+\frac{1}{4}(K-3)^2)$, where $n$ is the sample size $S$ is the sample skewness, and $K$ is the sample kurtosis.


### (d) Simulate from a normal distribution
Make a function that simulates `n` observations from a normal distribution with mean `mu` and standard deviation `sigma`, then calculates the Jarque-Beta test on these values. Do this `n_reps` times, and return the data as a Numpy vector. 

In [None]:
jarque_bera_normal = function(n, mu, sigma, n_reps):

### (e) Simulate and plot
Using `n = 100` and `n_reps = 10**5`, call `jarque_bera_normal` with your choice of `mu` and `sigma`. Make a histogram of the values. Moreover, according to [Jarque-Bera test](https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test), the distribution of the Jarque-Bera test should be approximately $\chi^2$-distributed with $2$ degrees of freedom. To verify this, add a line plot of the $\chi^2$-distributed with $2$ degrees of freedom to the histogram. Comment how well the lines match.
(***Hint:*** To plot the $\chi^2$-distribution you must consult the Numpy documentation.)

### (f) Connection to *p*-values
Consider the nullhypothesis
$$H_0: \textrm{The true distribution is normal.}$$
Since the Jarque-Bera test is $\chi^2$-distributed with $2$ degrees of freedom, we can calculate its *p*-value using `scipy.stats.chi2`. Explain *how* you would do this and *why* the result is a *p*-value.

### (g) Power of the test, part (i)
The definition of a *p*-value only mentions the null-hypothesis, but in order for it to be useful it must have **power** against some reasonable alternatives. This means that, for a fixed significance level $\alpha$, the probability is able to detect that $H_0$ isn't true. 

Make a function `simulate_jarque_bera` that takes three arguments `n`, `n_reps` and `random` as arguments. The `random` argument should be a random generator taking one `size` argument. (E.g. `lambda size: rng.normal(mu, sigma, size)`, `lambda size: rng.exponential(lambda, size)`). It should simulate the Jarque-Bera test as we did in Exercise (d), but with the supplied distribution `random` instead of the normal distribution.

### (h) Power of the test, part (ii)
Make a function `power_jarque_bera(n, n_reps, random, alpha = 0.05)`. The first three arguments are the same as the previous exercise, and `alpha` is a significance level. It should return the approximate probability that the `Jarque-Bera` test will be significant at the `alpha` level when the true distribution is `random`.

### (h) Power of the test, part (iii)
Use the `power_jarque_bera(n, n_reps, random, alpha = 0.05)` function to calculate the power of the Jarque-Bera test for $10$ different choices of `random`, and put them into a table. Please make some comments too. (***Hint***: Look at the numpy documentation and find some reasonable distributions to simulate from!)