# Exam in EBA 3500
All subtasks are equally weighted. There are 15 subtasks in total, giving slightly more than 
10 minutes for each. 

1. Don't spend too much time on any given exercise - come back to it later if you have 
   enough time left!
2. You don't have to make pretty legends for plots.
3. Download the [`.ipynb` at Github](https://gist.github.com/JonasMoss/97c450121cd75fa298d94835164264da), if you need it!

Good luck!

## Task 1: Variable slopes
In this task we will use the `fair` dataset from `statsmodels`. Please consult the [documentation](https://www.statsmodels.org/dev/datasets/generated/fair.html) for more information about the variables.

In [10]:
import statsmodels.formula.api as smf
import statsmodels.api as sm
fair = sm.datasets.fair.load_pandas().data
fair.head()

Unnamed: 0,rate_marriage,age,yrs_married,children,religious,educ,occupation,occupation_husb,affairs
0,3.0,32.0,9.0,3.0,3.0,17.0,2.0,5.0,0.111111
1,3.0,27.0,13.0,3.0,1.0,14.0,3.0,4.0,3.230769
2,4.0,22.0,2.5,0.0,1.0,16.0,3.0,5.0,1.4
3,4.0,37.0,16.5,4.0,3.0,16.0,5.0,5.0,0.727273
4,5.0,27.0,9.0,1.0,1.0,14.0,3.0,4.0,4.666666


### (a) Fix the data set
Some of the columns in the data set have the incorrect type. In fact, there are only `float64` types in this data set. 
1. Demonstrate that all the columns in this data set are `float64` using an appropriate method.
2. Change the numerical covariates that **should** be categorical (or ordinal) into categorical. Print the types of the column after changing types.

### (b) Several intercepts
Fit a suitable linear model with response `affairs` and covariates `rate_marriage` and `age`.
Show its output. This can be regarded as a model with several intercepts. Why?

### (c) Categorical covariates (i)
The coefficient `rate_marriage[T.3.0]` in the model used in (b) is not significant. Modify the model in (b) to run with `rate_marriage[T.3.0]` removed. Is the model fit improved?

### (d) Categorical covariates (ii)
Would you recommend removing `rate_marriage[T.3.0]` from the model? Why, or why not?

### (e) Coding of `rate_marriage`
Explain what a coding of an ordinal variables is. Judging from the regression output, do you think there is a reasonable coding for `rate_marriage`? (***Hint:*** Plot the values of `rate_marriage[T.2.0]` through `rate_marriage[T.5.0]`).


### (f) Several intercepts and several slopes.
Fit a suitable model with response `affairs` and covariates `rate_marriage` and `age`. But this time, make sure that there is a separate `age` slope for each value of `rate_marriage`. Show the output of the model.


### (g) Choice of model
Which model do you prefer? The one from (b) or the one from (f)? Give a short explanation why.

## Task 2: Binary regression
In this task we will use the `modechoice` dataset from `statsmodels`. Please consult the [documentation](https://www.statsmodels.org/dev/datasets/generated/modechoice.html) for more information about the variables.

In [17]:
import statsmodels.api as sm
modechoice = sm.datasets.modechoice.load_pandas().data
modechoice.head()

Unnamed: 0,individual,mode,choice,ttme,invc,invt,gc,hinc,psize
0,1.0,1.0,0.0,69.0,59.0,100.0,70.0,35.0,1.0
1,1.0,2.0,0.0,34.0,31.0,372.0,71.0,35.0,1.0
2,1.0,3.0,0.0,35.0,25.0,417.0,70.0,35.0,1.0
3,1.0,4.0,1.0,0.0,10.0,180.0,30.0,35.0,1.0
4,2.0,1.0,0.0,64.0,58.0,68.0,68.0,30.0,2.0


### (a) First look at the data
1. Create a plot that gives a rought overview of the data. (***Hint:*** Make a "pairplot" of the data.)
2. Make a correlation plot. We care mostly about `choice`. Do any of the them stand out? (***Hint:*** To make the correlation plot clearer, use e.g. [this](https://stackoverflow.com/a/50703596) to easily visualize the correlation plot.)

### (b) Link functions
In this task we will run regressions with response `choice`, which is binary. Recall that 
binary regression uses **link functions**.
1. Explain what link functions are, and why they are needed.
2. Give three examples of link functions. What are the most popular link functions?

### (c) Confidence interval
Using a method from `statsmodel`, report the $95\%$ confidence intervals for the coefficients in the model `choice ~ ttme + invt + gc`. 

### (d) Making predictions
Use the builtin functions of `statsmodels` to predict the value of `mode` for 
the model `choice ~ ttme + invt + gc` (do this for all the values in the data frame). 
Plot the observed values of `mode` on the $x$-axis against the predicted values on the
$y$-axis.


## Task 3: The central limit theorem

### (a) Stating the central limit theorem
State the central limit theorem. Give two reasons why we care about it.

### (b) A counterexample to the central limit theorem?
Let $X_i, i=1 \ldots n$ be $n$ independent observations from a [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution) with shape parameter $\alpha = 2$ and scale $x_m = 1$, a distribution with expected value $2$. Alice claims that $\sqrt{n}(\overline{X_n} - 2)$ does not converge to a normal distribution. Explain why this would not contradict the central limit theorem.

### (c) Distribution of the means
Using the setup from the previous exercise, simulate and plot the distribution of $\sqrt{n}(\overline{X_n} - 2)$ when $n = 1000.$ Does it
look normal? (***Hint:*** Use the [Numpy implementation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.pareto.html#numpy.random.Generator.pareto) to generate samples: `rng.pareto(a, size) + 1`. (You must add $1$ to make the scale correct). Compare the histogram to the appearance of "best-fitting" normal distribution by calculating the standard deviation of your simulated `np.sqrt(n) * (means - 2)`. Remember that `from scipy.stats import norm` imports the normal density.)  


### (d) Understanding when the central limit theorem is of limited use
When will, in practice, the central limit theorem not be very useful? Give a general description and three examples. (***Hint:*** Look the wikipedia page for the [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution); also think about [Black Swans](https://en.wikipedia.org/wiki/The_Black_Swan:_The_Impact_of_the_Highly_Improbable).)
