# EBA3500 Exercise 10: Ordinal, categorical, and numeric covariates

## Difficulty classifications
There are times when you may become confused by an exercise because it looks too easy. And some of the exercises in this course are easy. (For some!)  
* 🐇: Should be very easy for everyone.
* 🐖: Should be very easy for some, but harder for others.
* 🦢: Should demand some work to finish.
* 🐅: A "challenge" exercise that isn't strictly part of the curriculum. Can't guarantee a real challenge though. 😉

# Input data

In [2]:
import pandas as pd
url = "https://stats.idre.ucla.edu/stat/data/ologit.dta"
data_student = pd.read_stata(url)
data_student.head()

Unnamed: 0,apply,pared,public,gpa
0,very likely,0,0,3.26
1,somewhat likely,1,0,3.21
2,unlikely,1,1,3.94
3,somewhat likely,0,0,2.81
4,somewhat likely,0,0,2.53


## Exercise 1: Coding of variables


We will work with coding of ordinal variables. You may need to consult the documentation of [categorical data types.](https://pandas.pydata.org/docs/user_guide/categorical.html)

### (a) (🐇) Understanding codings
Suppose we have some ordinal ordinal values '"a" < "b" < "c" < "d"'. Let `x` be a numeric vector and let
`{'a' : x[0], 'b': x[1], 'c': x[2], 'd': x[3]}` be a dictionary associating the first categorical value with the first element of `x`, the second categorical variable with the second element of `x`, and so on.

Which of these vectors are codings of the ordinal values above?
1. `[0,1,2,3]`,
2. `[0, 0, 1,2]`,
3. `[-1,1,2,3]`,
4. `[1,2,4,3]`,
5. `[1,2,3]`.

### (b) (🐖) A simple coding function
Make a function that takes a Pandas series `s` with the ordered categorical data type as an argument. It should return the series obtained by switching every $i$ th element of `s` with `i`, i.e., linearly code `s`. (*Hint:* Is this already done in the lecture notes?)

### (c) (🦢) A coding function
Make a function that takes a Pandas series `s` with the ordered categorical data type and a function `f` as arguments. It should return the series obtained by mapping every $i$th element of `s` to `f(i)`. 

For instance, `data_student["apply"]` contains the ordered values `'unlikely' < 'somewhat likely' < 'very likely'`. Every `'unlikely'` in `s` will be mapped to `f(1)`, `'somewhat likely'` to `f(2)`, and `'very likely'` to `f(3)`. 

Notice the default argument, `f = lambda x: x`, which corresponds to the usual coding.

In [53]:
def coding(s, f = lambda i: i):
    # fill in

Verify your function using:

In [None]:
coding(data_student["apply"], lambda x: x**2).array
# [9, 4, 1, 4, 4, 1, 4, 4, 1, 4 ...

import numpy as np
coding(data_student["apply"], np.log).array
# 1.0986122886681098, 0.6931471805599453, 0.0, 0.6931471805599453, 0.6931471805599453, 
# 0.0, 0.6931471805599453, 0.6931471805599453, 0.0, 0.6931471805599453, ...

### (d) (🐖) Applications (i)
Try out the following codings on the `data_student` data set and global warming data sets (from the first part of the lecture.)
1. The linear coding, `f(i) = i`,
2. The quadratic coding `f(i) = i ** 2`,
3. The function that maps `1` to `1` but `i` to `i+1` when $i>1$.

### (e)  (🦢) Another coding function
Sometimes your coding function will depend on all the values of `s`. For instance, one could use a signed squared distance from the mean, i.e. `f = lambda i: np.sign(i - np.mean(coding(s))) * (i - np.mean(coding(s))) ** 2`.  Modiy the `coding` function to support functions `f` taking either one argument `i` or two arguments `i,s`.

You may need to check the signature of `f` to do this. The signature of a function is a list of the arguments it takes. To check the signature of a Python function, use:

In [38]:
from inspect import signature
signature(f)

<Signature (x)>

Use `dir` and the documentation of signature to make use of this.

In [None]:
import numpy as np
s = data_student["apply"]
coding(data_student["apply"], lambda i, s: np.sign(i - np.mean(coding(s))) * (i - np.mean(coding(s))) ** 2)
# 2.1025, 0.2025, -0.3025,  0.2025, 0.2025, ...

### (f) Applications (ii)
Try out your new coding function using:
1.  The signed squared distance from the mean, `lambda i, s: np.sign(i - np.mean(coding(s))) * (i - np.mean(coding(s))) ** 2`.
2.  Define `f(i, s)` as below.
3.  Is the normal quantile function of the result in (2). (Use Scipy for this, i.e. `stats.norm.ppf`)
4.  Is the Laplace quantile function of the result in (2). (Use Scipy for this.)

Which coding do you prefer?

### (g) (🐅) Making `f` more effective [Intermediate Python programming]
The function `f(i, s)` above computes the vector `total` and `values` every time it is called - which is not necessary - and can use a lot of computational resources on large inputs. There is an easy way to prevent it from using many resources, namely letting it be a function of `i` only and precompute the values of `total`. But that causes cluttered code and is a bad option. 

The challenge of this exercise is to modify the function `f(i, s)` so that it only computes `total` and `values` once. You are **not** allowed to store `values` or `total` in the environment calling `coding` however, as this would be cheating.

This is possible, but you probably have to use classes. In particular, you need a *callable object*. That's an object you can call using `()`. You can make an object callable by using the `__call__` attribute; see the code below.

In addition, try to use pure Numpy, i.e., avoid using `Counter`.

***Note***: This kind of "trick" is important, and can sometimes save hours of computing time, but is underused and even unknown by many statisticians. Stand tall in the grass, be resplendent in the mud. Learn to use callable objects! 

In [None]:
class F:
  values = None
  
  def __call__(self, i, s):
    # What do you do here?
    return self.values[i]

f = F()

## Exercise 2: Varying intercepts and slopes

### (a) (🦢) Plot the varying intercept model
Make a function that plots a random intercept model. It should be similar to `sns.lmplot`.

In [None]:
sns.lmplot(x ='negemot', y ='govact', data = glbwarm, hue ='partyid')

This function uses the slopes and intercept from the regression

In [None]:
smf.ols("govact ~ negemot * C(partyid)", data = glbwarm).fit()

However, you should make a function where the slope are constant across levels. In this case, you would want to use the unique slope from the regression model.

In [None]:
smf.ols("govact ~ negemot + C(partyid)", data = glbwarm).fit()