# Multilevel Logistic Regression Models

## [LINK!!! :O](https://calpolystat3.shinyapps.io/Hierarchical_Models/)

![this](week-3-img/multilevel-logistic-specification.png)

**The natural log of the probability of binary dependent variable y measured on person i within cluster j = 1 divided by 1 - that probability is called the logit function.**

We have the fixed effects, the fixed unknown parameters that we want to estimate to describe the relationships of the predictors with the log odds of the dependent variable being equal to 1. And we also have these random effects capturing the dependencies within the same higher level clusters in this case, denoted by j.

This is an example of a random coefficient model where we have:
* Random effect u*0j* which allows each cluster to have an unique intercept
* u*1j*: allows each cluster j to have an unique relationship of x with the log odds of y=1

**The multilevel specification of the model will be:**

**Level 1:** 

* logit[P(y*ij* = 1)] = B*0j* + B*1*x*1ij* + e*ij* <br>

**Level 2:**

* B*0j* = B*0* + u*0j* 
* B*1j* = B*1* + u*1j*

### Assumptions
We make the same **distributional assumptions** about random cluster effects: 
* Are normally distributed
* Have a mean vector of 0: Mean of each random effect is 0
* Have unique variances and covariances

**Recall:** We are fitting multilevel models because we have **explicit interest** in estimating variance of random cluster effects. So part of our research question involves estimating the amount of **between cluster variance**. In this case, The log odds of the dependent variable = 1.

![this](week-3-img/why-motivation-is-required.png)
**It takes longer to fit these models computationally.**

## Estimating Model Parameters

![this](week-3-img/estimating-parameters.png)

## Testing the model parameters

![this](week-3-img/testing-model-parameters.png)

## Revisiting the NHANES Example.

* We fit Logistic regression to model the probability of ever smoking 100 cifarettes as function of selected predictors. 

* We assumed that all the NHANES examples are independent of each other.
    * **THIS IS NOT TRUE!**
    * Because of the study design used for the NHANES
    * In NHANES, multi stage probability sampling was used where there were several stages of random selection of sampling clusters (or geographic areas)
        * So the observations on this indicator of never smoking 100 cigarettes in your lifetime come from these randomly sampled clusters as part of the NHANES study design. 
        * Because we have many people nested within the same clusters in that sample design, their observations on this indicator may in fact be correlated with each other for 1 reason or another.
        * **So we can't make the assumption that all NHANES observations are independent of one another**

* If the smoking observations are correlated within areas, the **Standard errors** in our "naive" logistic regression analysis is **likely understated.**
    * Understated means that our estimates of our regression parameters described in the relationships of these predictors with the probability of ever smoking a 100 cigarettes with the mean of that binary dependent variable, these estimated coefficients will have standard errors that are too small.
    * So the sampling variability for these estimates are smaller than it should be.
    * It should be larger because those observation on the dependent variable are correlated within areasand that increases the sampling variance of our estimate.
    * We need to make sure that our model accounts for that aspect of the study design
    * Including random cluster effects is one possible way to account for that fact, that observations are correlated within the areas.

* In addition to the modelling aspect where we are accounting for that between cluster variance, we must have **explicit interest** in estimating the variance of the between sampling clusters in terms of probability of smoking.

![this](week-3-img/bet-clust-var-smoking.png)

These bars bounce around a whole lot across the different sampling clusters. That's a visualization of the between cluster variance in the mean of the dependent variable of interest that we want to estimate when we fit multilevel models.

## Fitting the multilevel Logistic Regression model

**In this model we are going to consider the random effects of the randomly sampled clusters**
* This means that the intercepts of the model that we are fitting are allowed to randomly vary across the sampling clusters.
* For this example, we are not considering the case of random slopes. So, we will assume that the Coefficients for all of our predictor variables are going to be constant across the sampling clusters. We are only interested in calculating the variablity in the intercepts allowing each cluster to have a different proportion.

**What we end up with:**
* **Very similar inferences** regarding which predictors are significant
* **Slight changes** in the estimated fixed effect but for the most part we see the same coefficients in terms of the relationships with the predictor of ever smoking 100 cigarettes.
* **The Key difference:** The standard errors of estimates are now much **larger.**

**After fitting the model:**
* The estimated **variance** of random cluster **intercepts = 0.046** 
* **Significant** based on likelihood ratio test (p-value , 0.01)
    * We will reject the null hypothesis that the variance of the between clusters intercept is 0
    * We have strong evidence that there is between cluster variability

**Even after adjusting for predictors, randomly sampled clusters still vary in terms of smoking prevalence**
<br>
We are capturing this via the random effects.

## Model Diagnostics

![this](week-3-img/model-diagnostics-logistic.png)

Another key consideration is that we might center continuous predictor variables so that the intercept is interpretable. So, depending on whether our variables are continuous, we might center those variables at the mean so that we can interpret the intercept as representing an expectation of the log odds, of the probability of smoking when the predictor variables are set to their means.

### EBLUPs for random intercepts

![this](week-3-img/eblups-random-intercept-logistic.png)

## Conclusion

![this](week-3-img/conclusion-log.png)

# 1
Suppose that you examine the variability among randomly sampled higher-level clusters of observations (e.g., schools) in the proportions on a binary variable of interest, and you find visual evidence of significant variability in the proportions. You then fit a logistic regression model to these data in Python, but forget to fit a multilevel model including random effects of the clusters. How would this affect your analysis?


1. Nothing would change; random effects do not affect our estimates of interest.


2. The standard errors of the estimated fixed effects would be too high, but the fixed effects would be identical.


3. The standard errors of the estimated fixed effects would be too low, but the fixed effects would be identical.


4. None of the above.

Correct 
Answer: d). By omitting explicit random effects in the logistic regression model, the standard errors of our estimated fixed effects would likely be too low, and the estimates of the fixed effects would likely be incorrect (because we are failing to explicitly adjust for the random effects of the higher-level clusters). Omitting random effects when they are important is the same type of model specification errors as omitting the fixed effect of an important predictor variable.