# Marginal Logistic Regression Models

**Observations on our binary dependent variables are correlated within the clusters introduced by the study design.**

### Recall before starting (cuz I'll probably forget)
* With marginal models, there is no explicit interest in making inference about between-cluster variance in the coefficients of interest in a given model
* GEE is a general technique for estimating marginal regression models and accounting for the within-cluster dependencies in a dependent variable introduced by cluster sampling or repeated measurements.

### Our primary interest: 
Population average marginal relationships between predictor variables of interest and a binary dedpendent variable
* Overall marginal relationship across all the clusters that may have been introduced by the study design.

## Marginal Models for Binary Outcomes
* GEE Models were specifically designed to **readily accomodate non-normal outcome variables measured longitudinally**
    * So, where there are repeated measurements over time on the same subject that's being studied.
    
    * We have reason to believe that those repeated measurements are correlated. 
* The mean structure in the estimating equation is defined by a given type of generalized linear models.
    * In Marginal Linear Regression models we saw the mean was defined by the linear combination of the predictor variables and teh regression coefficients.
        * That's how we modelled the mean of the normally distributed dependent variable.
### For logistic Regression
For a binary DV, the mean of the DV is the probability that the DV is equal to 1:

![logistic-eqn](week-3-img/logistic-eqn-1.png)

**We are just talking about the proportion or probability.**
<br>
-> Expected value of Y conditional on the values of the predictive variables(x*ti*) <br>
-> In this specific case of Logistic Regression we fit the Logit Function. It's given by the exponential of the linear combination of predictive variables and regression parameters divided by 1 + the exponential of that linear combination of predictive variables and regression parameters. this is an inverse logit function that can be used to calculate a probability that the dependent variable is equal to 1 based on the specified model. <br>
This is where the Beta parameters enter into the estimating equation. We write down the mu as given here into the equation and then we try to solve the score function for the values of those parameters.

![logistic-eqn](week-3-img/about-marginal-logistic-models.png)

**Our job here is to find a specified correlation structure for the GEE as we have already specified the mean and the variance of the dependent variable with the model.**

## Revisiting the Smoking example from NHANES

![logistic-eqn](week-3-img/revisit-smoking-nhanes-1.png)

![logistic-eqn](week-3-img/revisit-smoking-nhanes-2.png)

**The Asterics signify which of these will be significantly different than 0.**
<br>
* Being a male increases the probability of having smoked over 100 cigarettes substantially.
* Older people have a higher probability of having smoked over 100 cigarettes.
* Larger household sizes - lower probability of having smoked over 100 cigarettes.
* Family income - poverty ratio - negative relationship: As it goes up, probability of having smokes over 100 cigarettes goes down.

Very similar inferences on both approaches. Even the standard errors are very similar. <br>
We did not include the variance component in this table in the GEE approach.

### Model Diagnostics
* Our nuisance estimate of the correlation structure assuming an exchangable correlation structure within sampling clusters was only 0.01
* corosponding QIC = 6284.53

![logistic-eqn](week-3-img/model-diagnostics-marginal-logistic-1.png)

* When we fit the independence model, where we assume 0 correlation within the same sampling cluster, the QIC is lower, 6284.05 relative to the exhangable 
    * This provided evidence in favour of assuming that observations within the same NHANES sampling cluster **do not** in fact have a **strong correlation.**
    
![logistic-eqn](week-3-img/model-diagnostics-marginal-logistic-2.png)

### Conclusion from the example

1. **Marginally:** when we look at the overall population average or the marginal relationships across the NHANES sampling clusters, we find the **Same estimated fixed effects** as we did when we fit the **multilevel model.**
    * So whether we were conditioning on the random effects like we were in the multilevel case or when looking at the overall population average marginal relationship, we will reach very similar conclusion on the importance of the different predictor variables in predicting lifetime smoking behavior.
    
2. Accounting for the dependency rather than assuming independence of observations within each cluster did **not** seem to improve the model fit in this case. 
    * There was some evidence of between cluster variance in the multilevel modelling approach, that within cluster correlation did not prove to be substantial in this case.
    
3. **Remember:** We still interpret the estimated fixed effects **marginally** across clusters. We are not conditioning on what's happening on a given cluster but across all the NHANES sampling clusters when the values of the predictor variables change.

# 1 
In the marginal logistic regression model fitted to the NHANES lifetime smoking indicator using GEE, the estimated fixed effect of household size was -0.08, and the test for this fixed effect parameter suggested that this parameter was significant. What is the correct interpretation of this estimated fixed effect parameter in this marginal model?


1. In the average NHANES cluster, for every one-person increase in household size, the expected log-odds of lifetime smoking decrease by -0.08.


2. In the average NHANES cluster, for every one-person increase in household size, the expected probability of lifetime smoking decreases by -0.08.


3. Across all NHANES clusters, for every one-person increase in household size, the expected log-odds of lifetime smoking decrease by -0.08.

Correct 
Answer: c). We are fitting a marginal model, so our inference is averaged across the entire population of NHANES clusters (not specific to a given NHANES cluster, which would be the case when using a multilevel modeling approach). Also, when fitting a logistic regression model, remember that we are modeling the log-odds of an event occurring, and we are not modeling the probability directly.

4. Across all NHANES clusters, for every one-person increase in household size, the expected probability of lifetime smoking decreases by -0.08.

# Quiz week 3

1. You are interested in predicting the probability that an NCAA men's basketball team wins their first round game in the annual NCAA men's basketball tournament, where potential predictors of the binary indicator of winning the first game include a variety of team-level variables measured for each of the 64 teams competing in the first round. There is only one observation per team, and the dependent variable is a binary indicator (1, 0) of whether the team won their first round game.

    * Logistic Regression Model
    
2. You are interested in estimating the relationship between gender (the IV) and a binary indicator of ever having experienced a major depressive disorder (the DV), where both variables were collected from a large national sample that involved area cluster sampling. You also wish to estimate between-cluster variance in the probability of having experienced a major depressive disorder, and explain this variance with the fixed effects of cluster-level covariates.
    
    * Multilevel Logistic regression Model with random cluster effects

3. You want to fit a model that enables the prediction of a continuous measure of birth weight for all of the newborns at a single large hospital. The data arise from a simple random sample of 500 births, and the predictors including information collected from both the mother and the father.
    
    * Linear Regression Model
    
4. After publishing a research paper describing the results from the model fitted for Question #3, you are contacted by 20 other large hospitals, and they wish to contribute to the estimation of a model for predicting birth weight. The team agrees that estimation of the variance in expected birth weight between hospitals and explanation of that variance with hospital-level covariates is a key objective. What type of model would you fit?

    * Multilevel linear regression model with random hospital effects
    
5. You wish to fit a model to a "forced choice" binary dependent variable measuring political party preference (if you had to pick a political party, which would you select: Democratic or Republican?), and examine the relationship of parental political attitudes with the preference of the respondents. Based on the study design, there are multiple respondents measured from each of several neighborhoods, and respondents within the same neighborhood may have shared political views, but you aren't interested in explicitly estimating between-neighborhood variance. You only wish to estimate the overall relationship of interest in the larger population, and account for possible within-neighborhood correlation in the DV.
    
    * Marginal Logistic model, fitted using GEE

