# Multi-level modelling assignment

- The assignment
- Loading packages and reading in data
- Part A
- Part B

## The assignment

#### Multilevel coursework assignment:

Use the “coursework assignment data.ws” in Mlwin to answer the following questions. The data come from a random subsample of the Understanding Society wave 1 survey (for more details, see here: https://www.understandingsociety.ac.uk/documentation) . There are two parts to the assignment (part A and B).

#### The variable names include:

region: Government Office Region

hid: household ID

pid: person ID

age: ranging from 16-96 years

sclfsato: overall satisfaction with life ranging from 1 (completely dissatisfied) to 7 (completely satisfied

nhood_mistrust: neighbourhood mistrust scale ranging from 1 (strongly trust in neighbourhood) to 5 (strongly mistrust 
in neighbourhood) 

urban: rural (0) vs urban (1) neighbourhood

female: man (0) vs woman (1)

hhtenure: housing tenure: owner/mortgaged (1); local authority/housing association rent (2); private rent (3)

hiqual3: highest qualification: Degree or equivalent (1); school level qualification (2); No qualification (3)

worry_crime: Worry about being affected by crime: No (0) vs Yes (1)

Cons, denom: constant terms (1) 

#### Part A: Dependent variable: nhood_mistrust (60 marks)

Fit a multilevel model with nhood_mistrust as the dependent variable. 

- Q1. Provide some summary statistics about the levels in the data. How many units are there at each level (overall N of each level), and how many units are there within levels (N within each level). - 3 marks

- Q2. Specify the generalised equations of a three level multilevel model with random slopes at levels 2 and 3. State the assumptions of the model. - 2 marks

- Q3. Start from the single level null model and add in the household and then region levels. 

A. Display the model coefficients in a table. - 3 marks

B. Does the addition of the household level improve the fit of the model? -2 marks

C. What evidence is there for an improvement in fit? - 5 marks

- Q4: Calculate the VPC at the household and regional levels for the 3 level variance components model. - 3 marks

- Q5. Add in the following explanatory variables in the model (random intercepts model only; no random slopes or coefficients)- age, sclfsato, urban, female, hhtenure, hiqual3. Take out any non-significant associations. 

A. Display the model coefficients in a table. - 2 marks

B. Interpret the coefficients. - 5 marks

- Q6. Add in random slopes for age, at the household and regional level

A. Does this improve the fit of the model?  - 3 marks

B. Take out from the model any non-significant random slope(s) for age

C. Interpret the random slope(s) for age, with the help of graphs. - 10 marks

- Q7. Examine potential interaction effects between age and the other explanatory variables- interpret the interactions using graphs. Is the random slope of age is explained by these interaction terms? - 10 marks

- Q8. After fitting all the explanatory variables, is a three level model still appropriate? - 2 marks

- Q9. Display and interpret the parameters in your final model in a table. - 5 marks

- Q10. Examine residuals at all three levels- are the assumptions of the regression model met? - 5 marks



#### Part B: Dependent variable: worry_crime (20 marks)

Fit a multilevel logistic model with worry_crime as the dependent variable. 

- Q11. Start from the single level null model and add in the household and then GOR levels. Display the model coefficients for the 3 level variance components model  in a table. - 5 marks

- Q12. Is a three level model appropriate? Is a 2 level model appropriate? Is a single level model appropriate? If a 2 level model is appropriate, which two level models (individuals within HH or individuals within GOR)? State your reasons for choosing between a 2 vs 3 level model. - 5 marks

- Q13. Having decided whether a multilevel or single level model is appropriate, add in the explanatory variables from your final model in Part A. Compare the models using a Table. Are the associations with worry_crime similar to the associations with nhood_mistrust?  - 5 marks

- Q14. Take out any non-significant explanatory variables from the model. Interpret the coefficients in your final model in a table.  5 marks

#### NB: 

An additional 20 marks will be assigned for the presentation of tables (10) and figures (10).

The results of this analysis should be presented in a written report of not more than 3000 words. There is no minimum number of words but each question should be answered comprehensively (Not just YES or NO answers, but giving the justification and evidence behind the answers). There is no limit on the number of tables and figures for the report.

Submission deadline: 3pm 14 May 2020
Please submit using Turnitin.
Please enter your ID number on your essay. 
Save your submission file as Student ID number_submissiontitle.doc (eg 1234567_Essay.doc) Enter your Student ID number in the Turnitin title field at the time of submission.

#### Loading packages and reading in data

- Loading packages

In [1]:
library(tidyverse)
library(lme4)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loading required package: Matrix

Attaching package: ‘Matrix’

The following object is masked from ‘package:tidyr’:

    expand



In [2]:
mydata <- read.table(file = "coursework.txt", sep = "," , header = TRUE)

#### Part A

- Q1. Provide some summary statistics about the levels in the data. How many units are there at each level (overall N of each level), and how many units are there within levels (N within each level).

In [3]:
nrow(mydata)

In [4]:
summary(mydata$region)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   4.000   7.000   6.759   9.000  12.000 

In [5]:
summary(mydata$hid)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    1105    2213    2211    3318    4429 

There are 12 different Government office regions (level 3), 4429 different households (level 2), and 4655 individuals (level 1). The table below provides information for every region on the number of households, the number of individuals, and the mean number of individuals per household.

In [8]:
mydata %>%
    group_by(region) %>% 
    summarise(Number_of_households = n_distinct(hid),
             Number_of_individuals = n(),
             Mean_number_of_individuals_per_household = n()/n_distinct(hid))

region,Number_of_households,Number_of_individuals,Mean_number_of_individuals_per_household
1,142,148,1.042254
2,483,506,1.047619
3,333,353,1.06006
4,333,356,1.069069
5,309,330,1.067961
6,350,364,1.04
7,531,554,1.043315
8,512,542,1.058594
9,360,374,1.038889
10,333,353,1.06006


- Q2. Specify the generalised equations of a three level multilevel model with random slopes at levels 2 and 3. State the assumptions of the model.

- Q3. Start from the single level null model and add in the household and then region levels. 

A. Display the model coefficients in a table.

B. Does the addition of the household level improve the fit of the model?

C. What evidence is there for an improvement in fit?

- Q4: Calculate the VPC at the household and regional levels for the 3 level variance components model.

- Q5. Add in the following explanatory variables in the model (random intercepts model only; no random slopes or coefficients)- age, sclfsato, urban, female, hhtenure, hiqual3. Take out any non-significant associations. 

A. Display the model coefficients in a table.

B. Interpret the coefficients.

- Q6. Add in random slopes for age, at the household and regional level

A. Does this improve the fit of the model?

B. Take out from the model any non-significant random slope(s) for age.

C. Interpret the random slope(s) for age, with the help of graphs.

- Q7. Examine potential interaction effects between age and the other explanatory variables- interpret the interactions using graphs. Is the random slope of age is explained by these interaction terms?

- Q8. After fitting all the explanatory variables, is a three level model still appropriate?

- Q9. Display and interpret the parameters in your final model in a table.

- Q10. Examine residuals at all three levels- are the assumptions of the regression model met?

#### Part B: Dependent variable: worry_crime

Fit a multilevel logistic model with worry_crime as the dependent variable.

- Q11. Start from the single level null model and add in the household and then GOR levels. Display the model coefficients for the 3 level variance components model  in a table.

- Q12. Is a three level model appropriate? Is a 2 level model appropriate? Is a single level model appropriate? If a 2 level model is appropriate, which two level models (individuals within HH or individuals within GOR)? State your reasons for choosing between a 2 vs 3 level model.

- Q13. Having decided whether a multilevel or single level model is appropriate, add in the explanatory variables from your final model in Part A. Compare the models using a Table. Are the associations with worry_crime similar to the associations with nhood_mistrust?

- Q14. Take out any non-significant explanatory variables from the model. Interpret the coefficients in your final model in a table.