<p align="right"><i>Foundations of Quantitative Research Methods - 2021/22 academic session</i></p>

# Statistical Summaries

Welcome to the second part of the module *Foundations of Quantitative Research Methods*. We are going to focus on the core ways we can analyse and interpret quantitative data. We are going to use a real, large-scale social survey called the **British Social Attitudes Survey, 2019**. In particulare we're going to focus on a set of survey questions relating to poverty and welfare. 

### Aims

This lesson - **Statistical Summaries** - has two aims:
1. Demonstrate how to analyse categorical and numeric variables one at a time.
2. Cultivate your computational skills through the use of the statistical programming langauge *R*. For example, in this notebook there are a number of opportunities for you to amend or write R syntax (code).

### Lesson details

* **Level**: Introductory, for individuals with no prior knowledge or experience of quantitative data analysis.
* **Duration**: 45-60 minutes.
* **Pre-requisites**: None.
* **Programming language**: R.
* **Learning outcomes**:
	1. Understand how to use R for conducting data analysis.
	2. Understand how to select and apply common data analysis techniques for categorical and numeric variables.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Analysing data - a refresher*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut `Shift + Enter`).

Try it for yourself:

In [None]:
name <- readline(prompt="Enter name: ")
print(paste("Hi", name, ", enjoy learning more about R and exploring data!"))

Notebooks are sequential, meaning code should be executed in order (top to bottom). For example, the following code won't work:

In [None]:
x * 5

As the error message suggests, there is no object (variable) called `x`, therefore we cannot do any calculations with it. 

Let's try a sequential approach:

In [None]:
x <- 10 # create an object called 'x' and give it the value '10'

In [None]:
x * 5 # multiply 'x' by 5

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

### Learner input

Throughout the lessons there times when you need to do the following activities:
* **TASK:** A coding task for you to complete (e.g. analyse different variables).
* **QUESTION:** A question regarding your interpretation of some code or a technique (e.g. what is the piece of code doing?).
* **EXERCISE:** A data analysis challenge for you to complete.

## Analysing data - a refresher

Once we have collected and explored our data, we can turn to the interesting part: analysis. Our ultimate aim is to answer the research question accurately: to do so we need to produce summaries of our key variables. In our case we need to analyse variables relating to attitudes to poverty and welfare in Britain.

How do we know which summaries to choose? We need to look at the **level of measurement** of each variable: are we dealing with categorical (e.g., marital status) or numeric (e.g., income) variables? We covered level of measurement earlier in the module but here is a quick refresher: 

![Differentiation of numeric and categorical variables](./images/lvl_msr_diagr.png)

Source: [https://maczokni.github.io/MSCD_labs/week2.html#univariate-analysis](https://maczokni.github.io/MSCD_labs/week2.html#univariate-analysis)

**Numeric variables** measure the amount or magnitude of some characteristic | attribute | outcome. For example, how much income a person receives from their part-time job; how many people are classed as homeless in Scotland in a given year.

**Categorical variables** measure the presence of some characteristic | attribute | outcome. For example, a person's country of birth; marital status. These are examples of *nominal* categorical variables. However there is another type of categorical variable that also captures the rank or ordering of the categories. For example, a student's degree classification; social class; agreement with a statement ("strongly agree", "agree", "disagree"). These are examples of *ordinal* categorical variables.

The analytical techniques we use depend on whether we are dealing with a numeric or categorical variable. However the aims of our analysis are similar:
1. We want to summarise the **central tendency** of the values of a variable
2. We want to summarise the **distribution** of the values of a variable

### Terminology

Let's refresh our memory of key terms when conducting quantitative data analysis:
* **Case** - the entity we are analysing e.g., people, countries, companies, animals, events.
* **Variable** - a characteristic that can vary in value among cases in a sample or population. Usually represented as a column in a dataset.
* **Observation** - a set of measurements of the variables for a case. Usually represented as a row in a dataset.
* **Respondent** - the case who responds to the survey - for obvious reasons, almost always a person.

These terms will become clearer as we conduct our analyses.

### Central tendency

Central tendency conveys what the *typical observation* looks like for a variable (Fogarty, 2019). There are a number of different measure of central tendency but the three most recognisable are:
* **Mode** - the most common value of a variable.
* **Median** - the middle value in a variable's distribution, where an equal number of observations lie above and below this value.
* **Mean** - the average value of a variable. 

We'll clarify what is meant by each of these measures when we encounter some examples in the *British Social Attitudes* data.

### Distribution

The pattern of variation in the values of a variable is called a *distribution*. Observing the full distribution of values is important but there are also a number of summary statistics that describe interesting features of a distribution. These are called **measures of dispersion** and some of the most commonly encountered are:
* **Minimum** - the lowest value observed
* **Maximum** - the highest value observed
* **Range** - the difference between the minimum and maximum
* **Standard deviation** - the average distance a value is from the mean

Many of these really only apply to numeric variables.

## Doing data analysis

In this lesson we focus on analysing *one* variable at a time. We may be interested in many variables in our study but we produce summaries of them separately, rather than trying to analyse them jointly (like in a correlation analysis).

One variable, sounds simple right? Well yes, as long as:
* You have variables that measure the concept you are interested in.
* You recognise the data type (level of measurement) of each variable.

### Importing data

The first step is to import the *British Social Attidues* data for analysis.

In [None]:
bsa2019 <- read.table("./data/bsa2019_poverty_open.tab", header=TRUE, na="NA", sep="\t")
head(bsa2019) # view the first six observations

Hmm, not easy to interpret what these variables mean. Let's get a list of variable names and see which ones relate to poverty and welfare.

In [None]:
names(bsa2019)

After looking up the [codebook](./codebook/8850_bsa_open_poverty_2019_codebook.pdf), there are two variables definitely of interest:
* `NatFrEst` is a measure of how many welfare claimants out of 100 a respondent thinks provides fradulent information: *Out of every 100 people receiving benefits in Britain, how many have broken the law by giving false information to support their claim?* This is a numeric variable, specifically a count variable.
* `incdiffs` is a measure of how strongly respondent's agree or disagree about the level of income inequality in Britain: *Differences in income in GB are too large?* The responses are a set of categories ranging from "Strongly agree" to "Strongly disagree", therefore this is a categorical variable, specifically an ordinal variable.

Let's focus on summarising the values of each of these variables.

### Labelling values

You may have noticed that it is difficult / impossible to know what the values of a variable mean without the [codebook](./codebook/8850_bsa_open_poverty_2019_codebook.pdf). This isn't much of a problem with numeric variables, but is when dealing with categorical variables.

We can make things easier for ourselves by attaching labels to specific values. 

Consider the income inequality variable (`incdiffs`):

In [None]:
table(bsa2019$incdiffs)

It is difficult to infer what categories the values represent. For example, does the value "1" represent "Strongly agree"? What could "-1" represent? Instead of repeatedly looking up the codebook, let's use *R* to attach labels to the values:

In [None]:
bsa2019$incdiffs <- factor(bsa2019$incdiffs, levels = c(1,2,3,4,5), labels = c("Strongly agree", "Agree", 
                                                                               "Neither agree nor disagree", "Disagree", "Strongly disagree"))

Now we can see the labels when we view the values of the variable:

In [None]:
table(bsa2019$incdiffs)

For the purposes of analysis we have ignored values of "-1", "8" and "9" as these represent responses we are not interested in (e.g., people who skipped this question, responded with "don't know").

### Summarising categorical variables

One of the core ways of analysing a categorical variable is to examine the frequency with which each category occurs. That is, we look at the variable's *frequency distribution*. Recall that a *distribution* is a summary of the variation in a variable: that is, observations have different values for a variable and these values form a pattern.

In [None]:
table(bsa2019$incdiffs)

The `table()` command is pretty basic but does give us some useful information. For instance we learn that over 1,300 respondents at the very least agree that income inequality is too high in Britain.

**QUESTION:** What is the mode for this variable?

It would be useful to know the percentage of observations in each category. *R* doesn't make this as easy as we would like but it can be achieved like so:

In [None]:
round(prop.table(table(bsa2019$incdiffs)) * 100, 0)

Now we can see that a strong majority (81%) of respondents think income inequality is too high in Britain.

Finally, summarising the distribution of a categorical variable is often best done through a graph or visualisation. A common graph type for categorical variables is a **bar chart**.

In [None]:
barplot(table(bsa2019$incdiffs))

We can clearly see the pattern in the distribution using a bar chart: respondents are more likely to agree or strongly agree.

(We won't focus on improving the visualisation in this module - you'll have to take *Data Analysis for the Social Sciences* in Year 9 if you want to learn more...)

**TASK:** Use frequency tables and bar charts to summarise the `RSex` variable. Look up the codebook and label the variable's values. (The answer to this task is at the end of the notebook)

In [None]:
# INSERT CODE HERE

### Summarising numeric variables

One of the core ways of analysing a numeric variable is to calculate measures of central tendency, in particular the **mean** and **median**:

In [None]:
mean(bsa2019$NatFrEst)

In [None]:
median(bsa2019$NatFrEst)

Hmmm, the mean doesn't look right...how can it take this value when the question specifically asked respondents to pick a number between 0 and 100. 

The codebook has the answer: respondents who said "don't know" or didn't answer are given the values "998" and "999". Therefore these values are skewing the mean towards them. Let's tell *R* to ignore these values when analysing this variable:

In [None]:
bsa2019$NatFrEst[bsa2019$NatFrEst > 100] <- NA # set "998" and "999" as missing values

In [None]:
mean(bsa2019$NatFrEst, na.rm = TRUE) # need to tell R to ignore missing values

In [None]:
median(bsa2019$NatFrEst, na.rm = TRUE)

That's better. 

Here we see the mean and median values of the welfare fraud estimate variable. The mean represents the average value of this variable and is calculated by adding up all the values and dividing by the number of observations for this variable.

The median represents the middle value: 50% of respondents estimated a number lower than this value, 50% estimated a number higher than this value. 

These measures are useful for analysing typical or representative values of variable. However it is also important we consider the distribution of values, so we can assess just *how* useful the mean and median are.

First we can view a summary of the **range** of values of this variable:

In [None]:
summary(bsa2019$NatFrEst)

A common graph type for numeric variables is a **histogram**.

In [None]:
hist(bsa2019$NatFrEst, breaks = seq(0, 100, by=10)) # each bar represents a range of values (10)

We can clearly see the pattern in the distribution using a histogram: many respondents think there are 10 or fewer benefit claimants out of 100 who are engaging in fraud. However there a significant number of respondents who think welfare fraud is very common (notice the "bump" in the middle of the distribution). 

This type of distribution, where the values are bunched to the left, can be described as being **positively skewed**. That is, there are a small number of "extreme" or high values.

When we have a small range of values for a numeric variable, then a bar chart may also be appropriate:

In [None]:
barplot(table(bsa2019$NatFrEst))

Now we can see the most common (mode) response was "50" i.e., half of benefit claimants are engaging in fraud to attain welfare.

**TASK:** Use summary, mean, median and hist commands to summarise the `welfare2` variable. Look up the codebook to see if any values need to be set as missing. (The answer to this task is at the end of the notebook)

In [None]:
# INSERT CODE HERE

## Conclusion

In this lesson we encountered a range of techniques for summarising categorical and numeric variables one at a time.

In the next lesson we focus on summarising the joint distribution of two or more variables, a technique known as *bivariate*, *multivariate* or *correlation* analysis.

## Solutions to tasks

### Use frequency tables and bar charts to summarise the `RSex` variable. Look up the codebook and label the variable's values.

In [None]:
table(bsa2019$RSex)

In [None]:
bsa2019$RSex <- factor(bsa2019$RSex, levels = c(1,2), labels = c("Male", "Female"))

In [None]:
table(bsa2019$RSex)

In [None]:
round(prop.table(table(bsa2019$RSex)) * 100, 0)

In [None]:
barplot(table(bsa2019$RSex))

#### Interpretation

The `RSex` variable is a binary measure of a respondent's sex. There are two categories: Male and Female.

We clearly see that more females than males completed the survey.

### Use summary, mean, median and hist commands to summarise the `welfare2` variable. Look up the codebook to see if any values need to be set as missing.

In [None]:
bsa2019$welfare2[bsa2019$welfare2 > 5 | bsa2019$welfare2 < 1] <- NA

In [None]:
summary(bsa2019$welfare2)

In [None]:
mean(bsa2019$welfare2, na.rm = TRUE)

In [None]:
median(bsa2019$welfare2, na.rm = TRUE)

In [None]:
hist(bsa2019$welfare2, breaks = seq(1, 5, by=1))

#### Interpretation

The `welfare2` variable summarises a respondent's attitude to welfare. A value of "1" indicates somebody who feels positively about welfare in Britain, a value of "5" represents somebody with negative attitudes to welfare. See pages 21-22 of the original studie's User Guide for information on how this variable was created: http://doc.ukdataservice.ac.uk/doc/8450/mrdoc/pdf/8450_bsa_2017_user_guide_final.pdf

We can see most people are in the middle, though with a noticeable lean towards more positive feelings to welfare.