<p align="right"><i>Data Analysis for the Social Sciences - Part II - 2022-11-21</i></p>

# Week 11 - Inferential Analysis

Welcome to Part II of Data Analysis for the Social Sciences. In this lab session we will learn how to make inferences from our descriptive analyses.

We will use real data from the *National Survey of Sexual Attitudes and Lifestyles, 2010-2012 (Natsal-3)*, specifically the open dataset available from the UK Data Service: https://doi.org/10.5255/UKDA-SN-8786-1

**Please note for Assessment 2 you are required to use the larger, richer version of *Natsal-3*, which is available on Aula.**

### Aims

This lesson - **Inferential Analysis** - has two aims:
1. Demonstrate how to calculate and communicate measures of uncertainty relating to your quantitative findings.
2. Cultivate your computational skills through the use of the statistical programming langauge *R*. For example, there are a number of opportunities for you to amend or write R syntax (code).

### Lesson details

* **Level**: Introductory, for individuals with minimal prior knowledge or experience of quantitative data analysis.
* **Duration**: 45-60 minutes.
* **Pre-requisites**: Completed [*Univariate Data Analysis*](./dass-week-9-univariate-analysis-2022-11-07.ipynb), [*Bivariate Data Analysis*](./dass-week-10-bivariate-analysis-2022-11-14.ipynb) and [*Multivariate Data Analysis*](./dass-week-11-multivariate-analysis-2022-11-14.ipynb) lessons.
* **Programming language**: R.
* **Learning outcomes**:
	1. Understand how to use R for conducting inferential analyses.
	2. Understand how to select and apply common data analysis techniques for categorical and numeric variables.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Introduction to Inferential Analysis*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `[]`.**

To execute a cell, click or double-click the cell and press the `Play` button next to the cell or select the `Run` button on the top toolbar (*Runtime > Run the focused cell*); you can also use the keyboard shortcuts `Shift + Enter` or `Ctrl + Enter`).

Try it for yourself:

In [None]:
name <- readline(prompt="Enter name: ")
print(paste("Hi,", name, "enjoy learning more about R and inferential analysis!"))

Notebooks are sequential, meaning code should be executed in order (top to bottom). For example, the following code won't work:

In [None]:
x * 5

As the error message suggests, there is no object (variable) called `x`, therefore we cannot do any calculations with it. 

Let's try a sequential approach:

In [None]:
x <- 10 # create an object called 'x' and give it the value '10'

In [None]:
x * 5 # multiply 'x' by 5

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

### Learner input

Throughout the lessons there are times when you need to do the following activities:
* **TASK:** A coding task for you to complete (e.g. analyse different variables).
* **QUESTION:** A question regarding your interpretation of some code or a technique (e.g. what is the piece of code doing?).
* **EXERCISE:** A data analysis challenge for you to complete.

## Introduction to Inferential Analysis

In the [**previous lessons**](https://github.com/DiarmuidM/data-analysis-for-the-social-sciences-2022/blob/main/lessons), we learned how to conduct a range of univariate, bivariate and multivariate analyses. The results of these analyses are often quite precise and accurate:
* The exact proportion of respondents who claim that religion is 'Very important' to them (14%).
* The average conservative attitude score for different age groups (16-24 year olds = -0.14817995).

However the ability to produce such exact summaries ignores the great deal of **uncertainty** associated with any piece of quantitative data analysis. For example, how confident are we that 14% of all British adults would claim that religion is very important to them? If it isn't 14%, then how much larger or smaller is the 'true' percentage? 

Remember, our estimate of 14% comes from the **sample** of people who happened to participate in the *Natsal* wave 3 survey (2010-12). Is this group perfectly representative of the wider **population** from which it was drawn (16-74 year olds in Britain)?

The short answer is this:

<p><center><i>Samples will always differ from the population they were drawn from due to random chance</i></center></p>

Thus in this lesson we will focus on expressing the inherent uncertainty in our quantitative analyses. There are a variety of measures of uncertainty but all are united by trying to answer the following question:
* Can we generalise our result to the wider population from which we drew our sample?

That is, can we make inferences about units of analysis that were **not included** in our sample i.e., those in the wider population?

## Making Inferences

### Importing data

The first step is to import the *Natsal-3* data for analysis.

In [None]:
natsal <- read.table("https://raw.githubusercontent.com/DiarmuidM/data-analysis-for-the-social-sciences-2022/main/lessons/data/natsal_3_teaching_open.tab", header=TRUE, na="NA", sep="\t")
head(natsal) # view the first six observations

### Data cleaning

There are a number of important steps that need to be executed before proceeding with the analysis:
* Handling missing values
* Labelling values of categorical variables

We cover these techniques in a separate notebook: [Data Cleaning](./dass-natsal-data-cleaning-2022-09-28.ipynb) 

**Please note that you will be expected to perform these tasks for your own analysis.**

In [None]:
natsal$religimp <- factor(natsal$religimp, levels = c(1,2,3,4), labels = c("Very important", "Fairly important", 
                                                                            "Not very important", "Not important at all"))

In [None]:
natsal$agrp <- factor(natsal$agrp, levels = c(1,2,3,4,5,6), labels = c("16-24", "25-34", "35-44", "45-54", "55-64", "65-74"))

In [None]:
natsal$rsex <- factor(natsal$rsex, levels = c(1,2), labels = c("Male", "Female"))

In [None]:
natsal$ethnicgrp <- factor(natsal$ethnicgrp, levels = c(1,2), labels = c("White", "Non-white"))

In [None]:
natsal$dage1ch[natsal$dage1ch==-1 | natsal$dage1ch==99] <- NA # convert "-1" and "99" to missing

### Univariate analysis

Univariate analysis produces statistical summaries of numeric and categorical variables e.g., average attitude to sex; proportion of respondents in the 16-24 age group etc. However these single numbers give an inflated sense of accuracy and precision. Thus we need some way of expressing the **range** of plausible values for a given statistical summary. You may have heard of this range referred to as **confidence intervals**, **margins of error**, **polling error** etc.

#### Numeric variables

Let's look at an example for our measure of attitude to sex (`attconservative`).

In [None]:
summary(natsal$attconservative)

We see that the average attitude is slightly liberal (less than zero). However this finding was generated by using data on our current sample of 16-74 year olds in Britain. What if we had a different sample of respondents? Would we also expect the average score for this variable to be '-0.01728'? And if we had a different sample again?

Thanks to sampling theory, we can produce a range or interval of plausible values for average (mean) attitude (or any mean of a numeric variable).

In [None]:
t.test(natsal$attconservative, na.rm = TRUE)$conf.int # calculate 95% confidence intervals for the mean of `attconservative`

The one-line summary is this: the mean attitude score is very probably between '-0.05066760' and '0.01610798' in the population of British adults from which the sample was drawn. The figure of '-0.01727981' remains our best estimate but we now acknowledge that the score in the population could actually be between '-0.05066760' and '0.01610798'.

This can be tricky to get your head around, so let's look at another example: age at birth of first child (`dage1ch`).

In [None]:
mean(natsal$dage1ch, na.rm = TRUE)

In [None]:
t.test(natsal$dage1ch, na.rm = TRUE)$conf.int # calculate 95% confidence intervals for the mean of `dage1ch`

The mean age at which our respondents have their first child is '24.8' years old, but we acknowledge that the mean age in the population could actually be between '24.6' and '25.1'. Therefore the range of the confidence interval is very tight around the mean, providing confidence that '24.8' is a good estimate of the average age at which adults in Britain have their first child.

Imagine our mean was still '24.8' but the confidence interval ranged from '19.5' to '31.2'. Now the mean does not look like a good estimate of the average age at which adults in Britain have their first child?

Calculating a confidence interval does not invalidate your estimate of the mean of a numeric variable, it simply provides some caution when making claims about a **population** based on a **sample**.

#### Categorical variables

We can also calculate a range of uncertainty for categories of a categorical variable.

In [None]:
round(prop.table(table(natsal$religimp)) * 100, 0)

We observe that 15% of respondents claim that religion is very important to them, 25% that it is fairly important etc. Are these good estimates of how the population of adults in Britain feel about religion?

In [None]:
install.packages("DescTools") # install the necessary package - only needs to be done once

In [None]:
library(DescTools) # import the package containing the `MultinomCI` command

In [None]:
MultinomCI(table(natsal$religimp)) # 95% confidence interval is the default

We are 95% confident that the true proportion of adults in Britain who claim religion is very important is between 13% and 16% (lwr.ci = lower end of the interval, upr.ci = upper end of the interval).

**TASK**: Calculate confidence intervals for the `snearly` variable. Look at the [codebook](./codebook/8786_natsal_open_codebook_feb21.pdf) to see what this variable measures.

In [None]:
# INSERT CODE HERE

#### A note on confidence

You'll have seen reference to '95% confidence interval' and rightly wondered what it meant.

Let's imagine that the *Natsal* survey was completed by 100 different samples of respondents (obviously this would be wildly expensive, impractical and unnecessary). The sampling procedure is the same (i.e., random sampling) and the sample sizes are the same (i.e., 15,000). And for each sample we calculate the mean of some numeric variable - let's say age at birth of first child.

In such a scenario, a 95% confidence interval represents the following:
* 95 of the 100 samples have a mean that falls into the range between '24.6' and 25.1'
* 5 of the 100 samples have a mean that falls outside of the range between '24.6' and 25.1'

The intractable problem is that we have no way telling whether the **actual** sample we observed (i.e., the people who completed the *Natsal* survey) is one of the 95 or one of the 5. 

All we know is that 95 times out of 100 the mean falls between a given range of values.

### Bivariate Analysis

A key aspect of bivariate analysis is producing a **measure of association** that summarises the strength / direction of a relationship between two variables. In this instance our uncertainty is not related to the exact value of a measure of association, **but in how confident we are that the relationship is present in the population from which the sample was drawn**.

For example, we find a moderate association between ethnicity and importance of religion:

In [None]:
CramerV(natsal$religimp, natsal$ethnicgrp)

How confident are we that this association exists in the population of British adults and not just in our sample? We can answer this question by calculating what are known as measures of **statistical significance**. 

A common measure of statistical significance is a **p-value**. This can be interpreted as a proportion, ranging from 0 to 1. In contrast to measures of association, we are interested in values close to 0, as these indicate a low chance of your association **NOT** being found in the population from which your sample was drawn. Put another way:
* a p-value < 0.05 indicates a high probability that the association is found in the wider population and not just in the sample.
* an association with a p-value < 0.05 is therefore said to be **statistically significant**. 

Therefore a p-value &mdash; and other measures of statistical significance &mdash; provides a summary of our confidence in the **generalisability** of the association / pattern we observe in the data.

#### Categorical vs Categorical

Let's return to the association between importance of religion and respondent's ethnicity:

In [None]:
CramerV(natsal$religimp, natsal$ethnicgrp)

In [None]:
#options(scipen = 999) # surpress display of scientific notation

chisq.test(natsal$religimp, natsal$ethnicgrp)

We observe that the p-value is well below the 0.05 threshold, therefore we conclude that the association is statistically significant. That is, the association is very likely present in the population from which the sample was drawn.

Even weak associations are likely to generate small p-values if the sample is large enough.

In [None]:
chisq.test(natsal$religimp, natsal$rsex)

What if we had a smaller sample of respondents?

In [None]:
natsal_samp <- natsal[sample(nrow(natsal), 100), ] # randomly sample 100 observations from the dataset

In [None]:
CramerV(natsal_samp$religimp, natsal_samp$ethnicgrp)

In [None]:
chisq.test(natsal_samp$religimp, natsal_samp$rsex)

**QUESTION:** Why is the p-value no longer below the 0.05 threshold?

#### Categorical vs Numeric

Recall that the appropriate summary statistic for a bivariate analysis of one categorical and one numeric variable is:
* *Eta squared*

This tells us the strength of the association but not the direction. *Eta squared* coefficient ranges from 0 to 1, with higher values representing stronger associations.

In [None]:
install.packages("lsr") # install the necessary package - only needs to be done once

In [None]:
library(lsr) # import the package containing the `etaSquared()` command

In [None]:
model <- aov(attconservative ~ agrp, data = natsal)
etaSquared(model)

We can recover the p-value for this association by summarising the results of the `aov(attconservative ~ agrp, data = natsal)` command:

In [None]:
summary(model)

In this instance we are looking at the `Pr(>F)` statistic, which is another way of describing a p-value.

Again this is considerably lower than the '0.05' threshold and this we would conclude that the very weak association is very likely to be present in the population from which the sample was drawn.

#### Numeric vs Numeric

Recall that the appropriate measure of association for two numeric variables is:
* *Pearson's correlation coefficient (r)*

Similar to other measures of association, it tells us the strength and direction of the association between two variables. The coefficient ranges between -1 and 1, with negative values representing negative associations, and positive values positive associations. Values closer to -1 or 1 indicate stronger associations than those closer to 0.

In [None]:
plot(natsal$dage1ch, natsal$attconservative) # X variable (axis) is listed first, Y variable (axis) second

A visual inspection of the joint distribution does not reveal any obvious pattern or relationship: as the age at which an individual had their first child increases, there does not seem to be any obvious change in attitude to sex. We can confirm our interpretation by calculating an appropriate measure of association (*Pearson's correlation coefficient (r)*).

In [None]:
cor(natsal$dage1ch, natsal$attconservative, use = "complete.obs")

To produce the p-value calculated for this association, we use the `cor.test()` command:

In [None]:
cor.test(natsal$dage1ch, natsal$attconservative, use = "complete.obs")

And once more, the p-value is below the '0.05' threshold and we conclude that the weak association is very likely present in the population from which the sample of respondents was drawn.

#### A note on statistical significance

Statistically significant **does not** mean a finding is important or of practical significance. The term comes from an older use of English (significant = signals). Therefore, statistically significant signals that a finding may be important and worth investigating further (MacInnes, 2019).

To claim that a finding is of practical significance, we look at the **magnitude** of a statistic:
* Whether an association is strong or not
* Whether a proportion for one group is considerably different to a proportion for another group (e.g., differences between males and females in terms of importance of religion)
* And so on

Put simply:
> Statistical significance tells us what we can infer about a target population from what we find in a sample. (MacInnes, 2019: 10)

## Conclusion

In this lesson we encountered a range of techniques for expressing the uncertainty inherent in our quantitative analyses.

In Week 12, we bring all our learning together to write a report based on a piece of quantitative data analysis.