<p align="right"><i>Data Analysis for the Social Sciences - Part II - 2021-11-29</i></p>

# Quantitative Data Analysis

Welcome to Part II of Data Analysis for the Social Sciences. In this stream - Quantitative Data Analysis - we will conduct a range of statistical analyses in order to answer the following research question:

<p><center><i>Is religion associated with differences in sexual attitudes and behaviours among British people?</i></center></p>

### Aims

This lesson - **Expressing Uncertainty** - has two aims:
1. Demonstrate how to calculate and communicate measures of uncertainty relating to your quantitative findings.
2. Cultivate your computational skills through the use of the statistical programming langauge *R*. For example, there are a number of opportunities for you to amend or write R syntax (code).

### Lesson details

* **Level**: Introductory, for individuals with some prior knowledge or experience of quantitative data analysis.
* **Duration**: 30-45 minutes.
* **Pre-requisites**: None.
* **Programming language**: R.
* **Learning outcomes**:
	1. Understand how to use R for quantitative data analysis.
	2. Understand how to select and apply common measures of uncertainty relating to quantitative findings.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Introduction to uncertainty*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut `Shift + Enter`).

Try it for yourself:

In [1]:
name <- readline(prompt="Enter name: ")
print(paste("Hi,", name, "enjoy learning more about R and measures of uncertainty!"))

Enter name: Diarmuid McDonnell
[1] "Hi, Diarmuid McDonnell enjoy learning more about R and measures of uncertainty!"


Notebooks are sequential, meaning code should be executed in order (top to bottom). For example, the following code won't work:

In [106]:
x * 5

As the error message suggests, there is no object (variable) called `x`, therefore we cannot do any calculations with it. 

Let's try a sequential approach:

In [145]:
x <- 10 # create an object called 'x' and give it the value '10'

In [146]:
x * 5 # multiply 'x' by 5

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

### Learner input

Throughout the lessons there times when you need to do the following activities:
* **TASK:** A coding task for you to complete (e.g. analyse different variables).
* **QUESTION:** A question regarding your interpretation of some code or a technique (e.g. what is the piece of code doing?).
* **EXERCISE:** A data analysis challenge for you to complete.

## Introduction to multivariate data analysis

In the [**previous lessons**](https://github.com/DiarmuidM/data-analysis-for-the-social-sciences-2021/blob/main/lessons), we learned how to conduct a range of univariate, bivariate and multivariate analyses. The results of these analyses are often quite precise and accurate:
* The exact proportion of respondents who claim that religion is 'Very important' to them (14%).
* The average conservative attitude score for different age groups (16-24 year olds = -0.14817995).

However the ability to produce such exact summaries ignores the great deal of **uncertainty** associated with any piece of quantitative data analysis. For example, how confident are we that 14% of all British adults would claim that religion is very important to them? If it isn't 14%, then how much larger or smaller is the 'true' percentage? 

Remember, our estimate of 14% comes from the **sample** of people who happened to participate in the *Natsal* wave 3 survey (2010-12). Is this group perfectly representative of the wider **population** from which it was drawn (16-74 year olds in Britain)?

Have a look at the lecture and materials from Week 3 (*Principles of Quantitative Data Analysis I*) but the short answer is this:

<p><center><i>Samples will always differ from their population due to random chance</i></center></p>

Thus in this lesson we will focus on expressing the inherent uncertainty in our quantitative analyses. There are a variety of measures of uncertainty but all are united by trying to answer the following question:
* Can we generalise our result to the wider population from which we drew our sample?

## Multivariate data analysis in action

### Preliminaries

Let's import the *Natsal* dataset and label the values some of our key variables:

In [2]:
natsal <- read.table("./data/natsal_3_teaching_open.tab", header=TRUE, na="NA", sep="\t")

In [3]:
natsal$agrp <- factor(natsal$agrp, levels = c(1,2,3,4,5,6), labels = c("16-24", "25-34", "35-44", 
                                                                       "45-54", "55-64", "65-74"))

In [4]:
natsal$religimp <- factor(natsal$religimp, levels = c(1,2,3,4,9), labels = c("Very important", "Fairly important", 
                                                                             "Not very important", "Not important at all", "Not answered"))

In [5]:
natsal$rsex <- factor(natsal$rsex, levels = c(1,2), labels = c("Male", "Female"))

In [6]:
natsal$ethnicgrp <- factor(natsal$ethnicgrp, levels = c(1,2,9), labels = c("White", "Non-white", "Not answered"))

### Uncertainty in univariate analysis

Univariate analysis produces statistical summaries of numeric and categorical variables e.g., average attitude to sex; proportion of respondents in the 16-24 age group etc. However these single numbers give an inflated sense of accuracy and precision. Thus we need some way of expressing the **range** of plausible values for a given statistical summary. You may have heard of this range referred to as **confidence intervals**, **margins of error**, **polling error** etc.

#### Numeric variables

Let's look at an example for our measure of attitude to sex (`attconservative`).

In [7]:
summary(natsal$attconservative)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-3.88938 -0.67900 -0.02177 -0.01728  0.71396  2.59516      299 

We see that the average attitude is slightly liberal (less than zero). However this finding was generated by using data on our current sample of randomly selected respondents. What if we had a different sample of respondents? Would we also expect the average score for this variable to be '-0.01728'? And if we had a different sample again?

Thanks to sampling theory, we can produce a range or interval of plausible values for average attitude (or any average of a numeric variable).

In [8]:
t.test(natsal$attconservative, na.rm = TRUE) # calculate 95% confidence intervals for the mean of `attconservative`


	One Sample t-test

data:  natsal$attconservative
t = -1.0147, df = 3499, p-value = 0.3103
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.05066760  0.01610798
sample estimates:
  mean of x 
-0.01727981 


We get lots of information returned but all we need for our interpretation is the following:
> 95 percent confidence interval:
> -0.05066760     0.01610798

The one-line summary is this: the mean attitude score is very probably between '-0.05066760' and '0.01610798' in the population of British adults from which the sample was drawn. The figure of '-0.01727981' remains our best estimate but we now acknowledge that the score in the population could actually be between '-0.05066760' and '0.01610798'.

This can be tricky to get your head around, so let's look at another example: age at birth of first child (`dage1ch`).

In [9]:
dage1ch_valid <- subset(natsal, dage1ch > -1 & dage1ch < 99) # drop invalid values of this variable
summary(dage1ch_valid$dage1ch)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.00   21.00   24.00   24.84   28.00   40.00 

In [10]:
t.test(dage1ch_valid$dage1ch, na.rm = TRUE) # calculate 95% confidence intervals for the mean of `dage1ch`


	One Sample t-test

data:  dage1ch_valid$dage1ch
t = 206.53, df = 1924, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 24.60567 25.07745
sample estimates:
mean of x 
 24.84156 


The mean age at which our respondents have their first child is '24.8' years old, but we acknowledge that the mean age in the population could actually be between '24.6' and '25.1'. Therefore the range of the confidence interval is very tight around the mean, providing confidence that '24.8' is a good estimate of the average age at which adults in Britain have their first child.

Imagine our mean was still '24.8' but the confidence interval ranged from '19.5' to '31.2'. Now the mean does not look like a good estimate of the average age at which adults in Britain have their first child.

Calculating a confidence interval does not invalidate your estimate of the mean of a numeric variable, it simply provides some caution when making claims about a **population** based on a **sample**.

#### Categorical variables

We can also calculate a range of uncertainty for categories of a categorical variable.

In [12]:
prop.table(table(natsal$religimp))


      Very important     Fairly important   Not very important 
         0.144774941          0.246117399          0.290076336 
Not important at all         Not answered 
         0.314293235          0.004738089 

(Multiply the proportions by 100 to convert them to percentages)

We observe that 14% of respondents claim that religion is very important to them, 25% that it is fairly important etc. Are these good estimates of how the population of adults in Britain feel about religion?

In [19]:
library(DescTools) # specific package for calculating confidence intervals for categorical variables

MultinomCI(table(natsal$religimp)) # 95% confidence interval is the default

Unnamed: 0,est,lwr.ci,upr.ci
Very important,0.144774941,0.1276652,0.1620909
Fairly important,0.246117399,0.2290076,0.26343336
Not very important,0.290076336,0.2729666,0.3073923
Not important at all,0.314293235,0.2971835,0.3316092
Not answered,0.004738089,0.0,0.02205405


We are 95% confident that the true proportion of adults in Britain who claim religion is very important is between 13% and 16% (lwr.ci = lower end of the interval, upr.ci = upper end of the interval).

**TASK**: Calculate confidence intervals for the `snearly` variable. Look at the [codebook](./codebook/8786_natsal_open_codebook_feb21.pdf) to see what this variable measures.

In [22]:
# INSERT CODE HERE

#### A note on confidence

You'll have seen reference to '95% confidence interval' and rightly wondered what it meant.

Let's imagine that the *Natsal* survey was completed by 100 different samples of respondents (obviously this would be wildly expensive, impractical and unnecessary). The sampling procedure is the same (i.e., random sampling) and the sample sizes are the same (i.e., 15,000). And for each sample we calculate the mean of some numeric variable - let's say age at birth of first child.

In such a scenario, a 95% confidence interval represents the following:
* 95 of the 100 samples have a mean that falls into the range between '24.6' and 25.1'
* 5 of the 100 samples have a mean that falls outside of the range between '24.6' and 25.1'

The intractable problem is that we have no way telling whether the **actual** sample we observed (i.e., the people who completed the *Natsal* survey) is one of the 95 or one of the 5. 

All we know is that 95 times out of 100 the mean falls between a given range of values.

### Uncertainty in bivariate analysis

A key aspect of bivariate analysis is producing a **measure of association** that summarises the strength / direction of a relationship between two variables. In this instance our uncertainty is not related to exact figure of a measure of association, but in how confident we are that the relationship is present in the population from which the sample was drawn.

For example, we found a weak association between sex and importance of religion:

In [23]:
CramerV(natsal$religimp, natsal$rsex)

How confident are we that this association exists in the population of British adults and not just in our sample? We can answer this question by calculating what are known as measures of **statistical significance**. 

A common measure of statistical significance is a **p-value**. This can be interpreted as a proportion, ranging from 0 to 1. In contrast to measures of association, we are interested in values close to 0, as these indicate a low chance of your association **NOT** being found in the population from which your sample was drawn. Put another way:
* a p-value < 0.05 indicates a high probability that the association is found in the wider population and not just in the sample.
* an association with a p-value < 0.05 is therefore said to be **statistically significant**. 

Therefore a p-value &mdash; and other measures of statistical significance &mdash; provides a summary of our confidence in the **generalisability** of the association / pattern we observe in the data.

#### Categorical vs Categorical

Let's return to the association between importance of religion and respondent's sex:

In [24]:
CramerV(natsal$religimp, natsal$rsex)

In [26]:
options(scipen = 999) # surpress display of scientific notation

chisq.test(natsal$religimp, natsal$rsex)


	Pearson's Chi-squared test

data:  natsal$religimp and natsal$rsex
X-squared = 46.43, df = 4, p-value = 0.000000002004


We observe that the p-value is well below the 0.05 threshold, therefore we conclude that the association is statistically significant. That is, the association is very likely present in the population from which the sample was drawn.

Let's look at one more example, this time for two ordinal categorical variables:

In [30]:
library(DescTools) # import the package containing the `GoodmanKruskalGamma()` command

GoodmanKruskalGamma(natsal$religimp, natsal$agrp)

In [33]:
chisq.test(natsal$religimp, natsal$agrp)

"Chi-squared approximation may be incorrect"



	Pearson's Chi-squared test

data:  natsal$religimp and natsal$agrp
X-squared = 153.37, df = 20, p-value < 0.00000000000000022


**QUESTION**: Is the association between importance of religion and age group statistically significant?

#### Categorical vs Numeric

Recall that the appropriate summary statistic for a bivariate analysis of one categorical and one numeric variable is:
* *Eta squared*

This tells us the strength of the association but not the direction (we need to infer this from the summary tables above). *Eta squared* coefficient ranges from 0 to 1, with higher values representing stronger associations.

In [34]:
library(lsr)

model <- aov(attconservative ~ agrp, data = natsal)
etaSquared(model)

Unnamed: 0,eta.sq,eta.sq.part
agrp,0.04232259,0.04232259


We can recover the p-value for this association by summarising the results of the `aov(attconservative ~ agrp, data = natsal)` command:

In [39]:
summary(model)

              Df Sum Sq Mean Sq F value              Pr(>F)    
agrp           5    150  30.060   30.88 <0.0000000000000002 ***
Residuals   3494   3401   0.973                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
299 observations deleted due to missingness

In this instance we are looking at the `Pr(>F)` statistic, which is another way of describing a p-value.

Again this is considerably lower than the '0.05' threshold and this we would conclude that the very weak association is very likely to be present in the population from which the sample was drawn.

#### Numeric vs Numeric

Recall that the appropriate measure of association for two numeric variables is:
* *Pearson's correlation coefficient (r)*

Similar to other measures of association, it tells us the strength and direction of the association between two variables. The coefficient ranges between -1 and 1, with negative values representing negative associations, and positive values positive associations. Values closer to -1 or 1 indicate stronger associations than those closer to 0.

In [40]:
country_ages <- read.csv("./data/median-age-our-world-in-data.csv", header=TRUE, na="NA")
head(country_ages)

Unnamed: 0_level_0,Entity,Code,Year,Age
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>
1,Afghanistan,AFG,1950,19.4
2,Afghanistan,AFG,1955,19.2
3,Afghanistan,AFG,1960,18.8
4,Afghanistan,AFG,1965,18.4
5,Afghanistan,AFG,1970,17.9
6,Afghanistan,AFG,1975,17.4


In [44]:
cor(country_ages$Year, country_ages$Age, use = "complete.obs")

No surprise there: the correlation coefficient indicates a strong, positive association between time and median age.

To produce the p-value calculated for this association, we use the `cor.test()` command:

In [45]:
cor.test(country_ages$Year, country_ages$Age, use = "complete.obs")


	Pearson's product-moment correlation

data:  country_ages$Year and country_ages$Age
t = 92.967, df = 7469, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7217246 0.7427545
sample estimates:
      cor 
0.7324142 


And once more, the p-value is below the '0.05' threshold and we conclude that the strong association between median age and year is very likely present in the population from which the sample of countries and years were drawn.

#### A note on statistical significance

Statistically significant **does not** mean a finding is important or of practical significance. The term comes from an older use of English (significant = signals). Therefore, statistically significant signals that a finding may be important and worth investigating further (MacInnes, 2019).

To claim that a finding is of practical significance, we look at the **magnitude** of a statistic:
* Whether an association is strong or not
* Whether a proportion for one group is considerably different to a proportion for another group (e.g., differences between men and women in terms of importance of religion)
* And so on

Put simply:
> Statistical significance tells us what we can infer about a target population from what we find in a sample. (MacInnes, 2019: 10)

## Conclusion

In this lesson we encountered a range of techniques for expressing the uncertainty inherent in our quantitative analyses.

In Week 12, we bring all over our learning together to write a report based on a piece of quantitative data analysis.