<p align="right"><i>Data Analysis for the Social Sciences - Part II - 2021-11-15</i></p>

# Quantitative Data Analysis

Welcome to Part II of Data Analysis for the Social Sciences. In this stream - Quantitative Data Analysis - we will conduct a range of statistical analyses in order to answer the following research question:

<p><center><i>Is religion associated with differences in sexual attitudes and behaviours among British people?</i></center></p>

### Aims

This lesson - **Univariate Analysis** - has two aims:
1. Demonstrate how to analyse categorical and numeric variables individually.
2. Cultivate your computational skills through the use of the statistical programming langauge *R*. For example, there are a number of opportunities for you to amend or write R syntax (code).

### Lesson details

* **Level**: Introductory, for individuals with no prior knowledge or experience of quantitative data analysis.
* **Duration**: 30-45 minutes.
* **Pre-requisites**: None.
* **Programming language**: R.
* **Learning outcomes**:
	1. Understand how to use R for conducting univariate data analysis.
	2. Understand how to select and apply common data analysis techniques for categorical and numeric variables.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Exploring Data*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut `Shift + Enter`).

Try it for yourself:

In [None]:
name <- readline(prompt="Enter name: ")
print(paste("Hi,", name, "enjoy learning more about R and exploring data!"))

Notebooks are sequential, meaning code should be executed in order (top to bottom). For example, the following code won't workL

In [2]:
x * 5

ERROR: Error in eval(expr, envir, enclos): object 'x' not found


As the error message suggests, there is no object (variable) called `x`, therefore we cannot do any calculations with it. 

Let's try a sequential approach:

In [4]:
x <- 10 # create an object called 'x' and give it the value '10'

In [5]:
x * 5 # multiply 'x' by 5

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

### Learner input

Throughout the lessons there times when you need to do the following activities:
* **TASK:** A coding task for you to complete (e.g. analyse different variables).
* **QUESTION:** A question regarding your interpretation of some code or a technique (e.g. what is the piece of code doing?).
* **EXERCISE:** A data analysis challenge for you to complete.

## Analysing Data

Once we have collected and explored our data, we can turn to the interesting part: analysis. Our ultimate aim is to answer our research question accurately: to do so we need to produce summaries of our key variables. In our case we need to analyse variables relating to religious beliefs and sexual attitudes/behaviours. 

How do we know which summaries to choose? We need to look at the **level of measurement** of each variable: are we dealing with categorical (e.g., marital status) or numeric (e.g., income) variables? We covered level of measurement in Week 2 but here is a quick refresher: 

![Differentiation of numeric and categorical variables](./images/lvl_msr_diagr.png)

Source: [https://maczokni.github.io/MSCD_labs/week2.html#univariate-analysis](https://maczokni.github.io/MSCD_labs/week2.html#univariate-analysis)

**Numeric variables** measure the amount or magnitude of some phenomenon. For example, how much income a person receives from their part-time job; how many people are classed as homeless in Scotland in a given year.

**Categorical variables** measure the presence of some phenomenon. For example, a person's country of birth; marital status. These are examples of *nominal* categorical variables. However there is another type of categorical variable that also captures the rank or ordering of the categories. For example, a student's degree classification; social class; agreement with a statement ("strongly agree", "agree", "disagree"). These are examples of *ordinal* categorical variables.

### What influences what? Independent and dependent variables [Move this section to bivariate analysis notebook]

Before we examine specific variables, it will help to clarify *how* we think religion and sexual attitudes/behaviours are linked. We can do this by specifying the *direction* of the relationship: that is, does religion influence sexual attitudes/behaviours, or is it the other way around (or both)?

For our study we are claiming that people's sexual attitudes/behaviours can be explained or predicted by their religious beliefs. Therefore values for sexual attitudes/behaviours variable **depend** on the values of the religious beliefs variable.

### Univariate analysis

As you may have guessed, *uni* variate analysis simply involves the analysis of *one* variable at a time. We may be interested in many variables in our study but we produce summaries of them separately, rather than trying to analyse them jointly (like in a correlation analysis).

One variable, sounds simple right? Well yes, as long as:
* You have variables that measure the concept you are interested in.
* You recognise the data type (level of measurement) of each variable.

### Importing data

The first step is to import the *Natsal-3* data for analysis.

In [3]:
natsal <- read.table("./data/natsal_3_teaching_open.tab", header=TRUE, na="NA", sep="\t")
head(natsal) # view the first six observations

Unnamed: 0_level_0,agrp,rsex,ethnicgrpr,sexidr,rnssecgp_6,adj_imd_quintile,rwcasual,snnolov,snpres,snold,snsexdrv,snmedia,snearly,attconservative,dage1ch,disabil2,depscr,religimp,relstatr,total_wt
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>
1,2,1,1,1,4,3,1,4,2,2,2,2,1,1.3665446,-1,1,0,2,7.0,0.4605145
2,3,2,1,1,5,3,5,2,3,4,4,4,3,-2.3896493,-1,1,0,4,2.0,1.9239441
3,1,2,1,1,8,4,3,2,4,4,2,2,2,-0.7515043,-1,1,0,3,3.0,1.0208558
4,4,2,2,2,5,5,1,5,4,8,3,1,1,,28,1,1,1,,0.8074293
5,2,1,2,2,1,5,1,5,8,8,4,8,1,,23,1,0,1,1.0,1.0390816
6,1,2,2,1,8,4,1,5,1,4,2,2,2,0.994498,-1,1,-1,1,7.0,0.5381795


Let's get a list of variable names and see which ones relate to our two concepts (*religious beliefs* and *sexual attitudes/behaviours*).

In [4]:
names(natsal)

After looking up the [codebook](./codebook/8786_natsal_open_codebook_feb21.pdf), there are two variables definitely of interest:
* `attconservative` is a measure of how conservative a respondent's attitudes to sex are: higher scores indicate more conservative attitudes. Therefore this is a numeric variable, specifically an interval variable.
* `religimp` is a measure of the current importance of religious beliefs to a respondent: responses range from "Very important" to "Not important at all". Therefore this is a categorical variable, specifically an ordinal variable.

Let's focus on summarising the values of each of these variables.

### Summarising categorical variables

One of the core ways of analysing a categorical variable is to examine the frequency with which each category occurs. That is, we look at the variable's *frequency distribution*. Recall that a *distribution* is a summary of the variation in a variable: that is, observations have different values for a variable and these differences form a pattern.

In [10]:
table(natsal$agrp)


   1    2    3    4    5    6 
 960 1027  533  491  404  384 

In [7]:
table(natsal$religimp)


   1    2    3    4    9 
 550  935 1102 1194   18 

Hmm, it would be useful if we didn't have to remember which category each number refers to. We can label the numbers of this variable like so (easier to create a cleaned dataset for use in Weeks 9-12):

In [6]:
natsal$religimp <- as.factor(natsal$religimp)

In [8]:
class(natsal$religimp)

In [None]:
tail(natsal)

**QUESTION:** What do you think the `head()` and `tail()` commands do?

We can view a list of all of the variables in the dataset:

In [None]:
names(natsal)

And get a sense of each variable's contents like so:

In [None]:
str(natsal)

We can see that each variable contains numeric values, however this does not mean there are no categorical variables in the dataset. Slightly confusing, I know: but remember that all quantitative data must be represented numerically, otherwise we have no way of counting or performing calculations. Therefore we need to know what the numbers **represent**. Let's do this using some obvious examples:
* `rsex` is a binary measure of whether a respondent is male or female. As is common in quantitative social science, the male category is coded "1" and the female category "2". Therefore this is a categorical variable, specifically a nominal variable.
* `agrp` is measure of which age group a respondent belongs to e.g., 16-24, 25-34. The youngest age group is coded "1" and subsequent groups are coded sequentially. Therefore this is a categorical variable, specifically an ordinal variable.
* `attconservative` is a measure of how conservative a respondent's attitudes to sex are: higher scores indicate more conservative attitudes. Therefore this is a numeric variable, specifically an interval variable.

How do we know what the numeric values represent? Helpfully, the UK Data Service has provided a description of each variable and its contents (known as a *codebook*): [Natsal Open Data Codebook](./8786_natsal_open_codebook_feb21.pdf)

#### Summarising variables

Once we have an understanding of the data at a macro level (i.e., number of observations and variables, variable names and types), we can start to explore specific variables in more detail. Let's do this for three variables relevant to our research question:
* `rsex`
* `agrp`
* `attconservative`

In [None]:
summary(natsal$rsex)

In [None]:
summary(natsal$agrp)

In [None]:
summary(natsal$attconservative)

The `summary()` command produces a range of summary statistics for numeric variables, including the mean, median and some measures of the range of values (e.g., minimum and maximum). Note how we refer to the variables by listing the dataset first (`natsal$rsex`). This is because we may have multiple datasets loaded in to R at one time and there may be variables in different datasets with the same name.

As we learned earlier however, some of our variables are categorical and thus are not well described using these summary statistics: for example, what does it mean to say a respondent's average sex is 1.596? We know this is a binary measure of whether somebody is male or female and thus it would be better to know how many respondents fall into each category.

In Week 9 we will learn how to select and apply the correct summary statistics.

**TASK:** Produce a summary of the `ethnicgrpr` (ethnic group) and `dage1ch` (age at having first child) variables. Can you learn anything useful about the values of these variables using the `summary()` command? Consult the codebook if you need help understanding the values of these variables.

Use the cell below to insert your code.

In [1]:
# INSERT CODE HERE

## Analysing Data

In Weeks 9-12 we will learn how to conduct a range of insightful, accurate analyses of the *Natsal* dataset. To whet your appetite, here is a mini-analysis of the following research question:

<p><center><i>Is a person's attitude towards sex associated with the age at which they have their first child?</i></center></p>

### Import the data

In [None]:
natsal <- read.table("./data/natsal_3_teaching_open.tab", header=TRUE, na="NA", sep="\t")

### View the data

In [None]:
head(natsal)

### Data cleaning

As per our research question, we are only interested in respondents who have had at least one child. Therefore we need to drop observations for whom we do not have information on their age at birth of first child. 

In [None]:
summary(natsal$dage1ch)

Having consulted the codebook, we know that the value "-1" indicates that that this question was not relevant to a respondent, and "99" that a respondent did not answer the question even though it was relevant. Therefore let's drop observations where `dage1ch` equals "-1" or "99".

In [None]:
natsal_analysis <- subset(natsal, dage1ch > -1 & dage1ch < 99)

In [None]:
nrow(natsal_analysis)

In [None]:
nrow(natsal)

Note how we created a new dataset after we dropped the observations (`natsal_analysis`). It is always good practice to leave the original / raw dataset unaltered.

Great, now we can continue with the analysis.

### Univariate analysis

Every analysis begins with a summary of individual variables.

In [None]:
summary(natsal_analysis$dage1ch)
hist(natsal_analysis$dage1ch)

We see that the average age at birth of first child is 24 years old, and that some respondents have their first child relatively young (15) or old (40), though most are between 18 and 26 according to the histogram.

In [None]:
summary(natsal_analysis$attconservative)
hist(natsal_analysis$attconservative)

Postive values of this variable indicate conservative attitudes to sex, negative values liberal attitudes. The average respondent is slightly conservative in their attitudes to sex, though most respondents have a score slightly below or above 0.

### Bivariate analysis

The next step in our analysis is to see if our two variables are associated or not. That is, does knowing somebody's attitude towards sex help us predict what age they had their first child at? Or vice versa: does knowing the age at which somebody had their first child tell you something about their attitude to sex?

We have two numeric (interval) variables, therefore we can calculate the Pearson correlation coefficient (r) to measure the strength and direction of the association. However the first step is to visualise the joint distribution of two variables.

In [None]:
plot(natsal_analysis$attconservative,natsal_analysis$dage1ch)

There does not appear to be any association between attitude to sex and age at birth of first child: there is no diagonal pattern in the joint distribution e.g., as sexual attitude becomes more conservative (move along the x / horizontal axis), age at birth of first child does not increase (move up the y / vertical axis).

We can confirm this using the Pearson correlation coefficient, which is close to 0 in value:

In [None]:
cor(natsal_analysis$attconservative, natsal_analysis$dage1ch, use = "complete.obs")

### Multivariate analysis

There may be no apparant association between attitude to sex and age at birth of first child, however is this true for males and females alike? What about different ethnic groups?

Multivariate analysis allows us to see if the patterns we find are consistent across values of other variables.

### By sex

In [None]:
natsal_males <- subset(natsal_analysis, rsex==1) # create a subset of observations containing only male respondents
cor(natsal_males$attconservative, natsal_males$dage1ch, use = "complete.obs")

In [None]:
natsal_females <- subset(natsal_analysis, rsex==2) # create a subset of observations containing only female respondents
cor(natsal_females$attconservative, natsal_females$dage1ch, use = "complete.obs")

The correlation coefficient is effectively the same for males and females, suggesting that the association between attitude to sex and age at birth of first child does not vary by natal sex.

### By ethnic group

In [None]:
natsal_white <- subset(natsal_analysis, ethnicgrpr==1) # create a subset of observations containing only white respondents
cor(natsal_white$attconservative, natsal_white$dage1ch, use = "complete.obs")

In [None]:
natsal_nonwhite <- subset(natsal_analysis, ethnicgrpr==2) # create a subset of observations containing only non-white respondents
cor(natsal_nonwhite$attconservative, natsal_nonwhite$dage1ch, use = "complete.obs")

The correlation coefficient does vary by ethnic group: though the association remains small, for white respondents there is a negative association, and for non-whites there is a positive association. For non-white respondents, this can be interpreted as follows: more conservative sexual attitudes are correlated with being older when first child is born.

## Conclusion