<p align="right"><i>Data Analysis for the Social Sciences - Part I - 2023-09-26</i></p>

# Week 3 - Doing Quantitative Data Analysis

Welcome to Part I of Data Analysis for the Social Sciences. In this lab session we will introduce you to the fundamental steps in quantitative data analysis. 

We will use real data from the *British Social Attitudes Survey, 2017* and we will conduct a range of simple statistical analyses of some variables relating to respondent age, sex at birth, and climate change beliefs.

### Aims

This lesson - **Doing Quantitative Data Analysis** - has two aims:
1. Demonstrate how to import and explore a secondary dataset.
2. Cultivate your computational skills through the use of the statistical programming langauge *R*. For example, there are a number of opportunities for you to amend or write R syntax (code).

### Lesson details

* **Level**: Introductory, for individuals with no prior knowledge or experience of quantitative data analysis.
* **Duration**: 30-45 minutes.
* **Pre-requisites**: None.
* **Programming language**: R.
* **Learning outcomes**:
	1. Understand how to use R for conducting common data exploration tasks.
	2. Understand how to describe and explore a secondary dataset containing quantitative data.

## Guide to using this notebook

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Exploring Data*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `[]`.**

To execute a cell, click or double-click the cell and press the `Play` button next to the cell or select the `Run` button on the top toolbar (*Runtime > Run the focused cell*); you can also use the keyboard shortcuts `Shift + Enter` or `Ctrl + Enter`).

Try it for yourself:

In [None]:
name <- readline(prompt="Enter name: ")
print(paste("Hi,", name, "enjoy learning more about R and exploring data!"))

Notebooks are sequential, meaning code should be executed in order (top to bottom). For example, the following code won't work:

In [None]:
x * 5

As the error message suggests, there is no object (variable) called `x`, therefore we cannot do any calculations with it. 

Let's try a sequential approach:

In [None]:
x <- 10 # create an object called 'x' and give it the value '10'

In [None]:
x * 5 # multiply 'x' by 5

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

### Learner input

Throughout the lessons there times when you need to do the following activities:
* **TASK:** A coding task for you to complete (e.g. create new variables).
* **QUESTION:** A question regarding your interpretation of some code or a technique (e.g. what is the piece of code doing?).
* **EXERCISE:** A data analysis challenge for you to complete.

## Exploring Data

Data exploration is an important first-step in the quantitative data analysis process. It involves a mix of functional and analytical tasks that in sum provide you with a keen sense of the data. For example, it is important to know how many variables are relevant for our analysis, how many observations are in the sample, whether there is missing data for some of our variables, whether the dataset "looks right" or there were problems downloading and importing the data etc.

### Secondary data

For this lesson we will use the open access version of the *British Social Attitudes Survey, 2017, Environment and Politics* dataset: [available here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8849)

This provides a limited number of variables and observations from the original, larger study.

For Assessment 2, you can choose to analyse one of two datasets:
* British Social Attitudes Survey, 2017, Environment and Politics
* British Social Attitudes Survey, 2019, Poverty and Welfare

### Importing data

Importing data - also known as loading or reading in data - is the natural first step in a piece of quantitative data analysis. It requires two pieces of information:
1. The location of the dataset on your machine / on the internet
2. The file type of the dataset

The first refers to the specific folder / directory on your machine where the dataset resides (e.g., "C:\Users\mcdonndz-local\data\"); the second to the type of file (e.g., MS Excel (.xlsx)).

Don't worry if the file management aspects are unfamiliar / daunting just now: we will get plenty of practice importing and saving files to different locations on our machine.

OK, let's import a dataset into *R* for the first time!

In [None]:
bsa <- read.csv("https://raw.githubusercontent.com/DiarmuidM/data-analysis-for-the-social-sciences-2023/main/lessons/data/bsa2017_open_enviropol.csv", header=TRUE, na="NA")
head(bsa) # view the first six observations

There were no error messages, so our *R* command must have worked (it did). Here is what the above command does:
* imports a file using the `read.csv()` command
* recognises that the first line of the file contains the variable names (`header=TRUE`)
* recognises missing values for variables (`na="NA"`)

Finally, we need an object name for the imported dataset so that we can reference it in future commands (`bsa`). The `<-` is the assignment operator i.e., import the dataset and assign it to the object `bsa`. 

#### Viewing the dataset

One of the most common ways of exploring a dataset is by visually inspecting its contents. We can do so by referencing the name we gave the dataset:

In [None]:
bsa

Doing so returns some useful output:
* A description of the dataset (`data.frame` containing 3988 observations and 25 variables).
* A truncated list of the observations in the dataset - by default *R* and other languages do not display the full dataset.

In [None]:
head(bsa)

In [None]:
tail(bsa)

**QUESTION:** What do you think the `head()` and `tail()` commands do?

We can view a list of all of the variables in the dataset:

In [None]:
names(bsa)

And get a sense of each variable's contents like so:

In [None]:
str(bsa)

We can see that each variable contains numeric values, however this does not mean there are no categorical variables in the dataset. Slightly confusing, I know: but remember that all quantitative data must be represented numerically, otherwise we have no way of counting or performing calculations. Therefore we need to know what the numbers **represent**. Let's do this using some obvious examples:
* `Rsex` is a binary measure of whether a respondent is male or female. As is common in quantitative social science, the male category is coded "1" and the female category "2". Therefore this is a categorical variable, specifically a nominal variable.
* `RAgeCat` is measure of which age group a respondent belongs to e.g., 16-24, 25-34. The youngest age group is coded "1" and subsequent groups are coded sequentially. Therefore this is a categorical variable, specifically an ordinal variable.
* `CCBELIEV` is a measure of whether a respondent agrees that climate change is real and/or caused by human action. conservative a respondent's attitudes to sex are: higher scores indicate more conservative attitudes. Respondents could choose one of three options, therefore this is a categorical variable: it could be argued to be nominal or ordinal depending on your view - we will treat it as nominal.

How do we know what the numeric values represent? Helpfully, the UK Data Service has provided a description of each variable and its contents (known as a *codebook*): [Open Data Codebook](https://github.com/DiarmuidM/data-analysis-for-the-social-sciences-2023/blob/main/lessons/codebooks/8849_bsa_open_enviropol_2017_codebook.pdf)

#### Summarising variables

Once we have an understanding of the data at a macro level (i.e., number of observations and variables, variable names and types), we can start to explore specific variables in more detail. Let's do this for three variables of interest:
* `Rsex`
* `RAgeCat`
* `CCBELIEV`

In [None]:
summary(bsa$Rsex)

In [None]:
summary(bsa$RAgeCat)

In [None]:
summary(bsa$CCBELIEV)

The `summary()` command produces a range of summary statistics for numeric variables, including the mean, median and some measures of the range of values (e.g., minimum and maximum). Note how we refer to the variables by listing the dataset first (`bsa$Rsex`). This is because we may have multiple datasets loaded in to R at one time and there may be variables in different datasets with the same name.

As we learned earlier however, some of our variables are categorical and thus are not well described using these summary statistics: for example, what does it mean to say a respondent's average sex is 1.547? We know this is a binary measure of whether somebody is male or female and thus it would be better to know how many respondents fall into each category.

In Part II of the module we will learn how to select and apply the correct statistical summaries.

**TASK:** Produce a summary of the `Married` (marital status) and `leftrigh` (how left / right leaning a respondent is) variables. Can you learn anything useful about the values of these variables using the `summary()` command? Consult the codebook if you need help understanding the values of these variables.

Use the cell below to insert your code.

In [None]:
# INSERT CODE HERE

## Analysing Data

In Part II we will learn how to conduct a range of insightful, accurate analyses of the *bsa* dataset. To whet your appetite, here is a mini-analysis of the following research question:

<p><center><i>Are climate change beliefs associated with sex and age among British people?</i></center></p>

### Import the data

In [None]:
bsa <- read.csv("https://raw.githubusercontent.com/DiarmuidM/data-analysis-for-the-social-sciences-2023/main/lessons/data/bsa2017_open_enviropol.csv", header=TRUE, na="NA")
head(bsa) # view the first six observations

### View the data

In [None]:
head(bsa)

### Data cleaning

We are only interested in respondents who provided a valid answer to the climate change question. For example, some respondents may have refused to answer, some do not have a belief or opinion, some were not asked as part of the sampling design etc. Therefore we need to perform some simple data cleaning.

In [None]:
table(bsa$CCBELIEV)

Having consulted the codebook, we know that the value "-1" indicates that that this question was not relevant to a respondent, and "8" that a respondent answered "Don't know. Therefore let's tell *R* to ignore these values when summarising / analysing the `CCBELIEV` variable.

In [None]:
bsa$CCBELIEV[bsa$CCBELIEV==-1 | bsa$CCBELIEV==8] <- NA # convert "-1" and "8" to missing

In [None]:
table(bsa$CCBELIEV)

In [None]:
summary(bsa$CCBELIEV)

Note there are now 1,048 respondents with missing values for this variable i.e., we ignore "-1" and "8" when producing statistical summaries.

Let's quickly do the same for our other variables.

In [None]:
bsa$RAgeCat[bsa$RAgeCat==-1 | bsa$RAgeCat==8] <- NA # convert "8" to missing

### Univariate analysis

Every analysis begins with a summary of individual variables.

In [None]:
table(bsa$Rsex)
barplot(table(bsa$Rsex))

We see that there are more females than males in the sample.

In [None]:
table(bsa$RAgeCat)
barplot(table(bsa$RAgeCat))

Respondents are skewed towards 65+ years old.

In [None]:
table(bsa$CCBELIEV)
barplot(table(bsa$CCBELIEV))

The vast majority of respondents believe that 
climate change i 
taking place a d
is, at least part y,
a result uf h man
ac.tions

### Bivariate analysis

The next step in our analysis is to see if our two variables are associated or not. That is, does knowing somebody's age help us predict what they believe about climate change? Or vice versa: does knowing somebody's climate change beliefs tell you anything about what age they are?

The first step is to visualise the joint distribution of two variables.

In [None]:
table(bsa$RAgeCat, bsa$CCBELIEV) 

Hmm, it's a little bit tricky to interpret due to the lack of labels and percentages. Let's tidy up the variables before producing the crosstabulation.

In [None]:
bsa$Rsex <- factor(bsa$Rsex, levels = c(1,2), labels = c("Male", "Female"))

In [None]:
bsa$RAgeCat <- factor(bsa$RAgeCat, levels = c(1,2,3,4,5,6,7), labels = c("18-24", "25-34", "35-44", "45-54", "55-59", "60-64", "65+"))

In [None]:
bsa$CCBELIEV <- factor(bsa$CCBELIEV, levels = c(1,2,3), labels = c("Dont believe", "Believe but no human cause", "Believe and human cause"))

In [None]:
cc_age_table <- table(bsa$RAgeCat, bsa$CCBELIEV) # store the results of the `table()` command in an object called 'cc_age_table'

In [None]:
round(prop.table(cc_age_table, 1)* 100, 0) # display row percentages

There does not appear to be any association between respondent age and climate change beliefs: the vast majority of people in each age group believe climate change is real and is at least partly caused by human action. 

We can confirm this using an appropriate measure of association:

In [None]:
library(DescTools) # laod in package with measure of association function
CramerV(bsa$RAgeCat, bsa$CCBELIEV)

**TASK:** Produce a bivariate analysis of the `Rsex` (sex at birth) and `CCBELIEV` variables.

Use the cell(s) below to insert your code.

In [None]:
# INSERT CODE HERE

In [None]:
# INSERT CODE HERE

### Multivariate analysis

There may be no apparant association between a respondent's age and their climate change beliefs, however is this true for males and females alike?

Multivariate analysis allows us to see if the patterns we find are consistent across values of other variables.

### By sex

In [None]:
sac_table <- table(bsa$Rsex, bsa$RAgeCat, bsa$CCBELIEV) # create crosstabulation - variable 1 is the control variable,
# variable 2 is the X (predictor) variable, variable 3 is the Y (outcome) variable.

sac_prop_table <- addmargins(prop.table(sac_table, c(1,2)), 3) # calculate proportions
sac_perc_table <- round(ftable(sac_prop_table) * 100, 0) # convert proportions to percentages
sac_perc_table # display table

In [None]:
bsa_males <- subset(bsa, Rsex=="Male") # create a subset of observations containing only male respondents
CramerV(bsa_males$RAgeCat, bsa_males$CCBELIEV)

In [None]:
bsa_females <- subset(bsa, Rsex=="Female") # create a subset of observations containing only female respondents
CramerV(bsa_females$RAgeCat, bsa_females$CCBELIEV)

The measure of association is slightly different across sexes: there is a stronger association between age and climate change beliefs for males. But given there are more females in the sample, the overall measure of association is closer to that in the female cohort.

## Conclusion

Hopefully this notebook has given you a sense of what quantitative data analysis entails:
* Importing datasets
* Exploring observations
* Summarising variables
* Writing syntax

In part II of the module we delve into each of these steps in detail.