<p align="right"><i>Data Analysis for the Social Sciences - Part I - 2023-09-19</i></p>

# Week 2 - Introduction to Data and R

Welcome to Part I of Data Analysis for the Social Sciences. In this lab session we will introduce you to quantitative data and the R programming language. 

We will use real data on registered charities in Scotland provided by Scottish Charity Regulator. We will attempt to analyse these data before turning our attention to how we can use a programming language to perform efficient, transparent and effective data analyses.

### Aims

This lesson - **Introduction to Data and R** - has two aims:
1. Understand the structure and contents of a dataset containing numeric information.
2. Cultivate your computational skills through the use of the statistical programming langauge *R*. For example, there are a number of opportunities for you to amend or write R syntax (code).

### Lesson details

* **Level**: Introductory, for individuals with no prior knowledge or experience of quantitative data analysis.
* **Duration**: 30-45 minutes.
* **Pre-requisites**: None.
* **Programming language**: R.
* **Learning outcomes**:
	1. Understand how to use R for conducting common data exploration tasks.
	2. Understand how to describe and explore a secondary dataset containing quantitative data.

## Guide to using this notebook

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Exploring Data*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `[]`.**

To execute a cell, click or double-click the cell and press the `Play` button next to the cell or select the `Run` button on the top toolbar (*Runtime > Run the focused cell*); you can also use the keyboard shortcuts `Shift + Enter` or `Ctrl + Enter`).

Try it for yourself:

In [3]:
name <- readline(prompt="Enter name: ")
print(paste("Hi", name,", enjoy learning more about R and exploring data!"))

Enter name:  DDDD


[1] "Hi DDDD , enjoy learning more about R and exploring data!"


Notebooks are sequential, meaning code should be executed in order (top to bottom). For example, the following code won't work:

In [None]:
x * 5

As the error message suggests, there is no object (variable) called `x`, therefore we cannot do any calculations with it. 

Let's try a sequential approach:

In [None]:
x <- 10 # create an object called 'x' and give it the value '10'

In [None]:
x * 5 # multiply 'x' by 5

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

### Learner input

Throughout the lessons there times when you need to do the following activities:
* **TASK:** A coding task for you to complete (e.g. create new variables).
* **QUESTION:** A question regarding your interpretation of some code or a technique (e.g. what is the piece of code doing?).
* **EXERCISE:** A data analysis challenge for you to complete.

## Understanding Data

Data exploration is an important first-step in the quantitative data analysis process. It involves a mix of functional and analytical tasks that in sum provide you with a keen sense of the data. For example, it is important to know how many variables are relevant for our analysis, how many observations are in the sample, whether there is missing data for some of our variables, whether the dataset "looks right" or there were problems downloading and importing the data etc.

### Secondary data

For this session we will use a secondary dataset (i.e., one we did not collect ourselves) containing information on currently registered Scottish charities. The dataset can be found here: [Scottish Charity Register](https://raw.githubusercontent.com/DiarmuidM/data-analysis-for-the-social-sciences-2023/main/lessons/data/CharityExport-13-Sep-2023.csv")

### Examining data

Open the dataset in a spreadsheet application (e.g., MS Excel) and answer the following questions:
1. How many charities are currently registered in Scotland?
2. How many variables does the dataset contain?
3. Are there any duplicate cases (i.e., charities) in the dataset?
4. Are there any variables that are not suitable for quantitative data analysis (i.e., are they qualitative variables)? If so which ones?
5. Which charity is the oldest in Scotland?
6. Which local authority has the most charities?
7. How many charities have a parent organisation?
8. What is the minimum, maximum and median income?
9. What is the issue with the `Activities` variable?
10. How did you find the process of answering these questions (easy | frustrating | slow)?

You can choose whatever approach / application you like for examining the dataset:
* MS Excel
* [OpenRefine](https://openrefine.org/)
* [Tableau](https://www.tableau.com/en-gb/trial/tableau-software)

But I recommend MS Excel as it is the easiest to use in our labs.

You can type your answers in the box below or into MS Word, a text file, notepad etc.

* A1.
* A2. 
* A3.
* A4. 
* A5. 
* A6. 
* A7. 
* A8. 
* A9. 
* A10. 

## Using Software

Analysing quantitative data manually (e.g., just by scrolling through observations and eyeing values) is near impossible for any dataset containing more than a few observations and variables. Therefore we need to employ an appropriate software application to assist us. For the above task MS Excel was probably good enough for answering most or all of the questions. However, I think you'll agree that adopting such an approach can be:
* Cumbersome and inefficient
* Inaccurate (lots of opportunity for human error)
* Opaque (only you know the steps you took to arrive at the answer)
* Difficult to replicate (repeating an analysis - by you or a colleague - is very difficult and slow without an accurate account of the steps you took)

Therefore modern social science data analysis is typically conducted using software or programming languages that allow you to quickly and accurately perform a wide range of analyses in a way that is replicable and transparent.

To prove this, let's answer the questions above using the R programming language.

In [None]:
creg <- 

### How many charities are currently registered in Scotland?

## Conclusion

Hopefully this notebook has given you a sense of what quantitative data analysis entails:
* Importing datasets
* Exploring observations
* Summarising variables
* Writing syntax