# Introduction to R for Social Science

Ny opbygning af Python intro:

- R sproget
- Hvad er objekter/variable?
    - Øvelse
- Hvad er funktioner?/Introduktion til funktioner?
- Pakker og import af pakker
- Indlæsning af datasæt
    - Øvelse: Indlæs pakke og indlæs data
- Indblik i data med descriptive measures
- DataFrames og vectors
- Subsetting data (base R)
- Subsetting data (dplyr)
  - Øvelse med at lave et subset
- Nye variable og rekodning (base R)
- Nye variable og rekodning (dplyr)
  - Øvelse med ny variable og bonusøvelser fastholdes
- Classes i en dataframe
  - Øvelse med typekonvertering
- Visualisering (enten til sidst eller efter indlæsning af data)
  - Øvelse?
- Gemme data

# Introduction
## What is R and why use it?

R is a free software environment with its own programming language. 

It is especailly suited for statistical analysis and graphical outputs.

R's popularity as a data science tool as well as it being open source has made its applications vast.

R can work with a large variety of data formats and is (with a few add-ons) compatible with data from other software solutions (Excel, SPSS, SAS, Stata).

## A Short Demo

(demo on R Studio server)

## Content of the R introduction

- The RStudio environment
- The R language
- Constructing variables and objects
- Functions and libraries/packages
- Working with dataframes (table structure in R)
- Simple data handling in R
- Simple visualizations in R (`ggplot2` package)

The introduction will combine presenting R and R code in Jupyter Notebook while demonstrating in RStudio. You are encouraged to work and write in RStudio during the workshop.

*Please write along as we go through the different examples.*

## The RStudio environment

During the workshop (and the other R workshops in CALDISS), we will be working with RStudio.

RStudio is an IDE for R (Integrated Development Environment) - Makes for a nicer workspace

<https://www.rstudio.com/products/rstudio/download/>

# The R language
R has it's own programming language. R works by you writing lines of code in that language (writing commands) and R interpreting that code (running commands).

R (and RStudio) has a limited user interface meaning almost all functionality (statistics, plots, simulations etc.) must be executed using code in the R language.

## R as a calculator
So what does it mean that R interprets our code?
It means that you tell R to do something by writing a command and R will do that (if R can understand you).

R, for example, understands mathematical expressions:

# The R Language: Objects and Functions

R works by storing values and information in "objects". These objects can then be used in various commands like calculating a statistical model, saving a file, creating a graph and so on. To simplify a bit: An object is some kind of stored information and a function is something that can manipulate that stored information (which then creates a new object). 

Most of R can be boiled down to these 3 basic steps:

1. Assign values to an object
2. Make sure R interprets the object correctly (its class)
3. Perfom some operation or manipulation on the object using a function

Translated to data analysis, the steps would (in general terms) look as follows:

1. Load our dataset: `data <- read.csv("my_datafile.csv")`
2. Check the that the variables are the correct class: `class(data$age)`
3. Perform some kind of analysis: `mean(data$age)`

The gap between these steps of course vary greatly.

## Objects

A lot of writing in R is about defining objects: A name to use to call up stored information.

Objects can be a lot of things: 
- a word
- a number
- a series of numbers
- a dataset 
- a URL
- a formula
- a result 
- a filepath
- a series of datasets
- and so on...

When an object is defined, it is available in the current working space (or environment).

This makes it possible to store and work with a variety of informaiton simultaneously.

### Defining objects
Objects are defined using the `<-` operator (`Alt` + `-`):

In [11]:
year <- 1964

In [2]:
year

When defined the object can be used like any other numeric value.

In [6]:
year + 10

Using `' '` or `" "` denotes that the input should be read as text. *This also applies to numbers!*

In [3]:
name <-  "keenan"

In [4]:
name

In [8]:
year_now  <- '2021'

In [9]:
year_now

Notice that numbers stored as text will be enclosed in quotes. Numbers stored as text cannot immediately be used as numbers:

In [10]:
year_now - 5

ERROR: Error in year_now - 5: non-numeric argument to binary operator


This error happens because R differentiates between objects by assigning them to a specific *class*. The class denotes what is possible with the object.

### Naming objects
Objects can be named almost anything but a good rule of thumb is to use names that are indicative of what the object contains.

#### Restrictions for naming objects
- Most special characters not allowed: `/`, `?`, `*`, `+` and so on (most characters mean something to R and will be read as an expression)
- Already existing names in R (will overwrite the function/object in the environment)

#### Good naming conventions 
- Using '`_`': `my_object`, `room_number`

or:

- Capitalize each word except the first: `myObject`, `roomNumber`

## Functions

Functions are commands used to transform an object in some way and give an output.

The input to a function is an "arguement". The number of arguements vary between function.

Functions have the basic syntax: `function(arg1, arg2, arg3)`.

Some arguements are required while others are optional.

In [15]:
name <- 'kilmister'
toupper(name) #Returns the object in upper-case

Most functions take the object as the first input but not all.

In [20]:
gsub("e", "a", name) #Replace all e's with a's

### Functions and their outputs

Note that functions almost *never* change the object. When calling functions you are asking R for a specific output but not to change anything.

Output of a function have to be stored in objects, if to be stored

In [21]:
name # Unchanged even though used in objects

In [22]:
name <- gsub("e", "a", name) # Store object with characters "e" swapped with "a" - replacing the object

In [23]:
name

# EXERCISE 1: OBJECTS AND FUNCTIONS

1. Define the following objects:

    - `name1`: `"araya"`
    - `name2`: `"townsend"`
    - `year1`: `1961` (without quotation marks)
    - `year2`: `"1972"` (with quotation marks)

2. Try calculating the ages for `year1` and `year2` (current year - year-object). 

3. Use the function `toupper()` to convert `name1` to upper-case.

# EXERCISE 1: DEFINING OBJECTS
*What happens?*

In [27]:
name1 <- "araya"
name2 <- "townsend"
year1 <- 1961
year2 <- "1972"

In [28]:
age1 <- 2021 - year1 # Works fine

age1

In [30]:
age2 <- 2021 - year2 # Produces an error

ERROR: Error in 2021 - year2: non-numeric argument to binary operator


In [31]:
name1 <- toupper(name1)

name1

# R Libraries - Packages 

R being open source means that a lot of developers are constantly adding new functions to R.
These new functions are distributed as *R packages* that can be loaded into the R library.

All the commands you have been using so far have been part of the `base` package (ships with R). 

Packages are installed using (name of package *with* quotes!): 

`install.packages('packagename')` 

The functions from the package is loaded into the environment using (name of package *without* quotes!):
    
`library(packagename)` 

Information for installed packages can be found using (name of package *with* quotes!):

`library(help = 'packagename')` 

## Importing data with `readr`

`readr` is a package for reading various data files into R.

R does have some "base" functions for doing this but `readr` is more efficient.

`readr` is part of a collection of packages called `tidyverse`: https://www.tidyverse.org/

In [32]:
library(readr)

data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

"Missing column names filled in: 'X1' [1]"Parsed with column specification:
cols(
  X1 = col_double(),
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


- Pakker og import af pakker
- Indlæsning af datasæt
    - Øvelse: Indlæs pakke og indlæs data
- Indblik i data med descriptive measures
- DataFrames og vectors
- Subsetting data (base R)
- Subsetting data (dplyr)
  - Øvelse med at lave et subset
- Nye variable og rekodning (base R)
- Nye variable og rekodning (dplyr)
  - Øvelse med ny variable og bonusøvelser fastholdes
- Classes i en dataframe
  - Øvelse med typekonvertering
- Visualisering (enten til sidst eller efter indlæsning af data)
  - Øvelse?
- Gemme data