<a href="https://colab.research.google.com/github/EmoreiraV/RUoG/blob/main/Week3Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3 Live Session

**Programming Quiz 1**

The first assignment for the course will go live on Friday at 5pm (UK time). This will take the form of a timed Moodle quiz.

You will have until Friday 20th October at 5pm (UK time) to complete the quiz. Once you begin the quiz, you will have **two hours** in which to complete the quiz. If you begin the quiz less than 2 hours before the deadline (5pm), you will only have until 5pm to complete the quiz.

The questions for the quiz will consist of a mixture of coding tasks, based on learning material from Weeks 1-3. When answering the questions, you should provide the code required to answer the questions (so be sure to have RStudio open when you start the quiz!). You do **not** have to provide the output from the code in your answers.

**For any data manipulation questions, please use base ("classic") R commands only, and not commands from tidyverse (covered in Week 4)**.

There will be a demo assignment made available for you to practice before the main quiz. This will help you familiarise yourself with the Moodle quiz functionality. You can attempt the practics quiz as many times as you wish.

Once you complete the assessed quiz, be sure to press the "submit" button at the end. Any unsubmitted attempts will be automatically submitted when the timer runs out (which you can view in the top right of the screen).

If you have any technical issues during the assessed quiz, please submit a Good Cause claim with any supporting evidence (i.e. screenshots of the issues that happened). If this claim is accepted, you will be entitled to a fresh reattempt during the reassessment window.

# Week 3 Outline

In Week 3, we cover basic data management in R. We will look at:



*   Data frames
*   Manipulating data frames
*   Merging data sets
*   Importing and exporting data from R



# Data Frames

Last week, we looked at matrices. We could store data in a matrix, but there is an important constraint - all the entries in the matrix must be the same variable type

In [None]:
matrix_data <- cbind(c(1,2,3),c("cat","dog","mouse"))
matrix_data
matrix_data[,1]

We could potentially use a list to store data (as this allows different types of variable), but lists do not enforce that each variable has to have the same number of observations.

In [None]:
list_data <- list(c(1,2,3),c("cat","dog"))
list_data

Data frames avoid these issues above. We can create a data frame in R by using the function `data.frame()`.

Data frames behave just like matrices and lists, so we can use operators like `$` to access elements.

In [None]:
frame_data <- data.frame(numbers=c(1,2,3),pets=c("cat","dog","mouse"))
frame_data

frame_data$pets
frame_data[,1]

numbers,pets
<dbl>,<chr>
1,cat
2,dog
3,mouse


# Data Manipulation

We can add new columns to an existing data frame. This can be done by using either `cbind()`, square bracket notation `data[]` or by using `$`.

If we want to use existing variables to create a new variable, it is best to use the `transform()` function.

In [None]:
# Let's use the CIA factbook data from Task 2
load(url(paste("https://github.com/UofGAnalyticsData/R/blob",
               "/main/Week%203/cia.RData?raw=true",sep="")))

cia$GDPpercapita <- cia$GDP/cia$Population
#cia$GDPpercapita

# Let's use transform instead
# reload data
load(url(paste("https://github.com/UofGAnalyticsData/R/blob",
               "/main/Week%203/cia.RData?raw=true",sep="")))
cia <- transform(cia, GDPpercapita=GDP/Population)
cia$GDPpercapita

We can also drop columns from a data frame, either using `data[,-x]`, `data$var <- NULL` or `data[[x]] <- NULL`.

In [None]:
colnames(cia)
cia$GDPpercapita <- NULL
colnames(cia)

We can also subset data frames in a similar way to matrices.

We can also subset data by using the `subset()` function

In [None]:
cia_asia <- cia[cia$Continent=="Asia",]
head(cia_asia)

cia_asia <- subset(cia, Continent=="Asia")
head(cia_asia)

Unnamed: 0_level_0,Country,Continent,Population,Life,GDP,MilitaryExpenditure
Unnamed: 0_level_1,<fct>,<fct>,<int>,<dbl>,<dbl>,<dbl>
1,Afghanistan,Asia,33609937.0,44.64,12850000000.0,244150000.0
13,Armenia,Asia,2967004.0,72.68,12070000000.0,784550000.0
15,Ashmore and Cartier Islands,Asia,,,,
18,Azerbaijan,Asia,8238672.0,66.66,53260000000.0,1384760000.0
21,Bangladesh,Asia,156050883.0,60.25,83040000000.0,1245600000.0
28,Bhutan,Asia,691141.0,66.13,1368000000.0,13680000.0


Unnamed: 0_level_0,Country,Continent,Population,Life,GDP,MilitaryExpenditure
Unnamed: 0_level_1,<fct>,<fct>,<int>,<dbl>,<dbl>,<dbl>
1,Afghanistan,Asia,33609937.0,44.64,12850000000.0,244150000.0
13,Armenia,Asia,2967004.0,72.68,12070000000.0,784550000.0
15,Ashmore and Cartier Islands,Asia,,,,
18,Azerbaijan,Asia,8238672.0,66.66,53260000000.0,1384760000.0
21,Bangladesh,Asia,156050883.0,60.25,83040000000.0,1245600000.0
28,Bhutan,Asia,691141.0,66.13,1368000000.0,13680000.0


We can also change the order of data sets by using the `order` function

In [None]:
permut <- order(cia$GDP)
cia <- cia[permut,]
head(cia)

Unnamed: 0_level_0,Country,Continent,Population,Life,GDP,MilitaryExpenditure
Unnamed: 0_level_1,<fct>,<fct>,<int>,<dbl>,<dbl>,<dbl>
170,Niue,Other,1398,,10010000,
236,Tuvalu,Other,12373,69.29,14940000,
123,Kiribati,Other,112850,63.22,71000000,
78,Falkland Islands (Islas Malvinas),Central/South America,3140,,105100000,
8,Anguilla,Central/South America,14436,80.65,108900000,
200,Sao Tome and Principe,Africa,212679,68.32,160000000,1280000.0


# Merging data sets

Quite often, studies split data sets across multiple data frames. For analysis, it is often a lot easier to combine these data into one data frame for modelling purposes. (Next week, we will look at more powerful functions in `dplyr` for merging data).

We can merge multiple data sets in R by using the `merge()` function.

When merging, it is best practice to specify which columns we wish to merge by, using the `by` argument.

If the columns in both data frames have different names, you can specify these by using the `by.x` and `by.y` arguments.

In [None]:
# We shall use the data from Task 3 in the notes - patients and weights
load(url(paste("https://github.com/UofGAnalyticsData/R/blob/",
"main/Week%203/patients_weights.RData?raw=true",sep="")))

head(patients)
head(weights)

# Can see we have a matching column in PatientID

weights.patients <- merge(patients, weights, by="PatientID")
head(weights.patients)

Unnamed: 0_level_0,PatientID,Gender,Age,Smoke
Unnamed: 0_level_1,<int>,<fct>,<int>,<fct>
1,1,male,33,no
2,2,female,32,no
3,3,male,67,ex
4,4,male,36,current
5,5,female,47,current


Unnamed: 0_level_0,PatientID,Week,Weight
Unnamed: 0_level_1,<int>,<int>,<int>
1,1,1,72
2,1,2,74
3,1,3,71
4,2,1,54
5,2,3,54
6,3,1,96


Unnamed: 0_level_0,PatientID,Gender,Age,Smoke,Week,Weight
Unnamed: 0_level_1,<int>,<fct>,<int>,<fct>,<int>,<int>
1,1,male,33,no,1,72
2,1,male,33,no,2,74
3,1,male,33,no,3,71
4,2,female,32,no,1,54
5,2,female,32,no,3,54
6,3,male,67,ex,1,96


# Importing and Exporting Data

**When reading and saving files into R, be sure to set the correct working directory first! (*Session > Set Working Directory > Choose directory...*)**

R has its own internal binary format called `RData`. To save any object(s) in this format, we can use the `save()` function.

If you want to save all objects in your workspace, you can use `save.image()`.

`RData` files can be loaded in by using the `load()` function.

In [None]:
save(data.file, file="MyFile.RData")
load("MyFile.RData")

Most data you will come across is stored in a tabular format, such as a spreadsheet. The most common functions you will use to read in these types of data is `read.table()` and `read.csv()`.

Before reading a file into R first, it is always good practice to inspect the file first to determine its structure. Things to look for are:



*   Are there column names?
*   How are entries separated? (the delimiter - is it , -  )
*   How are missing values stored? (NA, - , NULL,  )



In [None]:
#https://github.com/UofGAnalyticsData/R/raw/main/Week%203/cars.csv

cars <- read.csv("https://github.com/UofGAnalyticsData/R/raw/main/Week%203/cars.csv", na.strings="*")
head(cars)

cars <- read.table("https://github.com/UofGAnalyticsData/R/raw/main/Week%203/cars.csv", sep= ",", header=TRUE,  na.strings="*")
head(cars)

Unnamed: 0_level_0,Manufacturer,Model,MPG,Displacement,Horsepower
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<int>
1,Chevrolet,Camaro,19.0,3.4,160
2,Oldsmobile,Achieva,,2.3,155
3,Dodge,Spirit,22.0,2.5,100
4,Chevrolet,Astro,,4.3,165
5,Chevrolet,Corsica,25.0,2.2,110
6,Volkswagen,Corrado,18.0,2.8,178


Unnamed: 0_level_0,Manufacturer,Model,MPG,Displacement,Horsepower
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<int>
1,Chevrolet,Camaro,19.0,3.4,160
2,Oldsmobile,Achieva,,2.3,155
3,Dodge,Spirit,22.0,2.5,100
4,Chevrolet,Astro,,4.3,165
5,Chevrolet,Corsica,25.0,2.2,110
6,Volkswagen,Corrado,18.0,2.8,178


We can also export any data we have manipulated or created in R using the `write.table()` or `write.csv()` functions.

In [None]:
write.csv(cars, "cars.csv", row.names=FALSE)

# Other data types in R

Many other file formats can be read into R (too many to list them all!). Some of the most popular types though are `.xml`, SAS XPORT data, and SPSS files.

We can also import SQL databases directly into R using the `DBI` and `RSQLite` libraries.

JSON data can also be read into R using the `jsonlite` package. R can also interact with public API services easily using this format.

# Live session Task

**Task**

In this task you will use the data frame houseprices which you can download using

In [None]:
load(url(paste("https://github.com/UofGAnalyticsData/R/raw",
"/main/Week%203/houseprices.RData",sep="")))

The data frame houseprices contails data on property sales in and around Glasgow between July 1st and December 31st 2014. It contains the following columns



*   Day - Day of the month of the transaction
*   Month - Month of the transaction (integer)
*   Address - Address of the property
*   Lon - Longitude of the property (degrees)
*   Lat - Latitude of the property (degrees)
*   Price - Price of the property

Provide the R code which can be used to answer the following questions

What was the average house price in October 2014?







In [None]:
mean(subset(houseprices, Month==10)$Price)

How many transactions occurred between November 15th and December 15th

In [None]:
nrow(subset(houseprices, (Month==11 & Day>=15)|(Month==12 & Day<=15)))

Which house sold for the lowest and which sold for the highest price?

In [None]:
houseprices[which.min(houseprices$Price),]
houseprices[which.max(houseprices$Price),]

Unnamed: 0_level_0,Day,Month,Address,Lon,Lat,Price
Unnamed: 0_level_1,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>
727,2,7,"24 Ancroft Street, Glasgow, Glasgow City G20 7HU, UK",-4.267542,55.87717,1


Unnamed: 0_level_0,Day,Month,Address,Lon,Lat,Price
Unnamed: 0_level_1,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>
9378,24,12,"155 Saint Vincent Street, Glasgow, Glasgow City G2 5NW, UK",-4.261058,55.86182,27800000


Use cut to create a new column called PriceGroup that takes the values low (Price $\leq$ 100,000), medium (100,000 < Price $\leq$ 250,000) and high (Price > 250,000)

In [None]:
houseprices <- transform(houseprices, PriceGroup= cut(Price, breaks=c(0,100000, 250000, Inf), labels=c("low","medium","high")))
head(houseprices)

Unnamed: 0_level_0,Day,Month,Address,Lon,Lat,Price,PriceGroup
Unnamed: 0_level_1,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<fct>
1,1,7,"13 Hollowglen Road, Glasgow, Glasgow City G32 0ND, UK",-4.160496,55.85507,72500,low
2,1,7,"7 Clarence Street, Clydebank, West Dunbartonshire G81 2HU, UK",-4.400987,55.90835,137995,medium
3,1,7,"16 Clarence Street, Clydebank, West Dunbartonshire G81 2HU, UK",-4.400781,55.90897,148995,medium
4,1,7,"10E Park Court, Clydebank, West Dunbartonshire G81 4PH, UK",-4.428328,55.91704,25000,low
5,1,7,"30 Ashvale Crescent, Glasgow, Glasgow City G21 1NE, UK",-4.234017,55.88356,65000,low
6,1,7,"212 Menzies Road, Glasgow, Glasgow City G21 3NE, UK",-4.215196,55.88965,80000,low


Create a new variable Dist2Uni which contains the distance to the University in kilometres

What was the average price of properties which are within 1km of the University?

*Hint*: Consider two locations with longitudes $\lambda_1$ and $\lambda_2$ and latitudes $\phi_1$ and $\phi_2$ (expressed in radians)

Define

$$\Delta \lambda = \lambda_2 - \lambda_1$$
$$\Delta \phi = \phi_2 - \phi_1$$
$$\alpha = \text{sin}\left( \frac{\Delta \phi}{2} \right)^2 + \text{cos}(\phi_1)\text{cos}(\phi_2) \text{sin}\left( \frac{\Delta \lambda}{2} \right)^2$$
$$d = 12742\text{tan}^{-1} \left(\frac{\sqrt{\alpha}}{\sqrt{1-\alpha}} \right)$$

then $d$ gives the distance between the two locations.

The longitude and latitude of the university are:
$\lambda = -4.2886^{\circ}$ and $\phi = 55.8711^{\circ}$

tan$(\frac{a}{b})$ can be calculated in R using atan2(a,b)

In [None]:
lambda1 <- houseprices$Lon / 180 * pi
phi1 <- houseprices$Lat / 180 * pi
lambda2 <- -4.2886 / 180 * pi
phi2 <-  55.8711  / 180 * pi
delta.lambda <- lambda2-lambda1
delta.phi <- phi2-phi1
alpha <- sin(delta.phi/2)^2 + cos(phi1)*cos(phi2)*sin(delta.lambda/2)^2
d <- 12742 * atan2(sqrt(alpha),sqrt(1-alpha))
houseprices <- transform(houseprices, Dist2Uni=d)