# API-201 ABC REVIEW SESSION #5
**October 14, 2022**

## Recommended workflow:
1. Download original (raw) data files and __don't modify them directly__.
2. Prepare data for analysis by creating cleaned versions of your data.
3. Upload cleaned data to Google Colab.
4. Create your Jupyter notebook on Colab.

## Structuring your data in Excel


1. First row should be variable names.
    + Variable names should start with a letter and can contain numbers, letters, underscores, and periods.
2. Each other row should contain one observation that corresponds to your unit of analysis.
3. Put all data in one Excel file, possibly across several sheets.
4. Excel formatting can't be read by R.

## Uploading data to Colab

1. Create a new notebook on Colab.
2. Connect to an R runtime on Colab.
3. Click the file browser logo on the left.
4. Click the button to "Upload to session storage." Upload your data file to Colab.

__Whenever you disconnect from Colab's R runtime, your data will be removed from Colab. You'll have to re-upload it when you next start an R session on Colab.__


## Loading data into R



Once you've uploaded the data file to Colab, that file is accessible to R. We want to tell R to read that file into its memory as a data frame.

Until now, we've been giving you all the code to load datasets into R in a setup cell. That cell always contained `library(tidyverse)` to load tidyverse functions like `filter`, `mutate`, etc. 

Occasionally it also contained `library(readxl)` which loads the function `read_excel` which you can use to read Excel files into R. Run the cell below so you can use `read_excel`. 



In [1]:
# SETUP - Run this first!
library(tidyverse) # imports tidyverse functions you've been using
library(readxl) # imports read_excel()

── [1mAttaching packages[22m ──────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.2

── [1mConflicts[22m ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Now we are going to load data from the first sheet of `bigmac_analysis.xlsx`. `read_excel` requires a `path` argument, which specifies the name of the Excel file. You can also provide a `sheet` argument which indicates the name or number of the sheet within the Excel file. To load the first sheet, specify `sheet = 1`. You could instead omit the `sheet` argument in this case because `read_excel` reads the first sheet of the file by default.

In the cell below, we load the first sheet and name the dataset `bigmac_wide`. 

In [2]:
# Load data - specify you want to load the first sheet (same as default)
bigmac_wide <- read_excel(path = "bigmac_analysis.xlsx", sheet = 1)
head(bigmac_wide)

country_name,country_code,dollar_price_jan2020,dollar_price_jan2021,dollar_price_jan2022
<chr>,<chr>,<dbl>,<dbl>,<dbl>
United Arab Emirates,ARE,4.015627,4.015627,4.628306
Argentina,ARG,2.846887,3.748231,4.285041
Australia,AUS,4.451145,4.98474,4.50912
Azerbaijan,AZE,2.328323,2.324897,2.648617
Bahrain,BHR,3.713528,3.97878,3.97878
Brazil,BRA,4.804558,3.978491,4.312618


To load the second sheet of the file, you can specify `sheet = 2`. 

In the cell below, we load the second sheet and name the dataset `bigmac_long`. 

In [3]:
# Load data - specify you want to load the second sheet
bigmac_long <- read_excel(path = "bigmac_analysis.xlsx", sheet = 2)
head(bigmac_long)

country_name,country_code,year,dollar_price
<chr>,<chr>,<dbl>,<dbl>
United Arab Emirates,ARE,2022,4.628306
Argentina,ARG,2022,4.285041
Australia,AUS,2022,4.50912
Azerbaijan,AZE,2022,2.648617
Bahrain,BHR,2022,3.97878
Brazil,BRA,2022,4.312618


Alternatively, we can specify the name of the sheet rather than its number. If you are frequently reordering the sheets in your Excel file, you may prefer to load data this way.

In the cell below, we load the sheet named long and again name the dataset `bigmac_long`. 

In [4]:
# Load data - specify you want to load sheet named "long"
bigmac_long <- read_excel(path = "bigmac_analysis.xlsx", sheet = "long")
head(bigmac_long)

country_name,country_code,year,dollar_price
<chr>,<chr>,<dbl>,<dbl>
United Arab Emirates,ARE,2022,4.628306
Argentina,ARG,2022,4.285041
Australia,AUS,2022,4.50912
Azerbaijan,AZE,2022,2.648617
Bahrain,BHR,2022,3.97878
Brazil,BRA,2022,4.312618
