<a href="https://colab.research.google.com/github/ShoSato-047/R_review/blob/main/Introduction_to_R_in_Google_Colab_R_syntax.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook I introduce you to R syntax and data objects.  This is code that you can run in base R or R Studio - it doesn't need to be within a Colab environment!  Colab is the *environment*, not the *code*.  

# Using R as calculator
At its most basic, you can use R to do anything a calculator can do.

In [None]:
2+3

In [None]:
log(3)

In [None]:
280*3

In [None]:
pi/2

# Boolean output

If an expression asks a logical question which can evaluate to either TRUE or FALSE, the output is called *Boolean*.  Below are examples of Boolean expressions:

In [None]:
# Interpreted: "is 5 greater than 2?"
5>2

In [None]:
# Interpreted: "is 2 greater than 5?"
2>5

In [None]:
# Interpreted: "is 5 equivalent to 5?"
5==5

In [None]:
# Interpreted: "is 5 equivalent to 2?"
5==2

In [None]:
# Interpreted: "is 2 greater than 2?"
2>2

In [None]:
# Interpreted: "is 2 greater than OR EQUAL TO 2?"
2>=2

In [None]:
# "Is 5 not equal to 2?"
5!=2

In [None]:
10/2 < 3

Note that when testing for equivalence, double equal-signs `==` must be used.  This is to avoid ambiguity with *object assignments*.

# Objects

`R` is an *object-oriented* language.  This means it relies on object assignment an operation.

What is an object?  An object is any sort of data structure that is stored in memory.  The object can be referred to to perform operations.

For example, if I just type the code below, I am executing "5" once:

In [None]:
5

But if I assign the number 5 to an *object*, I can refer back to it!

In [None]:
your_number = 5

In [None]:
your_number

In [None]:
my_number

In [None]:
my_number

`R` now knows that `my_number` is the actual number 5.  I can now do math using the *object* `my_number`:

In [None]:
my_number + 2

In [None]:
my_number/2

In [None]:
log(my_number)

# Data types

It is important to be aware of the different data types in `R`.  Some functions operate differently depending on the type of data we are working with.  

Use the `class()` function to display the data type:

In [None]:
class(my_number)

Consider these other objects:

In [None]:
object1 <- TRUE
object2 <- 'winona'

## Task

* Run the `class()` function on the two objects above.  What are their data types?
* Try adding 3 to each of the 2 objects above.  What happens?

# Vectors

*Vectors* are important data types in `R`.  Vectors store multiple items of the same data type.  

To create a vector, use the `c()` function:

In [None]:
myvec <- c(10,9,8,7,6,5,4,3,2,1,0)

Many functions are built to work with vectors:

In [None]:
class(myvec)

In [None]:
mean(myvec)

In [None]:
length(myvec)

The power of vectors is that operations work on an entire vector.  For example:

In [None]:
myvec + 2

In [None]:
myvec/3

In [None]:
myvec > 5

## Task


Vectors are required to contain elements that are all the same data type, as this task demonstrates.




Run the code below.  What's happening?  

In [None]:
an_example <- c(5, 1, 3, 'winona', TRUE, FALSE)

In [None]:
an_example

In [None]:
class(an_example)

In [None]:
another_example <- c(5, 1, 3, TRUE, FALSE)

In [None]:
another_example

In [None]:
class(another_example)

In [None]:
class(TRUE)

In [None]:
class(FALSE)

## Subsetting

Square brackets `[]` can be used to subset elements of vectors.  The colon symbol `:` can be used to subset a range of indices.

In [None]:
an_example[4]

In [None]:
an_example[3:6]

Boolean vectors of the same length as the original vector can also be used to subset.  This is especially useful for subsetting vectors according to some condition, as the next example illustrates:

In [None]:
myvec

In [None]:
wanted <- myvec > 6

In [None]:
wanted

In [None]:
myvec[wanted]

### Task

Consider the vector below:

In [None]:
some_letters <- c('g','d','w','e','b','z','m','a','z','v','l','q','o','s','o')

I want to subset the 3rd, 8th, 9th, 13th, and 15th letters.  We'll go about this as follows:

1. Create a vector `indices` that contains the numbers 3, 8, 9, 13, and 15.
2. Use this vector inside square brackets `[]` to subset the desired letters.

Identify the result as some local "character"!

# Data frames

One of the most powerful types of data structures in `R` is the *data frame*.  One way to think of a data frame is as a combination of vectors, each column a vector of the same data type.  However, the columns might be different data types from each other.  


The `fev` data set forced expiratory volume (FEV) and other metrics on children and teens:

In [None]:
fev <- read.csv('https://www.dropbox.com/scl/fi/ijsx30c3bk5fzymp6ijh5/FEVdata.csv?rlkey=qp9armn7h46klefbgr3iopdzr&dl=1')

In [None]:
class(fev)

The `head()` command shows the first 6 rows of the data frame.  Note the different data types in each column:

In [None]:
head(fev)

Unnamed: 0_level_0,SUBJID,AGE,FEV,HEIGHT,SEX,SMOKE
Unnamed: 0_level_1,<int>,<int>,<dbl>,<dbl>,<chr>,<chr>
1,301,9,1.708,57.0,F,No
2,451,8,1.724,67.5,F,No
3,501,7,1.72,54.5,F,No
4,642,9,1.558,53.0,M,No
5,901,9,1.895,57.0,M,No
6,1701,8,2.336,61.0,F,No


The `dim()` command shows how many rows and columns there are:

In [None]:
dim(fev)

The dollar sign `$` can be used to extract a column.  In the code below, we extract the children's heights and put them in their own vector:

In [None]:
heights <- fev$HEIGHT

In [None]:
mean(heights)

In [None]:
head(heights)

In [None]:
length(heights)

## Task: counting teens

How many kids in the `fev` data set are teens (13 or older)?  To find this out:

* Create a vector `age` containing only the children's ages.
* Create a boolean vector `is_teen` that is `TRUE` if each child is 13+, `FALSE` otherwise
* Use `is_teen` and `[]` to subset only the teens into their own vector, `teen_ages`.
* Use `length()` to determine the length of the `teen_ages` vector, and hence the number of teens in the data set.
* Note that `R` thinks of `TRUE` as ` and `FALSE` as 0.  We could have just as easily counted the number of teens by running `sum()` on the `is_teen` vector.  Use this approach to verify the number of teens in the data set.

# The `dplyr` package

The `dplyr` package is one of the foremost ways for working with data frames.  The `dplyr` package is known for its data verb and pipe `%>%` notation for sequencing data wrangling tasks. The package must be loaded with each new `R` session:

In [None]:
library(dplyr)

Some of the most important `dplyr` verbs are:

* `mutate()`: create a new column that is a transformation of an existing column
* `filter()`: subset rows by some criteria
* `summarize()`: create a summary of some column
* `group_by()`: create grouped mutates or summaries; used most frequently in conjunction with `summarize()` and `mutate()`.

The syntax of a `dplyr` command chain takes the following form:

```
(dataframe
  %>% task1
  %>% task2
  %>% task3
)
```

## `mutate()`

`mutate()` will either transform (replace) an existing column, or create a new column.

In the example below, the `mutate()` function is used to convert `HEIGHT` to feet (transform/replace), and to create a a new `is_teen` variable:

In [None]:
(fev
  %>% mutate(HEIGHT = HEIGHT / 12, is_teen = ifelse(AGE >=13, 1, 0))
  %>% head()
)

Unnamed: 0_level_0,SUBJID,AGE,FEV,HEIGHT,SEX,SMOKE,is_teen
Unnamed: 0_level_1,<int>,<int>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
1,301,9,1.708,4.75,F,No,0
2,451,8,1.724,5.625,F,No,0
3,501,7,1.72,4.541667,F,No,0
4,642,9,1.558,4.416667,M,No,0
5,901,9,1.895,4.75,M,No,0
6,1701,8,2.336,5.083333,F,No,0


In [None]:
head(fev)

Unnamed: 0_level_0,SUBJID,AGE,FEV,HEIGHT,SEX,SMOKE
Unnamed: 0_level_1,<int>,<int>,<dbl>,<dbl>,<chr>,<chr>
1,301,9,1.708,57.0,F,No
2,451,8,1.724,67.5,F,No
3,501,7,1.72,54.5,F,No
4,642,9,1.558,53.0,M,No
5,901,9,1.895,57.0,M,No
6,1701,8,2.336,61.0,F,No


This code has 2 tasks:

* A `mutate()` task;
* A `head()` task.

The `mutate()` comes first, followed by `head()`.  

If we want to save this transformed data frame, we need to assign it to a new object:

In [None]:
fev_transformed <- (fev
  %>% mutate(HEIGHT = HEIGHT / 12, is_teen = ifelse(AGE >=13, 1, 0))
  %>% head()
)

In [None]:
fev_transformed

Unnamed: 0_level_0,SUBJID,AGE,FEV,HEIGHT,SEX,SMOKE,is_teen
Unnamed: 0_level_1,<int>,<int>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
1,301,9,1.708,4.75,F,No,0
2,451,8,1.724,5.625,F,No,0
3,501,7,1.72,4.541667,F,No,0
4,642,9,1.558,4.416667,M,No,0
5,901,9,1.895,4.75,M,No,0
6,1701,8,2.336,5.083333,F,No,0


...but this only has the first 6 rows!!!  How can we modify the code above to save the *entire* transformed data frame?

In [None]:
fev_transformed <- (fev
  %>% mutate(HEIGHT = HEIGHT / 12, is_teen = ifelse(AGE >=13, 1, 0))
  #%>% head()
)

In [None]:
fev_transformed

SUBJID,AGE,FEV,HEIGHT,SEX,SMOKE,is_teen
<int>,<int>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
301,9,1.708,4.750000,F,No,0
451,8,1.724,5.625000,F,No,0
501,7,1.720,4.541667,F,No,0
642,9,1.558,4.416667,M,No,0
901,9,1.895,4.750000,M,No,0
1701,8,2.336,5.083333,F,No,0
1752,6,1.919,4.833333,F,No,0
1753,6,1.415,4.666667,F,No,0
1901,8,1.987,4.875000,F,No,0
1951,9,1.942,5.000000,F,No,0


## `filter()`

`filter()` filters the data frame to only the rows that meet some condition.  The code below filters to only females at least 5 feet (60 inches) tall:

In [None]:
(fev
  %>% filter(SEX=='F')
  %>% filter(HEIGHT >= 60)
)

SUBJID,AGE,FEV,HEIGHT,SEX,SMOKE
<int>,<int>,<dbl>,<dbl>,<chr>,<chr>
451,8,1.724,67.5,F,No
1701,8,2.336,61.0,F,No
1951,9,1.942,60.0,F,No
5601,9,2.988,65.0,F,No
6101,8,2.980,60.0,F,No
6801,9,2.100,60.0,F,No
7251,8,2.673,60.0,F,No
9501,9,3.135,60.0,F,No
11501,9,2.797,61.5,F,No
13351,9,3.016,62.5,F,No


## Grouped summaries with `summarize()`

`summarize()` aggregates the data, creating summary statistics of groups of rows.  The code below finds the average height and FEV and number of rows (children) in the data frame:

In [None]:
(fev
  %>% summarize(mean(FEV), mean(HEIGHT), n())
)

mean(FEV),mean(HEIGHT),n()
<dbl>,<dbl>,<int>
2.63678,61.14358,654


Using `group_by()` before the `summarize()` will provide summaries by the grouping variable levels.  The code below finds the average FEV by smoking status, as well as the number of people in each smoking category:

In [None]:
(fev
  %>% group_by(SMOKE)
  %>% summarize(avg_FEV = mean(FEV), count = n())
)

SMOKE,avg_FEV,count
<chr>,<dbl>,<int>
No,2.566143,589
Yes,3.276862,65


Another shortcut to grouped summaries uses the `.by = ` option in the `summarize()` function, doing away with `group_by()`:

In [None]:
(fev
  %>% summarize(avg_FEV = mean(FEV), count = n(), .by = SMOKE)
)

SMOKE,avg_FEV,count
<chr>,<dbl>,<int>
No,2.566143,589
Yes,3.276862,65


## Task

Above we used a rather roundabout way with vector creation to count the number of teens in the data set.  Use one chain of `dplyr()` verbs to find the number of teens and the number of non-teens in the data set.

## Task

Use one chain of `dplyr` verbs to find and compare the average FEV of people who are at least 5 feet (60 inches) tall to people who are less than 5 feet tall.  