# Let's recap: <br>

R let's us assign single values to variables:

In [2]:
a <- 445
b <- 67
c <- 2

It also let's us perform operations with variables or explicit values:

In [3]:
5 + 89
445 + 67
a + b
a + c
a + b + c
b/c
a*c

It also let's us to combine multiple values into a single object which we call a **vector** with the **c()** function:

In [4]:
vec <- c(5, 89, a, b, c)

Why did R let us combine raw values and variables? Because all 5 are numeric. Try executing this and think about how **vec0** is different from **vec**:

In [5]:
vec0 <- c(5, 89, "a", "b", "c")

In [6]:
vec0

That's right - the quotes ('') tell us that **vec0** values are treated as **strings**, or chr, whereas the elements in **vec** are **numbers**.

This is not all R is capable of. Sometimes we want to bundle together not just single values into a vector but several vectors into a... <br>

# Dataframe


Dataframes (df) are R's core data structure. Pretty much all datasets we work with from now on will be saved into R as dataframes. You can think of a dataframe **loosely** as an Excel spreadsheet, with **rows** and **columns**. Rows in a df are called **observations** and columns - **variables**. For starters, let's try replicating this mini dataset in R:

<img src="https://ist387.s3.us-east-2.amazonaws.com/images/2021-02-02+00_48_23-PercentAgreement-TestData+-+Excel.png">

As the screenshot shows, there are two columns - **"name"** and **"age"**, and **4 rows**. We can think of this dataset as a collection of **2 vectors** - one for each column or variable. Then, we can start by declaring each vector, we've done similar things before:

In [7]:
name <- c("Beth", "Raj", "Ken", "Jordyn")
age <- c(46, 51, 13, 2)

Let's check if both vectors look okay by **printing** them:

In [8]:
name
age

So far so good. The only thing we need to figure out is how to bundle them together. Luckily, R has a function for that - **data.frame()** - we just need to think of a name for this new df and tell R which vectors it consists of:

In [9]:
data <- data.frame(name, age)

We named the df **data** and assigned to it the 2 vectors, **name** and **age**. Let's check our work:

In [12]:
data

name,age
<chr>,<dbl>
Beth,46
Raj,51
Ken,13
Jordyn,2


Looks good! We are now ready to start analyzing our data. For starters, let's find out what the average age is. To do this, we will use the **mean()** function we are already familiar with. But we can't feed it the entire df, that's not going to work:

In [13]:
mean(data)

“argument is not numeric or logical: returning NA”


It doesn't work because we are forcing R to calculate the average of a chr column - **name** which is coded as a bunch of text, and R doesn't quite know what to do with it. Instead, we need to zero in on the **age** column specifically. We do this in R by point the function to the specific column within the df we want to evaluate with the **$** sign:

In [14]:
mean(data$age)

Now we're talking! Using the same logic, we can find the min, max, range, and length of this column, just like we did with individual vectors last week:

In [15]:
min(data$age)
max(data$age)
range(data$age)
length(data$age)

We can even save some of these values for later use by assigning them to variables:

In [16]:
minAge <- min(data$age)
maxAge <- max(data$age)

Try out some mathematical operations with our new vars:

In [17]:
minAge + maxAge

In [18]:
minAge * maxAge

Remember **subsetting**? We can try it on entire dfs as well, not just individual vectors. For example, the following code will only retain the records of people in our data younger than 20:

In [20]:
youngDf <- data[data$age < 20,]

youngDf

Unnamed: 0_level_0,name,age
Unnamed: 0_level_1,<chr>,<dbl>
3,Ken,13
4,Jordyn,2


Let's see how this works: we start out, as usual, by thinking of a name for our new object - **youngDf**. Then, we tell R we want to copy over the contents of the **data** df but only retain those records for which the value in the **age** column is less than 20. Even though it doesn't look like a big deal, the comma (**,**) after **20** is **really important**. It acts as a **separator** between the **rows** and **columns** of a df. To the **left** of the comma is the **row condition** and to the right - the **column one**. In this case then, we ask R to give us only the rows where age < 20, and **ALL** the columns. Remember - if you leave the **column** condition after the comma empty, you essentially tell R you want all columns, which is exactly what we did. 