## Initiating a data frame
Data frames are the default data structure when you read in a csv file using read.csv(). This function will automatically use the first row of data as the column names unless otherwise specified.

We can initiate data frames for data we create using the data.frame() function:

In [98]:
# Initiate data frame, dat
dat <- data.frame(col1 = c(1:4),
                    col2 = c("a","b", "c", "d"),
                    col3 = 1)


In [99]:
dat

col1,col2,col3
<int>,<chr>,<dbl>
1,a,1
2,b,1
3,c,1
4,d,1


In [100]:
# colnames() to get the column names
colnames(dat)

## Subsetting a data frame
Individual rows, columns, and cells
Data frames are structured as rows and columns
* syntax dat[x,y], where x refers to a row index and y refers to a column index.
* dat[x,y] calls the cell that is in the x-th row and y-th column
* dat[,y] calls the entire y-th column
* dat[x,] calls the entire x-th row

Columns can be called by column name: dat$col1.

In [101]:
# Call the 4th row
dat[4,]

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,<int>,<chr>,<dbl>
4,4,d,1


In [102]:
# Call the 2nd column
dat[,2]

In [103]:
# Call the entrie column by name
dat$col1

## Multiple rows or columns
When using the dat[x,y] method, either x or y could be a vector or series of values. Doing so would call multiple rows or columns. We can also use the column names to select multiple columns. However, this method can become tedious if we are calling many columns.

In [104]:
# Retrieve 1st through 3rd rows
dat[1:3,]

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,<int>,<chr>,<dbl>
1,1,a,1
2,2,b,1
3,3,c,1


In [105]:
# Retrieve 1st through 3rd columns
dat[,1:3]

col1,col2,col3
<int>,<chr>,<dbl>
1,a,1
2,b,1
3,c,1
4,d,1


In [106]:
# Retrieve 1st and 3rd columns using names
dat[, c("col1", "col2", "col3")]

col1,col2,col3
<int>,<chr>,<dbl>
1,a,1
2,b,1
3,c,1
4,d,1


## Conditional selection
We can use the which() function to call specific rows that meet some logical criteria.

In [107]:
# Retrieve rows that have col1 > 2
dat[which(dat[,1] > 2),]

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,<int>,<chr>,<dbl>
3,3,c,1
4,4,d,1


In [108]:
# Retrive rows that have col2 == "b" or "d"
dat[which(dat[,2]== "b" | dat[,2]== "d"),] 

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,<int>,<chr>,<dbl>
2,2,b,1
4,4,d,1


## Adding data to a data frame
### Adding columns
Add a column of data by calling the name of the new column using the $ method. Then we assign the contents of the column. The contents of the new column must be in one of the following shapes:

* a single value that will flood through the whole column
* a vector whose length is the same as the number of rows in the data frame
* a vector whose length is a factor of the number of rows in the data frame. This vector will repeat through the column.


In [109]:
# Single value
dat$col4 <- "b"
dat

col1,col2,col3,col4
<int>,<chr>,<dbl>,<chr>
1,a,1,b
2,b,1,b
3,c,1,b
4,d,1,b


In [110]:
# Vector whose length equals the number of rows
dat$col5 <- c("e", "f", "g", "h")
dat

col1,col2,col3,col4,col5
<int>,<chr>,<dbl>,<chr>,<chr>
1,a,1,b,e
2,b,1,b,f
3,c,1,b,g
4,d,1,b,h


In [111]:
# Vector whose length is a factor of the number of rows
dat$col6 <- c(1,2)
dat

col1,col2,col3,col4,col5,col6
<int>,<chr>,<dbl>,<chr>,<chr>,<dbl>
1,a,1,b,e,1
2,b,1,b,f,2
3,c,1,b,g,1
4,d,1,b,h,2


## Adding rows
Append rows to a data frame using rbind(data_frame, new_rows). Similar to adding columns to a data frame, the contents must be in one of the following shapes:

* a single value that will fill across the entire row
* a vector whose length is the same as the number of columns in the data frame
* a vector whose length is a factor of the number of columns in the data frame. This vector will repeat itself through the row.

In [112]:
# Single value that will fill across the entire row
dat <- rbind(dat, 1)
dat

col1,col2,col3,col4,col5,col6
<dbl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
1,a,1,b,e,1
2,b,1,b,f,2
3,c,1,b,g,1
4,d,1,b,h,2
1,1,1,1,1,1


In [113]:
# Vector whose length equals the number of columns
dat <- rbind(dat, c(5, "a", 3, "e", 1, "i"))
dat

col1,col2,col3,col4,col5,col6
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,a,1,b,e,1
2,b,1,b,f,2
3,c,1,b,g,1
4,d,1,b,h,2
1,1,1,1,1,1
5,a,3,e,1,i


In [114]:
# Vector whose length is a factor of the number of columns
dat <- rbind(dat, c("x", "y"))
dat

col1,col2,col3,col4,col5,col6
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,a,1,b,e,1
2,b,1,b,f,2
3,c,1,b,g,1
4,d,1,b,h,2
1,1,1,1,1,1
5,a,3,e,1,i
x,y,x,y,x,y


# Modifying specific values
We can modify existing data the same way we can add data in data frames: 
* we call the row(s), column(s), or cell(s) we want to change and assign a new value.

In [115]:
# Change the whole 3rd column to equal "!"
dat[,3] <- "!"
dat

col1,col2,col3,col4,col5,col6
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,a,!,b,e,1
2,b,!,b,f,2
3,c,!,b,g,1
4,d,!,b,h,2
1,1,!,1,1,1
5,a,!,e,1,i
x,y,!,y,x,y


In [116]:
# Change the 4th and 5th rows to equal "?"
dat[4:5,] <- "?"
dat

col1,col2,col3,col4,col5,col6
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,a,!,b,e,1
2,b,!,b,f,2
3,c,!,b,g,1
?,?,?,?,?,?
?,?,?,?,?,?
5,a,!,e,1,i
x,y,!,y,x,y


In [117]:
# Change the bottom right value to equal "%"
dat$col6[7] <- "%"

# or 
dat[7,6] <- "%"
dat

col1,col2,col3,col4,col5,col6
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,a,!,b,e,1
2,b,!,b,f,2
3,c,!,b,g,1
?,?,?,?,?,?
?,?,?,?,?,?
5,a,!,e,1,i
x,y,!,y,x,%


In [118]:
# Change any rows that have "a" in col2 to "+"
dat[which(dat[,2] == "a"),] <- "+"
dat

col1,col2,col3,col4,col5,col6
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
+,+,+,+,+,+
2,b,!,b,f,2
3,c,!,b,g,1
?,?,?,?,?,?
?,?,?,?,?,?
+,+,+,+,+,+
x,y,!,y,x,%


## Removing data
There are a few methods we can use to remove rows or columns from data sets. The basic idea is that we are reassigning the data frame using a selected subset of the data.
### Specify row and columns to keep

In [119]:
# only keeps rows 1 through 3.
dat <- dat[1:3,]
dat

Unnamed: 0_level_0,col1,col2,col3,col4,col5,col6
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,+,+,+,+,+,+
2,2,b,!,b,f,2
3,3,c,!,b,g,1


In [120]:
# only keeps columns 2 and 4. Alternatively, dat <- dat[,c(2,4)]
dat <- dat[,c("col2", "col4")]
dat

Unnamed: 0_level_0,col2,col4
Unnamed: 0_level_1,<chr>,<chr>
1,+,+
2,b,b
3,c,b


In [121]:
# removes rows 1, 3, and 6.
dat <- dat[-c(1,3,6),] 
dat

Unnamed: 0_level_0,col2,col4
Unnamed: 0_level_1,<chr>,<chr>
2,b,b


In [122]:
# removes columns 2 through 5.
dat <- dat[,-c(2:5)] 
dat

In [123]:
# keeps only rows with "!" in col3
dat <- dat[which(dat$col3 == "!"),]

ERROR: Error in dat$col3: $ operator is invalid for atomic vectors


In [124]:
# does the same, but by REMOVING the rows without a "!" in col3
dat <- dat[-which(dat$col3 != "!"),] 

ERROR: Error in dat$col3: $ operator is invalid for atomic vectors
