# Overview of Data Frame Operations

Data frames are the workhorse of R, so in this lecture we will basically be creating a "cheat sheet" of common operations used with data frames and R. This will be a very critical lecture so be sure to understand all of the material 

- Creating Data Frames
- Importing and Exporting Data
- Getting Information about Data Frame
- Referencing Cells
- Referencing Rows
- Referencing Columns
- Adding Rows
- Adding Columns
- Setting Column Names
- Selecting Multiple Rows
- Selecting Multiple Columns
- Dealing with Missing Dataa

## Creating Data Frames

Let's first create an empty data frame

In [1]:
empty <- data.frame() # empty data frame

# lets now create a vector 
c1 <-  runif(10)

#we can use the built in vector 'letters' to make the next vector
c2 <- letters[1:10]

# lets make our data frame and name the columns
df <- data.frame("numbers" = c1, "letters" = c2) 
df

numbers,letters
<dbl>,<fct>
0.39059836,a
0.08741041,b
0.02793079,c
0.82038038,d
0.99909567,e
0.58412249,f
0.26511445,g
0.78301879,h
0.64013416,i
0.90595798,j


## Importing and Exporting Data

We will go over this more later

In [2]:
#d2 <- read.csv('some_file.csv')

#For Excel File
#load the readxl package
#library(readxl)

#Call info from the sheets using read.excel
#df <- read_excel('some_file.xlsx', sheet = 'Sheet1')

# Output to CSV
#write.csv(df,file='some_file.csv')

## Getting Information about Data Frames

In [3]:
# Count the numbers of rows and columns
nrow(df)
ncol(df)

In [4]:
# Column and Row Names
colnames(df)
rownames(df) # this will return the index value, not the random numbers

## Referencing Cells

You can think of the basics as using two sets of brackets for a single cell and using a signle set of brackets for multiple cells. Here's an example

In [5]:
vec <- df[[5, 2]] # get cell by [[row,col]] num

newdf <- df[1:5, 1:2] # get multiple cells in new df

df[[2, 'numbers']] <- 99999 # reassign a single cell

newdf
df

Unnamed: 0_level_0,numbers,letters
Unnamed: 0_level_1,<dbl>,<fct>
1,0.39059836,a
2,0.08741041,b
3,0.02793079,c
4,0.82038038,d
5,0.99909567,e


numbers,letters
<dbl>,<fct>
0.3905984,a
99999.0,b
0.02793079,c
0.8203804,d
0.9990957,e
0.5841225,f
0.2651145,g
0.7830188,h
0.6401342,i
0.905958,j


## Referencing Rows
Usually you'll want to use the [row, ] function

In [6]:
rowdf <- df[1,]
rowdf

Unnamed: 0_level_0,numbers,letters
Unnamed: 0_level_1,<dbl>,<fct>
1,0.3905984,a


If you want to get a row as a vector, use the following notation

In [7]:
vrow <- as.numeric(as.vector(df[1,]))
vrow # a is assigned to a numeric 

## Referencing Columns
Most column references will return a vector


In [8]:
cars <- mtcars
head(cars)

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [9]:
colv1 <- cars$mpg # returns a vector
colv1

colv2 <- cars[, 'mpg'] # returns vector
colv2

colv3<- cars[, 1] # a is int or string
colv3

colv4 <- cars[['mpg']] # returns a vector
colv4

In [10]:
# Ways of Returning Data Frames
mpgdf <- cars['mpg'] # returns 1 col df
head(mpgdf)

mpgdf2 <- cars[1] # returns 1 col df
head(mpgdf2)

Unnamed: 0_level_0,mpg
Unnamed: 0_level_1,<dbl>
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8
Hornet 4 Drive,21.4
Hornet Sportabout,18.7
Valiant,18.1


Unnamed: 0_level_0,mpg
Unnamed: 0_level_1,<dbl>
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8
Hornet 4 Drive,21.4
Hornet Sportabout,18.7
Valiant,18.1


## Adding Rows

In [11]:
# Both arguments are DFs
df2 <- data.frame("numbers" = 2000, "letters" = 'new row')
df2

numbers,letters
<dbl>,<fct>
2000,new row


Use rbind to bind the new row!

In [12]:
dfnew <- rbind(df,df2)
dfnew

numbers,letters
<dbl>,<fct>
0.3905984,a
99999.0,b
0.02793079,c
0.8203804,d
0.9990957,e
0.5841225,f
0.2651145,g
0.7830188,h
0.6401342,i
0.905958,j


## Adding Columns

In [13]:
df$newcol <- rep(NA, nrow(df)) # NA column
df

numbers,letters,newcol
<dbl>,<fct>,<lgl>
0.3905984,a,
99999.0,b,
0.02793079,c,
0.8203804,d,
0.9990957,e,
0.5841225,f,
0.2651145,g,
0.7830188,h,
0.6401342,i,
0.905958,j,


In [14]:
df[, 'copy.of.letters'] <- df$letters # copy a col
df

numbers,letters,newcol,copy.of.letters
<dbl>,<fct>,<lgl>,<fct>
0.3905984,a,,a
99999.0,b,,b
0.02793079,c,,c
0.8203804,d,,d
0.9990957,e,,e
0.5841225,f,,f
0.2651145,g,,g
0.7830188,h,,h
0.6401342,i,,i
0.905958,j,,j


We can also use equations

In [15]:
df[['numbers x numbers']] <- df$numbers * df$numbers
df

numbers,letters,newcol,copy.of.letters,numbers x numbers
<dbl>,<fct>,<lgl>,<fct>,<dbl>
0.3905984,a,,a,0.1525671
99999.0,b,,b,9999800000.0
0.02793079,c,,c,0.0007801291
0.8203804,d,,d,0.673024
0.9990957,e,,e,0.9981922
0.5841225,f,,f,0.3411991
0.2651145,g,,g,0.07028567
0.7830188,h,,h,0.6131184
0.6401342,i,,i,0.4097717
0.905958,j,,j,0.8207599


#We can also use cbind() which makes things easier

In [16]:
df <- cbind(df, df$letters)
df

numbers,letters,newcol,copy.of.letters,numbers x numbers,df$letters
<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
0.3905984,a,,a,0.1525671,a
99999.0,b,,b,9999800000.0,b
0.02793079,c,,c,0.0007801291,c
0.8203804,d,,d,0.673024,d
0.9990957,e,,e,0.9981922,e
0.5841225,f,,f,0.3411991,f
0.2651145,g,,g,0.07028567,g
0.7830188,h,,h,0.6131184,h
0.6401342,i,,i,0.4097717,i
0.905958,j,,j,0.8207599,j


## Setting Column Names

In [17]:
# Rename second column
colnames(df)[2] <- 'SECOND COLUMN NEW NAME'
df

# Rename all at once with a vector
colnames(df) <- c('col.name.1', 'col.name.2', 'newcol', 'copy.of.col2' ,'col1.times.2')
df

numbers,SECOND COLUMN NEW NAME,newcol,copy.of.letters,numbers x numbers,df$letters
<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
0.3905984,a,,a,0.1525671,a
99999.0,b,,b,9999800000.0,b
0.02793079,c,,c,0.0007801291,c
0.8203804,d,,d,0.673024,d
0.9990957,e,,e,0.9981922,e
0.5841225,f,,f,0.3411991,f
0.2651145,g,,g,0.07028567,g
0.7830188,h,,h,0.6131184,h
0.6401342,i,,i,0.4097717,i
0.905958,j,,j,0.8207599,j


col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
0.3905984,a,,a,0.1525671,a
99999.0,b,,b,9999800000.0,b
0.02793079,c,,c,0.0007801291,c
0.8203804,d,,d,0.673024,d
0.9990957,e,,e,0.9981922,e
0.5841225,f,,f,0.3411991,f
0.2651145,g,,g,0.07028567,g
0.7830188,h,,h,0.6131184,h
0.6401342,i,,i,0.4097717,i
0.905958,j,,j,0.8207599,j


## Selecting Multiple Rows

In [18]:
first.ten.rows <- df[1:10, ] # Same as head(df, 10)
first.ten.rows

Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
Unnamed: 0_level_1,<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
1,0.3905984,a,,a,0.1525671,a
2,99999.0,b,,b,9999800000.0,b
3,0.02793079,c,,c,0.0007801291,c
4,0.8203804,d,,d,0.673024,d
5,0.9990957,e,,e,0.9981922,e
6,0.5841225,f,,f,0.3411991,f
7,0.2651145,g,,g,0.07028567,g
8,0.7830188,h,,h,0.6131184,h
9,0.6401342,i,,i,0.4097717,i
10,0.905958,j,,j,0.8207599,j


In [19]:
everything.but.row.two <- df[-2, ]
everything.but.row.two

Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
Unnamed: 0_level_1,<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
1,0.39059836,a,,a,0.1525670777,a
3,0.02793079,c,,c,0.0007801291,c
4,0.82038038,d,,d,0.67302397,d
5,0.99909567,e,,e,0.9981921577,e
6,0.58412249,f,,f,0.3411990799,f
7,0.26511445,g,,g,0.0702856734,g
8,0.78301879,h,,h,0.613118421,h
9,0.64013416,i,,i,0.4097717365,i
10,0.90595798,j,,j,0.8207598701,j


In [20]:
# Conditional Selection
sub1 <- df[ (df$col.name.1 > 8 & df$col1.times.2 > 10), ]
sub1

sub2 <- subset(df, col.name.1 > 8 & col1.times.2 > 10)
sub2

Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
Unnamed: 0_level_1,<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
2,99999,b,,b,9999800001,b


Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
Unnamed: 0_level_1,<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
2,99999,b,,b,9999800001,b


## Selecting Multiple Columns

In [21]:
df[, c(1, 2, 3)] #Grab cols 1 2 3

col.name.1,col.name.2,newcol
<dbl>,<fct>,<lgl>
0.3905984,a,
99999.0,b,
0.02793079,c,
0.8203804,d,
0.9990957,e,
0.5841225,f,
0.2651145,g,
0.7830188,h,
0.6401342,i,
0.905958,j,


In [22]:
df[, c('col.name.1', 'col1.times.2')] # by name

col.name.1,col1.times.2
<dbl>,<dbl>
0.3905984,0.1525671
99999.0,9999800000.0
0.02793079,0.0007801291
0.8203804,0.673024
0.9990957,0.9981922
0.5841225,0.3411991
0.2651145,0.07028567
0.7830188,0.6131184
0.6401342,0.4097717
0.905958,0.8207599


In [23]:
df[, -1] # keep all but first column

col.name.2,newcol,copy.of.col2,col1.times.2,NA
<fct>,<lgl>,<fct>,<dbl>,<fct>
a,,a,0.1525671,a
b,,b,9999800000.0,b
c,,c,0.0007801291,c
d,,d,0.673024,d
e,,e,0.9981922,e
f,,f,0.3411991,f
g,,g,0.07028567,g
h,,h,0.6131184,h
i,,i,0.4097717,i
j,,j,0.8207599,j


In [24]:
df[, -c(1, 3)] # drop cols 1 and 3

col.name.2,copy.of.col2,col1.times.2,NA
<fct>,<fct>,<dbl>,<fct>
a,a,0.1525671,a
b,b,9999800000.0,b
c,c,0.0007801291,c
d,d,0.673024,d
e,e,0.9981922,e
f,f,0.3411991,f
g,g,0.07028567,g
h,h,0.6131184,h
i,i,0.4097717,i
j,j,0.8207599,j


## Dealing with Missing Data
Dealing with missing data is a very important skill to know when working with data frames!

In [25]:
any(is.na(mtcars)) # detect anywhere in df

In [26]:
#We have an empty column in df
any(is.na(df)) # anywhere in col
df

col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,NA
<dbl>,<fct>,<lgl>,<fct>,<dbl>,<fct>
0.3905984,a,,a,0.1525671,a
99999.0,b,,b,9999800000.0,b
0.02793079,c,,c,0.0007801291,c
0.8203804,d,,d,0.673024,d
0.9990957,e,,e,0.9981922,e
0.5841225,f,,f,0.3411991,f
0.2651145,g,,g,0.07028567,g
0.7830188,h,,h,0.6131184,h
0.6401342,i,,i,0.4097717,i
0.905958,j,,j,0.8207599,j


In [27]:
# delete selected missing data rows
df <- df[!is.na(df$col), ]

In [28]:
# replace NAs with something else
df[is.na(df)] <- 0 # works on whole df

In [29]:
df$col[is.na(df$col)] <- 999 # For a selected column