# Overview of Data Frame Operations

Data frames are the workhorse of R, so in this lecture we will basically be creating a "cheat sheet" of common operations used with data frames and R. This will be a very critical lecture so be sure to understand all of the material 

- Creating Data Frames
- Importing and Exporting Data
- Getting Information about Data Frame
- Referencing Cells
- Referencing Rows
- Referencing Columns
- Adding Rows
- Adding Columns
- Setting Column Names
- Selecting Multiple Rows
- Selecting Multiple Columns
- Dealing with Missing Dataa

## Creating Data Frames

Let's first create an empty data frame

In [1]:
empty <- data.frame() # empty data frame

# let's now create a vector of names 
names <-  c("Sai","Jacob","Elroy","Zoey","Bob","Mike","Andrew","James","Oleg","Valentin")

# Let's have test scores be the second vector
test <- c(45,87,80,90,98,30,89,100,95,100)

# lets make our data frame and name the columns
df <- data.frame("Students" = names, "First.Test" = test) 
df

Students,First.Test
<fct>,<dbl>
Sai,45
Jacob,87
Elroy,80
Zoey,90
Bob,98
Mike,30
Andrew,89
James,100
Oleg,95
Valentin,100


## Importing and Exporting Data

We will go over this more later

In [2]:
#d2 <- read.csv('some_file.csv')

#For Excel File
#load the readxl package
#library(readxl)

#Call info from the sheets using read.excel
#df <- read_excel('some_file.xlsx', sheet = 'Sheet1')

# Output to CSV
#write.csv(df,file='some_file.csv')

## Getting Information about Data Frames

In [3]:
# Count the numbers of rows and columns
nrow(df)
ncol(df)

In [4]:
# Column and Row Names
colnames(df)
rownames(df) # this will return the index value, not names or test scores. 

## Referencing Cells

You can think of the basics as using two sets of brackets for a single cell and using a signle set of brackets for multiple cells. Here's an example

In [5]:
vec <- df[[5, 2]] # get cell by [[row,col]] num
vec
newdf <- df[1:5, 1:2] # get multiple cells in new df
newdf

df[[2, 'First.Test']] <- 100 # reassign a single cell. Jacob retakes the test and gets a 100

newdf
df

Unnamed: 0_level_0,Students,First.Test
Unnamed: 0_level_1,<fct>,<dbl>
1,Sai,45
2,Jacob,87
3,Elroy,80
4,Zoey,90
5,Bob,98


Unnamed: 0_level_0,Students,First.Test
Unnamed: 0_level_1,<fct>,<dbl>
1,Sai,45
2,Jacob,87
3,Elroy,80
4,Zoey,90
5,Bob,98


Students,First.Test
<fct>,<dbl>
Sai,45
Jacob,100
Elroy,80
Zoey,90
Bob,98
Mike,30
Andrew,89
James,100
Oleg,95
Valentin,100


## Referencing Rows
Usually you'll want to use the [row, ] function

In [6]:
rowdf <- df[1,]
rowdf

Unnamed: 0_level_0,Students,First.Test
Unnamed: 0_level_1,<fct>,<dbl>
1,Sai,45


If you want to get a row as a vector, use the following notation

In [7]:
vrow <- as.numeric(as.vector(df[1,]))
vrow # Characters are reassigned as a numeric 

## Referencing Columns
Most column references will return a vector


In [8]:
cars <- mtcars
head(cars)

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [9]:
colv1 <- cars$mpg # returns a vector
colv1

colv2 <- cars[, 'mpg'] # returns vector
colv2

colv3<- cars[, 1] # a is int or string
colv3

colv4 <- cars[['mpg']] # returns a vector
colv4

In [10]:
# Ways of Returning Data Frames
mpgdf <- cars['mpg'] # returns 1 col df
head(mpgdf)

mpgdf2 <- cars[1] # returns 1 col df
head(mpgdf2)

Unnamed: 0_level_0,mpg
Unnamed: 0_level_1,<dbl>
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8
Hornet 4 Drive,21.4
Hornet Sportabout,18.7
Valiant,18.1


Unnamed: 0_level_0,mpg
Unnamed: 0_level_1,<dbl>
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8
Hornet 4 Drive,21.4
Hornet Sportabout,18.7
Valiant,18.1


## Adding Rows

In [11]:
# Both arguments are DFs
Ryans.Test <- data.frame("Students" = "Ryan", "First.Test" = 80)
Ryans.Test

Students,First.Test
<fct>,<dbl>
Ryan,80


Use rbind to bind the new row!

In [12]:
dfnew <- rbind(df,Ryans.Test)
dfnew

Students,First.Test
<fct>,<dbl>
Sai,45
Jacob,100
Elroy,80
Zoey,90
Bob,98
Mike,30
Andrew,89
James,100
Oleg,95
Valentin,100


## Adding Columns

In [13]:
dfnew$Second.Test <- rep(NA, nrow(dfnew)) # NA column
dfnew

Students,First.Test,Second.Test
<fct>,<dbl>,<lgl>
Sai,45,
Jacob,100,
Elroy,80,
Zoey,90,
Bob,98,
Mike,30,
Andrew,89,
James,100,
Oleg,95,
Valentin,100,


In [14]:
dfnew[, 'Second.Test'] <- dfnew$First.Test - 5 # We can copy and adjust a column. 
dfnew

Students,First.Test,Second.Test
<fct>,<dbl>,<dbl>
Sai,45,40
Jacob,100,95
Elroy,80,75
Zoey,90,85
Bob,98,93
Mike,30,25
Andrew,89,84
James,100,95
Oleg,95,90
Valentin,100,95


We can also use equations

In [15]:
dfnew[['First.Quarter.Average']] <- (dfnew$First.Test + dfnew$Second.Test)/2
dfnew

Students,First.Test,Second.Test,First.Quarter.Average
<fct>,<dbl>,<dbl>,<dbl>
Sai,45,40,42.5
Jacob,100,95,97.5
Elroy,80,75,77.5
Zoey,90,85,87.5
Bob,98,93,95.5
Mike,30,25,27.5
Andrew,89,84,86.5
James,100,95,97.5
Oleg,95,90,92.5
Valentin,100,95,97.5


#We can also use cbind() which makes things easier

In [16]:
Third.Test <- c(60,91,100,80,89,95,60,99,100,87,81)

dfnew$Third.Test <- rep(NA, nrow(dfnew)) # Makes a new column

dfnew[, 'Third.Test'] <- Third.Test
dfnew

Students,First.Test,Second.Test,First.Quarter.Average,Third.Test
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Sai,45,40,42.5,60
Jacob,100,95,97.5,91
Elroy,80,75,77.5,100
Zoey,90,85,87.5,80
Bob,98,93,95.5,89
Mike,30,25,27.5,95
Andrew,89,84,86.5,60
James,100,95,97.5,99
Oleg,95,90,92.5,100
Valentin,100,95,97.5,87


## Setting Column Names

In [17]:
# Rename second column
colnames(dfnew)[3] <- 'Final.Score' # Renames the 3rd column to Final.Score
dfnew

# Rename all at once with a vector
colnames(dfnew) <- c('Names of Students', 'First.Exam', 'Second.Exam', 'Third.Exam' ,'Fourth.Exam')
dfnew

Students,First.Test,Final.Score,First.Quarter.Average,Third.Test
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Sai,45,40,42.5,60
Jacob,100,95,97.5,91
Elroy,80,75,77.5,100
Zoey,90,85,87.5,80
Bob,98,93,95.5,89
Mike,30,25,27.5,95
Andrew,89,84,86.5,60
James,100,95,97.5,99
Oleg,95,90,92.5,100
Valentin,100,95,97.5,87


Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Sai,45,40,42.5,60
Jacob,100,95,97.5,91
Elroy,80,75,77.5,100
Zoey,90,85,87.5,80
Bob,98,93,95.5,89
Mike,30,25,27.5,95
Andrew,89,84,86.5,60
James,100,95,97.5,99
Oleg,95,90,92.5,100
Valentin,100,95,97.5,87


## Selecting Multiple Rows

In [18]:
first.four.rows <- dfnew[1:4, ] # Same as head(df, 4)
first.four.rows

Unnamed: 0_level_0,Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,Sai,45,40,42.5,60
2,Jacob,100,95,97.5,91
3,Elroy,80,75,77.5,100
4,Zoey,90,85,87.5,80


In [19]:
everything.but.row.two <- dfnew[-2, ] # We can remove Jacob from the list 
everything.but.row.two

Unnamed: 0_level_0,Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,Sai,45,40,42.5,60
3,Elroy,80,75,77.5,100
4,Zoey,90,85,87.5,80
5,Bob,98,93,95.5,89
6,Mike,30,25,27.5,95
7,Andrew,89,84,86.5,60
8,James,100,95,97.5,99
9,Oleg,95,90,92.5,100
10,Valentin,100,95,97.5,87
11,Ryan,80,75,77.5,81


In [20]:
# Conditional Selection
studends.passing <- dfnew[(dfnew$First.Exam > 65 ),]#& dfnew$Second.Exam > 65)]
studends.passing

students.failing <- subset(dfnew, (First.Exam + Second.Exam + Third.Exam + Fourth.Exam) / 4 < 65 )
students.failing

honor.students <- subset(dfnew, (First.Exam + Second.Exam + Third.Exam + Fourth.Exam) / 4 > 90 )
honor.students

Unnamed: 0_level_0,Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
2,Jacob,100,95,97.5,91
3,Elroy,80,75,77.5,100
4,Zoey,90,85,87.5,80
5,Bob,98,93,95.5,89
7,Andrew,89,84,86.5,60
8,James,100,95,97.5,99
9,Oleg,95,90,92.5,100
10,Valentin,100,95,97.5,87
11,Ryan,80,75,77.5,81


Unnamed: 0_level_0,Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,Sai,45,40,42.5,60
6,Mike,30,25,27.5,95


Unnamed: 0_level_0,Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
2,Jacob,100,95,97.5,91
5,Bob,98,93,95.5,89
8,James,100,95,97.5,99
9,Oleg,95,90,92.5,100
10,Valentin,100,95,97.5,87


## Selecting Multiple Columns

In [21]:
dfnew[, c(1, 2, 4)] #Grab cols 1 2 2

Names of Students,First.Exam,Third.Exam
<fct>,<dbl>,<dbl>
Sai,45,42.5
Jacob,100,97.5
Elroy,80,77.5
Zoey,90,87.5
Bob,98,95.5
Mike,30,27.5
Andrew,89,86.5
James,100,97.5
Oleg,95,92.5
Valentin,100,97.5


In [22]:
dfnew[, c("Names of Students", 'Third.Exam')] # by name

Names of Students,Third.Exam
<fct>,<dbl>
Sai,42.5
Jacob,97.5
Elroy,77.5
Zoey,87.5
Bob,95.5
Mike,27.5
Andrew,86.5
James,97.5
Oleg,92.5
Valentin,97.5


In [23]:
dfnew[, -1] # keep all but first column

First.Exam,Second.Exam,Third.Exam,Fourth.Exam
<dbl>,<dbl>,<dbl>,<dbl>
45,40,42.5,60
100,95,97.5,91
80,75,77.5,100
90,85,87.5,80
98,93,95.5,89
30,25,27.5,95
89,84,86.5,60
100,95,97.5,99
95,90,92.5,100
100,95,97.5,87


In [24]:
dfnew[, -c(1, 3)] # drop cols 1 and 3

First.Exam,Third.Exam,Fourth.Exam
<dbl>,<dbl>,<dbl>
45,42.5,60
100,97.5,91
80,77.5,100
90,87.5,80
98,95.5,89
30,27.5,95
89,86.5,60
100,97.5,99
95,92.5,100
100,97.5,87


## Dealing with Missing Data
Dealing with missing data is a very important skill to know when working with data frames!

In [25]:
any(is.na(mtcars)) # detect anywhere in df

In [26]:
#We have an empty cell in dfnew
dfnew[[6, 'Second.Exam']] <- NA # reassign a single cell. Mike doesn't have a score for the second exam
any(is.na(dfnew)) # checks to see if there are any NA values
dfnew

Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Sai,45,40.0,42.5,60
Jacob,100,95.0,97.5,91
Elroy,80,75.0,77.5,100
Zoey,90,85.0,87.5,80
Bob,98,93.0,95.5,89
Mike,30,,27.5,95
Andrew,89,84.0,86.5,60
James,100,95.0,97.5,99
Oleg,95,90.0,92.5,100
Valentin,100,95.0,97.5,87


In [31]:
# We can delete the row that contains missing data
dfnew <- dfnew[!is.na(dfnew$Second.Exam), ] # Good bye, Mike! 
dfnew

Unnamed: 0_level_0,Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,Sai,45,40,42.5,60
2,Jacob,100,95,97.5,91
3,Elroy,80,75,77.5,100
4,Zoey,90,85,87.5,80
5,Bob,98,93,95.5,89
7,Andrew,89,84,86.5,60
8,James,100,95,97.5,99
9,Oleg,95,90,92.5,100
10,Valentin,100,95,97.5,87
11,Ryan,80,75,77.5,81


In [28]:
# Or we can replace NAs with something else
dfnew[is.na(dfnew)] <- 0 # Replaces all NA values with 0's
dfnew

Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Sai,45,40,42.5,60
Jacob,100,95,97.5,91
Elroy,80,75,77.5,100
Zoey,90,85,87.5,80
Bob,98,93,95.5,89
Mike,30,0,27.5,95
Andrew,89,84,86.5,60
James,100,95,97.5,99
Oleg,95,90,92.5,100
Valentin,100,95,97.5,87


In [30]:
#Or we can add in the average value for that row 

dfnew[[6, 'Second.Exam']] <- NA 
dfnew

dfnew$Second.Exam[is.na(dfnew$Second.Exam)] <- mean(dfnew$Second.Exam)
dfnew

Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Sai,45,40.0,42.5,60
Jacob,100,95.0,97.5,91
Elroy,80,75.0,77.5,100
Zoey,90,85.0,87.5,80
Bob,98,93.0,95.5,89
Mike,30,,27.5,95
Andrew,89,84.0,86.5,60
James,100,95.0,97.5,99
Oleg,95,90.0,92.5,100
Valentin,100,95.0,97.5,87


Names of Students,First.Exam,Second.Exam,Third.Exam,Fourth.Exam
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Sai,45,40.0,42.5,60
Jacob,100,95.0,97.5,91
Elroy,80,75.0,77.5,100
Zoey,90,85.0,87.5,80
Bob,98,93.0,95.5,89
Mike,30,82.7,27.5,95
Andrew,89,84.0,86.5,60
James,100,95.0,97.5,99
Oleg,95,90.0,92.5,100
Valentin,100,95.0,97.5,87
