# Data Frames

Data frames are the most important data type in R. They can be considered as matrix where columns can be of different types, in fact, data frames are just a collection of lists. <btr>
Usually, data frames can be considered as sets of obervations (from an experiment, for example) where the rows are observations values and the columns the observation variables.


In [None]:
# iris is one of R pre-loaded data frames, let's print it 
head(iris)


# print the last lines
tail(iris)



## Structure of a data frame

In [None]:
# print the structure (the type of each columns)
str(iris)

# make a pairs plot. Each column is plotted agains the others. This plot is very useful to have a general idea about the data frame
pairs(iris[ , -ncol(iris)])



## Manually create a data frame

You can also manually create data frames




In [None]:
players <- c("Alice", "Bob", "John", "Richard")
sex <- factor(c("F", "M", "M", "M"))
age <- c(20, 26, 25, 65)

df <- data.frame(Players = players, Sex = sex, Age = age)
df

## Subsetting data frames
Data frames can be subsetted in the same way as the matrices as we saw before.
Single columns can be selected with the dollar operator. Be careful that the result of selecting with the dollar is a vector

In [None]:
"Second and third row"
iris[2:3,]

"One element"
iris[2,4]

"One full column" 
iris$Sepal.Length


## subset function

A useful function to subset data frame is the "subset" function, which allow you to select some part od the data frame based on some conditions

In [None]:

"Select only setosa species"
head( subset(iris, Species == "virginica") )

"Select only sepal width between 3 and 4"
head( subset(iris, (Sepal.Length > 3 & Sepal.Width < 4 & Species == "virginica")) )  # OR is |


## Sorting data frames

To sort data frames by one column, use the "order" function on some column

In [None]:
"Sort by sepal width"
head( iris[order(iris$Sepal.Width), ] )

"Sort by sepal width, inverse order"
head( iris[order(iris$Sepal.Width, decreasing = T), ] )



## Wide to long format

The data frame iris that we saw before is in "wide" format. Many functions in R expect data to be in a "long" format rather than a wide format (especially the ggplot functions that will see later on in the course). <br>
For more examples, see https://www.datacamp.com/community/tutorials/long-wide-data-R

In [None]:
library(reshape2)

head(iris)

# turn iris into long format
m <- melt(iris, id.vars =  c("Species"))
head(m, 20)

# turn long format back to wide format. dcast requires a formula (y ~ x) defining how the data frame should look like
#d <- dcast(m,   1:nrow(iris) + Species ~ variable, value.var  = "value")
#head(d)


## Grouping operations

Grouping operations are useful to calculate summaries statistics based on some group, they are similar to the tapply function that we saw before. <br> <br>


For best use of grouping, check the dplyr package, which is part of the tidyverse package https://dplyr.tidyverse.org/


In [None]:
"*********** Summary of each column ***********"
summary(iris)


"*********** How many species counts are there? ***********"
table(iris$Species)


"*********** Colmun means ***********"
colMeans(iris[, -ncol(iris)])


"***********Mean of sepal data by Species***********"
aggregate(x = iris[, -ncol(iris)], by = list(iris$Species), FUN = mean) # you can also specify external functions as FUN



## Reading data frames from files

R provides quick functions to read data frames from files: read.table, read.csv and read.delim. read.csv and read.delim are just wrappers for read.table, have a look at the documentation to see what they actually do.


In [None]:
df <- read.table("Data/USArrests.csv", sep=",")
head(df)

str(df)

# You can even read the file using its hyperlink without downloading it
df <- read.table("https://raw.githubusercontent.com/CBGP-UPM-INIA-PUBLIC/Introduction_R/master/Data/USArrests.csv", sep=",")


## Writing data frames to files

Use write.table to write a data frame to a file. 

In [None]:
players <- c("Alice", "Bob", "John", "Richard")
sex <- factor(c("F", "M", "M", "M"))
age <- c(20, 26, 25, 65)

df <- data.frame(Players = players, Sex = sex, Age = age)

write.table(df, "Data/players.tsv", sep="\t")

# removes quotes
#write.table(df, "Data/players.tsv", sep="\t", quote = F)

# removes rownames
#write.table(df, "Data/players.tsv", sep="\t", quote = F, row.names = F)

# Wanna read an Excel file? checkl the readxl package

