-
Notifications
You must be signed in to change notification settings - Fork 0
2.4 Data Frames
A data frame has the variables of a dataset as columns and the observations as rows.
Working with large datasets is not uncommon in data analysis. When you work with (extremely) large datasets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset.
The function head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your dataset.
head(mtcars)
tail(mtcars)Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your dataset. For a data frame it tells you:
- The total number of observations (e.g. 32 car types)
- The total number of variables (e.g. 11 car features)
- A full list of the variables names (e.g. mpg, cyl … )
- The data type of each variable (e.g. num)
- The first observations
Applying the str() function will often be the first thing that you do when receiving a new dataset or data frame. It is a great way to get more insight in your dataset before diving into the real analysis.
As a first goal, you want to construct a data frame that describes the main characteristics of eight planets in our solar system. According to your good friend Buzz, the main features of a planet are:
# Definition of vectors
name <- c("Mercury", "Venus", "Earth",
"Mars", "Jupiter", "Saturn",
"Uranus", "Neptune")
type <- c("Terrestrial planet",
"Terrestrial planet",
"Terrestrial planet",
"Terrestrial planet", "Gas giant",
"Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532,
11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03,
0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings )
# Check the structure of planets_df
str(planets_df)
str(planets_df)
'data.frame': 8 obs. of 5 variables:
$ name : chr "Mercury" "Venus" "Earth" "Mars" ...
$ type : chr "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
$ diameter: num 0.382 0.949 1 0.532 11.209 ...
$ rotation: num 58.64 -243.02 1 1.03 0.41 ...
$ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively.
Sometimes you want to select all elements of a row or column. For example, my_df[1, ] selects all elements of the first row.
my_df[1,2] # selects the value at the first row and second column in my_df.
my_df[1:3,2:4] # selects rows 1, 2, 3 and columns 2, 3, 4 in my_df.
my_df[1, ] # selects row 1 completelyA possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:
planets_df[1:5, "diameter"]You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick:
planets_df[,3]
planets_df[,"diameter"]However, there is a short-cut. If your columns have names, you can use the $ sign:
planets_df$diameter# planets_df is pre-loaded in your workspace
# Select the rings variable from planets_df
rings_vector <- planets_df$rings
# Print out rings_vector
rings_vectorplanets_df[rings_vector, "name"][1] "Jupiter" "Saturn" "Uranus" "Neptune"# Not the coma and the empty space for selecting all fields
planets_df[rings_vector, ] name type diameter rotation rings
5 Jupiter Gas giant 11.209 0.41 TRUE
6 Saturn Gas giant 9.449 0.43 TRUE
7 Uranus Gas giant 4.007 -0.72 TRUE
8 Neptune Gas giant 3.883 0.67 TRUEOr by the function subset()
subset(my_df, subset = some_condition)The first argument of subset() specifies the dataset for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.
# As the previous one:
subset(planets_df, subset = rings) name type diameter rotation rings
5 Jupiter Gas giant 11.209 0.41 TRUE
6 Saturn Gas giant 9.449 0.43 TRUE
7 Uranus Gas giant 4.007 -0.72 TRUE
8 Neptune Gas giant 3.883 0.67 TRUEsubset(planets_df, diameter < 1) name type diameter rotation rings
1 Mercury Terrestrial planet 0.382 58.64 FALSE
2 Venus Terrestrial planet 0.949 -243.02 FALSE
4 Mars Terrestrial planet 0.532 1.03 FALSE