DATA 607 Tidyverse Assignment

1. Purpose of this project

To create a programming sample vignette that demonstrates how to use one or more of the capabilities of selected TidyVerse packages using a dataset selected either from Kaggle or Fivethirtyeight.

2. About Tidyverse

Tidyverse is a collection of packages designed for datascience applications. Currently it consists of the following packages:

ggplot2 - for creating graphics
dplyr - for data manipulation
tidyr - to get data into a consistent and ‘tidy’ format
readr- to read rectangular data
purr - tools for working with functions and vectors
tibble - modified data framing tool
stringr - for working with strings
forcats - tools to solve commong problemsn with factors. For more information please visit Tidyverse

3. Dataset

The objective of this exercise is to compare the defense spending of the US, the European Union and India over the years. A such, the dataset selected for this vignette details defense spending by country from 1960 through 2018. This data was sourced from kaggle and can be found here. This data was first published by the World Bank and a more expanded dataset with relevant infromation can be found at the World Bank Website

4. Readr (read_csv)

Since this data can be considered rectangular and is in a csv format, we are going to demonstrate the capabilities of the read_csv function from the readr package.

As detailed in the documentation for read_csv(), the read_csv() function is a special case of read_delim() which is useful for reading common types of flat file data which is comma separated values. The syntax and a few limited arguments are as follows:

read_csv(file, col_names = TRUE, col_types = NULL, skip = 0, n_max = Inf, guess_max = min(1000,n_max), skip_empty_rows = TRUE)

where:

file - path to a file, a connection, or literal data
col_names - If TRUE, the first row of the input will be used as the column names, If FALSE, column names will be generated automatically: X1, X2, X3 etc.
col_types - If NULL, all column types will be imputed from the first 1000 rows on the input.If a column specification created by cols(), it must contain one column specification for each column.
skip - Number of lines to skip before reading data.
n_max - Maximum number of records to read.
skip_empty_rows - Should blank rows be ignored

5. Tidyr (pivot_longer)

According to the tidyverse webset, the goal of of tidyr is to create ‘tidy data’ where:

Every column is variable.
Every row is an observation..
Every cell is a single value. When reviewing the dataframe created above, we note that the original dataset is in a ‘wide’ format. In order to effectively work this dataset we would have to tansform this dataset into long form. This is where the pivotting functions of the tidyr package comes into play. The specific pivotting fuction we will be using is pivot_longer() which is replacing gather(). Note that the pivot_longer() still in development hence this exercise uses the devtools version of tidyverse.

As detailed in the documentation for pivot_longer(), the pivot_longer() function " ‘lengthens’ data, increasing the number of rows and decreasing the number of columns. The syntax and a few limited arguments are as follows:

pivot_longer(data, cols, names_to = "name",values_drop_na = FALSE

where:

data - data frame to pivot.
cols - columns to pivot into longer format
names_to - string specifying the name of the column to create
values_to - name of the column to create from the data stored in cell values
values_drop_na - If TRUE, will drop rows that contain only NAs.

6. dplyr (mutate, filter)

The dplyr library is considered to be the “grammar of data manipulation”. The five ‘verbs’ of manipulating data with dplyr are:

mutate() - adds new variables that are functions of existing variables
select() - picks variables based on their names.
filter() - picks cases based on their values.
summarise() - reduces multiple values down to a single summary.
arrange() - changes the ordering of the rows. For our purposes of this exercise we are going to use the mutate and filter functions:

mutate() - Currently the expenditure column is in dollars. For convenience we will create a new column to indicate expenditure in billions. We will use the mutate() function in dplyr to achieve this. This functions lets us add new variables to the dataframe while preserving the old ones. The syntax for mutate is:

mutate(data, calculation )

filter()- As stated earlier we selected this data to compare the defense expenditures of USA, EU and India over the years. We will use the filter() function to extract the data for just these three entities from the larger dataframe. The filter function syntax is as follows:

filter(.data, ogical condition, .preserve = FALSE)

where:

data - data table
logical condition - Multiple conditions are combined with &. Only rows where the condition evaluates to TRUE are kept.
preserve- when FALSE, the grouping structure is recalculated based on the resulting data, otherwise it is kept as is.

7. ggplot2 (geom_line)

Finally we are going to plot the three expenditures on the same graph using the ggplot2 package. The steps to create a line chart using ggplot is as follows:

Step 1:

ggplot() initialize a ggplot object which declares input data frame for a graphic and sets aesthetics intended to be common throughout all subsequent layers. Syntax for creating a ggplot object is ggplot(data = NULL, mapping = aes(), ...)

Where: - data- dataset to used for plotting.

mapping - list of aesthetic mappings to use for plotting

Step 2

geom_line() connects all the observations within the data frame. Note in order to plot lines we have to specify the number of groups within a dataframe , or the column whose discrete values will determine the number of groups. In the example below the column which contrains country codes (‘Code’) will determine grouping since each line is for a single value of ‘Code’

Step 3

Other arguments to determine the overall look of the visualiation

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Tidyverse-pt-2-_SHaslettaddition_files/figure-markdown_strict		Tidyverse-pt-2-_SHaslettaddition_files/figure-markdown_strict
Military Expenditure.csv		Military Expenditure.csv
Readme.md		Readme.md
Tidyverse Vignette.Rmd		Tidyverse Vignette.Rmd
Tidyverse Vignette.html		Tidyverse Vignette.html
Tidyverse pt 2 _SHaslettaddition.Rmd		Tidyverse pt 2 _SHaslettaddition.Rmd
Tidyverse-Vignette.md		Tidyverse-Vignette.md
Tidyverse-pt-2-_SHaslettaddition.md		Tidyverse-pt-2-_SHaslettaddition.md
tidy.jpeg		tidy.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DATA 607 Tidyverse Assignment

1. Purpose of this project

2. About Tidyverse

3. Dataset

4. Readr (read_csv)

5. Tidyr (pivot_longer)

6. dplyr (mutate, filter)

7. ggplot2 (geom_line)

Step 1:

Step 2

Step 3

About

Uh oh!

Releases

Packages

Languages

DevMeh/tidyverse

Folders and files

Latest commit

History

Repository files navigation

DATA 607 Tidyverse Assignment

1. Purpose of this project

2. About Tidyverse

3. Dataset

4. Readr (read_csv)

5. Tidyr (pivot_longer)

6. dplyr (mutate, filter)

7. ggplot2 (geom_line)

Step 1:

Step 2

Step 3

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages