New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploratory Data Analysis with R #17

Closed
mgalland opened this Issue Jan 29, 2018 · 0 comments

Comments

Projects
None yet
2 participants
@mgalland
Copy link
Member

mgalland commented Jan 29, 2018

Exploratory Data Analysis with R by Marc Galland

Level: novice

Lesson type: hands-on (practical session)
Lesson is here

Prerequisites: You need to know how to start R and RStudio. You will be guided through the rest of the practical.

Tech or materials needed: Bring your own laptop. We will install the tidyverse and nycflights13 libraries together.

Time to Complete: One-hour.

Summary/Context/Objectives

This lesson will help you to process and explore the flights dataset. This example dataset is already in the tidy format (one measurement per line). We will explore a few useful functions to get basic statistics on the dataset and make exploratory plots. These are the first steps in the Research Data Life Cycle (see the scheme below).

The Research Data Life Cycle

Lesson steps

  1. Install the necessary tidyverse and nycflights13 R libraries.
  2. Load the flights dataset that we will work with.
  3. Explore the flights dataset to show and understand the different variables.
  4. Filter the flights dataset using the filter function to keep only flights that leave the John F. Kennedy (JFK) international airport with destination Los Angeles international airport.
  5. Plot a distribution of the flight delays.
  6. Plot the number of flights operated per flight company.
  7. Calculate the mean and SD with a grouping variable (aircraft company)
  8. Relate the variable dep_delay to arr_delay
  9. Have a first insight into regression.

Glossary:

  • Dataframe: the equivalent of an Excel spreadsheet. More formally, a list of different data types (character, integer, numeric, etc) that have the same length (number of rows). In addition, a data frame generally has a names attribute labeling the variables and a row.names attribute for labeling the cases."
  • Tibble: the core tidyverse data structure is a tibble; this is a modern take on the data frame. You can find an extensive and practical definition here.

Additional Resources & further exploration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment