{{ message }}

# ScienceParkStudyGroup / studyGroup Public

forked from mozillascience/studyGroup

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

# Exploratory Data Analysis with R #17

Closed
opened this issue Jan 29, 2018 · 0 comments
Closed

# Exploratory Data Analysis with R #17

opened this issue Jan 29, 2018 · 0 comments
Assignees
Labels

## Exploratory Data Analysis with R by Marc Galland

Level: novice

Lesson type: hands-on (practical session)
Lesson is here

Prerequisites: You need to know how to start R and RStudio. You will be guided through the rest of the practical.

Tech or materials needed: Bring your own laptop. We will install the `tidyverse` and `nycflights13` libraries together.

Time to Complete: One-hour.

## Summary/Context/Objectives

This lesson will help you to process and explore the `flights` dataset. This example dataset is already in the tidy format (one measurement per line). We will explore a few useful functions to get basic statistics on the dataset and make exploratory plots. These are the first steps in the Research Data Life Cycle (see the scheme below).

### Lesson steps

1. Install the necessary `tidyverse` and `nycflights13` R libraries.
2. Load the `flights` dataset that we will work with.
3. Explore the `flights` dataset to show and understand the different variables.
4. Filter the `flights` dataset using the `filter` function to keep only flights that leave the John F. Kennedy (JFK) international airport with destination Los Angeles international airport.
5. Plot a distribution of the flight delays.
6. Plot the number of flights operated per flight company.
7. Calculate the mean and SD with a grouping variable (aircraft company)
8. Relate the variable `dep_delay` to `arr_delay`
9. Have a first insight into regression.

Glossary:

• Dataframe: the equivalent of an Excel spreadsheet. More formally, a list of different data types (character, integer, numeric, etc) that have the same length (number of rows). In addition, a data frame generally has a names attribute labeling the variables and a row.names attribute for labeling the cases."
• Tibble: the core tidyverse data structure is a tibble; this is a modern take on the data frame. You can find an extensive and practical definition here.

## Additional Resources & further exploration

to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet