## Exploratory Data Analysis with R by Marc Galland

Level: novice

Lesson type: hands-on (practical session)
Lesson is here

Prerequisites: You need to know how to start R and RStudio. You will be guided through the rest of the practical.

Tech or materials needed: Bring your own laptop. We will install the `tidyverse` and `nycflights13` libraries together.

Time to Complete: One-hour.

## Summary/Context/Objectives

This lesson will help you to process and explore the `flights` dataset. This example dataset is already in the tidy format (one measurement per line). We will explore a few useful functions to get basic statistics on the dataset and make exploratory plots. These are the first steps in the Research Data Life Cycle (see the scheme below).

### Lesson steps

1. Install the necessary `tidyverse` and `nycflights13` R libraries.
2. Load the `flights` dataset that we will work with.
3. Explore the `flights` dataset to show and understand the different variables.
4. Filter the `flights` dataset using the `filter` function to keep only flights that leave the John F. Kennedy (JFK) international airport with destination Los Angeles international airport.
5. Plot a distribution of the flight delays.
6. Plot the number of flights operated per flight company.
7. Calculate the mean and SD with a grouping variable (aircraft company)
8. Relate the variable `dep_delay` to `arr_delay`
9. Have a first insight into regression.

Glossary:

• Dataframe: the equivalent of an Excel spreadsheet. More formally, a list of different data types (character, integer, numeric, etc) that have the same length (number of rows). In addition, a data frame generally has a names attribute labeling the variables and a row.names attribute for labeling the cases."
• Tibble: the core tidyverse data structure is a tibble; this is a modern take on the data frame. You can find an extensive and practical definition here.

## Additional Resources & further exploration

