# **R Essentials**

In this self-study, you will analyze flight arrival delays using the `nycflights13` dataset, which includes all flights to and from New York City in 2013. This tutorial will guide you through the process of loading necessary packages and data, displaying the dataset, and extracting key features such as sample size and the number of variables. For a more comprehensive introduction to R that extends beyond this tutorial, please refer to this resource.

##  Loading Packages

In R, packages are collections of functions, data, and code that extend the capabilities of R. They are like add-ons that provide additional tools and features.

Before you can use a package, you need to install it on your local computer. However, for this self-study, we have already installed the necessary packages for you on our JupyterHub cloud.

For this tutorial, we will load two essential packages:

1. `tidyverse` This is a group of R packages designed to make working with data easier. It includes tools for data cleaning, visualization, and analysis.
3. `nycflights13` This package contains the `flights` dataset, which provides information about flights leaving New York City airports in 2013. This dataset is useful for practicing data analysis and visualization techniques in R.

By loading these packages, we ensure we have everything we need to start our data analysis.

In [1]:
############################################################################################

# Load the tidyverse package, a collection of R packages for data science
library(tidyverse)

# Load the nycflights13 package, which includes the flights dataset
library(nycflights13)

print("packages loaded")

############################################################################################

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


[1] "packages loaded"


## Description of Key Variables in the Dataset

The `flights` dataset from the `nycflights13` package contains detailed information about flights departing from New York City airports (EWR, JFK, and LGA) in 2013. Below are descriptions of the key variables used in this and the following tutorial:

- **dep_delay**: Departure delay in minutes. Negative values indicate flights that departed early.
- **arr_delay**: Arrival delay in minutes. Negative values indicate flights that arrived early.
- **carrier**: The carrier (airline) code.
- **origin**: The airport of origin (EWR, JFK, LGA).
- **dest**: The destination airport.
- **air_time**: The amount of time the flight was in the air, in minutes.
- **distance**: The distance between the origin and destination airports, in miles.
- **month**: The month of the flight (1-12).
- **day**: The day of the month (1-31).

## Initial Dataset Overview of the Dataset

When starting data analysis with the `flights` dataset from the `nycflights13` package, it's important to understand the data's structure and content. This section will guide you through the initial steps to explore the dataset.

### View Dataset Structure

To get a quick look at the dataset, you can use the `head()` function. This function shows the first six rows of the dataset, including the names of the columns and the types of data they contain. For example:

In [2]:
############################################################################################

# Ploting the first six rows of the dataset 'flights'
head(flights)

############################################################################################

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


This `head` command will display the first six rows of the flights dataset, giving you an idea of its structure and the types of information it includes. For instance, you might see a column named `dep_time` which represents the departure time of each flight. This column will show times as integers, like 517 for 5:17 AM.

**Exercise:** Look at the last six rows of the `flight` dataset. 

Fill in the `???` in the code below to complete the exercise.

In [3]:
############################################################################################

# Displaying the last six rows of the dataset
tail(???)

############################################################################################

ERROR: Error in parse(text = x, srcfile = src): <text>:4:9: unexpected ')'
3: # Displaying the last six rows of the dataset
4: tail(???)
           ^


### Dataset Summary

Understanding the descriptive statistics of your dataset is an important first step in data analysis. The `summary()` function in R provides a quick overview of each column in the dataset. This includes key statistics for numbers (like minimum, mean, and maximum values) and counts for categories.

To get a summary of the entire flights dataset, use this command:

In [None]:
############################################################################################

# Descriptive statistics for all variables in the dataset
summary(flights)

############################################################################################

This `summary()` command will show a detailed overview of each column, helping you spot any unusual values and understand how the data is spread out.

### Extracting Single Variables from the Dataset

In R, you can look at a single column from a dataset using the `$` sign. This is helpful when you want to focus on one specific part of your data.

**Example:**  
To look at the `dep_delay` (departure delay) column from the flights dataset, use this command: `flights$dep_delay`. Here, `flights` is the dataset, and `dep_delay` is the column name. The `$` sign tells R to look at just this one column.

### Generating a Summary for a Single Variable

After you extract a single column, you can use the `summary()` function to get more details about that column.

**Example:** 
To get a summary of the `dep_delay` column, use this command:

In [None]:
############################################################################################

# Plotting descriptive statistics of the variable 'dep_delay'
summary(flights$dep_delay)

############################################################################################

This command shows:

* *Min:* The shortest delay is -43 minutes (some flights left early).
* *1st Qu.:* 25% of flights had a delay of -5 minutes or less.
* *Median:* The middle delay value is 2 minutes.
* *Mean:* The average delay is about 12.64 minutes.
* *3rd Qu.:* 75% of flights had a delay of 15 minutes or less.
* *Max:* The longest delay is 1301 minutes.
* *NA's:* There are 8255 missing values in the `dep_delay` column.

This `summary()` command helps you understand how departure delays vary and spot any patterns in the data.

**Exercise:** Plot descriptive statistics of the variable `distance`. What is the median distance?

Fill in the `???` in the code below to complete the exercise.

In [6]:
############################################################################################

# Plotting descriptive statistics of the variable 'distance'
summary(???$???)

############################################################################################

ERROR: Error in parse(text = x, srcfile = src): <text>:4:12: unexpected '$'
3: # Plotting descriptive statistics of the variable 'distance'
4: summary(???$
              ^


**Answer:** 872 miles

## Summarizing Data by Groups

To generate summary statistics, use the `summarize()` function along with `group_by()`. For example, to find the average departure delay for each airline:

In [None]:
############################################################################################

# Calculating the average departure delay for each airline
summarize(group_by(flights, carrier), avg_dep_delay = mean(dep_delay, na.rm = TRUE))

############################################################################################

This command groups the data by the `carrier` column and then calculates the average departure delay (`dep_delay`) for each airline, ignoring missing values (`na.rm = TRUE`). The result is stored in the `avg_dep_delay` dataset.

## Number of Rows and Columns

Understanding the size of your dataset is crucial for effective data analysis and manipulation. The dimensions of the dataset represent the sample size (number of rows) and the number of variables (number of columns).

To find out the dimensions of your dataset, you can use the `dim()` function:

In [9]:
############################################################################################

# Dimensions of the dataset
dim(flights)

############################################################################################

This command will return two numbers:
- The first number is the number of rows, representing the sample size or the total number of observations.
- The second number is the number of columns, representing the number of variables or different types of information collected in the dataset.

Alternatively, you can use the `nrow()` and `ncol()` functions to get these numbers separately:

In [10]:
############################################################################################

nrow(flights)  # Gives the number of rows (sample size)
ncol(flights)  # Gives the number of columns (variables)

############################################################################################


Understanding these dimensions helps you grasp the scope of your dataset, influencing how you approach data cleaning, visualization, and analysis tasks.

### Column Names

Listing all column names can help you quickly identify which variables are available for analysis. You can use the `colnames()` function to list all the column names in your dataset:

In [11]:
############################################################################################

# Column names of dataset
colnames(flights)

############################################################################################

This command will display a list of all the column names in the `flights` dataset, allowing you to see the different types of information available for analysis. Knowing the column names is useful when you want to reference specific variables in your data manipulation and analysis tasks.

## Help Files

Using `help` commands in R, we can access detailed documentation on different packages and commands. This documentation provides further information about the functions and tools included in the package or command, either directly or through web links.

For example, to access the help file for the `summary` function, you can use the following command in R:

In [12]:
############################################################################################

# Getting additional support
help(summary)

############################################################################################

0,1
summary {base},R Documentation

0,1
object,an object for which a summary is desired.
x,a result of the default method of summary().
maxsum,"integer, indicating how many levels should be shown for factors."
digits,"integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame). In summary.default, if not specified (i.e., missing(.)), signif() will not be called anymore (since R >= 3.4.0, where the default has been changed to only round in the print and format methods)."
quantile.type,"integer code used in quantile(*, type=quantile.type) for the default method."
...,additional arguments affecting the summary produced.


This command will open the help file, where you can find descriptions of the function's features, examples of how to use it, and links to additional resources. This is a valuable tool for understanding and effectively using the `summary` function in your data analysis.

## It's Okay Not to Know Everything Immediately

- **Learning Curve**: It's completely normal for some functions or concepts in R to feel advanced or challenging at first. R programming, like any other skill, has a learning curve.
- **Repeated Lookup is Normal**: Even experienced R programmers look up functions and syntax regularly. The key to becoming proficient in programming isn't memorizing every command but knowing how to find the information you need.
- **Effective Googling**: Learning to program effectively is as much about developing good Googling skills as it is about understanding the syntax. Knowing what keywords to search for, reading documentation, and finding solutions on forums like Stack Overflow are invaluable skills.
- **Developing Intuition**: Over time, you'll develop an intuition for programming. This includes understanding which functions to use, how to troubleshoot errors, and how to navigate through documentation or seek help online.