# R for Data Science Essentials

In this activity, we'll delve into analyzing flight arrival delays with the nycflights13 dataset, covering all flights to and from New York City in 2013. Our exploration will emphasize data transformation and visualization, employing techniques paralleling those found in a foundational text (hint: see Chapters 3 and 5 for a deeper dive). We aim to discern the factors influencing flight delays, focusing on the destination airport, the day of the week, and the time of arrival as potential predictors. For further study, a comprehensive guide can be accessed [here](https://r4ds.had.co.nz).

##  Loading Essential Packages


We begin by loading the `tidyverse` package, which is a collection of R packages specifically designed for data science tasks. This package includes several useful functions and tools for data manipulation, visualization, and analysis.

Next, we load the `nycflights13` package. This package contains the `flights` dataset, which provides information about flights departing from New York City airports in 2013. This dataset is commonly used for practicing data analysis and visualization techniques in R.

By loading these packages, we ensure that we have access to the necessary functions and datasets to perform our data analysis tasks effectively.ial:

In [1]:
# Load the tidyverse package, a collection of R packages for data science
library(tidyverse)

# Load the nycflights13 package, which includes the flights dataset
library(nycflights13)




── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


Using `??` commands in R, we access detailed documentation on the `flights` dataset and the `tidyverse` package. This documentation provides insights into dataset variables, package components, and their functionalities, essential for data manipulation, visualization, and analysis.

With essential packages like `tidyverse` and `nycflights13` loaded, we're ready to explore, analyze, and derive insights from our data efficiently in our JupyterHub environment. This setup enables us to proceed with our data science tasks effectively, leveraging the comprehensive tools available to us.

In [8]:
# Use the following command to access the documentation for the `mutate` function in the `dplyr` package,
# which is a subpackage of the tidyverse ecosystem.
??dplyr::mutate


R Information

Help files with alias or concept or title matching ‘mutate’ using fuzzy
matching:


dplyr::mutate           Create, modify, and delete columns
  Aliases: mutate, mutate.data.frame
dplyr::mutate_all       Mutate multiple columns
  Aliases: mutate_all, mutate_if, mutate_at
dplyr::mutate-joins     Mutating joins
  Aliases: mutate-joins
dplyr::se-deprecated    Deprecated SE versions of main verbs.
  Aliases: mutate_
dplyr::summarise_each   Summarise and mutate multiple columns.
  Aliases: mutate_each, mutate_each_


Type '?PKG::FOO' to inspect entries 'PKG::FOO', or 'TYPE?PKG::FOO' for
entries like 'PKG::FOO-TYPE'.




## Initial Dataset Overview of the flights Dataset
When beginning data analysis with the flights dataset from the nycflights13 package, an initial overview is essential to understand the data's structure and content. This section will guide you through this initial exploration, highlighting techniques to gain insights into the dataset.

### View Dataset Structure: 
Use glimpse() to get a compact display of the dataset's structure, including columns and their data types. This function provides a quick overview, showing you each column's name, data type (e.g., integer, double, character), and some example values.

In [7]:
glimpse(flights)


Rows: 336,776
Columns: 19
$ year           [3m[90m<int>[39m[23m 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       [3m[90m<int>[39m[23m 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time [3m[90m<int>[39m[23m 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      [3m[90m<dbl>[39m[23m 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       [3m[90m<int>[39m[23m 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time [3m[90m<int>[39m[23m 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      [3m[90m<dbl>[39m[23m 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        [3m[90m<chr>[39m[23m "UA", "UA", "AA", "B6", "DL", "UA", "B6",

### Dataset Summary
The summary() function generates summary statistics for each column in the dataset, such as Min, Mean, Max for numeric variables, and Frequency for factors.This summary is crucial for identifying potential anomalies (like extreme values) and understanding the distribution of data.

In [8]:
summary(flights)


      year          month             day           dep_time    sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
 Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
                                                 NA's   :8255                 
   dep_delay          arr_time    sched_arr_time   arr_delay       
 Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
 Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
 Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
 3rd Qu.:  11.00   3rd Qu.:1

### Number of Rows and Columns
Understanding the size of your dataset can influence how you handle data analysis and manipulation.


In [9]:
dim(flights)


### First and Last Rows
Viewing the first few and last few rows can give you a concrete sense of the data's actual entries.

In [10]:
head(flights)
tail(flights)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,9,30,,1842,,,2019,,EV,5274,N740EV,LGA,BNA,,764,18,42,2013-09-30 18:00:00
2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30 14:00:00
2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-09-30 22:00:00
2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30 12:00:00
2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30 11:00:00
2013,9,30,,840,,,1020,,MQ,3531,N839MQ,LGA,RDU,,431,8,40,2013-09-30 08:00:00


### Column Names
Listing all column names can help in quickly identifying which variables are available for analysis.

In [11]:
colnames(flights)


### Unique Values in Columns
Knowing how many unique values a column has can be particularly useful for categorical data.

In [12]:
flights %>% summarise_all(n_distinct)


year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,12,31,1319,1021,528,1412,1163,578,16,3844,4044,3,105,510,214,20,60,6936


### Checking for missing values

One of the slightly more advanced techniques in data exploration involves checking for missing values across your dataset. The function:


The function:

```r
sapply(flights, function(x) sum(is.na(x)))
```

is used to count the number of missing (`NA`) values in each column of the `flights` dataset. While this command might seem complex at first glance, it's a powerful tool for data cleaning and preparation. Here's a quick breakdown:

- `sapply()` is a function that applies another function over a list or vector and simplifies the output.
- `flights` is our dataset.
- `function(x) sum(is.na(x))` is an anonymous function that calculates the sum of missing values for eacheek help online.
p online.

In [15]:
sapply(flights, function(x) sum(is.na(x)))


### It's Okay Not to Know Everything Immediately

- **Learning Curve**: It's completely normal for some functions or concepts in R (like the one above) to feel advanced or challenging at first. R programming, like any other skill, has a learning curve.
- **Repeated Lookup is Normal**: Even experienced R programmers look up functions and syntax regularly. The key to becoming proficient in programming isn't memorizing every command but knowing how to find the information you need.
- **Effective Googling**: Learning to program effectively is as much about developing good Googling skills as it is about understanding the syntax. Knowing what keywords to search for, reading documentation, and finding solutions on forums like Stack Overflow are invaluable skills.
- **Developing Intuition**: Over time, you'll develop an intuition for programming. This includes understanding which functions to use, how to troubleshoot errors, and how to navigate through documentation or seek help online.
