# Lab 6:  EDA using a massive airline dataset

For this lab, you'll need to access a 500MB+ dataset at kaggle.com.  First, register at kaggle.com, login and download the dataset at https://www.kaggle.com/usdot/flight-delays.  The dataset actually consists of three separate files, only one of which is huge.  

Next, visit https://drive.google.com/drive/my-drive and find a place to upload this dataset.  For instance, using the "New" button you can create a new folder where you'll upload the three files.  To allow the code below to access the files I uploaded, I created a folder called "data" inside the "Colab Notebooks" folder.

## Download and upload archive.zip

The way I handled obtaining this dataset, after I clicked the "download" button at the kaggle.com site above and then logged into kaggle, was to save the `archive.zip` file on my computer.  I did not unzip this file on my computer; instead, I uploaded `archive.zip` directly into my `drive/My Drive/Colab Notebooks/data/` folder using the "New" button followed by "File Upload".  At that point, I had to figure out how to use the zip extractor within google drive, then move all three of the .csv files into the `data` folder I had created.

This isn't the only way you can handle this step.  You might choose instead to unzip `archive.zip` on your own computer before uploading the three resulting .csv files to your google drive space.  It's up to you, as long as you wind up with `airlines.csv`, `airports.csv`, and `flights.csv` in a google drive folder you can access.  Frankly, I think it's probably less trouble to unzip on your own computer before uploading, but it's up to you.  

In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

After your three .csv files are in place, you have to make sure that your Jupyter notebook can access them from the google.colab environment.  That's what the code above does after you run it, by mounting the drive.  You can actually get the colab to insert this code block automatically if you click the folder icon along the left side of the screen and then click the "mount drive" icon that appears at the top of the left margin. 

## Change working directory

Next, we can change the working directory (folder) in which the Jupyter notebook looks for files.  First, let's see what the current working directory is using pwd (print working directory).

In [None]:
pwd

We can use cd (change directory) to change the working directory to a different folder.  A space has to be entered as `\ ` (backslash space).

In [None]:
cd /content/drive/MyDrive/Colab\ Notebooks/data

Finally, we can verify that the three .csv files are in the new working directory using ls (list).

In [None]:
ls

# Optional step:  Upgrade datascience library

You may recall that we've occasionally had to fix some of the Python code used in the textbook.  This was particularly true in Section 8.5, where the maps were produced.  The reason the fixes were needed is that the default datascience library used by the colab Jupyter notebooks is out of date.  If you want, you can update it as follows and then the textbook code should work without modification even in Section 8.5.

In [None]:
!pip install --upgrade datascience

# Load the datascience library and other resources

Once all the data files are in place, we can get to the Python code.  The first step, as usual, is to load the necessary python resources:

In [None]:
# Load needed python resources
from datascience import *
import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np
np.set_printoptions(threshold=50)


Next, we'll read the largest of the three files as a `Table` object.

In [None]:
# This code only works if flights.csv is in the current working directory (see above)
flights = Table.read_table('flights.csv')

Let's check out the columns available in the flights dataset:

In [None]:
flights.labels

Next, select a systematic sample of rows from the flights Table object using code similar to the Chapter 10 intro.  You'll need to figure out an appropriate value of 'gap' based on the lab instructions.

In [None]:
gap = **ENTER AN APPROPRIATE INTEGER VALUE OF gap HERE**
start = np.random.choice(np.arange(gap))
mySample = flights.take(np.arange(start, flights.num_rows, gap))
print(mySample.num_rows)
mySample

Load the airports dataset containing airport names and latitude/longitude coordinates:


In [11]:
airports = Table.read_table('airports.csv')

## Try to understand the next code block, especially the join method

String together multiple Table-modifying methods (from the datascience module) to produce a cleaned-up version of the systematic sample called mySample (then print its number of rows and its first 10 rows):

1.   Select mySample according to 'gap' and 'start'.
2.   Then join every ORIGIN_AIRPORT in 'mysample' with the
 corresponding columns from 'airports' based on matching IATA_CODE.
3.   Then select just the columns we need from the result.
4.   Then relabel some of the columns.


In [None]:
mySample = (flights.take(np.arange(start, flights.num_rows, gap))
                   .join('ORIGIN_AIRPORT', airports, 'IATA_CODE')
                   .select('MONTH', 'DAY', 'ORIGIN_AIRPORT', 
                           'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 
                           'DEPARTURE_DELAY', 'AIRPORT',
                           'LATITUDE', 'LONGITUDE')
                   .relabeled('ORIGIN_AIRPORT', 'ORIGIN')
                   .relabeled('DESTINATION_AIRPORT', 'DESTINATION')
                   .relabeled('DEPARTURE_DELAY', 'DELAY')
                   .relabeled('AIRPORT', 'ORIGIN_NAME')
            )
print(mySample.num_rows)
mySample

Create a new column with the approximate day of year. There are better, more accurate ways to do this, but this method that approximates each month by 30 days will work for this purpose:

In [None]:
mySample = mySample.with_column(
             'APPROX_DAY_OF_YEAR', 
             30*(mySample.column('MONTH')-1) + mySample.column('DAY'))
mySample

Recall that a scatterplot depicts the relationship between two quantitative measurement columns.  Create a scatterplot using the 'LATITUDE' and 'DELAY' columns.

In [None]:
mySample.scatter('LATITUDE', 'DELAY')

# EXTRA CODE
The code below is not strictly necessary for the lab assignment.  It is included to illustrate some potentially interesting directions you could take your own investigation:

In [None]:
# Create a histogram of LATITUDE
mySample.hist('LATITUDE')

In [None]:
# Create a new column that splits the airports into high vs. low latitude
# based on a cutoff you define:
LatCut = **ENTER A VALUE THAT MAKES SENSE HERE**
mySample = mySample.with_column(
              'HIGH_LAT', mySample.column('LATITUDE') > LatCut)
mySample

In [None]:
# Figure out how many airports are "high latitude" vs. "low latitude"
mySample.group('HIGH_LAT')

In [None]:
# Take the means for the high- and low- latitude airports.
# We're using the 'nanmean' method in numpy to ignore the 
# nan (not a number) values
mySample.group('HIGH_LAT', np.nanmean)

In [None]:
# Use the previous idea to find the mean difference automatically
Observed_mean_difference = np.diff(mySample.group('HIGH_LAT', np.nanmean)
                                           .column('DELAY nanmean'))[0]
Observed_mean_difference

In [22]:
# Define a function that will reshuffle the DELAY values and then return
# the mean difference statistic for the shuffled table.
# This simulates from the null hypothesis distribution of the 
# mean difference statistic.
def simulated_mean_difference_under_null():
    a=(mySample.sample(with_replacement=False)
               .select('DELAY')
               .with_column('HIGH_LAT', mySample.column('HIGH_LAT')))
    return (np.diff(a.group('HIGH_LAT', np.nanmean)
                     .column('DELAY nanmean'))[0])

In [None]:
# Simulate 5000 draws from the null hypothesis distribution of the 
# mean difference (and return the result as a numpy array--not the same 
# as a datascience Table)
H0_means = make_array()
for i in np.arange(5000):
    H0_means = np.append(H0_means, simulated_mean_difference_under_null())

In [None]:
# Create a table with the 5000 H0 (null hypothesis) values and then 
# create a histogram.
# Also add the observed value of the sample statistic as a red dot 
# along the x-axis.
Table().with_column(
    'Count in a Random Sample', H0_means
).hist(bins = np.arange(-12.5, 12.5, 1))
plots.scatter(Observed_mean_difference, 0, color='red', s=30);