# CA Data Science Takehome Problem

### For this problem, you are provided a data set recording various details about US domestic flights. The data is for all domestic flights during August in three consecutive years. Please explore the data however you prefer, and try to identify anything interesting, such as correlations, patterns, or interesting outliers. When you are done, prepare a write-up or annotate your notebook to show what you discovered. Be prepared to present your findings to the team if invited to continue to an on-site interview.

### This is an open-ended problem, which we expect will take approximately three hours to complete. This starter notebook will load the airline data into a data frame for you to use, but feel free to use any additional libraries or outside data that you would like. The accompanying "Column_Descriptions.csv" file explains what each of the columns in the data frame means.

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np

import requests

import seaborn as sns
import matplotlib.pyplot as plt

import xml.etree.ElementTree as ET

pd.set_option('display.max_columns', None)

## Loading the CSVs

In [None]:
r = requests.get('http://ca-data-science-interview.s3.amazonaws.com')
xml = r.text

csvs = []
root = ET.fromstring(xml)
    
for content in root.findall('{http://s3.amazonaws.com/doc/2006-03-01/}Contents'):
    key = content.find('{http://s3.amazonaws.com/doc/2006-03-01/}Key').text
    csvs.append(key)
    
for csv in csvs:
    csv_file = 'http://ca-data-science-interview.s3.amazonaws.com/' + csv
    print("-- fetching %s" % csv_file)
    r = requests.get(csv_file, stream = True)
    with open(csv, "wb") as f:
        for chunk in r.iter_content(chunk_size = 1024):
            if chunk:
                f.write(chunk)

    
df = pd.concat([pd.read_csv(csv, low_memory=False) for csv in csvs])
df = df.drop("Unnamed: 37", axis = 1) # All of the lines in the CSVs have an extra comma at the end that gets interpreted as an unnamed column.
df.head()

## A basic view of the data

In [None]:
# Plot of Departure Time versus Arrival Time (24-Hour Clock)

sns.set(style="white", color_codes=True)
grid = sns.JointGrid(x='DEP_TIME', y='ARR_TIME', data=df, space=0, size=10, ratio=50)
grid.plot_joint(plt.scatter, color="g")