# CA Data Science Takehome Problem

### For this problem, you are provided a data set recording various details about US domestic flights. Please explore the data however you prefer, and try to identify anything interesting, such as correlations, patterns, or strange outliers. When you are done, prepare a write-up or annotate your notebook to show what you discovered. Be prepared to present your findings to the team if invited to continue to an on-site interview.

### This is an open-ended problem, which we expect will take approximately three hours to complete. This starter notebook has loaded the airline data into a data frame for you to use, but feel free to use any additional libraries or outside data that you would like. The accompanying "Column_Descriptions.csv" file explains what each of the columns in the data frame means.

In [None]:
import pandas as pd
import numpy as np
import glob
import csv
import requests
import matplotlib.pyplot as plt
import xml.etree.ElementTree as ET
from collections import Counter

pd.set_option('display.max_columns', 500)

## Loading the CSVs

In [None]:
r = requests.get('http://ca-data-science-interview.s3.amazonaws.com')
xml = r.text

csvs = []
root = ET.fromstring(xml)
    
for content in root.findall('{http://s3.amazonaws.com/doc/2006-03-01/}Contents'):
    key = content.find('{http://s3.amazonaws.com/doc/2006-03-01/}Key').text
    csvs.append(key)
    
for csv in csvs:
    r = requests.get(f'http://ca-data-science-interview.s3.amazonaws.com/{csv}', stream = True)
    with open(csv, "wb") as f:
        for chunk in r.iter_content(chunk_size = 1024):
            if chunk:
                f.write(chunk)

    
df = pd.concat([pd.read_csv(csv, low_memory=False) for csv in csvs])
df = df.drop("Unnamed: 37", axis = 1) # All of the lines in the CSVs have an extra comma at the end that gets interpreted as an unnamed column.
df.head()

## Some basic views of the data

In [None]:
# Numbers of flights by day of the week

counts = Counter(list(df['DAY_OF_WEEK']))
plt.bar(counts.keys(), counts.values())

In [None]:
# Total number of flights by airport

def graph_counts(counts, num = 10):
    x = [a[0] for a in counts.most_common(num)]
    y = [a[1] for a in counts.most_common(num)]
    plt.bar(x, y)
    plt.show()

dest_counts = Counter(list(df['DEST']))
orig_counts = Counter(list(df['ORIGIN']))
total_counts = Counter()
total_counts.update(dest_counts)
total_counts.update(orig_counts)

graph_counts(dest_counts)
graph_counts(orig_counts)
graph_counts(total_counts)