# Abstract

## Overview

In February of 2019, tens of thousands of domestic flights carried passengers around the country—every single day. Tens of thousands of aircraft being carefully tracked, monitored, organized and directed, and hundred of thousands or even millions of passengers count on those planes to get them where they're going. It's an incredible system, and most of the time it actually works. But when it doesn't, it hurts. Flight cancellations are extremely expensive, costing airlines a $1 billion per year. While flight cancellations due to weather may be inevitable, a significant portion of cancellations are due to circumstances under _our_ control. Modern air traffic control is a hundred years in the making and has certainly worked to minimize this issue already, but…can we do better?

## Question

Using data on American commercial flights, can we predict when a cancellation is likely to occur? As a bonus, can we predict _why_ the cancellation will occur?


*This is your space to describe your intentions for the project, before writing a single line of code. What are you studying? What are you hoping to build? If you can't explain that clearly before you start digging into the data, you're going to have a hard time planning where to go with this.*

# Obtain the Data

*Describe your data sources here and explain why they are relevant to the problem you are trying to solve.*

*Your code should download the data and save it in data/raw. If you've got the data from an offline source, describe where it came from and what the files look like. Don't do anything to the raw data files just yet; that comes in the next step.*

## Data Sources

My data set will be primarily based around the "[Marketing Carrier On-Time Performance](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=)" report released by the Bureau of Transportation Statistics. This database contains information on nearly every flight conducted by a significant U.S. carrier dating back to January 2018, which amounts to approximately 8 million observations. I expect I will supplement this dataset with additional features such as weather forecasts preceding a flight and additional statistics surrounding the model of plane for each flight.

*After completing this step, be sure to edit `references/data_dictionary` to include descriptions of where you obtained your data and what information it contains.*

In [83]:
## %%writefile ../src/data/make_dataset.py

# Imports
from io import BytesIO
import os
import urllib.request
from zipfile import ZipFile


# Helper functions


def get_lookup_tables():
    # Reporting carrier lookup table
    'https://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS'
    # Reporting airline lookup table
    'https://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRLINE_ID'
    # Airport ID lookup table
    'https://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRPORT_ID'
    # City Market ID lookup table
    'https://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_CITY_MARKET_ID'
    # Airport lookup table (SEA -> Seattle-Tacoma International)
    'https://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRPORT'
    
    lookups = [
        'L_UNIQUE_CARRIERS',
        'L_AIRLINE_ID',
        'L_AIRPORT_ID',
        'L_CITY_MARKET_ID',
        'L_AIRPORT',
        'L_AIRPORT_ID',
        'L_CITY_MARKET_ID'
    ]
    lookup_base = 'https://www.transtats.bts.gov/Download_Lookup.asp?Lookup='
    for table in lookups:
        download_lookup(lookup_base + table)
    pass


def get_weather_forecasts():
    # Weather forecast API
    'https://darksky.net/dev/docs#time-machine-request'
    key = 'c00a48b4e746f7a6b5a4caef59e18dc9'
    # Example
    request_format = f'''
    https://api.darksky.net/forecast/{key}/{latitude},{longitude},{time}
    '''
    example = '''
    https://api.darksky.net/forecast/0123456789abcdef9876543210fedcba/
    42.3601,-71.0589,255657600?exclude=currently,flags
    '''
    pass


def download_dataset(url, path, filename, overwrite='ask'):
    """
    Downloads zip file from specified url and extracts csv file to raw data 
    directory
    Input: 
        url: string of url from which to retrieve data
        path: string of directory path to store file in
        filename: string of desired filename
        overwrite: parameter for whether or not to overwrite existing files, if
            found. If 'y', any existing file with filename in path will be
            overwritten. If 'n', function will do nothing. If 'ask', function
            will prompt user to decide whether or not to replace file.
    Output: dataset stored in raw data directory
    """
    filepath = path + filename
    file_exists = os.path.isfile(filepath)
    if file_exists:
        if overwrite == 'ask':
            overwrite = input(f'{filename} already exists. Update? y/n: ')
        if overwrite.lower() != 'y':
            return
                              
    print(f'Beginning download of {filename}...')
    try:
        zip_f = urllib.request.urlopen(url)
        with ZipFile(BytesIO(zip_f.read())) as my_zip_file:
            for f in my_zip_file.namelist():
                if '.csv' in f:
                    with open(filepath, 'wb') as output:
                        for line in my_zip_file.open(f).readlines():
                            output.write(line)
        print(f'Successfully wrote {filename} to {path}')
                              
    except urllib.request.HTTPError:
        print(f'Failed to download {filename}')
        return


def get_flight_data_url(year, month):
    '''
    Generate URL to download pre-zipped csv of flight data for a given month as
    provided by the Bureau of Transportation Statistics
    Input: Year in format YYYY (int), Month in format of (M)M, i.e. 3, or 11
    Output: download URL as a string
    '''
    base_url = 'http://transtats.bts.gov/PREZIP/'
    tail = 'On_Time_Reporting_Carrier_On_Time_Performance_1987_present_'
    slug = f'{year}_{month}.zip'
    return base_url + tail + slug
    

def get_flight_data(start, end, path):
    '''
    Downloads a variety of flight data tables from:
    https://www.transtats.bts.gov/Fields.asp
    '''
    # Download all BTS datasets
    for year in range(start, end):
        for month in range(1,13):
            filename = f'flight_data_{year}-{month}.csv'
            url = get_flight_data_url(year, month)
            download_dataset(url, path, filename)
    pass


def run():
    """
    Executes a set of helper functions that download data from one or more sources
    and saves those datasets to the data/raw directory.
    """
    path = '../data/raw/'
    get_flight_data(2003, 2020, path)
    # download_dataset_1(url)
    # download_dataset_2(url)
    # save_dataset_1('data/raw', filename)
    # save_dataset_2('data/raw', filename)
    pass

In [None]:
run()

Beginning download of flight_data_2003-1.csv...
Successfully wrote flight_data_2003-1.csv to ../data/raw/
Beginning download of flight_data_2003-2.csv...
Successfully wrote flight_data_2003-2.csv to ../data/raw/
Beginning download of flight_data_2003-3.csv...
Successfully wrote flight_data_2003-3.csv to ../data/raw/
Beginning download of flight_data_2003-4.csv...
Successfully wrote flight_data_2003-4.csv to ../data/raw/
Beginning download of flight_data_2003-5.csv...
Successfully wrote flight_data_2003-5.csv to ../data/raw/
Beginning download of flight_data_2003-6.csv...
Successfully wrote flight_data_2003-6.csv to ../data/raw/
Beginning download of flight_data_2003-7.csv...
Successfully wrote flight_data_2003-7.csv to ../data/raw/
Beginning download of flight_data_2003-8.csv...
Successfully wrote flight_data_2003-8.csv to ../data/raw/
Beginning download of flight_data_2003-9.csv...
Successfully wrote flight_data_2003-9.csv to ../data/raw/
Beginning download of flight_data_2003-10.csv.

Successfully wrote flight_data_2009-11.csv to ../data/raw/
Beginning download of flight_data_2009-12.csv...
Successfully wrote flight_data_2009-12.csv to ../data/raw/
Beginning download of flight_data_2010-1.csv...
Successfully wrote flight_data_2010-1.csv to ../data/raw/
Beginning download of flight_data_2010-2.csv...
Successfully wrote flight_data_2010-2.csv to ../data/raw/
Beginning download of flight_data_2010-3.csv...
Successfully wrote flight_data_2010-3.csv to ../data/raw/
Beginning download of flight_data_2010-4.csv...
Successfully wrote flight_data_2010-4.csv to ../data/raw/
Beginning download of flight_data_2010-5.csv...
Successfully wrote flight_data_2010-5.csv to ../data/raw/
Beginning download of flight_data_2010-6.csv...
Successfully wrote flight_data_2010-6.csv to ../data/raw/
Beginning download of flight_data_2010-7.csv...
Successfully wrote flight_data_2010-7.csv to ../data/raw/
Beginning download of flight_data_2010-8.csv...
Successfully wrote flight_data_2010-8.csv t

In [19]:
df = pd.read_csv('../data/raw/On_Time_Marketing_Carrier_On_Time_Performance_(Beginning_January_2018)_2019_2.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [20]:
df.shape

(582966, 120)

In [25]:
print(list(df.columns))

['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate', 'Marketing_Airline_Network', 'Operated_or_Branded_Code_Share_Partners', 'DOT_ID_Marketing_Airline', 'IATA_Code_Marketing_Airline', 'Flight_Number_Marketing_Airline', 'Originally_Scheduled_Code_Share_Airline', 'DOT_ID_Originally_Scheduled_Code_Share_Airline', 'IATA_Code_Originally_Scheduled_Code_Share_Airline', 'Flight_Num_Originally_Scheduled_Code_Share_Airline', 'Operating_Airline ', 'DOT_ID_Operating_Airline', 'IATA_Code_Operating_Airline', 'Tail_Number', 'Flight_Number_Operating_Airline', 'OriginAirportID', 'OriginAirportSeqID', 'OriginCityMarketID', 'Origin', 'OriginCityName', 'OriginState', 'OriginStateFips', 'OriginStateName', 'OriginWac', 'DestAirportID', 'DestAirportSeqID', 'DestCityMarketID', 'Dest', 'DestCityName', 'DestState', 'DestStateFips', 'DestStateName', 'DestWac', 'CRSDepTime', 'DepTime', 'DepDelay', 'DepDelayMinutes', 'DepDel15', 'DepartureDelayGroups', 'DepTimeBlk', 'TaxiOut', 'WheelsOff', 'Wheels

In [34]:
df['CancellationCode'].value_counts()

B    12231
A     3631
C     2488
D        2
Name: CancellationCode, dtype: int64

In [23]:
df.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,...,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum,Duplicate,Unnamed: 119
0,2019,1,2,1,5,2019-02-01,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
1,2019,1,2,2,6,2019-02-02,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
2,2019,1,2,3,7,2019-02-03,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
3,2019,1,2,4,1,2019-02-04,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
4,2019,1,2,5,2,2019-02-05,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,


In [42]:
pd.Series(df['DayofMonth'].value_counts()).mean()

20820.214285714286

In [49]:
df.groupby('DayofMonth')['Cancelled'].value_counts()

DayofMonth  Cancelled
1           0.0          21364
            1.0            524
2           0.0          16021
            1.0            376
3           0.0          18080
            1.0            386
4           0.0          21409
            1.0            410
5           0.0          19852
            1.0            566
6           0.0          20474
            1.0            437
7           0.0          21159
            1.0            651
8           0.0          21491
            1.0            405
9           0.0          16432
            1.0            309
10          0.0          19780
            1.0            560
11          0.0          20785
            1.0           1044
12          0.0          18289
            1.0           2171
13          0.0          20413
            1.0            699
14          0.0          21912
            1.0            388
15          0.0          22217
            1.0            254
16          0.0          17289
            1.0  

# Scrub the Data

*Look through the raw data files and see what you will need to do to them in order to have a workable data set. If your source data is already well-formatted, you may want to ask yourself why it hasn't already been analyzed and what other people may have overlooked when they were working on it. Are there other data sources that might give you more insights on some of the data you have here?*

*The end goal of this step is to produce a [design matrix](https://en.wikipedia.org/wiki/Design_matrix), containing one column for every variable that you are modeling, including a column for the outputs, and one row for every observation in your data set. It needs to be in a format that won't cause any problems as you visualize and model your data.*

## Features
The following is a list of features I'd like to have in the design matrix:



In [3]:
## %%writefile ../src/features/build_features.py

# Imports
from os import listdir
from psycopg2 import connect
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
from sqlalchemy import create_engine
import numpy as np
import pandas as pd


# Helper functions
def run_query(q):
    # opens a connection to database to run a query, q
    # returns a pandas dataframe
    with sqlite3.connect('mlb.db') as conn:
        return pd.read_sql(q, conn)

    
def show_tables():
    # Returns a list of all tables and views in our database
    q = """
            SELECT 
                name, 
                type 
            FROM sqlite_master 
            WHERE type IN (\"table\",\"view\");
        """
    return run_query(q)


def run_command(q, db='flights'):
    # opens a connection to database to run a command with no output
    with sqlite3.connect(db) as conn:
        conn.execute('PRAGMA foreign_keys = ON;')
        conn.isolation_level = None
        conn.execute(q)

        
def check_existence(query, item, cursor):
    '''
    Executes a query and checks if item is in returned results
    Input: 
        query (str), a SQL query returning list of items to look within
        item (str), the name of the item to check if exists
        cursor, a psycogp2 cursor object
    Output: boolean, True if item exists
    '''
    cursor.execute(query)
    items = [i[0] for i in cursor.fetchall()]
    exists = item in items
    return exists

    
def load_table(file, path, params, overwrite='ask'):
    ''' 
    Loads a csv into a SQL table
    Input:
        file (str), filename of csv to load
        path (str), relative directory to find file in
        params (dict), parameters for connecting to psql, including user, host,
            and port
        overwrite (str): parameter for whether or not to overwrite existing 
            files, if found. If 'y', any existing file with filename in path 
            will be overwritten. If 'n', function will do nothing. If 'ask', 
            function will prompt user to decide whether or not to replace file.
    '''
    file_path = path + file
    table_name = f'flights_{file[12:-4]}'
    
    q = """
    SELECT tablename 
    FROM pg_catalog.pg_tables 
    WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';
    """
    try:
        with connect(**params) as conn:
            conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
            print(f"Connecting to database {params['dbname']}")
            cur = conn.cursor()
            exists = check_existence(q, table_name, cur)
            if exists:
                if overwrite == 'ask':
                    overwrite = input(f'{filename} already exists. Update? y/n: ')
                if overwrite.lower() != 'y':
                    return
            if check_existence(q, table_name, cur):
                # Truncate the table first
                cur.execute(f'TRUNCATE {table_name} CASCADE;')
                print(f'Truncated {table_name}')
            with open(file_path, 'r') as f:
                c = f'COPY {table_name} FROM STDIN WITH CSV HEADER'
                cur.copy_expert(c, f)
                print(f'Loaded data into {table_name}')
    except Exception as e:
        print(f'Error: {str(e)}')
        sys.exit(1)
    pass

        
def create_db(dbname, params):
    '''
    Connects to psql as default user and creates new database if it doesn't
    already exist
    Input:
        dbname (string), name of new database
        params (dict), parameters for connecting to psql, including user, host,
        and port
    Output: database created in psql
    '''
    q = 'SELECT datname FROM pg_database;'
    temp_params = params.copy()
    temp_params['dbname'] = 'postgres'
    with connect(**temp_params) as conn:
        conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
        cur = conn.cursor()
        cur.execute(q)
        databases = [db[0] for db in cur.fetchall()]
        db_exists = dbname in databases
        if not db_exists:
            cur.execute(f'CREATE DATABASE {dbname}')
            print(f'Created database {dbname}')
    pass
        

def build_raw_database():
    '''
    Constructs database from raw data CSVs previously downloaded
    '''
    path = '../data/raw/'
    params = {
        'user': 'scottbutters',
        'host': '127.0.0.1',
        'port': 5432,
        'dbname': 'raw_flight_data'
    }
    
    # Create db if DNE yet
    create_db(dbname=params['dbname'], params=params)
    
    # Collect list of csvs and load into tables
    files = [f for f in listdir(path) if '.csv' in f]
    for file in files:
        load_table(file, path, params)
    
def run():
    """
    Executes a set of helper functions that read files from data/raw, cleans them,
    and converts the data into a design matrix that is ready for modeling.
    """
    build_raw_database()
    build_interim_database()
    data = clean_data()
    create_
    
    # clean_dataset_1('data/raw', filename)
    # clean_dataset_2('data/raw', filename)
    # save_cleaned_data_1('data/interim', filename)
    # save_cleaned_data_2('data/interim', filename)
    # build_features()
    # save_features('data/processed')
    pass

*Before moving on to exploratory analysis, write down some notes about challenges encountered while working with this data that might be helpful for anyone else (including yourself) who may work through this later on.*

# Explore the Data

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

In [None]:
## %%writefile ../src/visualization/visualize.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed')
    # describe_features(data, 'reports/')
    # generate_charts(data, 'reports/figures/')
    pass


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

*What did you learn? What relationships do you think will be most helpful as you build your model?*

# Model the Data

*Describe the algorithm or algorithms that you plan to use to train with your data. How do these algorithms work? Why are they good choices for this data and problem space?*

In [None]:
## %%writefile ../src/models/train_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed/')
    # train, test = train_test_split(data)
    # save_train_test(train, test, 'data/processed/')
    # model = build_model()
    # model.fit(train)
    # save_model(model, 'models/')
    pass


In [None]:
## %%writefile ../src/models/predict_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # test_X, test_y = load_test_data('data/processed')
    # trained_model = load_model('models/')
    # predictions = trained_model.predict(test_X)
    # metrics = evaluate(test_y, predictions)
    # save_metrics('reports/')
    pass



_Write down any thoughts you may have about working with these algorithms on this data. What other ideas do you want to try out as you iterate on this pipeline?_

# Interpret the Model

_Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?_