In [10]:
import pandas as pd
pd.read_sql_query('SELECT * FROM ny_flights LIMIT 5', conn)

Unnamed: 0,flight_date,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,airline,tail_number,flight_number,origin,dest,air_time,distance,cancelled,diverted
0,2021-01-01,655,700,-5,747,817,-30,9E,N301PQ,4632,SYR,JFK,38,209,0,0
1,2021-01-02,651,700,-9,847,817,30,9E,N296PQ,4632,SYR,JFK,50,209,0,0
2,2021-01-03,710,700,10,832,817,15,9E,N918XJ,4632,SYR,JFK,45,209,0,0
3,2021-01-04,700,700,0,816,817,-1,9E,N919XJ,4632,SYR,JFK,46,209,0,0
4,2021-01-07,656,700,-4,757,835,-38,9E,N340CA,4632,SYR,JFK,45,209,0,0


Wow, that was way easier than expected and what's even better: we can replace the execute and fetchall steps with this! Now that we have the data in a dataframe, we can use all the different techniques we have learned already to explore and clean the data. We don't have to do that right now, but this will become very useful in the future.  
You did a fantastic job ....and deserve to be called a badass python hacker!

In [1]:
# Import all necessary libraries
import pandas as pd
import numpy as np
from config import config
from sqlalchemy import exc #SQLAlchemy provides a nice “Pythonic” way of interacting with databases.
from sqlalchemy import event

In [14]:
import psycopg2
from configdef import config

def connect():
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # read connection parameters
        params = config()

        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        conn = psycopg2.connect(**params)
		
        # create a cursor
        cur = conn.cursor()
        
	# execute a statement
        print('PostgreSQL database version:')
        cur.execute('SELECT version()')

        # display the PostgreSQL database server version
        db_version = cur.fetchone()
        print(db_version)
       
	# close the communication with the PostgreSQL
        cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()
            print('Database connection closed.')



Connecting to the PostgreSQL database...
PostgreSQL database version:
('PostgreSQL 13.3 on x86_64-pc-linux-gnu, compiled by gcc, a 68c5366192 p 6520304dc1, 64-bit',)
Database connection closed.


In [26]:
import psycopg2
conn = psycopg2.connect(
    host="db-postgresql-fra1-70962-do-user-8861194-0.b.db.ondigitalocean.com",
    port="25060",
    database="defaultdb",
    user="dauser",
    password="eknotwxxqpe9bmue")

def get_data(query):
    cur = conn.cursor()
    cur.execute(query)
    rows = cur.fetchall()
    print(rows)

get_data('Select version()')

[('PostgreSQL 13.3 on x86_64-pc-linux-gnu, compiled by gcc, a 68c5366192 p 6520304dc1, 64-bit',)]


We will go through this workflow step by step before we include all single steps into one, big, main function which does all the steps at once.  
For our puropse, we will use public **flight data**. 

# 1. Set up a connection to a SQL database

We start with connecting to an existing sql database, so that we can check what is already there and to send data to a table in the database later on.

In [2]:
# Establish db connection

# Get connection details from configdef file into a list
params = config(section='postgres')

# Use sql alchemy to create connection to database, which is contained within the engine object
engine = pg_engine_connection(**params)

# Cleans up unnecessary database connections
engine.dispose()

Postgres Database connection successful


Now, you can query data from a table in the database you are connected to.  
We are working with the table called ny_flights, which contains data from 2019 and 2020 of the three New York airports (JFK, LGA and EWR).  
Get an overview of this table by querying the top three rows.

With ```engine.execute('sql query')``` you create an object:

In [12]:
engine.execute('select * from ny_flights limit 3')

<sqlalchemy.engine.result.ResultProxy at 0x7ff90ae0c1f0>

Using ```.fetchall()``` you can access the data you asked for in your query:

In [4]:
result = engine.execute('select * from ny_flights limit 3').fetchall()
display(result)

[(datetime.date(2021, 1, 1), 655, 700, -5, 747, 817, -30, '9E', 'N301PQ', 4632, 'SYR', 'JFK', 38, 209, 0, 0),
 (datetime.date(2021, 1, 2), 651, 700, -9, 847, 817, 30, '9E', 'N296PQ', 4632, 'SYR', 'JFK', 50, 209, 0, 0),
 (datetime.date(2021, 1, 3), 710, 700, 10, 832, 817, 15, '9E', 'N918XJ', 4632, 'SYR', 'JFK', 45, 209, 0, 0)]

You see three rows each representing a single flight with different information in each column.

# 2. Download csv file

In the following, you are going to download a csv file containing additional flight data from [this website](https://transtats.bts.gov).    
You can specify, which data you want to download.  
We want you to choose a specific month in a specific year for your download.  
In order to avoid everybody's downloading the same data, first check which months are already available in the ny_flights table:

In [13]:
#First, check which months are not in the database yet.
months = engine.execute('select distinct(year, month) from ny_flights').fetchall()
months

[('(2018,9)',),
 ('(2019,1)',),
 ('(2019,2)',),
 ('(2019,3)',),
 ('(2019,4)',),
 ('(2019,5)',),
 ('(2019,6)',),
 ('(2019,7)',),
 ('(2019,8)',),
 ('(2019,9)',),
 ('(2019,10)',),
 ('(2019,11)',),
 ('(2019,12)',),
 ('(2020,1)',),
 ('(2020,2)',),
 ('(2020,3)',),
 ('(2020,4)',),
 ('(2020,5)',),
 ('(2020,6)',),
 ('(2020,7)',),
 ('(2020,8)',),
 ('(2020,9)',),
 ('(2020,10)',),
 ('(2020,11)',),
 ('(2020,12)',)]

Now, choose a month/year that is not yet in the database.  
With the following command lines, you will download a csv file on public flight data from [this website](https://transtats.bts.gov) containing data of your chosen month/year.    
The file will be stored in a data folder.

In [15]:
import requests
from zipfile import *

years = [2021] # list of years you want to look at, specify one year
months = [1] # list of months you want to look at, specify one month
# Here: September 2018 which is in the list already
path ='data/' # Specifies path for saving file

# Loop through months
for year in years:
    for month in months:
        # Get the file from the website https://transtats.bts.gov
        zip_file = f'On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{year}_{month}.zip'
        csv_file = f'On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_{year}_{month}.csv'
        url = (f'https://transtats.bts.gov/PREZIP/{zip_file}')
        arg = f' -P {path} --no-check-certificate'
        # Use the wget Module to download the file. The method accepts two parameters: the URL path of the file to download and local path where the file is to be stored.
        !wget {url}{arg} 

--2021-07-15 11:32:42--  https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2021_1.zip
Resolving transtats.bts.gov... 204.68.194.70
Connecting to transtats.bts.gov|204.68.194.70|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 17937103 (17M) [application/x-zip-compressed]
Saving to: 'data/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2021_1.zip.2'


2021-07-15 11:33:05 (801 KB/s) - 'data/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2021_1.zip.2' saved [17937103/17937103]



In [6]:
# Unzip your file
with ZipFile(path+zip_file, 'r') as zip_ref:
    zip_ref.extractall(path)
    
# In casee this does not work for you try:
# Instead of 'from zipfile import *' use 'import zipfile' and use 'with zipfile.ZipFile(path+zip_file, 'r')'

In [19]:
# Read in your data
df = pd.read_csv(path+csv_file, low_memory = False)
df.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum,Unnamed: 109
0,2021,1,1,2,6,2021-01-02,9E,20363,9E,N337PQ,...,,,,,,,,,,
1,2021,1,1,3,7,2021-01-03,9E,20363,9E,N607LR,...,,,,,,,,,,
2,2021,1,1,4,1,2021-01-04,9E,20363,9E,N602LR,...,,,,,,,,,,
3,2021,1,1,7,4,2021-01-07,9E,20363,9E,N295PQ,...,,,,,,,,,,
4,2021,1,1,8,5,2021-01-08,9E,20363,9E,N324PQ,...,,,,,,,,,,


In [10]:
df.shape

NameError: name 'df' is not defined

# 3. Prepare the csv file for further processing

In the next step, we clean and prepare our dataset.

a) Since the dataset consists of a lot of columns, we we define which ones to keep.

In [21]:
# Columns from downloaded file that are to be kept
columns_to_keep = ['FlightDate',
                   'DepTime',
                   'CRSDepTime',
                   'DepDelay',
                   'ArrTime',
                   'CRSArrTime',
                   'ArrDelay',
                   'Reporting_Airline',
                   'Tail_Number',
                   'Flight_Number_Reporting_Airline',
                   'Origin',
                   'Dest',
                   'AirTime',
                   'Distance',
                   'Cancelled',
                   'Diverted']

In [5]:
# The columns in the DB have different naming as in the source csv files. Lets get the names from the DB
table_name_sql = "SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'ny_flights' ORDER BY ordinal_position"
c_names = engine.execute(table_name_sql).fetchall()
print(c_names)

[('flight_date',),
 ('dep_time',),
 ('sched_dep_time',),
 ('dep_delay',),
 ('arr_time',),
 ('sched_arr_time',),
 ('arr_delay',),
 ('airline',),
 ('tail_number',),
 ('flight_number',),
 ('origin',),
 ('dest',),
 ('air_time',),
 ('distance',),
 ('cancelled',),
 ('diverted',)]

In [7]:
# we can clean up the results into a clean list
new_column_names = []
for name in c_names:
    new_column_names.append(name[0])
print(new_column_names)

['flight_date',
 'dep_time',
 'sched_dep_time',
 'dep_delay',
 'arr_time',
 'sched_arr_time',
 'arr_delay',
 'airline',
 'tail_number',
 'flight_number',
 'origin',
 'dest',
 'air_time',
 'distance',
 'cancelled',
 'diverted']

b) With the next function, we make our csv file ready to be uploaded to SQL.  
We only keep to above specified columns and convert the datatypes.

In [8]:
def clean_airline_df(df):
    '''
    Transforms a df made from BTS csv file into a df that is ready to be uploaded to SQL
    Set rows=0 for no filtering
    '''

    # Build dataframe including only the columns you want to keep
    df_airline = df.loc[:,columns_to_keep]
     
    # Clean data types and NULLs
    df_airline['FlightDate']= pd.to_datetime(df_airline['FlightDate'], yearfirst=True)
    df_airline['CRSArrTime']= pd.to_numeric(df_airline['CRSArrTime'], downcast='integer', errors='coerce')
    df_airline['Cancelled']= pd.to_numeric(df_airline['Cancelled'], downcast='integer')
    df_airline['Diverted']= pd.to_numeric(df_airline['Diverted'], downcast='integer')
    
    # Rename columns
    df_airline.columns = new_column_names
    
    return df_airline

In [9]:
# Call function and check resulting dataframe
df_clean = clean_airline_df(df)
df_clean.head()

NameError: name 'df' is not defined

If you decide to only look at specific airports, it is a good decision to filter for them in advance.  
This function does the filtering. 

In [36]:
# Specify the airports you are interested in and put them as a list in the function.
def select_airport(df, airports):
    ''' Helper function for filtering airline df for a subset of airports'''
    df_out = df.loc[(df.origin.isin(airports)) | (df.dest.isin(airports))]
    return df_out

In [26]:
# Execute function, not filtering for any specific airports
airports=[]
if len(airports) > 0:
    df_airline = select_airport(df_clean, airports)
else:
    df_airline = df_clean
    
df_airline.head()

Unnamed: 0,flight_date,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,airline,tail_number,flight_number,origin,dest,air_time,distance,cancelled,diverted
0,2021-01-02,733.0,730,3.0,927.0,939,-12.0,9E,N337PQ,4628,CVG,BOS,89.0,752.0,0,0
1,2021-01-03,727.0,730,-3.0,924.0,939,-15.0,9E,N607LR,4628,CVG,BOS,97.0,752.0,0,0
2,2021-01-04,737.0,730,7.0,938.0,939,-1.0,9E,N602LR,4628,CVG,BOS,103.0,752.0,0,0
3,2021-01-07,1710.0,1715,-5.0,1911.0,1912,-1.0,9E,N295PQ,4628,CVG,BOS,104.0,752.0,0,0
4,2021-01-08,1711.0,1715,-4.0,1926.0,1912,14.0,9E,N324PQ,4628,CVG,BOS,106.0,752.0,0,0


# 4. Push the prepared data to a table in the database

In [27]:
# Specify which table within your database you want to push your data to. For our case, use the ny_flights table.
# If the specified table doesn't exist yet, it will be created
# With 'append', your data will be appended to the already existing data within the table ny_flights.
# This will take a minute or two...

table = 'ny_flights'

# Sends your data to specified table, via the etablished connection
df_airline.to_sql(table, engine, index=False, if_exists="append", 
    method='multi', chunksize=5000)
print(f'done uploading {year}-{month}')

done uploading 2021-1


# 5. Use SQL to query data from the database

### Tasks
Having sent the data to the database, it is always good to check the results.

1. Check the top 5 rows of the ny_flights table (as we did above)

2. Check if your data has been added by selecting your month and year.

3. See how many rows you have sent to the database by selecting your month and year in combination with count and group by.

# 6. Wrap-up

The following function puts everything we did above together.  

In [41]:
def airline_csv_to_sql(years=[2020], months=[2], path ='data/', table='flights', airports=[]):
    '''Downloads and unzips the flight data from BTS'''
    
    # Establish db connection
    params = config(section='postgresql')
    engine = pg_engine_connection(**params)
    engine.dispose()

    # Loop through months
    for year in years:
        for month in months:
            # Get the file
            zip_file = f'On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{year}_{month}.zip'
            csv_file = f'On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_{year}_{month}.csv'
            url = (f'https://transtats.bts.gov/PREZIP/{zip_file}')
            arg = f' -P {path} --no-check-certificate'
            !wget {url}{arg} 

            # unzip
            with ZipFile(path+zip_file, 'r') as zip_ref:
                zip_ref.extractall(path)
            
            # prepare df
            df_airline = pd.read_csv(path+csv_file, low_memory=False)
            df_airline = clean_airline_df(df_airline)
            
            # Select specific airports 
            if len(airports) > 0:
                df_airline = select_airport(df_airline, airports)

            # to SQL
            print(f'starting uploading {df_airline.shape[0]} rows from {year}-{month}')
            df_airline.to_sql(table, engine, index=False, if_exists="append", 
                  method='multi', chunksize=5000)
            print(f'done uploading {year}-{month}')