<img src="images/Carriers.jpg" width="800">

## How to work with this notebook
- Read the text in markdown cells 
- In codeblocks with blanks (____), replace the blank with required code 
- In codeblocks with comments and no code, write the required code underneath

## What we are going to do in this notebook
We want to get data about carriers from the internet, clean this data, combine it with data extracted from our database, then finally upload it to a table in the database. 
At the same time we will continue to work with our sql_functions module to get the hang of writing and importing functions.

# Using external data sources and PostgreSQL - Carriers
In our lecture we found out where to find data about carriers, we want to get this data into Python. And we want to do it without downloading the data to our machine using a web browser so we will read the csv directly from the web into Python 


## Importing carrier data from the internet into Python
We will use pandas to read the file we found on the internet directly. Lets find out which function can help us.
First import pandas

In [37]:
# Import pandas


Think back to the previous module: We used the Python library Pandas to import dataframes stored on our local drive.   
Find a function in the Pandas package that enables you to read a csv file into python.  
You can let the IDE help you by typing in pd.read and then see what the autocomplete suggests 

In [40]:
# type the name of the function that reads csv


In [41]:
carriers

Unnamed: 0,Code,Description
0,02Q,Titan Airways
1,04Q,Tradewind Aviation
2,05Q,"Comlux Aviation, AG"
3,06Q,Master Top Linhas Aereas Ltd.
4,07Q,Flair Airlines Ltd.
...,...,...
1560,ZW,Air Wisconsin Airlines Corp
1561,ZX,Air Georgian
1562,ZX (1),Airbc Ltd.
1563,ZY,Atlantic Gulf Airlines


In [None]:
# Set column names to 'carrier' and 'name'
carrier_columns = [______ , ______]

In [2]:
# solution
carrier_columns = ['carrier' , 'name']

In [3]:
#solution
import pandas as pd

Next, use the Pandas read csv function with the following url to import the data into Python:   
https://raw.githubusercontent.com/dannguyen/bts-transstats-t100-domestic-demo/master/data/lookup-tables/L_UNIQUE_CARRIERS.csv


In [None]:
# Import carriers data from the web using pandas
carriers = __.______(___________, 
                     names=_____, # Sets the column names to the values in carrier_columns
                     skiprows = 1) # Skips the row with column names


In [35]:
# SOLUTION # Import carriers data from the web using pandas
url = 'L_UNIQUE_CARRIERS.csv'
carriers = pd.read_csv(url, 
                     names=carrier_columns, # Sets the column names to the values in carrier_columns
                     skiprows = 1) # Skips the row with column names


In [None]:
# Print the carriers dataframe


In [5]:
# SOLUTION
display(carriers)

Unnamed: 0,carrier,name
0,02Q,Titan Airways
1,04Q,Tradewind Aviation
2,05Q,"Comlux Aviation, AG"
3,06Q,Master Top Linhas Aereas Ltd.
4,07Q,Flair Airlines Ltd.
...,...,...
1560,ZW,Air Wisconsin Airlines Corp
1561,ZX,Air Georgian
1562,ZX (1),Airbc Ltd.
1563,ZY,Atlantic Gulf Airlines


Congratulations! With just a few lines of code you loaded a csv file from the web into Python and changed the column names on the fly. Good job!

## Inserting carrier data into the database in Python
Now that we have the data in Python, all we have to do is write it into our SQL database. Before we do that we should make sure that our data is clean. Let's run some basic summary statistics before we move on.

In [None]:
# Print carriers info


It seems we have one NA/NULL value in the carrier column, since there are 1564 non-null values out of 1565 total values.  
We can confirm this by running the isnull() and sum() function together.

In [None]:
# Count NULL values


Clearly, there is one NA/NULL value in our table. Alright, let's find out which carrier is missing its code value.

In [None]:
# Find missing carrier code


Found it! The North American Airlines is missing their carrier code value. A quick research on the internet reveals that they ceased operations in 2014 and that in cases like that carrier codes can be reassigned to other carriers. This is good to know, but for now we don't have to worry about this missing value and can leave it as is. Let's move on and add the table to our SQL database.

In order to do this, we need to connect to our database, retrieve the data from Python, create a new table in our database and write the data into it.


Before this, we will add a new function to our sql_functions module that we built in the previous repo. If you haven't done the steps in the README yet, check it out now... 


Now, from your own sql module import your function to retrieve data into a list out of the postgres database :

In [None]:
# Import get_data function from the sql module
from ___ import ___


In [15]:
# solution
from sql_function2 import get_dataframe

Now that we have established a connection, let's check if the connection is working.  
Complete the code below and retrieve the first 5 rows from the flights table.

In [None]:
# Print top 5 rows of the flights table

In [16]:
#SOLUTION
sql = f'select * from hh_analytics_22_2.flights limit 5'
get_dataframe(sql)

Unnamed: 0,flight_date,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,airline,tail_number,flight_number,origin,dest,air_time,actual_elapsed_time,distance,cancelled,diverted
0,2021-01-04,942.0,945,-3.0,1218.0,1231,-13.0,OO,N103SY,5254,DEN,DSM,79.0,96.0,589,0,0
1,2021-01-04,1453.0,1500,-7.0,1628.0,1649,-21.0,OO,N471CA,5257,LNK,ORD,79.0,95.0,466,0,0
2,2021-01-04,904.0,905,-1.0,1141.0,1133,8.0,OO,N149SY,5258,ORD,OKC,107.0,157.0,693,0,0
3,2021-01-04,1820.0,1825,-5.0,1931.0,1942,-11.0,OO,N205SY,5259,SFO,ACV,47.0,71.0,250,0,0
4,2021-01-04,658.0,700,-2.0,827.0,834,-7.0,OO,N200SY,5260,ATL,IAH,112.0,149.0,689,0,0


Now back to our task, get a list of carriers (or airlines) in the flights table.

In [17]:
# Carriers in flights table
sql = 'select distinct airline from hh_analytics_22_1.flights'
airlines = get_dataframe(sql)

In [21]:
airlines.airline

0     AA
1     F9
2     B6
3     G4
4     DL
5     UA
6     WN
7     OH
8     MQ
9     NK
10    HA
11    9E
12    QX
13    OO
14    YV
15    AS
16    YX
Name: airline, dtype: object

Great, everything seems to be working as expected. Next, let's add the carriers data to our database. For this, we are going to use the <ins>to_sql()</ins> function from pandas. While the to_sql() function has multiple arguments, the important ones are

- name: name of SQL table
- con: sqlalchemy.engine or sqlite3.Connection

Feel free to check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) of the to_sql() function.  

We're going to set the parameter name to the name of the table we want to create and for the con argument we can simply use the engine from sqlalchemy. 

To make our lives easier and to expland what our sql_functions module can do we will build a function that gives us the engine back without us having to fill in the config details each time.

In [36]:
carriers.loc[carriers["carrier"].isin(airlines["airline"])]

Unnamed: 0,carrier,name
125,9E,Endeavor Air Inc.
141,AA,American Airlines Inc.
263,AS,Alaska Airlines Inc.
303,B6,JetBlue Airways
483,DL,Delta Air Lines Inc.
562,F9,Frontier Airlines Inc.
606,G4,Allegiant Air
665,HA,Hawaiian Airlines Inc.
900,MQ,American Eagle Airlines Inc.
957,NK,Spirit Air Lines


In [27]:
carriers["airlines"].unique()

array([False])



We want to add a function to sql_functions that returns us the engine. So please copy the below code block into the file in this folder. 

```python
# function to create sqlalchemy engine for writing data to a database

def get_engine():
    sql_config = get_sql_config()
    engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=sql_config
                        )
    return engine                        
```

In [None]:
# Import get_engine from sql_functions.py


# create a variable called engine using the get_engine function


In [None]:
# SOLUTION (without using sql_functions)

# Import get_engine from sql_functions.py
from sql_function2 import get_engine


# create a variable called engine using the get_engine function
engine = get_engine()

Good job! Next, set the schema_table_name variable. Its value will be the schema and the name of the table that will be written to the PostgreSQL database.

In [None]:
# IMPORTANT: Set the table_name variable to 'carriers_' + your initials/group number
# Example: 'carriers_pw' for Philipp Wendt or 'carriers_1' for group 1
table_name = '____._____'

In [None]:
#SOLUTION
table_name = 'car_ae'

Good job! It's time for the final step: writing the carriers data into the database.  
Complete the code below and write carriers data to the database using the to_sql() function.)

In [None]:
# we need psycopg2 for raising possible error message
import psycopg2

In [None]:
# Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        carriers.to_sql(name=________, # Name of SQL table
                        con=_________, # Engine or connection
                        schema='________', # your class schema
                        if_exists='replace', # Drop the table before inserting new values 
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None
else:
    print('No engine')

In [None]:
# SOLUTION Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        carriers.to_sql(name=table_name, # Name of SQL table
                        con=engine, # Engine or connection
                        schema='hh_analytics_22_1', # your class schema
                        if_exists='replace', # Drop the table before inserting new values 
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None
else:
    print('No engine')

Did you get 'The carriers table was imported successfully.'? If yes, then you should be able to query the table from the database. Let's do that to make sure we did everything right!

In [None]:
# Query the newly created table
print(engine)

In [None]:
pd.read_sql_query(f'select count(*) from hh_analytics_22_1.{table_name}', engine)

It worked, awesome! To summarise, we added an external data source to our PostgreSQL database which extends our existing data and allows us to run even better analyses. What's also great is that we didn't need to write a lot of complicated Python code.  

Congratulations on completing this notebook, you can be proud of yourself!  
Take a break and then move on to the next notebook where you will apply everything you've learned here and extend the existing data even further!