<img src="images/Carriers.jpg" width="800">

## How to work with this notebook
- Read the text in markdown cells 
- In codeblocks with blanks (____), replace the blank with required code 
- In codeblocks with comments and no code, write the required code underneath

## What we are going to do in this notebook
We want to get data about carriers from the internet, clean this data, combine it with data extracted from our database, then finally upload it to a table in the database. 
At the same time we will continue to work with our sql_functions module to get the hang of writing and importing functions.

## Using external data sources and PostgreSQL - Carriers
In our lecture we found out where to find data about carriers, we want to get this data into Python. And we want to do it without downloading the data to our machine using a web browser so we will read the csv directly from the web into Python.


## Importing carrier data from the internet into Python
We will use pandas to read the file we found on the internet directly. Lets find out which function can help us.
First import pandas

In [None]:
## Import pandas
import pandas as pd

Think back to the previous module: We used the Python library Pandas to import dataframes stored on our local drive.   
Find a function in the Pandas package that enables you to read a csv file into a dataframe.  
You can let the IDE help you by typing in pd.read and then see what the autocomplete suggests 

In [None]:
# type the name of the function that reads csv


Hopefully you found read_csv. When importing a csv file to a table, the arguments needed for this are the file name of the csv file as well as the column `names` since we want to define the column names ourselves. Lets make a list containing the column names we want to use:

In [None]:
# Set column names to 'carrier' and 'name'
carrier_columns = ['carrier' , 'name']

Next, use the Pandas read csv function with the following url to import the data into Python:   
https://raw.githubusercontent.com/dannguyen/bts-transstats-t100-domestic-demo/master/data/lookup-tables/L_UNIQUE_CARRIERS.csv


In [None]:
# Import carriers data from the web using pandas
carriers = pd.read_csv('https://raw.githubusercontent.com/dannguyen/bts-transstats-t100-domestic-demo/master/data/lookup-tables/L_UNIQUE_CARRIERS.csv', 
                     names=carrier_columns, # Sets the column names to the values in carrier_columns
                     skiprows = 1) # Skips the row with column names

In [None]:
# Print the carriers dataframe
display(carriers)

Congratulations! With just a few lines of code you loaded a csv file from the web into Python and changed the column names on the fly. Good job!

## Inserting carrier data into the database in Python
Now that we have the data in Python, all we have to do is write it into our SQL database. Before we do that we should make sure that our data is clean. Let's run some basic summary statistics before we move on.

In [None]:
# Print carriers info
carriers.info()

It seems we have one NA/NULL value in the carrier column, since there are 1564 non-null values out of 1565 total values.  
We can confirm this by running the isnull() and sum() function together.

In [None]:
# Count NULL values
carriers.isnull().sum()

Clearly, there is one NA/NULL value in our table. Alright, let's find out which carrier is missing its code value.

In [None]:
# Find missing carrier code
carriers[carriers.isnull().any(axis=1)]

Found it! The North American Airlines is missing their carrier code value. A quick research on the internet reveals that they ceased operations in 2014 and that in cases like that carrier codes can be reassigned to other carriers. This is good to know, but for now we don't have to worry about this missing value and can leave it as is. Let's move on with our task of adding the table to our SQL database.

In order to do this, we need to connect to our database, create a new table in our database and write the data into it.

Before this, we will add a new function to our sql_functions module that we built in the previous repo. If you haven't done the steps in the README yet, check it out now... 

Now, from your own sql module import your function to retrieve data into a list out of the postgres database :

In [None]:
# Import get_dataframe function from the sql module
from sql_functions import get_dataframe

Now that we have established a connection, let's check if the connection is working.  
Complete the code below and retrieve the first 5 rows from the flights table.

In [None]:
# Print top 5 rows of the flights table using get_dataframe
sql = 'select * from muc_analytics_22_2.flights limit 5'
get_dataframe(sql)

So we see that our engine works and we have a connection, our next step is to use that engine to take the carrier data we downloaded and send it to the postgres database using the engine. 


If you are feeling confident you can take this EXTRA CREDIT challenge:
Get a list of carrier codes (i.e. airlines) in the flights table and compare this against the downloaded list. Do you have data for all airlines in the flights table?

In [None]:
# EXTRA CREDIT part 1 of 2
# Extract the unique airline codes from the flights table in your schema
sql = 'select distinct airline from muc_analytics_22_1.flights'
airlines = get_dataframe(sql)
airlines

In [None]:
# EXTRA CREDIT part 2 of 2
# Check that all the airline codes in the airlines df match the carrier codes in the carriers df
airlines["airline"].isin(carriers["carrier"]).unique()

Great, everything seems to be working as expected. Next, let's add the expanded carriers data to our database. For this, we are going to use the `to_sql()` function from pandas. While the to_sql() function has multiple arguments, the important ones are

- `name`: name of SQL table
- `con`: sqlalchemy.engine or sqlite3.Connection

Feel free to check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) of the to_sql() function.  

We're going to set the parameter name to the name of the table we want to create and for the con argument we can simply use the engine from sqlalchemy. 

To make our lives easier and to expland what our sql_functions module can do we will build a function that gives us the engine back without us having to fill in the config details each time.



We want to add a function to sql_functions that returns us the engine. So please copy the below code block into the `sql_functions.py` file in this folder. 

```python
# function to create sqlalchemy engine for writing data to a database

def get_engine():
    sql_config = get_sql_config()
    engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=sql_config
                        )
    return engine                        
```

In [None]:
# Import get_engine from sql_functions.py. You will need to restart your kernel and rerun at this point since we changed the module since we first imported it.
from sql_functions import get_engine

# create a variable called engine using the get_engine function
engine = get_engine()

Good job! Next, set the `schema` variable and the `table_name` variable. These values will be used to choose the location and the name of the table that will be written to the PostgreSQL database.

In [None]:
# Set the schema to your course name and the table_name variable to 'carriers_' + your initials/group number

schema = 'muc_analytics_22_2' # your course schema name, example 'hh_analytics_22_1
table_name = 'carriers_solution' # Example: 'carriers_pw' for Philipp Wendt

Good job! It's time for the final step: writing the carriers data into the database.  
Complete the code below and write carriers data to the database using the `to_sql()` function.)

In [None]:
# we need psycopg2 for raising possible error message
import psycopg2

In [None]:
# Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        carriers.to_sql(name=table_name, # Name of SQL table
                        con=engine, # Engine or connection
                        schema=schema, # your class schema
                        if_exists='replace', # Drop the table before inserting new values 
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None
else:
    print('No engine')

Did you get 'The carriers table was imported successfully.'? If yes, then you should be able to query the table from the database. Let's do that to make sure we did everything right!

In [None]:
# Query the newly created table to count the rows
pd.read_sql_query(f'select count(*) from {schema}.{table_name}', engine)

It worked, awesome! To summarise, we added an external data source to our PostgreSQL database which extends our existing data and allows us to run even better analyses. What's also great is that we didn't need to write a lot of complicated Python code.  

Congratulations on completing this notebook, you can be proud of yourself!  
Take a break and then move on to the next notebook where you will apply everything you've learned here and extend the existing data even further!