<img src="images/Carriers.jpg" width="800">

# Inserting external data sources into PostgreSQL - Carriers
Now that we know where to find data about carriers, we can choose between options of how to get this data into Python.

1. Download the csv file directly from the website to your local drive and import it into Python
2. Read the csv directly from the web into Python

Since we are hackers we're obviously going with the second option!

## Importing carrier data from the web into Python
First, set the carrier_columns variable so that it contains the two strings 'carrier' and 'name'.

In [1]:
# Set column names to 'carrier' and 'name'
carrier_columns = ['carrier' , 'name']

Now think back to last week: We used the Python library Pandas to import dataframes stored on our local drive.   
Find a function in the Pandas package that enables you to read a csv file from the web into python.  
Once you have found the right function import the Pandas library.

In [2]:
# Import pandas
import pandas as pd

Next, use the Pandas function with the [link](https://raw.githubusercontent.com/dannguyen/bts-transstats-t100-domestic-demo/master/data/lookup-tables/L_UNIQUE_CARRIERS.csv) to the csv file to import the data into Python.

In [4]:
# Import carriers data from the web
carriers = pd.read_csv('https://raw.githubusercontent.com/dannguyen/bts-transstats-t100-domestic-demo/master/data/lookup-tables/L_UNIQUE_CARRIERS.csv', 
                     names=carrier_columns, # Sets the column names to the values in carrier_columns
                     skiprows = 1) # Skips the row with column names

# Print the carriers dataframe
carriers

Unnamed: 0,carrier,name
0,02Q,Titan Airways
1,04Q,Tradewind Aviation
2,05Q,"Comlux Aviation, AG"
3,06Q,Master Top Linhas Aereas Ltd.
4,07Q,Flair Airlines Ltd.
...,...,...
1560,ZW,Air Wisconsin Airlines Corp
1561,ZX,Air Georgian
1562,ZX (1),Airbc Ltd.
1563,ZY,Atlantic Gulf Airlines


Congratulations! With just a few lines of code you loaded a csv file from the web into Python and changed the column names on the fly. Good job!

## Inserting carrier data into the database in Python
Now that we have the data in Python, all we have to do is write it into our SQL database. Before we do that we should make sure that our data is clean. Let's run some basic summary statistics before we move on.

In [5]:
# Print carriers info
carriers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1565 entries, 0 to 1564
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   carrier  1564 non-null   object
 1   name     1565 non-null   object
dtypes: object(2)
memory usage: 24.6+ KB


It seems we have one NA/NULL value in the carrier column, since there are 1564 non-null values out of 1565 total values.  
We can confirm this by running the isnull() and sum() function together.

In [6]:
# Count NULL values
carriers.isnull().sum()

carrier    1
name       0
dtype: int64

Clearly, there is one NA/NULL value in our table. Alright, let's find out which carrier is missing its code code value.

In [8]:
# Find missing carrier code
carriers[carriers.isnull().any(axis=1)]

Unnamed: 0,carrier,name
926,,North American Airlines


Found it! The North American Airlines is missing their carrier code value. A quick research on the internet reveals that they ceased operations in 2014 and that in cases like that carrier codes can be reassigned to other carriers. This is good to know, but for now we don't have to worry about this missing value and can leave it as is. Let's move on and add the table to our SQL database.

In order to do this, we need to connect to our database, export the data from Python, create a new table in our database and write the data into it.
In the 'Internal Data Sourcing' module we alredy learned how to create a connection to our database. Just like last time, we have prepared a sql.py file where you will find a connection object called conn.  
Fill in the credentials and import the conn object into this jupyter notebook below.

In [1]:
 # Import conn from sql.py
from sql import conn

Now that we have the connection imported, let's check if the connection is working.  
Complete the code below and retrieve the first 5 rows from the flights table.

In [5]:
# Print top 5 rows of the airports table
cur = conn.cursor()
cur.execute('SELECT * FROM flights LIMIT 5')
cur.fetchall()

[(datetime.datetime(2021, 1, 23, 0, 0),
  2238.0,
  2159,
  39.0,
  2229.0,
  2200,
  29.0,
  '9E',
  'N330PQ',
  5044,
  'ATL',
  'HSV',
  33.0,
  151,
  0,
  0),
 (datetime.datetime(2021, 1, 24, 0, 0),
  2154.0,
  2159,
  -5.0,
  2146.0,
  2200,
  -14.0,
  '9E',
  'N308PQ',
  5044,
  'ATL',
  'HSV',
  37.0,
  151,
  0,
  0),
 (datetime.datetime(2021, 1, 25, 0, 0),
  2213.0,
  2159,
  14.0,
  2208.0,
  2200,
  8.0,
  '9E',
  'N917XJ',
  5044,
  'ATL',
  'HSV',
  36.0,
  151,
  0,
  0),
 (datetime.datetime(2021, 1, 26, 0, 0),
  2149.0,
  2159,
  -10.0,
  2141.0,
  2200,
  -19.0,
  '9E',
  'N678CA',
  5044,
  'ATL',
  'HSV',
  31.0,
  151,
  0,
  0),
 (datetime.datetime(2021, 1, 27, 0, 0),
  2155.0,
  2159,
  -4.0,
  2148.0,
  2200,
  -12.0,
  '9E',
  'N310PQ',
  5044,
  'ATL',
  'HSV',
  35.0,
  151,
  0,
  0)]

Next, calculae the total number of rows in the flights table.

In [7]:
# Total number of rows in flights table
cur.execute('SELECT COUNT(*) FROM flights')
cur.fetchall()

[(361428,)]

Great, everything seems to be working as expected. Next, let's add the carriers data to our database. For this, we are going to use the <ins>to_sql()</ins> function from pandas. While the to_sql() function has multiple arguments, the important ones are

- name: name of SQL table
- con: sqlalchemy.engine or sqlite3.Connection

Feel free to check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) of the to_sql() function.  

We're going to set the parameter name to the name of the table we want to create and for the con argument we can simply use the connection object conn we imported from the sql.py file, right? Well, no.  
Querying data from a database is handled differently than writing to a database. In order to write our carriers data to our PostgreSQL database, we need to create and use a sqlalchemy engine as the connection. Fortunately, creating it is easy and only requires two lines of code. First we need to import the sqlalchemy package and then create the engine with the <ins>create_engine()</ins> function. We have added the function to the sql.py file so open the file and check it out. Your job now is to put the credentials into the host, port, database etc. variables and import the engine variable into this jupyter notebook.  
Add the credentials to sql.py, then complete the code below and import the engine from sql.py.

In [10]:
# Import engine from sql.py
from sql import engine

# Print engine
engine

Engine(postgres+psycopg2://sinapietrowski:***@data-analytics-course.c8g8r1deus2v.eu-central-1.rds.amazonaws.com:5432/postgres)

Good job! Next, set the table_name variable. Its value will be name of the table that will be written to the PostgreSQL database.

In [11]:
# IMPORTANT: Set the table_name variable to 'carriers_' + your initials/group number
# Example: carriers_pw for Philipp Wendt / planes_1 for group 1
table_name = 'carriers_sp'

Good job! It's time for the final step: writing the carriers data into the database.  
Complete the code below and write carriers data to the database using the to_sql() function.)

In [13]:
import psycopg2

In [15]:
# Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        carriers.to_sql(name=table_name, # Name of SQL table
                        con=engine, # Engine or connection
                        if_exists='replace', # Drop the table before inserting new values 
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None

Did you get 'The carriers table was imported successfully.'? If yes, then you should be able to query the table from the database. Let's do that to make sure we did everything right!

In [17]:
# Query the newly created table
cur.execute('SELECT * FROM carriers_sp LIMIT 5')

UndefinedTable: relation "carriers_sp" does not exist
LINE 1: SELECT * FROM carriers_sp LIMIT 5
                      ^


It worked, awesome! To summarise, we added an external data source to our PostgreSQL database which extends our existing data and allows us to run even better analyses. What's also great is that we didn't need to write a lot of complicated Python code.  

Congratulations on completing this notebook, you can be proud of yourself!  
Take a break and then move on to the next notebook where you will apply everything you've learned here and extend the existing data even further!