<img src="images/Carriers.jpg" width="800">

## How to work with this notebook
- Read the text in markdown cells 
- In codeblocks with blanks (____), replace the blank with required code 
- In codeblocks with comments and no code, write the required code underneath

## What we are going to do in this notebook
We want to get data about carriers from the internet, clean this data, combine it with data extracted from our database, then finally upload it to a table in the database. 
At the same time we will continue to work with our sql_functions module to get the hang of writing and importing functions.

## Using external data sources and PostgreSQL - Carriers
In our lecture we found out where to find data about carriers, we want to get this data into Python. And we want to do it without downloading the data to our machine using a web browser so we will read the csv directly from the web into Python.


## Importing carrier data from the internet into Python
We will use pandas to read the file we found on the internet directly. Lets find out which function can help us.
First import pandas

In [1]:
# Import pandas
import pandas as pd

Think back to the previous module: We used the Python library Pandas to import dataframes stored on our local drive.   
Find a function in the Pandas package that enables you to read a csv file into a dataframe.  
You can let the IDE help you by typing in pd.read and then see what the autocomplete suggests 

In [2]:
# type the name of the function that reads csv
pd.read_csv

<function pandas.io.parsers.readers.read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols=None, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace: 'bool' = False, skiprows=None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, skip_blank_lines: 'bool' = True, parse_dates: 'bool | Sequence[Hashable] | None' = None, infer_datetime_format: 'bool | lib.NoDefault' = <no_default>, keep_date_col: 'bool' = False, date_parser=<no_default>, date_format: 'str | None' = None, dayfirst: '

Hopefully you found read_csv. When importing a csv file to a table, the arguments needed for this are the file name of the csv file as well as the column `names` since we want to define the column names ourselves. Lets make a list containing the column names we want to use:

In [3]:
# Set column names to 'carrier' and 'name'
carrier_columns = ['carrier' , 'name']

Next, use the Pandas read csv function with the following url to import the data into Python:   
https://raw.githubusercontent.com/dannguyen/bts-transstats-t100-domestic-demo/master/data/lookup-tables/L_UNIQUE_CARRIERS.csv


In [4]:
# Import carriers data from the web using pandas
carriers = pd.read_csv('https://raw.githubusercontent.com/dannguyen/bts-transstats-t100-domestic-demo/master/data/lookup-tables/L_UNIQUE_CARRIERS.csv', # the location on the internet of the file we want to read
                     names=carrier_columns, # Sets the column names to the values in carrier_columns
                     skiprows = 1) # Skips the row with column names


In [5]:
# Print the carriers dataframe
carriers

Unnamed: 0,carrier,name
0,02Q,Titan Airways
1,04Q,Tradewind Aviation
2,05Q,"Comlux Aviation, AG"
3,06Q,Master Top Linhas Aereas Ltd.
4,07Q,Flair Airlines Ltd.
...,...,...
1560,ZW,Air Wisconsin Airlines Corp
1561,ZX,Air Georgian
1562,ZX (1),Airbc Ltd.
1563,ZY,Atlantic Gulf Airlines


Congratulations! With just a few lines of code you loaded a csv file from the web into Python and changed the column names on the fly. Good job!

## Inserting carrier data into the database in Python
Now that we have the data in Python, all we have to do is write it into our SQL database. Before we do that we should make sure that our data is clean. Let's run some basic summary statistics before we move on.

In [6]:
# Print carriers info
carriers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1565 entries, 0 to 1564
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   carrier  1564 non-null   object
 1   name     1565 non-null   object
dtypes: object(2)
memory usage: 24.6+ KB


It seems we have one NA/NULL value in the carrier column, since there are 1564 non-null values out of 1565 total values.  
We can confirm this by running the isnull() and sum() function together.

In [7]:
# Count NULL values
carriers.isnull().sum()

carrier    1
name       0
dtype: int64

Clearly, there is one NA/NULL value in our table. Alright, let's find out which carrier is missing its code value.

In [8]:
# Find missing carrier code
carriers[carriers['carrier'].isna()]

Unnamed: 0,carrier,name
926,,North American Airlines


Found it! The North American Airlines is missing their carrier code value. A quick research on the internet reveals that they ceased operations in 2014 and that in cases like that carrier codes can be reassigned to other carriers. This is good to know, but for now we don't have to worry about this missing value and can leave it as is. Let's move on with our task of adding the table to our SQL database.

The steps we need to do for that are:

- we need to connect to our database
- create a new table in our database
- write the data into it

*Before that:* 

If you haven't done the steps in the README yet, please copy the `sql_functions.py` and the `.env` files from the internal-data-sourcing repository into this repo. We will add a new function to our sql_functions module that we built in the previous repo.

Now, from your own sql module import your function to retrieve data into a list out of the postgres database :

In [9]:
# Import get_dataframe function from the sql module
from sql_functions import get_dataframe
schema = 'hh_analytics_24_1'

Now that we have established a connection, let's check if the connection is working.  
Complete the code below and retrieve the first 5 rows from the flights table.

In [10]:
# Print top 5 rows of the flights table using get_dataframe
get_dataframe(f'SELECT * FROM {schema}.flights LIMIT 7;')

Unnamed: 0,flight_date,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,airline,tail_number,flight_number,origin,dest,air_time,actual_elapsed_time,distance,cancelled,diverted
0,2021-01-03,727.0,730,-3.0,924.0,939,-15.0,9E,N607LR,4628,CVG,BOS,97.0,117.0,752,0,0
1,2021-01-04,737.0,730,7.0,938.0,939,-1.0,9E,N602LR,4628,CVG,BOS,103.0,121.0,752,0,0
2,2021-01-07,1710.0,1715,-5.0,1911.0,1912,-1.0,9E,N295PQ,4628,CVG,BOS,104.0,121.0,752,0,0
3,2021-01-08,1711.0,1715,-4.0,1926.0,1912,14.0,9E,N324PQ,4628,CVG,BOS,106.0,135.0,752,0,0
4,2021-01-10,1709.0,1715,-6.0,1900.0,1912,-12.0,9E,N297PQ,4628,CVG,BOS,94.0,111.0,752,0,0
5,2021-01-11,1713.0,1715,-2.0,1859.0,1912,-13.0,9E,N902XJ,4628,CVG,BOS,94.0,106.0,752,0,0
6,2021-01-14,1725.0,1715,10.0,1921.0,1912,9.0,9E,N337PQ,4628,CVG,BOS,100.0,116.0,752,0,0


So we see that our engine works and we have a connection, our next step is to use that engine to take the carrier data we downloaded and send it to the postgres database using the engine. 


If you are feeling confident you can take this EXTRA CREDIT challenge:
Get a list of carrier codes (i.e. airlines) in the flights table and compare this against the downloaded list. Do you have data for all airlines in the flights table?

In [11]:
# EXTRA CREDIT part 1 of 2
# Extract the unique airline codes from the flights table in your schema
sql = f'SELECT DISTINCT airline\
        FROM {schema}.flights'
airlines = get_dataframe(sql)

In [12]:
airlines

Unnamed: 0,airline
0,AA
1,F9
2,B6
3,G4
4,DL
5,UA
6,WN
7,OH
8,MQ
9,NK


In [13]:
# EXTRA CREDIT part 1 of 2
# Check that all the airline codes in the airlines df match the carrier codes in the carriers df
matching_codes = airlines['airline'].isin(carriers['carrier']).all()
matching_codes

True

Great, everything seems to be working as expected. Next, let's add the expanded carriers data to our database. For this, we are going to use the `to_sql()` function from pandas. While the to_sql() function has multiple arguments, the important ones are

- `name`: name of SQL table
- `con`: sqlalchemy.engine or sqlite3.Connection

Feel free to check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) of the to_sql() function.  

We're going to set the parameter name to the name of the table we want to create and for the con argument we can simply use the engine from sqlalchemy. 

To make our lives easier and to expland what our sql_functions module can do we will build a function that gives us the engine back without us having to fill in the config details each time.



We want to add a function to sql_functions that returns us the engine. So please copy the below code block into the `sql_functions.py` file in this folder. 

```python
# function to create sqlalchemy engine for writing data to a database

def get_engine():
    sql_config = get_sql_config()
    engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=sql_config
                        )
    return engine                        
```

In [14]:
# Import get_engine from sql_functions.py.
# You will need to restart your kernel and rerun at this point since we changed the module since we first imported it.
from sql_functions import get_engine
# create a variable called engine using the get_engine function
engine = get_engine()

Good job! Next, set the `schema` variable and the `table_name` variable. These values will be used to choose the location and the name of the table that will be written to the PostgreSQL database.

In [15]:
# Set the schema to your course name and the table_name variable to 'carriers_' + your initials/group number

schema = 'hh_analytics_24_1' # your course schema name, example 'hh_analytics_22_1
table_name = 'carriers_sp' # Example: 'carriers_pw' for Philipp Wendt

Good job! It's time for the final step: writing the carriers data into the database.  
Complete the code below and write carriers data to the database using the `to_sql()` function.)

In [16]:
# we need psycopg2 for raising possible error message
import psycopg2

In [17]:
# Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        carriers.to_sql(name=table_name, # Name of SQL table variable
                        con=engine, # Engine or connection
                        schema=schema, # your class schema variable
                        if_exists='replace', # Drop the table before inserting new values 
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None
else:
    print('No engine')

The carriers_sp table was imported successfully.


Did you get 'The carriers table was imported successfully.'? If yes, then you should be able to query the table from the database. Let's do that to make sure we did everything right!

In [18]:
# Query the newly created table to count the rows
query = f'SELECT * FROM {schema}.carriers_sp'
carriers_sp_df = get_dataframe(query)
carriers_sp_df

Unnamed: 0,carrier,name
0,02Q,Titan Airways
1,04Q,Tradewind Aviation
2,05Q,"Comlux Aviation, AG"
3,06Q,Master Top Linhas Aereas Ltd.
4,07Q,Flair Airlines Ltd.
...,...,...
1560,ZW,Air Wisconsin Airlines Corp
1561,ZX,Air Georgian
1562,ZX (1),Airbc Ltd.
1563,ZY,Atlantic Gulf Airlines


In [19]:
# the data from the link was locally stored in this DataFrame, so we can call it in this notebook
carriers

Unnamed: 0,carrier,name
0,02Q,Titan Airways
1,04Q,Tradewind Aviation
2,05Q,"Comlux Aviation, AG"
3,06Q,Master Top Linhas Aereas Ltd.
4,07Q,Flair Airlines Ltd.
...,...,...
1560,ZW,Air Wisconsin Airlines Corp
1561,ZX,Air Georgian
1562,ZX (1),Airbc Ltd.
1563,ZY,Atlantic Gulf Airlines


In [20]:
# but we can't just call this one here, because it's data stored in the Database (in SQL, online) and not in a DataFrame
carriers_sp

NameError: name 'carriers_sp' is not defined

It worked, awesome! To summarise, we added an external data source to our PostgreSQL database which extends our existing data and allows us to run even better analyses. What's also great is that we didn't need to write a lot of complicated Python code.  

Congratulations on completing this notebook, you can be proud of yourself!  
Take a break and then move on to the next notebook where you will apply everything you've learned here and extend the existing data even further!