<img src="../images/Plane.jpeg" width="800">

# Inserting external data sources into PostgreSQL - Aircrafts
In the last jupyter notebook we learned how to import a csv file directly from the web into Python and write its data into our PostgreSQL database. In this notebook we're going apply similar steps, but this time we're going to add data about aircrafts to our database. Also, instead of importing the data directly from the web, we're going to save it to our local file storage since the files are stored in a zip file. This means we will be unpacking the zip file first, then bring the data into the right shape and format through subsetting, merging and cleaning and ultimately insert it into our PostgreSQL database.

## Downloading aircraft zip file from the web
The aircraft data we're going to work with can be downloaded from the Federal Aviation Administration website. Check out this [link](https://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/releasable_aircraft_download/) to find a description of the data as well as links for downloading either the entire database or yearly data.  

We will be downloading the entire database into a folder called data. The folder does not exist yet in the repository so our first task is to create it.  
Next we're going to create the download url and set the download path to the data folder we just created.

In [2]:
# Create a new folder in your repo called 'data'. You can use '!' to run terminal commands from your notebook.

# Specifies path for saving file
path ='data/' 
# Create the data folder
!mkdir {path}

# add your path to .gitignore so that you do not upload data to github
!echo {path} >> .gitignore

mkdir: data/: File exists


In [3]:
# Create download URL
zip_file = 'ReleasableAircraft.zip'
url = f'https://registry.faa.gov/database/{zip_file}'

In order to download the file, we need to import a package called <ins>requests</ins>. This library specialises in working with data from the web.  
Complete the code below and import the requests package.

In [4]:
# Import requests package
import requests


Now it's time to download the Aircraft Registration Database and save it to our the previously set download path. Since we want to get the files from the web to our local file storage, we have to send a GET request.  
Use the <ins>get()</ins> function and pass it the url we created earlier. In a second step we're writing the files to our local file storage.
The file is ~60MB, so depending on your internet speed this might take a few seconds.

In [5]:
# Download the database
r = requests.get(url)

# Save database to local file storage
with open(path+zip_file, 'wb') as f:
    f.write(r.content)

To make sure the file was downloaded successfully, go to the data folder in this repository and check its content. The folder should have a file called ReleasableAircraft.zip.  

The next steps are to unpack the zip file and load the necessary files into Python. Conveniently there is a package to make unpacking a zip file in python easy, it's called <ins>zipfile</ins>.  

Complete the code below and import the zipfile package.

In [6]:
# Import zipfile package
import zipfile

Now it's time to unpack the zip file. First, we're going to use the <ins>ZipFile()</ins> function and pass it  the path to the file as its first argument. As the second argument, in order to read an existing file, we pass it the parameter 'r'. To extract the ZipFile object we can either use the <ins>extract()</ins> function, to extract a single file, or the <ins>extractall()</ins> function, to extract all files from the archive.  

Complete the code below and either extract all files or only the MASTER.txt, ACFTREF.txt and ardata.pdf file using the extract() or extractall() function respectively.

In [7]:
# Unzip zip file using zipfile.ZipFile()
data = zipfile.ZipFile(path+zip_file, 'r')

In [8]:
# Extract MASTER.txt, ACFTREF.txt and ardata.pdf
members = ['MASTER.txt', 'ACFTREF.txt', 'ardata.pdf']

for member in members:
    data.extract(member, path)

We can use the file explorer in vs code to take a quick peak at the files we just extracted. In the Explorer expand the 'data' folder and click on the file MASTER.TXT to get a quick feel for the data.

Good job! MASTER.txt contains comma separated values and lists all registered aircrafts, exactly what we need. 
Now that we have the necessary files it's time to load them into Python.
One thing we notice in the file is there are some missing values that are filled with multiple spaces, we will have to deal with that when we load it into Python. Which is what we will do now!
Complete the code below, import the MASTER.txt file and save it in a variable called master_df.

In [9]:
# Import the necessary package
import pandas as pd

# Read MASTER.txt file and assign to variable master, enable the parameter 'skipinitialspace'
master = pd.read_csv(path+'MASTER.txt')

# Print first 5 rows
master.head(5)


Unnamed: 0,N-NUMBER,SERIAL NUMBER,MFR MDL CODE,ENG MFR MDL,YEAR MFR,TYPE REGISTRANT,NAME,STREET,STREET2,CITY,...,OTHER NAMES(2),OTHER NAMES(3),OTHER NAMES(4),OTHER NAMES(5),EXPIRATION DATE,UNIQUE ID,KIT MFR,KIT MODEL,MODE S CODE HEX,Unnamed: 34
0,1,680-0519,2076811,52041.0,2014.0,7,TENAX AEROSPACE LLC ...,124 ONE MADISON PLZ STE 2100,,MADISON,...,...,...,...,...,20241130,1141371,,,A00001,
1,100,5334,7100510,17003.0,1940.0,1,BENE MARY D ...,PO BOX 329,,KETCHUM,...,...,...,...,...,20230430,600060,,,A004B3,
2,10001,A28,9601202,67007.0,1928.0,1,STOOS ROBERT A ...,PO BOX 1056,,LAKELAND,...,...,...,...,...,20220228,432072,,,A00726,
3,10004,T18208245,2072738,,,7,ETOS AIR LLC ...,PO BOX 288,,NEW LONDON,...,...,...,...,...,20250331,102879,,,A00729,
4,10006,BG-72,1152020,17026.0,1955.0,1,COUTCHES ROBERT HERCULES DBA ...,550 AIRWAY BLVD,,LIVERMORE,...,...,...,...,...,20240229,480110,,,A0072B,


Next, let's have a look at the column names.  
Complete the code below and print all column names with their non-null counts and data types.

In [10]:
# Print master info
master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287562 entries, 0 to 287561
Data columns (total 35 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   N-NUMBER          287562 non-null  object 
 1   SERIAL NUMBER     287562 non-null  object 
 2   MFR MDL CODE      287562 non-null  object 
 3   ENG MFR MDL       287562 non-null  object 
 4   YEAR MFR          287562 non-null  object 
 5   TYPE REGISTRANT   287562 non-null  object 
 6   NAME              287562 non-null  object 
 7   STREET            287562 non-null  object 
 8   STREET2           287562 non-null  object 
 9   CITY              287562 non-null  object 
 10  STATE             287562 non-null  object 
 11  ZIP CODE          287562 non-null  object 
 12  REGION            287562 non-null  object 
 13  COUNTY            287562 non-null  object 
 14  COUNTRY           287562 non-null  object 
 15  LAST ACTION DATE  287562 non-null  int64  
 16  CERT ISSUE DATE   28

## EDA & data preparation
In total we have 287562 rows across 35 columns, some null values (although none in the ID column) and a mix of object and integer data types (if you're numbers are slightly different, this is because the database is updated regularly with new aircraft). 
All column names are uppercase and separated by spaces. As analysts we already know that this is not a preferred format. So before we continue let's make all columns lowercase and replace the spaces with underscores.  
Complete the code below and transform all column names from uppercase to lowercase.

In [11]:
# Make column names lowercase
master.columns = [x.lower() for x in master.columns]

# Print all column names
master.columns

Index(['n-number', 'serial number', 'mfr mdl code', 'eng mfr mdl', 'year mfr',
       'type registrant', 'name', 'street', 'street2', 'city', 'state',
       'zip code', 'region', 'county', 'country', 'last action date',
       'cert issue date', 'certification', 'type aircraft', 'type engine',
       'status code', 'mode s code', 'fract owner', 'air worth date',
       'other names(1)', 'other names(2)', 'other names(3)', 'other names(4)',
       'other names(5)', 'expiration date', 'unique id', 'kit mfr',
       ' kit model', 'mode s code hex', 'unnamed: 34'],
      dtype='object')

Next, we're going to replace spaces with underscores.

In [12]:
# Replace spaces with underscores
master.columns = [x.replace(' ', '_') for x in master.columns]

# Print all column names
master.columns

Index(['n-number', 'serial_number', 'mfr_mdl_code', 'eng_mfr_mdl', 'year_mfr',
       'type_registrant', 'name', 'street', 'street2', 'city', 'state',
       'zip_code', 'region', 'county', 'country', 'last_action_date',
       'cert_issue_date', 'certification', 'type_aircraft', 'type_engine',
       'status_code', 'mode_s_code', 'fract_owner', 'air_worth_date',
       'other_names(1)', 'other_names(2)', 'other_names(3)', 'other_names(4)',
       'other_names(5)', 'expiration_date', 'unique_id', 'kit_mfr',
       '_kit_model', 'mode_s_code_hex', 'unnamed:_34'],
      dtype='object')

Perfect, now that the column names are in a consistent and clean format we can move on to working with the data. Even though the MASTER.txt file has 35 columns, we're only going to need 3: 
1. n-number: the tail number of the aircraft
2. mfr_mdl_code: the manufacturer model code
3. year_mfr: the year the aircraft was manufactured

Complete the code below, create a subset of the master data that only includes the above mentioned columns and reassign it to the master variable.

In [13]:
# Create subset with n-number, mfr_mdl_code and year_mfg columns
master = master[['n-number', 'mfr_mdl_code', 'year_mfr']]

# Print all column names
master.columns

Index(['n-number', 'mfr_mdl_code', 'year_mfr'], dtype='object')

Let's clean up the column names a bit more.
Complete the code below and rename the columns to: nnum, code and year.

In [14]:
# Rename columns to nnum, code and year
master.rename(columns={'n-number': 'nnum',
                        'mfr_mdl_code': 'code',
                        'year_mfr': 'year'}, inplace=True)


# Print all column names
master.columns

Index(['nnum', 'code', 'year'], dtype='object')

While we're at it, we should remove any leading and trailing whitespace in our nnum and code column.  
Complete the code below and remove any whitepsace from the nnum and code column.

In [15]:
# Remove whitespace from nnum column
master.nnum = master.nnum.str.strip()

# Remove whitespace from code column
master.code = master.code.str.strip()

Great job so far! Now we have a file that contains a list of all aircrafts, their manufacturer model code and the year it was manufactured. I agree, this is not a lot of information but we're also not done yet. We still haven't looked at the other file: ACFTREF.txt. This file consists of detailed information for each aircraft such as the number of engines, engine type, number of seats etc. Let's import it and have a look!  
Complete the code below and import the ACFTREF.txt file. Save it in a variable called ref and print its first 5 rows.

In [16]:
# Read ACFTREF.txt file and assign to variable ref
ref = pd.read_csv(path+'ACFTREF.txt')

# Print first 5 rows
ref.head(5)

Unnamed: 0,CODE,MFR,MODEL,TYPE-ACFT,TYPE-ENG,AC-CAT,BUILD-CERT-IND,NO-ENG,NO-SEATS,AC-WEIGHT,SPEED,TC-DATA-SHEET,TC-DATA-HOLDER,Unnamed: 13
0,0020901,AAR AIRLIFT GROUP INC,UH-60A,6,3,1,0,2,15,CLASS 3,0,,...,
1,0030109,EXLINE ACE-C,ACE-C,4,1,1,1,1,1,CLASS 1,82,,...,
2,003010D,DELEBAUGH,P,4,1,1,1,1,1,CLASS 1,82,,...,
3,003010H,DAL PORTO,BABY ACE D,4,1,1,1,1,1,CLASS 1,82,,...,
4,003010P,DUNN,BABY ACE,4,1,1,1,1,1,CLASS 1,82,,...,


Just like we did before, let's have a look at the column names, null values and data types.

In [17]:
# Print table info
ref.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89022 entries, 0 to 89021
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CODE            89022 non-null  object 
 1   MFR             89022 non-null  object 
 2   MODEL           89022 non-null  object 
 3   TYPE-ACFT       89022 non-null  object 
 4   TYPE-ENG        89022 non-null  int64  
 5   AC-CAT          89022 non-null  int64  
 6   BUILD-CERT-IND  89022 non-null  int64  
 7   NO-ENG          89022 non-null  int64  
 8   NO-SEATS        89022 non-null  int64  
 9   AC-WEIGHT       89022 non-null  object 
 10  SPEED           89022 non-null  int64  
 11  TC-DATA-SHEET   89022 non-null  object 
 12  TC-DATA-HOLDER  89022 non-null  object 
 13  Unnamed: 13     0 non-null      float64
dtypes: float64(1), int64(6), object(7)
memory usage: 9.5+ MB


This time we have 14 columns, no null values and a mix of object and integer data types. On top of that all column names are uppercase again, but this time they're seperated with a dash. So before we continue let's make all columns lowercase and replace the dashes with underscores.  
Complete the code below and transform all column names from uppercase to lowercase.

In [18]:
# Make column names lowercase
ref.columns = [x.lower() for x in ref.columns]

# Print all column names
ref.columns

Index(['code', 'mfr', 'model', 'type-acft', 'type-eng', 'ac-cat',
       'build-cert-ind', 'no-eng', 'no-seats', 'ac-weight', 'speed',
       'tc-data-sheet', 'tc-data-holder', 'unnamed: 13'],
      dtype='object')

Next, we're going to replace spaces with underscores.

In [19]:
# Replace spaces with underscores
ref.columns = [x.replace('-', '_') for x in ref.columns]

# Print all column names
ref.columns

Index(['code', 'mfr', 'model', 'type_acft', 'type_eng', 'ac_cat',
       'build_cert_ind', 'no_eng', 'no_seats', 'ac_weight', 'speed',
       'tc_data_sheet', 'tc_data_holder', 'unnamed: 13'],
      dtype='object')

Now the column names are in a format that makes it a lot easier to work with the table. Next we're going to create a subset of the data since we won't be needing all the columns present in the table.
Complete the code below and create a subset that only contains the following columns: 
1. code: aircraft manufacturer, model and series code
2. mfr: name of the aircraft manufacturer
3. model: name of the aircraft model and series
4. type_acft: the id of the aircraft type
5. type_eng: the id of the engine type
6. no_eng: number of engines on the aircraft
7. no_seats: maximum number of seats in the aircraft
8. speed: the aircraft average cruising speed

Complete the code below and create a subset of the ref data that only includes the above mentioned columns.

In [20]:
# Select columns to keep
ref = ref[['code', 'mfr', 'model', 'type_acft', 'type_eng', 'no_eng', 'no_seats', 'speed']]

# Print all column names
ref.columns

Index(['code', 'mfr', 'model', 'type_acft', 'type_eng', 'no_eng', 'no_seats',
       'speed'],
      dtype='object')

Great, now we have a subset that only contains the columns we're interested in.  
Next, we're going to combine our master and ref dataset on the code column.  
Complete the code below and inner join the master and ref dataset on the code column.  
Save the combined dataset in a new variable called 'all' and print the dataframe's info.

In [21]:
# Inner join master and ref and assign to all
all = pd.merge(master, ref, on='code')  # pd.merge performs inner join by default

# Print info
all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 287562 entries, 0 to 287561
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   nnum       287562 non-null  object
 1   code       287562 non-null  object
 2   year       287562 non-null  object
 3   mfr        287562 non-null  object
 4   model      287562 non-null  object
 5   type_acft  287562 non-null  object
 6   type_eng   287562 non-null  int64 
 7   no_eng     287562 non-null  int64 
 8   no_seats   287562 non-null  int64 
 9   speed      287562 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 24.1+ MB


It worked, awesome! Now we have a combined dataset that consists of detailed aircraft information.  Unfortunately, we're not done yet. The two columns type_acft and type_eng only contain ids and no real information about aircraft and engine type. Fortunately, the Federal Aviation Administration includes a documentation file with the database. The file is called ardata.pdf and contains column descriptions as well as values for codes and ids.  

Let's translate the type_eng column first. Looking into the documentation we find 12 different engine types with ids ranging from 0 to 11. First we should check whether we find these ids in the type_eng column.  
Complete the code below and print a list of all distinct type_eng id values.

In [22]:
# Print list of distinct type_eng ids
display(all.type_eng.nunique())
display(all.type_eng.unique())

12

array([ 5,  1,  0,  8,  2,  7,  3,  4, 10, 11,  6,  9])

Good, the ids in the type_eng column match with the ones listed in the documentation.  
Next, create a list of engine types, called engine, that includes all engine types listed in the documentation. Make sure to store them in the same order they are listed in the documentation.

In [23]:
# Create list of engine type names: engine
engine = ["None", 
          "Reciprocating", 
          "Turbo-prop", 
          "Turbo-shaft", 
          "Turbo-jet",
          "Turbo-fan", 
          "Ramjet", 
          "2 Cycle", 
          "4 Cycle", 
          "Unknown", 
          "Electric", 
          "Rotary"]

Now we're going to add a new column called engine that has the engine name based on the id in type_eng. 

In [24]:
# Add engine column and translate type_eng id
all['engine'] = [engine[i] for i in all['type_eng'].tolist()]

# Print first 5 rows
all.head(5)

Unnamed: 0,nnum,code,year,mfr,model,type_acft,type_eng,no_eng,no_seats,speed,engine
0,1,2076811,2014,CESSNA,680,5,5,2,9,0,Turbo-fan
1,1010N,2076811,2012,CESSNA,680,5,5,2,9,0,Turbo-fan
2,101EF,2076811,2014,CESSNA,680,5,5,2,9,0,Turbo-fan
3,101FC,2076811,2007,CESSNA,680,5,5,2,9,0,Turbo-fan
4,101PG,2076811,2012,CESSNA,680,5,5,2,9,0,Turbo-fan


Good job, that was certainly not an easy task. Now that we have the actual engine type names in our dataset we don't need the type_eng column anymore.  
Complete the code below and delete the type_eng column.

In [25]:
# Delete type_eng column
all.drop(columns=['type_eng'], inplace=True)

# Print all column names
all.columns

Index(['nnum', 'code', 'year', 'mfr', 'model', 'type_acft', 'no_eng',
       'no_seats', 'speed', 'engine'],
      dtype='object')

Now that we know how to translate an id into its actual value, let's do the same for the type_acft column.  
First, let's have a look at the documentation. There we find 11 engine types with ids ranging from 1 to 9 and two letters H and O. This complicates things, but before we look further into this let's check the type_afct column values for matching ids first.

In [26]:
# Print list of distinct type_acft ids
all.type_acft.unique()

array(['5', '4', '6', '1', '7', '2', '8', '9', '3', 'O', 'H'],
      dtype=object)

Again, the ids are matching, which is good news. The bad news is we have a mix of numerical ids and letters and because of the letters the column is in a string format. The approach we used with the type_eng column will not work with the current format of the column and ids. There is no reason to get demotivated though, this points us to a more efficient and secure way to fill data based on ids. First, let's create a dictionary to capture the ids and type names. Similar to before, use the documentation file to get the values and this time type them into a dictionary.

In [27]:
acft_dict= {
            '1':'Glider',
            '2':'Balloon',
            '3':'Blimp/Dirigible',
            '4':'Fixed wing single engine', 
            '5':'Fixed wing multi engine',
            '6':'Rotorcraft',
            '7':'Weight-shift-control',
            '8':'Powered Parachute',
            '9':'Gyroplane',
            'O':'Hybrid Lift',
            'H':'Other'}


The last step is to translate the ids in the type_acft column into aircraft type names.  
Complete the code below and add a new column called aircraft_type that consists of the correct aircraft type names based on the id in the type_acft column.

In [28]:
# Assign values in the aircraft_type column by mapping the keys from type_acft column to the values in the acft_dict dictionary.
all['aircraft_type'] = all['type_acft'].map(acft_dict)
all.head(5)

Unnamed: 0,nnum,code,year,mfr,model,type_acft,no_eng,no_seats,speed,engine,aircraft_type
0,1,2076811,2014,CESSNA,680,5,2,9,0,Turbo-fan,Fixed wing multi engine
1,1010N,2076811,2012,CESSNA,680,5,2,9,0,Turbo-fan,Fixed wing multi engine
2,101EF,2076811,2014,CESSNA,680,5,2,9,0,Turbo-fan,Fixed wing multi engine
3,101FC,2076811,2007,CESSNA,680,5,2,9,0,Turbo-fan,Fixed wing multi engine
4,101PG,2076811,2012,CESSNA,680,5,2,9,0,Turbo-fan,Fixed wing multi engine


In [29]:

# Print list of distinct aircraft types and aircraft ids found in our dataset and the corresponding number of records for each type.
print(all[['aircraft_type', 'type_acft']].nunique())

aircraft_type    11
type_acft        11
dtype: int64


Good job, that was a tricky one! Now that we have the actual aircraft type names in our dataset we don't need the type_acft column anymore.  
Complete the code below and delete the type_acft column.

In [30]:
# Delete type_acft column
all.drop(columns=['type_acft'], inplace=True)

# Print all column names
all.columns

Index(['nnum', 'code', 'year', 'mfr', 'model', 'no_eng', 'no_seats', 'speed',
       'engine', 'aircraft_type'],
      dtype='object')

Done! We successfully translated the ids to their actual values. Before we write the table into our database, let's make sure we have clean and descriptive column names. The first column we are going to change is the nnum column. Lets have a look at its values before we rename it.  
Complete the code below and print the first 5 rows of the nnum column.

In [31]:
# Print first 5 rows of nnum column
all.nnum.head(5)

0        1
1    1010N
2    101EF
3    101FC
4    101PG
Name: nnum, dtype: object

Something is weird. Aren't the tail numbers in our flights data starting with the letter 'N'? Let's check that just to be sure. In order to do this we need to connect to our database and query the unique tail numbers from the flights table. In order to create a connection we need to create a connection object using the connect() function in combination with the credentials for the PostgreSQL database. We stored this in the sql.py file already. All that's left to do is to enter the credentials and import he connection object conn into this notebook.
Fill in the credentials in the sql.py file, complete the code below and import the conn object from the sql.py file.

In [32]:
# Import conn and get_data from sql.py
from sql import conn, get_data

Next, query the distinct tail numbers from the flights table and store them in a variable f_planes and print the first 5 rows.

In [33]:
# Store unique tail numbers in f_planes
f_planes = get_data('SELECT DISTINCT tail_number FROM flights')

# Print first 5 rows of f_planes
f_planes.head()

Unnamed: 0,tail_number
0,
1,N8647A
2,N665NK
3,N342DN
4,N585NN


Indeed, the tail numbers all start with the letter 'N'. To make sure the nnum column really consists of tail numbers and only the letter 'N' is missing, let's calculate how many matching values we have between the two columns.

In [34]:
# Count matching values in tail_number and nnum

# all['nnum'].isin(f_planes['tail_number'].str[1:]).value_counts()
f_planes['tail_number'].str[1:].isin(all['nnum']).value_counts()

True     4530
False      81
Name: tail_number, dtype: int64

We have around 4500 matches (something similar since the live data changes), which is quite a significant number. Therefore, we can assume that in order to match the nnum and the tail_number column, all we need to do is add the letter 'N' in front of each value in the nnum column.  
Complete the code below and create a new column in all called tail_number that consists of the letter 'N' and the nnum values.

In [35]:
# Create tailnum column
all['tail_number'] = all['nnum'].apply(lambda x: ('N' + str(x)))

# Print first 5 rows
all.tail_number.head(5)

0        N1
1    N1010N
2    N101EF
3    N101FC
4    N101PG
Name: tail_number, dtype: object

There we go! Now we have a tail number column that we can join with our flights table later on. Let's change the order of the columns and get rid of the code column since we don't need it anymore.

In [36]:
# Remove code column, change column order and assign to planes
planes = all[['tail_number', 'year', 'aircraft_type', 'mfr', 'model', 'engine', 'no_eng', 'no_seats', 'speed']]

As a final data cleaning step give the mfr, no_eng and no_seats column a more descriptive name.  
Complete the code below and change the column names mfr into manufacturer, no_eng into engines and no_seats into seats.

In [37]:
# Change column names
planes.rename(columns={'mfr' : 'manufacturer', 'no_eng' : 'engines', 'no_seats' : 'seats'}, inplace=True)

# Print all column names
planes.columns


Index(['tail_number', 'year', 'aircraft_type', 'manufacturer', 'model',
       'engine', 'engines', 'seats', 'speed'],
      dtype='object')

Awesome! We finally have a clean dataset with detailed information about aircrafts. Let's check how many unique aircrafts we have in our dataset.  
Complete the code below and count the unique airplanes in the planes variable.

In [38]:
# Count unique aircrafts
planes.tail_number.nunique()

287562

Wow, almost 290k aircrafts. Previously when we counted the matches between the nnum columnd and tail_number column we only had 4506 matches, which is a lot less than we have from the official source. Since we only need a small subset, let's filter out the remaining ones.  
Complete the code below and create a dataframe called final_table that has all aircrafts from the planes dataset that have matches in the f_planes dataset.

In [39]:
# Create dataframe with only matching values called final_table
final_table = planes.merge(f_planes, how='inner', on='tail_number')

In [40]:
# Print count of planes in final_table
final_table.tail_number.nunique()

4502

Good job! Instead of a huge dataset with all aircrafts we now have a smaller subset that matches the aircrafts we have in our flights table in our PostregSQL database.

## Inserting aircrafts data into the database
The last step is to write this table into our database. We already learned how to do this using the sql.py file. The credentials should be filled out since we've already done that in the previous notebook.  
Make sure the credentials a set up correctly and import the engine from the sql.py file.

In [41]:
# Import engine from sql.py and import psycopg2
from sql import engine
import psycopg2

Next, set the table name variable. This will be name of the table that will be written to the PostgreSQL database.

In [42]:
# IMPORTANT: Set the table_name variable to 'planes_' + your initials/group number
# Example: planes_pw for Philipp Wendt / planes_1 for group1
table_name = 'flights_delete_me'

The final step is to write the dataset to the database.  
Complete the code below and write the dataset stored in planes_in_both to the PostgreSQL database.

In [43]:
# Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        final_table.to_sql(table_name, # Name of SQL table
                        con=engine, # Engine or connection
                        schema='muc_analytics_21_1', # Name of class schema
                        if_exists='replace', # Drop the table before inserting new values 
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None

The flights_delete_me table was imported successfully.


To check if everything worked try querying the table from the database.

In [46]:
# Query the new planes table to get number of planes in the SQL table
get_data('select * from flights_delete_me').tail_number.nunique()

4502

You made it, congratulations!