<img src="images/Plane.jpeg" width="800">

# Inserting external data sources into PostgreSQL - Aircrafts
In the last jupyter notebook we learned how to import a csv file directly from the web into Python and write its data into our PostgreSQL database. In this notebook we're going to apply similar steps, but this time we're going to add data about aircrafts to our database. Also, instead of importing the data directly from the web, we're going to save it to our local file storage since the files are stored in a zip file. This means we will be unpacking the zip file first, then bring the data into the right shape and format through subsetting, merging and cleaning and ultimately insert it into our PostgreSQL database.

## Downloading aircraft zip file from the web
The aircraft data we're going to work with can be downloaded from the Federal Aviation Administration website. Check out this [link](https://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/releasable_aircraft_download/) to find a description of the data as well as links for downloading either the entire database or yearly data.  

We will be downloading the entire database into a folder called data. The folder does not exist yet in the repository so our first task is to create it.  
Next we're going to create the download url and set the download path to the data folder we just created.

In [1]:
# Create a new folder in your repo called 'data'. You can use '!' to run terminal commands from your notebook.

# Specifies path for saving file
path ='data/' 
# Create the data folder
#!mkdir {path}

# add your path to .gitignore so that you do not upload data to github
#!echo {path} >> .gitignore

In [2]:
# Create download URL
zip_file = 'ReleasableAircraft.zip'
url = f'https://registry.faa.gov/database/{zip_file}'

In order to download the file, we need to import a package called <ins>requests</ins>. This library specialises in working with data from the web.  
Complete the code below and import the requests package.

In [3]:
# Import requests package
#!pip install requests

In [4]:
import requests

Now it's time to download the Aircraft Registration Database and save it to our the previously set download path. Since we want to get the files from the web to our local file storage, we have to send a GET request.  
Use the <ins>get()</ins> function and pass it the url we created earlier. In a second step we're writing the files to our local file storage.
The file is ~60MB, so depending on your internet speed this might take a few seconds.

Sometimes, we need to use a User-Agent with our GET request. A User-Agent header is used to identify the software or client making an HTTP request.

For the purpose of this exercise, we will all use the same User-Agent.

In [5]:
# Try out this User-Agent. If it doesn't work, see instructions below.
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'}

In [6]:
# Download the database
#r = requests.get(url, headers=headers)

# Save database to local file storage
#with open(path+zip_file, 'wb') as f:
#    f.write(r.content)

**Note:** If the the requests.get() with the User-Agent doesn’t work for you, don’t worry! Just download the data manually here: [Download the Aircraft Registration Database (60MB)](https://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/releasable_aircraft_download). <br>
Put the zip file as it is into the data folder we created above and go on with the notebook.

To make sure the file was downloaded successfully, go to the data folder in this repository and check its content. The folder should have a file called ReleasableAircraft.zip.  

The next steps are to unpack the zip file and load the necessary files into Python. Conveniently there is a package to make unpacking a zip file in python easy, it's called <ins>zipfile</ins>.  

Complete the code below and import the zipfile package.

In [7]:
# Import zipfile package
import zipfile

Now it's time to unpack the zip file. First, we're going to use the <ins>ZipFile()</ins> function and pass it  the path to the file as its first argument. As the second argument, in order to read an existing file, we pass it the parameter 'r'.

To extract the ZipFile object we can either use the <ins>extract()</ins> function, to extract a single file, or the <ins>extractall()</ins> function, to extract all files from the archive.  

Complete the code below and either extract all files or only the MASTER.txt, ACFTREF.txt and ardata.pdf file using the extract() or extractall() function respectively.

In [8]:
# Unzip zip file using zipfile.ZipFile()
#zipfile.ZipFile('data/ReleasableAircraft.zip', 'r')
#zipfile.ZipFile(path+zip_file, 'r')

In [9]:
# extract all files or only the MASTER.txt, ACFTREF.txt and ardata.pdf file
#zipfile.ZipFile(path+zip_file, 'r').extractall(path)

We can use the file explorer in vs code to take a quick peak at the files we just extracted. In the Explorer expand the 'data' folder and click on the file MASTER.TXT to get a quick feel for the data.

Good job! MASTER.txt contains comma separated values and lists all registered aircrafts, exactly what we need. 
Now that we have the necessary files it's time to load them into Python.
One thing we notice in the file is there are some missing values that are filled with multiple spaces, we will have to deal with that when we load it into Python. Which is what we will do now!
Complete the code below, import the MASTER.txt file and save it in a variable called master_df.

**Hint:** Check the [pandas documentation on read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to see if there is a parameter that handles how spaces are treated, especially after the delimiter.

In [10]:
# Import the necessary package
import pandas as pd

# Read MASTER.txt file and assign to variable master
master_df = pd.read_csv('data/MASTER.txt', sep=',', skipinitialspace=True)    #added low_memory=False because of error-message

# Print first 5 rows
master_df.head(7)

  master_df = pd.read_csv('data/MASTER.txt', sep=',', skipinitialspace=True)    #added low_memory=False because of error-message


Unnamed: 0,N-NUMBER,SERIAL NUMBER,MFR MDL CODE,ENG MFR MDL,YEAR MFR,TYPE REGISTRANT,NAME,STREET,STREET2,CITY,...,OTHER NAMES(2),OTHER NAMES(3),OTHER NAMES(4),OTHER NAMES(5),EXPIRATION DATE,UNIQUE ID,KIT MFR,KIT MODEL,MODE S CODE HEX,Unnamed: 34
0,1,680-0519,2076811,52041.0,2014.0,7.0,TENAX AEROSPACE LLC ...,400 W PARKWAY PL STE 201,,RIDGELAND,...,,,,,20281130.0,1141371,,,A00001,
1,100,5334,7100510,17003.0,1940.0,1.0,BENE MARY D ...,PO BOX 329,,KETCHUM,...,,,,,20270430.0,600060,,,A004B3,
2,10000,10000,2130004,,,8.0,CIRRUS DESIGN CORP ...,4515 TAYLOR CIRCLE,,DULUTH,...,,,,,,1443200,,,A00725,
3,10001,A28,9601202,67007.0,1928.0,1.0,STOOS ROBERT A ...,PO BOX 1056,,LAKELAND,...,,,,,20290228.0,432072,,,A00726,
4,10004,T18208245,2072738,,,7.0,ETOS AIR LLC ...,PO BOX 288,,NEW LONDON,...,,,,,20290331.0,102879,,,A00729,
5,10006,BG-72,1152020,17026.0,1955.0,1.0,COUTCHES ROBERT HERCULES DBA ...,550 AIRWAY BLVD,,LIVERMORE,...,,,,,20280229.0,480110,,,A0072B,
6,10007,21058839,2073430,17032.0,1966.0,1.0,INSKO MATTHEW T ...,5804 HILLTOP ST,,PAPILLION,...,,,,,20290131.0,470110,,,A0072C,


Next, let's have a look at the column names.  
Complete the code below and print all column names with their non-null counts and data types.

In [11]:
# Print master info
master_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295558 entries, 0 to 295557
Data columns (total 35 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   N-NUMBER          295558 non-null  object 
 1   SERIAL NUMBER     295558 non-null  object 
 2   MFR MDL CODE      295558 non-null  object 
 3   ENG MFR MDL       268928 non-null  float64
 4   YEAR MFR          241121 non-null  float64
 5   TYPE REGISTRANT   295020 non-null  float64
 6   NAME              295018 non-null  object 
 7   STREET            295013 non-null  object 
 8   STREET2           10326 non-null   object 
 9   CITY              295020 non-null  object 
 10  STATE             293500 non-null  object 
 11  ZIP CODE          294960 non-null  object 
 12  REGION            295207 non-null  object 
 13  COUNTY            293402 non-null  float64
 14  COUNTRY           295020 non-null  object 
 15  LAST ACTION DATE  295558 non-null  int64  
 16  CERT ISSUE DATE   28

## EDA & data preparation
In total we have 287554 rows across 35 columns, some null values (although none in the ID column) and a mix of object and integer data types (if you're numbers are slightly different, this is because the database is updated regularly with new aircraft). 
All column names are uppercase and separated by spaces. As analysts we already know that this is not a preferred format. So before we continue let's make all columns lowercase and replace the spaces with underscores.  
Complete the code below and transform all column names from uppercase to lowercase.

In [12]:
# Make column names lowercase
#master_df.columns = map(str.lower, master_df.columns)
master_df.columns = master_df.columns.str.lower()

# Print all column names
master_df.columns

Index(['n-number', 'serial number', 'mfr mdl code', 'eng mfr mdl', 'year mfr',
       'type registrant', 'name', 'street', 'street2', 'city', 'state',
       'zip code', 'region', 'county', 'country', 'last action date',
       'cert issue date', 'certification', 'type aircraft', 'type engine',
       'status code', 'mode s code', 'fract owner', 'air worth date',
       'other names(1)', 'other names(2)', 'other names(3)', 'other names(4)',
       'other names(5)', 'expiration date', 'unique id', 'kit mfr',
       'kit model', 'mode s code hex', 'unnamed: 34'],
      dtype='object')

Next, we're going to replace spaces with underscores.

In [13]:
# Replace spaces with underscores
master_df.columns = master_df.columns.str.replace(' ', '_')

# Print all column names
master_df.columns

Index(['n-number', 'serial_number', 'mfr_mdl_code', 'eng_mfr_mdl', 'year_mfr',
       'type_registrant', 'name', 'street', 'street2', 'city', 'state',
       'zip_code', 'region', 'county', 'country', 'last_action_date',
       'cert_issue_date', 'certification', 'type_aircraft', 'type_engine',
       'status_code', 'mode_s_code', 'fract_owner', 'air_worth_date',
       'other_names(1)', 'other_names(2)', 'other_names(3)', 'other_names(4)',
       'other_names(5)', 'expiration_date', 'unique_id', 'kit_mfr',
       'kit_model', 'mode_s_code_hex', 'unnamed:_34'],
      dtype='object')

Perfect, now that the column names are in a consistent and clean format we can move on to working with the data. Even though the MASTER.txt file has 33 columns, we're only going to need 3: 
1. n-number: the tail number of the aircraft
2. mfr_mdl_code: the manufacturer model code
3. year_mfr: the year the aircraft was manufactured

Complete the code below, create a subset of the master data that only includes the above mentioned columns and reassign it to the master variable.

In [14]:
master_df.columns

Index(['n-number', 'serial_number', 'mfr_mdl_code', 'eng_mfr_mdl', 'year_mfr',
       'type_registrant', 'name', 'street', 'street2', 'city', 'state',
       'zip_code', 'region', 'county', 'country', 'last_action_date',
       'cert_issue_date', 'certification', 'type_aircraft', 'type_engine',
       'status_code', 'mode_s_code', 'fract_owner', 'air_worth_date',
       'other_names(1)', 'other_names(2)', 'other_names(3)', 'other_names(4)',
       'other_names(5)', 'expiration_date', 'unique_id', 'kit_mfr',
       'kit_model', 'mode_s_code_hex', 'unnamed:_34'],
      dtype='object')

In [15]:
# Create subset with n-number, mfr_mdl_code and year_mfr columns
subset = master_df[['n-number', 'mfr_mdl_code', 'year_mfr']]
master_df = subset.copy()

# Print all column names
master_df.columns

Index(['n-number', 'mfr_mdl_code', 'year_mfr'], dtype='object')

Let's clean up the column names a bit more.
Complete the code below and rename the columns to: nnum, code and year.

In [16]:
# Rename columns to nnum, code and year
master_df.rename(columns= {'n-number': 'nnum', 'mfr_mdl_code': 'code', 'year_mfr': 'year'}, inplace = True)

# Print all column names
master_df.columns

Index(['nnum', 'code', 'year'], dtype='object')

In [17]:
master_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295558 entries, 0 to 295557
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   nnum    295558 non-null  object 
 1   code    295558 non-null  object 
 2   year    241121 non-null  float64
dtypes: float64(1), object(2)
memory usage: 6.8+ MB


In [18]:
master_df['nnum'].str.startswith(' ').value_counts()

nnum
False    295558
Name: count, dtype: int64

In [19]:
master_df['nnum'].str.endswith(' ').value_counts()

nnum
False    251278
True      44280
Name: count, dtype: int64

In [20]:
master_df['code'].str.startswith(' ').value_counts()

code
False    295558
Name: count, dtype: int64

In [21]:
master_df['code'].str.endswith(' ').value_counts()

code
False    295558
Name: count, dtype: int64

While we're at it, we should remove any leading and trailing whitespace in our nnum and code column.  
Complete the code below and remove any whitespace from the nnum and code column.

In [22]:
# Remove whitespace from nnum column
#master_df['nnum'].str.replace(' ', '')
master_df['nnum'] = master_df['nnum'].str.strip()

# Remove whitespace from code column
#master_df['code'].str.replace(' ', '')
master_df['code'] = master_df['code'].str.strip()

master_df

Unnamed: 0,nnum,code,year
0,1,2076811,2014.0
1,100,7100510,1940.0
2,10000,2130004,
3,10001,9601202,1928.0
4,10004,2072738,
...,...,...,...
295553,9ZR,8680511,
295554,9ZS,5760102,1974.0
295555,9ZT,2130001,2001.0
295556,9ZU,7101828,1959.0


In [23]:
master_df['nnum'].str.startswith(' ').value_counts()

nnum
False    295558
Name: count, dtype: int64

In [24]:
master_df['nnum'].str.endswith(' ').value_counts()

nnum
False    295558
Name: count, dtype: int64

In [25]:
master_df['code'].str.startswith(' ').value_counts()

code
False    295558
Name: count, dtype: int64

In [27]:
master_df['code'].str.endswith(' ').value_counts()

code
False    295558
Name: count, dtype: int64

Great job so far! Now we have a file that contains a list of all aircrafts, their manufacturer model code and the year it was manufactured. I agree, this is not a lot of information but we're also not done yet. We still haven't looked at the other file: ACFTREF.txt. This file consists of detailed information for each aircraft such as the number of engines, engine type, number of seats etc. Let's import it and have a look!  
Complete the code below and import the ACFTREF.txt file. Save it in a variable called ref and print its first 5 rows.

In [28]:
# Read ACFTREF.txt file and assign to variable ref
ref_df = pd.read_csv('data/ACFTREF.txt', sep=",", skipinitialspace=True)

# Print first 5 rows
ref_df.head(5)

Unnamed: 0,CODE,MFR,MODEL,TYPE-ACFT,TYPE-ENG,AC-CAT,BUILD-CERT-IND,NO-ENG,NO-SEATS,AC-WEIGHT,SPEED,TC-DATA-SHEET,TC-DATA-HOLDER,Unnamed: 13
0,0020901,AAR AIRLIFT GROUP INC,UH-60A,6,3,1,0,2,15,CLASS 3,0,,,
1,0030109,EXLINE ACE-C,ACE-C,4,1,1,1,1,1,CLASS 1,82,,,
2,003010D,DELEBAUGH,P,4,1,1,1,1,1,CLASS 1,82,,,
3,003010H,DAL PORTO,BABY ACE D,4,1,1,1,1,1,CLASS 1,82,,,
4,003010P,DUNN,BABY ACE,4,1,1,1,1,1,CLASS 1,82,,,


Just like we did before, let's have a look at the column names, null values and data types.

In [29]:
# Print table info
ref_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91739 entries, 0 to 91738
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CODE            91739 non-null  object 
 1   MFR             91739 non-null  object 
 2   MODEL           91739 non-null  object 
 3   TYPE-ACFT       91739 non-null  object 
 4   TYPE-ENG        91739 non-null  int64  
 5   AC-CAT          91739 non-null  int64  
 6   BUILD-CERT-IND  91739 non-null  int64  
 7   NO-ENG          91739 non-null  int64  
 8   NO-SEATS        91739 non-null  int64  
 9   AC-WEIGHT       91739 non-null  object 
 10  SPEED           91739 non-null  int64  
 11  TC-DATA-SHEET   1080 non-null   object 
 12  TC-DATA-HOLDER  1080 non-null   object 
 13  Unnamed: 13     0 non-null      float64
dtypes: float64(1), int64(6), object(7)
memory usage: 9.8+ MB


This time we have 12 columns, no null values and a mix of object and integer data types. On top of that all column names are uppercase again, but this time they're seperated with a dash. So before we continue let's make all columns lowercase and replace the dashes with underscores.  
Complete the code below and transform all column names from uppercase to lowercase.

In [30]:
# Make column names lowercase
ref_df.columns = ref_df.columns.str.lower()

# Print all column names
ref_df.columns

Index(['code', 'mfr', 'model', 'type-acft', 'type-eng', 'ac-cat',
       'build-cert-ind', 'no-eng', 'no-seats', 'ac-weight', 'speed',
       'tc-data-sheet', 'tc-data-holder', 'unnamed: 13'],
      dtype='object')

Next, we're going to replace dashes with underscores.

In [31]:
# Replace dashes with underscores
ref_df.columns = ref_df.columns.str.replace('-', '_')

# Print all column names
ref_df.columns

Index(['code', 'mfr', 'model', 'type_acft', 'type_eng', 'ac_cat',
       'build_cert_ind', 'no_eng', 'no_seats', 'ac_weight', 'speed',
       'tc_data_sheet', 'tc_data_holder', 'unnamed: 13'],
      dtype='object')

Now the column names are in a format that makes it a lot easier to work with the table. Next we're going to create a subset of the data since we won't be needing all the columns present in the table.
Complete the code below and create a subset that only contains the following columns: 
1. code: aircraft manufacturer, model and series code
2. mfr: name of the aircraft manufacturer
3. model: name of the aircraft model and series
4. type_acft: the id of the aircraft type
5. type_eng: the id of the engine type
6. no_eng: number of engines on the aircraft
7. no_seats: maximum number of seats in the aircraft
8. speed: the aircraft average cruising speed

Complete the code below and create a subset of the ref data that only includes the above mentioned columns.

In [32]:
# Select columns to keep
ref_df = ref_df [['code', 'mfr', 'model', 'type_acft', 'type_eng', 'no_eng', 'no_seats', 'speed']]

# Print all column names
ref_df.columns

Index(['code', 'mfr', 'model', 'type_acft', 'type_eng', 'no_eng', 'no_seats',
       'speed'],
      dtype='object')

Great, now we have a subset that only contains the columns we're interested in.  
Next, we're going to combine our master and ref dataset on the code column.  
Complete the code below and inner join the master and ref dataset on the code column.  
Save the combined dataset in a new variable called 'df_all' and print the dataframe's info.

In [33]:
# Inner join master and ref and assign to df_all
df_all = master_df.merge(ref_df, how='inner', on='code')

# Print info
df_all

Unnamed: 0,nnum,code,year,mfr,model,type_acft,type_eng,no_eng,no_seats,speed
0,1,2076811,2014.0,CESSNA,680,5,5,2,9,0
1,1010N,2076811,2012.0,CESSNA,680,5,5,2,9,0
2,101EF,2076811,2014.0,CESSNA,680,5,5,2,9,0
3,101FC,2076811,2007.0,CESSNA,680,5,5,2,9,0
4,101PG,2076811,2012.0,CESSNA,680,5,5,2,9,0
...,...,...,...,...,...,...,...,...,...,...
295553,9YW,05901UM,2006.0,SHAY GREGORIE,FPNA-DRIFTER,4,8,1,2,0
295554,9ZF,0590018,2005.0,ATEC,ZEPHYR 2550,4,8,1,2,0
295555,9ZP,056115U,2004.0,LINK JOHN,LANCAIR LEGACY,4,1,1,2,0
295556,9ZQ,05630MY,2015.0,HARRELSON SUSAN E,RANS S6S,4,8,1,2,0


It worked, awesome! Now we have a combined dataset that consists of detailed aircraft information.  Unfortunately, we're not done yet. The two columns type_acft and type_eng only contain ids and no real information about aircraft and engine type. Fortunately, the Federal Aviation Administration includes a documentation file with the database. The file is called ardata.pdf and contains column descriptions as well as values for codes and ids.  

Let's translate the type_eng column first. Looking into the documentation we find 12 different engine types with ids ranging from 0 to 11. First we should check whether we find these ids in the type_eng column.  
Complete the code below and print a list of all distinct type_eng id values.

In [34]:
# Print list of distinct type_eng ids
print(df_all['type_eng'].unique())

[ 5  1  0  8  2  7  3  4 10 11  6  9]


Good, the ids in the type_eng column match with the ones listed in the documentation.  
Next, create a list of engine types, called engine_list, that includes all engine types listed in the documentation. Make sure to store them in the same order they are listed in the documentation.

In [35]:
# Create list of engine type names: engine_list
engine_list = ["None", 
          "Reciprocating", 
          "Turbo-prop", 
          "Turbo-shaft", 
          "Turbo-jet",
          "Turbo-fan", 
          "Ramjet", 
          "2 Cycle", 
          "4 Cycle", 
          "Unknown", 
          "Electric", 
          "Rotary"]

Now we're going to add a new column called engine that has the engine name based on the id in type_eng. 

In [36]:
# Add engine column and translate type_eng id
df_all['engine'] = [engine_list[i] for i in df_all['type_eng'].tolist()]

# Print first 5 rows
df_all.sample(10)

Unnamed: 0,nnum,code,year,mfr,model,type_acft,type_eng,no_eng,no_seats,speed,engine
78259,769TM,5170805,2001.0,LEARJET INC,45,5,5,2,12,0,Turbo-fan
244316,307SC,4220010,2009.0,HAWKER BEECHCRAFT CORP,HAWKER 750,5,5,2,8,0,Turbo-fan
74200,7998U,2072414,1964.0,CESSNA,172F,4,1,1,4,105,Reciprocating
106882,35101,2073708,1974.0,CESSNA,177B,4,1,1,4,108,Reciprocating
146978,7932R,1151544,1969.0,BEECH,V35A,4,1,1,6,150,Reciprocating
51993,973JM,2072732,1978.0,CESSNA,182Q,4,1,1,4,112,Reciprocating
78105,454LC,5170805,2011.0,LEARJET INC,45,5,5,2,12,0,Turbo-fan
197961,8817S,2071814,1965.0,CESSNA,150F,4,1,1,2,90,Reciprocating
291216,907GB,05635HB,2016.0,AK CUBS LLC,CCX-1865,4,1,1,2,0,Reciprocating
151222,8844K,9230404,1947.0,STINSON,108-1,4,1,1,4,84,Reciprocating


Good job, that was certainly not an easy task. Now that we have the actual engine type names in our dataset we don't need the type_eng column anymore.  
Complete the code below and delete the type_eng column.

In [37]:
# Delete type_eng column
df_all.drop('type_eng', axis=1, inplace=True)

# Print all column names
df_all.columns

Index(['nnum', 'code', 'year', 'mfr', 'model', 'type_acft', 'no_eng',
       'no_seats', 'speed', 'engine'],
      dtype='object')

Now that we know how to translate an id into its actual value, let's do the same for the type_acft column.  
First, let's have a look at the documentation. There we find 11 engine types with ids ranging from 1 to 9 and two letters H and O. This complicates things, but before we look further into this let's check the type_afct column values for matching ids first.

In [38]:
# Print list of distinct type_acft ids
print(df_all['type_acft'].unique())

['5' '4' '6' '1' '2' '7' '8' 'H' '9' '3' 'O']


Again, the ids are matching, which is good news. The bad news is we have a mix of numerical ids and letters and because of the letters the column is in a string format. The approach we used with the type_eng column will not work with the current format of the column and ids. There is no reason to get demotivated though, this points us to a more efficient and secure way to fill data based on ids. First, let's create a dictionary to capture the ids and type names. Similar to before, use the documentation file to get the values and this time type them into a dictionary.

In [39]:
acft_dict= {
            '1':'Glider',
            '2':'Balloon',
            '3':'Blimp/Dirigible',
            '4':'Fixed wing single engine', 
            '5':'Fixed wing multi engine',
            '6':'Rotorcraft',
            '7':'Weight-shift-control',
            '8':'Powered Parachute',
            '9':'Gyroplane',
            'H':'Hybrid Lift',
            'O':'Other'}


The last step is to translate the ids in the type_acft column into aircraft type names.  
Complete the code below and add a new column called aircraft_type that consists of the correct aircraft type names based on the id in the type_acft column.
With the help of the map-function, we can apply a transformation function to each item in an iterable and transform them into a new iterable. In our case, we transform the ids into names. This could be thought of as similar to VLOOKUP in this case.

In [40]:
# Assign values in the aircraft_type column by mapping the keys from type_acft column to the values in the acft_dict dictionary.
df_all['aircraft_type'] = df_all['type_acft'].map(acft_dict)

# Print list of distinct aircraft types and aircraft ids found in our dataset and the corresponding number of records for each type.
print(df_all['aircraft_type'].value_counts())

aircraft_type
Fixed wing single engine    214506
Fixed wing multi engine      48293
Rotorcraft                   19593
Balloon                       4989
Glider                        4775
Powered Parachute             1864
Weight-shift-control          1028
Gyroplane                      320
Hybrid Lift                    135
Blimp/Dirigible                 43
Other                           12
Name: count, dtype: int64


Good job, that was a tricky one! Now that we have the actual aircraft type names in our dataset we don't need the type_acft column anymore.  
Complete the code below and delete the type_acft column.

In [41]:
# Delete type_acft column
df_all.drop('type_acft', axis=1, inplace=True)

# Print all column names
df_all.columns

Index(['nnum', 'code', 'year', 'mfr', 'model', 'no_eng', 'no_seats', 'speed',
       'engine', 'aircraft_type'],
      dtype='object')

In [42]:
df_all.sample(5)

Unnamed: 0,nnum,code,year,mfr,model,no_eng,no_seats,speed,engine,aircraft_type
4369,416LA,2130004,2021.0,CIRRUS DESIGN CORP,SR22T,1,5,0,Reciprocating,Fixed wing single engine
24009,565ND,2072439,2008.0,CESSNA,172S,1,4,0,Reciprocating,Fixed wing single engine
44037,8093Q,2076014,1973.0,CESSNA,421B,2,8,172,Reciprocating,Fixed wing multi engine
58468,606TS,7640122,2007.0,ROBINSON HELICOPTER COMPANY,R44 II,1,4,0,Reciprocating,Rotorcraft
116075,2045N,7101828,1958.0,PIPER,PA-18-150,1,2,97,Reciprocating,Fixed wing single engine


Done! We successfully translated the ids to their actual values. Before we write the table into our database, let's make sure we have clean and descriptive column names. The first column we are going to change is the nnum column. Lets have a look at its values before we rename it.  
Complete the code below and print the first 5 rows of the nnum column.

In [43]:
# Print first 5 rows of nnum column
df_all[['nnum']].head()

Unnamed: 0,nnum
0,1
1,1010N
2,101EF
3,101FC
4,101PG


Something is weird. Aren't the tail numbers in our flights data starting with the letter 'N'? Let's check that just to be sure. In order to do this we need to connect to our database and query the unique tail numbers from the flights table. In order to create a connection we need to create a connection to the PostgreSQL database. We stored a function for this in the sql_functions.py file already. All that's left to do is import our get_dataframe function from the sql_functions.py file.

In [44]:
# Import get_dataframe from sql_functions.py
#import sql_functions
from sql_functions import get_dataframe

In [45]:
df_all['nnum'].unique()

array(['1', '1010N', '101EF', ..., '9ZP', '9ZQ', '9ZX'], dtype=object)

Next, query the distinct tail numbers from the flights table and store them in a variable f_planes and print the first 5 rows.

In [46]:
# Store unique tail numbers in f_planes
schema = 'hh_analytics_24_1'
f_planes = get_dataframe(f'SELECT DISTINCT tail_number FROM {schema}.flights')

# Print first 5 rows of f_planes
f_planes.head()

Unnamed: 0,tail_number
0,
1,N8647A
2,N665NK
3,N342DN
4,N585NN


Indeed, the tail numbers all start with the letter 'N'. To make sure the nnum column really consists of tail numbers and only the letter 'N' is missing, let's calculate how many matching values we have between the two columns.

In [47]:
f_planes['tail_number'].count()

4610

In [48]:
df_all['nnum'].count()

295558

In [49]:
# Count matching values in tail_number and nnum
f_planes['tail_number'].str[1:].isin(df_all['nnum']).value_counts()

tail_number
True     4449
False     162
Name: count, dtype: int64

We have 4506 matches (or something similar since the live data changes), which is quite a significant number. Therefore, we can assume that in order to match the nnum and the tail_number column, all we need to do is add the letter 'N' in front of each value in the nnum column.  
Complete the code below and create a new column in df_all called tail_number that consists of the letter 'N' and the nnum values.

In [50]:
# Create tailnum column
df_all['tail_number'] = 'N' + df_all['nnum']

# Print first 5 rows
df_all.head()

Unnamed: 0,nnum,code,year,mfr,model,no_eng,no_seats,speed,engine,aircraft_type,tail_number
0,1,2076811,2014.0,CESSNA,680,2,9,0,Turbo-fan,Fixed wing multi engine,N1
1,1010N,2076811,2012.0,CESSNA,680,2,9,0,Turbo-fan,Fixed wing multi engine,N1010N
2,101EF,2076811,2014.0,CESSNA,680,2,9,0,Turbo-fan,Fixed wing multi engine,N101EF
3,101FC,2076811,2007.0,CESSNA,680,2,9,0,Turbo-fan,Fixed wing multi engine,N101FC
4,101PG,2076811,2012.0,CESSNA,680,2,9,0,Turbo-fan,Fixed wing multi engine,N101PG


There we go! Now we have a tail number column that we can join with our flights table later on. Let's change the order of the columns and get rid of the code column since we don't need it anymore.

In [51]:
# Remove code column, change column order and assign to planes
planes = df_all[['tail_number', 'year', 'mfr', 'model', 'engine', 'aircraft_type', 'no_eng', 'no_seats', 'speed']].copy()

As a final data cleaning step give the mfr, no_eng and no_seats column a more descriptive name.  
Complete the code below and change the column names mfr into manufacturer, no_eng into engines and no_seats into seats.

In [52]:
# Change column names
planes.rename(columns={'mfr' : 'manufacturer', 'no_eng' : 'engines', 'no_seats' : 'seats'}, inplace=True)

# Print all column names
planes.columns

Index(['tail_number', 'year', 'manufacturer', 'model', 'engine',
       'aircraft_type', 'engines', 'seats', 'speed'],
      dtype='object')

Awesome! We finally have a clean dataset with detailed information about aircrafts. Let's check how many unique aircrafts we have in our dataset.  
Complete the code below and count the unique airplanes in the planes variable.

In [53]:
planes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295558 entries, 0 to 295557
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tail_number    295558 non-null  object 
 1   year           241121 non-null  float64
 2   manufacturer   295558 non-null  object 
 3   model          295558 non-null  object 
 4   engine         295558 non-null  object 
 5   aircraft_type  295558 non-null  object 
 6   engines        295558 non-null  int64  
 7   seats          295558 non-null  int64  
 8   speed          295558 non-null  int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 20.3+ MB


In [54]:
# Count unique aircrafts
planes['tail_number'].nunique()

295558

Wow, around 290k aircrafts. Previously when we counted the matches between the nnum column and tail_number column we only had 4506 matches, which is a lot less than we have from the official source. Since we only need a small subset, let's filter out the remaining ones.  
Complete the code below and create a dataframe called final_table that has all aircrafts from the planes dataset that have matches in the f_planes dataset.

In [55]:
# Create dataframe with only matching values called final_table
final_table = planes.merge(f_planes, how='right', on='tail_number')

In [56]:
final_table

Unnamed: 0,tail_number,year,manufacturer,model,engine,aircraft_type,engines,seats,speed
0,,,,,,,,,
1,N8647A,2014.0,BOEING,737-8H4,Turbo-fan,Fixed wing multi engine,2.0,143.0,0.0
2,N665NK,2016.0,AIRBUS,A321-231,Turbo-fan,Fixed wing multi engine,2.0,379.0,0.0
3,N342DN,2018.0,AIRBUS,A321-211,Turbo-fan,Fixed wing multi engine,2.0,199.0,0.0
4,N585NN,2016.0,BOMBARDIER INC,CL-600-2D24,Turbo-fan,Fixed wing multi engine,2.0,95.0,0.0
...,...,...,...,...,...,...,...,...,...
4606,N77535,2016.0,BOEING,737-824,Turbo-fan,Fixed wing multi engine,2.0,149.0,0.0
4607,N344NW,1992.0,AIRBUS INDUSTRIE,A320-212,Turbo-fan,Fixed wing multi engine,2.0,182.0,0.0
4608,N602LR,2008.0,BOMBARDIER INC,CL-600-2D24,Turbo-fan,Fixed wing multi engine,2.0,95.0,0.0
4609,N952AT,2000.0,BOEING,717-200,Turbo-fan,Fixed wing multi engine,2.0,100.0,0.0


In [57]:
# Print count of planes in final_table
final_table['tail_number'].count()

4610

Good job! Instead of a huge dataset with all aircrafts we now have a smaller subset that matches the aircrafts we have in our flights table in our PostregSQL database.

## Inserting aircrafts data into the database
The last step is to write this table into our database. We already created functions to help us do this using the sql_functions.py file. 
If the credentials and functions are set up correctly from the previous notebook, we can go ahead and import the helper function from the sql_functions.py file to get our connection engine.

In [58]:
# Import get_engine function from sql_functions.py and set it to a variable called engine
from sql_functions import get_engine
engine = get_engine()

# Import psycopg2
import psycopg2

Next, set the table name variable. This will be name of the table that will be written to the PostgreSQL database.

In [59]:
# IMPORTANT: Set the schema to your course name and set the table_name variable to 'planes_' + your initials/group number
# Example: planes_pw for Philipp Wendt / planes_1 for group1
schema = 'hh_analytics_24_1' # example 'hh_analytics_22_1
table_name = 'planes_sp'

The final step is to write the dataset to the database.  
Complete the code below and write the dataset stored in planes_in_both to the PostgreSQL database.

In [60]:
# Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        final_table.to_sql(name=table_name, # Name of SQL table
                        con=engine, # Engine or connection
                        if_exists='replace', # Drop the table before inserting new values 
                        schema=schema, # your class schema
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None
else:
    print('nope!')

The planes_sp table was imported successfully.


To check if everything worked try querying the table from the database.

In [61]:
# Query the new planes table to get number of planes in the SQL table
query = f'SELECT COUNT(tail_number) FROM {schema}.planes_sp'

#from sql_functions import get_data
#get_data(query)

get_dataframe(query)

Unnamed: 0,count
0,4610


You made it, congratulations!