## Rename Files to Correspond to Airport and Create Link in Weather data to be able to join to OnTime data
<br>
The OnTime data contains an Airport Code.<br>
The Weather data contains a STATION ID made up of a USAF code and WBAN code.<br>
<br>
The Master Location Identifier Database contains a wban and national_id_xref. The national_id_xref appears to match to Airport codes.<br>
<br>
In a previous step we used the LCD Station Number cross reference file to narrow the weather files down to just US. This file didn't help with identifying anything more with the airports. At least it helped get rid of half the files (Non US) and save space to work more.<br>
<br>
Here we'll go through just the US files in a similar manner but instead of determining US or not, with the wban portion of the file name we'll determine if the national_id_xref matches to an airport code in the OnTime data. If it does we'll rename it to the airport code.<br>
<br>
Files will be moved to the following files, determined by the filename, Station ID:<br>
* airport_files - these are the files matching to airport codes in the OnTime data<br>
* codename_files - these are the same files but renamed to the airport code, instead of Station ID. These files also have a column added to them called "airport_code" so they can be concatenated together and matched based on their "airport_code" to the OnTime data.<br>
* no_airport_match - the WBAN ID on the file did not match to any airports<br>
* no_wban_match - these are the files that didn't even match to a WBAN so they couldn't be matched to an airport at all<br>
<br>
The no_airport_match and no_wban_match files were saved in to investigate for airports no file was found for.

In [None]:
import pandas as pd
basepath = 'your_path_here'

The origindatetime.pkl file was created in the W1_Make_OnTime_DFW_joining_keys.ipynb code. It consists of the airport code, flight date and time.<br>
The list of airportcodes is created from all unique values in the 'Origin' field.

In [None]:
# Get list of airport codes from OnTime data
originflights = pd.read_pickle(basepath + "/OnTime/origindatetime.pkl")
airportcodes = OriginAirports = originflights['Origin'].unique().tolist()

Reading in the Master Location ID Database - this provides a cross reference from WBAN to NATIONAL_ID_XREF which matches to the airport codes in the OnTime data

In [None]:
filepath = basepath + '/Weather/Hourly/master-location-identifier-database-20130801.csv'
master_id = pd.read_csv(filepath, encoding='latin_1', low_memory=False, header = 4,usecols=['wban', 'national_id_xref'])

Note: wban values are not unique in MASTER Location ID database

In [None]:
master_id.dropna(inplace=True)
master_id.drop_duplicates(inplace=True)
master_id['wban'] = master_id['wban'].str.zfill(5) # This is to match to the wban values pulled from the file names, making it 5 characters padded with 0's.

# Read in all the filenames for the US 2010 Hourly weather

After this code was run, investigation into the airports with no match was completed. The following cell contains the matches found based on this investigation.

In [None]:
# This is 31 of the 33 airport codes not found in previous cross reference
AC_Dict = {	'12834': 'DAB',	'73805': 'ECP',	'11641': 'SJU',	'11640': 'STT',	'54831': 'BMI',	'93753': 'OAJ',	'53893': 'GTR',	'00207': 'BKG',
			'23903': 'UTM',	'03882': 'PFN',	'54764': 'SCE',	'11603': 'BQN',	'14756': 'ACK',	'24146': 'FCA',	'94014': 'ISN',	'41414': 'GUM',
			'41408': 'SPN',	'12919': 'BRO',	'23104': 'AZA',	'63801': 'USA',	'54745': 'OGS',	'04742': 'PBG',	'94824': 'CIU',	'03145': 'YUM',
			'13922': 'SLN',	'61705': 'PPG',	'63837': 'HHH',	'24014': 'XWA',	'23208': 'CLD',	'93194': 'IYK',	'92814': 'UST'}

The following function was set up to run each set of yearly files through. It moves the files, renames, and inserts columns as stated at the top of this notebook.

In [None]:
def determineAirportFiles(YearFolder):
    Direc =  basepath + "/Weather/Hourly/"+ YearFolder
    ForFileList = Direc + "/US"
    files = os.listdir(ForFileList)
    files = [f for f in files if os.path.isfile(ForFileList+'/'+f)]

    files = os.listdir(ForFileList)
    files = [f for f in files if os.path.isfile(ForFileList+'/'+f)] #Filtering only the files.

    # Create list of wban numbers corresponding to file list:
    wban_list = []#f for f in files f[6:12]]
    for station in files:
        wban_list.append(station[6:11])
    len(wban_list)

    airport_files = []
    check_these_files = []
    dup_weather_for_airport = []
    count = 0
    for file in files:
        ext = file[-4:]
        if count < 3000:
            int_id_xref = master_id[master_id['wban'] == wban_list[count]]['national_id_xref'].values[:].tolist()
            if len(int_id_xref) == 0:
                if wban_list[count] in AC_Dict:
                    int_id_xref = [AC_Dict[wban_list[count]]]

            if len(int_id_xref) > 0:
                for filenum in range(len(int_id_xref)):
                    if filenum == 0:
                        if int_id_xref[0] in airportcodes:
                            airport_files.append(int_id_xref[0])
                            ap_weather_file = pd.read_csv(Direc + "/US/" + file, low_memory=False)
                            ap_weather_file.insert(loc=0, column = 'airport_code', value = int_id_xref[0])
                            ap_weather_file.to_csv(Direc + "/codename_files/" + int_id_xref[0] + ".csv", index=False)
                            os.rename(Direc + "/US/" + file, Direc + "/airport_files/" + file)
                        else:
                            os.rename(Direc + "/US/" + file, Direc + "/no_airport_match/" + file)
                    else:
                        # If inside this else, the first file was already output to one location
                        if int_id_xref[filenum] in airportcodes:
                            dup_weather_for_airport.append(file)
                            print("\tfile:", file, "is an additional weather file for an airport:", int_id_xref[0] + ext, " in the dup_airport_files")
                        else:
                            check_these_files.append(file)
            else:
                os.rename(Direc + "/US/" + file, Direc + "/no_wban_match/" + file)
        count += 1

    print("For", Direc + "/US", "there are", len(airport_files), "files")
    return [check_these_files, airport_files, dup_weather_for_airport]

In [None]:
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2010')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2011')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2012')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2013')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2014')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2015')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2016')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2017')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2018')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2019')
check_these_files, airport_files, dup_weather_for_airport = determineAirportFiles('2023')

After running the 2010 data, additional work was done to find which airports didn't have a match. There were approximately 33 airports to investigate. The findings were added above. <br>
The lists output from these were reviewed after running each line at a time.<br>
<br>
7 files were found to be outside the continental US. Even after investigating the above airports they still didn't result in matches for all of the remaining files based on their WBAN numbers.