# CMAQ Flight Pairing

---
    author: Michael J. Pye (pye.michael@epa.gov)
    date created: 2025-01-13
---

### Origin
This notebook is based upon another notebook written by Havala Pye (pye.havala@epa.gov) for the Firex field campaign created on 2023-01-10.  
The path to the original notebook is: /work/MOD3DEV/has/2022firex/postproc/notebooks/2024_firexpairv3_hotp_20240304_base3.ipynb  
Although this notebook has a very different structure, at least some of the ideas for this notebook came from that notebook. 

### Purpose
This notebook is designed to perform flight pairing tasks for any flight field campaign to CMAQ conc files. 

### Log
* 01/13/2025 - Created notebook and organized introduction section
* 01/14/2025 - Began the `pair_flight()` functon
    * Added a section to find the model domiain grid files
    * Added a section to find the row and col of a grid point associated with a flight lat and lon
* 01/15/2025 - added another section to `pair_flight()` that finds the file associated with a given observation as well as the timestep within that file
* 01/17/2025 - Fixed the longitude distance issue in the section of the pairing function that finds the longitude distance from every grid point to the aircraft. Solved the issue by multiplying the distances by the cosine of the latitude 
* 01/21/2025 - added section to save index information to NetCDF
* 01/23/2025 - Changed method of horizontal matching to calculate the great circle distance and then find the minimum distance
* 01/24/2025 - Added function to extract CMAQ data based on aircraft location within 3D CMAQ grid
* 01/24/2025 - Added `source` attribute to NetCDF file storing lat-lon data and assigned it to a string that provides information about where the simulation domain files are stored. This is so the CMAQ data extraction function knows where to look in the case the user wants to include coordinate data in the CMAQ output DataFrame.
* 01/28/2025 - Added a function to combine the pairing and extraction processes together, allowing the user to avoid saving a NetCDF containing index information if that is not desireable.
* 02/03/2025 - Updated the extraction function to allow extraction of paired data from METCRO3D and GRIDCRO3D files
* 02/05/2025 - Updated the lat-lon pairing algorithm to include the arccosine function to make the process more clear.
* 02/06/2025 - Gave the pairing function the ability to determine if the aircraft is outside the model domain. Set indices to missing values in this case and then extracted values as missing when encountering missing indices.
* 02/13/2025 - updated the pairing function file time string to be created based on `datateime.datetime.strftime()` method instead of hardcoding the format. This allows for a future update where the file time string format can vary. This also made the `pad_zeros()` function useless, so it was deleted along with the rest of the general functions section.
* 2/14/2025 - Added a function to print out the flight ID's within a given index NetCDF file. 
---

# Imports

In [1]:
import pandas as pd
import numpy as np
import netCDF4 as nc
import datetime
import os
import cmaqsatproc as csp

# Pair flight data to CMAQ
The purpose of the `pair_flight()` function is to find the CMAQ output file grid points associated with with a flight trajectory and save the indices of each timestep for the flight to a variable in a NetCDF file.  
  
Parameters:  
* `flight_id` (str) - Acts as an identifier for the flight being submitted. Will be used as the NetCDF variable name
* `flight_lons` (list) - Provides the longitude of the aircraft at every observation
* `flight_lats` (list) - Provides the latitude of the aircraft at every obersvation
* `flight_alts` (list) - Provides the altitude of the plane at every observation in meters above ground level
* `flight_times` (list) - Provides the time each observation was taken. Required format YYYY-MM-DD HH:MM:SS
* `index_file_path` (str) - Path of the NetCDF file you would like to save the CMAQ indices to, regardless of whether the file at the entered path exists yet.
* `domain_dir` (str) - The name of the directory where the GRIDCRO2D and METCRO3D are stored.
* `model_resolution` (int) [Default: `12000`] - The horizontal grid spacing of the model in units of meters. This is only used to tell the function if the aircraft has left the model domain. Set this value to limit the farthest distance from a model grid point that you would like flight data to pair to model data. 
* `alt_type` (str) [Default: `'ASL'`] - specifies whether the altitudes passed to the function are in height above ground level or height above sea level. the options are `'ASL'` (above sea level) and `'AGL'` (above ground level) 

Notes:
* Make sure to remove any data with a missing longitude, latitude, altitude, OR time. Failure to do so will result in differences in the number of values from the model output and from the aircraft observations.
* In the function, there is a section that finds the closest point in the model to the aircraft in the horizontal using the grid point's and the aircraft's lat-lon coordinates. At the top of the section, there is a comment that says: "Find the x and y indices of the closest grid point to the aircraft by calculating the distance to all points." The algorithm used in this section is based on a formula for calulating the distance between two points on Earth based on Lat-Lon coordinates. Normally, you would multiply the `angle` variable by Earth's radius to calculate a distance. However, whichever grid point has the smallest angle between itself and the aircraft must have the shortest distance from the plane. Therefore, finding the index of the point with the smallest angle will give the closest grid point to the aircraft.  Websites that describe that formula are listed here:
    * https://dtcenter.org/sites/default/files/community-code/met/docs/write-ups/gc_simple.pdf
    * https://www.geeksforgeeks.org/great-circle-distance-formula/
    * https://www.themathdoctors.org/distances-on-earth-1-the-cosine-formula/  
* For times where the aircraft is farther than the set value of the model_resolution parameter in meters from the closest model grid point, `-99999999` is assigned as the index value for the time, layer, row, and column.  

In [None]:
def pair_flight(flight_id, flight_lons, flight_lats, flight_alts, flight_times, index_file_path, domain_dir, model_resolution = 12000, alt_type = 'ASL'):
    #force array type to be numpy.ndarray
    flight_lons = np.array(flight_lons)
    flight_lats = np.array(flight_lats)
    flight_alts = np.array(flight_alts)
    flight_times = np.array(flight_times)
    
    #find 2d model domain file name
    no_file = True
    dir_file_list = os.listdir(domain_dir)
    n = 0
    while no_file == True:    #this loop looks for the first GRIDCRO2D file and saves the file name
        if 'GRIDCRO2D.' in dir_file_list[n]:
            gridcro2d_file_name = dir_file_list[n]
            no_file = False
        n += 1
    gridcro2d_file_path = os.path.join(domain_dir, gridcro2d_file_name) 
    
    #extract model lat and lon and get land elevation if atl_type is "ASL"
    grid_data = nc.Dataset(gridcro2d_file_path)
    grid_lons = np.array(grid_data.variables['LON']).astype(np.float64)    #finds the longitude values and converts to a double precision value to avoid rounding errors in the angle calculations to find diatance
    grid_lats = np.array(grid_data.variables['LAT']).astype(np.float64)    #finds the latitude values and converts to a double precision value to avoid rounding errors in the angle calculations to find diatance
    rad_grid_lons = grid_lons[0, 0, :, :] * np.pi / 180    #version in longitude array in radians
    rad_grid_lats = grid_lats[0, 0, :, :] * np.pi / 180    #version of latitude array in radians
    if alt_type == 'ASL':
        ter_elev = np.array(grid_data.variables['HT'])
    grid_data.close()    
    
    #find unique dates in the flight
    unique_dates = []
    for time in flight_times:
        time = time.split(' ')[0].replace('-', '')[2:]
        if time not in unique_dates:
            unique_dates.append(time)
    
    #find METCRO3D files needed for flight
    met_file_names = []
    for file_name in os.listdir(domain_dir):
        if 'METCRO3D.' in file_name:
            for date in unique_dates:
                if date in file_name:
                    met_file_names.append(file_name)
                    
    index_array = np.zeros((len(flight_times), 6))
    last_date_check_str = None
    #iterate through the flight data
    for i in range(len(flight_times)):
        time_obj = datetime.datetime.strptime(flight_times[i], '%Y-%m-%d %H:%M:%S')
        current_date_check_str = time_obj.strftime('%y%m%d')    #stores the date given in the filename. May need to be changed for different file name formats
        
        #Find the x and y indices of the closest grid point to the aircraft by calculating the distance to all points.
        rad_flight_lon = flight_lons[i] * np.pi / 180
        rad_flight_lat = flight_lats[i] * np.pi / 180
        cos_angle = np.cos(rad_grid_lats) * np.cos(rad_flight_lat) * np.cos(rad_grid_lons - rad_flight_lon) + np.sin(rad_grid_lats) * np.sin(rad_flight_lat)    #represents the cosine of the angle between every grid point and the aircraft along a great circle line o Earth
        angle = np.arccos(np.floor(cos_angle * (10 ** 7)) / (10 ** 7))    #represents the angle between the aircraft and each grid point along a great circle line on Earth. The multiplication and division by 10^7 and finding np.floor() has the effect of chopping off decimals smaller than 10^7 to avoid rounding error issues when finding the cos_angle array.
        row, col = np.unravel_index(np.argmin(angle), angle.shape)    #finds the index of the grid point with the smallest number of radians from the aircraft 
        earth_radius = 6370000    #Earth radius defined in CMAQ CONST.EXT file: https://github.com/USEPA/CMAQ/blob/main/CCTM/src/ICL/fixed/const/CONST.EXT
        distance_from_plane = earth_radius * angle[row, col]    #multiplies the radius of the earth by the angle between the aircraft and the closest grid point to the aircraft
        
        #if the distance between the aircraft and the closest grid point is larger than the model's horizontal resolution,
        #the aircraft must be outside the model domain. This section accounts for this and saves all of the index values
        #to np.nan so no data is saved during CMAQ data extraction for this time
        if distance_from_plane > model_resolution:
            index_array[i, 0] = np.int64(flight_times[i].replace(' ', '').replace(':', '').replace('-', ''))
            index_array[i, 1] = current_date_check_str
            index_array[i, 2] = -99999999
            index_array[i, 3] = -99999999
            index_array[i, 4] = -99999999
            index_array[i, 5] = -99999999
        
        #if the distance between the aircraft and the closest grid point is smaller than the model's horizontal resolution,
        #the aircraft is within the model domain. This section then goes through the process of finding the vertical
        #index and saves index values to index_array
        elif distance_from_plane <= model_resolution:        
            #Find the index of the 3D array associated with the current time
            if current_date_check_str == last_date_check_str:    #this runs if the current ob occured on the same day as the previous. Means a new file dosent need to be opened every ob
                time_index = np.argmin(np.abs(current_file_datetimes - time_obj))    #finds the index of the shortest time between model out and ob 
            elif current_date_check_str != last_date_check_str:     
                for met_file_name in met_file_names:    #this loop finds the file that corresponds to time the ob was taken
                    if current_date_check_str in met_file_name:
                        current_met_file_name = met_file_name      
                met_data = csp.open_ioapi(os.path.join(domain_dir, met_file_name))
                current_file_times = np.array(met_data.data_vars['ZH']['TSTEP'])
                grid_heights = np.array(met_data.data_vars['ZH'])
                met_data.close()
                current_file_datetimes = []    #stores all times the METCRO3D file has but as datetime objects
                n = 0
                for model_out_time in current_file_times:    #this loop converts the METCRO3D file str dates to datetime objects
                    model_out_time = str(model_out_time).replace('T', ' ').split('.')[0]
                    current_file_datetimes.append(datetime.datetime.strptime(model_out_time, '%Y-%m-%d %H:%M:%S'))
                    n += 1
                current_file_datetimes = np.array(current_file_datetimes)
                time_index = np.argmin(np.abs(current_file_datetimes - time_obj))
            last_date_check_str = current_date_check_str

            #subtract terrain elevation if using altitude above sea level
            if alt_type == 'ASL':
                aircraft_altitude = flight_alts[i] - ter_elev[0, 0, row, col]
            else:
                aircraft_altitude = flight_alts[i]

            #find the z coordinate of the closest gridpoint to the aircraft
            column_heights = grid_heights[time_index, :, row, col]
            lay = np.argmin(np.abs(column_heights - aircraft_altitude))

            #save the index data to the index array
            index_array[i, 0] = np.int64(flight_times[i].replace(' ', '').replace(':', '').replace('-', ''))
            index_array[i, 1] = current_date_check_str
            index_array[i, 2] = time_index
            index_array[i, 3] = lay
            index_array[i, 4] = row
            index_array[i, 5] = col

    
    #This section saves the index values to a NetCDF file if a file already exists to save data to
    if os.path.exists(index_file_path) == True:
        index_file = nc.Dataset(index_file_path, 'a')
        try:
            index_var = index_file.createVariable(flight_id, np.int64, ('time', 'index'))
        except RuntimeError:    #gives the user a more specific error message when they accidentally add a variable to a file twice.
            raise KeyError('CMAQ Flight Pairing: ' + flight_id + ' is already a variable name in the file ' + os.path.basename(index_file_path) + '. Chose another variable name.')
        index_var.units = 'Values by column: observation time, file date, hour index, layer index, row index, column index'
        index_var[:, :] = index_array
        index_file.close()
    
    #This section saves the index values to a NetCDF file if no file has been created to store data yet
    else:
        index_file = nc.Dataset(index_file_path, 'w')
        index_dim = index_file.createDimension('index', 6)
        time_dim = index_file.createDimension('time', None)
        index_file.title = 'Indices of flight location within CMAQ output'
        index_file.source = 'lat_lon_coords: ' + gridcro2d_file_path + ' met_file_directory: ' + os.path.dirname(gridcro2d_file_path)
        index_var = index_file.createVariable(flight_id, np.int64, ('time', 'index'))
        index_var.units = 'Values by column: observation time, file date, hour index, layer index, row index, column index'
        index_var[:, :] = index_array
        index_file.close()  
    print('Indices for ' + flight_id + ' successfully saved!')

# Extract CMAQ Data 

### View available flight ID's
The purpose of the `flight_ids()` function is to show the user the flights that are currently available in a given index NetCDF file and return a list of the available flights. When the options are set correctly, runing this function will print out the flight ID's of all flights that have been paired to the CMAQ grid. Each flight ID is a variable name in the NetCDF file.

Parameters:  
* `index_file_path` (str) - path to the index NetCDF of which the flight ID's will be turned into a list and then printed  
* `print_ids` (str) [Default: `False`] - Tells the user whether or not to print out the flight ID's. When set to `True`, the ID's are printed, when set to `False`, they are not printed.

Returns:  
* `flight_ids` (list) - list containing the flight ID of every flight stored in the NetCDF file.


In [None]:
def flight_ids(index_file_path, print_ids = False):
    index_file = nc.Dataset(index_file_path)
    flight_id_list = []
    for flight_id in index_file.variables:
        flight_id_list.append(flight_id)
        if print_ids == True:
            print(flight_id)
    index_file.close()
    return flight_id_list

### Extract CMAQ Data along Flight Trajectory
The purpose of the `extract_cmaq_flight_time()` function is to pull out CMAQ data along the trajectory of a given flight and convert it into a pandas DataFrame so that it can easily be used and compared to flight observations.  

Parameters:  
* `flight_id` (str) - name of the index variable associated with the given flight in the index NetCDF file.
* `index_file_path` (str) - path to the NetCDF file storing flight index information. This can be an absolute or relative path but absolute is recommended.
* `cmaq_data_dir_path` (str) - path to the directory storing all CMAQ output files during the time of the flight. This can be an absolute or relative path but absolute is recommended. If cmaq_output_type is "METCRO3D" or "GRIDCRO3D", cmaq_data_dir_path will be automatically reset to whatever the domain directory is. Therefore, the object passed to this parameter will not have an impact on the resulting DataFrame.
* `cmaq_output_type` (str) [Default: `'CONC'`] - type of 3D CMAQ output file you would like to extract data from. Potential options are `'CONC'`, `'CONC_3D'`, `'METCRO3D'`, and `'GRIDCRO3D'` among others. The 3D grid of the file must be based on the grid box centers as opposed to the corners. For example: `'METCRO3D'` works, but not `'METDOT3D'`.
* `output_vars` (list or NoneType) [Default: `None`] - List of variables from the CMAQ output file that you would like to extract. If None is passed, data from all 3D variables will be extracted.
* `output_alt_type` (str) [Default: `'ASL'`] - specifies whether the altitudes passed to the function are in height above ground level or height above sea level. the options are `'ASL'` (above sea level) and `'AGL'` (above ground level). 
* `list_vars` (bool) [Default: `False`] - When set to `True`, this will tell the function to print a list of the variables within the selected CMAQ output file type. Setting this parameter to True will raise a `KeyError` and force the program to end. When set to `False`, no variable list will be printed and the funxtion will not force a `KeyError`

Rerturns:  
* `cmaq_df` (pandas.DataFrame) - DataFrame containing CMAQ data for all selcected 3D variables along the flight trajetory  

Notes:  
* If the aircraft was outside the model domain for a given time step, the data in the output DataFrame for that time step will be marked as `-9999.9999` for all variables in the DataFrame

In [None]:
def extract_cmaq_flight_data(flight_id, index_file_path, cmaq_data_dir_path, cmaq_output_type = 'CONC_3D', output_vars = None, output_alt_type = 'ASL', list_vars = False):
    #extract index data from index NetCDF file
    index_file = nc.Dataset(index_file_path)
    index_data = index_file.variables[flight_id]
    flight_times = index_data[:, 0]
    file_dates = index_data[:, 1]
    time_indices = index_data[:, 2]
    lays = index_data[:, 3]
    rows = index_data[:, 4]
    cols = index_data[:, 5]
    _, gridcro2d_file_path, _, metcro3d_file_dir = index_file.source.split(' ')
    index_file.close()
    
    #If the desired file type is a domain/met file, the CMAQ data directory path gets changed to the domain directory
    if cmaq_output_type in ['METCRO3D', 'GRIDCRO3D']:
        cmaq_data_dir_path = metcro3d_file_dir
    
    #extract lat, lon, and surface elevation data from GRIDCRO2D file
    coord_data = nc.Dataset(gridcro2d_file_path)
    coord_dict = {'LON':np.array(coord_data.variables['LON'][0, 0, :, :]), 'LAT':np.array(coord_data.variables['LAT'][0, 0, :, :]), 'ALT':np.array(coord_data.variables['HT'][0, 0, :, :])}
    coord_data.close()
    
    #creates a list of all CMAQ output files of the desired output type
    selected_cmaq_files = []
    cmaq_output_split = cmaq_output_type.split('_')
    for file_name in sorted(os.listdir(cmaq_data_dir_path)):
        if cmaq_output_type in ['METCRO3D', 'GRIDCRO3D']:
            file_name_split = file_name.split('.')
        else:
            file_name_split = file_name.split('_')
        for i in range(len(cmaq_output_split), len(file_name_split) + 1):
            if cmaq_output_split == file_name_split[i - len(cmaq_output_split): i]:    #check to see if the cmaq_output_split list appears in the file_name_split list
                selected_cmaq_files.append(os.path.join(cmaq_data_dir_path, file_name))
    cmaq_flight_data = {}
    last_file_date = -1
    
    #iterates through all times in the flight and extracts model data associated with flight trajectory
    for i in range(len(file_dates)):
        #opens a new CMAQ file if one has not been opened or if the aircraft observation was taken a day
        #later than the previous observation and extracts CMAQ data
        if file_dates[i] != last_file_date:
            #closes old cmaq file if one is already open 
            if 'cmaq_data' in locals():
                cmaq_data.close()
               
            #finds new cmaq file path and opens it
            for file_path in selected_cmaq_files:
                if str(file_dates[i]) in os.path.basename(file_path):
                    current_file_path = file_path
            cmaq_data = nc.Dataset(current_file_path)
            
            if list_vars == True:
                for item in list(cmaq_data.variables):
                    print(item)
                raise KeyError('CMAQ Flight Pairing: No issue here! File variables have been listed and the process has been ended. To stop this error from occuring again, set the value of the list_vars parameter to False (the default option)')
                
            #populate the cmaq_flight_data dict with arrays to store values if cmaq_flight_data is empty 
            output_vars = output_vars[:]       #this line is included so that if the function is run multiple times in one script, the list does not add extra copies of the coordinate variables.
            if cmaq_flight_data == {}:
                if output_vars == None:
                    output_vars = cmaq_data.variables
                output_vars.append('LAT')
                output_vars.append('LON')
                output_vars.append('ALT')
                output_vars.append('Time')
                for var_name in output_vars:
                    if var_name != 'Time':
                        cmaq_flight_data[var_name] = np.zeros(file_dates.shape)
                    elif var_name == 'Time':
                        cmaq_flight_data[var_name] = [] 
            
            #if aircraft is outside model domain, add np.nan to all variables except time
            if lays[i] == rows[i] == cols[i] == -99999999:
                for key in output_vars:
                    if key == 'Time':
                        cmaq_flight_data[key].append(datetime.datetime.strptime(str(flight_times[i]), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S'))
                    elif key == 'LAT':
                        cmaq_flight_data[key][i] = -9999.9999
                    elif key == 'LON':
                        cmaq_flight_data[key][i] = -8888.8888    #the use of 8's here as opposed to 9's is so there is at least two unique values in the row. This prevents the row from getting removed as a reult of the program thinking the flight has ended.
                    else:
                        cmaq_flight_data[key][i] = -9999.9999
            
            #extract data from selected grid point. If aircraft is outside model domain, at least open the vertical coordinate data required to add future data points
            for key in output_vars:
                if key not in ['LAT', 'LON', 'ALT', 'Time']:    #extracts non-coordinate data
                    if rows[i] != -99999999:
                        var_data = cmaq_data.variables[key]
                        if len(var_data.shape) == 4:
                            cmaq_flight_data[key][i] = var_data[time_indices[i], lays[i], rows[i], cols[i]]
                else:    #extracts coordinate data
                    if key == 'ALT':
                        if cmaq_output_type != 'METCRO3D':    #open METCRO3D file if it has not been opened and extract grid point height data
                            for domain_file_name in os.listdir(metcro3d_file_dir):
                                if 'METCRO3D' in domain_file_name and str(file_dates[i]) in domain_file_name:
                                    met_file_name = domain_file_name
                            met_data = nc.Dataset(os.path.join(metcro3d_file_dir, met_file_name))
                            grid_heights = np.array(met_data.variables['ZH'])
                            met_data.close()
                        elif cmaq_output_type == 'METCRO3D':    #extract grid point height data if METCRO3D is already open
                            grid_heights = cmaq_data.variables['ZH']
                        if rows[i] != -99999999:
                            if output_alt_type == 'AGL':    #in the case that the desired output altititude references height above ground level, no change is made to the value from the grid_heights dict
                                cmaq_flight_data[key][i] = grid_heights[time_indices[i], lays[i], rows[i], cols[i]]
                            elif output_alt_type == 'ASL':    #in the case that the desired output altititude references height above sea level, land elevation is added to the grid point height above ground level
                                cmaq_flight_data[key][i] = grid_heights[time_indices[i], lays[i], rows[i], cols[i]] + coord_dict[key][rows[i], cols[i]]
                    elif rows[i] != -99999999:
                        if key == 'Time':
                            cmaq_flight_data[key].append(datetime.datetime.strptime(str(flight_times[i]), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S'))
                        else:    #extracts lat and lon values
                            cmaq_flight_data[key][i] = coord_dict[key][rows[i], cols[i]]
            
        #uses the currently opened CMAQ file if the date has not changed since the previous aircraft 
        #observation and extracts CMAQ data   
        elif file_dates[i] == last_file_date:
            #if aircraft is outside model domain, add np.nan to all variables except time
            if lays[i] == rows[i] == cols[i] == -99999999:
                for key in output_vars:
                    if key == 'Time':
                        cmaq_flight_data[key].append(datetime.datetime.strptime(str(flight_times[i]), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S'))
                    elif key == 'LAT':
                        cmaq_flight_data[key][i] = -9999.9999
                    elif key == 'LON':
                        cmaq_flight_data[key][i] = -8888.8888    #the use of 8's here as opposed to 9's is so there is at least two unique values in the row. This prevents the row from getting removed as a reult of the program thinking the flight has ended.
                    else:
                        cmaq_flight_data[key][i] = -9999.9999
            
            #if aircraft is inside the model domain, extract data from selected grid point
            else:
                for key in output_vars:
                    if key not in ['LAT', 'LON', 'ALT', 'Time']:    #extracts non-coordinate data
                        var_data = cmaq_data.variables[key]
                        if len(var_data.shape) == 4:
                            cmaq_flight_data[key][i] = var_data[time_indices[i], lays[i], rows[i], cols[i]]
                    else:    #extracts coordinate data
                        if key == 'ALT':
                            if output_alt_type == 'AGL':    #in the case that the desired output altititude references height above ground level, no change is made to the value from the grid_heights dict
                                cmaq_flight_data[key][i] = grid_heights[time_indices[i], lays[i], rows[i], cols[i]]
                            elif output_alt_type == 'ASL':    #in the case that the desired output altititude references height above sea level, land elevation is added to the grid point height above ground level
                                cmaq_flight_data[key][i] = grid_heights[time_indices[i], lays[i], rows[i], cols[i]] + coord_dict[key][rows[i], cols[i]]
                        elif key == 'Time':
                            cmaq_flight_data[key].append(datetime.datetime.strptime(str(flight_times[i]), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S'))
                        else:    #extracts lat and lon values
                            cmaq_flight_data[key][i] = coord_dict[key][rows[i], cols[i]]
        
        last_file_date = file_dates[i]
    cmaq_data.close()
    
    length_diff = len(cmaq_flight_data['LAT']) - len(cmaq_flight_data['Time'])
    if length_diff > 0:
        for extra_index in range(length_diff):
            cmaq_flight_data['Time'].append(0)
    
    #create a Pandas DataFrame based on the data and remove all rows where all values are zero. Then set all
    #missing values to -9999.9999 and set the DataFrame index to time.
    #Note: when all values in a row of the DataFrame are zero, the row is associated with a time beyond the end 
    #of the flight, so it is removed. This is due to all NetCDF variable arrays needing to have the same shape.
    #Extra values are added to flight index variables with a flight duration shorter than the longest flight.
    cmaq_df = pd.DataFrame(cmaq_flight_data)
    cmaq_df = cmaq_df.loc[cmaq_df.nunique(axis = 1) > 1]
    cmaq_df = cmaq_df.replace(-8888.8888, -9999.9999)    #this change is made to reduce the number of unique values in he row back down to one. Now this row can be recognized as a time where the flight was outside the model domain and can be removed with the corresponding row in the flight data.
    cmaq_df = cmaq_df.set_index('Time')
    
    print('Extraction of CMAQ data along trajectory of ' + flight_id + ' complete!')
    return cmaq_df

# Pair Flight and Extract CMAQ Data
The purpose of the `pair_and_extract()` function is to combine the process of running the `pair_flight()` and `extract_cmaq_flight_data()` functions all into one step and avoids saving a NetCDF with CMAQ indices permanently. This may be more efficient in some cases where only data from one flight is needed. It allows the user to decide whether saving time or saving storage space is more important.  

Parameters:  
* `flight_id` (str) - Acts as an identifier for the flight being submitted. Will be used as the NetCDF variable name
* `flight_lons` (list) - Provides the longitude of the aircraft at every observation
* `flight_lats` (list) - Provides the latitude of the aircraft at every obersvation
* `flight_alts` (list) - Provides the altitude of the plane at every observation in meters above ground level
* `flight_times` (list) - Provides the time each observation was taken. Required format YYYY-MM-DD HH:MM:SS
* `cmaq_data_dir_path` (str) - path to the directory storing all CMAQ output files during the time of the flight. This can be an absolute or relative path but absolute is recommended.
* `domain_dir` (str) [Default value: `'2023_12US4'`] - The name of the directory where the GRIDCRO2D and METCRO3D are stored. This can either be an absolute path, or the section of the path between the `MOD3DATA` directory and the met directory that stores the files.
* `alt_type` (str) [Default value: `'ASL'`] - specifies whether the altitudes passed to the function are in meters above ground level or meters above seas level. the options are `'ASL'` (above sea level) and `'AGL'` (above ground level)
* cmaq_output_type (str) [Default value: 'CONC'] - type of 3D CMAQ output file you would like to extract data from.
* output_vars (list or NoneType) [Default value: `None`] - List of variables from the CMAQ output file that you would like to extract. If `None` is passed, data from all 3D variables will be extracted.  

Returns:  
* `cmaq_df` (pandas.DataFrame) - DataFrame containing CMAQ data for all selcected 3D variables along the flight trajetory

In [5]:
def pair_and_extract(flight_id, flight_lons, flight_lats, flight_alts, flight_times, cmaq_data_dir_path, domain_dir = '2023_12US4', alt_type = 'ASL', cmaq_output_type = 'CONC', output_vars = None):
    nc_file_name = flight_id + '_temp_indices.nc'
    
    #if the NetCDF already exists, delete it. This would only occur is the function crashed and needed to be rerun.
    if os.isfile(nc_file_name) == True:
        os.remove(nc_file_name)
    
    #pair flight and extract data from CMAQ. Then, remove the NetCDF file.
    pair_flight(flight_id, flight_lons, flight_lats, flight_alts, flight_times, nc_file_name, domain_dir, alt_type)
    cmaq_df = extract_cmaq_flight_data(flight_id, nc_file_name, cmaq_data_dir_path, cmaq_output_type, output_vars)
    os.remove(nc_file_name)
    
    return cmaq_df