# 1 - Read Filght Data

The data of the 5 flights are stored in HDF5 files :
* Flt1002-train.h5
* Flt1003-train.h5
* Flt1004-train.h5
* Flt1005-train.h5

The generic name of a flight is therefore `f'Flt100{flight_number}'` with flight_number = [2,3,4,5]

In [1]:
import pandas as pd
import h5py

def read_filght_data(flight_number, verbose=False):
    file_name = f'../data/raw/Flt100{flight_number}-train.h5'
    # read HDF5 file
    flight_data = h5py.File(file_name, 'r')    
    df = pd.DataFrame()
    for key in flight_data.keys():
        data = flight_data[key]
        if data.shape != ():
            df[key] = data[:]
            if df[key].isnull().any()&verbose:
                print(f'{key} contains NaN(s)')
        elif verbose:
            print(f'{key} = {data[()]}')
            
    # rename the column according to 'Appendix B Datafields'
    datafields = pd.read_csv('../data/raw/datafields.csv',
                         header=None,
                         index_col=0).to_dict()[1]
    df = df.rename(columns=datafields,
                   errors="raise")
    
    # index by TIME (sort)
    df = df.sort_values(by=['TIME'])
    df.index = df['TIME']
    df.index.name = 'Time [s]'    
    return df


In [2]:
flight_number = 3
df = read_filght_data(flight_number, verbose=True)

N = 160030
drape contains NaN(s)
dt = 0.09611163227016886
ogs_alt contains NaN(s)
ogs_mag contains NaN(s)


"NOTE: The dt field in each HDF5 file is incorrect. The correct value is 0.1."

In [3]:
df[['FLUXB_X','FLUXC_X']].describe()

Unnamed: 0,FLUXB_X,FLUXC_X
count,160030.0,160030.0
mean,34805.294581,-52089.150094
std,10137.198973,1958.944527
min,-15.877,-56392.728
25%,25884.5125,-53427.49425
50%,35410.303,-52414.0305
75%,44255.286,-51101.914
max,54512.841,-37037.329


Testing for good understanding of geographic conventions :

In [4]:
from pyproj import Transformer
import numpy as np

WGS_to_UTC = Transformer.from_crs(crs_from=4326, # EPSG:4326 World Geodetic System 1984, https://epsg.io/4326
                                  crs_to=32618)  # EPSG:32618 WGS 84/UTM zone 18N, https://epsg.io/32618

# Transfom (LAT, LONG) -> (X_UTM, Y_UTM)
UTM_X_pyproj, UTM_Y_pyproj = WGS_to_UTC.transform(df.LAT.values,
                                                  df.LONG.values)

# Check if the converted coordinates and the dataset coordinates are equal (+/- 1.4cm).
all(np.sqrt((df.UTM_X - UTM_X_pyproj)**2 + (df.UTM_Y - UTM_Y_pyproj)**2) < 0.014)

True

## Export data

For the following, we will use the HDF5 file **Flt_data.h5**. And we also export to csv for convenience.  

In [5]:
for flight_number in range(2,6):
    df = read_filght_data(flight_number)
    # export to HDF5
    df.to_hdf('../data/interim/Flt_data.h5',
              key=f'Flt100{flight_number}')
    # export to csv
    df.to_csv(f'../data/interim/Flt_data_csv/Flt100{flight_number}.csv')

 Let's check if if the import works properly :

In [6]:
df2 = pd.read_hdf('../data/interim/Flt_data.h5',
                  key=f'Flt100{flight_number}')
all(df2 == df)

True