In [1]:
from pathlib import Path

import pandas as pd
import functions.data_acquisition as data_funcs

%cd ..

import src.configuration as config

/workspaces/Dansah2_Aviation_Final_Project


## 2. On-time performance data from the Bureau of Transportation Statistics (BTS)

Data source: [Airline Service Quality Performance 234 (On-Time performance data)](https://www.bts.gov/browse-statistical-products-and-data/bts-publications/airline-service-quality-performance-234-time)

The data generated in the notebook was uploaded to Kaggle as an open-source dataset.

contributors: [Dyimah Ansah](https://www.kaggle.com/dyimahansah), [Madeshwaran Selvaraj](https://www.kaggle.com/madeshwaranselvaraj), [Adam Val](https://www.kaggle.com/valadamcanton), [George Perdrizet](https://www.kaggle.com/gperdrizet)

Features: 

As you may notice, the features in the asc file are unlabeled. We were able to confirm each feature by cross referencing a sample in the asc file with a sample from BTS's manual download page located [here](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr).

'carrier_code': a unique identifier assigned to a company that provides transportation services.

'flight_number': A unique code assigned to a specific flight, acting as its identifier.

'iata_code_reporting_airline': an airline designators are used to identify an airline for commercial purposes.

'origin': The three character acronym indicating the departing airport.

'destination': The three character acronym indicating the arrival airport.

'date': The date of flight departure.

'departure_time': The time of flight departure.

'cancellation': whether or not a flight was cancelled, (1 = cancelled flight, 0 = completed flight)

'departure_delay': how long in minutes a flight was delayed from departing the origin airport.

'arrival_delay': how long in minutes a flight was delayed from arriving at the destination airport.

'net_delay': the difference between the time the flight was scheduled to arrive and its actual arrival time.

'tail_number': AKA N-Number, is the aircraft registration number.



### Download on-time data

In [2]:
links=data_funcs.get_ontime_links(config.ONTIME_DATA_URL)

### Parse and combine on-time datafiles

In [3]:
ontime_df=data_funcs.parse_asc_datafiles(config.ONTIME_FILES, config.RAW_DATA_DIRECTORY, config.RAW_ONTIME_CSV_FILE)

./data/raw/ontime.td.202412.asc
./data/raw/ontime.td.202410.asc


In [4]:
ontime_df.head().transpose()

Unnamed: 0,0,1,2,3,4
DL,DL,DL,DL,DL,DL
4032,4032.0,3667.0,4066.0,4066.0,3664.0
OO,OO,OO,OO,OO,OO
4032.1,4032.0,3667.0,4066.0,4066.0,3664.0
9E,9E,9E,9E,9E,9E
...,...,...,...,...,...
29,,,,,
10,,,,,
75,,,,,
0.30,,,,,


In [5]:
ontime_features={
    'carrier_code': 0,
    'flight_number': 1,
    'iata_code_reporting_airline': 4,
    'origin': 6,
    'destination': 7,
    'date': 8,
    'departure_time': 12,
    'cancellation': 16,
    'departure_delay': 18,
    'arrival_delay': 21,
    'net_delay': 22,
    'tail_number': 25 
}

extracted_ontime_df=ontime_df.iloc[:,list(ontime_features.values())].copy()
extracted_ontime_df.columns=ontime_features.keys()
extracted_ontime_df.dropna(inplace=True)
extracted_ontime_df.head()

Unnamed: 0,carrier_code,flight_number,iata_code_reporting_airline,origin,destination,date,departure_time,cancellation,departure_delay,arrival_delay,net_delay,tail_number
0,DL,4032.0,9E,LGA,ORF,20241213.0,1338.0,0,95.0,164.0,6,N915XJ
1,DL,3667.0,9E,DTW,MSN,20241202.0,821.0,0,30.0,-2.0,62,N907XJ
2,DL,4066.0,9E,MSP,BIS,20241202.0,901.0,0,104.0,4.0,8,N166PQ
3,DL,4066.0,9E,BIS,MSP,20241202.0,1135.0,0,92.0,10.0,4,N166PQ
4,DL,3664.0,9E,MSN,DTW,20241202.0,928.0,0,143.0,-7.0,-60,N907XJ
