# Trajectory prediction

It is your task as Flight Operations Engineer researcher to develop different trajectory prediction 
algorithms. The goal is to predict the position of the aircraft in the next 10 minutes from any point. For this 
reason, different models should be evaluated to propose to Eurocontrol which one should be explored 
further. 
The following restrictions apply to the problem:
1. EDA + plots
2. Data cleaning and variable conversion is expected. 
3. Regression algorithm + another (explain which and why)
5. You should predict the trajectory in the next 10 minutes from a selected point.
a. 4D Output : Latitude, longitude, altitude, and time
b. Show the degradation (or improvement of the solution) 
6. Your justification of selected parameters used in your algorithm predictor should be validated 
using statistical tools or techniques such as feature engineering or any other you think is valid. 
An explanation is expected.
7. You must justify the quality of your model using tools such as residuals, F statistics, or any 
relevant tool. 


The optimization is divided as follows:
* Data initialization;
* Data-type conversion;
* Data cleaning & variable conversion;
* Data splitting:
    - Climb;
    - Enroute;
    - Descent;
* Data plotting & visualisation;
* Regression models

Het hele model maken moet ik nog doen, maar ik heb tot nu toe alleen de data gefilterd in aparte vluchten. Wat ik doe, is dat ik per csv file alles opsplits op aparte vluchten met de "onground" column. Dan kijk ik per "vlucht", of het aan mijn eisen voldoet (is het lang genoeg, begint het bij lebl, eindigt het bij eham, etc.). Als het aan alle eisen voldoet, voeg ik de vlucht toe. Als ik alle 4 de batches doorloop krijg ik zo'n 2000 aparte vluchten

1. Lees het csv bestand in met pandas readcsv
2. convert datetime en last_position column naar datetime format
3. Als het verschil tussen timestamp en last_position van een row meer is dan 2 seconden, drop de row
4. Groeppeer het dataframe bij onground: `df['group'] = (df['onground'] != df['onground'].shift(1)).cumsum()
5. Vervolgens kan je df.groupby gebruiken op de nieuwe 'group' kolom en alle groups waar onground 'True' is verwijderen
6. Met groupby heb je nu voor elke 'vlucht' een group, nu kan je voor elke group kijken of het wel echt een goede vlucht is, zoniet kan je hem verwijderen
7. Nu heb je aparte dataframes met elke slechts 1 vlucht en kun je per vlucht de data opschonendddd

1. all onground = True weg
2. Nieuwe kolom timestamp_delta (s), groter dan x aantal seconden
3. nieuwe kolom, groep

1. all onground = True weg
2. icao code en dag groupby

### Data initialization

In [22]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns

In [23]:
file_location = r'C:/Users/delan/Documents/HvA/Aviation Y4 - FOE en stage/FOE S1/FOE CoP/Optimization algorithm/CSV'

flight_files : list[os.PathLike] = []

for i in range(1, 26):
    file_path = f"{file_location}\\File_{i}.csv"
    if not os.path.isfile(file_path):
        print(f"File not found: {file_path}")
    
    flight_files.append(pd.read_csv(file_path,  low_memory=False))

### Initial data cleaning

Making the data ready for data analysis

Drop column 'Unnamed', 'hour' and 'callsign'.
Change timestamp and last_position to datetime

In [24]:
for i in range(len(flight_files)):
    flight_files[i]['timestamp'] = pd.to_datetime(flight_files[i]['timestamp'], errors='coerce')
    flight_files[i]['last_position'] = pd.to_datetime(flight_files[i]['last_position'], errors='coerce')
    flight_files[i]['icao24'] = flight_files[i]['icao24'].astype(str)
    flight_files[i].drop(columns = ['Unnamed: 0', 'hour', 'callsign'], inplace = True)

In [25]:
flight_files[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083774 entries, 0 to 1083773
Data columns (total 14 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   timestamp      1083774 non-null  datetime64[ns]
 1   icao24         1083774 non-null  object        
 2   latitude       1083774 non-null  float64       
 3   longitude      1083774 non-null  float64       
 4   ground_speed   1061121 non-null  float64       
 5   track          1061121 non-null  float64       
 6   vertical_rate  1061121 non-null  float64       
 7   onground       1083774 non-null  bool          
 8   alert          1083774 non-null  bool          
 9   spi            1083774 non-null  bool          
 10  squawk         988737 non-null   float64       
 11  baro_altitude  1026550 non-null  float64       
 12  altitude       1020215 non-null  float64       
 13  last_position  1083774 non-null  datetime64[ns]
dtypes: bool(3), datetime64[ns](2), flo

Based on the condition of the difference between the 'timestamp' and 'last-position'. If bigger than 2 seconds, delete row.

In [26]:
for i in range(len(flight_files)):
    time_difference = (flight_files[i]['timestamp'] - flight_files[i]['last_position']).dt.total_seconds()
    flight_files[i] = flight_files[i][time_difference <= 2].copy()
    flight_files[i].drop(columns=['last_position'], inplace=True)

Determine where 'onground' column is equal to True, meaning the aircraft is not on in the air. These values should be deleted

In [73]:
for i in range(len(flight_files)):
    flight_files[i] = flight_files[i].drop(flight_files[i][flight_files[i]['onground']].index)

    flight_files[i]['timestamp_delta'] = flight_files[i]['timestamp'].diff().dt.total_seconds()
    x_seconds_threshold = 60 * 30
    flight_files[i]['group'] = (flight_files[i]['timestamp_delta'] >= x_seconds_threshold).cumsum()

    flight_files[i]['group'] = flight_files[i]['group'].mask(flight_files[i]['timestamp_delta'] >= x_seconds_threshold)
    flight_files[i].reset_index(drop=True, inplace=True)
    # Starting from group 0
    flight_files[i]['group_duration'] = flight_files[i].groupby('group')['timestamp_delta'].transform('sum') / 3600

In [64]:
flight_files[0].head()

Unnamed: 0,timestamp,icao24,latitude,longitude,ground_speed,track,vertical_rate,onground,alert,spi,squawk,baro_altitude,altitude,timestamp_delta,group,group_duration
0,2022-10-01 17:32:55,4853d1,52.300508,4.766968,83.528366,238.212747,0.0,False,False,False,2115.0,150.0,,,0.0,2.489444
1,2022-10-01 17:32:56,4853d1,52.300275,4.766262,87.132008,238.134022,0.0,False,False,False,2115.0,150.0,150.0,1.0,0.0,2.489444
2,2022-10-01 17:32:57,4853d1,52.300003,4.765625,87.132008,238.134022,0.0,False,False,False,2115.0,150.0,150.0,1.0,0.0,2.489444
3,2022-10-01 17:32:58,4853d1,52.299728,4.764938,98.792627,238.24052,0.0,False,False,False,2115.0,150.0,150.0,1.0,0.0,2.489444
4,2022-10-01 17:32:59,4853d1,52.299437,4.764221,98.792627,238.24052,0.0,False,False,False,2115.0,150.0,150.0,1.0,0.0,2.489444


In [75]:
flight_files[0][['latitude','longitude','ground_speed','track','vertical_rate','baro_altitude','altitude','group_duration']].describe()

Unnamed: 0,latitude,longitude,ground_speed,track,vertical_rate,baro_altitude,altitude,group_duration
count,940883.0,940883.0,939907.0,939907.0,939907.0,937512.0,937055.0,940738.0
mean,46.263755,-7.279736,413.096442,165.987189,43.093693,28498.071251,29234.461264,-49.756628
std,4.563016,29.287076,92.557482,136.428518,1142.272248,12624.795368,12901.594958,180.824858
min,25.908829,-118.706939,58.591758,0.0,-18880.0,-1000.0,-50.0,-703.118056
25%,42.55298,1.103985,384.774927,27.730532,-64.0,20425.0,21050.0,1.743333
50%,46.61879,1.644819,443.805881,140.954107,0.0,36000.0,36500.0,1.8425
75%,50.342402,3.608643,472.546944,341.113913,64.0,37000.0,37625.0,1.974444
max,52.640625,19.83429,1125.599312,359.890027,26624.0,117300.0,127175.0,4.8325


In [82]:
group_row_counts = flight_files[0].groupby('group').size()
print(group_row_counts.mean())

6443.41095890411


In [63]:
def filter_group(df):
    condition_1 = group_row_counts.quantile(0.25) < df['group'].shape[0] < group_row_counts.quantile(0.75)
    condition_2 = len(df['icao24'].unique()) == 1
    condition_3 = (pd.Timedelta(hours=1, minutes=30) < df['group_duration']) & (df['group_duration'] < pd.Timedelta(hours=2, minutes=30))

    # condition_5 = first_alt < 2000
    # condition_6 = last_alt < 2000

    return condition_1 and condition_2 and condition_3

valid_flights = [flight for flight in flight_files if filter_group(flight)]


    # flight_files[i] = flight_files[i].groupby('group').filter(lambda x: self.__filter_group(x)).groupby('group')

TypeError: '>' not supported between instances of 'numpy.ndarray' and 'Timedelta'

In [32]:
valid_flights

NameError: name 'valid_flights' is not defined

Determine if aircraft is within end/start bounds of EHAM/LEBL

In [None]:
# # For onground categorie, some values are equal to True, while other values do not seem that way
# df[df['onground'] == True]
# def in_region(lat, lon, min_lat, max_lat, min_lon, max_lon):
#     return (lat.between(min_lat, max_lat)) & (lon.between(min_lon, max_lon))

# onground_true_df = df[df['onground'] == True]

# # Define the latitude and longitude bounds for Amsterdam Schiphol Airport region
# min_latitude_amsterdam, max_latitude_amsterdam = 52.3000, 52.4000
# min_longitude_amsterdam, max_longitude_amsterdam = 4.7000, 4.8000

# # Define the latitude and longitude bounds for El Prat Barcelona Airport region
# min_latitude_barcelona, max_latitude_barcelona = 41.3000, 41.4000
# min_longitude_barcelona, max_longitude_barcelona = 2.0500, 2.1500

# # Check if any row is within the specified regions
# in_amsterdam_region = in_region(
#     onground_true_df['latitude'], onground_true_df['longitude'],
#     min_latitude_amsterdam, max_latitude_amsterdam, min_longitude_amsterdam, max_longitude_amsterdam
# ).any()

# in_barcelona_region = in_region(
#     onground_true_df['latitude'], onground_true_df['longitude'],
#     min_latitude_barcelona, max_latitude_barcelona, min_longitude_barcelona, max_longitude_barcelona
# ).any()

# print(f"The bounds are in the region of Schiphol: " + str(in_amsterdam_region))
# print(f"The bounds are in the region of El Prat: " + str(in_barcelona_region))
# # So, the on-ground column is ignored as outliers

### Data splitting

Split every batch (csv) in a seperate df using 'Onground' column

In [None]:
# flight_starts = batch1['onground'] & ~batch1['onground'].shift(1, fill_value=True)
# flight_ids = (~flight_starts).cumsum()

# flight_groups = batch1.groupby(flight_ids)

# flight_dataframes = {flight_id: batch1[batch1['onground'] == False] for flight_id, group in flight_groups}

### Data cleaning

The data needs to be converted to the following:
* Timestamp - to datetime
* Callsign  - to string

* Long/Lat/Altitude/Barometric as NaN - dropna
    - Being equal to ground when not operating

Check if the data:
* Starts at LEBL
* Ends at EHAM
* Has timestamp within limits

Per batch, plot flight profiles on lat, lon and altitude.
If good, add together