For general steps, just read the text in between the code.
For more detailed steps, read the comments in the code.

In [1]:
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)


Standard starting up Numpy and Pandas, Pandas is a data analysis tool built on top of Numpy. Display precision is used to limit the amount of decimals Pandas shows.

Below the data is read into Pandas DataFrames. These dataframes can be seen as a really big table with data. 

In [2]:
df_01I = pd.read_csv('data/[01I].txt')  #inbound data-set
df_01O = pd.read_csv('data/[01O].txt')  #outbound data-set

Lets look at the top of the data-sets. (' '.head())

In [3]:
df_01I.head()

Unnamed: 0,FLIGHT_ID,FLIGHT_ID_1,TRACK_ID,TRACK_ID_1,X,Y,MODE_C,CALLSIGN,ICAO_ACTYPE,DEST,ADEP,FLIGHT_TYPE,RADAR,LANDING_TIME,TIME,Unnamed: 15
0,5170097,5170097,135322667,135322667,9586,18498,400,KLM7451,B737,EHAM,ENGM,INBOUND,ARTACC,22-10-2010 0:51:33,22-10-2010 0:20:01,
1,5170097,5170097,135322667,135322667,9562,18432,400,KLM7451,B737,EHAM,ENGM,INBOUND,ARTACC,22-10-2010 0:51:33,22-10-2010 0:20:06,
2,5170097,5170097,135322667,135322667,9536,18362,400,KLM7451,B737,EHAM,ENGM,INBOUND,ARTACC,22-10-2010 0:51:33,22-10-2010 0:20:10,
3,5170097,5170097,135322667,135322667,9512,18286,400,KLM7451,B737,EHAM,ENGM,INBOUND,ARTACC,22-10-2010 0:51:33,22-10-2010 0:20:15,
4,5170097,5170097,135322667,135322667,9486,18216,400,KLM7451,B737,EHAM,ENGM,INBOUND,ARTACC,22-10-2010 0:51:33,22-10-2010 0:20:20,


In [4]:
df_01O.head()

Unnamed: 0,FLIGHT_ID,FLIGHT_ID_1,TRACK_ID,TRACK_ID_1,X,Y,MODE_C,CALLSIGN,ICAO_ACTYPE,DEST,ADEP,FLIGHT_TYPE,RADAR,TAKEOFF_TIME,TIME,Unnamed: 15
0,5170107,5170107,135323318,135323318,-8,-66,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:38,
1,5170107,5170107,135323318,135323318,-24,-78,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:43,
2,5170107,5170107,135323318,135323318,-44,-90,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:48,
3,5170107,5170107,135323318,135323318,-66,-102,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:52,
4,5170107,5170107,135323318,135323318,-88,-118,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:57,


We can see duplicate columns and an empty final column, this is fixed by specifying only the wanted columns in both data-sets.

In [5]:
df_01I = df_01I[['FLIGHT_ID', 'TRACK_ID', 'X', 'Y', 'MODE_C', 'CALLSIGN', 'ICAO_ACTYPE', 'DEST', 'ADEP', 'FLIGHT_TYPE', 'RADAR', 'LANDING_TIME', 'TIME']]
df_01O = df_01O[['FLIGHT_ID', 'TRACK_ID', 'X', 'Y', 'MODE_C', 'CALLSIGN', 'ICAO_ACTYPE', 'DEST', 'ADEP', 'FLIGHT_TYPE', 'RADAR', 'TAKEOFF_TIME', 'TIME']]

The dataset now looks like:

In [6]:
df_01O.head()

Unnamed: 0,FLIGHT_ID,TRACK_ID,X,Y,MODE_C,CALLSIGN,ICAO_ACTYPE,DEST,ADEP,FLIGHT_TYPE,RADAR,TAKEOFF_TIME,TIME
0,5170107,135323318,-8,-66,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:38
1,5170107,135323318,-24,-78,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:43
2,5170107,135323318,-44,-90,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:48
3,5170107,135323318,-66,-102,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:52
4,5170107,135323318,-88,-118,-1.0,SQC7377,B744,OMSJ,EHAM,OUTBOUND,ARTACC,22-10-2010 1:12:48,22-10-2010 1:12:57


That being fixed the big data-set can be broken up into the training and prediction/validation set. Important to note here is that the 'random.seed()' will ensure that every time the program runs the same result wil be gotten, though the first time is completely random.

In [7]:

flights_01I = df_01I.groupby('FLIGHT_ID') #group the data per flightID

flights_01O = df_01O.groupby('FLIGHT_ID')


np.random.seed(31415)  #makes the randomness happen only once, constant after that

msk_01I = np.random.rand(len(flights_01I)) < 0.75  #make mask using True/False list distributed randomly 75/25
msk_01O = np.random.rand(len(flights_01O)) < 0.75  #repeated for outbound as well




flightIDs_01I = [b for a, b in zip(msk_01I, list(flights_01I.groups.keys())) if a]           #make list of flightsIDs if it lines up with True from above
flightIDs_01IP = [b for a, b in zip(msk_01I, list(flights_01I.groups.keys())) if not a]      #same but if it lines up with False

flightIDs_01O = [b for a, b in zip(msk_01O, list(flights_01O.groups.keys())) if a]           
flightIDs_01OP = [b for a, b in zip(msk_01O, list(flights_01O.groups.keys())) if not a]

df_11I = df_01I.loc[df_01I['FLIGHT_ID'].isin(flightIDs_01I)]                                 #from the big datasets take all data if it has the flightIDs from the list above
df_11IP = df_01I.loc[df_01I['FLIGHT_ID'].isin(flightIDs_01IP)] 

df_11O = df_01O.loc[df_01O['FLIGHT_ID'].isin(flightIDs_01O)]
df_11OP = df_01O.loc[df_01O['FLIGHT_ID'].isin(flightIDs_01OP)]


df_11 = pd.concat([df_11I, df_11O], axis=0, join='outer', ignore_index=True)                 #merge the inbound and outbound flights into one set
df_11P = pd.concat([df_11IP, df_11OP], axis=0, join='outer', ignore_index=True)


df_11SE = df_11.head(200)                                                                    #take the first couple hunderd lines of data as a 'Single Early' data-set





The bottoms of the data-sets can also be checked for if everything is right. (''.tail())

In [8]:
df_11.tail()

Unnamed: 0,FLIGHT_ID,TRACK_ID,X,Y,MODE_C,CALLSIGN,ICAO_ACTYPE,DEST,ADEP,FLIGHT_TYPE,RADAR,LANDING_TIME,TIME,TAKEOFF_TIME
258183,5172119,135377047,11658,9102,258.0,JAE7454,B744,ZSPD,EHAM,OUTBOUND,ARTACC,,22-10-2010 21:37:32,22-10-2010 21:16:13
258184,5172119,135377047,11732,9148,259.0,JAE7454,B744,ZSPD,EHAM,OUTBOUND,ARTACC,,22-10-2010 21:37:36,22-10-2010 21:16:13
258185,5172119,135377047,11804,9194,260.0,JAE7454,B744,ZSPD,EHAM,OUTBOUND,ARTACC,,22-10-2010 21:37:41,22-10-2010 21:16:13
258186,5172119,135377047,11878,9238,261.0,JAE7454,B744,ZSPD,EHAM,OUTBOUND,ARTACC,,22-10-2010 21:37:46,22-10-2010 21:16:13
258187,5172119,135377047,11952,9284,261.0,JAE7454,B744,ZSPD,EHAM,OUTBOUND,ARTACC,,22-10-2010 21:37:51,22-10-2010 21:16:13


In [9]:
df_11P.tail()

Unnamed: 0,FLIGHT_ID,TRACK_ID,X,Y,MODE_C,CALLSIGN,ICAO_ACTYPE,DEST,ADEP,FLIGHT_TYPE,RADAR,LANDING_TIME,TIME,TAKEOFF_TIME
85594,5172116,135378687,9996,-5644,298.0,CAI1618,B734,LTAI,EHAM,OUTBOUND,ARTACC,,22-10-2010 23:18:51,22-10-2010 23:02:32
85595,5172116,135378687,10052,-5696,300.0,CAI1618,B734,LTAI,EHAM,OUTBOUND,ARTACC,,22-10-2010 23:18:56,22-10-2010 23:02:32
85596,5172116,135378687,10108,-5748,300.0,CAI1618,B734,LTAI,EHAM,OUTBOUND,ARTACC,,22-10-2010 23:19:01,22-10-2010 23:02:32
85597,5172116,135378687,10164,-5798,302.0,CAI1618,B734,LTAI,EHAM,OUTBOUND,ARTACC,,22-10-2010 23:19:06,22-10-2010 23:02:32
85598,5172116,135378687,10220,-5850,302.0,CAI1618,B734,LTAI,EHAM,OUTBOUND,ARTACC,,22-10-2010 23:19:11,22-10-2010 23:02:32


Confirmed correct, the prediction set has significantly less data points than the normal set. The sets can now be saved in Python Pickles which is a easy to use binary format. This is done for exchaning between programs.

In [10]:
df_11.to_pickle('data/[11].pkl')         #normal
df_11P.to_pickle('data/[11_P].pkl')      #prediction


df_11I.to_pickle('data/[11_I].pkl')      #inbound
df_11IP.to_pickle('data/[11_IP].pkl')    #inbound prediction
df_11O.to_pickle('data/[11_O].pkl')      #outbound
df_11OP.to_pickle('data/[11_OP].pkl')    #outbound prediction

df_11SE.to_pickle('data/[11_SE].pkl')    #single early



First level data sets have now been made and put out as pickles.