<h1> Data Handling - Preparing Data for Modelling </h1>

<h2> Preliminary Steps </h2>

Let's begin with importing the necessary libraries:

In [51]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from bs4 import BeautifulSoup
import re
import requests
from datetime import datetime, timedelta
import json


For this step, we need to load the dataframes which are the final product of our two crawling steps:

In [29]:
df_arrivals = pd.read_csv("flight_arrivals_JFK_raw_data.csv")
df_departures = pd.read_csv("flight_departures_JFK_raw_data.csv")

<h2> The Process </h2>

<h3> Data Analysis </h3>

First, Let's merge both of the csv files to one big DataFrame

In [30]:
merged_df = pd.concat([df_arrivals, df_departures])
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS
0,05 May 2023,New York (JFK),Teterboro (TEB),GLF4 (N455MB),0:31,14:45,15:33,16:10,Landed 16:04
1,05 May 2023,Bermuda (BDA),New York (JFK),GLF4 (N455MB),2:01,12:30,13:23,13:24,Landed 14:24
2,03 May 2023,White Plains (HPN),Bermuda (BDA),GLF4 (N455MB),1:44,07:30,08:24,10:17,Landed 11:08
3,03 May 2023,Teterboro (TEB),White Plains (HPN),GLF4 (N455MB),0:22,06:00,06:33,06:24,Landed 06:55
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),GLF4 (N455MB),4:48,14:45,14:45,22:34,Landed 22:34
...,...,...,...,...,...,...,...,...,...
63215,29 Jan 2023,Mexico City (MEX),New York (JFK),A306 (XA-UYR),3:51,08:15,01:44,14:00,Landed 06:35
63216,22 Jan 2023,Mexico City (MEX),New York (JFK),B762 (XA-EFR),—,08:15,18:37,14:00,Diverted to ORD
63217,19 Jan 2023,New York (JFK),Chicago (ORD),B762 (XA-LRC),2:04,16:55,17:04,18:05,Landed 18:08
63218,19 Jan 2023,Guatemala City (GUA),New York (JFK),B762 (XA-LRC),4:13,10:30,09:21,15:25,Landed 14:34


In [31]:
#remove empty cells to make sure the data is complete

merged_df = merged_df.dropna()
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS
0,05 May 2023,New York (JFK),Teterboro (TEB),GLF4 (N455MB),0:31,14:45,15:33,16:10,Landed 16:04
1,05 May 2023,Bermuda (BDA),New York (JFK),GLF4 (N455MB),2:01,12:30,13:23,13:24,Landed 14:24
2,03 May 2023,White Plains (HPN),Bermuda (BDA),GLF4 (N455MB),1:44,07:30,08:24,10:17,Landed 11:08
3,03 May 2023,Teterboro (TEB),White Plains (HPN),GLF4 (N455MB),0:22,06:00,06:33,06:24,Landed 06:55
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),GLF4 (N455MB),4:48,14:45,14:45,22:34,Landed 22:34
...,...,...,...,...,...,...,...,...,...
63215,29 Jan 2023,Mexico City (MEX),New York (JFK),A306 (XA-UYR),3:51,08:15,01:44,14:00,Landed 06:35
63216,22 Jan 2023,Mexico City (MEX),New York (JFK),B762 (XA-EFR),—,08:15,18:37,14:00,Diverted to ORD
63217,19 Jan 2023,New York (JFK),Chicago (ORD),B762 (XA-LRC),2:04,16:55,17:04,18:05,Landed 18:08
63218,19 Jan 2023,Guatemala City (GUA),New York (JFK),B762 (XA-LRC),4:13,10:30,09:21,15:25,Landed 14:34


In [32]:
#Verifying that there are no empty cells that need to be handled

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201508 entries, 0 to 63219
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   DATE         201508 non-null  object
 1   FROM         201508 non-null  object
 2   TO           201508 non-null  object
 3   AIRCRAFT     201508 non-null  object
 4   FLIGHT TIME  201508 non-null  object
 5   STD          201508 non-null  object
 6   ATD          201508 non-null  object
 7   STA          201508 non-null  object
 8   STATUS       201508 non-null  object
dtypes: object(9)
memory usage: 15.4+ MB


In [33]:
#In the below command, we are looking for a non-null cells that contain '-' and delete all those rows because 
#these cells represent a missing data and therefore are like empty cells

mask = merged_df.apply(lambda x: x.str.contains('—')).any(axis=1)
merged_df = merged_df[~mask]
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183360 entries, 0 to 63219
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   DATE         183360 non-null  object
 1   FROM         183360 non-null  object
 2   TO           183360 non-null  object
 3   AIRCRAFT     183360 non-null  object
 4   FLIGHT TIME  183360 non-null  object
 5   STD          183360 non-null  object
 6   ATD          183360 non-null  object
 7   STA          183360 non-null  object
 8   STATUS       183360 non-null  object
dtypes: object(9)
memory usage: 14.0+ MB


In [34]:
#This line of code is resetting the indexes because after we deleted the empty cells, there were gaps in the index

merged_df = merged_df.reset_index(drop=True)

In [35]:
#Deleting 'Landed' part from the STATUS column and preparing the data for the visualization step 
#(all columns need to be int/float)

merged_df['STATUS'] = merged_df['STATUS'].str.replace('Landed ', '')
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS
0,05 May 2023,New York (JFK),Teterboro (TEB),GLF4 (N455MB),0:31,14:45,15:33,16:10,16:04
1,05 May 2023,Bermuda (BDA),New York (JFK),GLF4 (N455MB),2:01,12:30,13:23,13:24,14:24
2,03 May 2023,White Plains (HPN),Bermuda (BDA),GLF4 (N455MB),1:44,07:30,08:24,10:17,11:08
3,03 May 2023,Teterboro (TEB),White Plains (HPN),GLF4 (N455MB),0:22,06:00,06:33,06:24,06:55
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),GLF4 (N455MB),4:48,14:45,14:45,22:34,22:34
...,...,...,...,...,...,...,...,...,...
183355,30 Jan 2023,New York (JFK),Chicago (ORD),A306 (XA-UYR),2:13,06:30,09:16,10:28,10:29
183356,29 Jan 2023,Mexico City (MEX),New York (JFK),A306 (XA-UYR),3:51,08:15,01:44,14:00,06:35
183357,19 Jan 2023,New York (JFK),Chicago (ORD),B762 (XA-LRC),2:04,16:55,17:04,18:05,18:08
183358,19 Jan 2023,Guatemala City (GUA),New York (JFK),B762 (XA-LRC),4:13,10:30,09:21,15:25,14:34


In [36]:
#In the code below, we are running in a loop on all the cells of 'STATUS' column and we are inserting to 
#a new column called 'DIFFERENCE' the difference bewteen 'STA' (scheduled time of arrival) 
#and 'STATUS' (actual time of arrival) for each line accordingly

difflist = []
for i in range(len(merged_df['STATUS'])):
    if ':' in merged_df['STA'][i] and ':' in merged_df['STATUS'][i]:
        time1 = datetime.strptime(merged_df['STA'][i], '%H:%M')
        time2 = datetime.strptime(merged_df['STATUS'][i], '%H:%M')
        diff = time2-time1
        total_minutes = diff.seconds // 60
        hours, minutes = divmod(total_minutes, 60)
        if diff.days < 0 and hours >= 23:
            diff = timedelta(1) - diff
            total_minutes = diff.seconds // 60
            total_minutes = -total_minutes
        else:
            total_minutes = diff.seconds // 60
        difflist.append(total_minutes)
    else:
            difflist.append(None)



merged_df['DIFFERENCE'] = difflist
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE
0,05 May 2023,New York (JFK),Teterboro (TEB),GLF4 (N455MB),0:31,14:45,15:33,16:10,16:04,-6.0
1,05 May 2023,Bermuda (BDA),New York (JFK),GLF4 (N455MB),2:01,12:30,13:23,13:24,14:24,60.0
2,03 May 2023,White Plains (HPN),Bermuda (BDA),GLF4 (N455MB),1:44,07:30,08:24,10:17,11:08,51.0
3,03 May 2023,Teterboro (TEB),White Plains (HPN),GLF4 (N455MB),0:22,06:00,06:33,06:24,06:55,31.0
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),GLF4 (N455MB),4:48,14:45,14:45,22:34,22:34,0.0
...,...,...,...,...,...,...,...,...,...,...
183355,30 Jan 2023,New York (JFK),Chicago (ORD),A306 (XA-UYR),2:13,06:30,09:16,10:28,10:29,1.0
183356,29 Jan 2023,Mexico City (MEX),New York (JFK),A306 (XA-UYR),3:51,08:15,01:44,14:00,06:35,995.0
183357,19 Jan 2023,New York (JFK),Chicago (ORD),B762 (XA-LRC),2:04,16:55,17:04,18:05,18:08,3.0
183358,19 Jan 2023,Guatemala City (GUA),New York (JFK),B762 (XA-LRC),4:13,10:30,09:21,15:25,14:34,-51.0


In [37]:
#Verifying that there are no empty cells that need to be handled

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183360 entries, 0 to 183359
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   DATE         183360 non-null  object 
 1   FROM         183360 non-null  object 
 2   TO           183360 non-null  object 
 3   AIRCRAFT     183360 non-null  object 
 4   FLIGHT TIME  183360 non-null  object 
 5   STD          183360 non-null  object 
 6   ATD          183360 non-null  object 
 7   STA          183360 non-null  object 
 8   STATUS       183360 non-null  object 
 9   DIFFERENCE   183067 non-null  float64
dtypes: float64(1), object(9)
memory usage: 14.0+ MB


In [38]:
merged_df.dropna(subset=['DIFFERENCE'], inplace=True)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183067 entries, 0 to 183359
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   DATE         183067 non-null  object 
 1   FROM         183067 non-null  object 
 2   TO           183067 non-null  object 
 3   AIRCRAFT     183067 non-null  object 
 4   FLIGHT TIME  183067 non-null  object 
 5   STD          183067 non-null  object 
 6   ATD          183067 non-null  object 
 7   STA          183067 non-null  object 
 8   STATUS       183067 non-null  object 
 9   DIFFERENCE   183067 non-null  float64
dtypes: float64(1), object(9)
memory usage: 15.4+ MB


In [39]:
#In this step we are creating a new list called 'arcrft_reg' which represent the actual aircraft of the flight 
#and we are copying into this list only the string that inside of the brackets 
#in 'AIRCRAFT' column

arcrft_reg = [item.split('(')[-1].strip(')') for item in merged_df['AIRCRAFT']]
arcrft_reg

['N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',
 'N455MB',

<h4> Saving to a DataFrame </h4>

In [40]:
df_arcrft_reg = pd.DataFrame(arcrft_reg)
df_arcrft_reg

Unnamed: 0,0
0,N455MB
1,N455MB
2,N455MB
3,N455MB
4,N455MB
...,...
183062,XA-UYR
183063,XA-UYR
183064,XA-LRC
183065,XA-LRC


<h4> Removing duplicates in order to leave only unique values in the list </h4>

In [41]:
df_arcrft_reg = df_arcrft_reg.drop_duplicates()
df_arcrft_reg

Unnamed: 0,0
0,N455MB
132,GLF4
139,B-222J
140,B-2077
141,B-223A
...,...
182896,OE-IFB
183019,XA-LRC
183023,XA-GGL
183031,XA-EFR


<h5> Saving to a csv file </h5>

In [14]:
df_arcrft_reg.to_csv("aircraft_reg_raw.csv")

<h4> after we saved the file in the line above, we worked on it in the Arrivals scraping Notebook. in the below line of code we are reading from the  updated csv to a dataframe and we are cleaning all the none-null cells than containg "-" in them, because they are like empty cells </h4>

In [42]:
df_arcrft_age = pd.read_csv("aircraft_reg_new.csv")

In [43]:
df_arcrft_age = df_arcrft_age.replace("-", np.nan)

In [44]:
df_arcrft_age = df_arcrft_age.dropna()

In [45]:
df_arcrft_age.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4876 entries, 0 to 4879
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   FLIGHT REG  4876 non-null   object
 1   AGE         4876 non-null   object
dtypes: object(2)
memory usage: 114.3+ KB


In [46]:
#In the below step we are looking for 'Brand New' airplanes and changing their value to 0 so all 
#the value would be numerical, in addition, we are deleting from each cell of 'AGE' the 'years' word 
#from the string in order to convert this column to int

df_arcrft_age.replace('Brand new', 0, inplace=True)
df_arcrft_age['AGE'] = df_arcrft_age['AGE'].apply(lambda x: str(x).replace(' year', '') if isinstance(x, str) else x)
df_arcrft_age

Unnamed: 0,FLIGHT REG,AGE
0,N455MB,19s
1,B-222J,0
2,B-2077,13s
3,B-223A,0
4,B-222N,0
...,...,...
4875,OE-IFB,19s
4876,XA-LRC,36s
4877,XA-GGL,31s
4878,XA-EFR,35s


<b> In the following 3 blocks of code we are doing the follwing: <br> 1. creating a new column called 'AGE'. <br> 2. extracting the string of the aircraft inside the brackets in the AIRCRAFT column. <br> 3. for each row, we are looking in the - df_arcrft_age for the actual age and copying it to the 'AGE' column in the merged DataFrame </b>

In [47]:
merged_df['AGE'] = pd.Series([])
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE,AGE
0,05 May 2023,New York (JFK),Teterboro (TEB),GLF4 (N455MB),0:31,14:45,15:33,16:10,16:04,-6.0,
1,05 May 2023,Bermuda (BDA),New York (JFK),GLF4 (N455MB),2:01,12:30,13:23,13:24,14:24,60.0,
2,03 May 2023,White Plains (HPN),Bermuda (BDA),GLF4 (N455MB),1:44,07:30,08:24,10:17,11:08,51.0,
3,03 May 2023,Teterboro (TEB),White Plains (HPN),GLF4 (N455MB),0:22,06:00,06:33,06:24,06:55,31.0,
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),GLF4 (N455MB),4:48,14:45,14:45,22:34,22:34,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
183355,30 Jan 2023,New York (JFK),Chicago (ORD),A306 (XA-UYR),2:13,06:30,09:16,10:28,10:29,1.0,
183356,29 Jan 2023,Mexico City (MEX),New York (JFK),A306 (XA-UYR),3:51,08:15,01:44,14:00,06:35,995.0,
183357,19 Jan 2023,New York (JFK),Chicago (ORD),B762 (XA-LRC),2:04,16:55,17:04,18:05,18:08,3.0,
183358,19 Jan 2023,Guatemala City (GUA),New York (JFK),B762 (XA-LRC),4:13,10:30,09:21,15:25,14:34,-51.0,


In [48]:
#In the below code we are creating a new column calles MODEL

merged_df['MODEL'] = pd.Series([])
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE,AGE,MODEL
0,05 May 2023,New York (JFK),Teterboro (TEB),GLF4 (N455MB),0:31,14:45,15:33,16:10,16:04,-6.0,,
1,05 May 2023,Bermuda (BDA),New York (JFK),GLF4 (N455MB),2:01,12:30,13:23,13:24,14:24,60.0,,
2,03 May 2023,White Plains (HPN),Bermuda (BDA),GLF4 (N455MB),1:44,07:30,08:24,10:17,11:08,51.0,,
3,03 May 2023,Teterboro (TEB),White Plains (HPN),GLF4 (N455MB),0:22,06:00,06:33,06:24,06:55,31.0,,
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),GLF4 (N455MB),4:48,14:45,14:45,22:34,22:34,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...
183355,30 Jan 2023,New York (JFK),Chicago (ORD),A306 (XA-UYR),2:13,06:30,09:16,10:28,10:29,1.0,,
183356,29 Jan 2023,Mexico City (MEX),New York (JFK),A306 (XA-UYR),3:51,08:15,01:44,14:00,06:35,995.0,,
183357,19 Jan 2023,New York (JFK),Chicago (ORD),B762 (XA-LRC),2:04,16:55,17:04,18:05,18:08,3.0,,
183358,19 Jan 2023,Guatemala City (GUA),New York (JFK),B762 (XA-LRC),4:13,10:30,09:21,15:25,14:34,-51.0,,


In [49]:
#in this code we are splitting the string until the first space, which represents the airplane model and pasting 
#it in the MODEL column, ann then we delete the brackets from the string inside it.

merged_df['MODEL'] = merged_df['AIRCRAFT'].str.split().str[0]
merged_df['AIRCRAFT'] = [item.split('(')[-1].strip(')') for item in merged_df['AIRCRAFT']]
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE,AGE,MODEL
0,05 May 2023,New York (JFK),Teterboro (TEB),N455MB,0:31,14:45,15:33,16:10,16:04,-6.0,,GLF4
1,05 May 2023,Bermuda (BDA),New York (JFK),N455MB,2:01,12:30,13:23,13:24,14:24,60.0,,GLF4
2,03 May 2023,White Plains (HPN),Bermuda (BDA),N455MB,1:44,07:30,08:24,10:17,11:08,51.0,,GLF4
3,03 May 2023,Teterboro (TEB),White Plains (HPN),N455MB,0:22,06:00,06:33,06:24,06:55,31.0,,GLF4
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),N455MB,4:48,14:45,14:45,22:34,22:34,0.0,,GLF4
...,...,...,...,...,...,...,...,...,...,...,...,...
183355,30 Jan 2023,New York (JFK),Chicago (ORD),XA-UYR,2:13,06:30,09:16,10:28,10:29,1.0,,A306
183356,29 Jan 2023,Mexico City (MEX),New York (JFK),XA-UYR,3:51,08:15,01:44,14:00,06:35,995.0,,A306
183357,19 Jan 2023,New York (JFK),Chicago (ORD),XA-LRC,2:04,16:55,17:04,18:05,18:08,3.0,,B762
183358,19 Jan 2023,Guatemala City (GUA),New York (JFK),XA-LRC,4:13,10:30,09:21,15:25,14:34,-51.0,,B762


In [30]:
#Here we are filling 'AGE' column 

for index, row in merged_df.iterrows():
    aircraft = row['AIRCRAFT']
    df_filtered = df_arcrft_age.loc[df_arcrft_age['FLIGHT REG'] == aircraft, 'AGE']
    if not df_filtered.empty:
        age = df_filtered.iloc[0]
        merged_df.at[index, 'AGE'] = age
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE,AGE,MODEL
0,05 May 2023,New York (JFK),Teterboro (TEB),N455MB,0:31,14:45,15:33,16:10,16:04,-6.0,19s,GLF4
1,05 May 2023,Bermuda (BDA),New York (JFK),N455MB,2:01,12:30,13:23,13:24,14:24,60.0,19s,GLF4
2,03 May 2023,White Plains (HPN),Bermuda (BDA),N455MB,1:44,07:30,08:24,10:17,11:08,51.0,19s,GLF4
3,03 May 2023,Teterboro (TEB),White Plains (HPN),N455MB,0:22,06:00,06:33,06:24,06:55,31.0,19s,GLF4
4,30 Apr 2023,Van Nuys (VNY),Teterboro (TEB),N455MB,4:48,14:45,14:45,22:34,22:34,0.0,19s,GLF4
...,...,...,...,...,...,...,...,...,...,...,...,...
183355,30 Jan 2023,New York (JFK),Chicago (ORD),XA-UYR,2:13,06:30,09:16,10:28,10:29,1.0,30s,A306
183356,29 Jan 2023,Mexico City (MEX),New York (JFK),XA-UYR,3:51,08:15,01:44,14:00,06:35,995.0,30s,A306
183357,19 Jan 2023,New York (JFK),Chicago (ORD),XA-LRC,2:04,16:55,17:04,18:05,18:08,3.0,36s,B762
183358,19 Jan 2023,Guatemala City (GUA),New York (JFK),XA-LRC,4:13,10:30,09:21,15:25,14:34,-51.0,36s,B762


In [32]:
#Deleting the 's' from each cell in AGE column, because we want this column to be int

merged_df['AGE'] = merged_df['AGE'].str.replace('s', '')
merged_df
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183067 entries, 0 to 183359
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   DATE         183067 non-null  object 
 1   FROM         183067 non-null  object 
 2   TO           183067 non-null  object 
 3   AIRCRAFT     183067 non-null  object 
 4   FLIGHT TIME  183067 non-null  object 
 5   STD          183067 non-null  object 
 6   ATD          183067 non-null  object 
 7   STA          183067 non-null  object 
 8   STATUS       183067 non-null  object 
 9   DIFFERENCE   183067 non-null  float64
 10  AGE          179527 non-null  object 
 11  MODEL        183067 non-null  object 
dtypes: float64(1), object(11)
memory usage: 22.2+ MB


In [33]:
#checking for empty cells that need to be clean

merged_df.dropna(subset=['AGE'], inplace=True)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179527 entries, 0 to 183359
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   DATE         179527 non-null  object 
 1   FROM         179527 non-null  object 
 2   TO           179527 non-null  object 
 3   AIRCRAFT     179527 non-null  object 
 4   FLIGHT TIME  179527 non-null  object 
 5   STD          179527 non-null  object 
 6   ATD          179527 non-null  object 
 7   STA          179527 non-null  object 
 8   STATUS       179527 non-null  object 
 9   DIFFERENCE   179527 non-null  float64
 10  AGE          179527 non-null  object 
 11  MODEL        179527 non-null  object 
dtypes: float64(1), object(11)
memory usage: 17.8+ MB


In [35]:
#Here we are creating 3 new columns called 'DAY', 'MONTH', 'YEAR' and we split the DATE column string into 3 int cells.
#the first two digits until the '-' are copied to DAY column, the two next digits until the second '-' are copied to MONTH column, 
#and the last four digits are copied to 'YEAR' column.

merged_df['DATE'] = pd.to_datetime(merged_df['DATE'])
merged_df['DAY'] = merged_df['DATE'].dt.day
merged_df['MONTH'] = merged_df['DATE'].dt.month
merged_df['YEAR'] = merged_df['DATE'].dt.year
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE,AGE,MODEL,DAY,MONTH,YEAR
0,2023-05-05,New York (JFK),Teterboro (TEB),N455MB,0:31,14:45,15:33,16:10,16:04,-6.0,19,GLF4,5,5,2023
1,2023-05-05,Bermuda (BDA),New York (JFK),N455MB,2:01,12:30,13:23,13:24,14:24,60.0,19,GLF4,5,5,2023
2,2023-05-03,White Plains (HPN),Bermuda (BDA),N455MB,1:44,07:30,08:24,10:17,11:08,51.0,19,GLF4,3,5,2023
3,2023-05-03,Teterboro (TEB),White Plains (HPN),N455MB,0:22,06:00,06:33,06:24,06:55,31.0,19,GLF4,3,5,2023
4,2023-04-30,Van Nuys (VNY),Teterboro (TEB),N455MB,4:48,14:45,14:45,22:34,22:34,0.0,19,GLF4,30,4,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183355,2023-01-30,New York (JFK),Chicago (ORD),XA-UYR,2:13,06:30,09:16,10:28,10:29,1.0,30,A306,30,1,2023
183356,2023-01-29,Mexico City (MEX),New York (JFK),XA-UYR,3:51,08:15,01:44,14:00,06:35,995.0,30,A306,29,1,2023
183357,2023-01-19,New York (JFK),Chicago (ORD),XA-LRC,2:04,16:55,17:04,18:05,18:08,3.0,36,B762,19,1,2023
183358,2023-01-19,Guatemala City (GUA),New York (JFK),XA-LRC,4:13,10:30,09:21,15:25,14:34,-51.0,36,B762,19,1,2023


In [36]:
#Here we are creating a new column called 'DAY OF WEEK' and we are using dt.day_name in order to determine the day of week
#for this date

merged_df['DAY OF WEEK'] = merged_df['DATE'].dt.day_name()
merged_df

Unnamed: 0,DATE,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE,AGE,MODEL,DAY,MONTH,YEAR,DAY OF WEEK
0,2023-05-05,New York (JFK),Teterboro (TEB),N455MB,0:31,14:45,15:33,16:10,16:04,-6.0,19,GLF4,5,5,2023,Friday
1,2023-05-05,Bermuda (BDA),New York (JFK),N455MB,2:01,12:30,13:23,13:24,14:24,60.0,19,GLF4,5,5,2023,Friday
2,2023-05-03,White Plains (HPN),Bermuda (BDA),N455MB,1:44,07:30,08:24,10:17,11:08,51.0,19,GLF4,3,5,2023,Wednesday
3,2023-05-03,Teterboro (TEB),White Plains (HPN),N455MB,0:22,06:00,06:33,06:24,06:55,31.0,19,GLF4,3,5,2023,Wednesday
4,2023-04-30,Van Nuys (VNY),Teterboro (TEB),N455MB,4:48,14:45,14:45,22:34,22:34,0.0,19,GLF4,30,4,2023,Sunday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183355,2023-01-30,New York (JFK),Chicago (ORD),XA-UYR,2:13,06:30,09:16,10:28,10:29,1.0,30,A306,30,1,2023,Monday
183356,2023-01-29,Mexico City (MEX),New York (JFK),XA-UYR,3:51,08:15,01:44,14:00,06:35,995.0,30,A306,29,1,2023,Sunday
183357,2023-01-19,New York (JFK),Chicago (ORD),XA-LRC,2:04,16:55,17:04,18:05,18:08,3.0,36,B762,19,1,2023,Thursday
183358,2023-01-19,Guatemala City (GUA),New York (JFK),XA-LRC,4:13,10:30,09:21,15:25,14:34,-51.0,36,B762,19,1,2023,Thursday


In [37]:
#Lookinmg for empty cells

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179527 entries, 0 to 183359
Data columns (total 16 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   DATE         179527 non-null  datetime64[ns]
 1   FROM         179527 non-null  object        
 2   TO           179527 non-null  object        
 3   AIRCRAFT     179527 non-null  object        
 4   FLIGHT TIME  179527 non-null  object        
 5   STD          179527 non-null  object        
 6   ATD          179527 non-null  object        
 7   STA          179527 non-null  object        
 8   STATUS       179527 non-null  object        
 9   DIFFERENCE   179527 non-null  float64       
 10  AGE          179527 non-null  object        
 11  MODEL        179527 non-null  object        
 12  DAY          179527 non-null  int64         
 13  MONTH        179527 non-null  int64         
 14  YEAR         179527 non-null  int64         
 15  DAY OF WEEK  179527 non-null  obje

In [38]:
#Here we delete the DATE column, because it is not necessary anymore

merged_df = merged_df.drop('DATE', axis=1)
merged_df

Unnamed: 0,FROM,TO,AIRCRAFT,FLIGHT TIME,STD,ATD,STA,STATUS,DIFFERENCE,AGE,MODEL,DAY,MONTH,YEAR,DAY OF WEEK
0,New York (JFK),Teterboro (TEB),N455MB,0:31,14:45,15:33,16:10,16:04,-6.0,19,GLF4,5,5,2023,Friday
1,Bermuda (BDA),New York (JFK),N455MB,2:01,12:30,13:23,13:24,14:24,60.0,19,GLF4,5,5,2023,Friday
2,White Plains (HPN),Bermuda (BDA),N455MB,1:44,07:30,08:24,10:17,11:08,51.0,19,GLF4,3,5,2023,Wednesday
3,Teterboro (TEB),White Plains (HPN),N455MB,0:22,06:00,06:33,06:24,06:55,31.0,19,GLF4,3,5,2023,Wednesday
4,Van Nuys (VNY),Teterboro (TEB),N455MB,4:48,14:45,14:45,22:34,22:34,0.0,19,GLF4,30,4,2023,Sunday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183355,New York (JFK),Chicago (ORD),XA-UYR,2:13,06:30,09:16,10:28,10:29,1.0,30,A306,30,1,2023,Monday
183356,Mexico City (MEX),New York (JFK),XA-UYR,3:51,08:15,01:44,14:00,06:35,995.0,30,A306,29,1,2023,Sunday
183357,New York (JFK),Chicago (ORD),XA-LRC,2:04,16:55,17:04,18:05,18:08,3.0,36,B762,19,1,2023,Thursday
183358,Guatemala City (GUA),New York (JFK),XA-LRC,4:13,10:30,09:21,15:25,14:34,-51.0,36,B762,19,1,2023,Thursday


In [40]:
#Here we reorganize our columns, in order to make it look more organized

new_order = ['DAY', 'MONTH', 'YEAR', 'DAY OF WEEK', 'FROM', 'TO', 'AIRCRAFT','MODEL', 'AGE', 'FLIGHT TIME', 'STD', 'ATD', 'STA', 'STATUS', 'DIFFERENCE']
merged_df = merged_df.reindex(columns=new_order)
merged_df = merged_df.rename(columns={'STATUS': 'ATA'})
merged_df


Unnamed: 0,DAY,MONTH,YEAR,DAY OF WEEK,FROM,TO,AIRCRAFT,MODEL,AGE,FLIGHT TIME,STD,ATD,STA,ATA,DIFFERENCE
0,5,5,2023,Friday,New York (JFK),Teterboro (TEB),N455MB,GLF4,19,0:31,14:45,15:33,16:10,16:04,-6.0
1,5,5,2023,Friday,Bermuda (BDA),New York (JFK),N455MB,GLF4,19,2:01,12:30,13:23,13:24,14:24,60.0
2,3,5,2023,Wednesday,White Plains (HPN),Bermuda (BDA),N455MB,GLF4,19,1:44,07:30,08:24,10:17,11:08,51.0
3,3,5,2023,Wednesday,Teterboro (TEB),White Plains (HPN),N455MB,GLF4,19,0:22,06:00,06:33,06:24,06:55,31.0
4,30,4,2023,Sunday,Van Nuys (VNY),Teterboro (TEB),N455MB,GLF4,19,4:48,14:45,14:45,22:34,22:34,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183355,30,1,2023,Monday,New York (JFK),Chicago (ORD),XA-UYR,A306,30,2:13,06:30,09:16,10:28,10:29,1.0
183356,29,1,2023,Sunday,Mexico City (MEX),New York (JFK),XA-UYR,A306,30,3:51,08:15,01:44,14:00,06:35,995.0
183357,19,1,2023,Thursday,New York (JFK),Chicago (ORD),XA-LRC,B762,36,2:04,16:55,17:04,18:05,18:08,3.0
183358,19,1,2023,Thursday,Guatemala City (GUA),New York (JFK),XA-LRC,B762,36,4:13,10:30,09:21,15:25,14:34,-51.0


We need to analyze the dataframe we've scraped. Let's get familliar with what are the columns we built:<br>
<ul>
    <li><b>Day</b> - Specifies the day of the month of the flight.</li>
    <li><b>Month</b> - Specifies the month of the flight.</li>
    <li><b>Year</b> - Specifies the year of the flight.</li>
    <li><b>Day Of Week</b> - Specifies the day of week of the flight.</li>
    <li><b>From</b> - Specifies the origin of the flight.</li>
    <li><b>To</b> - Specifies the destination of the flight.</li>
    <li><b>Aircraft</b> - Specifies the specific registration of the airplane of the flight.</li>
    <li><b>Model</b> - Specifies the model of the airplane.</li>
    <li><b>Age</b> - Specifies the age of the registrated airplane.</li>
    <li><b>Flight Time</b> - Specifies the total time passed since the aircraft started departure until landing.</li>
    <li><b>STD</b> - Stands for 'Scheduled Time for Departure', Specifies the scheduled time of departure.</li>
    <li><b>ATD</b> - Stands for 'Actual Time of Departure', Specifies the actual time of departure.</li>
    <li><b>STA</b> - Stands for 'Scheduled Time of Arrival', Specifies the scheduled time of arrival.</li>
    <li><b>ATA</b> - Stands for 'Actual Time of Arrival', Specifies the actual time of arrival.</li>
    <li><b>Difference</b> - Specifies the delta between STA and ATA, the delta can be negative (meaning the flight arrived before STA) and can be positive (meaning the flight arrivd after the STA).</li>

</ul>

In [49]:
#weekday - Every weekday will be replaced with it's appropriate number depending on its position within the week:

weekdays_rep_map = {"Sunday" : 1, "Monday" : 2, "Tuesday" : 3, "Wednesday" : 4, "Thursday" : 5, "Friday" : 6, "Saturday" : 7}
merged_df["DAY OF WEEK"].replace(weekdays_rep_map, inplace=True)
merged_df.head()

Unnamed: 0,DAY,MONTH,YEAR,DAY OF WEEK,FROM,TO,AIRCRAFT,MODEL,AGE,FLIGHT TIME,STD,ATD,STA,ATA,DIFFERENCE
0,5,5,2023,6,New York (JFK),Teterboro (TEB),N455MB,GLF4,19,0:31,14:45,15:33,16:10,16:04,-6.0
1,5,5,2023,6,Bermuda (BDA),New York (JFK),N455MB,GLF4,19,2:01,12:30,13:23,13:24,14:24,60.0
2,3,5,2023,4,White Plains (HPN),Bermuda (BDA),N455MB,GLF4,19,1:44,07:30,08:24,10:17,11:08,51.0
3,3,5,2023,4,Teterboro (TEB),White Plains (HPN),N455MB,GLF4,19,0:22,06:00,06:33,06:24,06:55,31.0
4,30,4,2023,1,Van Nuys (VNY),Teterboro (TEB),N455MB,GLF4,19,4:48,14:45,14:45,22:34,22:34,0.0


In [50]:
#Here we create a new list that contains all of the airports in the merged dataframe, this list contains airports from 
#two columns

airport_list = []

for index, row in merged_df.iterrows():
    airport_list.append(row['FROM'])
    airport_list.append(row['TO'])

airport_list = list(set(airport_list))


In [45]:
#Here we create a dictionary for the airports, so each airport has its unique number, we are doing it to convert 
#the columns to int

airport_dict={}
for i in range(len(airport_list)):
    airport_dict[airport_list[i]] = i
airport_dict

{'Miami (OPF)': 0,
 'New York (EWR)': 1,
 'Bridgetown (BGI)': 2,
 'La Crosse (LSE)': 3,
 'Duluth (DLH)': 4,
 'Cali (CLO)': 5,
 'State College (SCE)': 6,
 'Spencer (SPW)': 7,
 'Abu Dhabi (AUH)': 8,
 'Tokyo (HND)': 9,
 'Greenville-Spartanburg (GSP)': 10,
 'Casablanca (CMN)': 11,
 'Port-au-Prince (PAP)': 12,
 'Traverse City (TVC)': 13,
 'Denver (BJC)': 14,
 'Frankfurt (FRA)': 15,
 'Auckland (AKL)': 16,
 'Waterloo (YKF)': 17,
 'Teterboro (TEB)': 18,
 'Georgetown (GEO)': 19,
 'Knoxville (TYS)': 20,
 'Long Beach (LGB)': 21,
 'Kissimmee (ISM)': 22,
 'Montreal (YUL)': 23,
 'Edinburgh (EDI)': 24,
 'Medellin (MDE)': 25,
 'Sydney (YQY)': 26,
 'Portland (PDX)': 27,
 'Birmingham (BHM)': 28,
 'Columbus (CMH)': 29,
 'Columbia (CAE)': 30,
 'Leesburg (LEE)': 31,
 'Albany (ALB)': 32,
 'Tokyo (NRT)': 33,
 'Puerto Vallarta (PVR)': 34,
 'Appleton (ATW)': 35,
 'Amsterdam (AMS)': 36,
 'Willemstad (CUR)': 37,
 'Phoenix (AZA)': 38,
 'Panama City (ECP)': 39,
 'Louisville (LOU)': 40,
 'Dayton (DAY)': 41,
 'Topek

In [51]:
#Here we are replacing the airport string for each cell in 'TO' and 'FROM' to its unique number from the dictionary.
#now that we have done that, our cells are int and not string

for i in range(len(merged_df['FROM'])):
    merged_df['FROM'][i]=airport_dict[merged_df['FROM'][i]]
for i in range(len(merged_df['TO'])):
    merged_df['TO'][i]=airport_dict[merged_df['TO'][i]]
    
merged_df                                                              

Unnamed: 0,DAY,MONTH,YEAR,DAY OF WEEK,FROM,TO,AIRCRAFT,MODEL,AGE,FLIGHT TIME,STD,ATD,STA,ATA,DIFFERENCE
0,5,5,2023,6,56,18,N455MB,GLF4,19,0:31,14:45,15:33,16:10,16:04,-6.0
1,5,5,2023,6,386,56,N455MB,GLF4,19,2:01,12:30,13:23,13:24,14:24,60.0
2,3,5,2023,4,376,386,N455MB,GLF4,19,1:44,07:30,08:24,10:17,11:08,51.0
3,3,5,2023,4,18,376,N455MB,GLF4,19,0:22,06:00,06:33,06:24,06:55,31.0
4,30,4,2023,1,316,18,N455MB,GLF4,19,4:48,14:45,14:45,22:34,22:34,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179522,30,1,2023,2,56,223,XA-UYR,A306,30,2:13,06:30,09:16,10:28,10:29,1.0
179523,29,1,2023,1,220,56,XA-UYR,A306,30,3:51,08:15,01:44,14:00,06:35,995.0
179524,19,1,2023,5,56,223,XA-LRC,B762,36,2:04,16:55,17:04,18:05,18:08,3.0
179525,19,1,2023,5,379,56,XA-LRC,B762,36,4:13,10:30,09:21,15:25,14:34,-51.0


In [52]:
#aircraft_type:
#Converting the aircraft number to int

aircraft_labels = merged_df["AIRCRAFT"].astype('category').cat.categories.to_list()
aircraft_rep_map={x:y for x,y in zip(aircraft_labels,list(range(0,len(aircraft_labels)+1)))}
merged_df["AIRCRAFT"].replace(aircraft_rep_map, inplace=True)
merged_df

Unnamed: 0,DAY,MONTH,YEAR,DAY OF WEEK,FROM,TO,AIRCRAFT,MODEL,AGE,FLIGHT TIME,STD,ATD,STA,ATA,DIFFERENCE
0,5,5,2023,6,56,18,2652,GLF4,19,0:31,14:45,15:33,16:10,16:04,-6.0
1,5,5,2023,6,386,56,2652,GLF4,19,2:01,12:30,13:23,13:24,14:24,60.0
2,3,5,2023,4,376,386,2652,GLF4,19,1:44,07:30,08:24,10:17,11:08,51.0
3,3,5,2023,4,18,376,2652,GLF4,19,0:22,06:00,06:33,06:24,06:55,31.0
4,30,4,2023,1,316,18,2652,GLF4,19,4:48,14:45,14:45,22:34,22:34,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179522,30,1,2023,2,56,223,4676,A306,30,2:13,06:30,09:16,10:28,10:29,1.0
179523,29,1,2023,1,220,56,4676,A306,30,3:51,08:15,01:44,14:00,06:35,995.0
179524,19,1,2023,5,56,223,4668,B762,36,2:04,16:55,17:04,18:05,18:08,3.0
179525,19,1,2023,5,379,56,4668,B762,36,4:13,10:30,09:21,15:25,14:34,-51.0


In [54]:
#model type:
#Converting the MODEL number to int

model_labels = merged_df["MODEL"].astype('category').cat.categories.to_list()
model_rep_map={x:y for x,y in zip(model_labels,list(range(0,len(model_labels)+1)))}
merged_df["MODEL"].replace(model_rep_map, inplace=True)
merged_df

Unnamed: 0,DAY,MONTH,YEAR,DAY OF WEEK,FROM,TO,AIRCRAFT,MODEL,AGE,FLIGHT TIME,STD,ATD,STA,ATA,DIFFERENCE
0,5,5,2023,6,56,18,2652,50,19,0:31,14:45,15:33,16:10,16:04,-6.0
1,5,5,2023,6,386,56,2652,50,19,2:01,12:30,13:23,13:24,14:24,60.0
2,3,5,2023,4,376,386,2652,50,19,1:44,07:30,08:24,10:17,11:08,51.0
3,3,5,2023,4,18,376,2652,50,19,0:22,06:00,06:33,06:24,06:55,31.0
4,30,4,2023,1,316,18,2652,50,19,4:48,14:45,14:45,22:34,22:34,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179522,30,1,2023,2,56,223,4676,2,30,2:13,06:30,09:16,10:28,10:29,1.0
179523,29,1,2023,1,220,56,4676,2,30,3:51,08:15,01:44,14:00,06:35,995.0
179524,19,1,2023,5,56,223,4668,26,36,2:04,16:55,17:04,18:05,18:08,3.0
179525,19,1,2023,5,379,56,4668,26,36,4:13,10:30,09:21,15:25,14:34,-51.0


In [53]:
with open("model_dict.txt", "w") as file:
    file.write(json.dumps(model_rep_map))

In [55]:
#Here we are checking how many columns we still need to convert to int/float

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179527 entries, 0 to 179526
Data columns (total 15 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   DAY          179527 non-null  int64  
 1   MONTH        179527 non-null  int64  
 2   YEAR         179527 non-null  int64  
 3   DAY OF WEEK  179527 non-null  int64  
 4   FROM         179527 non-null  object 
 5   TO           179527 non-null  object 
 6   AIRCRAFT     179527 non-null  int64  
 7   MODEL        179527 non-null  int64  
 8   AGE          179527 non-null  int64  
 9   FLIGHT TIME  179527 non-null  object 
 10  STD          179527 non-null  object 
 11  ATD          179527 non-null  object 
 12  STA          179527 non-null  object 
 13  ATA          179527 non-null  object 
 14  DIFFERENCE   179527 non-null  float64
dtypes: float64(1), int64(7), object(7)
memory usage: 20.5+ MB


In [56]:
#Although we have numbers in TO and FROM columns, We see that they are still considered as strings, 
#Here we are converting them to ints

merged_df['FROM'] = merged_df['FROM'].astype(int)
merged_df['TO'] = merged_df['TO'].astype(int)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179527 entries, 0 to 179526
Data columns (total 15 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   DAY          179527 non-null  int64  
 1   MONTH        179527 non-null  int64  
 2   YEAR         179527 non-null  int64  
 3   DAY OF WEEK  179527 non-null  int64  
 4   FROM         179527 non-null  int32  
 5   TO           179527 non-null  int32  
 6   AIRCRAFT     179527 non-null  int64  
 7   MODEL        179527 non-null  int64  
 8   AGE          179527 non-null  int64  
 9   FLIGHT TIME  179527 non-null  object 
 10  STD          179527 non-null  object 
 11  ATD          179527 non-null  object 
 12  STA          179527 non-null  object 
 13  ATA          179527 non-null  object 
 14  DIFFERENCE   179527 non-null  float64
dtypes: float64(1), int32(2), int64(7), object(5)
memory usage: 19.2+ MB


In [64]:
#Here we are converting 'FLIGHT TIME', 'STD', 'ATD', 'STA', 'ATA' columns from string to int (the converted numbers are 
#represented in minutes), for example, 00:00 is 0 , 00:44 is 44, 23:59 is 1440.

def hour_to_minutes(time_string):
    hours, minutes = map(int, time_string.split(':'))
    return hours * 60 + minutes

merged_df['FLIGHT TIME'] = merged_df['FLIGHT TIME'].apply(hour_to_minutes)
merged_df['STD'] = merged_df['STD'].apply(hour_to_minutes)
merged_df['ATD'] = merged_df['ATD'].apply(hour_to_minutes)
merged_df['STA'] = merged_df['STA'].apply(hour_to_minutes)
merged_df['ATA'] = merged_df['ATA'].apply(hour_to_minutes)
merged_df

Unnamed: 0,DAY,MONTH,YEAR,DAY OF WEEK,FROM,TO,AIRCRAFT,MODEL,AGE,FLIGHT TIME,STD,ATD,STA,ATA,DIFFERENCE
0,5,5,2023,6,56,18,2652,50,19,31,885,933,970,964,-6.0
1,5,5,2023,6,386,56,2652,50,19,121,750,803,804,864,60.0
2,3,5,2023,4,376,386,2652,50,19,104,450,504,617,668,51.0
3,3,5,2023,4,18,376,2652,50,19,22,360,393,384,415,31.0
4,30,4,2023,1,316,18,2652,50,19,288,885,885,1354,1354,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179522,30,1,2023,2,56,223,4676,2,30,133,390,556,628,629,1.0
179523,29,1,2023,1,220,56,4676,2,30,231,495,104,840,395,995.0
179524,19,1,2023,5,56,223,4668,26,36,124,1015,1024,1085,1088,3.0
179525,19,1,2023,5,379,56,4668,26,36,253,630,561,925,874,-51.0


In [65]:
#Making sure that these columns are from type int

merged_df['FLIGHT TIME'] = merged_df['FLIGHT TIME'].astype(int)
merged_df['STD'] = merged_df['STD'].astype(int)
merged_df['ATD'] = merged_df['ATD'].astype(int)
merged_df['STA'] = merged_df['STA'].astype(int)
merged_df['ATA'] = merged_df['ATA'].astype(int)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 178991 entries, 0 to 179526
Data columns (total 15 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   DAY          178991 non-null  int64  
 1   MONTH        178991 non-null  int64  
 2   YEAR         178991 non-null  int64  
 3   DAY OF WEEK  178991 non-null  int64  
 4   FROM         178991 non-null  int32  
 5   TO           178991 non-null  int32  
 6   AIRCRAFT     178991 non-null  int64  
 7   MODEL        178991 non-null  int64  
 8   AGE          178991 non-null  int64  
 9   FLIGHT TIME  178991 non-null  int32  
 10  STD          178991 non-null  int32  
 11  ATD          178991 non-null  int32  
 12  STA          178991 non-null  int32  
 13  ATA          178991 non-null  int32  
 14  DIFFERENCE   178991 non-null  float64
dtypes: float64(1), int32(7), int64(7)
memory usage: 17.1 MB


In [66]:
#Here we are making sure that our dataframe looks as we want it to look

merged_df.head()

Unnamed: 0,DAY,MONTH,YEAR,DAY OF WEEK,FROM,TO,AIRCRAFT,MODEL,AGE,FLIGHT TIME,STD,ATD,STA,ATA,DIFFERENCE
0,5,5,2023,6,56,18,2652,50,19,31,885,933,970,964,-6.0
1,5,5,2023,6,386,56,2652,50,19,121,750,803,804,864,60.0
2,3,5,2023,4,376,386,2652,50,19,104,450,504,617,668,51.0
3,3,5,2023,4,18,376,2652,50,19,22,360,393,384,415,31.0
4,30,4,2023,1,316,18,2652,50,19,288,885,885,1354,1354,0.0


In [67]:
#Importing the dataframe to a CSV file in order to keep working on it on the next notebook which is Visualizations notebook

merged_df.to_csv("Merged_Numerical.csv", index=False)