# Final Project Title


# Column Descriptions! 

YEAR Year of the Flight Trip 

MONTH Month of the Flight Trip

DAY Day of the Flight Trip 

DAY_OF_WEEK Day of week of the Flight Trip

AIRLINE Airline Identifier

FLIGHT_NUMBER Flight Identifier

TAIL_NUMBER Aircraft Identifier

ORIGIN_AIRPORT Starting Airport

DESTINATION_AIRPORT  Destination Airport

SCHEDULED_DEPARTURE  Planned Departure Time

DEPARTURE_TIME: WHEEL_OFF - TAXI_OUT

DEPARTURE_DELAY  Total Delay on Departure

TAXI_OUT The time duration elapsed between departure from the origin airport gate and wheels off

WHEELS_OFF The time point that the aircraft's wheels leave the ground

SCHEDULED_TIME: Planned time amount needed for the flight trip

ELAPSED_TIME:  AIR_TIME + TAXI_IN + TAXI_OUT

AIR_TIME The time duration between wheels_off and wheels_on time

DISTANCE Distance between two airports

WHEELS_ON The time point that the aircraft's wheels touch on the ground

TAXI_IN The time duration elapsed between wheels-on and gate arrival at the destination airport

SCHEDULED_ARRIVAL Planned arrival time

ARRIVAL_TIME:  WHEELS_ON + TAXI_IN

ARRIVAL_DELAY: ARRIVAL_TIME - SCHEDULED_ARRIVAL

DIVERTED Aircraft landed on airport that out of schedule

CANCELLED Flight Cancelled (1 = cancelled)

CANCELLATION_REASON Reason for Cancellation of flight: A - Airline/Carrier; B - Weather; C - National Air System; D - Security

AIR_SYSTEM_DELAY Delay caused by air system

SECURITY_DELAY Delay caused by security

AIRLINE_DELAY Delay caused by the airline

LATE_AIRCRAFT_DELAY Delay caused by aircraft

WEATHER_DELAY Delay caused by weather

In [1]:
#read in file as dataframe 
# import pyspark modules
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *       # for datatype conversion
from pyspark.sql.functions import *   # for col() function
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.feature import StandardScaler
from pyspark.ml.regression import LinearRegression
import pandas as pd
import os
import pyspark.sql.types as typ
import pyspark.sql.functions as F

In [2]:
from pyspark.sql import SparkSession 
spark = SparkSession \
    .builder \
    .master("local") \
    .appName("app") \
    .config("spark.executor.memory", '2g') \
    .config('spark.executor.cores', '2') \
    .config('spark.cores.max', '2') \
    .config("spark.driver.memory",'4g') \
    .getOrCreate()

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

### Start of APT edit

In [3]:
path_to_data = os.path.join("/home/jovyan/FlightDelay/flights.csv")

read into rdd

In [4]:
delay_rdd = sc.textFile(path_to_data)

In [5]:
#delay_rdd.take(3)

read into spark dataframe

In [6]:
delay_df = spark.read.format("csv") \
    .option("header", "true").option("inferschema","true").load(path_to_data)

In [7]:
delay_df.show(3)
delay_df.cache()

+----+-----+---+-----------+-------+-------------+-----------+--------------+-------------------+-------------------+--------------+---------------+--------+----------+--------------+------------+--------+--------+---------+-------+-----------------+------------+-------------+--------+---------+-------------------+----------------+--------------+-------------+-------------------+-------------+
|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|TAIL_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|SCHEDULED_DEPARTURE|DEPARTURE_TIME|DEPARTURE_DELAY|TAXI_OUT|WHEELS_OFF|SCHEDULED_TIME|ELAPSED_TIME|AIR_TIME|DISTANCE|WHEELS_ON|TAXI_IN|SCHEDULED_ARRIVAL|ARRIVAL_TIME|ARRIVAL_DELAY|DIVERTED|CANCELLED|CANCELLATION_REASON|AIR_SYSTEM_DELAY|SECURITY_DELAY|AIRLINE_DELAY|LATE_AIRCRAFT_DELAY|WEATHER_DELAY|
+----+-----+---+-----------+-------+-------------+-----------+--------------+-------------------+-------------------+--------------+---------------+--------+----------+--------------+------------+--------+-

DataFrame[YEAR: int, MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, TAIL_NUMBER: string, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, TAXI_OUT: int, WHEELS_OFF: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, AIR_TIME: int, DISTANCE: int, WHEELS_ON: int, TAXI_IN: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, ARRIVAL_DELAY: int, DIVERTED: int, CANCELLED: int, CANCELLATION_REASON: string, AIR_SYSTEM_DELAY: int, SECURITY_DELAY: int, AIRLINE_DELAY: int, LATE_AIRCRAFT_DELAY: int, WEATHER_DELAY: int]

In [8]:
delay_df.printSchema()

root
 |-- YEAR: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- ARRIVAL_DELAY: integer (null

In [9]:
delay_df.count()

5287214

In [10]:
delay_df.describe(['DEPARTURE_DELAY', 'ARRIVAL_DELAY']).show()

+-------+------------------+------------------+
|summary|   DEPARTURE_DELAY|     ARRIVAL_DELAY|
+-------+------------------+------------------+
|  count|           5208890|           5191895|
|   mean|  9.17736811489588| 4.284605717180336|
| stddev|36.605655917829274|38.808542303886945|
|    min|               -68|               -87|
|    max|              1988|              1971|
+-------+------------------+------------------+



# Check for Duplicates

In [11]:
print('rows = {}'.format(delay_df.count()))

rows = 5287214


In [12]:
print('rows = {}'.format(delay_df.distinct().count()))

rows = 5287214


Appears there are no duplicated entries

# Check for Missing Values 

In [13]:
# calculates percent of missing values in ecah column! 
missing = delay_df.agg(*[
    (1-F.count(c) / F.count('*')).alias(c + '_missing')
    for c in delay_df.columns
]).show() 

+------------+-------------+-----------+-------------------+---------------+---------------------+--------------------+----------------------+---------------------------+---------------------------+----------------------+-----------------------+-------------------+-------------------+----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------------+--------------------+---------------------+--------------------+--------------------+---------------------------+------------------------+----------------------+---------------------+---------------------------+---------------------+
|YEAR_missing|MONTH_missing|DAY_missing|DAY_OF_WEEK_missing|AIRLINE_missing|FLIGHT_NUMBER_missing| TAIL_NUMBER_missing|ORIGIN_AIRPORT_missing|DESTINATION_AIRPORT_missing|SCHEDULED_DEPARTURE_missing|DEPARTURE_TIME_missing|DEPARTURE_DELAY_missing|   TAXI_OUT_missing| WHEELS_OFF_missing|SCHEDULED_TIME_missing|ELAPSED_TIME_missin

Last 6 coluns appear to have very large percentage of missing values: 

CANCELLATION_REASON_missing, AIR_SYSTEM_DELAY_missing, SECURITY_DELAY_missing, 
AIRLINE_DELAY_missing, LATE_AIRCRAFT_DELAY_missing, WEATHER_DELAY_missing 

Should I drop all of these columns? 

In [14]:
delay_df.select('WEATHER_DELAY', 'SECURITY_DELAY', 'AIR_SYSTEM_DELAY', 
                'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'CANCELLATION_REASON').show(5)

+-------------+--------------+----------------+-------------+-------------------+-------------------+
|WEATHER_DELAY|SECURITY_DELAY|AIR_SYSTEM_DELAY|AIRLINE_DELAY|LATE_AIRCRAFT_DELAY|CANCELLATION_REASON|
+-------------+--------------+----------------+-------------+-------------------+-------------------+
|         null|          null|            null|         null|               null|               null|
|         null|          null|            null|         null|               null|               null|
|         null|          null|            null|         null|               null|               null|
|         null|          null|            null|         null|               null|               null|
|         null|          null|            null|         null|               null|               null|
+-------------+--------------+----------------+-------------+-------------------+-------------------+
only showing top 5 rows



The columns selected have almost 90% na, so I'm dropping them from the dataet 

# Drop columns 

In [15]:
cols_to_drop = ['WEATHER_DELAY', 'SECURITY_DELAY', 'AIR_SYSTEM_DELAY','AIRLINE_DELAY', 
                'LATE_AIRCRAFT_DELAY', 'CANCELLATION_REASON', 'WHEELS_ON', 'WHEELS_OFF', 
                'TAXI_IN', 'TAXI_OUT', 'AIR_TIME', 'TAIL_NUMBER'] 

delay_df = delay_df.drop(*cols_to_drop)

In [16]:
# Drop records where atleast 3 columns have NULL values 

delay_df = delay_df.dropna(thresh=3)

Doesn't change the count of the DF, so no rows with atleast 3 missing columns 

In [17]:
delay_df.count()

5287214

# Impute Missing values 

In [18]:
drop_cols = ['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'ORIGIN_AIRPORT', 'DIVERTED', 
                   'CANCELLED', 'DESTINATION_AIRPORT', 'AIRLINE']


df_impute = delay_df.drop(*drop_cols)
means = df_impute.agg(*[F.mean(c).alias(c) \
                                for c in df_impute.columns]) \
                                .toPandas().to_dict('records')[0]

df_impute_mode = delay_df.select('YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'ORIGIN_AIRPORT', 
                                 'DIVERTED','CANCELLED', 'DESTINATION_AIRPORT', 'AIRLINE')


In [19]:
modes = []
for c in df_impute_mode.columns:
    df=df_impute_mode.groupBy(c).count()
    mode = df.orderBy(df['count'].desc()).collect()[0][0]
    modes.append((c,mode))
    

In [20]:
# Turn list of tuples to dictionary 
modes = dict(modes)

In [21]:
# combine dictionaries
def Merge(dict1, dict2): 
    res = {**dict1, **dict2} 
    return res 

imputed_vals = Merge(means,modes)

I don't want to impute values for these columns since I dont know how imputing integer values such as day of the week or year will affect the data. We can ask the professor about this! 

In [22]:
# dictionary of means to impute 
imputed_vals

{'FLIGHT_NUMBER': 2174.2944622631126,
 'SCHEDULED_DEPARTURE': 1329.2202784340257,
 'DEPARTURE_TIME': 1334.8751653423283,
 'DEPARTURE_DELAY': 9.17736811489588,
 'SCHEDULED_TIME': 141.35763740666857,
 'ELAPSED_TIME': 136.7412064381117,
 'DISTANCE': 820.9152377632602,
 'SCHEDULED_ARRIVAL': 1494.5158333889708,
 'ARRIVAL_TIME': 1476.9788760477654,
 'ARRIVAL_DELAY': 4.284605717180336,
 'YEAR': 2015,
 'MONTH': 7,
 'DAY': 2,
 'DAY_OF_WEEK': 5,
 'ORIGIN_AIRPORT': 'ATL',
 'DIVERTED': 0,
 'CANCELLED': 0,
 'DESTINATION_AIRPORT': 'ATL',
 'AIRLINE': 'WN'}

Fill na values with mean

In [23]:
delay_df.columns

['YEAR',
 'MONTH',
 'DAY',
 'DAY_OF_WEEK',
 'AIRLINE',
 'FLIGHT_NUMBER',
 'ORIGIN_AIRPORT',
 'DESTINATION_AIRPORT',
 'SCHEDULED_DEPARTURE',
 'DEPARTURE_TIME',
 'DEPARTURE_DELAY',
 'SCHEDULED_TIME',
 'ELAPSED_TIME',
 'DISTANCE',
 'SCHEDULED_ARRIVAL',
 'ARRIVAL_TIME',
 'ARRIVAL_DELAY',
 'DIVERTED',
 'CANCELLED']

In [24]:
delay_df = delay_df.fillna(imputed_vals)

In [25]:
delay_df.select('SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 
                'DEPARTURE_DELAY','CANCELLED', 
                'SCHEDULED_TIME', 'ARRIVAL_TIME', 'ARRIVAL_DELAY').show(5)

+-------------------+--------------+---------------+---------+--------------+------------+-------------+
|SCHEDULED_DEPARTURE|DEPARTURE_TIME|DEPARTURE_DELAY|CANCELLED|SCHEDULED_TIME|ARRIVAL_TIME|ARRIVAL_DELAY|
+-------------------+--------------+---------------+---------+--------------+------------+-------------+
|                  5|          2354|            -11|        0|           205|         408|          -22|
|                 10|             2|             -8|        0|           280|         741|           -9|
|                 20|            18|             -2|        0|           286|         811|            5|
|                 20|            15|             -5|        0|           285|         756|           -9|
|                 25|            24|             -1|        0|           235|         259|          -21|
+-------------------+--------------+---------------+---------+--------------+------------+-------------+
only showing top 5 rows



In [26]:
delay_df.select('ORIGIN_AIRPORT', 'DESTINATION_AIRPORT').show(10)

+--------------+-------------------+
|ORIGIN_AIRPORT|DESTINATION_AIRPORT|
+--------------+-------------------+
|           ANC|                SEA|
|           LAX|                PBI|
|           SFO|                CLT|
|           LAX|                MIA|
|           SEA|                ANC|
|           SFO|                MSP|
|           LAS|                MSP|
|           LAX|                CLT|
|           SFO|                DFW|
|           LAS|                ATL|
+--------------+-------------------+
only showing top 10 rows



In [27]:
delay_df.cache()

DataFrame[YEAR: int, MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, DISTANCE: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, ARRIVAL_DELAY: int, DIVERTED: int, CANCELLED: int]

# Outliers 

# Alternate Outlier approach: 
    

In [28]:
# Calculate values used for outlier filtering

df_for_outlier_calc = delay_df.select('DEPARTURE_DELAY', 'ARRIVAL_DELAY', 'ELAPSED_TIME', 'DISTANCE')

for c in df_for_outlier_calc.columns:
    mean_val = delay_df.agg({c: 'mean'}).collect()[0][0]
    stddev_val = delay_df.agg({c: 'stddev'}).collect()[0][0]

    # Create three standard deviation (μ ± 3.3σ) lower and upper bounds for data
    # Use 3.3 since our data is not normally distrubuted and we should expand bounds to deal with this 
    low_bound = mean_val - (3.3 * stddev_val)
    hi_bound = mean_val + (3.3 * stddev_val)

    # Filter the data to fit between the lower and upper bounds
    delay_df = delay_df.where((delay_df[c] < hi_bound) & (delay_df[c] > low_bound))

In [29]:
delay_df.cache()

DataFrame[YEAR: int, MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, DISTANCE: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, ARRIVAL_DELAY: int, DIVERTED: int, CANCELLED: int]

Should I get rid of these outliers or impute them? It seems like a lot of data to impute or get rid of.

In [30]:
outliers.filter(outliers.DEPARTURE_DELAY_outlier == 'true').count() \
/(delay_df.select('DEPARTURE_DELAY').count())

NameError: name 'outliers' is not defined

In [None]:
#outliers.filter(outliers.ARRIVAL_DELAY_outlier == 'true').count()\
#/(delay_df.select('ARRIVAL_DELAY').count())

In [None]:
#outliers.where((outliers.DISTANCE_outlier == 'true') & (outliers.ELAPSED_TIME_outlier == 'true')).count()/(outliers.count())

In [None]:
#outliers.filter(outliers.ELAPSED_TIME_outlier == 'true').count()\
#/(delay_df.select('ELAPSED_TIME').count())

Consider creating a variable that uses a ratio of elasped time to distance. 

Drop columns that contain outliers: 

    

# One Hot Encoder 

We want to use OneHotEncoder on the string type variables: 
'AIRLINE', 'DESTINATION_AIRPORT' ,' ORGIN_AIRPORT' 
to represent them in a numerical form.

Maps a column of label indices to a column of binary vectors, with at most a single one-value. This is the same as dummy coding. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.


An intermediate step is to use StringIndexer.
StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.

In [31]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import OneHotEncoder, StringIndexer
spark= SparkSession.builder.getOrCreate()

Apply OneHotEncoder to AIRLINE: 

In [32]:
# for each level, count freq. val=0 for most freq, then 1, ...

stringIndexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_Index")
model = stringIndexer.fit(delay_df)
indexed = model.transform(delay_df)

encoder = OneHotEncoder(inputCol="AIRLINE_Index", outputCol="AIRLINE_Vec")
encoded = encoder.transform(indexed)


In [33]:
type(encoded)

pyspark.sql.dataframe.DataFrame

Apply OneHotEncoder to Orgin_AIRPORT: 

Is there a way ro encode origin_airport and destination_airport 
together so the same airports have the same encoder inboth columns? 

How do we use OneHotEncoder column?? 

In [34]:
# for each level, count freq. val=0 for most freq, then 1, ...

stringIndexer2 = StringIndexer(inputCol="ORIGIN_AIRPORT", outputCol="ORIGIN_AIRPORT_Index")
model2 = stringIndexer2.fit(encoded)
indexed2 = model2.transform(encoded)

encoder2 = OneHotEncoder(inputCol="ORIGIN_AIRPORT_Index", outputCol="ORIGIN_AIRPORT_Vec")
encoded2 = encoder2.transform(indexed2)



In [35]:
# for each level, count freq. val=0 for most freq, then 1, ...

stringIndexer3 = StringIndexer(inputCol="DESTINATION_AIRPORT", outputCol="DESTINATION_AIRPORT_Index")
model3 = stringIndexer3.fit(encoded2)
indexed3 = model3.transform(encoded2)

encoder3 = OneHotEncoder(inputCol="DESTINATION_AIRPORT_Index", outputCol="DESTINATION_AIRPORT_Vec")
encoded3 = encoder3.transform(indexed3)
encoded3.select('DESTINATION_AIRPORT','DESTINATION_AIRPORT_Index', "DESTINATION_AIRPORT_Vec").show()
#encoded3.cache()

+-------------------+-------------------------+-----------------------+
|DESTINATION_AIRPORT|DESTINATION_AIRPORT_Index|DESTINATION_AIRPORT_Vec|
+-------------------+-------------------------+-----------------------+
|                SEA|                     10.0|       (626,[10],[1.0])|
|                PBI|                     53.0|       (626,[53],[1.0])|
|                CLT|                     14.0|       (626,[14],[1.0])|
|                MIA|                     24.0|       (626,[24],[1.0])|
|                ANC|                     68.0|       (626,[68],[1.0])|
|                MSP|                      9.0|        (626,[9],[1.0])|
|                MSP|                      9.0|        (626,[9],[1.0])|
|                CLT|                     14.0|       (626,[14],[1.0])|
|                DFW|                      2.0|        (626,[2],[1.0])|
|                ATL|                      0.0|        (626,[0],[1.0])|
|                ATL|                      0.0|        (626,[0],

Drop unnecesary Columns from encoded3 dataframe 

In [None]:
#encoded3.show(10)

In [36]:
new_cols_to_drop = ['AIRLINE_Index', 'AIRLINE', 'ORIGIN_AIRPORT_Index', 
                                   'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT_Index', 'DESTINATION_AIRPORT']

final_encoded = encoded3.drop(*new_cols_to_drop)

final_encoded.cache()


DataFrame[YEAR: int, MONTH: int, DAY: int, DAY_OF_WEEK: int, FLIGHT_NUMBER: int, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, DISTANCE: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, ARRIVAL_DELAY: int, DIVERTED: int, CANCELLED: int, AIRLINE_Vec: vector, ORIGIN_AIRPORT_Vec: vector, DESTINATION_AIRPORT_Vec: vector]

In [None]:
data_graph, data_to_not_graph = final_encoded.randomSplit([0.10, 0.90])

In [None]:
type(data_graph)

# Bucketize 

In [37]:
from pyspark.ml.feature import Bucketizer
from pyspark.sql import Row

In [38]:
delay_splits = [0, 300, 600, 900, 1200, 1500, 1800, 2100, 2400]

In [39]:
# Does the job, quickly too, but not very elegantly. Look into how to bucketize groups of columns

deptime_bucketizer = Bucketizer() \
  .setInputCol("DEPARTURE_TIME") \
  .setOutputCol("B_DEPARTURE_TIME") \
  .setSplits(delay_splits)

scheddep_bucketizer = Bucketizer() \
  .setInputCol("SCHEDULED_DEPARTURE") \
  .setOutputCol("B_SCHEDULED_DEPARTURE") \
  .setSplits(delay_splits)

arrtime_bucketizer = Bucketizer() \
  .setInputCol("ARRIVAL_TIME") \
  .setOutputCol("B_ARRIVAL_TIME") \
  .setSplits(delay_splits)

schedarr_bucketizer = Bucketizer() \
  .setInputCol("SCHEDULED_ARRIVAL") \
  .setOutputCol("B_SCHEDULED_ARRIVAL") \
  .setSplits(delay_splits)


In [41]:
#Transform original data into its bucket index.
final_df = deptime_bucketizer\
               .transform(scheddep_bucketizer\
               .transform(arrtime_bucketizer\
               .transform(schedarr_bucketizer\
               .transform(final_encoded))))

In [42]:
final_df.select(["B_DEPARTURE_TIME", "B_SCHEDULED_DEPARTURE", "B_ARRIVAL_TIME", "B_SCHEDULED_ARRIVAL"]).show(10)

+----------------+---------------------+--------------+-------------------+
|B_DEPARTURE_TIME|B_SCHEDULED_DEPARTURE|B_ARRIVAL_TIME|B_SCHEDULED_ARRIVAL|
+----------------+---------------------+--------------+-------------------+
|             7.0|                  0.0|           1.0|                1.0|
|             0.0|                  0.0|           2.0|                2.0|
|             0.0|                  0.0|           2.0|                2.0|
|             0.0|                  0.0|           2.0|                2.0|
|             0.0|                  0.0|           0.0|                1.0|
|             0.0|                  0.0|           2.0|                2.0|
|             0.0|                  0.0|           1.0|                1.0|
|             0.0|                  0.0|           2.0|                2.0|
|             0.0|                  0.0|           1.0|                1.0|
|             0.0|                  0.0|           2.0|                2.0|
+-----------

In [43]:
final_df.show(4)

+----+-----+---+-----------+-------------+-------------------+--------------+---------------+--------------+------------+--------+-----------------+------------+-------------+--------+---------+--------------+------------------+-----------------------+-------------------+--------------+---------------------+----------------+
|YEAR|MONTH|DAY|DAY_OF_WEEK|FLIGHT_NUMBER|SCHEDULED_DEPARTURE|DEPARTURE_TIME|DEPARTURE_DELAY|SCHEDULED_TIME|ELAPSED_TIME|DISTANCE|SCHEDULED_ARRIVAL|ARRIVAL_TIME|ARRIVAL_DELAY|DIVERTED|CANCELLED|   AIRLINE_Vec|ORIGIN_AIRPORT_Vec|DESTINATION_AIRPORT_Vec|B_SCHEDULED_ARRIVAL|B_ARRIVAL_TIME|B_SCHEDULED_DEPARTURE|B_DEPARTURE_TIME|
+----+-----+---+-----------+-------------+-------------------+--------------+---------------+--------------+------------+--------+-----------------+------------+-------------+--------+---------+--------------+------------------+-----------------------+-------------------+--------------+---------------------+----------------+
|2015|    1|  1|   