# Spark Practical Work

We are supposed to create a model capable of predicting the arrival delay time of a commercial flight based on several parameters known at the take-off time. Tasks:
* Load the input data, previously stored at a known location.
* Select, process and transform the input variables, to prepare them for training the model.
* Perform some basic analysis of each input variable. 
* Create a ML model that predicts the arrival delay time.
* Validate the created model and provide some measures of its accuracy.


In [1]:
import os
os.getcwd()

'/home/dslab/workspaces/rrunix/spark/final_project'

# 1. Load data

In [2]:
# extract files into csv formats
import bz2
# extract every bz2 format file in ../BigData/data/project_data/ to csv file
files = os.listdir("../BigData/data/project_data/")
def bz2_to_csv(files):
	for file in files:
		if file.endswith(".bz2"):
			file_path = "../BigData/data/project_data/" + file
			with bz2.open(file_path, "rb") as f:
				file_content = f.read()
			with open("../BigData/data/project_data/" + file[:-4], "wb") as f:
				f.write(file_content)

bz2_to_csv(files)

In [3]:
# Create a SparkSession
from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext("local", "ComercialFlights")
spark = SparkSession.builder \
            .appName("First Session") \
            .master("local[*]") \
            .getOrCreate()

sc.setLogLevel("ERROR")
print("Spark Version: {}".format(sc.version))

23/12/24 15:18:10 WARN Utils: Your hostname, mordor resolves to a loopback address: 127.0.1.1; using 193.147.50.16 instead (on interface eno1np0)
23/12/24 15:18:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/24 15:18:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark Version: 3.5.0


In [9]:
# Use DataFrames to read csv files
# read all csv files in ../BigData/data/project_data/ to pyspark dataframe
csv_files = os.listdir("../BigData/data/project_data/")
def csv_to_df(csv_files):
	df_pyspark =[]
	for file in csv_files:
		file_path = "../BigData/data/project_data/" + file
		df = spark.read.csv(file_path, header=True, inferSchema=True)
		df_pyspark.append(df)
	return df_pyspark

df_pyspark = csv_to_df(csv_files)

# unir todos los dataframes en uno solo
from functools import reduce
from pyspark.sql import DataFrame
def unionAll(*dfs):
	return reduce(DataFrame.unionAll, dfs)

df = unionAll(*df_pyspark)
df.show(5)

		
	

                                                                                

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|1988|    1|         9|        6|   1348|      1331|   1458|      1435|           PI|      942

In [10]:
df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: string (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Carr

In [21]:
# comprobar el numero de filas de df 
print(df.count())
# comprobar el numero de columnas de df
print(len(df.columns))



25952068
29


                                                                                

# 2. Process data

The dataset has 29 columns. We won't use all of them. The ones that should be droped are: 
* ArrTime
* ActualElapsedTime
* AirTime
* TaxiIn
* Diverted
* CarrierDelay
* WeatherDelay
* NASDelay
* SecurityDelay
* LateAircraftDelay

Meaning of the variables that we keep: 
1. Year 1987-2008 
2. Month 1-12 
3. DayofMonth 1-31 
4. DayOfWeek 1 (Monday) - 7 (Sunday)
5. DepTime actual departure time (local, hhm m) 
6. CRSDepTime scheduled departure time (local, hhmm) 
7. CRSArrTime scheduled arrival time (local, hhmm) 
8. UniqueCarrier Airline code 
9. FlightNum flight number 
10. TailNum plane tail number 
11. CRSElapsedTime in minutes (estimated flight time)
12. ArrDelay arrival delay, in minutes -- TARGET VARIABLE
13. DepDelay departure delay, in minutes 
14. Origin origin IATA airport code 
15. Dest destination IATA airport code 
16. Distance in miles 
17. TaxiOut taxi out time in minutes (tiempo que tarda el avión desde la puerta de embarque hasta el despegue")
18. Cancelled was the flight cancelled? 
19. CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 

In [70]:
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer

# rename columns:
def edit_column_names(df):
    df =  df.withColumnRenamed('DayofMonth','day_of_month').\
                withColumnRenamed('DayOfWeek','day_of_week').\
                withColumnRenamed('DepTime','actual_departure_time').\
                withColumnRenamed('CRSDepTime','scheduled_departure_time').\
                withColumnRenamed('ArrTime','actual_arrival_time').\
                withColumnRenamed('CRSArrTime','scheduled_arrival_time').\
                withColumnRenamed('UniqueCarrier','airline_code').\
                withColumnRenamed('FlightNum','flight_number').\
                withColumnRenamed('TailNum','plane_number').\
                withColumnRenamed('ActualElapsedTime','actual_flight_time').\
                withColumnRenamed('CRSElapsedTime','scheduled_flight_time').\
                withColumnRenamed('AirTime','air_time').\
                withColumnRenamed('ArrDelay','arrival_delay').\
                withColumnRenamed('DepDelay','departure_delay').\
                withColumnRenamed('TaxiIn','taxi_in').\
                withColumnRenamed('TaxiOut','taxi_out').\
                withColumnRenamed('CancellationCode','cancellation_code').\
                withColumnRenamed('CarrierDelay','carrier_delay').\
                withColumnRenamed('WeatherDelay','weather_delay').\
                withColumnRenamed('NASDelay','nas_delay').\
                withColumnRenamed('SecurityDelay','security_delay').\
                withColumnRenamed('LateAircraftDelay','late_aircraft_delay')
    for col in df.columns:
        df = df.withColumnRenamed(col, col.lower())
    return df

# select columns:
def my_columns (df):
    df = df.select('year','month','day_of_month', 'day_of_week', 'actual_departure_time',
                   'scheduled_departure_time', 'scheduled_arrival_time', 'airline_code',
                   'flight_number', 'plane_number', 'scheduled_flight_time', 'arrival_delay',
                   'departure_delay', 'origin', 'dest', 'distance', 'taxi_out', 'cancelled',
                   'cancellation_code')
    return df

# combine to create dates:
def add_date_column(df):
    df = df.withColumn('date', to_date(concat(col('day_of_month'), lit(' '),
                                              col('month'), lit(' '), col('year')), 'd M yyyy'))
    return df

# some strings to float:
def string_to_float(df):
    df = df.withColumn('arrival_delay', col('arrival_delay').cast('float'))
    df = df.withColumn('departure_delay', col('departure_delay').cast('float'))
    df = df.withColumn('taxi_out', col('taxi_out').cast('float'))
    df = df.withColumn('distance', col('distance').cast('float'))
    return df

# encode categorical features:
def encode_categorical_features(df):
    indexer = StringIndexer(inputCols=['airline_code', 'origin', 'dest', 'cancellation_code', 'plane_number'],
                            outputCols=['airline_index', 'origin_index', 'dest_index', 'cancellation_index', 'plane_index'])
    
    df = indexer.fit(df).transform(df)
    return df
    

# convert time to minutes:
def convert_time_to_minutes(df):
    # for actual_departure_time, scheduled_departure_time, scheduled_arrival_time, scheduled_flight_time transform to minutes
    # take the first two digits and multiply by 60 and add the last two digits
    df = df.withColumn('actual_departure_hour', (col('actual_departure_time') / 100).cast('int'))
    df = df.withColumn('scheduled_departure_hour', (col('scheduled_departure_time') / 100).cast('int'))
    df = df.withColumn('scheduled_arrival_hour', (col('scheduled_arrival_time') / 100).cast('int'))
    df = df.withColumn('scheduled_flight_hour', (col('scheduled_flight_time') / 100).cast('int'))
    
    df = df.withColumn('actual_departure_time_mins', (col('actual_departure_hour') * 60) + (col('actual_departure_time') % 100))
    df = df.withColumn('scheduled_departure_time_mins', (col('scheduled_departure_hour') * 60) + (col('scheduled_departure_time') % 100))
    df = df.withColumn('scheduled_arrival_time_mins', (col('scheduled_arrival_hour') * 60) + (col('scheduled_arrival_time') % 100))
    df = df.withColumn('scheduled_flight_time_mins', (col('scheduled_flight_hour') * 60) + (col('scheduled_flight_time') % 100))
    
    # drop actual_departure_hour, scheduled_departure_hour, scheduled_arrival_hour, scheduled_flight_hour
    
    df = df.drop('actual_departure_hour', 'scheduled_departure_hour', 'scheduled_arrival_hour', 'scheduled_flight_hour')
    
    return df

# handle missing values:
def handle_missing_values(df):
    # eliminar filas donde actual_departure_time es null
    df = df.filter(df.actual_departure_time.isNotNull())
    # eliminar filas donde scheduled_flight_time es null
    df = df.filter(df.scheduled_flight_time.isNotNull())
    return df

def my_df(df):
    # select columns
    df = df.select('date','year','month','day_of_month', 'day_of_week', 'actual_departure_time_mins',
                   'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_index',
                   'flight_number', 'scheduled_flight_time_mins',
                   'origin_index', 'dest_index', 'distance', 'taxi_out', 'cancelled', 'departure_delay')
    return df
    

# standarize df
def standarize_dataframe(df):
    temp = edit_column_names(df)
    temp = my_columns(temp)
    temp = string_to_float(temp)
    temp = add_date_column(temp)
    temp = encode_categorical_features(temp)
    temp = convert_time_to_minutes(temp)
    temp = handle_missing_values(temp)
    temp = my_df(temp)

    return temp



In [59]:
new_df = standarize_dataframe(df)
new_df.show(5)

                                                                                

+----------+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|      date|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|taxi_out|cancelled|
+----------+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|1988-01-09|1988|    1|           9|          6|                     828.0|                          811|                        875|          8.0|          942|                      64.0|           17.0|        63.0|      23.0| 

In [60]:
new_df.printSchema()

root
 |-- date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_month: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- actual_departure_time_mins: double (nullable = true)
 |-- scheduled_departure_time_mins: integer (nullable = true)
 |-- scheduled_arrival_time_mins: integer (nullable = true)
 |-- airline_index: double (nullable = false)
 |-- flight_number: integer (nullable = true)
 |-- scheduled_flight_time_mins: double (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin_index: double (nullable = false)
 |-- dest_index: double (nullable = false)
 |-- distance: float (nullable = true)
 |-- taxi_out: float (nullable = true)
 |-- cancelled: integer (nullable = true)



In [61]:
new_df.columns

['date',
 'year',
 'month',
 'day_of_month',
 'day_of_week',
 'actual_departure_time_mins',
 'scheduled_departure_time_mins',
 'scheduled_arrival_time_mins',
 'airline_index',
 'flight_number',
 'scheduled_flight_time_mins',
 'departure_delay',
 'origin_index',
 'dest_index',
 'distance',
 'taxi_out',
 'cancelled']

In [62]:
# contar el número de variables nan en cada columna
new_df.select([count(when(col(c) == "NA", c)).alias(c) for c in new_df.columns]).show()



+----+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|date|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|taxi_out|cancelled|
+----+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|   0|   0|    0|           0|          0|                         0|                            0|                          0|            0|            0|                         0|              0|           0|         0|       0|       0|       

                                                                                

In [68]:
# contar valores NULL en cada columna
new_df.select([count(when(col(c).isNull(), c)).alias(c) for c in new_df.columns]).show()



+----+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|date|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|taxi_out|cancelled|
+----+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|   0|   0|    0|           0|          0|                    518228|                            0|                          0|            0|            0|                      5586|         518228|           0|         0|   22204|10533076|       

                                                                                

* El 0.019 % de actual_departure_time es nan		--> eliminamos filas
* el 0.4 % de plane_number es nan		--> No usamos esta variable
* mucho menos del 0.02 % de schedule_flig_time		--> eliminamos filas con nulos
* 0.81% de cancellation_code es nan 		--> No usamos esta variable

AHORA APARECE COMO QUE NO HAY NINGUNO PORQUE YA LOS HE QUITADO, PERO SI QUE HABÍA

taxi_out tiene muchos valores NULL

In [66]:
new_df.show()

+----------+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|      date|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|taxi_out|cancelled|
+----------+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+
|1988-01-09|1988|    1|           9|          6|                     828.0|                          811|                        875|          8.0|          942|                      64.0|           17.0|        63.0|      23.0| 

In [67]:
from pyspark.ml.feature import VectorAssembler

featureassembler = VectorAssembler(inputCols=['month', 'day_of_week', 'actual_departure_time_mins',
                   'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_index',
                   'flight_number', 'scheduled_flight_time_mins', 'departure_delay',
                   'origin_index', 'dest_index', 'distance',  'cancelled'], 
                outputCol='features')
# no he incluido TaxiOut porque tiene muchos valores nulos (pero lo mismo deberíamos)

output = featureassembler.transform(new_df)
output.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|[1.0,6.0,828.0,81...|
|[1.0,7.0,814.0,81...|
|[1.0,1.0,886.0,81...|
|[1.0,2.0,814.0,81...|
|[1.0,3.0,821.0,81...|
+--------------------+
only showing top 5 rows



In [71]:
output.show(5)

+----------+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+--------------------+
|      date|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|taxi_out|cancelled|            features|
+----------+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+--------+---------+--------------------+
|1988-01-09|1988|    1|           9|          6|                     828.0|                          811|                        875|          8.0|          942|     

In [69]:
final_data = output.select('features', 'arrival_delay')
final_data.show(5)

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `arrival_delay` cannot be resolved. Did you mean one of the following? [`departure_delay`, `origin_index`, `airline_index`, `cancelled`, `date`].;
'Project [features#27866, 'arrival_delay]
+- Project [date#26724, year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time_mins#27070, scheduled_departure_time_mins#27101, scheduled_arrival_time_mins#27133, airline_index#26911, flight_number#26025, scheduled_flight_time_mins#27166, departure_delay#26664, origin_index#26912, dest_index#26913, distance#26704, taxi_out#26684, cancelled#26385, UDF(struct(month_double_VectorAssembler_65e4c87d18d1, cast(month#25785 as double), day_of_week_double_VectorAssembler_65e4c87d18d1, cast(day_of_week#25845 as double), actual_departure_time_mins, actual_departure_time_mins#27070, scheduled_departure_time_mins_double_VectorAssembler_65e4c87d18d1, cast(scheduled_departure_time_mins#27101 as double), scheduled_arrival_time_mins_double_VectorAssembler_65e4c87d18d1, cast(scheduled_arrival_time_mins#27133 as double), airline_index, airline_index#26911, flight_number_double_VectorAssembler_65e4c87d18d1, cast(flight_number#26025 as double), scheduled_flight_time_mins, scheduled_flight_time_mins#27166, departure_delay_double_VectorAssembler_65e4c87d18d1, cast(departure_delay#26664 as double), origin_index, origin_index#26912, dest_index, dest_index#26913, distance_double_VectorAssembler_65e4c87d18d1, cast(distance#26704 as double), ... 2 more fields)) AS features#27866]
   +- Project [date#26724, year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time_mins#27070, scheduled_departure_time_mins#27101, scheduled_arrival_time_mins#27133, airline_index#26911, flight_number#26025, scheduled_flight_time_mins#27166, departure_delay#26664, origin_index#26912, dest_index#26913, distance#26704, taxi_out#26684, cancelled#26385]
      +- Filter isnotnull(scheduled_flight_time#26115)
         +- Filter isnotnull(actual_departure_time#25875)
            +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 5 more fields]
               +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 9 more fields]
                  +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 8 more fields]
                     +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 7 more fields]
                        +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 6 more fields]
                           +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 5 more fields]
                              +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 4 more fields]
                                 +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 3 more fields]
                                    +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, airline_index#26911, origin_index#26912, dest_index#26913, cancellation_index#26914, ... 2 more fields]
                                       +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, date#26724, UDF(cast(airline_code#25995 as string)) AS airline_index#26911, UDF(cast(origin#26235 as string)) AS origin_index#26912, UDF(cast(dest#26265 as string)) AS dest_index#26913, UDF(cast(cancellation_code#26415 as string)) AS cancellation_index#26914, UDF(cast(plane_number#26055 as string)) AS plane_index#26915]
                                          +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415, to_date(concat(cast(day_of_month#25815 as string),  , cast(month#25785 as string),  , cast(year#25755 as string)), Some(d M yyyy), Some(Etc/UTC), false) AS date#26724]
                                             +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, cast(distance#26295 as float) AS distance#26704, taxi_out#26684, cancelled#26385, cancellation_code#26415]
                                                +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, departure_delay#26664, origin#26235, dest#26265, distance#26295, cast(taxi_out#26355 as float) AS taxi_out#26684, cancelled#26385, cancellation_code#26415]
                                                   +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26644, cast(departure_delay#26205 as float) AS departure_delay#26664, origin#26235, dest#26265, distance#26295, taxi_out#26355, cancelled#26385, cancellation_code#26415]
                                                      +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, cast(arrival_delay#26175 as float) AS arrival_delay#26644, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_out#26355, cancelled#26385, cancellation_code#26415]
                                                         +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, scheduled_flight_time#26115, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_out#26355, cancelled#26385, cancellation_code#26415]
                                                            +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, cancelled#26385, cancellation_code#26415, diverted#26445, ... 5 more fields]
                                                               +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, cancelled#26385, cancellation_code#26415, diverted#26445, ... 5 more fields]
                                                                  +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, cancelled#26385, cancellation_code#26415, diverted#26445, ... 5 more fields]
                                                                     +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, cancelled#26385, cancellation_code#26415, diverted#26445, ... 5 more fields]
                                                                        +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, cancelled#26385, cancellation_code#26415, diverted#26445, ... 5 more fields]
                                                                           +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, cancelled#26385, cancellation_code#26415, Diverted#119 AS diverted#26445, ... 5 more fields]
                                                                              +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, cancelled#26385, cancellation_code#25575 AS cancellation_code#26415, Diverted#119, ... 5 more fields]
                                                                                 +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#26355, Cancelled#117 AS cancelled#26385, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                    +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#26325, taxi_out#25545 AS taxi_out#26355, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                       +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, distance#26295, taxi_in#25515 AS taxi_in#26325, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                          +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, dest#26265, Distance#114 AS distance#26295, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                             +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, origin#26235, Dest#113 AS dest#26265, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#26205, Origin#112 AS origin#26235, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                   +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#26175, departure_delay#25485 AS departure_delay#26205, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                      +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#26145, arrival_delay#25455 AS arrival_delay#26175, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                         +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#26115, air_time#25425 AS air_time#26145, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                            +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#26085, scheduled_flight_time#25395 AS scheduled_flight_time#26115, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                               +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#26055, actual_flight_time#25365 AS actual_flight_time#26085, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                  +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#26025, plane_number#25335 AS plane_number#26055, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                     +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25995, flight_number#25305 AS flight_number#26025, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                        +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25965, airline_code#25275 AS airline_code#25995, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                           +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25935, scheduled_arrival_time#25245 AS scheduled_arrival_time#25965, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                              +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25905, actual_arrival_time#25215 AS actual_arrival_time#25935, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                 +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25875, scheduled_departure_time#25185 AS scheduled_departure_time#25905, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                    +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25845, actual_departure_time#25155 AS actual_departure_time#25875, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                       +- Project [year#25755, month#25785, day_of_month#25815, day_of_week#25125 AS day_of_week#25845, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                          +- Project [year#25755, month#25785, day_of_month#25095 AS day_of_month#25815, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                             +- Project [year#25755, Month#97 AS month#25785, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                +- Project [Year#96 AS year#25755, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                   +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                      +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                         +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                            +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                               +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                                  +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, taxi_out#25545, Cancelled#117, CancellationCode#118 AS cancellation_code#25575, Diverted#119, ... 5 more fields]
                                                                                                                                                                     +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, taxi_in#25515, TaxiOut#116 AS taxi_out#25545, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                        +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, departure_delay#25485, Origin#112, Dest#113, Distance#114, TaxiIn#115 AS taxi_in#25515, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                           +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, arrival_delay#25455, DepDelay#111 AS departure_delay#25485, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                              +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, air_time#25425, ArrDelay#110 AS arrival_delay#25455, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                 +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, scheduled_flight_time#25395, AirTime#109 AS air_time#25425, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                    +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, actual_flight_time#25365, CRSElapsedTime#529 AS scheduled_flight_time#25395, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                       +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, plane_number#25335, ActualElapsedTime#107 AS actual_flight_time#25365, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                          +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, flight_number#25305, TailNum#106 AS plane_number#25335, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                             +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, airline_code#25275, FlightNum#105 AS flight_number#25305, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, scheduled_arrival_time#25245, UniqueCarrier#104 AS airline_code#25275, FlightNum#105, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                   +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, actual_arrival_time#25215, CRSArrTime#103 AS scheduled_arrival_time#25245, UniqueCarrier#104, FlightNum#105, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                      +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, scheduled_departure_time#25185, ArrTime#102 AS actual_arrival_time#25215, CRSArrTime#103, UniqueCarrier#104, FlightNum#105, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                         +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, actual_departure_time#25155, CRSDepTime#101 AS scheduled_departure_time#25185, ArrTime#102, CRSArrTime#103, UniqueCarrier#104, FlightNum#105, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                            +- Project [Year#96, Month#97, day_of_month#25095, day_of_week#25125, DepTime#100 AS actual_departure_time#25155, CRSDepTime#101, ArrTime#102, CRSArrTime#103, UniqueCarrier#104, FlightNum#105, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                               +- Project [Year#96, Month#97, day_of_month#25095, DayOfWeek#99 AS day_of_week#25125, DepTime#100, CRSDepTime#101, ArrTime#102, CRSArrTime#103, UniqueCarrier#104, FlightNum#105, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                                  +- Project [Year#96, Month#97, DayofMonth#98 AS day_of_month#25095, DayOfWeek#99, DepTime#100, CRSDepTime#101, ArrTime#102, CRSArrTime#103, UniqueCarrier#104, FlightNum#105, TailNum#106, ActualElapsedTime#107, CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                                     +- Union false, false
                                                                                                                                                                                                                        :- Project [Year#96, Month#97, DayofMonth#98, DayOfWeek#99, DepTime#100, CRSDepTime#101, ArrTime#102, CRSArrTime#103, UniqueCarrier#104, FlightNum#105, TailNum#106, ActualElapsedTime#107, cast(CRSElapsedTime#108 as string) AS CRSElapsedTime#529, AirTime#109, ArrDelay#110, DepDelay#111, Origin#112, Dest#113, Distance#114, TaxiIn#115, TaxiOut#116, Cancelled#117, CancellationCode#118, Diverted#119, ... 5 more fields]
                                                                                                                                                                                                                        :  +- Relation [Year#96,Month#97,DayofMonth#98,DayOfWeek#99,DepTime#100,CRSDepTime#101,ArrTime#102,CRSArrTime#103,UniqueCarrier#104,FlightNum#105,TailNum#106,ActualElapsedTime#107,CRSElapsedTime#108,AirTime#109,ArrDelay#110,DepDelay#111,Origin#112,Dest#113,Distance#114,TaxiIn#115,TaxiOut#116,Cancelled#117,CancellationCode#118,Diverted#119,... 5 more fields] csv
                                                                                                                                                                                                                        :- Project [Year#171, Month#172, DayofMonth#173, DayOfWeek#174, DepTime#175, CRSDepTime#176, ArrTime#177, CRSArrTime#178, UniqueCarrier#179, FlightNum#180, TailNum#181, ActualElapsedTime#182, CRSElapsedTime#183, AirTime#184, ArrDelay#185, DepDelay#186, Origin#187, Dest#188, cast(Distance#189 as string) AS Distance#530, TaxiIn#190, TaxiOut#191, Cancelled#192, CancellationCode#193, Diverted#194, ... 5 more fields]
                                                                                                                                                                                                                        :  +- Relation [Year#171,Month#172,DayofMonth#173,DayOfWeek#174,DepTime#175,CRSDepTime#176,ArrTime#177,CRSArrTime#178,UniqueCarrier#179,FlightNum#180,TailNum#181,ActualElapsedTime#182,CRSElapsedTime#183,AirTime#184,ArrDelay#185,DepDelay#186,Origin#187,Dest#188,Distance#189,TaxiIn#190,TaxiOut#191,Cancelled#192,CancellationCode#193,Diverted#194,... 5 more fields] csv
                                                                                                                                                                                                                        :- Project [Year#246, Month#247, DayofMonth#248, DayOfWeek#249, DepTime#250, CRSDepTime#251, ArrTime#252, CRSArrTime#253, UniqueCarrier#254, FlightNum#255, TailNum#256, ActualElapsedTime#257, cast(CRSElapsedTime#258 as string) AS CRSElapsedTime#562, AirTime#259, ArrDelay#260, DepDelay#261, Origin#262, Dest#263, Distance#264, TaxiIn#265, TaxiOut#266, Cancelled#267, CancellationCode#268, Diverted#269, ... 5 more fields]
                                                                                                                                                                                                                        :  +- Relation [Year#246,Month#247,DayofMonth#248,DayOfWeek#249,DepTime#250,CRSDepTime#251,ArrTime#252,CRSArrTime#253,UniqueCarrier#254,FlightNum#255,TailNum#256,ActualElapsedTime#257,CRSElapsedTime#258,AirTime#259,ArrDelay#260,DepDelay#261,Origin#262,Dest#263,Distance#264,TaxiIn#265,TaxiOut#266,Cancelled#267,CancellationCode#268,Diverted#269,... 5 more fields] csv
                                                                                                                                                                                                                        :- Project [Year#321, Month#322, DayofMonth#323, DayOfWeek#324, DepTime#325, CRSDepTime#326, ArrTime#327, CRSArrTime#328, UniqueCarrier#329, FlightNum#330, TailNum#331, ActualElapsedTime#332, CRSElapsedTime#333, AirTime#334, ArrDelay#335, DepDelay#336, Origin#337, Dest#338, cast(Distance#339 as string) AS Distance#593, cast(TaxiIn#340 as string) AS TaxiIn#594, cast(TaxiOut#341 as string) AS TaxiOut#595, Cancelled#342, CancellationCode#343, Diverted#344, ... 5 more fields]
                                                                                                                                                                                                                        :  +- Relation [Year#321,Month#322,DayofMonth#323,DayOfWeek#324,DepTime#325,CRSDepTime#326,ArrTime#327,CRSArrTime#328,UniqueCarrier#329,FlightNum#330,TailNum#331,ActualElapsedTime#332,CRSElapsedTime#333,AirTime#334,ArrDelay#335,DepDelay#336,Origin#337,Dest#338,Distance#339,TaxiIn#340,TaxiOut#341,Cancelled#342,CancellationCode#343,Diverted#344,... 5 more fields] csv
                                                                                                                                                                                                                        :- Project [Year#396, Month#397, DayofMonth#398, DayOfWeek#399, DepTime#400, CRSDepTime#401, ArrTime#402, CRSArrTime#403, UniqueCarrier#404, FlightNum#405, TailNum#406, ActualElapsedTime#407, CRSElapsedTime#408, AirTime#409, ArrDelay#410, DepDelay#411, Origin#412, Dest#413, cast(Distance#414 as string) AS Distance#626, TaxiIn#415, TaxiOut#416, Cancelled#417, CancellationCode#418, Diverted#419, ... 5 more fields]
                                                                                                                                                                                                                        :  +- Relation [Year#396,Month#397,DayofMonth#398,DayOfWeek#399,DepTime#400,CRSDepTime#401,ArrTime#402,CRSArrTime#403,UniqueCarrier#404,FlightNum#405,TailNum#406,ActualElapsedTime#407,CRSElapsedTime#408,AirTime#409,ArrDelay#410,DepDelay#411,Origin#412,Dest#413,Distance#414,TaxiIn#415,TaxiOut#416,Cancelled#417,CancellationCode#418,Diverted#419,... 5 more fields] csv
                                                                                                                                                                                                                        +- Project [Year#471, Month#472, DayofMonth#473, DayOfWeek#474, DepTime#475, CRSDepTime#476, ArrTime#477, CRSArrTime#478, UniqueCarrier#479, FlightNum#480, TailNum#481, ActualElapsedTime#482, CRSElapsedTime#483, AirTime#484, ArrDelay#485, DepDelay#486, Origin#487, Dest#488, cast(Distance#489 as string) AS Distance#657, cast(TaxiIn#490 as string) AS TaxiIn#658, cast(TaxiOut#491 as string) AS TaxiOut#659, Cancelled#492, CancellationCode#493, Diverted#494, ... 5 more fields]
                                                                                                                                                                                                                           +- Relation [Year#471,Month#472,DayofMonth#473,DayOfWeek#474,DepTime#475,CRSDepTime#476,ArrTime#477,CRSArrTime#478,UniqueCarrier#479,FlightNum#480,TailNum#481,ActualElapsedTime#482,CRSElapsedTime#483,AirTime#484,ArrDelay#485,DepDelay#486,Origin#487,Dest#488,Distance#489,TaxiIn#490,TaxiOut#491,Cancelled#492,CancellationCode#493,Diverted#494,... 5 more fields] csv


# 3. Linear Regression

In [None]:
from pyspark.ml.regression import LinearRegression
# train test split
train_data, test_data = final_data.randomSplit([0.75, 0.25])
regressor = LinearRegression(featuresCol='features', labelCol='arrival_delay')
regressor = regressor.fit(train_data)


In [None]:
regressor.coefficients

In [None]:
regressor.intercept

In [None]:
# evalute model
pred_results = regressor.evaluate(test_data)
pred_results.predictions.show(5)

In [None]:
pred_results.meanAbsoluteError 
pred_results.meanSquaredError
pred_results.rootMeanSquaredError
pred_results.r2

In [10]:
spark.stop()