# Spark Practical Work

We are supposed to create a model capable of predicting the arrival delay time of a commercial flight based on several parameters known at the take-off time. Tasks:
* Load the input data, previously stored at a known location.
* Select, process and transform the input variables, to prepare them for training the model.
* Perform some basic analysis of each input variable. 
* Create a ML model that predicts the arrival delay time.
* Validate the created model and provide some measures of its accuracy.


In [1]:
import os
os.getcwd()

'/home/dslab/workspaces/rrunix/spark/final_project'

# 1. Load data

In [2]:
# extract files into csv formats
# import bz2
# files = os.listdir("../BigData/data/project_data/")
# def bz2_to_csv(files):
# 	for file in files:
# 		if file.endswith(".bz2"):
# 			file_path = "../BigData/data/project_data/" + file
# 			with bz2.open(file_path, "rb") as f:
# 				file_content = f.read()
# 			with open("../BigData/data/project_data/" + file[:-4], "wb") as f:
# 				f.write(file_content)

# bz2_to_csv(files)

In [3]:
# Create a SparkSession
from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext("local", "ComercialFlights")
spark = SparkSession.builder \
            .appName("First Session") \
            .master("local[*]") \
            .getOrCreate()

sc.setLogLevel("ERROR")
print("Spark Version: {}".format(sc.version))

23/12/25 09:22:39 WARN Utils: Your hostname, mordor resolves to a loopback address: 127.0.1.1; using 193.147.50.16 instead (on interface eno1np0)
23/12/25 09:22:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/25 09:22:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark Version: 3.5.0


In [4]:
# Use DataFrames to read csv files
# read all csv files in ../BigData/data/project_data/ to pyspark dataframe
from functools import reduce
from pyspark.sql import DataFrame

def csv_to_df(csv_files):
	df_pyspark =[]
	for file in csv_files:
		file_path = "../BigData/data/project_data/" + file
		df = spark.read.csv(file_path, header=True, inferSchema=True)
		df_pyspark.append(df)
	return df_pyspark

csv_files = os.listdir("../BigData/data/project_data/")
df_pyspark = csv_to_df(csv_files)

# merge all pyspark dataframes into one
def unionAll(*dfs):
	return reduce(DataFrame.unionAll, dfs)

df = unionAll(*df_pyspark)
df.show(5)

		
	

                                                                                

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|1988|    1|         9|        6|   1348|      1331|   1458|      1435|           PI|      942

In [5]:
df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: string (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Carr

In [9]:
# comprobar el numero de filas de df 
n_rows = df.count()
n_cols1 = len(df.columns)

print("Number of rows: ", n_rows)
# comprobar el numero de columnas de df
print("Number of columns: ", n_cols1)



Number of rows:  25952068
Number of columns:  29


                                                                                

# 2. Process data

The dataset has 29 columns. We won't use all of them. The ones that should be droped are: 
* ArrTime
* ActualElapsedTime
* AirTime
* TaxiIn
* Diverted
* CarrierDelay
* WeatherDelay
* NASDelay
* SecurityDelay
* LateAircraftDelay

Meaning of the variables that we keep: 
1. Year 1987-2008 
2. Month 1-12 
3. DayofMonth 1-31 
4. DayOfWeek 1 (Monday) - 7 (Sunday)
5. DepTime actual departure time (local, hhm m) 
6. CRSDepTime scheduled departure time (local, hhmm) 
7. CRSArrTime scheduled arrival time (local, hhmm) 
8. UniqueCarrier Airline code 
9. FlightNum flight number 
10. TailNum plane tail number 
11. CRSElapsedTime in minutes (estimated flight time)
12. ArrDelay arrival delay, in minutes -- TARGET VARIABLE
13. DepDelay departure delay, in minutes 
14. Origin origin IATA airport code 
15. Dest destination IATA airport code 
16. Distance in miles 
17. TaxiOut taxi out time in minutes (tiempo que tarda el avión desde la puerta de embarque hasta el despegue")
18. Cancelled was the flight cancelled? 
19. CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 

### Useful functions

In [44]:
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer

# rename columns:
def edit_column_names(df):
    df =  df.withColumnRenamed('DayofMonth','day_of_month').\
                withColumnRenamed('DayOfWeek','day_of_week').\
                withColumnRenamed('DepTime','actual_departure_time').\
                withColumnRenamed('CRSDepTime','scheduled_departure_time').\
                withColumnRenamed('ArrTime','actual_arrival_time').\
                withColumnRenamed('CRSArrTime','scheduled_arrival_time').\
                withColumnRenamed('UniqueCarrier','airline_code').\
                withColumnRenamed('FlightNum','flight_number').\
                withColumnRenamed('TailNum','plane_number').\
                withColumnRenamed('ActualElapsedTime','actual_flight_time').\
                withColumnRenamed('CRSElapsedTime','scheduled_flight_time').\
                withColumnRenamed('AirTime','air_time').\
                withColumnRenamed('ArrDelay','arrival_delay').\
                withColumnRenamed('DepDelay','departure_delay').\
                withColumnRenamed('TaxiIn','taxi_in').\
                withColumnRenamed('TaxiOut','taxi_out').\
                withColumnRenamed('CancellationCode','cancellation_code').\
                withColumnRenamed('CarrierDelay','carrier_delay').\
                withColumnRenamed('WeatherDelay','weather_delay').\
                withColumnRenamed('NASDelay','nas_delay').\
                withColumnRenamed('SecurityDelay','security_delay').\
                withColumnRenamed('LateAircraftDelay','late_aircraft_delay')
    for col in df.columns:
        df = df.withColumnRenamed(col, col.lower())
    return df

# select columns:
def my_columns (df):
    df = df.select('year','month','day_of_month', 'day_of_week', 'actual_departure_time',
                   'scheduled_departure_time', 'scheduled_arrival_time', 'airline_code',
                   'flight_number', 'plane_number', 'scheduled_flight_time', 'arrival_delay',
                   'departure_delay', 'origin', 'dest', 'distance', 'taxi_out', 'cancelled',
                   'cancellation_code')
    return df

# # combine to create dates:
# def add_date_column(df):
#     df = df.withColumn('date', to_date(concat(col('day_of_month'), lit(' '),
#                                               col('month'), lit(' '), col('year')), 'd M yyyy'))
#     return df

# some strings to float:
def string_to_float(df):
    df = df.withColumn('arrival_delay', col('arrival_delay').cast('float'))
    df = df.withColumn('departure_delay', col('departure_delay').cast('float'))
    df = df.withColumn('taxi_out', col('taxi_out').cast('float'))
    df = df.withColumn('distance', col('distance').cast('float'))
    return df

# encode categorical features:
def encode_categorical_features(df):
    indexer = StringIndexer(inputCols=['airline_code', 'origin', 'dest', 'cancellation_code', 'plane_number'],
                            outputCols=['airline_index', 'origin_index', 'dest_index', 'cancellation_index', 'plane_index'])
    
    df = indexer.fit(df).transform(df)
    return df
    

# convert time to minutes:
def convert_time_to_minutes(df):
    # for actual_departure_time, scheduled_departure_time, scheduled_arrival_time, scheduled_flight_time transform to minutes
    # take the first two digits and multiply by 60 and add the last two digits
    df = df.withColumn('actual_departure_hour', (col('actual_departure_time') / 100).cast('int'))
    df = df.withColumn('scheduled_departure_hour', (col('scheduled_departure_time') / 100).cast('int'))
    df = df.withColumn('scheduled_arrival_hour', (col('scheduled_arrival_time') / 100).cast('int'))
    df = df.withColumn('scheduled_flight_hour', (col('scheduled_flight_time') / 100).cast('int'))
    
    df = df.withColumn('actual_departure_time_mins', (col('actual_departure_hour') * 60) + (col('actual_departure_time') % 100))
    df = df.withColumn('scheduled_departure_time_mins', (col('scheduled_departure_hour') * 60) + (col('scheduled_departure_time') % 100))
    df = df.withColumn('scheduled_arrival_time_mins', (col('scheduled_arrival_hour') * 60) + (col('scheduled_arrival_time') % 100))
    df = df.withColumn('scheduled_flight_time_mins', (col('scheduled_flight_hour') * 60) + (col('scheduled_flight_time') % 100))
    
    # drop actual_departure_hour, scheduled_departure_hour, scheduled_arrival_hour, scheduled_flight_hour
    
    df = df.drop('actual_departure_hour', 'scheduled_departure_hour', 'scheduled_arrival_hour', 'scheduled_flight_hour')
    
    return df

# handle missing values:
def handle_missing_values(df):
    # eliminar filas donde actual_departure_time es null
    df = df.filter(df.actual_departure_time.isNotNull())
    # eliminar filas donde scheduled_flight_time es null
    df = df.filter(df.scheduled_flight_time.isNotNull())
    return df

def my_df(df):
    # select columns
    df = df.select('year', 'month', 'day_of_month', 'day_of_week', 'actual_departure_time_mins',
 					'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_index',
 					'flight_number', 'scheduled_flight_time_mins', 'departure_delay',
 					'origin_index', 'dest_index', 'distance', 'cancelled',
 					'arrival_delay')
    return df

def drop_cancelled(df):
    df = df.filter(df.cancelled == 0)
    return df
    

# # standarize df
# def standarize_dataframe(df):
#     temp = edit_column_names(df)
#     temp = my_columns(temp)
#     temp = string_to_float(temp)
#     temp = add_date_column(temp)
#     temp = encode_categorical_features(temp)
#     temp = convert_time_to_minutes(temp)
#     temp = handle_missing_values(temp)
#     temp = my_df(temp)

#     return temp



###

In [11]:
# renamed columns:
df1 = edit_column_names(df)

# delete columns that we are not supposed to use:
df2 = my_columns(df1)
df2.show(5)

+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+-------------+------------+---------------------+-------------+---------------+------+----+--------+--------+---------+-----------------+
|year|month|day_of_month|day_of_week|actual_departure_time|scheduled_departure_time|scheduled_arrival_time|airline_code|flight_number|plane_number|scheduled_flight_time|arrival_delay|departure_delay|origin|dest|distance|taxi_out|cancelled|cancellation_code|
+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+-------------+------------+---------------------+-------------+---------------+------+----+--------+--------+---------+-----------------+
|1988|    1|           9|          6|                 1348|                    1331|                  1435|          PI|          942|          NA|                   64|           23|             17|   SYR| BWI|     273|      

In [13]:
# convert strings to float: arrival_delay, departure_delay, taxi_out, distance
df3 = string_to_float(df2)
df3.printSchema()

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_month: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- actual_departure_time: string (nullable = true)
 |-- scheduled_departure_time: integer (nullable = true)
 |-- scheduled_arrival_time: integer (nullable = true)
 |-- airline_code: string (nullable = true)
 |-- flight_number: integer (nullable = true)
 |-- plane_number: string (nullable = true)
 |-- scheduled_flight_time: string (nullable = true)
 |-- arrival_delay: float (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- distance: float (nullable = true)
 |-- taxi_out: float (nullable = true)
 |-- cancelled: integer (nullable = true)
 |-- cancellation_code: string (nullable = true)



In [14]:
# encode categorical features: airline_code, origin, dest, cancellation_code, plane_number (When null --> 0.0)
df4 = encode_categorical_features(df3)
df4.printSchema()

                                                                                

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_month: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- actual_departure_time: string (nullable = true)
 |-- scheduled_departure_time: integer (nullable = true)
 |-- scheduled_arrival_time: integer (nullable = true)
 |-- airline_code: string (nullable = true)
 |-- flight_number: integer (nullable = true)
 |-- plane_number: string (nullable = true)
 |-- scheduled_flight_time: string (nullable = true)
 |-- arrival_delay: float (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- distance: float (nullable = true)
 |-- taxi_out: float (nullable = true)
 |-- cancelled: integer (nullable = true)
 |-- cancellation_code: string (nullable = true)
 |-- airline_index: double (nullable = false)
 |-- origin_index: double (nullable = false)
 |-- dest_index: double (nullable = false)
 |-- cancell

In [15]:
# convert time (hh:mm) to minutes: actual_departure_time, scheduled_departure_time, scheduled_arrival_time, scheduled_flight_time
df5 = convert_time_to_minutes(df4)
df5.show(5)

+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+-------------+------------+---------------------+-------------+---------------+------+----+--------+--------+---------+-----------------+-------------+------------+----------+------------------+-----------+--------------------------+-----------------------------+---------------------------+--------------------------+
|year|month|day_of_month|day_of_week|actual_departure_time|scheduled_departure_time|scheduled_arrival_time|airline_code|flight_number|plane_number|scheduled_flight_time|arrival_delay|departure_delay|origin|dest|distance|taxi_out|cancelled|cancellation_code|airline_index|origin_index|dest_index|cancellation_index|plane_index|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|scheduled_flight_time_mins|
+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+--

In [31]:
df5.columns

['year',
 'month',
 'day_of_month',
 'day_of_week',
 'actual_departure_time',
 'scheduled_departure_time',
 'scheduled_arrival_time',
 'airline_code',
 'flight_number',
 'plane_number',
 'scheduled_flight_time',
 'arrival_delay',
 'departure_delay',
 'origin',
 'dest',
 'distance',
 'taxi_out',
 'cancelled',
 'cancellation_code',
 'airline_index',
 'origin_index',
 'dest_index',
 'cancellation_index',
 'plane_index',
 'actual_departure_time_mins',
 'scheduled_departure_time_mins',
 'scheduled_arrival_time_mins',
 'scheduled_flight_time_mins']

In [42]:
# check for missing values (NULL, NAN, NA)
# count Na values in each column
from pyspark.sql.functions import isnan, when, count, col

df5.select([count(when(col(c) == "NA", c)).alias(c) for c in df5.columns]).show()



+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+-------------+------------+---------------------+-------------+---------------+------+----+--------+--------+---------+-----------------+-------------+------------+----------+------------------+-----------+--------------------------+-----------------------------+---------------------------+--------------------------+
|year|month|day_of_month|day_of_week|actual_departure_time|scheduled_departure_time|scheduled_arrival_time|airline_code|flight_number|plane_number|scheduled_flight_time|arrival_delay|departure_delay|origin|dest|distance|taxi_out|cancelled|cancellation_code|airline_index|origin_index|dest_index|cancellation_index|plane_index|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|scheduled_flight_time_mins|
+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+--

                                                                                

In [43]:
# count NULL values in each column
df5.select([count(when(col(c).isNull(), c)).alias(c) for c in df5.columns]).show()



+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+-------------+------------+---------------------+-------------+---------------+------+----+--------+--------+---------+-----------------+-------------+------------+----------+------------------+-----------+--------------------------+-----------------------------+---------------------------+--------------------------+
|year|month|day_of_month|day_of_week|actual_departure_time|scheduled_departure_time|scheduled_arrival_time|airline_code|flight_number|plane_number|scheduled_flight_time|arrival_delay|departure_delay|origin|dest|distance|taxi_out|cancelled|cancellation_code|airline_index|origin_index|dest_index|cancellation_index|plane_index|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|scheduled_flight_time_mins|
+----+-----+------------+-----------+---------------------+------------------------+----------------------+------------+--

                                                                                

__Missing Values__

_NA_
* actual_departure_time: 518228 --- (0.0199%) ---> DROP
* plane_number: 10404192 --- (0.4009%) ---> DROP COLLUMN (dont keep plane_index either bc it has the same info)
* scheduled_flight_time: 5586 --- (0.0002%) --> DROP
* cancellation_code: 21173634 --- (0.8158%) ---> DROP COLLUMN (dont keep cancellation_index either bc it has the same info)

_NULL_
* plane_number: 84904 --- (0.0032%) ---> DROP COLLUMN .
* arrival_delay: 584730 --- (0.0225%) ---> REMOVE ROWS (ES LA TARGET) .
* departure_delay: 518228 --- (0.0199%) ---> IMPUT MEAN OF THE AIRLINE __train/test__
* distance: 22204 --- (0.0008%) ---> BUSCAR OTROS VUELOS CON MISMO ORIGINY DEST Y PONER LA MEDIA __train/test__
* taxi_out: 10533076 --- (0.4058%) ---> DROP COLLUMN .
* cancellation_code: 4649550 --- (0.1791%) ---> DROP COLLUMN .
* actual_departure_time_mins: 518228 --- (0.0199%) ---> schedule_departure_time + deparutre delay __train/test__
* scheduled_flight_time_mins: 5586 --- (0.0002%) ---> REMOVE ROWS .

In [47]:
# keep only useful columns: 'year', 'month', 'day_of_month', 'day_of_week', 'actual_departure_time_mins',
# 							'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_index',
# 							'flight_number', 'scheduled_flight_time_mins', 'departure_delay',
# 							'origin_index', 'dest_index', 'distance', 'cancelled',
# 							'arrival_delay',

df6 = my_df(df5)
df6.show(5)

+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+
|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|cancelled|arrival_delay|
+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+
|1988|    1|           9|          6|                     828.0|                          811|                        875|          8.0|          942|                      64.0|           17.0|        63.0|      23.0|   273.0|        0|         23

In [50]:
# handle nulls
def drop_nulls(df):
	# remove rows in arrival_delay where arrival_delay is null
	df = df.filter(df.arrival_delay.isNotNull())
	# remove rows in scheduled_flight_time_mins where departure_delay is null
	df = df.filter(df.scheduled_flight_time_mins.isNotNull())
	# remove rows in distance where distance is null
	df = df.filter(df.distance.isNotNull())
	return df

df7 = drop_nulls(df6)

In [51]:
# count NULL values in each column
df7.select([count(when(col(c).isNull(), c)).alias(c) for c in df7.columns]).show()



+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+
|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|cancelled|arrival_delay|
+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+
|   0|    0|           0|          0|                         0|                            0|                          0|            0|            0|                         0|              0|           0|         0|       0|        0|           

                                                                                

In [54]:
# drop cancelled flights (cancelled = 1)
 df8 = drop_cancelled(df7)
 n_rows2 = df8.count()
 n_cols2 = len(df8.columns)

print("Number of rows before: ", n_rows)
print("Number of rows now: ", n_rows2, "(", n_rows - n_rows2, " rows dropped )")

# comprobar el numero de columnas de df
print("Number of columns before: ", n_cols1)
print("Number of columns now: ", n_cols2, "(", n_cols1 - n_cols2, " columns dropped )")



Number of rows before:  25952068
Number of rows now:  25345340 ( 606728  rows dropped )
Number of columns before:  29
Number of columns now:  16 ( 13  columns dropped )


In [55]:
df8.printSchema()

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_month: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- actual_departure_time_mins: double (nullable = true)
 |-- scheduled_departure_time_mins: integer (nullable = true)
 |-- scheduled_arrival_time_mins: integer (nullable = true)
 |-- airline_index: double (nullable = false)
 |-- flight_number: integer (nullable = true)
 |-- scheduled_flight_time_mins: double (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin_index: double (nullable = false)
 |-- dest_index: double (nullable = false)
 |-- distance: float (nullable = true)
 |-- cancelled: integer (nullable = true)
 |-- arrival_delay: float (nullable = true)



In [56]:
df8.show(5)

+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+
|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|cancelled|arrival_delay|
+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+
|1988|    1|           9|          6|                     828.0|                          811|                        875|          8.0|          942|                      64.0|           17.0|        63.0|      23.0|   273.0|        0|         23

# 3. Creating the Model

__VECTOR ASSEMBLER__

In [59]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

# Split the data (Train/Test)(0.7, 0.3)
train, test = df8.randomSplit([0.7, 0.3], seed=42)

# Create a VectorAssembler
my_features = ['year', 'month', 'day_of_month', 'day_of_week', 'actual_departure_time_mins',
               'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_index',
               'flight_number', 'scheduled_flight_time_mins', 'departure_delay',
               'origin_index', 'dest_index', 'distance', 'cancelled']
featureassembler = VectorAssembler(inputCols=my_features, outputCol="features")

# Create a Normalizer
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

# Create LinearRegression
lr = LinearRegression(labelCol="arrival_delay", featuresCol="features_norm", maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Create a pipeline
pipeline = Pipeline(stages=[featureassembler, normalizer, lr])

# Fit the pipeline on training data
model = pipeline.fit(train)


                                                                                

In [64]:
# coefficients and intercept for linear regression
print("Coefficients: " + str(model.stages[2].coefficients))
print("Intercept: " + str(model.stages[2].intercept))


Coefficients: [-7.641549166493271,-389.31388309605893,0.0,-1048.8045729249613,0.0,0.0,0.0,225.98892437091425,0.3854630303632286,-147.59234929323068,6347.9680462580545,0.0,-16.33496898106161,0.0,0.0]
Intercept: 5.3294728541366245


# 4. Validating the model

In [61]:
# Make predictions on test data
predictions = model.transform(test)
predictions.select("prediction", "arrival_delay", "features_norm").show(5)

[Stage 69:>                                                         (0 + 1) / 1]

+------------------+-------------+--------------------+
|        prediction|arrival_delay|       features_norm|
+------------------+-------------+--------------------+
| 8.629841200589368|         13.0|[0.34907813871817...|
| 27.74944779478431|         18.0|[0.34658298465829...|
|3.6894928381296768|         -8.0|[0.30898352502331...|
|135.26770922739655|        114.0|[0.34122897356676...|
|-4.942867547229427|         13.0|[0.74013402829486...|
+------------------+-------------+--------------------+
only showing top 5 rows



                                                                                

In [63]:
# evalute model
predictions.show(5)

[Stage 70:>                                                         (0 + 1) / 1]

+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+--------------------+--------------------+------------------+
|year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_index|flight_number|scheduled_flight_time_mins|departure_delay|origin_index|dest_index|distance|cancelled|arrival_delay|            features|       features_norm|        prediction|
+----+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+-------------+-------------+--------------------------+---------------+------------+----------+--------+---------+-------------+--------------------+--------------------+------------------+
|1988|    1|           1|          5|                       2.0|

                                                                                

In [65]:
# MAE, MSE, RMSE, R2

from pyspark.ml.evaluation import RegressionEvaluator

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE = %g" % mae)

evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)
print("MSE = %g" % mse)

evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g" % rmse)

evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R2 = %g" % r2)



                                                                                

MAE = 9.20746


                                                                                

MSE = 254.213


                                                                                

RMSE = 15.944




R2 = 0.720329


                                                                                

In [66]:
spark.stop()