# Spark Practical Work

We are supposed to create a model capable of predicting the arrival delay time of a commercial flight based on several parameters known at the take-off time. Tasks:
* Load the input data, previously stored at a known location.
* Select, process and transform the input variables, to prepare them for training the model.
* Perform some basic analysis of each input variable. 
* Create a ML model that predicts the arrival delay time.
* Validate the created model and provide some measures of its accuracy.


In [29]:
import os
os.getcwd()

'/home/dslab/workspaces/rrunix/spark/Final_project'

In [30]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.types import *
from pyspark.sql.functions import *

conf = SparkConf().set("spark.driver.memory", "4g").set("spark.executor.memory", "4g").set("loglevel", "ERROR")
spark = SparkSession.builder \
            .appName("ComercialFlights") \
            .master("local[*]") \
            .config(conf=conf) \
            .getOrCreate()


In [31]:
# spark.stop()

# 1. Load data

In [32]:
# load files .bz2 from data folder
# extract files and read them into pyspark dataframe

# if "data/my_df.csv/" existe cargarlo, sino crar un nuevo df
if os.path.exists("data/my_df.csv"):
	df = spark.read.csv("data/my_df.csv", header=True, sep=",")

else:
	files = os.listdir("data/")
	files = [file for file in files if file.endswith('.bz2')]

	# read files into spark dataframes
	dfs = []
	for file in files:
		# read them into df
		df_ = spark.read.csv("data/" + files[0], header=True, sep=",")
		dfs.append(df_)

	from functools import reduce
	# union all dataframes into one
	df = reduce(DataFrame.unionAll, dfs)

In [4]:
df.show(5)

24/01/02 10:24:45 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|1988|   10|        21|        5|   2146|      2115|   2326|      2305|           US|     2744

In [6]:
df.printSchema()

root
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: string (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: string (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: string (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: string (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: string (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: string (nullable = true)
 |-- CarrierDelay:

In [17]:
# comprobar el numero de filas de df 
n_rows = df.count()
n_cols1 = len(df.columns)

print("Number of rows: ", n_rows)
# comprobar el numero de columnas de df
print("Number of columns: ", n_cols1)



Number of rows:  15606288
Number of columns:  29


                                                                                

# 2. Process data

The dataset has 29 columns. We won't use all of them. The ones that should be droped are: 
* ArrTime
* ActualElapsedTime
* AirTime
* TaxiIn
* Diverted
* CarrierDelay
* WeatherDelay
* NASDelay
* SecurityDelay
* LateAircraftDelay

Meaning of the variables that we keep: 
1. Year 1987-2008 
2. Month 1-12 
3. DayofMonth 1-31 
4. DayOfWeek 1 (Monday) - 7 (Sunday)
5. DepTime actual departure time (local, hhm m) 
6. CRSDepTime scheduled departure time (local, hhmm) 
7. CRSArrTime scheduled arrival time (local, hhmm) 
8. UniqueCarrier Airline code 
9. FlightNum flight number 
10. TailNum plane tail number 
11. CRSElapsedTime in minutes (estimated flight time)
12. ArrDelay arrival delay, in minutes -- TARGET VARIABLE
13. DepDelay departure delay, in minutes 
14. Origin origin IATA airport code 
15. Dest destination IATA airport code 
16. Distance in miles 
17. TaxiOut taxi out time in minutes (tiempo que tarda el avión desde la puerta de embarque hasta el despegue")
18. Cancelled was the flight cancelled? 
19. CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 

### Useful functions

In [33]:
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

# rename columns:
def edit_column_names(df):
    """
    Edit column names to lowercase and replace spaces with underscores.
    
    Param:
    - df: spark dataframe
    
    Return:
    - df: spark dataframe with edited column names
    """
    df =  df.withColumnRenamed('DayofMonth','day_of_month').\
                withColumnRenamed('DayOfWeek','day_of_week').\
                withColumnRenamed('DepTime','actual_departure_time').\
                withColumnRenamed('CRSDepTime','scheduled_departure_time').\
                withColumnRenamed('ArrTime','actual_arrival_time').\
                withColumnRenamed('CRSArrTime','scheduled_arrival_time').\
                withColumnRenamed('UniqueCarrier','airline_code').\
                withColumnRenamed('FlightNum','flight_number').\
                withColumnRenamed('TailNum','plane_number').\
                withColumnRenamed('ActualElapsedTime','actual_flight_time').\
                withColumnRenamed('CRSElapsedTime','scheduled_flight_time').\
                withColumnRenamed('AirTime','air_time').\
                withColumnRenamed('ArrDelay','arrival_delay').\
                withColumnRenamed('DepDelay','departure_delay').\
                withColumnRenamed('TaxiIn','taxi_in').\
                withColumnRenamed('TaxiOut','taxi_out').\
                withColumnRenamed('CancellationCode','cancellation_code').\
                withColumnRenamed('CarrierDelay','carrier_delay').\
                withColumnRenamed('WeatherDelay','weather_delay').\
                withColumnRenamed('NASDelay','nas_delay').\
                withColumnRenamed('SecurityDelay','security_delay').\
                withColumnRenamed('LateAircraftDelay','late_aircraft_delay')
    for col in df.columns:
        df = df.withColumnRenamed(col, col.lower())
    return df

# some strings to float:
def string_to_float(df):
    """ 
    Convert some columns from string to float.
    
    Param:
    - df: spark dataframe
    
    Return:
    - df: spark dataframe with some columns converted to float
    """
    df = df.withColumn('year', col('year').cast('float'))
    df = df.withColumn('month', col('month').cast('float'))
    df = df.withColumn('day_of_month', col('day_of_month').cast('float'))
    df = df.withColumn('day_of_week', col('day_of_week').cast('float'))
    df = df.withColumn('arrival_delay', col('arrival_delay').cast('float'))
    df = df.withColumn('departure_delay', col('departure_delay').cast('float'))
    df = df.withColumn('taxi_out', col('taxi_out').cast('float'))
    df = df.withColumn('distance', col('distance').cast('float'))
    df = df.withColumn('cancelled', col('cancelled').cast('float'))
    df = df.withColumn('flight_number', col('flight_number').cast('float'))


    return df

# encode categorical features:
def encode_categorical_features(df):
    """ 
    Encode categorical features using StringIndexer and OneHotEncoder. 
    The output is a new DataFrame with the specified columns encoded.
    
    Params:
    - df: Spark DataFrame
    
    Returns:
    - df_encoded: Spark DataFrame with categorical features encoded
    """
    # StringIndexer
    indexer = StringIndexer(inputCols=['airline_code', 'origin', 'dest',  'plane_number'],
                            outputCols=['airline_index', 'origin_index', 'dest_index', 'plane_index'],
                            handleInvalid="keep")
    
    # OneHotEncoder
    encoder = OneHotEncoder(inputCols=['airline_index', 'origin_index', 'dest_index',  'plane_index'],
                            outputCols=['airline_encoded', 'origin_encoded', 'dest_encoded', 'plane_encoded'])
    
    # Pipeline
    pipeline = Pipeline(stages=[indexer, encoder])
    df_encoded = pipeline.fit(df).transform(df)
    
    return df_encoded
    
# convert time to minutes:
def convert_time_to_minutes(df):
    """ 
    Convert time to minutes. Creates new columns with the time in minutes and drop the original ones.
    To compute it we take the first two digits and multiply by 60 and add the last two digits.
    This is applied to: actual_departure_time, scheduled_departure_time, scheduled_arrival_time, scheduled_flight_time
    Param:
    - df: spark dataframe
    
    Return:
    - df: spark dataframe with new columns with time in minutes
    """
    df = df.withColumn('actual_departure_hour', (col('actual_departure_time') / 100).cast('int'))
    df = df.withColumn('scheduled_departure_hour', (col('scheduled_departure_time') / 100).cast('int'))
    df = df.withColumn('scheduled_arrival_hour', (col('scheduled_arrival_time') / 100).cast('int'))
    df = df.withColumn('scheduled_flight_hour', (col('scheduled_flight_time') / 100).cast('int'))
    
    df = df.withColumn('actual_departure_time_mins', (col('actual_departure_hour') * 60) + (col('actual_departure_time') % 100))
    df = df.withColumn('scheduled_departure_time_mins', (col('scheduled_departure_hour') * 60) + (col('scheduled_departure_time') % 100))
    df = df.withColumn('scheduled_arrival_time_mins', (col('scheduled_arrival_hour') * 60) + (col('scheduled_arrival_time') % 100))
    df = df.withColumn('scheduled_flight_time_mins', (col('scheduled_flight_hour') * 60) + (col('scheduled_flight_time') % 100))
    
    df = df.drop('actual_departure_hour', 'scheduled_departure_hour', 'scheduled_arrival_hour', 'scheduled_flight_hour')
    
    return df

# drop columns:
def my_df(df):
    """
    Select columns to keep in the dataframe. 
    Some columns are dropped as asked in the project instructions.
    Others are dropped because they are not useful for the model since they are derived from the original ones.
    
    Params: 
    - df: spark dataframe
    
    Return:
    - df: spark dataframe with selected columns
    """
    df = df.select('year', 'month', 'day_of_month', 'day_of_week', 'actual_departure_time_mins',
 					'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_encoded',
 					'flight_number', 'scheduled_flight_time_mins', 'departure_delay',
 					'origin_encoded', 'dest_encoded', 'distance', 'cancelled',
 					'arrival_delay')
    return df

# drop nulls:
def drop_nulls(df):
    """ 
    Drop rows with null values in the following columns: arrival_delay, scheduled_flight_time_mins, distance.

    Param:
    - df: spark dataframe

    Return:
    - df: spark dataframe with rows with null values dropped
    """
    # remove rows in arrival_delay where arrival_delay is null
    df = df.filter(df.arrival_delay.isNotNull())
    # remove rows in scheduled_flight_time_mins where departure_delay is null
    df = df.filter(df.scheduled_flight_time_mins.isNotNull())
    # remove rows in distance where distance is null
    df = df.filter(df.distance.isNotNull())
    return df

# drop cancelled flights:
def drop_cancelled(df):
    """ 
    Drop rows with cancelled flights.
    
    Param:
    - df: spark dataframe
    
    Return:
    - df: spark dataframe with cancelled flights dropped
    """
    df = df.filter(df.cancelled == 0)
    # drop cancelled column
    df = df.drop('cancelled')
    return df


###

In [38]:
# renamed columns:
df1 = edit_column_names(df)
df1.show(5)

+----+-----+------------+-----------+---------------------+------------------------+-------------------+----------------------+------------+-------------+------------+------------------+---------------------+--------+-------------+---------------+------+----+--------+-------+--------+---------+-----------------+--------+-------------+-------------+---------+--------------+-------------------+
|year|month|day_of_month|day_of_week|actual_departure_time|scheduled_departure_time|actual_arrival_time|scheduled_arrival_time|airline_code|flight_number|plane_number|actual_flight_time|scheduled_flight_time|air_time|arrival_delay|departure_delay|origin|dest|distance|taxi_in|taxi_out|cancelled|cancellation_code|diverted|carrier_delay|weather_delay|nas_delay|security_delay|late_aircraft_delay|
+----+-----+------------+-----------+---------------------+------------------------+-------------------+----------------------+------------+-------------+------------+------------------+---------------------+

In [39]:
# convert strings to float: arrival_delay, departure_delay, taxi_out, distance
df2 = string_to_float(df1)
df2.printSchema()

root
 |-- year: float (nullable = true)
 |-- month: float (nullable = true)
 |-- day_of_month: float (nullable = true)
 |-- day_of_week: float (nullable = true)
 |-- actual_departure_time: string (nullable = true)
 |-- scheduled_departure_time: string (nullable = true)
 |-- actual_arrival_time: string (nullable = true)
 |-- scheduled_arrival_time: string (nullable = true)
 |-- airline_code: string (nullable = true)
 |-- flight_number: float (nullable = true)
 |-- plane_number: string (nullable = true)
 |-- actual_flight_time: string (nullable = true)
 |-- scheduled_flight_time: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- arrival_delay: float (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- distance: float (nullable = true)
 |-- taxi_in: string (nullable = true)
 |-- taxi_out: float (nullable = true)
 |-- cancelled: float (nullable = true)
 |-- cancellation_code: strin

In [40]:
# encode categorical features: airline_code, origin, dest, plane_number (When null --> 0.0)
df3 = encode_categorical_features(df2)
df3.show(5)
# df3.printSchema()

                                                                                

+------+-----+------------+-----------+---------------------+------------------------+-------------------+----------------------+------------+-------------+------------+------------------+---------------------+--------+-------------+---------------+------+----+--------+-------+--------+---------+-----------------+--------+-------------+-------------+---------+--------------+-------------------+-------------+------------+----------+-----------+---------------+---------------+---------------+-------------+
|  year|month|day_of_month|day_of_week|actual_departure_time|scheduled_departure_time|actual_arrival_time|scheduled_arrival_time|airline_code|flight_number|plane_number|actual_flight_time|scheduled_flight_time|air_time|arrival_delay|departure_delay|origin|dest|distance|taxi_in|taxi_out|cancelled|cancellation_code|diverted|carrier_delay|weather_delay|nas_delay|security_delay|late_aircraft_delay|airline_index|origin_index|dest_index|plane_index|airline_encoded| origin_encoded|   dest_enc

In [41]:
df3.printSchema()

root
 |-- year: float (nullable = true)
 |-- month: float (nullable = true)
 |-- day_of_month: float (nullable = true)
 |-- day_of_week: float (nullable = true)
 |-- actual_departure_time: string (nullable = true)
 |-- scheduled_departure_time: string (nullable = true)
 |-- actual_arrival_time: string (nullable = true)
 |-- scheduled_arrival_time: string (nullable = true)
 |-- airline_code: string (nullable = true)
 |-- flight_number: float (nullable = true)
 |-- plane_number: string (nullable = true)
 |-- actual_flight_time: string (nullable = true)
 |-- scheduled_flight_time: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- arrival_delay: float (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- distance: float (nullable = true)
 |-- taxi_in: string (nullable = true)
 |-- taxi_out: float (nullable = true)
 |-- cancelled: float (nullable = true)
 |-- cancellation_code: strin

In [42]:
# convert time (hh:mm) to minutes: actual_departure_time, scheduled_departure_time, scheduled_arrival_time, scheduled_flight_time
df4 = convert_time_to_minutes(df3)
df4.show(5)

+------+-----+------------+-----------+---------------------+------------------------+-------------------+----------------------+------------+-------------+------------+------------------+---------------------+--------+-------------+---------------+------+----+--------+-------+--------+---------+-----------------+--------+-------------+-------------+---------+--------------+-------------------+-------------+------------+----------+-----------+---------------+---------------+---------------+-------------+--------------------------+-----------------------------+---------------------------+--------------------------+
|  year|month|day_of_month|day_of_week|actual_departure_time|scheduled_departure_time|actual_arrival_time|scheduled_arrival_time|airline_code|flight_number|plane_number|actual_flight_time|scheduled_flight_time|air_time|arrival_delay|departure_delay|origin|dest|distance|taxi_in|taxi_out|cancelled|cancellation_code|diverted|carrier_delay|weather_delay|nas_delay|security_delay|l

In [43]:
df4.printSchema()

root
 |-- year: float (nullable = true)
 |-- month: float (nullable = true)
 |-- day_of_month: float (nullable = true)
 |-- day_of_week: float (nullable = true)
 |-- actual_departure_time: string (nullable = true)
 |-- scheduled_departure_time: string (nullable = true)
 |-- actual_arrival_time: string (nullable = true)
 |-- scheduled_arrival_time: string (nullable = true)
 |-- airline_code: string (nullable = true)
 |-- flight_number: float (nullable = true)
 |-- plane_number: string (nullable = true)
 |-- actual_flight_time: string (nullable = true)
 |-- scheduled_flight_time: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- arrival_delay: float (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- distance: float (nullable = true)
 |-- taxi_in: string (nullable = true)
 |-- taxi_out: float (nullable = true)
 |-- cancelled: float (nullable = true)
 |-- cancellation_code: strin

In [44]:
df4.columns

['year',
 'month',
 'day_of_month',
 'day_of_week',
 'actual_departure_time',
 'scheduled_departure_time',
 'actual_arrival_time',
 'scheduled_arrival_time',
 'airline_code',
 'flight_number',
 'plane_number',
 'actual_flight_time',
 'scheduled_flight_time',
 'air_time',
 'arrival_delay',
 'departure_delay',
 'origin',
 'dest',
 'distance',
 'taxi_in',
 'taxi_out',
 'cancelled',
 'cancellation_code',
 'diverted',
 'carrier_delay',
 'weather_delay',
 'nas_delay',
 'security_delay',
 'late_aircraft_delay',
 'airline_index',
 'origin_index',
 'dest_index',
 'plane_index',
 'airline_encoded',
 'origin_encoded',
 'dest_encoded',
 'plane_encoded',
 'actual_departure_time_mins',
 'scheduled_departure_time_mins',
 'scheduled_arrival_time_mins',
 'scheduled_flight_time_mins']

In [45]:
# check for missing values (NULL, NAN, NA)
# count Na values in each column
from pyspark.sql.functions import isnan, when, count, col

# df4.select([count(when(col(c) == "NA", c)).alias(c) for c in df4.columns]).show()

In [46]:
# count NULL values in each column
# df4.select([count(when(col(c).isNull(), c)).alias(c) for c in df4.columns]).show()

__Missing Values__

_NA_
* actual_departure_time: 518228 --- (0.0199%) ---> DROP
* plane_number: 10404192 --- (0.4009%) ---> DROP COLLUMN (dont keep plane_index either bc it has the same info)
* scheduled_flight_time: 5586 --- (0.0002%) --> DROP
* cancellation_code: 21173634 --- (0.8158%) ---> DROP COLLUMN (dont keep cancellation_index either bc it has the same info)

_NULL_
* plane_number: 84904 --- (0.0032%) ---> DROP COLLUMN .
* arrival_delay: 584730 --- (0.0225%) ---> REMOVE ROWS (ES LA TARGET) .
* departure_delay: 518228 --- (0.0199%) ---> IMPUT MEAN OF THE AIRLINE __train/test__
* distance: 22204 --- (0.0008%) ---> BUSCAR OTROS VUELOS CON MISMO ORIGINY DEST Y PONER LA MEDIA __train/test__
* taxi_out: 10533076 --- (0.4058%) ---> DROP COLLUMN .
* cancellation_code: 4649550 --- (0.1791%) ---> DROP COLLUMN .
* actual_departure_time_mins: 518228 --- (0.0199%) ---> schedule_departure_time + deparutre delay __train/test__
* scheduled_flight_time_mins: 5586 --- (0.0002%) ---> REMOVE ROWS .

In [47]:
# keep only useful columns: 'year', 'month', 'day_of_month', 'day_of_week', 'actual_departure_time_mins',
# 							'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_encoded',
# 							'flight_number', 'scheduled_flight_time_mins', 'departure_delay',
# 							'origin_encoded', 'dest_encoded', 'distance', 'cancelled',
# 							'arrival_delay',

df5 = my_df(df4)
df5.show(5)

+------+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+---------------+-------------+--------------------------+---------------+---------------+---------------+--------+---------+-------------+
|  year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_encoded|flight_number|scheduled_flight_time_mins|departure_delay| origin_encoded|   dest_encoded|distance|cancelled|arrival_delay|
+------+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+---------------+-------------+--------------------------+---------------+---------------+---------------+--------+---------+-------------+
|1988.0| 10.0|        21.0|        5.0|                    1306.0|                       1275.0|                     1385.0| (14,[3],[1.0])|       2744.0|                      70.0|           31.0|(238,[5],[1.0]

In [48]:
df6 = drop_nulls(df5)

In [49]:
# # count NULL values in each column
# df6.select([count(when(col(c).isNull(), c)).alias(c) for c in df6.columns]).show()

In [50]:
# drop cancelled flights (cancelled = 1)
df7 = drop_cancelled(df6)

In [51]:
n_rows2 = df7.count()
n_cols2 = len(df7.columns)

print("Number of rows before: ", n_rows)
print("Number of rows now: ", n_rows2, "(", n_rows - n_rows2, " rows dropped )")

# comprobar el numero de columnas de df
print("Number of columns before: ", n_cols1)
print("Number of columns now: ", n_cols2, "(", n_cols1 - n_cols2, " columns dropped )")





Number of rows before:  15606288
Number of rows now:  15379494 ( 226794  rows dropped )
Number of columns before:  29
Number of columns now:  15 ( 14  columns dropped )


                                                                                

In [19]:
df7.printSchema()

root
 |-- year: float (nullable = true)
 |-- month: float (nullable = true)
 |-- day_of_month: float (nullable = true)
 |-- day_of_week: float (nullable = true)
 |-- actual_departure_time_mins: double (nullable = true)
 |-- scheduled_departure_time_mins: double (nullable = true)
 |-- scheduled_arrival_time_mins: double (nullable = true)
 |-- airline_encoded: vector (nullable = true)
 |-- flight_number: float (nullable = true)
 |-- scheduled_flight_time_mins: double (nullable = true)
 |-- departure_delay: float (nullable = true)
 |-- origin_encoded: vector (nullable = true)
 |-- dest_encoded: vector (nullable = true)
 |-- distance: float (nullable = true)
 |-- arrival_delay: float (nullable = true)



In [20]:
df7.show(5)

+------+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+---------------+-------------+--------------------------+---------------+---------------+---------------+--------+-------------+
|  year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_encoded|flight_number|scheduled_flight_time_mins|departure_delay| origin_encoded|   dest_encoded|distance|arrival_delay|
+------+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+---------------+-------------+--------------------------+---------------+---------------+---------------+--------+-------------+
|1988.0| 10.0|        21.0|        5.0|                    1306.0|                       1275.0|                     1385.0| (14,[3],[1.0])|       2744.0|                      70.0|           31.0|(238,[5],[1.0])|(236,[8],[1.0])|   651.0|   

# 3. Creating the Model

__VECTOR ASSEMBLER__

In [52]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

# Split the data (Train/Test)(0.7, 0.3)
train, test = df7.randomSplit([0.7, 0.3], seed=42)

#####################################################################
### WE SHOULD EXPLORE WHICH FEATURES ARE THE MOST IMPORTANT ONES ###
#####################################################################
# Create a VectorAssembler
my_features = ['year', 'month', 'day_of_month', 'day_of_week', 'actual_departure_time_mins',
               'scheduled_departure_time_mins', 'scheduled_arrival_time_mins', 'airline_encoded',
               'flight_number', 'scheduled_flight_time_mins', 'departure_delay',
               'origin_encoded', 'dest_encoded', 'distance']
featureassembler = VectorAssembler(inputCols=my_features, outputCol="features")


#####################################################################
### WE SHOUL EXPLORE DIFFERENT PARAMETERS AND UNDERSTAND THEM  ######
#####################################################################
# Create a Normalizer
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)

# Create LinearRegression
lr = LinearRegression(labelCol="arrival_delay", featuresCol="features_norm", maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Create a pipeline
pipeline = Pipeline(stages=[featureassembler, normalizer, lr])

# Fit the pipeline on training data
model = pipeline.fit(train)


                                                                                

In [53]:
# coefficients and intercept for linear regression
print("Coefficients: " + str(model.stages[2].coefficients))
print("Intercept: " + str(model.stages[2].intercept))


Coefficients: [-2.5829914745548765,-212.20343641478436,0.0,-1039.8628245958892,2.2901568759113218,0.0,2.7961724857291452,71.70706949255045,-5888.884105662422,0.0,0.0,0.0,-9373.008164213083,2948.0212196738917,-996.2528223462382,0.0,-12863.699177032591,0.0,0.0,405.0428721556513,-4936.804086955268,0.0,-54.12084015415666,5638.4354569582,0.0,0.0,0.0,-10931.066070429353,-6552.958519218816,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4875.745873278625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.

In [54]:
# print a table with the coefficients and the features
import pandas as pd
coefficients_table = pd.DataFrame(list(zip(my_features, model.stages[2].coefficients.toArray())),
								  columns=['feature', 'coefficients'])
# ordenar por valor absoluto de los coeficientes
coefficients_table['abs_coefficients'] = coefficients_table['coefficients'].abs()
coefficients_table = coefficients_table.sort_values(by=['abs_coefficients'], ascending=False)
coefficients_table = coefficients_table.drop('abs_coefficients', axis=1)

coefficients_table
# save
coefficients_table.to_csv('data/coefficients_table.csv', index=False)


In [55]:
coefficients_table

Unnamed: 0,feature,coefficients
12,dest_encoded,-9373.008164
8,flight_number,-5888.884106
13,distance,2948.02122
3,day_of_week,-1039.862825
1,month,-212.203436
7,airline_encoded,71.707069
6,scheduled_arrival_time_mins,2.796172
0,year,-2.582991
4,actual_departure_time_mins,2.290157
2,day_of_month,0.0


# 4. Validating the model

In [24]:
# Make predictions on test data
predictions = model.transform(test)
predictions.select("prediction", "arrival_delay", "features_norm").show(5)

[Stage 24:>                                                         (0 + 1) / 1]

+------------------+-------------+--------------------+
|        prediction|arrival_delay|       features_norm|
+------------------+-------------+--------------------+
| 27.93160765941896|         20.0|(499,[0,1,2,3,4,5...|
|-4.314880882052852|         11.0|(499,[0,1,2,3,4,5...|
|25.980916829625887|         16.0|(499,[0,1,2,3,4,5...|
|2.5066803821768975|         22.0|(499,[0,1,2,3,4,5...|
| 54.56258475974316|         54.0|(499,[0,1,2,3,4,5...|
+------------------+-------------+--------------------+
only showing top 5 rows



                                                                                

In [25]:
# evalute model
predictions.show(5)

[Stage 25:>                                                         (0 + 1) / 1]

+------+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+---------------+-------------+--------------------------+---------------+----------------+----------------+--------+-------------+--------------------+--------------------+------------------+
|  year|month|day_of_month|day_of_week|actual_departure_time_mins|scheduled_departure_time_mins|scheduled_arrival_time_mins|airline_encoded|flight_number|scheduled_flight_time_mins|departure_delay|  origin_encoded|    dest_encoded|distance|arrival_delay|            features|       features_norm|        prediction|
+------+-----+------------+-----------+--------------------------+-----------------------------+---------------------------+---------------+-------------+--------------------------+---------------+----------------+----------------+--------+-------------+--------------------+--------------------+------------------+
|1988.0| 10.0|         1.0|        6.0|             

                                                                                

In [26]:
# MAE, MSE, RMSE, R2

from pyspark.ml.evaluation import RegressionEvaluator

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE = %g" % mae)

evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)
print("MSE = %g" % mse)

evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g" % rmse)

evaluator = RegressionEvaluator(labelCol="arrival_delay", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R2 = %g" % r2)



                                                                                

MAE = 8.1552


                                                                                

MSE = 188.427


                                                                                

RMSE = 13.7269




R2 = 0.654543


                                                                                

In [56]:
spark.stop()