# Spark Practical Work

We are supposed to create a model capable of predicting the arrival delay time of a commercial flight based on several parameters known at the take-off time. Tasks:
* Load the input data, previously stored at a known location.
* Select, process and transform the input variables, to prepare them for training the model.
* Perform some basic analysis of each input variable. 
* Create a ML model that predicts the arrival delay time.
* Validate the created model and provide some measures of its accuracy.


In [2]:
import os
os.getcwd()

'/home/dslab/workspaces/rrunix/spark/final_project'

# 1. Load data

In [12]:
import bz2
file_path = "../BigData/data/project_data//1987.csv.bz2"  # Replace with the path to your .bz2 file

# open the file, read it 
with bz2.open(file_path, "rb") as f:
	file_content = f.read()

# save it as csv
with open("../BigData/data/project_data/1987.csv", "wb") as f:
	f.write(file_content)

In [13]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Comercial Flights").getOrCreate()

csv_file_path = "../BigData/data/project_data/1987.csv"
df_pyspark = spark.read.csv(csv_file_path, header=True, inferSchema=True)
df_pyspark.printSchema()
df_pyspark.show(5)


                                                                                

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Car

The dataset has 29 columns. We won't use all of them. The ones that should be droped are: 
* ArrTime
* ActualElapsedTime
* AirTime
* TaxiIn
* Diverted
* CarrierDelay
* WeatherDelay
* NASDelay
* SecurityDelay
* LateAircraftDelay

In [15]:
# Filter data: We won't use all the columns, so we can drop them out
my_df = df_pyspark.select("Year", "Month", "DayofMonth", "DayOfWeek", "DepTime", "CRSDepTime", "CRSArrTime", "UniqueCarrier", "FlightNum", "TailNum", "CRSElapsedTime", "ArrDelay", "DepDelay", "Origin", "Dest", "Distance", "TaxiOut", "Cancelled", "CancellationCode")

# print number of columns in df_pyspark and my_df
print(len(df_pyspark.columns))
print(len(my_df.columns))


29
19


In [16]:
my_df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)

