# Flight Delay Prediction:
- This is flight data from various flight routes and we need to predict the delay in flights using pyspark.
- We used many spark libraries like ML, Streaming & Graphframes.
- We followed the whole journey from getting the data, cleaning it, modelling and evaluation.
- This was a limited time task so we prioritized the main deliverables which are:
  - Predict the delay category using any <b>Classifier</b> of your choice.
  - All your steps should be in a pipeline.
  - You should obtain at least <b>0.5 f1-score</b> and <b>0.6 accuracy</b>.

## Environment Preparation for Colab :

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=557157915f774a66ded1446abc433f6df621ba7aaa5d53033441cd4ee9c38cf8
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [None]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://bitbucket.org/habedi/datasets/raw/b6769c4664e7ff68b001e2f43bc517888cbe3642/spark/spark-3.0.2-bin-hadoop2.7.tgz
!tar xf spark-3.0.2-bin-hadoop2.7.tgz
!rm -rf spark-3.0.2-bin-hadoop2.7.tgz*
!pip -q install findspark pyspark graphframes

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.7 kB[0m [31m5.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.0-s_2.12/graphframes-0.8.2-spark3.0-s_2.12.jar -P /content/spark-3.0.2-bin-hadoop2.7/jars/
!cp /content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar /content/spark-3.0.2-bin-hadoop2.7/graphframes-0.8.2-spark3.0-s_2.12.zip

--2024-07-13 07:27:37--  https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.0-s_2.12/graphframes-0.8.2-spark3.0-s_2.12.jar
Resolving repos.spark-packages.org (repos.spark-packages.org)... 13.35.166.78, 13.35.166.111, 13.35.166.66, ...
Connecting to repos.spark-packages.org (repos.spark-packages.org)|13.35.166.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 247882 (242K) [binary/octet-stream]
Saving to: ‘/content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar’


2024-07-13 07:27:38 (590 KB/s) - ‘/content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar’ saved [247882/247882]



In [None]:
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"
os.environ["HADOOP_HOME"] = os.environ["SPARK_HOME"]

os.environ["PYSPARK_DRIVER_PYTHON"] = "jupyter"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"] = "notebook"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

In [None]:
import findspark
findspark.init()

In [None]:
!export PYSPARK_SUBMIT_ARGS="--master local[*] pyspark-shell"
!export PYSPARK_DRIVER_PYTHON=jupyter
!export PYSPARK_DRIVER_PYTHON_OPTS=notebook

In [None]:
from pyspark.sql import SparkSession
from graphframes import *

spark = SparkSession.builder.master("local[*]").appName("GraphFrames").getOrCreate()

In [None]:
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 pyspark-shell"

In [None]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

## Streaming & GraphFrames:
- The data you have contains two dataframes one for graphframe vertices and the other for graphframe edges.  
- Define schemas one for each folder of the provided data <b>VertFinalExam</b> and <b>EdgesFinalExam</b>                              
- Create two emply folders that you will use as a streaming reading sources.
- Create a streaming reader to read streaming data from the reading sources

- For the streaming Edges dataframe create a new column to indicate delay categories as follow:
    - Early: for early delays (-ve delay values).
    - Late: for delayed flights (+ve delay values).
    - OnTime: for on time flights (0 delay values).
- For the streaming Vertices dataframe remove all rows that contain state as an emplty string <b>state=""</b>.

- Create a writer for the final streaming Edges dataframe to write the streaming data in writing sink in a parquet fromat.
- Create a writer for the final streaming Vertices dataframe to write the streaming data in writing sink in a parquet fromat.
- Start a query for the Edges writer.
- Start a query for the Vertices writer.
- Read the vertices and edges data from the writing sink directory into static dataframes.
- Create a <b>GraphFrame</b> from these data.
- Apply <b>PageRank</b> algorithm to find the most <b>10</b> important Vertices. Order the results based on the rank in descending order.

## Machine Learning:
- Using the Edges dataframe you used for GraphFrame creation in the previous part:
- Convert the three dealy categories of the Edges dataframe into integers (0,1,2).



# Streaming :

In [None]:
df1 = spark.read.parquet("/content/MyFirstFolder/part-00000-18c44c6d-af85-42a2-924c-bb789e03af2d-c000.snappy.parquet")
df1.show()

+-------+-----+--------+---+---+
| tripid|delay|distance|src|dst|
+-------+-----+--------+---+---+
|1011245|    6|     602|ABE|ATL|
|1020600|   -8|     369|ABE|DTW|
|1021245|   -2|     602|ABE|ATL|
|1020605|   -4|     602|ABE|ATL|
|1031245|   -4|     602|ABE|ATL|
|1030605|    0|     602|ABE|ATL|
|1041243|   10|     602|ABE|ATL|
|1040605|   28|     602|ABE|ATL|
|1051245|   88|     602|ABE|ATL|
|1050605|    9|     602|ABE|ATL|
|1061215|   -6|     602|ABE|ATL|
|1061725|   69|     602|ABE|ATL|
|1061230|    0|     369|ABE|DTW|
|1060625|   -3|     602|ABE|ATL|
|1070600|    0|     369|ABE|DTW|
|1071725|    0|     602|ABE|ATL|
|1071230|    0|     369|ABE|DTW|
|1070625|    0|     602|ABE|ATL|
|1071219|    0|     569|ABE|ORD|
|1080600|    0|     369|ABE|DTW|
+-------+-----+--------+---+---+
only showing top 20 rows



In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

df1_schema = StructType([
    StructField("tripid", IntegerType(), nullable=False),
    StructField("delay", IntegerType()),
    StructField("distance", IntegerType()),
    StructField("src", StringType()),
    StructField("dst", StringType())
    ])

In [None]:
df2 = spark.read.parquet("/content/MySecondFolder/part-00000-5eaa8c57-c4ec-45df-bb39-7524d02b9e78-c000.snappy.parquet")
df2.show()

+---+-------------+-----+-------+
| id|         City|State|Country|
+---+-------------+-----+-------+
|ABE|    Allentown|   PA|    USA|
|ABI|      Abilene|   TX|    USA|
|ABQ|  Albuquerque|   NM|    USA|
|ABR|     Aberdeen|   SD|    USA|
|ABY|       Albany|   GA|    USA|
|ACK|    Nantucket|   MA|    USA|
|ACT|         Waco|   TX|    USA|
|ACV|       Eureka|   CA|    USA|
|ACY|Atlantic City|   NJ|    USA|
|ADQ|       Kodiak|   AK|    USA|
|AEX|   Alexandria|   LA|    USA|
|AGS|      Augusta|   GA|    USA|
|AHN|       Athens|   GA|    USA|
|AIA|     Alliance|   NE|    USA|
|AKN|  King Salmon|   AK|    USA|
|ALB|       Albany|   NY|    USA|
|ALO|     Waterloo|   IA|    USA|
|ALS|      Alamosa|   CO|    USA|
|ALW|  Walla Walla|   WA|    USA|
|AMA|     Amarillo|   TX|    USA|
+---+-------------+-----+-------+
only showing top 20 rows



In [None]:
df2_schema = StructType([
    StructField("id", StringType(), nullable=False),
    StructField("City", StringType()),
    StructField("State", StringType()),
    StructField("Country", StringType())
])

In [None]:
#Edges dataframe

df1 = spark.readStream.option("header", "True").schema(df1_schema).parquet("/content/MyFirstFolder")

In [None]:
#Vertices dataframe

df2 = spark.readStream.option("header", "True").schema(df2_schema).parquet("/content/MySecondFolder")

I know I should have to create a child dataframe from df1 containing the new column to make the most use of the spark impklementation. However, I am doing this for simplicity for the sake of the task and to not mix dataframes in my mind.

I used the concept in the ML part below.

In [None]:
#Creating Delay column

from pyspark.sql.functions import when, col

df1 = df1.withColumn("Delay_Cat",
                  when(col("delay") < 0, "Early")
                  .when(col("delay") > 0, "Late")
                  .otherwise("OnTime"))

In [None]:
#Removing empty State

df2 = df2.filter(df2.State != "")

In [None]:
# write stream df1

from pyspark.sql.functions import *
from pyspark.sql.types import *

w1 = df1.writeStream \
   .format("parquet") \
   .option("checkpointLocation","/content/chk1") \
   .option("path", "/content/Out1")

In [None]:
w2 = df2.writeStream \
   .format("parquet") \
   .option("checkpointLocation","/content/chk2") \
   .option("path", "/content/Out2")

In [None]:
df1.printSchema()

root
 |-- tripid: integer (nullable = true)
 |-- delay: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- Delay_Cat: string (nullable = false)



In [None]:
q1 = w1.start()

In [None]:
q1.stop()

In [None]:
q2 = w2.start()

In [None]:
q2.stop()

In [None]:
# import shutil

# shutil.rmtree("/content/chk1")

# GraphFrames :

In [None]:
df1_schema1 = StructType([
    StructField("tripid", IntegerType(), nullable=False),
    StructField("delay", IntegerType()),
    StructField("distance", IntegerType()),
    StructField("src", StringType()),
    StructField("dst", StringType()),
    StructField("Delay_Cat", StringType())
    ])

In [None]:
E_df = spark.read.format("parquet") \
    .schema(df1_schema1) \
    .load("/content/Out1")

In [None]:
E_df.show()

+-------+-----+--------+---+---+---------+
| tripid|delay|distance|src|dst|Delay_Cat|
+-------+-----+--------+---+---+---------+
|1010630|  -10|     928|RSW|EWR|    Early|
|1021029|   87|     974|RSW|ORD|     Late|
|1021346|    0|     928|RSW|EWR|   OnTime|
|1021044|   18|     928|RSW|EWR|     Late|
|1021730|   29|     748|RSW|IAH|     Late|
|1020535|  605|     974|RSW|ORD|     Late|
|1021820|   71|     974|RSW|ORD|     Late|
|1021743|    0|     928|RSW|EWR|   OnTime|
|1022017|    0|     928|RSW|EWR|   OnTime|
|1020600|   -2|     748|RSW|IAH|    Early|
|1021214|   29|     891|RSW|CLE|     Late|
|1020630|   -5|     928|RSW|EWR|    Early|
|1031029|   13|     974|RSW|ORD|     Late|
|1031346|  279|     928|RSW|EWR|     Late|
|1031740|   29|     748|RSW|IAH|     Late|
|1030535|    0|     974|RSW|ORD|   OnTime|
|1031808|   -3|     974|RSW|ORD|    Early|
|1031516|   -2|    1396|RSW|DEN|    Early|
|1032017|   14|     928|RSW|EWR|     Late|
|1031214|   17|     891|RSW|CLE|     Late|
+-------+--

In [None]:
V_df = spark.read.format("parquet") \
    .schema(df2_schema) \
    .load("/content/Out2")

In [None]:
V_df.show()

+---+-------------+-----+-------+
| id|         City|State|Country|
+---+-------------+-----+-------+
|ABE|    Allentown|   PA|    USA|
|ABI|      Abilene|   TX|    USA|
|ABQ|  Albuquerque|   NM|    USA|
|ABR|     Aberdeen|   SD|    USA|
|ABY|       Albany|   GA|    USA|
|ACK|    Nantucket|   MA|    USA|
|ACT|         Waco|   TX|    USA|
|ACV|       Eureka|   CA|    USA|
|ACY|Atlantic City|   NJ|    USA|
|ADQ|       Kodiak|   AK|    USA|
|AEX|   Alexandria|   LA|    USA|
|AGS|      Augusta|   GA|    USA|
|AHN|       Athens|   GA|    USA|
|AIA|     Alliance|   NE|    USA|
|AKN|  King Salmon|   AK|    USA|
|ALB|       Albany|   NY|    USA|
|ALO|     Waterloo|   IA|    USA|
|ALS|      Alamosa|   CO|    USA|
|ALW|  Walla Walla|   WA|    USA|
|AMA|     Amarillo|   TX|    USA|
+---+-------------+-----+-------+
only showing top 20 rows



In [None]:
gf = GraphFrame(V_df, E_df)

In [None]:
gf.vertices.show()

+---+-------------+-----+-------+
| id|         City|State|Country|
+---+-------------+-----+-------+
|ABE|    Allentown|   PA|    USA|
|ABI|      Abilene|   TX|    USA|
|ABQ|  Albuquerque|   NM|    USA|
|ABR|     Aberdeen|   SD|    USA|
|ABY|       Albany|   GA|    USA|
|ACK|    Nantucket|   MA|    USA|
|ACT|         Waco|   TX|    USA|
|ACV|       Eureka|   CA|    USA|
|ACY|Atlantic City|   NJ|    USA|
|ADQ|       Kodiak|   AK|    USA|
|AEX|   Alexandria|   LA|    USA|
|AGS|      Augusta|   GA|    USA|
|AHN|       Athens|   GA|    USA|
|AIA|     Alliance|   NE|    USA|
|AKN|  King Salmon|   AK|    USA|
|ALB|       Albany|   NY|    USA|
|ALO|     Waterloo|   IA|    USA|
|ALS|      Alamosa|   CO|    USA|
|ALW|  Walla Walla|   WA|    USA|
|AMA|     Amarillo|   TX|    USA|
+---+-------------+-----+-------+
only showing top 20 rows



In [None]:
gf.edges.show()

+-------+-----+--------+---+---+---------+
| tripid|delay|distance|src|dst|Delay_Cat|
+-------+-----+--------+---+---+---------+
|1010630|  -10|     928|RSW|EWR|    Early|
|1021029|   87|     974|RSW|ORD|     Late|
|1021346|    0|     928|RSW|EWR|   OnTime|
|1021044|   18|     928|RSW|EWR|     Late|
|1021730|   29|     748|RSW|IAH|     Late|
|1020535|  605|     974|RSW|ORD|     Late|
|1021820|   71|     974|RSW|ORD|     Late|
|1021743|    0|     928|RSW|EWR|   OnTime|
|1022017|    0|     928|RSW|EWR|   OnTime|
|1020600|   -2|     748|RSW|IAH|    Early|
|1021214|   29|     891|RSW|CLE|     Late|
|1020630|   -5|     928|RSW|EWR|    Early|
|1031029|   13|     974|RSW|ORD|     Late|
|1031346|  279|     928|RSW|EWR|     Late|
|1031740|   29|     748|RSW|IAH|     Late|
|1030535|    0|     974|RSW|ORD|   OnTime|
|1031808|   -3|     974|RSW|ORD|    Early|
|1031516|   -2|    1396|RSW|DEN|    Early|
|1032017|   14|     928|RSW|EWR|     Late|
|1031214|   17|     891|RSW|CLE|     Late|
+-------+--

## PageRank :

In [None]:
pagerank = gf.pageRank(resetProbability=0.15, maxIter=5)

top_vertices = pagerank.vertices.select("id", "pagerank").orderBy("pagerank", ascending=False).limit(10)

In [None]:
top_vertices.show()

+---+------------------+
| id|          pagerank|
+---+------------------+
|ATL|31.402169285067313|
|DFW| 22.76415219751248|
|ORD| 21.83241348762772|
|DEN|16.026921025779515|
|LAX|14.358865452635795|
|IAH|13.229634347806075|
|SFO|11.322517232690489|
|PHX|10.852423159730376|
|SLC| 9.622759351860472|
|LAS| 8.778471071473987|
+---+------------------+



# Machine Learning :

## Data Exploration :

In [None]:
E_df_mapped = E_df.withColumn("Delay_Cat",
    when(col("Delay_Cat") == "Early", 0)
    .when(col("Delay_Cat") == "OnTime", 1)
    .otherwise(2))

In [None]:
E_df_mapped.show()

+-------+-----+--------+---+---+---------+
| tripid|delay|distance|src|dst|Delay_Cat|
+-------+-----+--------+---+---+---------+
|1010630|  -10|     928|RSW|EWR|        0|
|1021029|   87|     974|RSW|ORD|        2|
|1021346|    0|     928|RSW|EWR|        1|
|1021044|   18|     928|RSW|EWR|        2|
|1021730|   29|     748|RSW|IAH|        2|
|1020535|  605|     974|RSW|ORD|        2|
|1021820|   71|     974|RSW|ORD|        2|
|1021743|    0|     928|RSW|EWR|        1|
|1022017|    0|     928|RSW|EWR|        1|
|1020600|   -2|     748|RSW|IAH|        0|
|1021214|   29|     891|RSW|CLE|        2|
|1020630|   -5|     928|RSW|EWR|        0|
|1031029|   13|     974|RSW|ORD|        2|
|1031346|  279|     928|RSW|EWR|        2|
|1031740|   29|     748|RSW|IAH|        2|
|1030535|    0|     974|RSW|ORD|        1|
|1031808|   -3|     974|RSW|ORD|        0|
|1031516|   -2|    1396|RSW|DEN|        0|
|1032017|   14|     928|RSW|EWR|        2|
|1031214|   17|     891|RSW|CLE|        2|
+-------+--

In [None]:
E_df_mapped.printSchema()

root
 |-- tripid: integer (nullable = false)
 |-- delay: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- Delay_Cat: integer (nullable = false)



In [None]:
train_df, test_df = E_df_mapped.randomSplit([0.8,0.2],seed=42)

In [None]:
summary = train_df.describe()

In [None]:
summary.show()

+-------+-----------------+------------------+-----------------+-------+-------+------------------+
|summary|           tripid|             delay|         distance|    src|    dst|         Delay_Cat|
+-------+-----------------+------------------+-----------------+-------+-------+------------------+
|  count|          1113550|           1113550|          1113550|1113550|1113550|           1113550|
|   mean|2180905.310870639|12.076650352476314|690.4121808630057|   null|   null|0.9446392169188631|
| stddev|838065.7442207639| 38.88248975964204|513.7984141939904|   null|   null|0.9501106881371156|
|    min|          1010010|              -112|               21|    ABE|    ABE|                 0|
|    max|          3312359|              1642|             4330|    YUM|    YUM|                 2|
+-------+-----------------+------------------+-----------------+-------+-------+------------------+



I did some more outlier checking here especially in the distance column. I didn't go very deep so I found no straight outliers. These cells were accidently removed while I was cleaning the notebook.

## Preparation for modelling :

In [None]:
cat_cols = [col_name for col_name, col_type in E_df_mapped.dtypes if col_type == 'string']
print(cat_cols)

['src', 'dst']


In [None]:
strIndOut = [k+'_Index' for k,v in E_df_mapped.dtypes if v=='string']

In [None]:
strIndOut

['src_Index', 'dst_Index']

In [None]:
OHE_Out = [k+'_OHE' for k,v in E_df_mapped.dtypes if v=='string']
OHE_Out

['src_OHE', 'dst_OHE']

In [None]:
num_cols = [k for k,v in E_df_mapped.dtypes if ((v=='int'))]
num_cols

['tripid', 'delay', 'distance', 'Delay_Cat']

In [None]:
num_cols = [col for col in num_cols if col != 'Delay_Cat']

In [None]:
num_cols

['tripid', 'delay', 'distance']

In [None]:
all_cols = num_cols + OHE_Out
all_cols

['tripid', 'delay', 'distance', 'src_OHE', 'dst_OHE']

## Pipeline :

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import *

In [None]:
# Softmax Regression

from pyspark.ml.classification import LogisticRegression

log_r = LogisticRegression(maxIter=100, regParam=0.1, elasticNetParam=0.0, family='multinomial',
                        featuresCol='features', labelCol='Delay_Cat', predictionCol='prediction')

In [None]:
str_Ind = StringIndexer(inputCols=cat_cols,outputCols=strIndOut,handleInvalid='skip')

In [None]:
OHE = OneHotEncoder(inputCols=strIndOut,outputCols=OHE_Out)

In [None]:
vec_Asmb = VectorAssembler(inputCols=all_cols,outputCol='features')

In [None]:
stages = [str_Ind,OHE,vec_Asmb,log_r]

In [None]:
pl = Pipeline(stages=stages)

In [None]:
model_pl = pl.fit(train_df)

## Prediction :

In [None]:
preds = model_pl.transform(test_df)

In [None]:
preds.show()

+-------+-----+--------+---+---+---------+---------+---------+-----------------+----------------+--------------------+--------------------+--------------------+----------+
| tripid|delay|distance|src|dst|Delay_Cat|src_Index|dst_Index|          src_OHE|         dst_OHE|            features|       rawPrediction|         probability|prediction|
+-------+-----+--------+---+---+---------+---------+---------+-----------------+----------------+--------------------+--------------------+--------------------+----------+
|1010020|    0|    1273|SFO|DFW|        1|      7.0|      1.0|  (254,[7],[1.0])| (302,[1],[1.0])|(559,[0,2,10,258]...|[0.74557016749828...|[0.56724533174672...|       0.0|
|1010154|   -7|    1397|SJU|EWR|        0|     49.0|     11.0| (254,[49],[1.0])|(302,[11],[1.0])|(559,[0,1,2,52,26...|[0.70689808610824...|[0.57100948098031...|       0.0|
|1010344|   -7|    1455|SJU|BOS|        0|     49.0|     14.0| (254,[49],[1.0])|(302,[14],[1.0])|(559,[0,1,2,52,27...|[0.87895935635709...|[

## Model Evaluation :

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="Delay_Cat", predictionCol="prediction", metricName="f1")
f1_score = evaluator.evaluate(preds)

evaluator = MulticlassClassificationEvaluator(labelCol="Delay_Cat", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(preds)

print(f"F1 Score: {f1_score:.4f}")
print(f"Accuracy: {accuracy:.4f}")

F1 Score: 0.7201
Accuracy: 0.7587


##Further Work:

- More data preparation and models exploration should be implemented to improve the score. However, this was out of scope given the time limit and the fact that the data takes long time to train.
