<a href="https://colab.research.google.com/github/MarinaWolters/Coding-Tracker/blob/master/W11_DataStreams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Stream Processing

We will be using Spark Streaming (since everyone is familiar with Spark) as well as StreamParse (which is a bit lower-level).

## Spark Installation

As a preliminary step, we'll first set Apache Spark up on the current machine.

In [None]:
## Let's install Apache Spark on the local machine

import os

!apt install openjdk-8-jdk-headless
!wget https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install findspark

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin/hadoop2.7"

Reading package lists... Done
Building dependency tree       
Reading state information... Done
openjdk-8-jdk-headless is already the newest version (8u222-b10-1ubuntu1~18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-430
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.
--2019-11-22 23:09:57--  https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
Resolving www-us.apache.org (www-us.apache.org)... 40.79.78.1
Connecting to www-us.apache.org (www-us.apache.org)|40.79.78.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 230091034 (219M) [application/x-gzip]
Saving to: ‘spark-2.4.4-bin-hadoop2.7.tgz.3’


2019-11-22 23:10:14 (12.9 MB/s) - ‘spark-2.4.4-bin-hadoop2.7.tgz.3’ saved [230091034/230091034]



In [None]:
import findspark

findspark.init('./spark-2.4.5-bin-hadoop2.7')

In [None]:
# Let's set up a connection to local Spark

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

spark = SparkSession.builder.appName("Flights").getOrCreate()
sc = spark.sparkContext

## Downloading Data from Google Drive

Next, we'll be analyzing **airline flight info** in streaming fashion.  This info started off in a giant data file from the US Department of Transportation's [Bureau of Transportation Statistics](https://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time).

First we need to download it from where it's being publicly shared in Google Drive.

In [None]:
!pip install googledrivedownloader



In [None]:
from google_drive_downloader import GoogleDriveDownloader as gdd

# One month's flights, which we'll split into segments for streaming
gdd.download_file_from_google_drive(file_id='1PPtjGx8lr_cDUfVa3qwlk1W8yY6hY91n',
                                    dest_path='/content/ontime.csv')

# Airlines
gdd.download_file_from_google_drive(file_id='1z80wSbMwQmFcr3S3OC5p_4FgnZysymUt',
                                    dest_path='/content/airlines.csv')


# Airports
gdd.download_file_from_google_drive(file_id='1Qe4FpLg473FjfVhdRGGO4dHLmJa1ETq4',
                                    dest_path='/content/airports.csv')

# 2015 flights
gdd.download_file_from_google_drive(file_id='1-rvBDAwtET-J9LcDG_TDGE9G4Yh5DhkQ',
                                    dest_path='/content/2015-ontime.csv')

Let's see what we have downloaded...

In [None]:
!ls -l /content/*.csv

-rw-r--r-- 1 root root 592406591 Nov 22 19:39 /content/2015-ontime.csv
-rw-r--r-- 1 root root       359 Nov 22 19:39 /content/airlines.csv
-rw-r--r-- 1 root root     23867 Nov 22 19:39 /content/airports.csv
-rw-r--r-- 1 root root  28805770 Nov 22 19:39 /content/ontime.csv


In [None]:
# Now, to demonstrate Spark's incremental processing, we'll break the one-month
# ontime.csv file into segments of 10000 lines each
! split -n 10000 ontime.csv

!head ontime.csv

"YEAR","MONTH","DAY_OF_MONTH","AIRLINE_ID","CARRIER","FL_NUM","ORIGIN","DEST","ARR_DELAY_NEW","CANCELLED",
2018,1,2,19393,"WN","1325","SJU","MCO",0.00,0.00,
2018,1,2,19393,"WN","5159","SJU","MCO",0.00,0.00,
2018,1,2,19393,"WN","5890","SJU","MCO",9.00,0.00,
2018,1,2,19393,"WN","6618","SJU","MCO",0.00,0.00,
2018,1,2,19393,"WN","1701","SJU","MDW",8.00,0.00,
2018,1,2,19393,"WN","844","SJU","TPA",23.00,0.00,
2018,1,2,19393,"WN","4679","SJU","TPA",0.00,0.00,
2018,1,2,19393,"WN","6294","SLC","BUR",20.00,0.00,
2018,1,2,19393,"WN","5245","SLC","DAL",0.00,0.00,


In [None]:
!head airports.csv

IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE
ABE,Lehigh Valley International Airport,Allentown,PA,USA,40.65236,-75.44040
ABI,Abilene Regional Airport,Abilene,TX,USA,32.41132,-99.68190
ABQ,Albuquerque International Sunport,Albuquerque,NM,USA,35.04022,-106.60919
ABR,Aberdeen Regional Airport,Aberdeen,SD,USA,45.44906,-98.42183
ABY,Southwest Georgia Regional Airport,Albany,GA,USA,31.53552,-84.19447
ACK,Nantucket Memorial Airport,Nantucket,MA,USA,41.25305,-70.06018
ACT,Waco Regional Airport,Waco,TX,USA,31.61129,-97.23052
ACV,Arcata Airport,Arcata/Eureka,CA,USA,40.97812,-124.10862
ACY,Atlantic City International Airport,Atlantic City,NJ,USA,39.45758,-74.57717


In [None]:
!head airlines.csv

IATA_CODE,AIRLINE
UA,United Air Lines Inc.
AA,American Airlines Inc.
US,US Airways Inc.
F9,Frontier Airlines Inc.
B6,JetBlue Airways
OO,Skywest Airlines Inc.
AS,Alaska Airlines Inc.
NK,Spirit Air Lines
WN,Southwest Airlines Co.


In [None]:
!ls

2015-ontime.csv			 xbqk  xdhf  xeya  xgov  xifq  xjwl  xlng  xneb
airlines.csv			 xbql  xdhg  xeyb  xgow  xifr  xjwm  xlnh  xnec
airports.csv			 xbqm  xdhh  xeyc  xgox  xifs  xjwn  xlni  xned
ontime.csv			 xbqn  xdhi  xeyd  xgoy  xift  xjwo  xlnj  xnee
sample_data			 xbqo  xdhj  xeye  xgoz  xifu  xjwp  xlnk  xnef
spark-2.4.4-bin-hadoop2.7	 xbqp  xdhk  xeyf  xgpa  xifv  xjwq  xlnl  xneg
spark-2.4.4-bin-hadoop2.7.tgz	 xbqq  xdhl  xeyg  xgpb  xifw  xjwr  xlnm  xneh
spark-2.4.4-bin-hadoop2.7.tgz.1  xbqr  xdhm  xeyh  xgpc  xifx  xjws  xlnn  xnei
spark-2.4.4-bin-hadoop2.7.tgz.2  xbqs  xdhn  xeyi  xgpd  xify  xjwt  xlno  xnej
spark-2.4.4-bin-hadoop2.7.tgz.3  xbqt  xdho  xeyj  xgpe  xifz  xjwu  xlnp  xnek
spark-warehouse			 xbqu  xdhp  xeyk  xgpf  xiga  xjwv  xlnq  xnel
xaaa				 xbqv  xdhq  xeyl  xgpg  xigb  xjww  xlnr  xnem
xaab				 xbqw  xdhr  xeym  xgph  xigc  xjwx  xlns  xnen
xaac				 xbqx  xdhs  xeyn  xgpi  xigd  xjwy  xlnt  xneo
xaad				 xbqy  xdht  xeyo  xgpj  xige  xjwz  xlnu  xnep
xaa

## From the filesystem to Spark's HDFS filesystem

Spark can't directly read local files on the server.  Instead they need to be copied to the Hadoop distributed filesystem.  Let's start with a connection to HDFS...

In [None]:
# Next, for Spark we will need to copy the files to HDFS
# So first we need to connect to the HDFS filesystem

######
# From https://diogoalexandrefranco.github.io/interacting-with-hdfs-from-pyspark/
#
# Get fs handler from java gateway
######
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
fs = FileSystem.get(sc._jsc.hadoopConfiguration())

# Make sure we have an empty directory in HDFS
fs.delete(Path('/in'), True)
fs.mkdirs(Path('/in'))


True

## Creating a Streaming (Microbatched) Query

As a first part of this notebook, we will look at doing *incremental computation* over data as it arrives.  To do this we'll use Spark's **microbatch** capability, where it incrementally re-runs a computation based on a "triggering" event (typically a delay).

Here's how this will work

1. We'll define a "stream query" that gets periodically executed, tracking, for each carrier and flight number, how many flights we've seen so far and the average delay.
1. The query will look for csv files as they are added into the `/in` directory, as a series of chunks in a stream.
1. We'll parse the CSV format into appropriate columns, arriving at a Spark dataframe.
1. This will receive an SQL table name, so we can use it in SparkSQL.


In [None]:
# We need this to set a schema
from pyspark.sql.types import StructType

# Here's the basic schema of ontime.csv (and its splits)
flightSchema = StructType().add("YEAR", "integer").add("MONTH", "integer")\
  .add("DAY_OF_MONTH", "integer").add("AIRLINE_ID", "integer")\
  .add("CARRIER", "string").add("FL_NUM", "integer").add("ORIGIN", "string")\
  .add("DEST", "string").add("ARR_DELAY", "double").add("CANCELLED", "double")
airlineSchema = StructType().add("IATA_CODE", "string").add("AIRLINE", "string")
airportSchema = StructType().add("IATA_CODE", "string").add("AIRPORT", "string")\
  .add("CITY", "string").add("STATE", "string").add("COUNTRY", "string")\
  .add("LATITUDE", "double").add("LONGITUDE", "double")

# The airlines
airlinesDF = spark.read.option("sep", ",").option("header", "true").\
  schema(airlineSchema).csv('airlines.csv')
airlinesDF.createOrReplaceTempView("airlines")

# The airports
airportsDF = spark.read.option("sep", ",").option("header", "true").\
  schema(airportSchema).csv('airports.csv')
airportsDF.createOrReplaceTempView("airports")

# This will be a stream, and in each microbatch we'll read one file at a time
flightsStreamDF = spark.readStream.option("sep", ",").option("header", "true").\
  option("maxFilesPerTrigger", 1).\
  schema(flightSchema).csv("/in/")
flightsStreamDF.createOrReplaceTempView("flights")


avg_delay = spark.sql("""select CARRIER,FL_NUM, ORIGIN, DEST, org.LATITUDE AS from_lat, org.LONGITUDE AS from_long,
                          dst.LATITUDE AS to_lat, dst.LONGITUDE AS to_long,
                          count(ARR_DELAY) as NbrFlights, 
                          avg(ARR_DELAY) as avg_delay 
                        from (flights f join airports org on f.origin=org.IATA_CODE) join airports dst on f.dest=dst.IATA_CODE
                        GROUP BY CARRIER, FL_NUM, ORIGIN, DEST, org.LATITUDE, org.LONGITUDE, dst.LATITUDE, dst.LONGITUDE
                        ORDER BY CARRIER, FL_NUM, ORIGIN, DEST""")


## Launching the Stream Query

Now we will have to do two things:
1. Launch the streaming query (which initially has no data)
1. Periodically add new data into the input stream (by copying a file into `/in`).
1. Show the updated query results.

This code below will only show the first 4 entries of the stream output, but you should be able to see that evolve over time.  You can press the Stop button to end the query (then execute `query.stop()` in the next cell) or let it run for a long time until it completes.

In [None]:
# We'll need this to periodically sleep
import time

# Start the query, run every 1 second, recompute the complete output, and in
# each case store it in-memory in a table called flight_info
query = avg_delay.writeStream.outputMode("complete").queryName("flight_info").format("memory").\
    trigger(processingTime='1 seconds').start()

# As the query is running, start copying files from /content into /in
# Then wait 3 sec for Spark to process them, and display the updated output
for filename in os.listdir('/content'):
    fs.copyFromLocalFile(Path('/content/' + filename),Path('/in'))
    time.sleep(3)    
    display(spark.sql("select * from flight_info").limit(4).toPandas())

query.stop()

# Time-Windowed Processing

To this point, we've only looked at Spark Streaming from the context of incremental recomputation.  Of course, in many cases you want to do computation over temporal aspects of the data.

For this one we'll use the much bigger longitudinal dataset for 2015 on-time performance.  The schema is considerably bigger than the simpler `ontime.csv`.

For simplicity we will load the whole file into a single dataframe, without streaming.

In [None]:
! head 2015-ontime.csv

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
2015,1,1,4,AS,98,N407AS,ANC,SEA,0005,2354,-11,21,0015,205,194,169,1448,0404,4,0430,0408,-22,0,0,,,,,,
2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,0010,0002,-8,12,0014,280,279,263,2330,0737,4,0750,0741,-9,0,0,,,,,,
2015,1,1,4,US,840,N171US,SFO,CLT,0020,0018,-2,16,0034,286,293,266,2296,0800,11,0806,0811,5,0,0,,,,,,
2015,1,1,4,AA,258,N3HYAA,LAX,MIA,0020,0015,-5,15,0030,285,281,258,2342,0748,8,0805,0756,-9,0,0,,,,,,
2015,1,1,4,AS,135,N527AS,SEA,ANC,0025,0024,-1,11,0035,235,215,199,1448,0254,5,0320,0259,-21,0,0,,,,,,
2015,1,1,4,DL,806,N3730B,SFO,MSP,0025,0020,-5,18,0038,217,230,206,1589,0604,6,0602,0610,8,0,0,,,,

In [None]:
# We need this to set a schema
from pyspark.sql.types import StructType
from pyspark.sql.functions import window

fs.copyFromLocalFile(Path('/content/2015-ontime.csv'),Path('/2015-ontime.csv'))

# Here's the basic schema of ontime.csv (and its splits)
ontimeSchema = StructType().add("YEAR", "integer").add("MONTH", "integer")\
  .add("DAY", "integer").add("DAY_OF_WEEK", "integer").add("AIRLINE_ID", "string")\
  .add("FL_NUM", "integer").add("TAIL_NUM", "string").add("ORIGIN", "string")\
  .add("DEST", "string").add("SCH_DEPARTURE", "integer").add("DEPARTURE", "integer")\
  .add("DEP_DELAY","integer").add("TAXI_OUT","integer").add("WHEELS_OFF","integer")\
  .add("SCH_TIME","integer").add("ELAPSED_TIME","integer").add("AIR_TIME","integer")\
  .add("DISTANCE","integer").add("WHEELS_ON","integer").add("TAXI_IN","integer")\
  .add("SCH_ARRIVAL","integer").add("ARRIVAL_TIME","integer")\
  .add("ARR_DELAY", "integer").add("DIVERTED", "integer").add("CANCELLED", "integer")\
  .add("CANCELLATION_REASON","string").add("AIR_SYSTEM_DELAY", "integer")\
  .add("SECURITY_DELAY", "integer").add("AIRLINE_DELAY", "integer")\
  .add("LATE_AIRCRAFT_DELAY", "integer").add("WEATHER_DELAY", "integer")

# This will be a stream, and in each microbatch we'll read one file at a time
ontimeDF = spark.read.option("sep", ",").option("header", "true").\
  schema(ontimeSchema).csv("/2015-ontime.csv")
ontimeDF.createOrReplaceTempView("ontime")

display(ontimeDF.take(2))

[Row(YEAR=2015, MONTH=1, DAY=1, DAY_OF_WEEK=4, AIRLINE_ID='AS', FL_NUM=98, TAIL_NUM='N407AS', ORIGIN='ANC', DEST='SEA', SCH_DEPARTURE=5, DEPARTURE=2354, DEP_DELAY=-11, TAXI_OUT=21, WHEELS_OFF=15, SCH_TIME=205, ELAPSED_TIME=194, AIR_TIME=169, DISTANCE=1448, WHEELS_ON=404, TAXI_IN=4, SCH_ARRIVAL=430, ARRIVAL_TIME=408, ARR_DELAY=-22, DIVERTED=0, CANCELLED=0, CANCELLATION_REASON=None, AIR_SYSTEM_DELAY=None, SECURITY_DELAY=None, AIRLINE_DELAY=None, LATE_AIRCRAFT_DELAY=None, WEATHER_DELAY=None),
 Row(YEAR=2015, MONTH=1, DAY=1, DAY_OF_WEEK=4, AIRLINE_ID='AA', FL_NUM=2336, TAIL_NUM='N3KUAA', ORIGIN='LAX', DEST='PBI', SCH_DEPARTURE=10, DEPARTURE=2, DEP_DELAY=-8, TAXI_OUT=12, WHEELS_OFF=14, SCH_TIME=280, ELAPSED_TIME=279, AIR_TIME=263, DISTANCE=2330, WHEELS_ON=737, TAXI_IN=4, SCH_ARRIVAL=750, ARRIVAL_TIME=741, ARR_DELAY=-9, DIVERTED=0, CANCELLED=0, CANCELLATION_REASON=None, AIR_SYSTEM_DELAY=None, SECURITY_DELAY=None, AIRLINE_DELAY=None, LATE_AIRCRAFT_DELAY=None, WEATHER_DELAY=None)]

In [None]:
simplerDF = spark.sql("""select cast(concat(cast(YEAR as string), '-', 
                      cast (MONTH as string), '-', CAST (DAY as string), ' ', 
                      cast(cast (SCH_DEPARTURE / 100 as integer) as string), ':', 
                           cast (SCH_DEPARTURE % 100 as string) ) as timestamp) as YMD,
                      AIRLINE_ID, ORIGIN, DEST, DISTANCE, ARR_DELAY, 
                      (SCH_ARRIVAL - SCH_DEPARTURE + 2400) % 2400 as SCH_DURATION 
                      from ontime""")
simplerDF.createOrReplaceTempView("simpler")

w = simplerDF.groupBy(
    window(simplerDF.YMD, "1 day", "1 day"),
    simplerDF.ORIGIN,
    simplerDF.DEST
).avg()

w

DataFrame[window: struct<start:timestamp,end:timestamp>, ORIGIN: string, DEST: string, avg(DISTANCE): double, avg(ARR_DELAY): double, avg(SCH_DURATION): double]

In [None]:
from pyspark.sql.functions import window, month, dayofmonth, hour


windowedDelays = w.select(month(w.window.start).alias("month"),\
                          dayofmonth(w.window.start).alias("day"),\
                          hour(w.window.start).alias("hour"),\
                          w.ORIGIN, w.DEST,\
               w['avg(DISTANCE)'].alias("distance"), 
               w['avg(SCH_DURATION)'].alias("duration"),
               w['avg(ARR_DELAY)'].alias("delay"))

delayDF = windowedDelays.filter(windowedDelays.month == 2)\
  .orderBy(windowedDelays.delay.desc()).toPandas()

In [None]:
delayDF

Unnamed: 0,month,day,hour,ORIGIN,DEST,distance,duration,delay
0,2,9,0,JFK,HNL,4983.0,615.000000,1467.0
1,2,22,0,EGE,ORD,1007.0,338.000000,1460.0
2,2,8,0,HNL,JFK,4983.0,1425.000000,1391.0
3,2,27,0,DFW,HNL,3784.0,461.000000,1295.0
4,2,28,0,ORD,EGE,1007.0,190.000000,1235.0
...,...,...,...,...,...,...,...,...
104440,2,21,0,EWR,GSP,594.0,204.000000,
104441,2,21,0,DCA,CAK,274.0,155.000000,
104442,2,23,0,MLU,DFW,293.0,146.000000,
104443,2,23,0,GRK,DFW,134.0,90.555556,


In [None]:
import pandas as pd
import numpy as np

# Take the dataframe and one-hot encode the airport origins and destinations
main_df = pd.concat([delayDF, pd.get_dummies(delayDF[['ORIGIN']],prefix='ORIGIN', drop_first=True),
          pd.get_dummies(delayDF[['DEST']],prefix='DEST', drop_first=True)], axis=1)

main_df.drop(['ORIGIN','DEST'], axis=1, inplace=True)

# Drop outliers (more than an hour) and mark things that arrive early as having 0 delay
main_df['delay'] = main_df['delay'].apply(lambda x: x if x >= 0 and x < 60 else 0 if x < 0 else np.NaN)

# Some entries, such as delay, show up with NaN
main_df.dropna(how='any', axis=0, inplace=True)

main_df

Unnamed: 0,month,day,hour,distance,duration,delay,ORIGIN_ABI,ORIGIN_ABQ,ORIGIN_ABR,ORIGIN_ABY,ORIGIN_ACT,ORIGIN_ACV,ORIGIN_ACY,ORIGIN_ADK,ORIGIN_ADQ,ORIGIN_AEX,ORIGIN_AGS,ORIGIN_ALB,ORIGIN_ALO,ORIGIN_AMA,ORIGIN_ANC,ORIGIN_APN,ORIGIN_ASE,ORIGIN_ATL,ORIGIN_ATW,ORIGIN_AUS,ORIGIN_AVL,ORIGIN_AVP,ORIGIN_AZO,ORIGIN_BDL,ORIGIN_BET,ORIGIN_BFL,ORIGIN_BGM,ORIGIN_BGR,ORIGIN_BHM,ORIGIN_BIL,ORIGIN_BIS,ORIGIN_BJI,ORIGIN_BLI,ORIGIN_BMI,...,DEST_SHV,DEST_SIT,DEST_SJC,DEST_SJT,DEST_SJU,DEST_SLC,DEST_SMF,DEST_SMX,DEST_SNA,DEST_SPI,DEST_SPS,DEST_SRQ,DEST_STC,DEST_STL,DEST_STT,DEST_STX,DEST_SUN,DEST_SUX,DEST_SWF,DEST_SYR,DEST_TLH,DEST_TOL,DEST_TPA,DEST_TRI,DEST_TTN,DEST_TUL,DEST_TUS,DEST_TVC,DEST_TWF,DEST_TXK,DEST_TYR,DEST_TYS,DEST_UST,DEST_VEL,DEST_VLD,DEST_VPS,DEST_WRG,DEST_XNA,DEST_YAK,DEST_YUM
5960,2,26,0,108.0,97.500000,59.888889,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5961,2,2,0,762.0,220.461538,59.857143,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5962,2,8,0,475.0,266.000000,59.833333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
5963,2,20,0,1325.0,289.500000,59.833333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5964,2,25,0,997.0,303.833333,59.833333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102001,2,23,0,2607.0,428.000000,0.000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
102002,2,22,0,2607.0,428.000000,0.000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
102003,2,7,0,1504.0,370.000000,0.000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
102004,2,19,0,2520.0,885.000000,0.000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

y = main_df['delay']
X = main_df.drop(['delay'], axis=1)

delay_X_train, delay_X_test, delay_y_train, delay_y_test = train_test_split(\
  X, y, test_size=0.20, random_state=42)
regr = linear_model.LinearRegression()

regr.fit(delay_X_train, delay_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
y_pred = regr.predict(delay_X_test)

What do the errors look like?

In [None]:
y_pred - delay_y_test

52753     3.449764
31759    -3.561367
49075     5.811687
50030     3.023708
97828     4.740913
           ...    
74926     5.365898
21647   -11.183247
27981    -1.496341
66845     2.997246
60324     4.716377
Name: delay, Length: 19210, dtype: float64

Let's analyze more systematically, using mean squared error and variance...

In [None]:
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(delay_y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(delay_y_test, y_pred))

Coefficients: 
 [-1.07398801e+10  1.49183943e-01 -8.65222356e+10 -4.76052998e-04
 -1.46213324e-03 -7.56097059e+00 -5.12629706e+00 -6.65588543e+00
 -5.60330838e+00 -6.94016889e+00 -5.90954809e+00 -3.85614891e+00
  2.20237692e+00 -9.92574703e+00 -3.23599296e+00 -1.34451411e+00
 -5.73415900e+00 -7.02390148e+00 -5.34378859e+00 -7.09157578e+00
 -3.04338808e+00 -3.39101612e+00 -4.37166865e+00 -5.59124532e+00
 -4.81997047e+00 -8.35402261e+00 -7.50439942e+00 -5.21988110e+00
 -3.27146912e+00 -3.70863424e+00 -8.32162044e+00  3.10457614e+00
 -3.13906310e+00 -5.84388584e+00 -7.81940410e+00 -2.90983816e+00
 -1.04844453e+01 -7.70921702e+00 -3.02075771e+00 -4.68880963e+00
 -7.29429601e+00  2.67651904e+00 -8.29352198e-01 -6.84422177e+00
 -6.78506904e+00 -7.80836798e+00 -5.70715971e+00 -4.23007184e+00
 -9.25490626e+00 -2.33873934e+00 -1.33251463e+00 -3.85564473e+00
 -4.92212176e+00 -2.76457226e+00 -6.39504722e+00 -1.25334306e+00
 -6.86135572e+00 -7.86732840e+00 -6.76468834e+00 -6.85995021e+00
 -3.84535

# Bonus Exercise

See if you can combine the incremental stream processing portion of the notebook with the time window-based computation!