Do the following:
1. Pick one use case (defined below).
    We picked the Ruter case.
2. Explore and research which algorithm would work best for this use case (regression or classification)
    To determine the number of passengers on a bus, regression is the best suited algorithm.
3. Document your findings in a markdown cell (3-5 lines) on why you chose this algorithm.
    Classification, gives yes or no answers. This could theoretically be used with some logic, but regression is better suited for this task as it will give the answer directly. The other is however theoretically equally accurate, but practically very inferior.
4. Train the algorithm using Python
    
5. Keep the solution as simple as possible. We are not looking for the best machine-learning algorithm. We are interested in seeing that you know how to work with machine learning.

6. Turn in a **JUPYTER NOTEBOOK** on canvas.

Predict passenger data for Ruter.

Use Ruter-data.csv dataset (in data folder). 

I want you to make a prediction algorithm which predicts the number of passengers on a specific date for a specific bus (pick any one). 

Input should be date and output will be number of passengers. 

You should also show the prediction percentage score. 

Data file: Ruter_data.csv

In [92]:
# imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm

Load the data from the file and explore it.

In [93]:
ruter = pd.read_csv('./data/Ruter-data.csv' , sep=';', index_col=None)

## Number of rows and columns in ruter
print("Columns, rows")
print(ruter.shape)
#ruter.info()

Columns, rows
(6000, 17)


In [94]:
ruter.head(10)

Unnamed: 0,TurId,Dato,Fylke,Område,Kommune,Holdeplass_Fra,Holdeplass_Til,Linjetype,Linjefylke,Linjenavn,Linjeretning,Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra,Tidspunkt_Faktisk_Avgang_Holdeplass_Fra,Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra,Tidspunkt_Planlagt_Avgang_Holdeplass_Fra,Kjøretøy_Kapasitet,Passasjerer_Ombord
0,15006-2020-08-10T10:24:00+02:00,10/08/2020,Viken,Vest,Bærum,Nordliveien,Tjernsmyr,Lokal,Viken,150,0,10:53:53,10:53:59,10:53:00,10:53:00,112,5
1,15002-2020-08-15T12:54:00+02:00,15/08/2020,Viken,Vest,Bærum,Nadderud stadion,Bekkestua bussterminal (Plattform C),Lokal,Viken,150,0,13:12:20,13:12:26,13:12:00,13:12:00,112,5
2,15004-2020-08-03T09:54:00+02:00,03/08/2020,Viken,Vest,Bærum,Ringstabekkveien,Skallum,Lokal,Viken,150,0,10:18:56,10:19:21,10:19:00,10:19:00,112,6
3,15003-2020-07-27T13:00:00+02:00,27/07/2020,Viken,Vest,Bærum,Gruvemyra,Gullhaug,Lokal,Viken,150,1,13:52:04,13:52:26,13:51:00,13:51:00,112,10
4,15002-2020-08-27T07:15:00+02:00,27/08/2020,Viken,Vest,Bærum,Lysaker stasjon (Plattform A),Tjernsmyr,Lokal,Viken,150,1,07:34:13,07:34:53,07:33:00,07:33:00,112,10
5,3110-2020-08-01T16:16:00+02:00,01/08/2020,Oslo,Nordøst,Bjerke,Veitvet (mot Kalbakken),Rødtvet (mot Kalbakken),Lokal,Oslo,31,0,17:22:56,17:23:23,17:16:00,17:16:00,151,8
6,15010-2020-07-28T13:09:00+02:00,28/07/2020,Viken,Vest,Bærum,Nedre Toppenhaug,Øvre Toppenhaug,Lokal,Viken,150,0,13:19:00,13:19:05,13:17:00,13:17:00,112,1
7,15003-2020-07-27T06:18:00+02:00,27/07/2020,Oslo,Indre By,St.Hanshaugen,Hammersborggata (ved Storgata retning vest),St. Olavs plass (mot Frederiks gate),Lokal,Viken,150,1,06:20:24,06:20:29,06:20:00,06:20:00,112,-1
8,21002-2020-08-16T14:33:00+02:00,16/08/2020,Viken,Vest,Bærum,Stein gård,Knabberudveien,Lokal,Viken,150,1,15:10:12,15:10:34,15:09:00,15:09:00,112,2
9,15002-2020-08-13T18:09:00+02:00,13/08/2020,Viken,Vest,Bærum,Bekkestua bussterminal (Plattform C),Stabekk skole,Lokal,Viken,150,0,18:37:41,18:38:19,18:29:00,18:29:00,112,4


In [95]:
#print column names for copy paste purposes
print(ruter.columns.tolist())

['TurId', 'Dato', 'Fylke', 'Område', 'Kommune', 'Holdeplass_Fra', 'Holdeplass_Til', 'Linjetype', 'Linjefylke', 'Linjenavn', 'Linjeretning', 'Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra', 'Tidspunkt_Faktisk_Avgang_Holdeplass_Fra', 'Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra', 'Tidspunkt_Planlagt_Avgang_Holdeplass_Fra', 'Kjøretøy_Kapasitet', 'Passasjerer_Ombord']


predict the number of passengers on a specific date, for a specific bus (pick any one). 

Input should be date and output will be number of passengers. You should also show the prediction percentage score.

In [96]:
# Checking for missing values
print(ruter.isnull().sum())


TurId                                        0
Dato                                         0
Fylke                                        0
Område                                       0
Kommune                                      0
Holdeplass_Fra                               0
Holdeplass_Til                               0
Linjetype                                    0
Linjefylke                                   0
Linjenavn                                    0
Linjeretning                                 0
Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra     0
Tidspunkt_Faktisk_Avgang_Holdeplass_Fra      0
Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra    0
Tidspunkt_Planlagt_Avgang_Holdeplass_Fra     0
Kjøretøy_Kapasitet                           0
Passasjerer_Ombord                           0
dtype: int64


In [97]:
# I turn the passengers who are less than zero to zero, because i assume that the best guess is that this bus is empty. If i wanted to really predict the number i guess i would have to check video of some of these and see if there was a average.
ruter['Passasjerer_Ombord'] = ruter['Passasjerer_Ombord'].apply(lambda x: max(x, 0))
ruter.describe()
ruter.head(10)


Unnamed: 0,TurId,Dato,Fylke,Område,Kommune,Holdeplass_Fra,Holdeplass_Til,Linjetype,Linjefylke,Linjenavn,Linjeretning,Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra,Tidspunkt_Faktisk_Avgang_Holdeplass_Fra,Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra,Tidspunkt_Planlagt_Avgang_Holdeplass_Fra,Kjøretøy_Kapasitet,Passasjerer_Ombord
0,15006-2020-08-10T10:24:00+02:00,10/08/2020,Viken,Vest,Bærum,Nordliveien,Tjernsmyr,Lokal,Viken,150,0,10:53:53,10:53:59,10:53:00,10:53:00,112,5
1,15002-2020-08-15T12:54:00+02:00,15/08/2020,Viken,Vest,Bærum,Nadderud stadion,Bekkestua bussterminal (Plattform C),Lokal,Viken,150,0,13:12:20,13:12:26,13:12:00,13:12:00,112,5
2,15004-2020-08-03T09:54:00+02:00,03/08/2020,Viken,Vest,Bærum,Ringstabekkveien,Skallum,Lokal,Viken,150,0,10:18:56,10:19:21,10:19:00,10:19:00,112,6
3,15003-2020-07-27T13:00:00+02:00,27/07/2020,Viken,Vest,Bærum,Gruvemyra,Gullhaug,Lokal,Viken,150,1,13:52:04,13:52:26,13:51:00,13:51:00,112,10
4,15002-2020-08-27T07:15:00+02:00,27/08/2020,Viken,Vest,Bærum,Lysaker stasjon (Plattform A),Tjernsmyr,Lokal,Viken,150,1,07:34:13,07:34:53,07:33:00,07:33:00,112,10
5,3110-2020-08-01T16:16:00+02:00,01/08/2020,Oslo,Nordøst,Bjerke,Veitvet (mot Kalbakken),Rødtvet (mot Kalbakken),Lokal,Oslo,31,0,17:22:56,17:23:23,17:16:00,17:16:00,151,8
6,15010-2020-07-28T13:09:00+02:00,28/07/2020,Viken,Vest,Bærum,Nedre Toppenhaug,Øvre Toppenhaug,Lokal,Viken,150,0,13:19:00,13:19:05,13:17:00,13:17:00,112,1
7,15003-2020-07-27T06:18:00+02:00,27/07/2020,Oslo,Indre By,St.Hanshaugen,Hammersborggata (ved Storgata retning vest),St. Olavs plass (mot Frederiks gate),Lokal,Viken,150,1,06:20:24,06:20:29,06:20:00,06:20:00,112,0
8,21002-2020-08-16T14:33:00+02:00,16/08/2020,Viken,Vest,Bærum,Stein gård,Knabberudveien,Lokal,Viken,150,1,15:10:12,15:10:34,15:09:00,15:09:00,112,2
9,15002-2020-08-13T18:09:00+02:00,13/08/2020,Viken,Vest,Bærum,Bekkestua bussterminal (Plattform C),Stabekk skole,Lokal,Viken,150,0,18:37:41,18:38:19,18:29:00,18:29:00,112,4


In [98]:
# Very many variables are not relevant. I will drop them.
# just to make sure i have enough data I will start by counting rows for routes Holdeplass_Fra and Holdeplass_Til
# kombinerer fra og til.
ruter['Holdeplass_Fra_Til'] = 'fra_' + ruter['Holdeplass_Fra'] + '_til_' + ruter['Holdeplass_Til']
ruter['Holdeplass_Fra_Til'].value_counts().head(10)


Holdeplass_Fra_Til
fra_Jenseberget_til_Sagdalen                                                  26
fra_Teisenkrysset  (fra Helsfyr)_til_Ulvenkrysset  (fra Helsfyr)              25
fra_Kloppaveien_til_Ahus                                                      21
fra_Vålerenga  (mot Galgeberg)_til_Galgeberg  (fra Etterstad)                 20
fra_Knatten (Solheimsveien)_til_Kloppaveien                                   20
fra_Lillestrøm bussterminal (Plattform 18)_til_Nittedalsgata (mot Kjeller)    19
fra_Karihaugen  (mot Furuset)_til_Folkvangveien  (mot Furuset)                18
fra_Trosterud  (E6 mot Helsfyr)_til_Ulvenkrysset  (mot Helsfyr)               18
fra_Strømsbergveien_til_Stasjonsveien                                         17
fra_Knatten (Solheimsveien)_til_Lørenskog sentrum (plattform 4)               17
Name: count, dtype: int64

In [None]:
# We chose the route fra_Jenseberget_til_Sagdalen and dropped all other rows
ruter = ruter[ruter['Holdeplass_Fra_Til'] == 'fra_Jenseberget_til_Sagdalen']
ruter = ruter.drop(columns=['TurId', 'Fylke', 'Kommune', 'Område', 'Holdeplass_Fra_Til', 'Holdeplass_Fra', 'Holdeplass_Til', 'Linjetype', 'Linjefylke', 'Linjeretning']) #keep linjenavn, because maybe someone is going somwhere.
ruter.head(1)


Unnamed: 0,Dato,Linjenavn,Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra,Tidspunkt_Faktisk_Avgang_Holdeplass_Fra,Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra,Tidspunkt_Planlagt_Avgang_Holdeplass_Fra,Kjøretøy_Kapasitet,Passasjerer_Ombord
156,15/06/2020,100,22:43:24,22:43:29,22:42:00,22:42:00,151,6
320,27/07/2020,110,17:03:13,17:03:18,17:02:00,17:02:00,106,0
401,23/06/2020,110,23:20:47,23:20:52,23:20:00,23:20:00,106,5
801,09/06/2020,100,21:14:45,21:15:14,21:12:00,21:12:00,151,16
850,02/06/2020,100,23:28:54,23:29:26,23:27:00,23:27:00,151,6
880,30/07/2020,100,16:49:07,16:49:11,16:47:00,16:47:00,151,18
904,04/06/2020,100,09:58:42,09:58:48,09:57:00,09:57:00,151,16
1022,20/06/2020,100,23:45:01,23:45:06,23:42:00,23:42:00,151,13
1103,23/08/2020,110,11:02:51,11:02:55,11:02:00,11:02:00,106,3
1130,10/06/2020,110,16:24:19,16:24:24,16:22:00,16:22:00,106,5


In [106]:
#Kjøretøy_Kapasitet max
print(ruter['Kjøretøy_Kapasitet'].max())


151


Metadata thoughts. 


I think that with 26 rows there is not enough data to divide into weekend and weekday, days, etc. We want to predict the number of passengers on a given route, and we assume that the patterns are stable over time. And that we do not have enough data to train a time series model either. Maybe a binary value for holyday, but stil its 2/7 of the data equals 7 rows.

Time of day: 
Ideal time groups: morning (05–09), midday (09–12), afternoon (12–17), evening (17–21), night (21–05).
What we have data for is: morning-midday (05-12), afternoon-evening (12-21), night (21-05).

How delayed the bus is. So it's Expected arrival - Actual arrival. This can remain continuous, because it is a measurable quantity.

In [100]:
# Test weekend split, counting saturday and sunday as weekend and returning the sum of days.

counter_weekend = 0
for dato in ruter['Dato']:
    dato_dt = pd.to_datetime(dato, format='%d/%m/%Y')
    if dato_dt.dayofweek in [5, 6]: 
        counter_weekend += 1
print(f"Number of weekend days: {counter_weekend}")

Number of weekend days: 7


In [101]:
# Create 1/0 column for weekend
ruter['Weekend'] = ruter['Dato'].apply(lambda x: 1 if pd.to_datetime(x, format='%d/%m/%Y').dayofweek in [5, 6] else 0)
ruter.drop(columns=['Dato'], inplace=True) # drop Dato column as it is no longer needed
ruter.head(1)

Unnamed: 0,Linjenavn,Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra,Tidspunkt_Faktisk_Avgang_Holdeplass_Fra,Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra,Tidspunkt_Planlagt_Avgang_Holdeplass_Fra,Kjøretøy_Kapasitet,Passasjerer_Ombord,Weekend
156,100,22:43:24,22:43:29,22:42:00,22:42:00,151,6,0


In [102]:
#time of day Lets convert to time.
# I decided that planlagt_ankomst to faktisk_ankomst was the cleanest most stable data. 
ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra'] = pd.to_datetime(ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra'], format='%H:%M:%S')
ruter['Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra'] = pd.to_datetime(ruter['Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra'], format='%H:%M:%S')
# Delay in minutes
ruter['Delay_Minutes'] = (ruter['Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra'] - ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra']).dt.total_seconds() / 60.0
ruter.head(1)



Unnamed: 0,Linjenavn,Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra,Tidspunkt_Faktisk_Avgang_Holdeplass_Fra,Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra,Tidspunkt_Planlagt_Avgang_Holdeplass_Fra,Kjøretøy_Kapasitet,Passasjerer_Ombord,Weekend,Delay_Minutes
156,100,1900-01-01 22:43:24,22:43:29,1900-01-01 22:42:00,22:42:00,151,6,0,1.4


In [103]:
#Check split time of day morning, afternoon, evening, night
morning=0
afternoon=0
evening=0
night=0

for time in ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra']:
    hour = time.hour
    if 6 <= hour < 12:
        morning += 1
    elif 12 <= hour < 18:
        afternoon += 1
    elif 18 <= hour < 23:
        evening += 1
    else:
        night += 1
print(f"Morning: {morning}, Afternoon: {afternoon}, Evening: {evening}, Night: {night}")

Morning: 4, Afternoon: 8, Evening: 10, Night: 4


In [104]:
# onehot encoding time of day
ruter['Morning'] = ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra'].apply(lambda x: 1 if 6 <= x.hour < 12 else 0)
ruter['Afternoon'] = ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra'].apply(lambda x: 1 if 12 <= x.hour < 18 else 0)
ruter['Evening'] = ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra'].apply(lambda x: 1 if 18 <= x.hour < 23 else 0)
ruter['Night'] = ruter['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra'].apply(lambda x: 1 if (0 <= x.hour < 6) or (23 <= x.hour <= 23) else 0)
# Drops time columns as they are no longer needed
ruter = ruter.drop(columns=['Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra', 'Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra', 'Tidspunkt_Planlagt_Avgang_Holdeplass_Fra', 'Tidspunkt_Faktisk_Avgang_Holdeplass_Fra']) # drop time columns as they are no longer needed
# Drops afternoon to avoid multicollinearity, Afternoon is now standard
ruter = ruter.drop(columns=['Afternoon'])
ruter.head(1)

Unnamed: 0,Linjenavn,Kjøretøy_Kapasitet,Passasjerer_Ombord,Weekend,Delay_Minutes,Morning,Evening,Night
156,100,151,6,0,1.4,0,1,0


In [105]:
#regression plot X all columns except Passasjerer_Ombord
X = ruter.drop(columns=['Passasjerer_Ombord'])
y = ruter['Passasjerer_Ombord']

#model = sm.OLS(y, sm.add_constant(X)).fit()
#X = sm.add_constant(X)

#print(model.summary())

Obs: Datasettet har noenganger negative passasjertall. 
Passer data i en sigmoid funksjon, og avgjør om det er en høy eller lav mengde passasjerer.

Test hvor full bussen er med algoritmen: høy / lav.

Noen costfunksjoner gir bedre prediksjonscore enn andre funksjoner. Test flere.

