## Assignment 2 : Machine Learning

I chose a regression algorithm for predicting the number of passengers on a specific date for a specific bus because the output variable, (the number of passengers), is a continuous numerical value, and the goal is to make quantitative predictions (predictions that provide specific numerical values as outcomes). 

Regression algorithms are well-suited for these tasks as they are designed to model and predict numerical outcomes. By using regression, we aim to find a mathematical relationship between the input variables (date) and the target variable (number of passengers), allowing us to make accurate, data-driven predictions.

In [25]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import datetime

In [26]:
# Load dataset
url = "data\Ruter-data.csv"
df = pd.read_csv(url, sep=";")

# Look at the dataset
df.head()

Unnamed: 0,TurId,Dato,Fylke,Område,Kommune,Holdeplass_Fra,Holdeplass_Til,Linjetype,Linjefylke,Linjenavn,Linjeretning,Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra,Tidspunkt_Faktisk_Avgang_Holdeplass_Fra,Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra,Tidspunkt_Planlagt_Avgang_Holdeplass_Fra,Kjøretøy_Kapasitet,Passasjerer_Ombord
0,15006-2020-08-10T10:24:00+02:00,10/08/2020,Viken,Vest,Bærum,Nordliveien,Tjernsmyr,Lokal,Viken,150,0,10:53:53,10:53:59,10:53:00,10:53:00,112,5
1,15002-2020-08-15T12:54:00+02:00,15/08/2020,Viken,Vest,Bærum,Nadderud stadion,Bekkestua bussterminal (Plattform C),Lokal,Viken,150,0,13:12:20,13:12:26,13:12:00,13:12:00,112,5
2,15004-2020-08-03T09:54:00+02:00,03/08/2020,Viken,Vest,Bærum,Ringstabekkveien,Skallum,Lokal,Viken,150,0,10:18:56,10:19:21,10:19:00,10:19:00,112,6
3,15003-2020-07-27T13:00:00+02:00,27/07/2020,Viken,Vest,Bærum,Gruvemyra,Gullhaug,Lokal,Viken,150,1,13:52:04,13:52:26,13:51:00,13:51:00,112,10
4,15002-2020-08-27T07:15:00+02:00,27/08/2020,Viken,Vest,Bærum,Lysaker stasjon (Plattform A),Tjernsmyr,Lokal,Viken,150,1,07:34:13,07:34:53,07:33:00,07:33:00,112,10


In [27]:
# Generate statistics for each numerical column in the DataFrame.
df.describe()

Unnamed: 0,Linjeretning,Kjøretøy_Kapasitet,Passasjerer_Ombord
count,6000.0,6000.0,6000.0
mean,0.492,104.712167,4.512833
std,0.499978,24.225196,6.73573
min,0.0,33.0,-39.0
25%,0.0,80.0,0.0
50%,0.0,106.0,3.0
75%,1.0,112.0,7.0
max,1.0,151.0,64.0


In [28]:
# We see in df.describe that the minimum value of passengers on board is -39, 
# Since this is not a possible value for the number of passengers, I'll drop all data points with passengers on board less than 0
df = df[df["Passasjerer_Ombord"]>=0]
df.describe()

Unnamed: 0,Linjeretning,Kjøretøy_Kapasitet,Passasjerer_Ombord
count,5333.0,5333.0,5333.0
mean,0.47553,105.052503,5.601163
std,0.499448,24.255181,6.160011
min,0.0,33.0,0.0
25%,0.0,80.0,1.0
50%,0.0,106.0,4.0
75%,1.0,112.0,8.0
max,1.0,151.0,64.0


In [29]:
# In order to create a model with the most accuracy from this data set, I want to use the route that occurs most often in the data set
df["Linjenavn"].describe()

count     5333
unique     148
top        100
freq       423
Name: Linjenavn, dtype: object

In [30]:
# Count the occurrences of each unique Linjenavn
lineCounts = df["Linjenavn"].value_counts()

# Find the line with the most datapoints
mostCommonLine = lineCounts.idxmax()
mostCommonLineCount = lineCounts.max()

# Print the result
print(f"Most common line : {mostCommonLine}\nAmount of data : {mostCommonLineCount}")

Most common line : 100
Amount of data : 423


In [31]:
# The chosen line for this assignment, is line number 100, as assigned to 'mostCommonLine'
df = df[df["Linjenavn"] == mostCommonLine]
df.describe()

Unnamed: 0,Linjeretning,Kjøretøy_Kapasitet,Passasjerer_Ombord
count,423.0,423.0,423.0
mean,0.472813,151.0,8.513002
std,0.499852,0.0,7.599838
min,0.0,151.0,0.0
25%,0.0,151.0,3.0
50%,0.0,151.0,7.0
75%,1.0,151.0,12.0
max,1.0,151.0,40.0


In [32]:
# Dropping irrelevant data
df.drop(["TurId", "Fylke", "Område", "Kommune", "Holdeplass_Fra", "Holdeplass_Til", "Linjetype", "Linjefylke", "Linjenavn",	"Linjeretning", 
         "Tidspunkt_Faktisk_Ankomst_Holdeplass_Fra", "Tidspunkt_Faktisk_Avgang_Holdeplass_Fra", "Tidspunkt_Planlagt_Ankomst_Holdeplass_Fra", 
         "Tidspunkt_Planlagt_Avgang_Holdeplass_Fra"], 
         axis='columns', inplace=True)

In [33]:
# Convert dates
df['Dato'] = pd.to_datetime(df['Dato'], format='%d/%m/%Y')

# Setup XY axis
X = df['Dato'].apply(lambda x: x.toordinal())  
Y = df['Passasjerer_Ombord'].values

# Split the data into training and testing sets (e.g., 80% for training, 20% for testing)
trainRatio = 0.8
testRatio = 0.2

# Train test split
X_train, X_test, Y_train, Y_test = train_test_split(X.values.reshape(-1,1), Y, test_size=testRatio, random_state=50)