# Models created to predict # of vehicles involved

Below are the efforts made to predict the number of vehicles involved in a collision given the data we've accumulated. We weren't sure if this would be a good target or if there would be any correlation, but we were able to come up with models that are decently accurate. The algorithms we used are the following:
1. OneR (Baseline)
2. Naive Bayes
3. Logistic Regression
4. Random Forest
5. K Nearest Neighbors

In [16]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

le= LabelEncoder()
df = pd.read_csv('Supervised_Learning_Dataset_2.csv')

#Turning Crash Date into separate numeric columns
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'])
df['year'] = df['CRASH DATE'].dt.year
df['month'] = df['CRASH DATE'].dt.month
df['day'] = df['CRASH DATE'].dt.day
df = df.drop(columns='CRASH DATE')

#Deal with encoding categorical weekdays
df['CRASH DAY'] = df['CRASH DAY'].astype('category')
def day_to_num(day):
    if day == 'Monday':
        return 0
    elif day == 'Tuesday':
        return 1
    elif day == 'Wednesday':
        return 2
    elif day == 'Thursday':
        return 3
    elif day == 'Friday':
        return 4
    elif day == 'Saturday':
        return 5
    elif day == 'Sunday':
        return 6
df['CRASH DAY'] = df['CRASH DAY'].apply(day_to_num)

# df['percent_unknown'] = 100 - df['percent_licensed'] - df['percent_unlicensed'] - df['percent_permit']

#Turning time into numeric data (amount of seconds since midnight)
df['CRASH TIME'] = pd.to_datetime(df['CRASH TIME'], format='%H:%M').dt.time
# Function to convert time to seconds
def time_to_seconds(t):
    return t.hour * 3600 + t.minute * 60 + t.second
df['CRASH TIME'] = df['CRASH TIME'].apply(time_to_seconds)

#Encoding holiday names
df['holiday_name'] = df['holiday_name'].fillna('N/A')
df['holiday_name'] = df['holiday_name'].astype('category')
df['holiday_name'] = le.fit_transform(df['holiday_name'])

df['is_holiday'] = df['is_holiday'].astype('category')

#filling missing data
df['snow_depth'] = df['snow_depth'].fillna(df['snow_depth'].mean())

#making target categorical (Only 5 posisble options)
df['num_vehicles_involved'] = df['num_vehicles_involved'].astype('category')

#Filling missing values for %'s
df['percent_licensed'] = df['percent_licensed'].fillna(0)
df['percent_unlicensed'] = df['percent_unlicensed'].fillna(0)
df['percent_permit'] = df['percent_permit'].fillna(0)
df['percent_unknown'] = 100 - df['percent_licensed'] - df['percent_unlicensed'] - df['percent_permit']

In [17]:
from sklearn.model_selection import train_test_split

#creating features
X = df.drop('num_vehicles_involved', axis = 1)

#creating target numpy array
y = np.array(df['num_vehicles_involved'])

#Create respective variables? Idk really 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error

decision_tree = DecisionTreeClassifier(max_depth=1)
decision_tree.fit(X_train, y_train)

# train_pred = decision_tree.predict(X_train)
test_pred = decision_tree.predict(X_test)



In [19]:
# print("MSE for Train Sample: ", mean_squared_error(y_train, train_pred, squared=False))
print("RMSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=False))
print("MSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=True))
print("Python accuracy score: ", accuracy_score(y_test, test_pred))


RMSE for Test Sample:  0.6520821066920548
MSE for Test Sample:  0.4252110738679484
Python accuracy score:  0.7249715406020744




In [25]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error

naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)

# Make predictions on the test set

# train_pred = naive_bayes.predict(X_train)
test_pred = naive_bayes.predict(X_test)

# print("MSE for Train Sample: ", mean_squared_error(y_train, train_pred, squared=False))
print("MSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=False))

MSE for Test Sample:  0.8775553557919464




In [26]:
print("RMSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=False))
print("MSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=True))
print("Python accuracy score: ", accuracy_score(y_test, test_pred))

RMSE for Test Sample:  0.8775553557919464
MSE for Test Sample:  0.7701034024791298
Python accuracy score:  0.5849196812547433




In [27]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# train_pred = log_reg.predict(X_train)
test_pred = log_reg.predict(X_test)

MSE for Test Sample:  0.6068708907310485


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [29]:
from sklearn.ensemble import RandomForestClassifier

rand_for = classifier = RandomForestClassifier(n_estimators=100, random_state=42)

rand_for.fit(X_train, y_train)
# train_pred = rand_for.predict(X_train)
test_pred = rand_for.predict(X_test)

# print("MSE for Train Sample: ", mean_squared_error(y_train, train_pred, squared=False))
print("MSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=False))

MSE for Test Sample:  0.5213132328254747




# K Nearest Neighbors Model

Below is the code used to create the model using KNN.

In [31]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=10)

knn.fit(X_train, y_train)
# train_pred = knn.predict(X_train)
test_pred = knn.predict(X_test)

# print("MSE for Train Sample: ", mean_squared_error(y_train, train_pred, squared=False))
print("MSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=False))

MSE for Test Sample:  0.5960338016330088




ff

In [32]:
print("RMSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=False))
print("MSE for Test Sample: ", mean_squared_error(y_test, test_pred, squared=True))
print("Python accuracy score: ", accuracy_score(y_test, test_pred))

RMSE for Test Sample:  0.5960338016330088
MSE for Test Sample:  0.35525629268909686
Python accuracy score:  0.7338927713129269


