## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [1]:
import pandas as pd
import os

## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [10]:
dataset_path = 'dataset/'

df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))

print("The shape of the dataset is {}.\n\n".format(df.shape))
df.info()
df.head()


The shape of the dataset is (6407, 16).


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6407 entries, 0 to 6406
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            6407 non-null   int64  
 1   Lat           6407 non-null   float64
 2   Lng           6407 non-null   float64
 3   Bump          6407 non-null   bool   
 4   Distance(mi)  6407 non-null   float64
 5   Crossing      6407 non-null   bool   
 6   Give_Way      6407 non-null   bool   
 7   Junction      6407 non-null   bool   
 8   No_Exit       6407 non-null   bool   
 9   Railway       6407 non-null   bool   
 10  Roundabout    6407 non-null   bool   
 11  Stop          6407 non-null   bool   
 12  Amenity       6407 non-null   bool   
 13  Side          6407 non-null   object 
 14  Severity      6407 non-null   int64  
 15  timestamp     6407 non-null   object 
dtypes: bool(9), float64(3), int64(2), object(2)
memory usage: 406.8+ KB


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


## We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

In [11]:
df.dropna(subset=df.columns[df.isnull().mean()!=0], how='any', axis=0, inplace=True)
print(df.shape)
#print(df['Side'].describe())
#side_num = pd.get_dummies(df['Side'],drop_first=True)
#df.insert(1,'df_dummy',df_dummy,True)
df['Side'].replace('R',1,inplace=True)
df['Side'].replace('L',0,inplace=True)
for c in ['Crossing','Give_Way','Junction','Railway','Stop','Amenity']:
    df[c].replace(False,0,inplace=True)
    df[c].replace(True,1,inplace=True)
#print(df['df_dummy'].head(10))
#print(df_dummy.info())
df.drop(columns=['ID','Bump','Roundabout','No_Exit']).describe()

#df['Date'] = [d.date() for d in df['timestamp']]
#df['time'] = [d.time() for d in df['timestamp']]
df['Date'] = pd.to_datetime(df['timestamp']).dt.date
df['Time'] = pd.to_datetime(df['timestamp']).dt.time

print(df)

(6407, 16)
        ID        Lat         Lng   Bump  Distance(mi)  Crossing  Give_Way  \
0        0  37.762150 -122.405660  False         0.044         0         0   
1        1  37.719157 -122.448254  False         0.000         0         0   
2        2  37.808498 -122.366852  False         0.000         0         0   
3        3  37.785930 -122.391080  False         0.009         0         0   
4        4  37.719141 -122.448457  False         0.000         0         0   
...    ...        ...         ...    ...           ...       ...       ...   
6402  6402  37.740630 -122.407930  False         0.368         0         0   
6403  6403  37.752755 -122.402790  False         0.639         0         0   
6404  6404  37.726304 -122.446015  False         0.000         0         0   
6405  6405  37.808090 -122.367211  False         0.000         0         0   
6406  6406  37.773745 -122.408515  False         0.000         1         0   

      Junction  No_Exit  Railway  Roundabout  Stop  

The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

In [13]:
from sklearn.model_selection import train_test_split
import numpy as np


from bs4 import BeautifulSoup
# Open XML file
file = open("dataset/holidays.xml", 'r')
  
# Read the contents of that file
contents = file.read()
  
soup = BeautifulSoup(contents, 'xml')
  
# Extracting the data
date = soup.find_all('date')
description = soup.find_all('description')

data = []
  
# Loop to store the data in a list named 'data'
for i in range(0, len(date)):
    rows = [date[i].get_text(), description[i].get_text()]
    data.append(rows)

holiday_df = pd.DataFrame(data, columns=['date',
                                 'description'])
    
# Converting the list into dataframe

hol=np.zeros(len(df))

for k in range(0,len(df)):
    for i in range(0,len(holiday_df)):
        if str(df.loc[k,'Date']) == holiday_df.loc[i,'date']:
            hol[k]=1
        
df['holiday'] =hol



train_df, val_df = train_test_split(df, test_size=0.2, random_state=42) # Try adding `stratify` here

X_train = train_df.drop(columns=['ID','Date','Time', 'Severity','timestamp','Bump','Roundabout','No_Exit'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['ID','Date','Time', 'Severity','timestamp','Bump','Roundabout','No_Exit'])
y_val = val_df['Severity']

print(X_train)

            Lat         Lng  Distance(mi)  Crossing  Give_Way  Junction  \
748   37.720890 -122.448044         0.000         1         0         0   
5720  37.727319 -122.402749         0.000         0         0         0   
1310  37.731370 -122.423590         0.161         0         0         0   
5343  37.731860 -122.418282         0.231         0         0         0   
1480  37.808498 -122.366852         0.000         0         0         0   
...         ...         ...           ...       ...       ...       ...   
3772  37.710819 -122.455711         0.000         0         0         0   
5191  37.761349 -122.392647         0.000         0         0         0   
5226  37.725182 -122.401639         0.000         0         0         1   
5390  37.769646 -122.417847         0.000         1         0         0   
860   37.778107 -122.401192         0.000         0         0         0   

      Railway  Stop  Amenity  Side  holiday  
748         1     0        0     1      0.0  
5720   

As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

In [14]:
# This cell is used to select the numerical features. IT SHOULD BE REMOVED AS YOU DO YOUR WORK.
# X_train = X_train[['Lat', 'Lng', 'Distance(mi)']]
# X_val = X_val[['Lat', 'Lng', 'Distance(mi)']]

## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [16]:
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the validation set is  0.7433697347893916


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [17]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04 19:20:31
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28 11:51:00
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09 07:36:45
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06 15:46:25
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17 09:54:58


Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [18]:
test_df['Side'].replace('R',1,inplace=True)
test_df['Side'].replace('L',0,inplace=True)
for c in ['Crossing','Give_Way','Junction','Railway','Stop','Amenity']:
    test_df[c].replace(False,0,inplace=True)
    test_df[c].replace(True,1,inplace=True)
    
test_df['Date'] = pd.to_datetime(test_df['timestamp']).dt.date
test_df['Time'] = pd.to_datetime(test_df['timestamp']).dt.time

hol1=np.zeros(len(test_df))
for k in range(0,len(test_df)):
    for i in range(0,len(holiday_df)):
        if str(test_df.loc[k,'Date']) == holiday_df.loc[i,'date']:
            hol1[k]=1
        
test_df['holiday'] =hol1
X_test = test_df.drop(columns=['ID','timestamp','Date','Time','Bump','Roundabout','No_Exit'])


# You should update/remove the next line once you change the features used for training


y_test_predicted = classifier.predict(X_test)

test_df['Severity'] = y_test_predicted

test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp,Date,Time,holiday,Severity
0,6407,37.78606,-122.3909,False,0.039,0,0,1,False,0,False,0,0,1,2016-04-04 19:20:31,2016-04-04,19:20:31,0.0,2
1,6408,37.769609,-122.415057,False,0.202,0,0,0,False,0,False,0,0,1,2020-10-28 11:51:00,2020-10-28,11:51:00,0.0,2
2,6409,37.807495,-122.476021,False,0.0,0,0,0,False,0,False,0,0,1,2019-09-09 07:36:45,2019-09-09,07:36:45,0.0,2
3,6410,37.761818,-122.405869,False,0.0,0,0,1,False,0,False,0,0,1,2019-08-06 15:46:25,2019-08-06,15:46:25,0.0,2
4,6411,37.73235,-122.4141,False,0.67,0,0,0,False,0,False,0,0,1,2018-10-17 09:54:58,2018-10-17,09:54:58,0.0,2


Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [19]:
test_df[['ID', 'Severity']].to_csv('dataset/submission.csv', index=False)