<a href="https://colab.research.google.com/github/HussamSelim/car-crashes-severity-prediction-/blob/main/getting_started_car_crashes_severity_prediction_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## You're here! 
Welcome to your first competition in the [ITI's AI Pro training program](https://ai.iti.gov.eg/epita/ai-engineer/)! We hope you enjoy and learn as much as we did prepairing this competition.


## Introduction

In the competition, it's required to predict the `Severity` of a car crash given info about the crash, e.g., location.

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

### Prerequisites
You should know how to use python and a little bit of Machine Learning. You can apply the techniques you learned in the training program and submit the new solutions! 

### Checklist
You can participate in this competition the way you perefer. However, I recommend following these steps if this is your first time joining a competition on Kaggle.

* Fork this notebook and run the cells in order.
* Submit this solution.
* Make changes to the data processing step as you see fit.
* Submit the new solutions.

*You can submit up to 5 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Don't hesitate to leave a comment or contact me if you have any question!

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [None]:
import pandas as pd
import os
import numpy as np

## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [None]:
dataset_path = '/kaggle/input/car-crashes-severity-prediction/'

df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))
weather_df = pd.read_csv(os.path.join(dataset_path, 'weather-sfcsv.csv'))

print("The shape of the dataset is {}.\n\n".format(df.shape))

weather_df_dates = pd.to_datetime(weather_df[['Year','Month','Day','Hour']], format='%Y%m%d%H', errors='ignore')

weather_df['dates_weather']=weather_df_dates
weather_df_dates = weather_df.drop(columns=['Year','Month','Day','Hour'])
# weather_df_dates['Hour']=pd.to_datetime(weather_df_dates['Hour'],unit='h',origin=pd.Timestamp(weather_df['date']))
weather_df_dates_sorted = weather_df_dates.sort_values(by='dates_weather')

weather_df_dates_sorted['Precipitation(in)'].fillna(method='pad',inplace=True)#filling from prev val
weather_df_dates_sorted['Precipitation(in)'].fillna(method='bfill',inplace=True)#filling from post val

weather_df_dates_sorted['Wind_Chill(F)'].fillna(method='pad',inplace=True)#filling from prev val
weather_df_dates_sorted['Wind_Chill(F)'].fillna(method='bfill',inplace=True)#filling from post val

weather_df_dates_sorted['Wind_Speed(mph)'].fillna(method='pad',inplace=True)#filling from prev val
weather_df_dates_sorted['Wind_Speed(mph)'].fillna(method='bfill',inplace=True)#filling from post val

weather_df_dates_sorted['Visibility(mi)'].fillna(method='bfill',inplace=True)#filling from specific val

weather_df_dates_sorted['Temperature(F)'].fillna(method='pad',inplace=True)#filling from post val

weather_df_dates_sorted['Humidity(%)'].fillna(method='pad',inplace=True)#filling from post val
# weather_df_dates_sorted_cleaned = weather_df_dates_sorted.dropna()
# weather_df_dates_sorted_cleaned = weather_df_dates_sorted_cleaned.loc[(weather_df_dates_sorted_cleaned[['Wind_Speed(mph)', 'Visibility(mi)']] != 0).all(axis=1)]
weather_df_dates_sorted.info()


The shape of the dataset is (6407, 16).


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6901 entries, 5724 to 2360
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Weather_Condition  6900 non-null   object        
 1   Wind_Chill(F)      6901 non-null   float64       
 2   Precipitation(in)  6901 non-null   float64       
 3   Temperature(F)     6901 non-null   float64       
 4   Humidity(%)        6901 non-null   float64       
 5   Wind_Speed(mph)    6901 non-null   float64       
 6   Visibility(mi)     6901 non-null   float64       
 7   Selected           6901 non-null   object        
 8   dates_weather      6901 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(6), object(2)
memory usage: 539.1+ KB


In [None]:
# # getting the correlation of the data with respect to severity 

# data.corr(method='pearson')['Severity'].sort_values(ascending=False)

In [None]:
#Encoding True and falses into zeros and ones
data = df*1
#Converting timestamp to date and deleting the minutes 
dates = pd.to_datetime(df['timestamp'])
dates_z = dates.apply(lambda x: x.replace(minute=0,second=0))
df['dates_weather']=dates_z
df.info()
#Encoding "Side" Column

from sklearn.preprocessing import LabelBinarizer

# A function to Binarize columns
def Binarizer(column,data):
    sides_encoder = LabelBinarizer()
    sides_encoder.fit(data[column])
    transformed = sides_encoder.transform(df[column])
    ohe_df = pd.DataFrame(transformed)
    data = pd.concat([data, ohe_df], axis=1).drop([column], axis=1)
    data=data.rename(columns={0:column})
    return data

data=Binarizer('Side',data)
data.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6407 entries, 0 to 6406
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   ID             6407 non-null   int64         
 1   Lat            6407 non-null   float64       
 2   Lng            6407 non-null   float64       
 3   Bump           6407 non-null   bool          
 4   Distance(mi)   6407 non-null   float64       
 5   Crossing       6407 non-null   bool          
 6   Give_Way       6407 non-null   bool          
 7   Junction       6407 non-null   bool          
 8   No_Exit        6407 non-null   bool          
 9   Railway        6407 non-null   bool          
 10  Roundabout     6407 non-null   bool          
 11  Stop           6407 non-null   bool          
 12  Amenity        6407 non-null   bool          
 13  Side           6407 non-null   object        
 14  Severity       6407 non-null   int64         
 15  timestamp      6407 n

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Severity,timestamp,Side
0,0,37.76215,-122.40566,0,0.044,0,0,0,0,0,0,0,1,2,2016-03-25 15:13:02,1
1,1,37.719157,-122.448254,0,0.0,0,0,0,0,0,0,0,0,2,2020-05-05 19:23:00,1
2,2,37.808498,-122.366852,0,0.0,0,0,0,0,0,0,1,0,3,2016-09-16 19:57:16,1
3,3,37.78593,-122.39108,0,0.009,0,0,1,0,0,0,0,0,1,2020-03-29 19:48:43,1
4,4,37.719141,-122.448457,0,0.0,0,0,0,0,0,0,0,0,2,2019-10-09 08:47:00,1


In [None]:
# getting the mean of the severity
sev_weight=data.groupby(['Lat']).mean()
sev_weight_df=sev_weight.reset_index()

sev_weight_df.rename({'Severity':'Sev_weight'},inplace=True,axis=1)

sev_temp = sev_weight_df[['Lat','Lng','Sev_weight']]
sev_temp = sev_temp.groupby(['Lng']).mean()
sev_temp_final = sev_temp.reset_index()

In [None]:
sev_weight_df

Unnamed: 0,Lat,ID,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Sev_weight,Side
0,37.609619,6084.00,-122.390540,0.0,0.199,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
1,37.614593,669.00,-122.385414,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0
2,37.615020,3476.00,-122.393990,0.0,0.239,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
3,37.629330,5786.00,-122.401787,0.0,0.860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
4,37.629623,3638.00,-122.401779,0.0,0.310,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2056,37.825462,3023.00,-122.479152,0.0,0.324,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
2057,37.825603,5948.00,-122.479279,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
2058,37.825614,3828.25,-122.479151,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
2059,37.825615,2271.25,-122.479266,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0


In [None]:


data_with_sevWeight=pd.merge(data,sev_temp_final[['Lng','Sev_weight']],on=['Lng'],how='left')

data_with_sevWeight2=pd.merge(data_with_sevWeight,sev_weight_df[['Lat','Sev_weight']],on=['Lat'],how='left')

data_with_sevWeight2.drop(columns=['Sev_weight_x'],inplace = True)
data_with_sevWeight2 = data_with_sevWeight2.rename({'Sev_weight_y':'Sev_weight'},axis = 1)
data_with_sevWeight2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6407 entries, 0 to 6406
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            6407 non-null   int64  
 1   Lat           6407 non-null   float64
 2   Lng           6407 non-null   float64
 3   Bump          6407 non-null   int64  
 4   Distance(mi)  6407 non-null   float64
 5   Crossing      6407 non-null   int64  
 6   Give_Way      6407 non-null   int64  
 7   Junction      6407 non-null   int64  
 8   No_Exit       6407 non-null   int64  
 9   Railway       6407 non-null   int64  
 10  Roundabout    6407 non-null   int64  
 11  Stop          6407 non-null   int64  
 12  Amenity       6407 non-null   int64  
 13  Severity      6407 non-null   int64  
 14  timestamp     6407 non-null   object 
 15  Side          6407 non-null   int64  
 16  Sev_weight    6407 non-null   float64
dtypes: float64(4), int64(12), object(1)
memory usage: 901.0+ KB


In [None]:
# df[['Severity','Lng','Lat']].corr()

In [None]:
data_date=pd.to_datetime(data_with_sevWeight2['timestamp'])
dataDF=pd.DataFrame(data_date)
data1 = pd.concat([data.drop('timestamp',axis=1), dataDF], axis=1)
data1.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Severity,Side,timestamp
0,0,37.76215,-122.40566,0,0.044,0,0,0,0,0,0,0,1,2,1,2016-03-25 15:13:02
1,1,37.719157,-122.448254,0,0.0,0,0,0,0,0,0,0,0,2,1,2020-05-05 19:23:00
2,2,37.808498,-122.366852,0,0.0,0,0,0,0,0,0,1,0,3,1,2016-09-16 19:57:16
3,3,37.78593,-122.39108,0,0.009,0,0,1,0,0,0,0,0,1,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,0,0.0,0,0,0,0,0,0,0,0,2,1,2019-10-09 08:47:00


We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

In [None]:
# This cell is used to select the numerical features. IT SHOULD BE REMOVED AS YOU DO YOUR WORK.
# X_train = X_train[['Lat', 'Lng', 'Distance(mi)']]
# X_val = X_val[['Lat', 'Lng', 'Distance(mi)']]

data_with_sevWeight2['timestamp']=pd.to_datetime(data_with_sevWeight2['timestamp'])

data_with_sevWeight2.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Severity,timestamp,Side,Sev_weight
0,0,37.76215,-122.40566,0,0.044,0,0,0,0,0,0,0,1,2,2016-03-25 15:13:02,1,2.0
1,1,37.719157,-122.448254,0,0.0,0,0,0,0,0,0,0,0,2,2020-05-05 19:23:00,1,2.0
2,2,37.808498,-122.366852,0,0.0,0,0,0,0,0,0,1,0,3,2016-09-16 19:57:16,1,2.988679
3,3,37.78593,-122.39108,0,0.009,0,0,1,0,0,0,0,0,1,2020-03-29 19:48:43,1,1.545455
4,4,37.719141,-122.448457,0,0.0,0,0,0,0,0,0,0,0,2,2019-10-09 08:47:00,1,2.0


In [None]:
# Importing the required libraries
import xml.etree.ElementTree as Xet
import pandas as pd

cols = ["date", "description"]
rows = []

# Parsing the XML file
xmlparse = Xet.parse(os.path.join(dataset_path, 'holidays.xml'))
root = xmlparse.getroot()
for i in root:
	name = i.find("date").text
	phone = i.find("description").text

	rows.append({"date": name,
				"description": phone})

holidays_df = pd.DataFrame(rows, columns=cols)

# Writing dataframe to csv
holidays_df['description'].value_counts()

Martin Luther King Jr. Day               9
Labor Day                                9
Presidents Day (Washingtons Birthday)    9
Thanksgiving Day                         9
Memorial Day                             9
Independence Day                         9
Veterans Day                             9
Columbus Day                             9
Christmas Day                            9
New Year Day                             9
Name: description, dtype: int64

In [None]:
holidays_factorized,uniqes=pd.factorize(holidays_df['description'])
hf=pd.DataFrame(holidays_factorized,columns=['description_factorized'])

holidays_df1=pd.concat([holidays_df,hf],axis=1)
holidays_df1=holidays_df1.drop('description', axis=1)

date_holiday=pd.to_datetime(holidays_df1['date'])
dataDF=pd.DataFrame(holidays_df1)
holidays_df1 = pd.concat([holidays_df1.drop('date',axis=1), date_holiday], axis=1)
holidays_df1.tail()

Unnamed: 0,description_factorized,date
85,5,2020-09-07
86,6,2020-10-12
87,7,2020-11-11
88,8,2020-11-26
89,9,2020-12-25


In [None]:
data_with_sevWeight2['dates']=pd.to_datetime(data_with_sevWeight2['timestamp'].dt.date)
accidents_holidays=holidays_df1.merge(data_with_sevWeight2,left_on='date',right_on='timestamp',how = 'right')

accidents_holidays_merged=accidents_holidays.drop('timestamp',axis=1)


accidents_holidays_merged['hour']=(data_with_sevWeight2['timestamp'].dt.time)
accidents_holidays_merged['description_factorized'].fillna(-1,inplace = True)
accidents_holidays_merged

Unnamed: 0,description_factorized,date,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Severity,Side,Sev_weight,dates,hour
0,-1.0,NaT,0,37.762150,-122.405660,0,0.044,0,0,0,0,0,0,0,1,2,1,2.000000,2016-03-25,15:13:02
1,-1.0,NaT,1,37.719157,-122.448254,0,0.000,0,0,0,0,0,0,0,0,2,1,2.000000,2020-05-05,19:23:00
2,-1.0,NaT,2,37.808498,-122.366852,0,0.000,0,0,0,0,0,0,1,0,3,1,2.988679,2016-09-16,19:57:16
3,-1.0,NaT,3,37.785930,-122.391080,0,0.009,0,0,1,0,0,0,0,0,1,1,1.545455,2020-03-29,19:48:43
4,-1.0,NaT,4,37.719141,-122.448457,0,0.000,0,0,0,0,0,0,0,0,2,1,2.000000,2019-10-09,08:47:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,-1.0,NaT,6402,37.740630,-122.407930,0,0.368,0,0,0,0,0,0,0,0,3,1,2.176471,2017-10-01,18:36:13
6403,-1.0,NaT,6403,37.752755,-122.402790,0,0.639,0,0,1,0,0,0,0,0,2,1,2.333333,2018-10-23,07:40:27
6404,-1.0,NaT,6404,37.726304,-122.446015,0,0.000,0,0,1,0,0,0,0,0,2,1,2.000000,2019-10-28,15:45:00
6405,-1.0,NaT,6405,37.808090,-122.367211,0,0.000,0,0,1,0,0,0,0,0,3,1,3.000000,2019-05-04,13:45:31


In [None]:
datentime=data1['timestamp']
acc_hol=pd.concat([accidents_holidays_merged,datentime],axis=1)
acc_hol.drop('hour',axis=1,inplace = True)
acc_hol.head()


Unnamed: 0,description_factorized,date,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Severity,Side,Sev_weight,dates,timestamp
0,-1.0,NaT,0,37.76215,-122.40566,0,0.044,0,0,0,0,0,0,0,1,2,1,2.0,2016-03-25,2016-03-25 15:13:02
1,-1.0,NaT,1,37.719157,-122.448254,0,0.0,0,0,0,0,0,0,0,0,2,1,2.0,2020-05-05,2020-05-05 19:23:00
2,-1.0,NaT,2,37.808498,-122.366852,0,0.0,0,0,0,0,0,0,1,0,3,1,2.988679,2016-09-16,2016-09-16 19:57:16
3,-1.0,NaT,3,37.78593,-122.39108,0,0.009,0,0,1,0,0,0,0,0,1,1,1.545455,2020-03-29,2020-03-29 19:48:43
4,-1.0,NaT,4,37.719141,-122.448457,0,0.0,0,0,0,0,0,0,0,0,2,1,2.0,2019-10-09,2019-10-09 08:47:00


In [None]:
data_date=pd.to_datetime(acc_hol['timestamp'])
dataDF=pd.DataFrame(data_date)
data2 = pd.concat([acc_hol.drop('timestamp',axis=1), dataDF], axis=1)
data2.rename({'timestamp':'dates_weather'},axis = 1 , inplace = True)

In [None]:
data2

Unnamed: 0,description_factorized,date,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Severity,Side,Sev_weight,dates,dates_weather
0,-1.0,NaT,0,37.762150,-122.405660,0,0.044,0,0,0,0,0,0,0,1,2,1,2.000000,2016-03-25,2016-03-25 15:13:02
1,-1.0,NaT,1,37.719157,-122.448254,0,0.000,0,0,0,0,0,0,0,0,2,1,2.000000,2020-05-05,2020-05-05 19:23:00
2,-1.0,NaT,2,37.808498,-122.366852,0,0.000,0,0,0,0,0,0,1,0,3,1,2.988679,2016-09-16,2016-09-16 19:57:16
3,-1.0,NaT,3,37.785930,-122.391080,0,0.009,0,0,1,0,0,0,0,0,1,1,1.545455,2020-03-29,2020-03-29 19:48:43
4,-1.0,NaT,4,37.719141,-122.448457,0,0.000,0,0,0,0,0,0,0,0,2,1,2.000000,2019-10-09,2019-10-09 08:47:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6402,-1.0,NaT,6402,37.740630,-122.407930,0,0.368,0,0,0,0,0,0,0,0,3,1,2.176471,2017-10-01,2017-10-01 18:36:13
6403,-1.0,NaT,6403,37.752755,-122.402790,0,0.639,0,0,1,0,0,0,0,0,2,1,2.333333,2018-10-23,2018-10-23 07:40:27
6404,-1.0,NaT,6404,37.726304,-122.446015,0,0.000,0,0,1,0,0,0,0,0,2,1,2.000000,2019-10-28,2019-10-28 15:45:00
6405,-1.0,NaT,6405,37.808090,-122.367211,0,0.000,0,0,1,0,0,0,0,0,3,1,3.000000,2019-05-04,2019-05-04 13:45:31


In [None]:
# Cleaning weather data
weather_df_dates_sorted_clean = weather_df_dates_sorted.drop_duplicates(subset=['dates_weather'])
# weather_df_dates_sorted_clean.info()
#Merging Weather Data with dataset
merged_data = pd.merge(data2,weather_df_dates_sorted_clean,on=['dates_weather'],how = 'left')
# df_merged_data = merged_data.drop(columns=['timestamp'])
df_dates_sorted_merged = merged_data.sort_values(by='dates_weather')

# weather_df_dates_sorted['Precipitation(in)'].fillna(method='pad',inplace=True)#filling from prev val

df_dates_sorted_merged['Precipitation(in)'].fillna(0.009867,inplace=True)
# df_dates_sorted_merged['Precipitation(in)'].fillna(method='pad',inplace=True)

df_dates_sorted_merged['Humidity(%)'].fillna(68.571366,inplace=True)
# df_dates_sorted_merged['Humidity(%)'].fillna(method='pad',inplace=True)

df_dates_sorted_merged['Temperature(F)'].fillna(59.907086,inplace=True)
# df_dates_sorted_merged['Temperature(F)'].fillna(method='pad',inplace=True)

df_dates_sorted_merged['Wind_Speed(mph)'].fillna(10.695899,inplace=True)
# df_dates_sorted_merged['Wind_Speed(mph)'].fillna(method='pad',inplace=True)

df_dates_sorted_merged['Visibility(mi)'].fillna(9.441932,inplace=True)
# df_dates_sorted_merged['Visibility(mi)'].fillna(method='pad',inplace=True)

df_dates_sorted_merged['Wind_Chill(F)'].fillna(50.229329,inplace=True)
# df_dates_sorted_merged['Wind_Chill(F)'].fillna(method='pad',inplace=True)

## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [None]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df_dates_sorted_merged, test_size=0.2, random_state=42) # Try adding `stratify` here

X_train = train_df.drop(columns=['ID', 'Severity','Roundabout','Bump','No_Exit','Distance(mi)','dates_weather','Selected','Weather_Condition','dates','date','Precipitation(in)'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['ID', 'Severity','Roundabout','Bump','No_Exit','Distance(mi)','dates_weather','Selected','Weather_Condition','dates','date','Precipitation(in)'])
y_val = val_df['Severity']

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5125 entries, 284 to 3859
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   description_factorized  5125 non-null   float64
 1   Lat                     5125 non-null   float64
 2   Lng                     5125 non-null   float64
 3   Crossing                5125 non-null   int64  
 4   Give_Way                5125 non-null   int64  
 5   Junction                5125 non-null   int64  
 6   Railway                 5125 non-null   int64  
 7   Stop                    5125 non-null   int64  
 8   Amenity                 5125 non-null   int64  
 9   Side                    5125 non-null   int64  
 10  Sev_weight              5125 non-null   float64
 11  Wind_Chill(F)           5125 non-null   float64
 12  Temperature(F)          5125 non-null   float64
 13  Humidity(%)             5125 non-null   float64
 14  Wind_Speed(mph)         5125 non-null 

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [None]:
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the validation set is  0.9009360374414976


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [None]:
sev_temp_final

Unnamed: 0,Lng,Lat,Sev_weight
0,-122.510440,37.763900,4.0
1,-122.507191,37.743544,2.0
2,-122.506742,37.741686,2.0
3,-122.494469,37.733791,2.0
4,-122.494186,37.776985,2.0
...,...,...,...
1917,-122.353855,37.817651,2.0
1918,-122.352951,37.818428,3.0
1919,-122.352173,37.818672,3.0
1920,-122.350359,37.819110,2.0


In [None]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
# Getting test_df ready for testing
test_df = test_df * 1
##################################

#Converting timestamp to date and deleting the minutes 
dates = pd.to_datetime(test_df['timestamp'])
dates_z = dates.apply(lambda x: x.replace(minute=0,second=0))
test_df['dates_weather']=dates_z
#Merging Weather Data with testing dataset
merged_data = pd.merge(test_df,weather_df_dates_sorted_clean,on=['dates_weather'],how = 'left')
df_dates_sorted_merged = merged_data.drop(columns=['timestamp'])
test_df = df_dates_sorted_merged
#############################3


data_with_sevWeight=pd.merge(test_df,sev_temp_final[['Lng','Sev_weight']],on=['Lng'],how='left')

data_with_sevWeight2=pd.merge(data_with_sevWeight,sev_weight_df[['Lat','Sev_weight']],on=['Lat'],how='left')

data_with_sevWeight2.drop(columns=['Sev_weight_x'],inplace = True)
data_with_sevWeight2 = data_with_sevWeight2.rename({'Sev_weight_y':'Sev_weight'},axis = 1)
data_with_sevWeight2.sort_values(['Lng'],inplace = True)
data_with_sevWeight2['Sev_weight'].fillna(method='pad',inplace =True)
data_with_sevWeight2.info()
##############################

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 1112 to 1427
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   ID                 1601 non-null   int64         
 1   Lat                1601 non-null   float64       
 2   Lng                1601 non-null   float64       
 3   Bump               1601 non-null   int64         
 4   Distance(mi)       1601 non-null   float64       
 5   Crossing           1601 non-null   int64         
 6   Give_Way           1601 non-null   int64         
 7   Junction           1601 non-null   int64         
 8   No_Exit            1601 non-null   int64         
 9   Railway            1601 non-null   int64         
 10  Roundabout         1601 non-null   int64         
 11  Stop               1601 non-null   int64         
 12  Amenity            1601 non-null   int64         
 13  Side               1601 non-null   object        
 14  dates

In [None]:

data_with_sevWeight2['dates_weather'] = pd.to_datetime(data_with_sevWeight2['dates_weather'])
data_with_sevWeight2['dates']=pd.to_datetime(data_with_sevWeight2['dates_weather'].dt.date)
test_df_holidays=holidays_df1.merge(data_with_sevWeight2,left_on='date',right_on='dates',how = 'right')

test_df_holidays_merged=test_df_holidays.drop('dates_weather',axis=1)


# test_df_holidays_merged['hour']=(test_df['timestamp'].dt.time)
test_df_holidays_merged['description_factorized'].fillna(-1,inplace = True)
test_df_holidays_merged.info()
test_df_holidays_merged.head()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   description_factorized  1601 non-null   float64       
 1   date                    37 non-null     datetime64[ns]
 2   ID                      1601 non-null   int64         
 3   Lat                     1601 non-null   float64       
 4   Lng                     1601 non-null   float64       
 5   Bump                    1601 non-null   int64         
 6   Distance(mi)            1601 non-null   float64       
 7   Crossing                1601 non-null   int64         
 8   Give_Way                1601 non-null   int64         
 9   Junction                1601 non-null   int64         
 10  No_Exit                 1601 non-null   int64         
 11  Railway                 1601 non-null   int64         
 12  Roundabout              1601 non-null   int64   

Unnamed: 0,description_factorized,date,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,...,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Sev_weight,dates
0,-1.0,NaT,7519,37.7639,-122.51044,0,1.969,1,0,0,...,Fair,56.0,0.0,56.0,77.0,9.0,10.0,No,4.0,2020-04-10
1,-1.0,NaT,7941,37.74168,-122.50694,0,0.007,1,0,0,...,Partly Cloudy,68.0,0.0,68.0,56.0,9.0,10.0,No,4.0,2020-10-25
2,-1.0,NaT,6603,37.783828,-122.486077,0,0.0,0,0,0,...,Partly Cloudy,85.0,0.0,85.0,16.0,0.0,10.0,No,4.0,2019-10-24
3,-1.0,NaT,7859,37.819321,-122.478447,0,0.0,0,0,0,...,Mostly Cloudy,59.0,0.0,59.0,78.0,15.0,10.0,No,2.0,2020-01-24
4,-1.0,NaT,6715,37.813958,-122.477934,0,0.0,0,0,0,...,Cloudy,54.0,0.0,54.0,75.0,14.0,10.0,No,2.0,2020-04-11


In [None]:
# Encoding
# test_df_holidays_merged = test_df_holidays_merged*1
# Side edit
sides_encoder = LabelBinarizer()
sides_encoder.fit(test_df_holidays_merged['Side'])
transformed = sides_encoder.transform(test_df_holidays_merged['Side'])
ohe_df = pd.DataFrame(transformed)
test_df_holidays_merged2 = pd.concat([test_df_holidays_merged, ohe_df], axis=1).drop(['Side'], axis=1)
test_df_holidays_merged2=test_df_holidays_merged2.rename(columns={0:'Side'})
test_df_holidays_merged2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   description_factorized  1601 non-null   float64       
 1   date                    37 non-null     datetime64[ns]
 2   ID                      1601 non-null   int64         
 3   Lat                     1601 non-null   float64       
 4   Lng                     1601 non-null   float64       
 5   Bump                    1601 non-null   int64         
 6   Distance(mi)            1601 non-null   float64       
 7   Crossing                1601 non-null   int64         
 8   Give_Way                1601 non-null   int64         
 9   Junction                1601 non-null   int64         
 10  No_Exit                 1601 non-null   int64         
 11  Railway                 1601 non-null   int64         
 12  Roundabout              1601 non-null   int64   

Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [None]:
X_test = test_df_holidays_merged2.drop(columns=['ID','Roundabout','Bump','No_Exit','Distance(mi)','Selected','Weather_Condition','dates','date','Precipitation(in)'])
X_test = X_test[['description_factorized', 'Lat', 'Lng', 'Crossing', 'Give_Way',
       'Junction', 'Railway', 'Stop', 'Amenity', 'Side', 'Sev_weight',
       'Wind_Chill(F)', 'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)',
       'Visibility(mi)']]
X_test.columns

Index(['description_factorized', 'Lat', 'Lng', 'Crossing', 'Give_Way',
       'Junction', 'Railway', 'Stop', 'Amenity', 'Side', 'Sev_weight',
       'Wind_Chill(F)', 'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)',
       'Visibility(mi)'],
      dtype='object')

In [None]:
X_train.columns

Index(['description_factorized', 'Lat', 'Lng', 'Crossing', 'Give_Way',
       'Junction', 'Railway', 'Stop', 'Amenity', 'Side', 'Sev_weight',
       'Wind_Chill(F)', 'Temperature(F)', 'Humidity(%)', 'Wind_Speed(mph)',
       'Visibility(mi)'],
      dtype='object')

In [None]:

# You should update/remove the next line once you change the features used for training
# X_test = X_test[['Lat', 'Lng', 'Distance(mi)']]

y_test_predicted = classifier.predict(X_test)

test_df_holidays_merged2['Severity'] = y_test_predicted

test_df_holidays_merged2.head()

Unnamed: 0,description_factorized,date,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,...,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Sev_weight,dates,Side,Severity
0,-1.0,NaT,7519,37.7639,-122.51044,0,1.969,1,0,0,...,0.0,56.0,77.0,9.0,10.0,No,4.0,2020-04-10,1,2
1,-1.0,NaT,7941,37.74168,-122.50694,0,0.007,1,0,0,...,0.0,68.0,56.0,9.0,10.0,No,4.0,2020-10-25,0,2
2,-1.0,NaT,6603,37.783828,-122.486077,0,0.0,0,0,0,...,0.0,85.0,16.0,0.0,10.0,No,4.0,2019-10-24,0,3
3,-1.0,NaT,7859,37.819321,-122.478447,0,0.0,0,0,0,...,0.0,59.0,78.0,15.0,10.0,No,2.0,2020-01-24,1,2
4,-1.0,NaT,6715,37.813958,-122.477934,0,0.0,0,0,0,...,0.0,54.0,75.0,14.0,10.0,No,2.0,2020-04-11,1,2


Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [None]:
test_df_holidays_merged2[['ID', 'Severity']].to_csv('/kaggle/working/submission.csv', index=False)

The remaining steps is to submit the generated file and are as follows. 

1. Press `Save Version` on the upper right corner of this notebook.
2. Write a `Version Name` of your choice and choose `Save & Run All (Commit)` then click `Save`.
3. Wait for the saved notebook to finish running the go to the saved notebook.
4. Scroll down until you see the output files then select the `submission.csv` file and click `Submit`.

Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

## Conclusion

In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.