## You're here! 
Welcome to your first competition in the [ITI's AI Pro training program](https://ai.iti.gov.eg/epita/ai-engineer/)! We hope you enjoy and learn as much as we did prepairing this competition.


## Introduction

In the competition, it's required to predict the `Severity` of a car crash given info about the crash, e.g., location.

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

### Prerequisites
You should know how to use python and a little bit of Machine Learning. You can apply the techniques you learned in the training program and submit the new solutions! 

### Checklist
You can participate in this competition the way you perefer. However, I recommend following these steps if this is your first time joining a competition on Kaggle.

* Fork this notebook and run the cells in order.
* Submit this solution.
* Make changes to the data processing step as you see fit.
* Submit the new solutions.

*You can submit up to 5 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Don't hesitate to leave a comment or contact me if you have any question!

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [1]:
import pandas as pd
import os

## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [2]:
dataset_path = '/kaggle/input/car-crashes-severity-prediction/'

df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))

print("The shape of the dataset is {}.\n\n".format(df.shape))

df.head()

The shape of the dataset is (6407, 16).




Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

In [3]:
df.drop(columns='ID').describe()

Unnamed: 0,Lat,Lng,Distance(mi),Severity
count,6407.0,6407.0,6407.0,6407.0
mean,37.765653,-122.40599,0.135189,2.293429
std,0.032555,0.028275,0.39636,0.521225
min,37.609619,-122.51044,0.0,1.0
25%,37.737096,-122.41221,0.0,2.0
50%,37.768238,-122.404835,0.0,2.0
75%,37.787813,-122.392477,0.041,3.0
max,37.825626,-122.349734,6.82,4.0


The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

# preprocessing the training data

In [4]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Date'] = df['timestamp'].dt.date
df['Hour']=df['timestamp'].dt.hour
df=df.drop(columns=['timestamp','ID'])

In [5]:
# Mapping Side column to numerical representation
df['Side'] = df['Side'].rank(method='dense', ascending=False).astype(float)

# Loading and preprocessing Weather Data

In [6]:
df_weather = pd.read_csv(os.path.join(dataset_path, 'weather-sfcsv.csv'))

print("The shape of the dataset is {}.\n\n".format(df_weather.shape))

The shape of the dataset is (6901, 12).




In [7]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6901 entries, 0 to 6900
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               6901 non-null   int64  
 1   Day                6901 non-null   int64  
 2   Month              6901 non-null   int64  
 3   Hour               6901 non-null   int64  
 4   Weather_Condition  6900 non-null   object 
 5   Wind_Chill(F)      3292 non-null   float64
 6   Precipitation(in)  3574 non-null   float64
 7   Temperature(F)     6899 non-null   float64
 8   Humidity(%)        6899 non-null   float64
 9   Wind_Speed(mph)    6556 non-null   float64
 10  Visibility(mi)     6900 non-null   float64
 11  Selected           6901 non-null   object 
dtypes: float64(6), int64(4), object(2)
memory usage: 647.1+ KB


In [8]:
df_weather['Date']=pd.to_datetime(df_weather[['Year', 'Month', 'Day']]).dt.date
df_weather=df_weather.drop(columns=['Year','Day','Month'])
df_weather=df_weather[['Date','Weather_Condition','Visibility(mi)','Temperature(F)', 'Humidity(%)','Wind_Speed(mph)']]#['Wind_Chill(F)', 'Precipitation(in)']
#df_weather['Wind_Chill(F)'].fillna(df_weather['Wind_Chill(F)'].mean(), inplace = True)
df_weather['Wind_Speed(mph)'].fillna(df_weather['Wind_Speed(mph)'].mean() ,inplace = True)
#df_weather['Precipitation(in)'].fillna(df_weather['Precipitation(in)'].mean(), inplace = True)

In [9]:
# Mapping weather condition to numerical representation
df_weather.Weather_Condition = pd.factorize(df_weather.Weather_Condition)[0]

# Merging training Data and Weather Data 

In [10]:
#Merge the DataFrames
df_merged = pd.merge(df, df_weather, how='inner', left_index=True, right_index=True, suffixes=('', '_drop'))

#Drop the duplicate columns
df_merged.drop([col for col in df_merged.columns if 'drop' in col], axis=1, inplace=True)
df_merged=df_merged.dropna()


In [11]:
# Showing information about the new merged data
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6406 entries, 0 to 6406
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Lat                6406 non-null   float64
 1   Lng                6406 non-null   float64
 2   Bump               6406 non-null   bool   
 3   Distance(mi)       6406 non-null   float64
 4   Crossing           6406 non-null   bool   
 5   Give_Way           6406 non-null   bool   
 6   Junction           6406 non-null   bool   
 7   No_Exit            6406 non-null   bool   
 8   Railway            6406 non-null   bool   
 9   Roundabout         6406 non-null   bool   
 10  Stop               6406 non-null   bool   
 11  Amenity            6406 non-null   bool   
 12  Side               6406 non-null   float64
 13  Severity           6406 non-null   int64  
 14  Date               6406 non-null   object 
 15  Hour               6406 non-null   int64  
 16  Weather_Condition  6406 

# Loading and preprocessing Holidays Data

In [12]:
import xml.etree.ElementTree as ET
import codecs

with codecs.open('/kaggle/input/car-crashes-severity-prediction/holidays.xml', 'r', encoding='utf8') as f:
    xml_holidays = f.read()
def xmlToDf(xml_data):
    root = ET.XML(xml_data)
    all_records = []
    for i, child in enumerate(root):
        record = {}
        for sub_child in child:
            record[sub_child.tag] = sub_child.text
        all_records.append(record)
    return pd.DataFrame(all_records)


df_holidays = xmlToDf(xml_holidays)


In [13]:
# Extract Year from date
df_holidays['Year']=pd.to_datetime(df_holidays['date']).dt.year

In [14]:
# Select Years from 2016 to 2020
df_holidays=df_holidays.loc[df_holidays['Year'] >= 2016]

In [15]:
df_holidays

Unnamed: 0,date,description,Year
40,2016-01-01,New Year Day,2016
41,2016-01-18,Martin Luther King Jr. Day,2016
42,2016-02-15,Presidents Day (Washingtons Birthday),2016
43,2016-05-30,Memorial Day,2016
44,2016-07-04,Independence Day,2016
45,2016-09-05,Labor Day,2016
46,2016-10-10,Columbus Day,2016
47,2016-11-11,Veterans Day,2016
48,2016-11-24,Thanksgiving Day,2016
49,2016-12-25,Christmas Day,2016


In [16]:
# Mapping holidays to numerical representation
df_holidays.description = pd.factorize(df_holidays.description)[0]
# Dropping Year column
df_holidays=df_holidays.drop(columns='Year')
# Renaming the columns
df_holidays.columns=['Date','Holiday']
# Casting the Date column to be Date dataframe
df_holidays['Date']=pd.to_datetime(df_holidays['Date']).dt.date


In [17]:
# Showing the dataframe
df_holidays

Unnamed: 0,Date,Holiday
40,2016-01-01,0
41,2016-01-18,1
42,2016-02-15,2
43,2016-05-30,3
44,2016-07-04,4
45,2016-09-05,5
46,2016-10-10,6
47,2016-11-11,7
48,2016-11-24,8
49,2016-12-25,9


# Merging the df_merged with df_holidays

In [18]:
#Merge the DataFrames
df_merged = pd.merge(df_merged, df_holidays, how='left', on='Date', suffixes=('', '_drop'))
#Drop the duplicate columns
df_merged.drop([col for col in df_merged.columns if 'drop' in col], axis=1, inplace=True)

In [19]:
# Showing information about the dataframe
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6406 entries, 0 to 6405
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Lat                6406 non-null   float64
 1   Lng                6406 non-null   float64
 2   Bump               6406 non-null   bool   
 3   Distance(mi)       6406 non-null   float64
 4   Crossing           6406 non-null   bool   
 5   Give_Way           6406 non-null   bool   
 6   Junction           6406 non-null   bool   
 7   No_Exit            6406 non-null   bool   
 8   Railway            6406 non-null   bool   
 9   Roundabout         6406 non-null   bool   
 10  Stop               6406 non-null   bool   
 11  Amenity            6406 non-null   bool   
 12  Side               6406 non-null   float64
 13  Severity           6406 non-null   int64  
 14  Date               6406 non-null   object 
 15  Hour               6406 non-null   int64  
 16  Weather_Condition  6406 

In [20]:
# Replacing the null value which is ordinary day (non-holiday day) to numerical label (10)
df_merged['Holiday'].fillna(10, inplace = True)

## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

In [21]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df_merged, test_size=0.2, random_state=42) # Try adding `stratify` here

X_train = train_df.drop(columns=['Severity'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['Severity'])
y_val = val_df['Severity']


In [22]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5124 entries, 748 to 860
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Lat                5124 non-null   float64
 1   Lng                5124 non-null   float64
 2   Bump               5124 non-null   bool   
 3   Distance(mi)       5124 non-null   float64
 4   Crossing           5124 non-null   bool   
 5   Give_Way           5124 non-null   bool   
 6   Junction           5124 non-null   bool   
 7   No_Exit            5124 non-null   bool   
 8   Railway            5124 non-null   bool   
 9   Roundabout         5124 non-null   bool   
 10  Stop               5124 non-null   bool   
 11  Amenity            5124 non-null   bool   
 12  Side               5124 non-null   float64
 13  Date               5124 non-null   object 
 14  Hour               5124 non-null   int64  
 15  Weather_Condition  5124 non-null   int64  
 16  Visibility(mi)     5124

As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

In [23]:
X_train = X_train[['Lat', 'Lng', 'Distance(mi)','Crossing','Junction','Railway','Amenity','Side','Weather_Condition','Visibility(mi)','Holiday','Temperature(F)', 'Humidity(%)','Stop','Wind_Speed(mph)']]
X_val = X_val[['Lat', 'Lng', 'Distance(mi)','Crossing','Junction','Railway','Amenity','Side','Weather_Condition','Visibility(mi)','Holiday','Temperature(F)', 'Humidity(%)','Stop','Wind_Speed(mph)']]

## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [24]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [25]:
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the validation set is  0.751170046801872


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [26]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04 19:20:31
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28 11:51:00
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09 07:36:45
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06 15:46:25
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17 09:54:58


Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [27]:
test_df['timestamp'] = pd.to_datetime(test_df['timestamp'])
test_df['Date'] = test_df['timestamp'].dt.date
test_df['Hour']=test_df['timestamp'].dt.hour
X_test=test_df.drop(columns=['timestamp','ID'])

# Mapping Side column to numerical representation
X_test['Side'] = X_test['Side'].rank(method='dense', ascending=False).astype(float)

# Merging training Data and Weather Data 
#Merge the DataFrames
test_merged = pd.merge(X_test, df_weather, how='inner', left_index=True, right_index=True, suffixes=('', '_drop'))

#Drop the duplicate columns
test_merged.drop([col for col in test_merged.columns if 'drop' in col], axis=1, inplace=True)
test_merged=test_merged.dropna()

# Merging the test_merged with df_holidays
test_merged = pd.merge(test_merged, df_holidays, how='left', on='Date', suffixes=('', '_drop'))

#Drop the duplicate columns
test_merged.drop([col for col in test_merged.columns if 'drop' in col], axis=1, inplace=True)

# Replacing the null value which is ordinary day (non-holiday day) to numerical label (10)
test_merged['Holiday'].fillna(10, inplace = True)


# You should update/remove the next line once you change the features used for training
test_merged = test_merged[['Lat', 'Lng', 'Distance(mi)','Crossing','Junction','Railway','Amenity','Side','Weather_Condition','Visibility(mi)','Holiday','Temperature(F)', 'Humidity(%)','Stop','Wind_Speed(mph)']]

y_test_predicted = classifier.predict(test_merged)

test_df['Severity'] = y_test_predicted

test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp,Date,Hour,Severity
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04 19:20:31,2016-04-04,19,2
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28 11:51:00,2020-10-28,11,2
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09 07:36:45,2019-09-09,7,2
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06 15:46:25,2019-08-06,15,2
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17 09:54:58,2018-10-17,9,2


Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [28]:
test_df[['ID', 'Severity']].to_csv('/kaggle/working/submission.csv', index=False)

The remaining steps is to submit the generated file and are as follows. 

1. Press `Save Version` on the upper right corner of this notebook.
2. Write a `Version Name` of your choice and choose `Save & Run All (Commit)` then click `Save`.
3. Wait for the saved notebook to finish running the go to the saved notebook.
4. Scroll down until you see the output files then select the `submission.csv` file and click `Submit`.

Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

## Conclusion

In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.