## You're here! 
Welcome to your first competition in the [ITI's AI Pro training program](https://ai.iti.gov.eg/epita/ai-engineer/)! We hope you enjoy and learn as much as we did prepairing this competition.


## Introduction

In the competition, it's required to predict the `Severity` of a car crash given info about the crash, e.g., location.

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

### Prerequisites
You should know how to use python and a little bit of Machine Learning. You can apply the techniques you learned in the training program and submit the new solutions! 

### Checklist
You can participate in this competition the way you perefer. However, I recommend following these steps if this is your first time joining a competition on Kaggle.

* Fork this notebook and run the cells in order.
* Submit this solution.
* Make changes to the data processing step as you see fit.
* Submit the new solutions.

*You can submit up to 5 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Don't hesitate to leave a comment or contact me if you have any question!

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [1]:
import pandas as pd
import os
from datetime import datetime


## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [2]:
dataset_path = '../input/car-crashes-severity-prediction'
dataset_holiday_path = '../input/holidayscsv'
dataset_feature_path = '../input/feature3'
# dataset_road_path = '../input/roadcsv'
# dataset_last_path = '../input/lastchance'

df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))
df_holidays = pd.read_csv(os.path.join(dataset_holiday_path, 'holidays.csv'))
# df_x = pd.read_csv(os.path.join(dataset_feature_path, 'x.csv'))
# df_r = pd.read_csv(os.path.join(dataset_road_path, 'RoadName.csv'))
# df_l = pd.read_csv(os.path.join(dataset_last_path, 'lastchance.csv'))

df_weather = pd.read_csv(os.path.join(dataset_path, 'weather-sfcsv.csv'))


print("The shape of the dataset is {}.\n\n".format(df.shape))

print(df['Severity'].value_counts())
print(len(df))

# df = pd.concat([df, df_x],axis=1)
# df = pd.concat([df, df_r],axis=1)
# df = pd.concat([df, df_l],axis=1)

df.head()
# class_counts = df['Severity'].value_counts()
# class_weights = len(class_counts)/class_counts
# df = df.sample(
#     n=class_counts.max()*len(class_counts),
#     weights=df['Severity'].map(class_weights), 
#     replace=True)

# print(df['Severity'].value_counts())
# print(len(df))
# df.head()
# print(df['timestamp'][1])

The shape of the dataset is (6407, 16).


2    4346
3    1855
1     129
4      77
Name: Severity, dtype: int64
6407


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

In [3]:
df['timestamp']= pd.to_datetime(df['timestamp'])
df_holidays['date']= pd.to_datetime(df_holidays['date']).dt.date
df_weather['timestamp'] = pd.to_datetime(df_weather[['Year', 'Month','Day','Hour']].assign(Minute=0)).dt.date



In [4]:
df['holiday'] = 0
df['Night'] = 0

exists = False
for i in range(len(df['timestamp'])):
    if df['timestamp'][i].date().strftime("%A") == 'Saturday' or df['timestamp'][i].date().strftime("%A") == 'Sunday':
        exists = True
    if exists:
        df['holiday'][i] = 1
        #print(exists)
    exists = False
    
for i in range(len(df['timestamp'])):
    exists = df['timestamp'][i].date() in df_holidays['date']
   
    if exists:
        df['holiday'][i] = 1
        print(exists)
for i in range(len(df['timestamp'])):
    if 6 <= df['timestamp'][i].hour <= 17:
        pass
    else:
         df['Night'][i] = 1
        
df['timestamp'] = pd.to_datetime(df['timestamp']).dt.date

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [5]:
df = df.merge(df_weather,on='timestamp',how='left')
df = df.drop_duplicates(subset=['ID'])
df= df.drop(columns=['Wind_Chill(F)','Precipitation(in)'])
# class_counts = df['Severity'].value_counts()
# print(class_counts)
# class_weights = len(class_counts)/class_counts
# df = df.sample(
#     n=class_counts.max()*len(class_counts),
#     weights=df['Severity'].map(class_weights), 
#     replace=True)

# df.reset_indexs = True
# print(df['Severity'].value_counts())
# print(len(df))
# df.head()

In [6]:
df[df.isnull().any(axis=1)]
df['holiday'].value_counts()

0    5199
1    1208
Name: holiday, dtype: int64

In [7]:
# df['Precipitation(in)'].fillna((df['Precipitation(in)'].mean()), inplace=True)
# df['Wind_Chill(F)'].fillna((df['Wind_Chill(F)'].mean()), inplace=True)
# df['Wind_Speed(mph)'].fillna((df['Wind_Speed(mph)'].mean()), inplace=True)
# df['Temperature(F)'].fillna((df['Temperature(F)'].mean()), inplace=True)
# df['Humidity(%)'].fillna((df['Humidity(%)'].mean()), inplace=True)
# df['Visibility(mi)'].fillna((df['Visibility(mi)'].mean()), inplace=True)
# df['Weather_Condition'].fillna((df['Weather_Condition'].mode()[0]), inplace=True)
# df.info()
df = df.dropna()


In [8]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6115 entries, 0 to 42910
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 6115 non-null   int64  
 1   Lat                6115 non-null   float64
 2   Lng                6115 non-null   float64
 3   Bump               6115 non-null   bool   
 4   Distance(mi)       6115 non-null   float64
 5   Crossing           6115 non-null   bool   
 6   Give_Way           6115 non-null   bool   
 7   Junction           6115 non-null   bool   
 8   No_Exit            6115 non-null   bool   
 9   Railway            6115 non-null   bool   
 10  Roundabout         6115 non-null   bool   
 11  Stop               6115 non-null   bool   
 12  Amenity            6115 non-null   bool   
 13  Side               6115 non-null   object 
 14  Severity           6115 non-null   int64  
 15  timestamp          6115 non-null   object 
 16  holiday            6115

In [9]:
df= df.drop(columns=['Year','Month','Day','Hour','Selected',])

In [10]:
df.tail(10)

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Side,Severity,timestamp,holiday,Night,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
42873,6397,37.80791,-122.3674,False,0.0,False,False,True,False,False,...,R,2,2016-04-01,0,0,Partly Cloudy,61.0,60.0,10.4,10.0
42877,6398,37.756012,-122.409447,False,0.0,True,False,False,False,False,...,R,2,2019-12-25,0,0,Cloudy / Windy,50.0,71.0,22.0,10.0
42887,6399,37.743958,-122.405594,False,0.0,False,False,False,False,False,...,R,2,2018-08-31,0,0,Overcast,55.9,93.0,9.2,10.0
42888,6400,37.808491,-122.367348,False,0.0,False,False,False,False,False,...,R,3,2019-05-11,1,1,Cloudy,57.0,77.0,6.0,10.0
42890,6401,37.788067,-122.440445,False,0.0,True,False,False,False,False,...,L,2,2016-06-28,0,0,Partly Cloudy,55.0,80.0,19.6,10.0
42892,6402,37.74063,-122.40793,False,0.368,False,False,False,False,False,...,R,3,2017-10-01,1,1,Scattered Clouds,61.0,62.0,17.3,10.0
42893,6403,37.752755,-122.40279,False,0.639,False,False,True,False,False,...,R,2,2018-10-23,0,0,Mostly Cloudy,55.9,75.0,5.8,10.0
42897,6404,37.726304,-122.446015,False,0.0,False,False,True,False,False,...,R,2,2019-10-28,0,0,Fair,55.0,27.0,10.0,10.0
42906,6405,37.80809,-122.367211,False,0.0,False,False,True,False,False,...,R,3,2019-05-04,1,0,Fair,63.0,58.0,13.0,10.0
42910,6406,37.773745,-122.408515,False,0.0,True,False,False,False,False,...,R,2,2020-02-28,0,1,Mostly Cloudy,52.0,83.0,13.0,10.0


In [11]:
df["Bump"] = df["Bump"].astype('category').cat.codes
df["Crossing"] = df["Crossing"].astype('category').cat.codes
df["Give_Way"] = df["Give_Way"].astype('category').cat.codes
df["Junction"] = df["Junction"].astype('category').cat.codes
df["Roundabout"] = df["Roundabout"].astype('category').cat.codes
df["Amenity"] = df["Amenity"].astype('category').cat.codes
df["Railway"] = df["Railway"].astype('category').cat.codes
df["Stop"] = df["Stop"].astype('category').cat.codes
df["Side"] = df["Side"].astype('category').cat.codes
df["No_Exit"] = df["No_Exit"].astype('category').cat.codes
df["Weather_Condition"] = df["Weather_Condition"].astype('category').cat.codes
#df["roadpostcode"] = df["roadpostcode"].astype('category').cat.codes


In [12]:
# #df["Bump"] = df["Bump"].astype('category').cat.codes
# df["Crossing"] = df["Crossing"].astype('category').cat.codes
# df["Give_Way"] = df["Give_Way"].astype('category').cat.codes
# df["Junction"] = df["Junction"].astype('category').cat.codes
# #df["Roundabout"] = df["Roundabout"].astype('category').cat.codes
# df["Amenity"] = df["Amenity"].astype('category').cat.codes
# df["Railway"] = df["Railway"].astype('category').cat.codes
# df["Stop"] = df["Stop"].astype('category').cat.codes
# df["Side"] = df["Side"].astype('category').cat.codes
# df["No_Exit"] = df["No_Exit"].astype('category').cat.codes
# df["Weather_Condition"] = df["Weather_Condition"].astype('category')


In [13]:
# encoded_conditions = pd.get_dummies(df['Weather_Condition'])
# # encoded_island = pd.get_dummies(categorical_data['island'])
# # encoded_sex = pd.get_dummies(categorical_data['sex'])

# df = df.join(encoded_conditions)
# # categorical_data = categorical_data.join(encoded_island)
# # categorical_data = categorical_data.join(encoded_sex)

In [14]:
# def mapping(data,feature):
#     featureMap=dict()
#     count=0
#     for i in sorted(data[feature].unique(),reverse=True):
#         featureMap[i]=count
#         count=count+1
#     data[feature]=data[feature].map(featureMap)
#     return data

In [15]:
# df=mapping(df,"Crossing")
# df=mapping(df,"Give_Way")
# df=mapping(df,"Junction")
# df=mapping(df,"Amenity")
# df=mapping(df,"Railway")
# df=mapping(df,"Stop")
# df=mapping(df,"Side")
# df=mapping(df,"No_Exit")
# df=mapping(df,"Weather_Condition")


In [16]:
df.tail()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Side,Severity,timestamp,holiday,Night,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
42892,6402,37.74063,-122.40793,0,0.368,0,0,0,0,0,...,1,3,2017-10-01,1,1,17,61.0,62.0,17.3,10.0
42893,6403,37.752755,-122.40279,0,0.639,0,0,1,0,0,...,1,2,2018-10-23,0,0,11,55.9,75.0,5.8,10.0
42897,6404,37.726304,-122.446015,0,0.0,0,0,1,0,0,...,1,2,2019-10-28,0,0,3,55.0,27.0,10.0,10.0
42906,6405,37.80809,-122.367211,0,0.0,0,0,1,0,0,...,1,3,2019-05-04,1,0,3,63.0,58.0,13.0,10.0
42910,6406,37.773745,-122.408515,0,0.0,1,0,0,0,0,...,1,2,2020-02-28,0,1,11,52.0,83.0,13.0,10.0


In [17]:
# from sklearn.preprocessing import MinMaxScaler
# from pandas import DataFrame

# scaler = MinMaxScaler()
# data = scaler.fit_transform(df)
# df = DataFrame(data)
# # get max values in each column

# print(df.describe())

# def normalize(dataset):
#     dataNorm=((dataset-dataset.min())/(dataset.max()-dataset.min()))
#     dataNorm["ID"]=dataset["ID"]
#     return dataNorm

# df=normalize(df)
# df.sample(5)


# copy the data
#df_max_scaled = df.copy()
  
# # apply normalization techniques
# for column in df.columns:
#     df[column] = df[column]  / df[column].abs().max()
      
# # view normalized data
# display(df)

# import pandas as pd
# from sklearn import preprocessing

# min_max_scaler = preprocessing.MinMaxScaler()
# x_scaled = min_max_scaler.fit_transform(df.values)
# normalized_df = pd.DataFrame(x_scaled)

In [18]:
df[df.isnull().any(axis=1)]


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Side,Severity,timestamp,holiday,Night,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)


In [19]:
#df.drop(columns='ID').describe()
#df= df.drop(columns=['timestamp'])

In [20]:
# from sklearn.decomposition import PCA
# pca = PCA(n_components=4)
#pca.fit(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6115 entries, 0 to 42910
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 6115 non-null   int64  
 1   Lat                6115 non-null   float64
 2   Lng                6115 non-null   float64
 3   Bump               6115 non-null   int8   
 4   Distance(mi)       6115 non-null   float64
 5   Crossing           6115 non-null   int8   
 6   Give_Way           6115 non-null   int8   
 7   Junction           6115 non-null   int8   
 8   No_Exit            6115 non-null   int8   
 9   Railway            6115 non-null   int8   
 10  Roundabout         6115 non-null   int8   
 11  Stop               6115 non-null   int8   
 12  Amenity            6115 non-null   int8   
 13  Side               6115 non-null   int8   
 14  Severity           6115 non-null   int64  
 15  timestamp          6115 non-null   object 
 16  holiday            6115

In [21]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

standard_scaler = StandardScaler()
df['Lat'] = standard_scaler.fit_transform(df[['Lat']])
df['Lng'] = standard_scaler.fit_transform(df[['Lng']])

df['Distance(mi)'] = standard_scaler.fit_transform(df[['Distance(mi)']])
#df['Wind_Chill(F)'] = standard_scaler.fit_transform(df[['Wind_Chill(F)']])
#df['Precipitation(in)'] = standard_scaler.fit_transform(df[['Precipitation(in)']])
df['Temperature(F)'] = standard_scaler.fit_transform(df[['Temperature(F)']])
df['Humidity(%)'] = standard_scaler.fit_transform(df[['Humidity(%)']])
df['Wind_Speed(mph)'] = standard_scaler.fit_transform(df[['Wind_Speed(mph)']])
df['Visibility(mi)'] = standard_scaler.fit_transform(df[['Visibility(mi)']])



In [22]:
df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Side,Severity,timestamp,holiday,Night,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
0,0,-0.114285,0.014514,0,-0.226293,0,0,0,0,0,...,1,2,2016-03-25,0,0,17,0.590967,-0.626067,0.473775,0.349851
3,1,-1.432378,-1.481658,0,-0.336274,0,0,0,0,0,...,1,2,2020-05-05,0,1,3,0.852243,-0.56336,0.980869,0.349851
15,2,1.306666,1.377698,0,-0.336274,0,0,0,0,0,...,1,3,2016-09-16,0,1,14,0.478992,-0.187124,1.757355,0.349851
17,3,0.614769,0.526657,0,-0.313778,0,0,1,0,0,...,1,1,2020-03-29,1,1,1,-0.640761,0.56535,-0.762265,0.349851
24,4,-1.432869,-1.488789,0,-0.336274,0,0,0,0,0,...,1,2,2019-10-09,0,0,3,1.101077,-1.817483,0.505469,0.349851


In [23]:
df['Stop2'] = (df['Stop'] ) * df['Stop']



In [24]:
df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Severity,timestamp,holiday,Night,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Stop2
0,0,-0.114285,0.014514,0,-0.226293,0,0,0,0,0,...,2,2016-03-25,0,0,17,0.590967,-0.626067,0.473775,0.349851,0
3,1,-1.432378,-1.481658,0,-0.336274,0,0,0,0,0,...,2,2020-05-05,0,1,3,0.852243,-0.56336,0.980869,0.349851,0
15,2,1.306666,1.377698,0,-0.336274,0,0,0,0,0,...,3,2016-09-16,0,1,14,0.478992,-0.187124,1.757355,0.349851,1
17,3,0.614769,0.526657,0,-0.313778,0,0,1,0,0,...,1,2020-03-29,1,1,1,-0.640761,0.56535,-0.762265,0.349851,0
24,4,-1.432869,-1.488789,0,-0.336274,0,0,0,0,0,...,2,2019-10-09,0,0,3,1.101077,-1.817483,0.505469,0.349851,0


In [25]:
df[df.columns[1:]].corr()['Severity'][:]


Lat                  0.099134
Lng                  0.146227
Bump                      NaN
Distance(mi)        -0.010203
Crossing            -0.090825
Give_Way            -0.012278
Junction            -0.069739
No_Exit             -0.007088
Railway             -0.036607
Roundabout                NaN
Stop                 0.225454
Amenity             -0.075957
Side                 0.061346
Severity             1.000000
holiday              0.044284
Night                0.043302
Weather_Condition    0.127356
Temperature(F)       0.019045
Humidity(%)          0.033494
Wind_Speed(mph)      0.066247
Visibility(mi)       0.000278
Stop2                0.225454
Name: Severity, dtype: float64

The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

In [26]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42) # Try adding `stratify` here

X_train = train_df.drop(columns=['ID', 'Severity'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['ID', 'Severity'])
y_val = val_df['Severity']


As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

In [27]:
# # This cell is used to select the numerical features. IT SHOULD BE REMOVED AS YOU DO YOUR WORK.
# X_train = X_train[['Lat', 'Lng', 'Distance(mi)']]
# X_val = X_val[['Lat', 'Lng', 'Distance(mi)']]

# This cell is used to select the numerical zfeatures. IT SHOULD BE REMOVED AS YOU DO YOUR WORK.
# X_train = X_train[['Lat','Lng','Bump','Distance(mi)','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Stop','Amenity','Side','holiday','Weather_Condition','Wind_Chill(F)','Precipitation(in)','Temperature(F)','Humidity(%)','Wind_Speed(mph)','Visibility(mi)']]
# X_val = X_val[['Lat','Lng','Bump','Distance(mi)','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Stop','Amenity','Side','holiday','Weather_Condition','Wind_Chill(F)','Precipitation(in)','Temperature(F)','Humidity(%)','Wind_Speed(mph)','Visibility(mi)']]
#X_train = X_train[['Lat','Lng','Distance(mi)','Crossing','Give_Way','Junction','No_Exit','Railway','Stop','Amenity','Side','holiday','Night','Weather_Condition','Wind_Chill(F)','Precipitation(in)','Temperature(F)','Humidity(%)','Wind_Speed(mph)','Visibility(mi)']]
#X_val = X_val[['Lat','Lng','Distance(mi)','Crossing','Give_Way','Junction','No_Exit','Railway','Stop','Amenity','Side','holiday','Night','Weather_Condition','Wind_Chill(F)','Precipitation(in)','Temperature(F)','Humidity(%)','Wind_Speed(mph)','Visibility(mi)']]

X_train = X_train[['Stop','Weather_Condition','Lat','Lng','Side']]
X_val = X_val[['Stop','Weather_Condition','Lat','Lng','Side']]

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6115 entries, 0 to 42910
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 6115 non-null   int64  
 1   Lat                6115 non-null   float64
 2   Lng                6115 non-null   float64
 3   Bump               6115 non-null   int8   
 4   Distance(mi)       6115 non-null   float64
 5   Crossing           6115 non-null   int8   
 6   Give_Way           6115 non-null   int8   
 7   Junction           6115 non-null   int8   
 8   No_Exit            6115 non-null   int8   
 9   Railway            6115 non-null   int8   
 10  Roundabout         6115 non-null   int8   
 11  Stop               6115 non-null   int8   
 12  Amenity            6115 non-null   int8   
 13  Side               6115 non-null   int8   
 14  Severity           6115 non-null   int64  
 15  timestamp          6115 non-null   object 
 16  holiday            6115

## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [29]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [30]:
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the validation set is  0.749795584627964


In [31]:
print("The accuracy of the classifier on the train set is ", (classifier.score(X_train, y_train)))

The accuracy of the classifier on the train set is  0.7371218315617334


In [32]:
ypredd = classifier.predict(X_val)
from sklearn.metrics import classification_report
print(classification_report(y_val,ypredd))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        23
           2       0.74      0.99      0.84       841
           3       0.90      0.25      0.39       350
           4       0.00      0.00      0.00         9

    accuracy                           0.75      1223
   macro avg       0.41      0.31      0.31      1223
weighted avg       0.76      0.75      0.69      1223



  _warn_prf(average, modifier, msg_start, len(result))


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [33]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
test_df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04 19:20:31
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28 11:51:00
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09 07:36:45
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06 15:46:25
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17 09:54:58


In [34]:
test_df['timestamp']= pd.to_datetime(test_df['timestamp'])
df_holidays['date']= pd.to_datetime(df_holidays['date']).dt.date
df_weather['timestamp'] = pd.to_datetime(df_weather[['Year', 'Month','Day','Hour']].assign(Minute=0)).dt.date



In [35]:
test_df['holiday'] = 0
test_df['Night'] = 0

exists = False
for i in range(len(test_df['timestamp'])):
    if test_df['timestamp'][i].date().strftime("%A") == 'Saturday' or test_df['timestamp'][i].date().strftime("%A") == 'Sunday':
        exists = True
    if exists:
        test_df['holiday'][i] = 1
        #print(exists)
    exists = False
    
for i in range(len(test_df['timestamp'])):
    exists = test_df['timestamp'][i].date() in df_holidays['date']
   
    if exists:
        test_df['holiday'][i] = 1
        print(exists)
for i in range(len(test_df['timestamp'])):
    if 6 <= test_df['timestamp'][i].hour <= 17:
        pass
    else:
         test_df['Night'][i] = 1
        
test_df['timestamp'] = pd.to_datetime(test_df['timestamp']).dt.date

test_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,timestamp,holiday,Night
0,6407,37.78606,-122.3909,False,0.039,False,False,True,False,False,False,False,False,R,2016-04-04,0,1
1,6408,37.769609,-122.415057,False,0.202,False,False,False,False,False,False,False,False,R,2020-10-28,0,0
2,6409,37.807495,-122.476021,False,0.0,False,False,False,False,False,False,False,False,R,2019-09-09,0,0
3,6410,37.761818,-122.405869,False,0.0,False,False,True,False,False,False,False,False,R,2019-08-06,0,0
4,6411,37.73235,-122.4141,False,0.67,False,False,False,False,False,False,False,False,R,2018-10-17,0,0


In [36]:
test_df = test_df.merge(df_weather,on='timestamp',how='left')
test_df = test_df.drop_duplicates(subset=['ID'])
test_df= test_df.drop(columns=['Wind_Chill(F)','Precipitation(in)'])
# class_counts = df['Severity'].value_counts()
# print(class_counts)
# class_weights = len(class_counts)/class_counts
# test_df = test_df.sample(
#     n=class_counts.max()*len(class_counts),
#     weights=test_df['Severity'].map(class_weights), 
#     replace=True)

# test_df.reset_indexs = True
# print(test_df['Severity'].value_counts())
# print(len(test_df))
# test_df.head()

In [37]:
test_df[test_df.isnull().any(axis=1)]


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Year,Day,Month,Hour,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected
0,6407,37.786060,-122.390900,False,0.039,False,False,True,False,False,...,2016,4,4,22,Clear,55.9,75.0,,10.0,No
68,6419,37.741791,-122.398019,False,0.120,False,False,False,False,False,...,2018,16,2,8,Partly Cloudy,52.0,66.0,,10.0,No
137,6431,37.808498,-122.366852,False,0.000,False,False,False,False,False,...,2018,28,3,8,Partly Cloudy,59.0,72.0,,10.0,No
474,6473,37.778091,-122.401176,False,0.000,False,False,False,False,False,...,2017,1,9,23,Clear,84.0,22.0,,8.0,No
520,6479,37.732243,-122.432076,False,0.010,False,False,True,False,False,...,2017,10,3,14,Overcast,66.0,68.0,,10.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10884,7992,37.785542,-122.391380,False,0.000,False,False,True,False,False,...,2017,19,12,7,Mostly Cloudy,39.0,79.0,,9.0,No
10886,7993,37.769670,-122.415980,False,0.013,False,False,True,False,False,...,2016,13,12,19,Overcast,55.9,100.0,,9.0,No
10936,8000,37.808110,-122.367190,False,0.037,False,False,True,False,False,...,2017,15,3,5,Overcast,54.0,100.0,,10.0,No
10950,8002,37.730759,-122.471825,False,0.010,True,False,False,False,True,...,2017,30,1,15,Mostly Cloudy,60.1,49.0,,10.0,No


In [38]:
#test_df['Precipitation(in)'].fillna((test_df['Precipitation(in)'].mean()), inplace=True)
#test_df['Wind_Chill(F)'].fillna((test_df['Wind_Chill(F)'].mean()), inplace=True)
test_df['Wind_Speed(mph)'].fillna((test_df['Wind_Speed(mph)'].mean()), inplace=True)
test_df['Temperature(F)'].fillna((test_df['Temperature(F)'].mean()), inplace=True)
test_df['Humidity(%)'].fillna((test_df['Humidity(%)'].mean()), inplace=True)
test_df['Visibility(mi)'].fillna((test_df['Visibility(mi)'].mean()), inplace=True)
test_df['Weather_Condition'].fillna((test_df['Weather_Condition'].mode()[0]), inplace=True)

#test_df = test_df.dropna()


In [39]:
test_df[test_df.isnull().any(axis=1)]


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Year,Day,Month,Hour,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected


In [40]:
test_df= test_df.drop(columns=['Year','Month','Day','Hour','Selected'])

In [41]:
test_df.describe()

Unnamed: 0,ID,Lat,Lng,Distance(mi),holiday,Night,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi)
count,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0,1601.0
mean,7207.0,37.765552,-122.40605,0.149761,0.191131,0.364772,60.197439,68.408495,10.999079,9.387758
std,462.313206,0.031883,0.028206,0.467515,0.393314,0.481517,8.247537,16.345161,5.97389,1.698804
min,6407.0,37.614687,-122.51044,0.0,0.0,0.0,39.0,10.0,0.0,0.25
25%,6807.0,37.735352,-122.414206,0.0,0.0,0.0,54.0,59.0,6.0,10.0
50%,7207.0,37.76871,-122.40485,0.0,0.0,0.0,59.0,70.0,10.4,10.0
75%,7607.0,37.786995,-122.39235,0.069,0.0,1.0,66.0,80.0,15.0,10.0
max,8007.0,37.819321,-122.358505,9.84,1.0,1.0,98.0,100.0,37.0,10.0


In [42]:
test_df["Bump"] = test_df["Bump"].astype('category').cat.codes
test_df["Crossing"] = test_df["Crossing"].astype('category').cat.codes
test_df["Give_Way"] = test_df["Give_Way"].astype('category').cat.codes
test_df["Junction"] = test_df["Junction"].astype('category').cat.codes
test_df["Roundabout"] = test_df["Roundabout"].astype('category').cat.codes
test_df["Amenity"] = test_df["Amenity"].astype('category').cat.codes
test_df["Railway"] = test_df["Railway"].astype('category').cat.codes
test_df["Stop"] = test_df["Stop"].astype('category').cat.codes
test_df["Side"] = test_df["Side"].astype('category').cat.codes
test_df["No_Exit"] = test_df["No_Exit"].astype('category').cat.codes
test_df["Weather_Condition"] = test_df["Weather_Condition"].astype('category').cat.codes


In [43]:
df.tail(10)

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Severity,timestamp,holiday,Night,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Stop2
42873,6397,1.288639,1.358449,0,-0.336274,0,0,1,0,0,...,2,2016-04-01,0,0,14,0.105741,-0.500654,-0.065012,0.349851,0
42877,6398,-0.302466,-0.118509,0,-0.336274,1,0,0,0,0,...,2,2019-12-25,0,0,2,-1.262845,0.189113,1.773202,0.349851,0
42887,6399,-0.672021,0.016833,0,-0.336274,0,0,0,0,0,...,2,2018-08-31,0,0,13,-0.528785,1.568648,-0.255172,0.349851,0
42888,6400,1.306452,1.360275,0,-0.336274,0,0,0,0,0,...,3,2019-05-11,1,1,1,-0.391927,0.56535,-0.762265,0.349851,1
42890,6401,0.680286,-1.207356,0,-0.336274,1,0,0,0,0,...,2,2016-06-28,0,0,14,-0.640761,0.753468,1.392882,0.349851,0
42892,6402,-0.774052,-0.065223,0,0.58356,0,0,0,0,0,...,3,2017-10-01,1,1,17,0.105741,-0.375242,1.028409,0.349851,0
42893,6403,-0.40232,0.115327,0,1.260937,0,0,1,0,0,...,2,2018-10-23,0,0,11,-0.528785,0.439938,-0.793958,0.349851,0
42897,6404,-1.213263,-1.40301,0,-0.336274,0,0,1,0,0,...,2,2019-10-28,0,0,3,-0.640761,-2.569956,-0.128398,0.349851,0
42906,6405,1.294158,1.365088,0,-0.336274,0,0,1,0,0,...,3,2019-05-04,1,0,3,0.354575,-0.626067,0.347002,0.349851,0
42910,6406,0.241198,-0.085772,0,-0.336274,1,0,0,0,0,...,2,2020-02-28,0,1,11,-1.014011,0.941586,0.347002,0.349851,0


Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [44]:
X_test = test_df.drop(columns=['ID'])

# You should update/remove the next line once you change the features used for training
X_test = X_test[['Stop','Weather_Condition','Lat','Lng','Side']]
print(X_test)
y_test_predicted = classifier.predict(X_test)

test_df['Severity'] = y_test_predicted

test_df.head()

       Stop  Weather_Condition        Lat         Lng  Side
0         0                  0  37.786060 -122.390900     1
5         0                  3  37.769609 -122.415057     1
10        0                  9  37.807495 -122.476021     1
17        0                  3  37.761818 -122.405869     1
20        0                 12  37.732350 -122.414100     1
...     ...                ...        ...         ...   ...
10955     0                  9  37.812973 -122.362335     1
10961     0                 12  37.761818 -122.405861     1
10963     0                  8  37.732260 -122.431970     1
10970     0                  3  37.786782 -122.390126     1
10980     0                  9  37.773040 -122.406570     1

[1601 rows x 5 columns]


Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,...,Side,timestamp,holiday,Night,Weather_Condition,Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Severity
0,6407,37.78606,-122.3909,0,0.039,0,0,1,0,0,...,1,2016-04-04,0,1,0,55.9,75.0,10.999079,10.0,2
5,6408,37.769609,-122.415057,0,0.202,0,0,0,0,0,...,1,2020-10-28,0,0,3,65.0,56.0,5.0,9.0,2
10,6409,37.807495,-122.476021,0,0.0,0,0,0,0,0,...,1,2019-09-09,0,0,9,58.0,90.0,18.0,10.0,2
17,6410,37.761818,-122.405869,0,0.0,0,0,1,0,0,...,1,2019-08-06,0,0,3,72.0,57.0,16.0,10.0,2
20,6411,37.73235,-122.4141,0,0.67,0,0,0,0,0,...,1,2018-10-17,0,0,12,55.0,83.0,15.0,10.0,2


Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [45]:
test_df[['ID', 'Severity']].to_csv('/kaggle/working/submission.csv', index=False)

The remaining steps is to submit the generated file and are as follows. 

1. Press `Save Version` on the upper right corner of this notebook.
2. Write a `Version Name` of your choice and choose `Save & Run All (Commit)` then click `Save`.
3. Wait for the saved notebook to finish running the go to the saved notebook.
4. Scroll down until you see the output files then select the `submission.csv` file and click `Submit`.

Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

## Conclusion

In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.