
## Introduction

It's simple. You are given information about the environment of the car crash and you're required to predict the severity of the crash out of 4 level. The predictive system is already built but it needs some data of good quality. You need to prepare the dataset and train the prediction systems.
This predictive system will be helpful to anticipate the resources to engage by San Francisco Municipality depending on its severity.
### Descriptions
You're provided with data about car crashes severity. The file contains 16 features, it represents the car crashes in the city of San Francisco between 2016 and 2020.

### File descriptions
* <b>train.csv</b> - the training set.
* <b>holidays.xml </b>- Information about whether the day is a regular day or a holiday.
* <b>weather-sfcsv.csv</b> - Information about the weather.

### Data fields
* <b>Lat </b>- Latitude of the incident
* <b>Lng </b>- Longitude of the incident
* <b>Bump, Crossing, Give_Way,Junction, NoExit, RailWay, Roundabout, Stop, Amenity, Side</b> - The characteristics of the location where the incident has taken place, several can be true at the same time. Side is the side of the street.
* <b>State</b> - the state from which this dataset is coming from
* <b>Distance </b>- the distance of the traffic jam provoked by an accident
* <b>Timestamp</b> - the moment when the incident has occurred.
* <b>Severity</b> - (Target) An indicator representing the severity of the car crash and possible impacts on the traffic. Values can range from 1 to 4, the highest value translates a highest impact.
Severity is the target variable for that exercise. The target variable is the one we want to predict thanks to the predictive system module already present that you will build during that Programming Challenge.

## Import the libraries



In [1]:
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import os
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn import preprocessing
from sklearn.cluster import MiniBatchKMeans

## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

### Helper Functions

In [None]:
def xml_to_df(xml):
    root = xml.getroot()
    get_range = lambda col: range(len(col))
    l = [{r[i].tag:r[i].text for i in get_range(r)} for r in root]
    df = pd.DataFrame.from_dict(l)
    return df

def summarize(df):
    print("The shape of the dataset is {}.".format(df.shape), "\n" + 50 * "=", "\n", end='')
    print(df.head())
    print(df.info(), "\n", 50 * "=", "\n", end='')
    print(df.describe(), "\n", 50 * "=", "\n\n\n\n", end='')

## Preprocess & Merge Data

In [None]:
def reformat_timestamp(main):
    acc_date = []
    acc_hour = []
    main['timestamp'] = pd.to_datetime(main['timestamp'])
    for i in main["timestamp"]:
        acc_date.append(i.date())
        acc_hour.append(i.time().hour)
    main["acc_date"] = acc_date
    main["acc_date"] = main["acc_date"].astype('datetime64[ns]')
    main["acc_hour"] = acc_hour

    main.drop(['timestamp'], axis=1, inplace=True)

    
def merge_with_main(main, weather, holidays):
    data = pd.merge(main, holidays, how='left', left_on='acc_date', right_on='date')
    data = pd.merge(data, weather, how='inner', left_on=['acc_date','acc_hour'], right_on=['w_date','Hour'], left_index=False, right_index=False)
    data.reset_index(drop=True)
    data = data.drop_duplicates('ID',keep="last")
    return data
    

def fatorize_column(df, column):
    codes, uniques = df[column].factorize()
    df[column] = codes
    return df
    
    
def normalize(df, columns):
    result = df.copy()
    for column in columns:
        max_value = df[column].max()
        min_value = df[column].min()
        result[column] = df[column] -df[column].mean() / ( max_value-min_value)
    return result    
    
    
def preprocess(main, weather, holidays):
    # Dropping main useless columns
    
    main.drop(columns=['Bump',"Distance(mi)", 'Give_Way', 'No_Exit', 'Roundabout'], inplace=True)
    reformat_timestamp(main)
    
    # Dropping weather useless columns
    intersect = set(weather.columns).intersection(['Selected'])
    if(len(intersect)):
        weather.drop(columns=['Selected', "Wind_Chill(F)",'Humidity(%)', "Precipitation(in)"], inplace=True)
    weather['w_date'] = pd.to_datetime(weather[['Year', 'Month', 'Day']])

    holidays['date'] = pd.to_datetime(holidays['date'])
    
    # Joinning main data with weather and holidays on Date without duplicates
    data = merge_with_main(main, weather, holidays)
    
    # Replace bool values with (0,1)
    #data.replace({False: 0, True: 1}, inplace=True)
    
    # Replace (R,L) values with (0,1)
    data.replace({"R": 0, "L": 1}, inplace=True)
    data['Hour'] = (data['Hour'] % 24 + 4) // 4
    data['Hour'].replace({1: 'Late Night',
                          2: 'Early Morning',
                          3: 'Morning',
                          4: 'Noon',
                          5: 'Evening',
                          6: 'Night'}, inplace=True)
    # Factorize Categorical Columns
    data = fatorize_column(data, "Weather_Condition")
    data = fatorize_column(data, "description")
    data = fatorize_column(data, "Hour")
    #data = fatorize_column(data, 'day-of-week')

    # Dropping merged useless columns
    data.drop(columns=['acc_date', 'acc_hour', 'date'], inplace=True)
    data['day-of-week'] = data['w_date'].dt.dayofweek
    data['w_date'] = data['w_date'].apply(lambda x: x.value) / 10**9



    # Impute NAN values with the mean of its column
    data.fillna(data.mean(), inplace=True)
    
    # Dropping rows that have NAN values
#     data.dropna(inplace=True)

    # Normalizing some numerical columns
    #data = normalize(data, [ "w_date",'Temperature(F)', 'Wind_Speed(mph)', 'Visibility(mi)'])
    A=data[['Temperature(F)',
       'Wind_Speed(mph)','Visibility(mi)']]
    pca = PCA(n_components=1)            # 2. Instantiate the model with hyperparameters
    pca.fit(A)                      # 3. Fit to data. Notice y is not specified!
    X_A = pca.transform(A)
    data["weather1"]=X_A
    data=data.drop(columns=['Temperature(F)',
          'Wind_Speed(mph)','Visibility(mi)'])
    
    
#     B=data[['Lat',
#        'Lng']]
#     pca = PCA(n_components=2)            # 2. Instantiate the model with hyperparameters
#     pca.fit(B)                      # 3. Fit to data. Notice y is not specified!
#     X_B = pca.transform(B)
#     data["Lat"]=X_B[:,0:1]
#     data["Lng"]=X_B[:,1:2]
    

    data=data.drop(columns=['Year', 'Month', 'Hour',"Day","weather1","Weather_Condition","description"])
    return data

We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

## Load, Clean, Normalize, Impute & Merge Data

In [None]:
# Loading Data
dataset_path = '/Datasets/car-crashes-severity-prediction/'
#dataset_path = './'
df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))

main = pd.read_csv(os.path.join(dataset_path, 'train.csv'),index_col=None)


In [None]:
q_low = main["Lat"].quantile(0.25)
q_hi = main["Lat"].quantile(0.999)
main = main[(main["Lat"] < q_hi) & (main["Lat"] > q_low)]


In [None]:
weather = pd.read_csv(os.path.join(dataset_path, 'weather-sfcsv.csv'),index_col=None)
holidays_tree = ET.parse(os.path.join(dataset_path, 'holidays.xml'))
holidays = xml_to_df(holidays_tree)

# clean & preprocess
data = preprocess(main, weather, holidays)

The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

In [None]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(data, test_size=0.2, random_state=42) # Try adding `stratify` here

X_train = train_df.drop(columns=['ID','Severity'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['ID','Severity'])
y_val = val_df['Severity']


As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [None]:
print("The accuracy of the classifier on the validation set is ", round(classifier.score(X_val, y_val)*100, 4), "%")

Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [None]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
test_df = preprocess(test_df, weather, holidays)
test_df.info()

In [None]:
X_test = test_df.drop(columns=['ID'])

# You should update/remove the next line once you change the features used for training
# X_test = X_test[['Lat', 'Lng', 'Distance(mi)']]

y_test_predicted = classifier.predict(X_test)

test_df['Severity'] = y_test_predicted

test_df.head()

Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.