# 02 - Splitting data into testing and training sets
___

Step 2! We have downloaded and joined the data for Casualties and Collisions over the last 5 years. Let's now split our data up into testing and training so that we don't peek at our testing data as this should be treated as completely unseen data.

There are a couple of key things to think about here:
  1. There may be multiple entries for each accident_index for collisions where there were multiple casulaties. We need to make sure that duplicate accident_index values are split into either the training or testing set. If the same accident is across both, then the model will have already 'seen' this accident and we will have a severe data leakage situation on our hands!
  2. Our classification categories are really severely unbalanced. We need to stratify the data split so that we preserve the same proportion of examples of each class into the testing and training data. Down the line we could consider bootstrapping the data to synthetically increase the proportion of "severe" classifications, but as you can imagine this can be tricky as we will overweight the model towards the features of the bootstrapped collisions.
  3. We only have 14,179 total data entries. On the ML-scale, this is quite a small dataset so instead of splitting the data into training, validation, and testing data, we will instead use k-fold cross-validation on the training data to use as much data as possible for model training.

First, let's import our libraries and read in the data .csv that we just made.

---

#### Note that we will use the 'casualty_severity' column as our predictor. This is important to use instead of the accident_severity, as some multi-person collisions have a mix of slight and serious injuries. It it unfortunate that the name is still 'accident' rather than 'collision'!

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [10]:
data = pd.read_csv('dft_statistics_collision_and_casualty_last_5_years.csv', low_memory = False)

In the below cell, we can see that the classes in the dataset are really unbalanced - 73% of the collisions were slight, 26% were serious, and only 1% of the casualties were fatal.

If we want to use our model to try and reduce the number of road collisions that have serious consequences, then I don't think we need to distinguish between fatal or serious collisions. Let's combine these into one 'Serious' class.

This means that our classes are still unbalanced, 73% versus 27%. We need to bear this in mind when we evaluate the performance of our model, and preserve these proportions between our testing and training datasets.

In [11]:
np.round(data.casualty_severity.value_counts(normalize=True)*100)

casualty_severity
Slight     73.0
Serious    26.0
Fatal       1.0
Name: proportion, dtype: float64

In [12]:
data['casualty_severity'] = data['casualty_severity'].replace('Fatal', 'Serious')
np.round(data.casualty_severity.value_counts(normalize=True)*100)

casualty_severity
Slight     73.0
Serious    27.0
Name: proportion, dtype: float64

Now let's seperate our training and test data to prevent any data leakage. We need to be a little bit clever here, as accidents with multiple casualties have multiple entries. We need to make sure that the duplicate accident_index entries are not split between the train and test sets, as this would be quie a severe leakage of information.

To do this, let's make a dummy dataframe with only the unique accident_indexes, randomly split between test and train as normal, and then append the duplicate crash entries to the appropriate set (train or test.)

We should also be aware that we have unbalanced classes in the casualty_severity predictor, so we should stratify our split (easily done using scikit).

In [13]:
print(f'There are {len(data)} total entries in the dataset.')
print(f'There are {data.accident_index.nunique()} unique accident indexes in the dataset.')

data_no_dup = data.groupby('accident_index').first().reset_index()
print(f'The new dataset without duplicates has {data_no_dup.accident_index.nunique()} unique accident indexes and {len(data_no_dup)} total entries')

There are 14179 total entries in the dataset.
There are 10688 unique accident indexes in the dataset.
The new dataset without duplicates has 10688 unique accident indexes and 10688 total entries


Let's just do a quick check that we've done that correctly.

In [14]:
all(data_no_dup.accident_index.unique() == data.accident_index.unique())

True

Here, we're using Scitkit-learn's test_train_split to randomly split the data from the dataset without duplicate accident_indexes into 80% training data and 20% testing data. We need to specify the random_state so that we get the exact same results every time we run this code. This is really important for reproducibility of our modelling. We are specifying the 'casualty_severity' predictor column as the column to stratify the data on. This means we will preserve the balance of the Slight vs. Serious classes.

After we've split the data I'm merging the original dataset and dropping any duplicate columns that we don't need.

In [15]:
train_no_dup, test_no_dup = train_test_split(data_no_dup, test_size=0.2, stratify=data_no_dup['casualty_severity'], random_state=42)

train = train_no_dup.merge(data, on='accident_index', how='left', suffixes=('', '_remove'))
test = test_no_dup.merge(data, on='accident_index', how='left', suffixes=('', '_remove'))

# remove the duplicate columns
train.drop([i for i in train.columns if 'remove' in i], axis=1, inplace=True)
test.drop([i for i in test.columns if 'remove' in i], axis=1, inplace=True)



We should make sure we're happy with the splits. Let's check we have the correct number of data entries, the correct number of unique collisions, and finally a quick test for any leakage of the accident indexes between the training and testing datasets. By inner merging on the accident_index we only keep entries that are common across the two datasets. As there aren't any then our splitting has worked well, woohoo!

In [16]:
train.shape[0] + test.shape[0]

14179

In [17]:
train.accident_index.nunique() + test.accident_index.nunique()

10688

In [21]:
any(acc_idx in test.accident_index for acc_idx in train.accident_index)

False

In [18]:
dup_check = train.merge(test, on='accident_index', how='inner')
print(dup_check.shape[0])

0


To finish, let's export the data to read in next.

In [19]:
training_data = train.to_csv('training_data.csv', index=False)
testing_data = test.to_csv('testing_data.csv', index=False)

___
### We're happy with our seperated data so now let's do some data wrangling!