### ANALYSING THE TRAFFIC DEMOGRAPHICS IN THE UK - Modeling & Predictive Analytics

Richard Abraham

In [1]:
# Importing python libraries
import numpy as np
import pandas as pd

# To make all outputs show
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Reading in pre-processed and transformed data 

file = 'C:/Users/Admin/Desktop/BDA 106/Project/NEW_Dataset/Predictive_Modelling/predictive_analytics.csv'

dataset = pd.read_csv(file, low_memory = False)

In [3]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,Carriageway_Hazards,Day_of_Week,Junction_Detail,Latitude,Light_Conditions,Local_Authority_(Highway),Longitude,Number_of_Casualties,Number_of_Vehicles,...,Vehicle_Leaving_Carriageway,Vehicle_Location.Restricted_Lane,Vehicle_Manoeuvre,Vehicle_Type,X1st_Point_of_Impact,Target_Severe_Indicator,Daytime,Ped_Cross_Human,Ped_Cross_Physical,Age_Band_of_Driver_order
0,2,,Friday,Not at junction or within 20 metres,51.482442,Daylight,Kensington and Chelsea,-0.173862,1,1,...,Did not leave carriageway,0.0,Going ahead other,Car,Front,0.0,office hours (10-15),None within 50 metres,No physical crossing facilities within 50 metres,5.0
1,4,,Tuesday,Not at junction or within 20 metres,51.51554,Daylight,Kensington and Chelsea,-0.203238,1,2,...,Did not leave carriageway,0.0,Moving off,Car,Did not impact,0.0,office hours (10-15),None within 50 metres,No physical crossing facilities within 50 metres,5.0
2,5,,Tuesday,Not at junction or within 20 metres,51.51554,Daylight,Kensington and Chelsea,-0.203238,1,2,...,Did not leave carriageway,0.0,Going ahead other,Motorcycle 125cc and under,Did not impact,0.0,office hours (10-15),None within 50 metres,No physical crossing facilities within 50 metres,3.0
3,7,,Thursday,T or staggered junction,51.512695,Darkness - lights lit,Kensington and Chelsea,-0.211277,1,2,...,Did not leave carriageway,0.0,Parked,Car,Back,0.0,evening (19-23),None within 50 metres,No physical crossing facilities within 50 metres,4.0
4,8,,Friday,Not at junction or within 20 metres,51.50226,Daylight,Kensington and Chelsea,-0.187623,2,1,...,Nearside,0.0,Going ahead other,Car,Front,0.0,afternoon rush (15-19),None within 50 metres,No physical crossing facilities within 50 metres,7.0


In [4]:
# Dropping unnamed column
dataset.drop(dataset.columns[0],axis=1,inplace=True)

In [5]:
dataset.head()

Unnamed: 0,Carriageway_Hazards,Day_of_Week,Junction_Detail,Latitude,Light_Conditions,Local_Authority_(Highway),Longitude,Number_of_Casualties,Number_of_Vehicles,Road_Surface_Conditions,...,Vehicle_Leaving_Carriageway,Vehicle_Location.Restricted_Lane,Vehicle_Manoeuvre,Vehicle_Type,X1st_Point_of_Impact,Target_Severe_Indicator,Daytime,Ped_Cross_Human,Ped_Cross_Physical,Age_Band_of_Driver_order
0,,Friday,Not at junction or within 20 metres,51.482442,Daylight,Kensington and Chelsea,-0.173862,1,1,Dry,...,Did not leave carriageway,0.0,Going ahead other,Car,Front,0.0,office hours (10-15),None within 50 metres,No physical crossing facilities within 50 metres,5.0
1,,Tuesday,Not at junction or within 20 metres,51.51554,Daylight,Kensington and Chelsea,-0.203238,1,2,Wet or damp,...,Did not leave carriageway,0.0,Moving off,Car,Did not impact,0.0,office hours (10-15),None within 50 metres,No physical crossing facilities within 50 metres,5.0
2,,Tuesday,Not at junction or within 20 metres,51.51554,Daylight,Kensington and Chelsea,-0.203238,1,2,Wet or damp,...,Did not leave carriageway,0.0,Going ahead other,Motorcycle 125cc and under,Did not impact,0.0,office hours (10-15),None within 50 metres,No physical crossing facilities within 50 metres,3.0
3,,Thursday,T or staggered junction,51.512695,Darkness - lights lit,Kensington and Chelsea,-0.211277,1,2,Dry,...,Did not leave carriageway,0.0,Parked,Car,Back,0.0,evening (19-23),None within 50 metres,No physical crossing facilities within 50 metres,4.0
4,,Friday,Not at junction or within 20 metres,51.50226,Daylight,Kensington and Chelsea,-0.187623,2,1,Dry,...,Nearside,0.0,Going ahead other,Car,Front,0.0,afternoon rush (15-19),None within 50 metres,No physical crossing facilities within 50 metres,7.0


In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1263105 entries, 0 to 1263104
Data columns (total 36 columns):
Carriageway_Hazards                 1263105 non-null object
Day_of_Week                         1263105 non-null object
Junction_Detail                     1263105 non-null object
Latitude                            1263105 non-null float64
Light_Conditions                    1263105 non-null object
Local_Authority_(Highway)           1263105 non-null object
Longitude                           1263105 non-null float64
Number_of_Casualties                1263105 non-null int64
Number_of_Vehicles                  1263105 non-null int64
Road_Surface_Conditions             1263105 non-null object
Road_Type                           1263105 non-null object
Special_Conditions_at_Site          1263105 non-null object
Speed_limit                         1263105 non-null float64
Urban_or_Rural_Area                 1263105 non-null object
Weather_Conditions                  1263105 no

#### Converting 'Object' to 'category' dtype - Saves memory

In [7]:
for col in set(dataset.columns) - set(dataset.describe().columns):
    dataset[col] = dataset[col].astype('category')

In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1263105 entries, 0 to 1263104
Data columns (total 36 columns):
Carriageway_Hazards                 1263105 non-null category
Day_of_Week                         1263105 non-null category
Junction_Detail                     1263105 non-null category
Latitude                            1263105 non-null float64
Light_Conditions                    1263105 non-null category
Local_Authority_(Highway)           1263105 non-null category
Longitude                           1263105 non-null float64
Number_of_Casualties                1263105 non-null int64
Number_of_Vehicles                  1263105 non-null int64
Road_Surface_Conditions             1263105 non-null category
Road_Type                           1263105 non-null category
Special_Conditions_at_Site          1263105 non-null category
Speed_limit                         1263105 non-null float64
Urban_or_Rural_Area                 1263105 non-null category
Weather_Conditions          

### Random Sampling - removed rows at random to speed up model run times (for testing purposes only) 
Considering the imbalance of classes in the target variable, it may be worth using random stratified sampling to maintain proportionality of classes of the original dataset (Stratified sampling not carried out here however)

In [10]:
np.random.seed(150)

remove_n = 1100000 #Sample size to remove from original dataset
df = dataset
drop_indices = np.random.choice(df.index, remove_n, replace=False)
df_subset = df.drop(drop_indices)

In [11]:
df_subset.shape
df_subset.head()

(163105, 36)

Unnamed: 0,Carriageway_Hazards,Day_of_Week,Junction_Detail,Latitude,Light_Conditions,Local_Authority_(Highway),Longitude,Number_of_Casualties,Number_of_Vehicles,Road_Surface_Conditions,...,Vehicle_Leaving_Carriageway,Vehicle_Location.Restricted_Lane,Vehicle_Manoeuvre,Vehicle_Type,X1st_Point_of_Impact,Target_Severe_Indicator,Daytime,Ped_Cross_Human,Ped_Cross_Physical,Age_Band_of_Driver_order
1,,Tuesday,Not at junction or within 20 metres,51.51554,Daylight,Kensington and Chelsea,-0.203238,1,2,Wet or damp,...,Did not leave carriageway,0.0,Moving off,Car,Did not impact,0.0,office hours (10-15),None within 50 metres,No physical crossing facilities within 50 metres,5.0
6,,Tuesday,T or staggered junction,51.492622,Darkness - lights lit,Kensington and Chelsea,-0.157753,1,2,Wet or damp,...,Offside,0.0,Going ahead other,Car,Front,0.0,morning rush (5-10),None within 50 metres,No physical crossing facilities within 50 metres,3.0
7,,Tuesday,T or staggered junction,51.492622,Darkness - lights lit,Kensington and Chelsea,-0.157753,1,2,Wet or damp,...,Did not leave carriageway,0.0,Turning left,Car,Did not impact,0.0,morning rush (5-10),None within 50 metres,No physical crossing facilities within 50 metres,3.0
12,,Saturday,T or staggered junction,51.495498,Darkness - lights lit,Kensington and Chelsea,-0.174925,1,1,Dry,...,Did not leave carriageway,0.0,Going ahead other,Car,Front,1.0,night (23-5),None within 50 metres,No physical crossing facilities within 50 metres,3.0
49,,Saturday,Crossroads,51.517616,Daylight,Kensington and Chelsea,-0.203733,1,2,Dry,...,Did not leave carriageway,0.0,Going ahead other,Motorcycle 50cc and under,Front,0.0,morning rush (5-10),None within 50 metres,No physical crossing facilities within 50 metres,4.0


In [12]:
# 85% to 15% distribution of target class - Proportionality of the original dataset is still maintained
df_subset['Target_Severe_Indicator'].value_counts()

0.0    139876
1.0     23229
Name: Target_Severe_Indicator, dtype: int64

### Splitting target variable from predictor variables

In [13]:
dataset_X = df_subset.drop('Target_Severe_Indicator', axis=1)  
dataset_Y = df_subset['Target_Severe_Indicator']  

In [14]:
# Converting independent categorical features to Numerical by creating Dummy variables

dataset_X_dummy = pd.get_dummies(dataset_X)
#print(dataset_X_dummy.head())

In [15]:
dataset_X_dummy.shape

(163105, 753)

## Feature Selection

### 1. Applying VarianceThreshold filter

In [16]:
from sklearn.feature_selection import VarianceThreshold

# threshold set to 87% for variance 
# i.e. if 87% of the column data is the same (i.e. low variation), the column will not be as useful
# in the prediction
thresh=(.85 * (1 - .85))

In [17]:
# Wrapper function to identify low variance features and remove them from the dataframe 

def get_low_variance_columns(dframe=None, columns=None,
                             skip_columns=None, thresh=0.0,
                             autoremove=False):
    try:
        # get list of all the original df columns
        all_columns = dframe.columns

        # remove `skip_columns`
        remaining_columns = all_columns.drop(skip_columns)

        # get length of new index
        max_index = len(remaining_columns) - 1

        # get indices for `skip_columns`
        skipped_idx = [all_columns.get_loc(column)
                       for column
                       in skip_columns]

        # adjust insert location by the number of columns removed
        # (for non-zero insertion locations) to keep relative
        # locations intact
        for idx, item in enumerate(skipped_idx):
            if item > max_index:
                diff = item - max_index
                skipped_idx[idx] -= diff
            if item == max_index:
                diff = item - len(skip_columns)
                skipped_idx[idx] -= diff
            if idx == 0:
                skipped_idx[idx] = item

        # get values of `skip_columns`
        skipped_values = dframe.iloc[:, skipped_idx].values

        # get dataframe values
        X = dframe.loc[:, remaining_columns].values

        # instantiate VarianceThreshold object
        vt = VarianceThreshold(threshold=thresh)

        # fit vt to data
        vt.fit(X)

        # get the indices of the features that are being kept
        feature_indices = vt.get_support(indices=True)

        # remove low-variance columns from index
        feature_names = [remaining_columns[idx]
                         for idx, _
                         in enumerate(remaining_columns)
                         if idx
                         in feature_indices]

        # get the columns to be removed
        removed_features = list(np.setdiff1d(remaining_columns,
                                             feature_names))
        print("Found {0} low-variance columns."
              .format(len(removed_features)))

        # remove the columns
        if autoremove:
            print("Removing low-variance features.")
            # remove the low-variance columns
            X_removed = vt.transform(X)

            print("Reassembling the dataframe (with low-variance "
                  "features removed).")
            # re-assemble the dataframe
            dframe = pd.DataFrame(data=X_removed,
                                  columns=feature_names)

            # add back the `skip_columns`
            for idx, index in enumerate(skipped_idx):
                dframe.insert(loc=index,
                              column=skip_columns[idx],
                              value=skipped_values[:, idx])
            print("Succesfully removed low-variance columns.")

        # do not remove columns
        else:
            print("No changes have been made to the dataframe.")

    except Exception as e:
        print(e)
        print("Could not remove low-variance features. Something "
              "went wrong.")
        pass

    return dframe, removed_features

In [18]:
# retrieve new dataframe (with low variance features)
dataset_X_new, low_var_col = get_low_variance_columns(dataset_X_dummy,[],[],thresh, True) 
#Set to True to remove low variance columns

Found 715 low-variance columns.
Removing low-variance features.
Reassembling the dataframe (with low-variance features removed).
Succesfully removed low-variance columns.


In [19]:
dataset_X_new.shape

(163105, 38)

In [20]:
dataset_X_new.head()

Unnamed: 0,Latitude,Longitude,Number_of_Casualties,Number_of_Vehicles,Speed_limit,Age_of_Vehicle,Engine_Capacity_.CC.,Vehicle_Location.Restricted_Lane,Age_Band_of_Driver_order,Day_of_Week_Friday,...,Sex_of_Driver_Female,Sex_of_Driver_Male,Vehicle_Manoeuvre_Going ahead other,Vehicle_Type_Car,X1st_Point_of_Impact_Back,X1st_Point_of_Impact_Front,Daytime_afternoon rush (15-19),Daytime_morning rush (5-10),Daytime_office hours (10-15),Ped_Cross_Physical_No physical crossing facilities within 50 metres
0,51.51554,-0.203238,1.0,2.0,30.0,1.0,2976.0,0.0,5.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,51.492622,-0.157753,1.0,2.0,30.0,2.0,698.0,0.0,3.0,0.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
2,51.492622,-0.157753,1.0,2.0,30.0,4.0,2148.0,0.0,3.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
3,51.495498,-0.174925,1.0,1.0,30.0,1.0,1997.0,0.0,3.0,0.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,51.517616,-0.203733,1.0,2.0,30.0,3.0,49.0,0.0,4.0,0.0,...,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


**Normalizing data** - adjusting values measured on different scales to a notionally common scale (between 0 - 1)

In [21]:
dataset_X_normalized=(dataset_X_new-dataset_X_new.min())/(dataset_X_new.max()-dataset_X_new.min())

In [22]:
dataset_X_normalized.head()

Unnamed: 0,Latitude,Longitude,Number_of_Casualties,Number_of_Vehicles,Speed_limit,Age_of_Vehicle,Engine_Capacity_.CC.,Vehicle_Location.Restricted_Lane,Age_Band_of_Driver_order,Day_of_Week_Friday,...,Sex_of_Driver_Female,Sex_of_Driver_Male,Vehicle_Manoeuvre_Going ahead other,Vehicle_Type_Car,X1st_Point_of_Impact_Back,X1st_Point_of_Impact_Front,Daytime_afternoon rush (15-19),Daytime_morning rush (5-10),Daytime_office hours (10-15),Ped_Cross_Physical_No physical crossing facilities within 50 metres
0,0.142774,0.788285,0.0,0.015152,0.2,0.0,0.106284,0.0,0.571429,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,0.140562,0.793194,0.0,0.015152,0.2,0.014493,0.02382,0.0,0.285714,0.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
2,0.140562,0.793194,0.0,0.015152,0.2,0.043478,0.07631,0.0,0.285714,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.140839,0.79134,0.0,0.0,0.2,0.0,0.070844,0.0,0.285714,0.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.142975,0.788231,0.0,0.015152,0.2,0.028986,0.000326,0.0,0.428571,0.0,...,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


In [23]:
df_X=dataset_X_normalized.round(3) 

In [24]:
df_X.head(3)

Unnamed: 0,Latitude,Longitude,Number_of_Casualties,Number_of_Vehicles,Speed_limit,Age_of_Vehicle,Engine_Capacity_.CC.,Vehicle_Location.Restricted_Lane,Age_Band_of_Driver_order,Day_of_Week_Friday,...,Sex_of_Driver_Female,Sex_of_Driver_Male,Vehicle_Manoeuvre_Going ahead other,Vehicle_Type_Car,X1st_Point_of_Impact_Back,X1st_Point_of_Impact_Front,Daytime_afternoon rush (15-19),Daytime_morning rush (5-10),Daytime_office hours (10-15),Ped_Cross_Physical_No physical crossing facilities within 50 metres
0,0.143,0.788,0.0,0.015,0.2,0.0,0.106,0.0,0.571,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,0.141,0.793,0.0,0.015,0.2,0.014,0.024,0.0,0.286,0.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
2,0.141,0.793,0.0,0.015,0.2,0.043,0.076,0.0,0.286,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0


## Dealing with Imbalanced Data

### 1. Change the Performance Metric before treating imbalance in dataset

#### Popular metric
The very simple metric to measure classification is basic accuracy i.e. ratio of correct predictions to the total number of samples in the dataset. However, in the case of imbalanced classes this metric can be misleading, as high scores do not show prediction capacity for the minority class. One may have a 99% accuracy but a low prediction capacity on the class they are truly interested in (eg. anomaly detection where anomalies are rare classes in a dataset). 

Therefore, we start by calculating alternate performance measures such as AUC, recall and F1 scores.

**AUC - ROC** curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

**recall** refers to the percentage of total relevant results correctly classified by the algorithm.

The **F1 score** can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0

In [25]:
dataset_Y.value_counts()

0.0    139876
1.0     23229
Name: Target_Severe_Indicator, dtype: int64

In [26]:
# Using Logistic Regression to train model using imbalanced data (at first)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [27]:
# Splitting data into train and test
from sklearn.model_selection import train_test_split

In [28]:
# 80 train -20 test split
X_train, X_test, y_train, y_test = train_test_split(df_X, dataset_Y, test_size=0.2, random_state=42)

In [29]:
# Train model using the 'saga' solver which is faster for large datasets
clf_0 = LogisticRegression(solver='saga',max_iter=2500).fit(X_train, y_train)

In [30]:
# Predict on testing set
pred_y_0 = clf_0.predict(X_test)

In [31]:
print("Accuracy : ", round(accuracy_score(y_test, pred_y_0),4))

Accuracy :  0.8575
