# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [1]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [4]:
# Dataframe
path_df = "Pickles/df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = "Pickles/X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = "Pickles/X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = "Pickles/y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = "Pickles/y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = "Pickles/features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = "Pickles/labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = "Pickles/features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = "Pickles/labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = "Models/best_svc.pickle"
with open(path_model, 'rb') as data:
    svc_model = pickle.load(data)
    
# Category mapping dictionary
category_names={0:'Ambarella Corp.', 1:'Apex.Ai, Inc.', 2:'Box Bot Inc.',
       3:'DiDi Research America, LLC', 4:'Gatik AI Inc.', 5:'Intel Corporation',
       6:'RIDECELL INC', 7:'ThorDrive, Inc.'}

Let's get the predictions on the test set:

In [5]:
predictions = svc_model.predict(features_test)

Now we'll create the Test Set dataframe with the actual and predicted categories:

In [7]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT', 'Manufacturer', 'Category_Code', 'Prediction']]

# Decode
df_test['Category_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Category_Predicted':category_names})

# Clean columns again
df_test = df_test[['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT', 'Manufacturer', 'Category_Predicted']]

In [8]:
df_test.head()

Unnamed: 0,DESCRIPTION OF FACTS CAUSING DISENGAGEMENT,Manufacturer,Category_Predicted
356,Software Discrepancy,Intel Corporation,Intel Corporation
434,Reckless driving road user that came from behi...,"ThorDrive, Inc.",Box Bot Inc.
429,Incorrect behavior prediction of other partici...,"ThorDrive, Inc.",Box Bot Inc.
262,Software Discrepancy,Intel Corporation,Intel Corporation
181,Need to manually drive around car stopped in lane,Box Bot Inc.,Box Bot Inc.


Let's get the misclassified articles:

In [9]:
condition = (df_test['Manufacturer'] != df_test['Category_Predicted'])

df_misclassified = df_test[condition]

df_misclassified.head(3)

Unnamed: 0,DESCRIPTION OF FACTS CAUSING DISENGAGEMENT,Manufacturer,Category_Predicted
434,Reckless driving road user that came from behi...,"ThorDrive, Inc.",Box Bot Inc.
429,Incorrect behavior prediction of other partici...,"ThorDrive, Inc.",Box Bot Inc.
241,Hardware discrepancy or system fault,Gatik AI Inc.,Box Bot Inc.


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [11]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Manufacturer']))
    print('Predicted Category: %s' %(row_article['Category_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['DESCRIPTION OF FACTS CAUSING DISENGAGEMENT']))

We'll get three random numbers from the indexes:

In [12]:
random.seed(8)
list_samples = random.sample(list(df_misclassified.index), 3)
list_samples

[430, 421, 341]

First case:

In [13]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: ThorDrive, Inc.
Predicted Category: Box Bot Inc.
-------------------------------------------
Text: 
Perception error due to the occlusion on the corner.


Second case:

In [14]:
output_article(df_misclassified.loc[list_samples[1]])

Actual Category: RIDECELL INC
Predicted Category: Box Bot Inc.
-------------------------------------------
Text: 
Steering oscillations detected due to sensitive lateral controller calibration.  Safety Driver took over to drive manually.


Third case:

In [15]:
output_article(df_misclassified.loc[list_samples[2]])

Actual Category: Intel Corporation
Predicted Category: Box Bot Inc.
-------------------------------------------
Text: 
Other Road User


We can see that in all cases the category is not 100% clear, since these articles contain concepts of both categories. These errors will always happen and we are not looking forward to be 100% accurate on them.