# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [1]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [2]:
# Dataframe
path_df = "../03. Feature Engineering/Pickles/df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = "../03. Feature Engineering/Pickles/X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = "../03. Feature Engineering/Pickles/X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = "../03. Feature Engineering/Pickles/y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = "../03. Feature Engineering/Pickles/y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = "../03. Feature Engineering/Pickles/features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = "../03. Feature Engineering/Pickles/labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = "../03. Feature Engineering/Pickles/features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = "../03. Feature Engineering/Pickles/labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = "../04. Model Training/Models/best_svc.pickle"
with open(path_model, 'rb') as data:
    svc_model = pickle.load(data)
    
# Category mapping dictionary
category_codes = {
    'Autonomous Cars': 0,
    'Others': 1,
}

category_names = {
    0: 'Autonomous Cars',
    1: 'Others',
}

Let's get the predictions on the test set:

In [3]:
predictions = svc_model.predict(features_test)

Now we'll create the Test Set dataframe with the actual and predicted categories:

In [4]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['Content', 'Category', 'Category_Code', 'Prediction']]

# Decode
df_test['Category_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Category_Predicted':category_names})

# Clean columns again
df_test = df_test[['Content', 'Category', 'Category_Predicted']]

In [5]:
df_test.head()

Unnamed: 0,Content,Category,Category_Predicted
493,b'Is Chocolate Good for Your Heart?\n\nWhy a l...,Other,Others
131,b'Imagine a world with no car crashes. Our sel...,Autonomous Cars,Autonomous Cars
234,"b'Python, C++, Linear Algebra and Calculus. Se...",Autonomous Cars,Autonomous Cars
25,"b'Python, C++, Linear Algebra and Calculus. Se...",Autonomous Cars,Autonomous Cars
127,"b""Imagine getting into your car, typing\xe2\x8...",Autonomous Cars,Autonomous Cars


Let's get the misclassified articles:

In [6]:
condition = (df_test['Category'] != df_test['Category_Predicted'])

df_misclassified = df_test[condition]

df_misclassified.head(3)

Unnamed: 0,Content,Category,Category_Predicted
493,b'Is Chocolate Good for Your Heart?\n\nWhy a l...,Other,Others
408,"b'If you think chocolate is heavenly, you\'re ...",Other,Others
346,"b'Is chocolate toxic to dogs?\n\nYes, chocolat...",Other,Others


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [7]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Category']))
    print('Predicted Category: %s' %(row_article['Category_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['Content']))

We'll get three random numbers from the indexes:

In [8]:
random.seed(8)
list_samples = random.sample(list(df_misclassified.index), 3)
list_samples

[66, 311, 513]

First case:

In [9]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: Autonomous Cars
Predicted Category: Others
-------------------------------------------
Text: 
b"Company Name Country UNITED STATES UNITED KINGDOM CANADA AUSTRALIA INDIA ------ Afghanistan \xc3\x85land Islands Albania Algeria American Samoa Andorra Angola Anguilla Antarctica Antigua and Barbuda Argentina Armenia Aruba Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda Bhutan Bolivia Bonaire, Sint Eustatius and Saba Bosnia and Herzegovina Botswana Bouvet Island Brazil British Indian Ocean Territory Brunei Darussalam Bulgaria Burkina Faso Burundi Cambodia Cameroon Cape Verde Cayman Islands Central African Republic Chad Chile China Christmas Island Cocos (Keeling) Islands Colombia Comoros Congo Congo, The Democratic Republic of the Cook Islands Costa Rica C\xc3\xb4te D'Ivoire Croatia Cuba Cura\xc3\xa7ao Cura\xc3\xa7ao Cyprus Czech Republic Denmark Djibouti Dominica Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea Eritr

In [None]:
We can see that in all cases the category is not 100% clear, since these articles are neutral. These errors will always happen and we are not looking forward to be 100% accurate on them.

Second case:

In [10]:
output_article(df_misclassified.loc[list_samples[1]])

Actual Category: Other
Predicted Category: Others
-------------------------------------------
Text: 
b'First Alert Forecast: plenty of sun for Sunday, rip currents low before Larry sweeps by the US\n\nWILMINGTON, N.C. (WECT) - Good to see you this Sunday! This weekend may mark the unofficial end of summer, but Your First Alert Forecast for Labor Day weekend features weather for summer and fall weather lovers alike! Expect another clear and cool morning with temperatures in the 60s - and perhaps 50s in isolated cases - and finishes with a toasty Labor Day Monday with humid 90s sneaking back into the fold.\n\nRead More...'


Third case:

In [11]:
output_article(df_misclassified.loc[list_samples[2]])

Actual Category: Other
Predicted Category: Others
-------------------------------------------
Text: 
b'Share on Pinterest Screen Moment/Stocksy United Dark chocolate is loaded with nutrients that can positively affect your health. Made from the seed of the cacao tree, it\xe2\x80\x99s one of the best sources of antioxidants you can find. Studies show that dark chocolate can improve your health and lower the risk of heart disease. This article reviews 7 health benefits of dark chocolate or cocoa that are supported by science.\n\n1. Very nutritious If you buy quality dark chocolate with a high cocoa content, then it\xe2\x80\x99s quite nutritious. It contains a decent amount of soluble fiber and is loaded with minerals. A 100-gram bar of dark chocolate with 70\xe2\x80\x9385% cocoa contains (1): 11 grams of fiber\n\n67% of the DV for iron\n\n58% of the DV for magnesium\n\n89% of the DV for copper\n\n98% of the DV for manganese In addition, it has plenty of potassium, phosphorus, zinc, and s

We can see that in all cases the category is not 100% clear, since these articles contain concepts of both categories. These errors will always happen and we are not looking forward to be 100% accurate on them.