# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [1]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [2]:
import os
cwd = os.getcwd()
parent = os.path.dirname(cwd) # .../Latest-News-Classifier/0. Latest News Classifier


# Dataframe
path_df = parent + "/03. Feature Engineering/Pickles/df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = parent + "/03. Feature Engineering/Pickles/X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = parent + "/03. Feature Engineering/Pickles/X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = parent + "/03. Feature Engineering/Pickles/y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = parent + "/03. Feature Engineering/Pickles/y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = parent + "/03. Feature Engineering/Pickles/features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = parent + "/03. Feature Engineering/Pickles/labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = parent + "/03. Feature Engineering/Pickles/features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = parent + "/03. Feature Engineering/Pickles/labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = parent + "/04. Model Training/Models/best_svc.pickle"
with open(path_model, 'rb') as data:
    svc_model = pickle.load(data)
    
# Category mapping dictionary
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

category_names = {
    0: 'business',
    1: 'entertainment',
    2: 'politics',
    3: 'sport',
    4: 'tech'
}

Let's get the predictions on the test set:

In [3]:
predictions = svc_model.predict(features_test)

Now we'll create the Test Set dataframe with the actual and predicted categories:

In [4]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['Content', 'Category', 'Category_Code', 'Prediction']]

# Decode
df_test['Category_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Category_Predicted':category_names})

# Clean columns again
df_test = df_test[['Content', 'Category', 'Category_Predicted']]

In [5]:
df_test.head()

Unnamed: 0,Content,Category,Category_Predicted
1691,Ireland call up uncapped Campbell\n\nUlster sc...,sport,sport
1103,Gurkhas to help tsunami victims\n\nBritain has...,politics,business
477,Egypt and Israel seal trade deal\n\nIn a sign ...,business,business
197,Cairn shares up on new oil find\n\nShares in C...,business,business
475,Saudi NCCI's shares soar\n\nShares in Saudi Ar...,business,business


Let's get the misclassified articles:

In [6]:
condition = (df_test['Category'] != df_test['Category_Predicted'])

df_misclassified = df_test[condition]

df_misclassified.head(3)

Unnamed: 0,Content,Category,Category_Predicted
1103,Gurkhas to help tsunami victims\n\nBritain has...,politics,business
1942,"Argonaut founder rebuilds empire\n\nJez San, t...",tech,business
1880,Half-Life 2 sweeps Bafta awards\n\nPC first pe...,tech,entertainment


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [7]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Category']))
    print('Predicted Category: %s' %(row_article['Category_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['Content']))

We'll get three random numbers from the indexes:

In [12]:
random.seed(1)
list_samples = random.sample(list(df_misclassified.index), 3)
list_samples

[627, 640, 1880]

First case:

In [13]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: entertainment
Predicted Category: sport
-------------------------------------------
Text: 
REM concerts blighted by illness

US rock band REM have been forced to cancel concerts after bass player Mike Mills was taken to hospital suffering from "severe flu-like symptoms".

The band were forced to cut short Monday night's show in Sheffield, and have cancelled Tuesday's Glasgow date. Mills could "hardly stand up, let alone play", said an REM spokesman, who added he is now "resting" in hospital. The remainder of the band played a short acoustic set on Monday. Tuesday's gig has been rescheduled for 15 June. Those who had a ticket for the show in Glasgow are being advised to retain their ticket stub so they can attend the new date. The band's spokesman said that they would review their remaining dates on a "day-to-day basis", based on doctors' advice to Mills. "Obviously we all want Mike to get better, and clearly we all want to play the shows. Rest assured we will do so as 

Second case:

In [14]:
output_article(df_misclassified.loc[list_samples[1]])

Actual Category: entertainment
Predicted Category: politics
-------------------------------------------
Text: 
Franz man seeks government help

Franz Ferdinand frontman Alex Kapranos has called for more government help for musicians, while taking part in an Edinburgh Lectures discussion.

"For any cultural output to thrive there needs to be some kind of state input to that as well," he said. But Kapranos warned against musicians being too closely linked with MPs, at the University of Edinburgh event. "I think the role of musicians is to question politicians rather than to go to bed with them," he said.

Kapranos joined the prestigious lecture series to discuss Scotland's role in making 21st Century music. "There are elements of our musical output which require sustenance because they aren't self-sufficient," he said. "But so-called commercial music would benefit from investment as well." He warned musicians against being allied to a particular party, however. "I don't know if having te

Third case:

In [15]:
output_article(df_misclassified.loc[list_samples[2]])

Actual Category: tech
Predicted Category: entertainment
-------------------------------------------
Text: 
Half-Life 2 sweeps Bafta awards

PC first person shooter Half-Life 2 has won six Bafta Awards, including best game and best online game.

The title, developed by Valve, was released last year to universal acclaim - receiving special praise for its immersive plot and physics engine. The game also won Baftas for best action adventure, best PC game, art direction and animation. Burnout 3 won three awards in the categories for racing, technical direction and best PlayStation 2 game. Grant Dean, chairman of the Bafta games awards, said at a ceremony in London on Tuesday: "The last year has been a great year for the interactive entertainment industry.

"These awards reflect the enormous achievements, progress and diversity that we have seen in that time." Halo 2 won the best Xbox game category, while Prince of Persia: Warrior Within was adjudged the best GameCube title. The sports award

We can see that in all cases the category is not 100% clear, since these articles contain concepts of both categories. These errors will always happen and we are not looking forward to be 100% accurate on them.