# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [7]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [8]:
# Dataframe
path_df = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\03. Feature Engineering\Pickles\labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = r"E:\unt_fe\assignment_1\Latest-News-Classifier\0. Latest News Classifier\04. Model Training\Models\best_svc.pickle"
with open(path_model, 'rb') as data:
    svc_model = pickle.load(data)
    
# Category mapping dictionary
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

category_names = {
    0: 'business',
    1: 'entertainment',
    2: 'politics',
    3: 'sport',
    4: 'tech'
}

Let's get the predictions on the test set:

In [9]:
predictions = svc_model.predict(features_test)

Now we'll create the Test Set dataframe with the actual and predicted categories:

In [10]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['Content', 'Category', 'Category_Code', 'Prediction']]

# Decode
df_test['Category_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Category_Predicted':category_names})

# Clean columns again
df_test = df_test[['Content', 'Category', 'Category_Predicted']]

In [11]:
df_test.head()

Unnamed: 0,Content,Category,Category_Predicted
1691,Ireland call up uncapped Campbell\r\n\r\nUlste...,sport,3
1103,Gurkhas to help tsunami victims\r\n\r\nBritain...,politics,0
477,Egypt and Israel seal trade deal\r\n\r\nIn a s...,business,0
197,Cairn shares up on new oil find\r\n\r\nShares ...,business,0
475,Saudi NCCI's shares soar\r\n\r\nShares in Saud...,business,0


Let's get the misclassified articles:

In [12]:
condition = (df_test['Category'] != df_test['Category_Predicted'])

df_misclassified = df_test[condition]

df_misclassified.head(3)

Unnamed: 0,Content,Category,Category_Predicted
1691,Ireland call up uncapped Campbell\r\n\r\nUlste...,sport,3
1103,Gurkhas to help tsunami victims\r\n\r\nBritain...,politics,0
477,Egypt and Israel seal trade deal\r\n\r\nIn a s...,business,0


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [13]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Category']))
    print('Predicted Category: %s' %(row_article['Category_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['Content']))

We'll get three random numbers from the indexes:

In [14]:
random.seed(8)
list_samples = random.sample(list(df_misclassified.index), 3)
list_samples

[1902, 955, 101]

First case:

In [15]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: tech
Predicted Category: 4
-------------------------------------------
Text: 
Software watching while you work

Software that can not only monitor every keystroke and action performed at a PC but also be used as legally binding evidence of wrong-doing has been unveiled.

Worries about cyber-crime and sabotage have prompted many employers to consider monitoring employees. The developers behind the system claim it is a break-through in the way data is monitored and stored. But privacy advocates are concerned by the invasive nature of such software.

The system is a joint venture between security firm 3ami and storage specialists BridgeHead Software. They have joined forces to create a system which can monitor computer activity, store it and retrieve disputed files within minutes. More and more firms are finding themselves in deep water as a result of data misuse. Sabotage and data theft are most commonly committed from within an organisation according to the Nation

Second case:

In [16]:
output_article(df_misclassified.loc[list_samples[1]])

Actual Category: politics
Predicted Category: 2
-------------------------------------------
Text: 
Cardinal criticises Iraq war cost

Billions of pounds spent on conflict in Iraq and in the Middle East should have been used to reduce poverty, Cardinal Cormac Murphy-O'Connor has said.

The head of the Catholic Church in England and Wales made the comments on BBC Radio 4 and will re-iterate his stance in his Christmas Midnight Mass. The cardinal used a Christmas message to denounce the war in Iraq as a "terrible" waste of money. He and the Archbishop of Canterbury have both spoken out about the war.

Speaking on BBC Radio 4's Thought for the Day slot, he criticised the fact that "billions" have been spent on war, instead of being used to bring people "out of dire poverty and malnourishment and disease". The cardinal said 2005 should be the year for campaigning to "make history poverty". He added: "If the governments of the rich countries were as ready to devote to peace the resourc

Third case:

In [17]:
output_article(df_misclassified.loc[list_samples[2]])

Actual Category: business
Predicted Category: 0
-------------------------------------------
Text: 
US company admits Benin bribery

A US defence and telecommunications company has agreed to pay $28.5m after admitting bribery in the West African state of Benin.

The Titan corporation was accused of funnelling more than $2m into the 2001 re-election campaign of President Mathieu Kerekou. At the time, Titan was trying to get a higher price for a telecommunications project in Benin. There is no suggestion that Mr Kerekou was himself aware of any wrongdoing. Titan, a California-based company, pleaded guilty to falsifying its accounts and violating US anti-bribery laws. It agreed to pay $13m in criminal penalties, as well as $15.5m to settle a civil lawsuit brought by the US financial watchdog, the Securities and Exchange Commission (SEC).

The SEC had accused Titan of illegally paying $2.1m to an unnamed agent in Benin claiming ties with President Kerekou. Some of the money was used t

We can see that in all cases the category is not 100% clear, since these articles contain concepts of both categories. These errors will always happen and we are not looking forward to be 100% accurate on them.