In this notebook we will be using two bag of words techniques; count vectorization and tfidf vectorization

Recording on youtube of the study group session: https://youtu.be/HlmmXrA4FUU

In [11]:
import numpy as np
import pandas as pd

real = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')
fake = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')

In [12]:
### If you want to run the notebook faster at the cost of accuracy you can uncomment out the two lines below to use only a sample of 40k

# real = real.sample(20000)
# fake = fake.sample(20000)
real.shape, fake.shape

((21417, 4), (23481, 4))

In [13]:
real.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [14]:
num = 100 # Selects an article to preview fromt the real dataset

print('Title: ', real.title[num],'\n')
print('Article:\n', real.text[num])

Title:  Senator Warren hits out at 'effort to politicize' U.S. consumer agency 

Article:
 WASHINGTON (Reuters) - Democratic Senator Elizabeth Warren is taking aim at budget chief Mick Mulvaney’s plan to fill the ranks of the U.S. consumer financial watchdog with political allies, according to letters seen by Reuters, the latest salvo in a broader battle over who should run the bureau. President Donald Trump last month appointed Mulvaney as acting director of the Consumer Financial Protection Bureau (CFPB), though the decision is being legally challenged by the agency’s deputy director, Leandra English, who says she is the rightful interim head. Mulvaney told reporters earlier this month he planned to bring in several political appointees to help overhaul the agency, but Warren warned in a pair of letters sent Monday to Mulvaney and the Office of Personnel Management (OPM), which oversees federal hiring, that doing so was inappropriate and potentially illegal. The CFPB is meant to be a

In [15]:
### Based on the differences in this column we cannot use this feature without data leakage.
print('subjects of fake news articles:',fake['subject'].unique())
print('subjects of real news articles:',real['subject'].unique())

subjects of fake news articles: ['News' 'politics' 'Government News' 'left-news' 'US_News' 'Middle-east']
subjects of real news articles: ['politicsNews' 'worldnews']


In [16]:
### Since the real news articles and fake news articles are in two different data sets we can add a label column easily
real['is_real'] = 1
fake['is_real'] = 0

In [17]:
data = real.append(fake)
data.index = range(data.shape[0])
data.sample(10)

Unnamed: 0,title,text,subject,date,is_real
7536,"Jump in Florida, Nevada early voting could rea...",MIAMI (Reuters) - The man answering a voluntee...,politicsNews,"November 6, 2016",1
30789,WHAT TRUMP JUST SAID Should Scare All Gropers ...,President Trump wants those who settled sexual...,politics,"Nov 21, 2017",0
4980,Trump Middle East envoy meets Netanyahu in Jer...,JERUSALEM (Reuters) - U.S. President Donald Tr...,politicsNews,"March 13, 2017",1
35164,BUSTED! HERE’S WHY HILLARY CLINTON’S Brother-I...,Roger Clinton got busted for a DUI on Sunday b...,politics,"Jun 6, 2016",0
16667,Deadly air strike hits Syrian government-held ...,BEIRUT (Reuters) - An air raid in the governme...,worldnews,"October 23, 2017",1
15983,Factbox: Catalonia crisis - What's next?,MADRID (Reuters) - Catalonia s ousted leader C...,worldnews,"October 31, 2017",1
17046,Vote may have put independence out of reach fo...,ERBIL (Reuters) - The Kurdish independence vot...,worldnews,"October 18, 2017",1
43704,CONVENIENT? ‘Active Shooter’ Kills 5 in Fort L...,"21st Century Wire says Incredibly, on the same...",US_News,"January 6, 2017",0
28701,Pope Francis Just INFURIATED Conservatives By...,Pope Francis has once again infuriated conserv...,News,"March 25, 2016",0
19979,Russia says North Korea's latest missile launc...,MOSCOW (Reuters) - The Russian Foreign Ministr...,worldnews,"September 15, 2017",1


In [18]:
data.isnull().sum()

title      0
text       0
subject    0
date       0
is_real    0
dtype: int64

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

Here we create the count vectorizer and tfidf-vectorizer, we use the optional arguments to strip accents, remove n-grams, filter out stop words, and set the vector length to 1k elements. These two vectorizers will perform tokenization on their own so that step has been skipped. For the sake of keeping things quick and simple we encoded the titles for the news articles and not the body of text, for improved performance one could expirement with using the full articles.

In [47]:
countvec = CountVectorizer(strip_accents='ascii', stop_words=stopwords, ngram_range=(1,2), max_features=1000)
tfidf = TfidfVectorizer(strip_accents='ascii', stop_words=stopwords, ngram_range=(1,2), max_features=1000)

text_column = 'text' # use 'text' to train on the full articles or 'title' to only use the titles.
                     # 'text' will take alot longer for the vectorizers to run

The sample from the count vectorizer below may appear to contain only zeroes, this is because the output is a sparse matrix where the vast majority of columns in any one row will be zero. Below this cell we can also veiw the vocabulary key used to build the vectors.

In [48]:
count_dat = countvec.fit_transform(data[text_column])
count_dat = pd.DataFrame(count_dat.toarray())
count_dat.sample(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
32428,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,1,0,0,0,0,0
7051,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34968,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
124,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38820,0,0,0,0,0,0,0,0,1,0,...,0,1,0,1,0,0,0,0,0,0
13704,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
40605,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40838,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
44334,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16476,0,0,0,0,0,0,0,0,0,0,...,0,5,0,0,0,0,0,0,0,0


In [49]:
countvec.vocabulary_

{'washington': 958,
 'reuters': 742,
 'head': 381,
 'conservative': 194,
 'republican': 735,
 'congress': 192,
 'voted': 947,
 'month': 558,
 'national': 572,
 'debt': 229,
 'pay': 631,
 'tax': 869,
 'called': 131,
 'sunday': 854,
 'budget': 124,
 'way': 962,
 'among': 67,
 'republicans': 736,
 'representative': 733,
 'mark': 526,
 'speaking': 827,
 'face': 294,
 'nation': 571,
 'hard': 379,
 'line': 498,
 'federal': 308,
 'spending': 830,
 'lawmakers': 476,
 'january': 447,
 'return': 741,
 'wednesday': 965,
 'trying': 916,
 'pass': 627,
 'fight': 310,
 'likely': 497,
 'issues': 445,
 'immigration': 414,
 'policy': 650,
 'even': 282,
 'november': 595,
 'congressional': 193,
 'election': 267,
 'keep': 456,
 'control': 199,
 'president': 663,
 'donald': 251,
 'trump': 909,
 'want': 953,
 'big': 112,
 'increase': 421,
 'military': 550,
 'democrats': 238,
 'also': 60,
 'non': 589,
 'defense': 235,
 'programs': 685,
 'support': 855,
 'education': 260,
 'public': 692,
 'health': 382,
 'admi

In [50]:
tfidf_dat = tfidf.fit_transform(data[text_column])
tfidf_dat = pd.DataFrame(tfidf_dat.toarray())
tfidf_dat.sample(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
2278,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.20302,0.0,0.0,0.0
23447,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.109683,0.0,0.0,0.0,0.0,0.125195,0.0,0.0,0.0
19592,0.0,0.028541,0.033319,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.024443,0.073461,0.0,0.0,0.0,0.0,0.062888,0.0,0.027125,0.0
43884,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33388,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42531,0.0,0.0,0.0,0.0,0.067308,0.0,0.0,0.0,0.074587,0.0,...,0.0,0.0,0.070256,0.1257,0.0,0.0,0.042506,0.0,0.0,0.0
14533,0.047688,0.0,0.061564,0.0,0.0,0.0,0.0,0.0,0.067966,0.0,...,0.0,0.027147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.293191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.097606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
77,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.018861,0.0,0.0,0.024409,0.039578,0.0,0.036058,0.0,0.0


Here we split the data into training and test sets, the two sets of training data reprsent the two different vectorization methods, performed side by side for compairison. Since this is a balanced dataset we will use accuracy as the metric.

In [51]:
from sklearn.model_selection import train_test_split
y = data.is_real

train_x1, test_x1, train_y1, test_y1 = train_test_split(count_dat, y, test_size=.3, random_state=42)
train_x2, test_x2, train_y2, test_y2 = train_test_split(tfidf_dat, y, test_size=.3, random_state=42)

Below we use 3 different models, a support vector machine, a random forest, and a naive bayes model. The %%time magic is used to view the time each model takes to complete training/inference. It is worth pointing out how fast the SVM is able to train and perform inference compaired the other two models

Depending on the model used the difference between count vectorizer and tfidf is either trivial or signifigant. The most interesting of these changes is the naive bayes model, which not only becomes much more accurate, but also cuts a signifigant amount of time off training/inference. One explanation of this is that tfidf essentially has weighted values instead of a strightforward count, which help to start the model off with coeficients (or equivilant model parameters) closer to the optimal value. SVMs seem to be able to reach convergence optimally without the extra help, but it would seem that NB/RF models benifit from initiallizing closer to the optimum values.

In [52]:
%%time

from sklearn.svm import LinearSVC
from sklearn import metrics

svm = LinearSVC()
svm.fit(train_x1, train_y1)
preds = svm.predict(test_x1)
print('Accuracy with count vectorizer:', metrics.accuracy_score(preds, test_y1), '\n\n')

Accuracy with count vectorizer: 0.9942093541202672 


CPU times: user 777 ms, sys: 94.1 ms, total: 871 ms
Wall time: 871 ms




In [53]:
%%time

svm = LinearSVC()
svm.fit(train_x2, train_y2)
preds = svm.predict(test_x2)
print('Accuracy with tfidf vectorizer:', metrics.accuracy_score(preds, test_y2), '\n\n')

Accuracy with tfidf vectorizer: 0.9947290274684484 


CPU times: user 436 ms, sys: 18 ms, total: 454 ms
Wall time: 452 ms


In [54]:
%%time
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(train_x1, train_y1)
print('Accuracy with count vectorizer:', metrics.accuracy_score(rfc.predict(test_x1), test_y1), '\n\n')

Accuracy with count vectorizer: 0.9977728285077951 


CPU times: user 13.9 s, sys: 37.8 ms, total: 14 s
Wall time: 14 s


In [55]:
%%time
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(train_x2, train_y2)
print('accuracy with tfidf vectorizer:', metrics.accuracy_score(preds, test_y2), '\n\n')

accuracy with tfidf vectorizer: 0.9947290274684484 


CPU times: user 22.3 s, sys: 28.7 ms, total: 22.4 s
Wall time: 22.4 s


In [56]:
%%time
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x1, train_y1)
print('Accuracy with count vectorizer:', metrics.accuracy_score(gnb.predict(test_x1), test_y1), '\n\n')

Accuracy with count vectorizer: 0.8893095768374165 


CPU times: user 520 ms, sys: 217 ms, total: 737 ms
Wall time: 735 ms


In [57]:
%%time
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x2, train_y2)
print('Accuracy with tfidf vectorizer:', metrics.accuracy_score(preds, test_y2), '\n\n')

Accuracy with tfidf vectorizer: 0.9947290274684484 


CPU times: user 335 ms, sys: 133 ms, total: 468 ms
Wall time: 469 ms
