# Generate features from text and use Multinomial Naive Bayes to predict fake news

Reference
* https://github.com/justmarkham/pycon-2016-tutorial/blob/master/tutorial_with_output.ipynb
* https://www.youtube.com/watch?v=hXNbFNCgPfY

## Generate data

In [60]:
import pandas as pd
import numpy as np
import os
import newspaper 

### read data from files and merge them

In [93]:
path = os.path.join('data', 'fakenews_jz.csv')
data_fakenews = pd.read_csv(path,usecols=[1,2,3,5])
data_fakenews['label_num'] = 1
data_fakenews.tail()

Unnamed: 0,url,source,title,text,label_num
435,http://now8news.com/fidget-spinner-bursts-flam...,now8news,Fidget Spinner Bursts Into Flames Killing Todd...,The parents of a 3 year old girl woke up to tr...,1
436,http://now8news.com/18-year-old-girl-marries-f...,now8news,18 Year Old Girl Marries Her Father In Arkansa...,18 Year Old Girl Marries Her Father In Arkansa...,1
437,http://now8news.com/trump-raising-age-limit-to...,now8news,Trump Raising Age Limit For Tobacco Consumptio...,There is more bad news for cigarette smokers –...,1
438,http://now8news.com/caitlyn-jenner-discusses-d...,now8news,Caitlyn Jenner Discusses Her Desire To Transit...,Caitlyn Jenner or “CJ” as he refers to herself...,1
439,http://now8news.com/3-year-old-dies-tickled-de...,now8news,3 Year Old Girl Dies After Accidentally Being ...,"Charlotte, NC – It’s a warning being sent out ...",1


In [94]:
data_fakenews.source.value_counts()

newsbbc                 87
nationonenews           62
newswithviews           49
infostormer             39
nephef                  31
nbc                     27
lastdeplorables         24
interestingdailynews    21
local31news             21
ladylibertysnews        19
krbcnews                19
majorthoughts           15
madworldnews            15
now8news                11
Name: source, dtype: int64

In [95]:
path = os.path.join('data', 'realnews_jz.csv')
data_realnews = pd.read_csv(path,usecols=[1,2,3,5])
data_realnews['label_num'] = 0
data_realnews.tail()

Unnamed: 0,url,source,title,text,label_num
412,http://www.newyorker.com/magazine/1960/12/17/c...,newyorker,The World of Dr. Seuss,The face of Theodor Seuss Geisel—an arresting ...,0
413,http://www.newyorker.com/podcast/political-sce...,newyorker,"Ai Weiwei Talks to David Remnick About Art, Ce...","David Remnick sits down with Ai Weiwei, China’...",0
414,http://www.newyorker.com/magazine/2017/07/31/w...,newyorker,Why Corrupt Bankers Avoid Jail,"In the summer of 2012, a subcommittee of the U...",0
415,http://www.newyorker.com/news/benjamin-wallace...,newyorker,Benjamin Wallace-Wells: American Politics and ...,,0
416,http://www.newyorker.com/culture/cultural-comm...,newyorker,Why Justin Bieber Got Banned from Performing i...,"Until last week, the Dalai Lama had fairly lit...",0


In [96]:
data_realnews.source.value_counts()

newyorker    96
msnbc        96
go           92
cbsnews      72
nbcnews      61
Name: source, dtype: int64

In [97]:
news_data = pd.concat([data_fakenews, data_realnews], ignore_index=True)
news_data.tail()

Unnamed: 0,url,source,title,text,label_num
852,http://www.newyorker.com/magazine/1960/12/17/c...,newyorker,The World of Dr. Seuss,The face of Theodor Seuss Geisel—an arresting ...,0
853,http://www.newyorker.com/podcast/political-sce...,newyorker,"Ai Weiwei Talks to David Remnick About Art, Ce...","David Remnick sits down with Ai Weiwei, China’...",0
854,http://www.newyorker.com/magazine/2017/07/31/w...,newyorker,Why Corrupt Bankers Avoid Jail,"In the summer of 2012, a subcommittee of the U...",0
855,http://www.newyorker.com/news/benjamin-wallace...,newyorker,Benjamin Wallace-Wells: American Politics and ...,,0
856,http://www.newyorker.com/culture/cultural-comm...,newyorker,Why Justin Bieber Got Banned from Performing i...,"Until last week, the Dalai Lama had fairly lit...",0


## Convert text to numbers (features)

In [98]:
# shape of our data
news_data.shape

(857, 5)

In [99]:
# drop rows that contain NaN
news_data = news_data.dropna(axis=0,how='any')    #to drop if any value in the row has a nan
news_data.tail()

Unnamed: 0,url,source,title,text,label_num
850,http://www.newyorker.com/culture/cover-story/c...,newyorker,Javier Mariscal’s “Private Beach”,“The beach is always only three hundred feet f...,0
852,http://www.newyorker.com/magazine/1960/12/17/c...,newyorker,The World of Dr. Seuss,The face of Theodor Seuss Geisel—an arresting ...,0
853,http://www.newyorker.com/podcast/political-sce...,newyorker,"Ai Weiwei Talks to David Remnick About Art, Ce...","David Remnick sits down with Ai Weiwei, China’...",0
854,http://www.newyorker.com/magazine/2017/07/31/w...,newyorker,Why Corrupt Bankers Avoid Jail,"In the summer of 2012, a subcommittee of the U...",0
856,http://www.newyorker.com/culture/cultural-comm...,newyorker,Why Justin Bieber Got Banned from Performing i...,"Until last week, the Dalai Lama had fairly lit...",0


In [100]:
# source distribution 
news_data.label_num.value_counts()

1    423
0    399
Name: label_num, dtype: int64

### Define x and y for modeling later, and split data into training and testing sets

In [101]:
from sklearn.cross_validation import train_test_split

In [102]:
x = news_data.text
y = news_data.label_num
print(x.shape)
print(y.shape)

(822,)
(822,)


In [103]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(616,)
(206,)
(616,)
(206,)


### Method 1: Use CountVectorizer to generate features

In [104]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer()

In [105]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(x_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [106]:
x_train_dtm = vect.transform(x_train)

In [107]:
# examine the document-term matrix
x_train_dtm

<616x23858 sparse matrix of type '<class 'numpy.int64'>'
	with 148177 stored elements in Compressed Sparse Row format>

In [108]:
# transform testing data (using fitted vocabulary) into a document-term matrix
x_test_dtm = vect.transform(x_test)
x_test_dtm

<206x23858 sparse matrix of type '<class 'numpy.int64'>'
	with 44875 stored elements in Compressed Sparse Row format>

### Method 2: use TF-IDF to generate features

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [178]:
x_train_dtm = tfidf.fit_transform(x_train)

In [179]:
# examine the document-term matrix
x_train_dtm

<163x8324 sparse matrix of type '<class 'numpy.float64'>'
	with 25430 stored elements in Compressed Sparse Row format>

In [180]:
x_test_dtm = tfidf.transform(x_test)
x_test_dtm

<55x8324 sparse matrix of type '<class 'numpy.float64'>'
	with 7567 stored elements in Compressed Sparse Row format>

## Predicting fake news using Multinomial Naive Bayes 

In [109]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [110]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(x_train_dtm, y_train)

CPU times: user 3.89 ms, sys: 1.94 ms, total: 5.83 ms
Wall time: 4.79 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [111]:
# make class predictions for x_test_dtm
y_pred_class = nb.predict(x_test_dtm)

In [112]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.83495145631067957

In [113]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[94, 16],
       [18, 78]])

In [114]:
# print the false positives (real news incorrectly classified as fake)
x_test[y_test < y_pred_class]

713    Katy Perry says the prospect of sharing her mu...
476    Video\n\nGen. Dunford on North Korea: We can p...
755    One of President Donald Trump's top advisers o...
733    Republican Sen. Lindsey Graham today warned th...
680    Boy Scouts of America Chief Scout Executive Mi...
832    If you read Jared Kushner’s statement to congr...
465    After Volkswagen was caught cheating on diesel...
835    Ryan Lizza talks with Dorothy Wickenden about ...
469    The American Heart Association wants you to re...
491    The U.S. government said on Monday it would re...
717    The European Commission announced a new emerge...
570    Is banning selfies in the voting booth is a vi...
702    Are the United States and North Korea moving c...
710    Miami Dolphins center Mike Pouncey says Aaron ...
777    WASHINGTON ( The Borowitz Report )—In an extra...
730    It's hard to overstate the enormity of Preside...
Name: text, dtype: object

In [115]:
# example false positives
x_test[476]

'Video\n\nGen. Dunford on North Korea: We can protect the American people today'

In [116]:
test = news_data[news_data.text=='Video\n\nGen. Dunford on North Korea: We can protect the American people today']
print(test.url)

476    http://www.nbcnews.com/news/north-korea
Name: url, dtype: object


In [65]:
# print the false negatives (fake news incorrectly classified as real)
x_test[y_test > y_pred_class]

405    WASHINGTON (AP) — President Donald Trump and B...
117    First Lady Melania Trump returned to the Unite...
380    “The Message” Reveals Truth About Eugene Peter...
91     Being a Minnesotan, I didn’t realize I talked ...
107    There is already more than ample evidence to s...
438    Caitlyn Jenner or “CJ” as he refers to herself...
50     So shocking! A report is claiming that there i...
113    A highly decorated Air Force officer is breaki...
79     Iceland is a bizarre country. Still remember t...
265    JoAnne Cusick was wearing a pink floral sundre...
59     The entire series of Transformers films direct...
90     I wouldn’t care if it took me extra time to ge...
322    I watch my wonderful wife of twenty-two years ...
61     A new strain of super-gonorrhoea is ripping th...
41     lol at the fact that the British taxpayer fund...
80     This is exactly why I can’t watch Mythbusters....
68     A new strain of super-gonorrhoea is ripping th...
287    A Wal-Mart parking lot b

In [117]:
# example false negative
x_test[68]

'A new strain of super-gonorrhoea is ripping through Houstons youth\n\nUS health experts are expressing a huge concern for what they are calling super-gonorrhoea.\n\nThe new superbug began its spread in Englands gay community but is quickly reaching the furthest corners of the globe.\n\nHouston is being hit particularly hard, and the bug is no longer only affecting only the gay community.\n\nThe bug prompted worldwide attention last year when it became immune to its most widely used cure. US health experts are calling their attempts at stopping the bug a limited success and fear it may soon be completely untreatable.\n\nThe bug has been found to cause infertility, along with all the alements associated with regular gonorrhoea.\n\nIt is currently being treated with a combination of azithromycin and ceftriaxone, but resistance to azithromycin is spreading, and experts say its only a matter of time before ceftriaxone fails too.\n\nExperts estimate there are at least 16,000 infected indivi

In [118]:
# calculate predicted probabilities for x_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.87874053030303023