# Generate features from text and use Multinomial Naive Bayes to predict fake news

Reference
* https://github.com/justmarkham/pycon-2016-tutorial/blob/master/tutorial_with_output.ipynb
* https://www.youtube.com/watch?v=hXNbFNCgPfY

## Generate data

In [3]:
import pandas as pd
import numpy as np
import os
import newspaper 

### Method 1: generate data directly

In [20]:
def generate_data(news_list, data_name):
    col_names = ["source", "title", "author", "text"]
    article_df = pd.DataFrame(columns = col_names)
    
    for link in news_list:
        print(link)
        news_articles = newspaper.build(link, memoize_articles=False)
        news_brand = news_articles.brand
        size_articles = news_articles.size()
        num_news = 50
        if size_articles < num_news:
            num_news = size_articles
            
        count = 0
        for i in range(0,num_news):
            article = news_articles.articles[i];
            try:
                article.download()
                article.parse()
            
                entry = pd.DataFrame([[news_brand, article.title, article.authors, article.text]], columns=col_names)
                article_df = article_df.append(entry)
                count += 1
            except:
                pass
        print("The total number of " + str(news_brand) + " articles is ", count) 
        
    article_df = article_df[col_names]
    article_df.to_csv(data_name+".csv")

In [21]:
# create fake and real news website list
fake_news_list = ['http://conservativefighters.com/', 'http://cnn.com.de/']
true_news_list = ['http://www.nbcnews.com/', 'http://abcnews.go.com', 'http://www.cbsnews.com/']
my_news_list = list(set(fake_news_list+true_news_list))
print(my_news_list)

['http://www.nbcnews.com/', 'http://abcnews.go.com', 'http://www.cbsnews.com/', 'http://cnn.com.de/', 'http://conservativefighters.com/']


In [22]:
# write data to file
generate_data(my_news_list, 'news_data')

http://www.nbcnews.com/
Article `download()` failed with 404 Client Error: Not Found for url: http://www.nbcnews.com/feature/donald-trump-cabinet on URL http://www.nbcnews.com/feature/donald-trump-cabinet
Article `download()` failed with 404 Client Error: Not Found for url: http://www.nbcnews.com/tv/shows/responding-by-storm/news/dramatic-storm-rescue-photos-2017 on URL http://www.nbcnews.com/tv/shows/responding-by-storm/news/dramatic-storm-rescue-photos-2017
The total number of nbcnews articles is  48
http://abcnews.go.com
The total number of go articles is  50
http://www.cbsnews.com/
The total number of cbsnews articles is  50
http://cnn.com.de/
The total number of com articles is  50
http://conservativefighters.com/
The total number of conservativefighters articles is  36


#### Read and clean data 

In [186]:
# read data
news_data = pd.read_csv('news_data.csv', usecols=[1,2,4])
news_data.tail()

Unnamed: 0,source,title,text
229,conservativefighters,Not Messing Around: Trump’s New Comms Director...,Anthony Scaramucci was hired by President Dona...
230,conservativefighters,Comments on: Singer Bans American Flag From Co...,
231,conservativefighters,Swedish Company Epicenter Implants Microchips ...,A Swedish company Epicenter implants microchip...
232,conservativefighters,Comments on: Charlie Gard’s Parents Release De...,
233,conservativefighters,McDonald’s Employee Gets Fired For Refusing To...,"Officer Scott Naff, who works for Virginia Dep..."


In [189]:
# convert source to a numerical variable: 
# news from go (absnews), cbsnews, and nbcnews are 0
# news from com (cnn.com.de) and conservativefighters are 1 (flagged as fake news)
news_data['label_num'] = news_data.source.map({'go':0,'cbsnews':0,'nbcnews':0, 'conservativefighters':1, 'com':1})

### Method 2: read data from files and merge them

In [75]:
path = os.path.join('data', 'fakenews_jz.csv')
data_fakenews = pd.read_csv(path,usecols=[2,3,5])
data_fakenews.tail()

Unnamed: 0,source,title,text
222,nationonenews,Nation One News,We use cookies to give you the best possible e...
223,nationonenews,"Comments on: Jill Steins still has $1,361,834....",
224,nationonenews,"Not so fast, Scaramucci is not replacing Spice...",Some news organizations are clamoring to disto...
225,nationonenews,"[WATCH] Fox News Put The Washington ""Hurt"" On ...",The Democrats in Congress will stop at nothing...
226,nationonenews,Comments on: [Viral Video] Trump appears on Ga...,


In [13]:
data_fakenews.source.value_counts()

newsbbc                 87
nationonenews           62
nbc                     27
interestingdailynews    21
majorthoughts           15
madworldnews            15
Name: source, dtype: int64

In [76]:
path = os.path.join('data', 'realnews_jz.csv')
data_realnews = pd.read_csv(path,usecols=[2,3,5])
data_realnews.tail()

Unnamed: 0,source,title,text
291,cbsnews,Adam West 1928-2017,
292,cbsnews,Elon Musk and Mark Zuckerberg clash over risks...,Two tech billionaires are clashing over the fu...
293,cbsnews,Confusion and mystery shroud health care bill ...,"July 25, 2017, 7:03 AM | The Republican vow to..."
294,cbsnews,15-pound bag of frozen pork lands on family's ...,"FORT LAUDERDALE, Fla. -- Meat falling from the..."
295,cbsnews,South Carolina domestic violence law unfair to...,"COLUMBIA, S.C. -- People in same-sex relations..."


In [12]:
data_realnews.source.value_counts()

msnbc      100
cbsnews    100
nbcnews     96
Name: source, dtype: int64

In [48]:
news_data = pd.concat([data_fakenews, data_realnews], ignore_index=True)
news_data.tail()

Unnamed: 0,source,title,text
518,cbsnews,Adam West 1928-2017,
519,cbsnews,Elon Musk and Mark Zuckerberg clash over risks...,Two tech billionaires are clashing over the fu...
520,cbsnews,Confusion and mystery shroud health care bill ...,"July 25, 2017, 7:03 AM | The Republican vow to..."
521,cbsnews,15-pound bag of frozen pork lands on family's ...,"FORT LAUDERDALE, Fla. -- Meat falling from the..."
522,cbsnews,South Carolina domestic violence law unfair to...,"COLUMBIA, S.C. -- People in same-sex relations..."


## Convert text to numbers (features)

In [49]:
# shape of our data
news_data.shape

(523, 3)

In [50]:
# source distribution 
news_data.source.value_counts()

msnbc                   100
cbsnews                 100
nbcnews                  96
newsbbc                  87
nationonenews            62
nbc                      27
interestingdailynews     21
majorthoughts            15
madworldnews             15
Name: source, dtype: int64

In [51]:
news_data['label_num'] = news_data.source.map({'msnbc':0,'cbsnews':0,'nbcnews':0, \
                                               'newsbbc':1, 'nationonenews':1, \
                                               'nbc':1, 'interestingdailynews':1, \
                                               'majorthoughts':1, 'madworldnews':1})

In [52]:
news_data.tail(10)

Unnamed: 0,source,title,text,label_num
513,cbsnews,New details emerge on Linkin Park frontman Che...,LOS ANGELES -- New details have emerged on Lin...,0
514,cbsnews,The very American art of Steve Penley,,0
515,cbsnews,Louise Penny: How writing became her solace,The loyalty of her millions of readers speaks ...,0
516,cbsnews,Lawmakers from both parties warn Trump not to ...,WASHINGTON -- Members of congress have a messa...,0
517,cbsnews,"Debt ceiling will be hit in October, CBO estim...",The federal government will exhaust its so-cal...,0
518,cbsnews,Adam West 1928-2017,,0
519,cbsnews,Elon Musk and Mark Zuckerberg clash over risks...,Two tech billionaires are clashing over the fu...,0
520,cbsnews,Confusion and mystery shroud health care bill ...,"July 25, 2017, 7:03 AM | The Republican vow to...",0
521,cbsnews,15-pound bag of frozen pork lands on family's ...,"FORT LAUDERDALE, Fla. -- Meat falling from the...",0
522,cbsnews,South Carolina domestic violence law unfair to...,"COLUMBIA, S.C. -- People in same-sex relations...",0


In [53]:
# drop rows that contain NaN
news_data = news_data.dropna(axis=0,how='any')    #to drop if any value in the row has a nan
news_data.tail(10)

Unnamed: 0,source,title,text,label_num
511,cbsnews,On The Horizon: Scorpion venom as cancer treat...,A multitude of potential advances are ON THE H...,0
512,cbsnews,Kristen Bell on why her daughters don't watch ...,Kristen Bell has lent her voice to more than j...,0
513,cbsnews,New details emerge on Linkin Park frontman Che...,LOS ANGELES -- New details have emerged on Lin...,0
515,cbsnews,Louise Penny: How writing became her solace,The loyalty of her millions of readers speaks ...,0
516,cbsnews,Lawmakers from both parties warn Trump not to ...,WASHINGTON -- Members of congress have a messa...,0
517,cbsnews,"Debt ceiling will be hit in October, CBO estim...",The federal government will exhaust its so-cal...,0
519,cbsnews,Elon Musk and Mark Zuckerberg clash over risks...,Two tech billionaires are clashing over the fu...,0
520,cbsnews,Confusion and mystery shroud health care bill ...,"July 25, 2017, 7:03 AM | The Republican vow to...",0
521,cbsnews,15-pound bag of frozen pork lands on family's ...,"FORT LAUDERDALE, Fla. -- Meat falling from the...",0
522,cbsnews,South Carolina domestic violence law unfair to...,"COLUMBIA, S.C. -- People in same-sex relations...",0


In [56]:
# source distribution 
news_data.label_num.value_counts()

0    285
1    216
Name: label_num, dtype: int64

### Define x and y for modeling later, and split data into training and testing sets

In [57]:
from sklearn.cross_validation import train_test_split

In [58]:
x = news_data.text
y = news_data.label_num
print(x.shape)
print(y.shape)

(501,)
(501,)


In [59]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(375,)
(126,)
(375,)
(126,)


### Method 1: Use CountVectorizer to generate features

In [60]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer()

In [61]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(x_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [64]:
x_train_dtm = vect.transform(x_train)

In [65]:
# examine the document-term matrix
x_train_dtm

<375x13714 sparse matrix of type '<class 'numpy.int64'>'
	with 66790 stored elements in Compressed Sparse Row format>

In [68]:
# transform testing data (using fitted vocabulary) into a document-term matrix
x_test_dtm = vect.transform(x_test)
x_test_dtm

<126x13714 sparse matrix of type '<class 'numpy.int64'>'
	with 18817 stored elements in Compressed Sparse Row format>

### Method 2: use TF-IDF to generate features

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [178]:
x_train_dtm = tfidf.fit_transform(x_train)

In [179]:
# examine the document-term matrix
x_train_dtm

<163x8324 sparse matrix of type '<class 'numpy.float64'>'
	with 25430 stored elements in Compressed Sparse Row format>

In [180]:
x_test_dtm = tfidf.transform(x_test)
x_test_dtm

<55x8324 sparse matrix of type '<class 'numpy.float64'>'
	with 7567 stored elements in Compressed Sparse Row format>

## Predicting fake news using Multinomial Naive Bayes 

In [69]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [70]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(x_train_dtm, y_train)

CPU times: user 2.55 ms, sys: 1.51 ms, total: 4.06 ms
Wall time: 3.12 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [71]:
# make class predictions for x_test_dtm
y_pred_class = nb.predict(x_test_dtm)

In [72]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.87301587301587302

In [73]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[70,  8],
       [ 8, 40]])

In [37]:
# print the false positives (real news incorrectly classified as fake)
x_test[y_test < y_pred_class]

363    Tune in to the first joint candidate event of ...
400    Video footage shows 5-year-old Omran Daqneesh ...
271    Mexican and U.S. Breweries Team up to Make Bee...
488    April 10, 2017, 8:42 AM | The winners of the 2...
232    Only the Dutch Chocolate packages with Rocky R...
408    Tune in to the first joint candidate event of ...
357    Video footage shows 5-year-old Omran Daqneesh ...
509    Incoming White House press secretary Sarah Huc...
Name: text, dtype: object

In [40]:
# example false positives
x_test[400]

'Video footage shows 5-year-old Omran Daqneesh being plucked away from the rubble in the aftermath of a devastating airstrike in Aleppo and carried inside an ambulance, looking dazed and flat-eyed.\n\nThe boy then runs his hand over his blood-covered face, looks at his hands and wipes them on the ambulance chair.\n\nWatch the heartbreaking video here: http://on.msnbc.com/2bfBhoO'

In [41]:
# print the false negatives (fake news incorrectly classified as real)
x_test[y_test > y_pred_class]

47     Big changes may be on the horizon for the stat...
34     Phoenix, AZ — Sitting in the front row during ...
186    Ivanka Trump is a success, despite what her fa...
4      Nine-speed auto rated 31/47 mpg\n\nThe 2017 Ch...
29     WASHINGTON, D.C. (AP) — Today President Donald...
170    Or, Dr. Mike Adams says: "You have the Constit...
201    A 10-year-old was involved in an alleged argum...
18     Teacher Breaks Down in Tears When She Sees Wha...
Name: text, dtype: object

In [43]:
# example false negative
x_test[4]

'Nine-speed auto rated 31/47 mpg\n\nThe 2017 Chevrolet Cruze Diesel sedan has been EPA-rated 30/52/37 mpg (city/highway/combined) when equipped with the six-speed manual transmission. Based on those numbers, GM estimates the Cruze Diesel has a range of up to 702 miles on one tank.\n\nThe Cruze Diesel is powered by a 1.6-liter turbodiesel I-4 that makes 137 hp and 240 lb-ft of torque. In addition to the six-speed manual, the diesel is also offered with a nine-speed automatic, which is rated at 31/47/37 mpg. Chevy calls the 52-mpg highway number “segment-best,” but with Volkswagen’s Jetta TDI sidelined by the company’s diesel emissions scandal, it’s difficult to make any apples-to-apples comparisons. Both manual and automatic EPA estimates best the previous Cruze Diesel’s 27/44/32 mpg result, however. That car was powered by a 151-hp, 256-lb-ft 2.0-liter turbodiesel I-4 mated to a six-speed automatic.\n\nNot surprisingly, the diesels’ fuel economy ratings also come in higher than their g

In [74]:
# calculate predicted probabilities for x_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.88354700854700852