# Generate features from text and use Multinomial Naive Bayes to predict fake news

Reference
* https://github.com/justmarkham/pycon-2016-tutorial/blob/master/tutorial_with_output.ipynb
* https://www.youtube.com/watch?v=hXNbFNCgPfY

## Generate data

In [1]:
import pandas as pd
import numpy as np
import newspaper 

In [185]:
fake_news = ['http://ABCnews.com.co','http://bizstandardnews.com','http://Bloomberg.ma',
     'http://70news.wordpress.com','http://beforeitsnews.com', 'http://ddsnewstrend.com', 
     'http://thebostontribune.com/','http://americanfreepress.net/','http://www.bipartisanreport.com/',
     'http://aurora-news.us/', 'http://conservativefighters.com/','http://conservativespirit.com/',
     'http://conservative101.com/','http://DrudgeReport.com.co','http://NBCNews.com.co','http://TrueTrumpers.com',
     'http://UndergroundNewsReport.com','http://washingtonpost.com.co','http://YourNewsWire.com','http://cnn.com.de/',
     'http://rickwells.us/','http://thedcgazette.com/','http://donaldtrumppotus45.com/','http://24wpn.com',
     'http://AmericanFlavor.news','http://AmericanPresident.co','http://AMPosts.com','http://BB4SP.com',
     'http://BlueVisionPost.com','http://CivicTribune.com']

true_news = ['https://www.wsj.com/','https://www.nytimes.com/','http://www.bbc.com/news',
             'http://www.npr.org/sections/news/', 'http://www.reuters.com/', 'https://www.economist.com/',
             'https://www.apnews.com/', 'http://www.cnn.com', 'http://www.foxnews.com/', 
             'http://www.politico.com/', 'http://www.nbcnews.com/', 'http://www.msnbc.com/', 'http://www.cbsnews.com/',
             'http://www.huffingtonpost.com/','http://www.bloomberg.com/', 'http://abcnews.go.com/',
             'http://www.aljazeera.com/news/', 'https://www.afp.com/en/news-hub', 'http://www.newyorker.com/',
             'https://www.theguardian.com/', 'http://www.telegraph.co.uk/', 'http://www.zeit.de/english/index',
             'http://www.chicagotribune.com/', 'http://www.freep.com/', 'http://www.bostonherald.com/',
             'http://www.dailypress.com/', 'http://www.detroitnews.com/', 'https://www.ft.com/', 
             'http://www.ibtimes.com/', 'http://www.voanews.com/']

fake_news_list = fake_news[10:20] 
true_news_list = true_news[10:20] 

print(fake_news_list)
print(true_news_list)

['http://conservativefighters.com/', 'http://conservativespirit.com/', 'http://conservative101.com/', 'http://DrudgeReport.com.co', 'http://NBCNews.com.co', 'http://TrueTrumpers.com', 'http://UndergroundNewsReport.com', 'http://washingtonpost.com.co', 'http://YourNewsWire.com', 'http://cnn.com.de/']
['http://www.nbcnews.com/', 'http://www.msnbc.com/', 'http://www.cbsnews.com/', 'http://www.huffingtonpost.com/', 'http://www.bloomberg.com/', 'http://abcnews.go.com/', 'http://www.aljazeera.com/news/', 'https://www.afp.com/en/news-hub', 'http://www.newyorker.com/', 'https://www.theguardian.com/']


In [20]:
def generate_data(news_list, data_name):
    col_names = ["source", "title", "author", "text"]
    article_df = pd.DataFrame(columns = col_names)
    
    for link in news_list:
        print(link)
        news_articles = newspaper.build(link, memoize_articles=False)
        news_brand = news_articles.brand
        size_articles = news_articles.size()
        num_news = 50
        if size_articles < num_news:
            num_news = size_articles
            
        count = 0
        for i in range(0,num_news):
            article = news_articles.articles[i];
            try:
                article.download()
                article.parse()
            
                entry = pd.DataFrame([[news_brand, article.title, article.authors, article.text]], columns=col_names)
                article_df = article_df.append(entry)
                count += 1
            except:
                pass
        print("The total number of " + str(news_brand) + " articles is ", count) 
        
    article_df = article_df[col_names]
    article_df.to_csv(data_name+".csv")

In [21]:
# create fake and real news website list
fake_news_list = ['http://conservativefighters.com/', 'http://cnn.com.de/']
true_news_list = ['http://www.nbcnews.com/', 'http://abcnews.go.com', 'http://www.cbsnews.com/']
my_news_list = list(set(fake_news_list+true_news_list))
print(my_news_list)

['http://www.nbcnews.com/', 'http://abcnews.go.com', 'http://www.cbsnews.com/', 'http://cnn.com.de/', 'http://conservativefighters.com/']


In [22]:
# write data to file
generate_data(my_news_list, 'news_data')

http://www.nbcnews.com/
Article `download()` failed with 404 Client Error: Not Found for url: http://www.nbcnews.com/feature/donald-trump-cabinet on URL http://www.nbcnews.com/feature/donald-trump-cabinet
Article `download()` failed with 404 Client Error: Not Found for url: http://www.nbcnews.com/tv/shows/responding-by-storm/news/dramatic-storm-rescue-photos-2017 on URL http://www.nbcnews.com/tv/shows/responding-by-storm/news/dramatic-storm-rescue-photos-2017
The total number of nbcnews articles is  48
http://abcnews.go.com
The total number of go articles is  50
http://www.cbsnews.com/
The total number of cbsnews articles is  50
http://cnn.com.de/
The total number of com articles is  50
http://conservativefighters.com/
The total number of conservativefighters articles is  36


## Convert text to numbers (features)

### Read and clean data 

In [186]:
# read data
news_data = pd.read_csv('news_data.csv', usecols=[1,2,4])
news_data.tail()

Unnamed: 0,source,title,text
229,conservativefighters,Not Messing Around: Trump’s New Comms Director...,Anthony Scaramucci was hired by President Dona...
230,conservativefighters,Comments on: Singer Bans American Flag From Co...,
231,conservativefighters,Swedish Company Epicenter Implants Microchips ...,A Swedish company Epicenter implants microchip...
232,conservativefighters,Comments on: Charlie Gard’s Parents Release De...,
233,conservativefighters,McDonald’s Employee Gets Fired For Refusing To...,"Officer Scott Naff, who works for Virginia Dep..."


In [187]:
# shape of our data
news_data.shape

(234, 3)

In [188]:
# source distribution 
news_data.source.value_counts()

go                      50
cbsnews                 50
com                     50
nbcnews                 48
conservativefighters    36
Name: source, dtype: int64

In [189]:
# convert source to a numerical variable: 
# news from go (absnews), cbsnews, and nbcnews are 0
# news from com (cnn.com.de) and conservativefighters are 1 (flagged as fake news)
news_data['label_num'] = news_data.source.map({'go':0,'cbsnews':0,'nbcnews':0, 'conservativefighters':1, 'com':1})

In [190]:
news_data.tail()

Unnamed: 0,source,title,text,label_num
229,conservativefighters,Not Messing Around: Trump’s New Comms Director...,Anthony Scaramucci was hired by President Dona...,1
230,conservativefighters,Comments on: Singer Bans American Flag From Co...,,1
231,conservativefighters,Swedish Company Epicenter Implants Microchips ...,A Swedish company Epicenter implants microchip...,1
232,conservativefighters,Comments on: Charlie Gard’s Parents Release De...,,1
233,conservativefighters,McDonald’s Employee Gets Fired For Refusing To...,"Officer Scott Naff, who works for Virginia Dep...",1


In [192]:
# drop rows that contain NaN
news_data = news_data.dropna(axis=0,how='any')    #to drop if any value in the row has a nan
news_data.tail()

Unnamed: 0,source,title,text,label_num
227,conservativefighters,"Singer Bans American Flag From Concerts, Says ...",It’s so typical of radical liberals.\n\nThey p...,1
228,conservativefighters,Muslim Immigrant REFUSES to Let Cops Check His...,Far too many Muslim immigrants come to this co...,1
229,conservativefighters,Not Messing Around: Trump’s New Comms Director...,Anthony Scaramucci was hired by President Dona...,1
231,conservativefighters,Swedish Company Epicenter Implants Microchips ...,A Swedish company Epicenter implants microchip...,1
233,conservativefighters,McDonald’s Employee Gets Fired For Refusing To...,"Officer Scott Naff, who works for Virginia Dep...",1


### Define x and y for modeling later, and split data into training and testing sets

In [212]:
from sklearn.cross_validation import train_test_split

In [193]:
x = news_data.text
y = news_data.label_num
print(x.shape)
print(y.shape)

(218,)
(218,)


In [213]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(163,)
(55,)
(163,)
(55,)


### Use CountVectorizer to generate features

In [195]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer()

In [196]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(x_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [197]:
x_train_dtm = vect.transform(x_train)

In [198]:
# examine the document-term matrix
x_train_dtm

<163x8593 sparse matrix of type '<class 'numpy.int64'>'
	with 35122 stored elements in Compressed Sparse Row format>

In [199]:
# transform testing data (using fitted vocabulary) into a document-term matrix
x_test_dtm = vect.transform(x_test)
x_test_dtm

<55x8593 sparse matrix of type '<class 'numpy.int64'>'
	with 10985 stored elements in Compressed Sparse Row format>

### Or use TF-IDF to generate features

In [177]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [178]:
x_train_dtm = tfidf.fit_transform(x_train)

In [179]:
# examine the document-term matrix
x_train_dtm

<163x8324 sparse matrix of type '<class 'numpy.float64'>'
	with 25430 stored elements in Compressed Sparse Row format>

In [180]:
x_test_dtm = tfidf.transform(x_test)
x_test_dtm

<55x8324 sparse matrix of type '<class 'numpy.float64'>'
	with 7567 stored elements in Compressed Sparse Row format>

## Predicting fake news using Multinomial Naive Bayes 

In [200]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [201]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(x_train_dtm, y_train)

CPU times: user 1.48 ms, sys: 885 µs, total: 2.37 ms
Wall time: 1.75 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [202]:
# make class predictions for x_test_dtm
y_pred_class = nb.predict(x_test_dtm)

In [203]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.83636363636363631

In [204]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[37,  1],
       [ 8,  9]])

In [205]:
# print the false positives (real news incorrectly classified as fake)
x_test[y_test < y_pred_class]

19    Wells Fargo Introduces Cardless ATMs\n\nThe ba...
Name: text, dtype: object

In [206]:
# example false positives
x_test[19]

'Wells Fargo Introduces Cardless ATMs\n\nThe bank is upgrading all 13,000 of its ATMs to process withdrawals using smartphones rather than debit cards. Chase and Bank of America plan to roll out their own version of cardless ATMs, too.'

In [170]:
# print the false negatives (fake news incorrectly classified as real)
x_test[y_test > y_pred_class]

223    Anyone who dismisses Kid Rock’s campaign with ...
194    Although you experience minor pest problems in...
202    Ami Horowitz, it’s a filmmaker known for his c...
160    Baltimore, MD (AP) – Speaking to reporters in ...
219    Senate Minority chairman Chuck Schumer (D-NY) ...
231    A Swedish company Epicenter implants microchip...
204    House Oversight Chairman Trey Gowdy defended A...
233    Officer Scott Naff, who works for Virginia Dep...
Name: text, dtype: object

In [211]:
# example false negative
x_test[219]

'Senate Minority chairman Chuck Schumer (D-NY) is loud anti-Trump Democrat that is constantly trying to obtain every bit of information that the Democrats can use to impeach Trump. But all that now has backfired.\n\nA new survey shows that 37% of New Yorkers had an unfavorable vote of Schumer, and this the highest rate of disapproval he has ever got. Siena College released the poll this Thursday.\n\nThe Dems finally got sick of hearing him attack the president all the time. The results of the poll came just one day after Schumer’s last attack against the president, when he criticized Trump for saying that Republicans should let Obamacare die, claiming that nothing can change his position on this matter.\n\n“It’s hard to believe that he could say something like that,” said the NYC Democrat. “President Trump’s promise to let our health care system collapse is so, so wrong on three counts. It’s a failure morally, it’s a failure politically, and it’s a remarkable failure of presidential le

In [208]:
# calculate predicted probabilities for x_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.79256965944272451