## What do "fake" headlines have in common?

This makes heavy use of Scattertext (https://github.com/JasonKessler/scattertext).

Datasets:
Analyzing the Kaggle Fake News Challange Dataset (https://www.kaggle.com/mrisdal/fake-news).

The traditional news dataset is from the UCI headline aggregator data (https://archive.ics.uci.edu/ml/datasets/News+Aggregator).

In [1]:
import pandas as pd
import numpy as np
import datetime
import spacy
import scattertext as st
#import imp; imp.reload(st)
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))
import pickle
from nltk.corpus import reuters
%matplotlib inline  

### First, parse the headline data

In [2]:
nlp = spacy.en.English()

In [3]:
uci_df = pd.read_csv('data/uci-news-aggregator.csv.gz')
traditional_publishers = ['Forbes','Bloomberg','Los Angeles Times','TIME','Wall Street Journal']
repubable_celebrity_gossip = ['TheCelebrityCafe.com', 'PerezHilton.com']
real_df = uci_df[uci_df['PUBLISHER'].isin(traditional_publishers)]
real_df.columns = [x.lower() for x in real_df.columns]
real_df['date'] = (real_df.timestamp/1000).apply(datetime.datetime.fromtimestamp)
real_df['type'] = 'traditional'

In [4]:
real_df.iloc[:10]

Unnamed: 0,id,title,url,publisher,category,story,hostname,timestamp,date,type
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698,2014-03-10 09:52:50.698,traditional
25,26,ECB's Noyer not Happy With Euro Strength -- Up...,http://online.wsj.com/article/BT-CO-20140310-7...,Wall Street Journal,b,dPhGU51DcrolUIMxbRm0InaHGA2XM,online.wsj.com,1394470504274,2014-03-10 09:55:04.274,traditional
55,56,Icahn Targets Ebay Chief Donahoe After Company...,http://www.forbes.com/sites/steveschaefer/2014...,Forbes,b,dxyGGb4iN9Cs9aMZTKQpJeoiQfruM,www.forbes.com,1394470921918,2014-03-10 10:02:01.918,traditional
103,104,McDonald's Sales Tank In February,http://time.com/18190/mcdonalds-sales-tank-in-...,TIME,b,d4lvRcSzuglGdFMmqvT6ZGcG8vDlM,time.com,1394471391112,2014-03-10 10:09:51.112,traditional
143,144,Buyer Beware: A Bull Market Doesn't Lift All S...,http://www.forbes.com/sites/investor/2014/03/1...,Forbes,b,dchKRTV8vDviyCMH_lfUDTuFThh_M,www.forbes.com,1394472087371,2014-03-10 10:21:27.371,traditional
160,161,Stock Futures Slip on Weak China Data,http://stream.wsj.com/story/latest-headlines/S...,Wall Street Journal,b,dchKRTV8vDviyCMH_lfUDTuFThh_M,stream.wsj.com,1394472091116,2014-03-10 10:21:31.116,traditional
223,224,Mt. Gox Seeks U.S. Court Shield During Japan B...,http://www.bloomberg.com/news/2014-03-10/mt-go...,Bloomberg,b,d85PwV5UxEUaHMM0to00pLgkAXkbM,www.bloomberg.com,1394474325730,2014-03-10 10:58:45.730,traditional
287,288,"Hackers Hit Mt. Gox Exchange's CEO, Claim To P...",http://www.forbes.com/sites/andygreenberg/2014...,Forbes,b,d85PwV5UxEUaHMM0to00pLgkAXkbM,www.forbes.com,1394474337855,2014-03-10 10:58:57.855,traditional
339,340,Metro-North Railroad Worker Killed by Train in...,http://stream.wsj.com/story/latest-headlines/S...,Wall Street Journal,b,dEMAAHOiiULKoTMlQ5PeakjabUaYM,stream.wsj.com,1394476678854,2014-03-10 11:37:58.854,traditional
355,356,Press Release: Pershing Square Issues Statement,http://online.wsj.com/article/BT-CO-20140310-7...,Wall Street Journal,b,dsxeQTqHzsYQzrM7usRl_Shtda-iM,online.wsj.com,1394479007638,2014-03-10 12:16:47.638,traditional


In [5]:
df = pd.read_csv('data/fake.csv.gz')
#df = df.append(real_df)
df = pd.concat([df, real_df]).reset_index()


In [6]:
df = df[df['title'].apply(lambda x: type(x) == str)]
df['clean_title'] = df['title'].apply(lambda x: ' '.join(x.split('»')[0].split('>>')[0].split('[')[0].split('(')[0].split('|')[0].strip().split()))
df = df.loc[df['clean_title'].drop_duplicates().index]
df['parsed_title'] = df['clean_title'].apply(nlp)
df['meta'] = df['author'].fillna('') + df['publisher'].fillna('') + ' ' + df['site_url'].fillna('')
df['category'] = df['type'].apply(lambda x: 'Real' if x == 'traditional' else 'Fake')
fake_df = df[df['category'] == 'Fake']

In [7]:
fake_df.iloc[:10]

Unnamed: 0,index,author,category,comments,country,crawled,date,domain_rank,hostname,id,...,text,thread_title,timestamp,title,type,url,uuid,clean_title,parsed_title,meta
0,0,Barracuda Brigade,Fake,0.0,US,2016-10-27T01:49:27.168+03:00,NaT,25689.0,,,...,Print They should pay all the back all the mon...,Muslims BUSTED: They Stole Millions In Gov’t B...,,Muslims BUSTED: They Stole Millions In Gov’t B...,bias,,6a175f46bcd24d39b3e962ad0f29936721db70db,Muslims BUSTED: They Stole Millions In Gov’t B...,"(Muslims, BUSTED, :, They, Stole, Millions, In...",Barracuda Brigade 100percentfedup.com
1,1,reasoning with facts,Fake,0.0,US,2016-10-29T08:47:11.259+03:00,NaT,25689.0,,,...,Why Did Attorney General Loretta Lynch Plead T...,Re: Why Did Attorney General Loretta Lynch Ple...,,Re: Why Did Attorney General Loretta Lynch Ple...,bias,,2bdc29d12605ef9cf3f09f9875040a7113be5d5b,Re: Why Did Attorney General Loretta Lynch Ple...,"(Re, :, Why, Did, Attorney, General, Loretta, ...",reasoning with facts 100percentfedup.com
2,2,Barracuda Brigade,Fake,0.0,US,2016-10-31T01:41:49.479+02:00,NaT,25689.0,,,...,Red State : \nFox News Sunday reported this mo...,BREAKING: Weiner Cooperating With FBI On Hilla...,,BREAKING: Weiner Cooperating With FBI On Hilla...,bias,,c70e149fdd53de5e61c29281100b9de0ed268bc3,BREAKING: Weiner Cooperating With FBI On Hilla...,"(BREAKING, :, Weiner, Cooperating, With, FBI, ...",Barracuda Brigade 100percentfedup.com
3,3,Fed Up,Fake,0.0,US,2016-11-01T15:46:26.304+02:00,NaT,25689.0,,,...,Email Kayla Mueller was a prisoner and torture...,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,bias,,7cf7c15731ac2a116dd7f629bd57ea468ed70284,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,"(PIN, DROP, SPEECH, BY, FATHER, OF, DAUGHTER, ...",Fed Up 100percentfedup.com
4,4,Fed Up,Fake,0.0,US,2016-11-01T23:59:42.266+02:00,NaT,25689.0,,,...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,bias,,0206b54719c7e241ffe0ad4315b808290dbe6c0f,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,"(FANTASTIC, !, TRUMP, 'S, 7, POINT, PLAN, To, ...",Fed Up 100percentfedup.com
5,5,Barracuda Brigade,Fake,0.0,US,2016-11-02T16:31:28.550+02:00,NaT,25689.0,,,...,Print Hillary goes absolutely berserk! She exp...,Hillary Goes Absolutely Berserk On Protester A...,,Hillary Goes Absolutely Berserk On Protester A...,bias,,8f30f5ea14c9d5914a9fe4f55ab2581772af4c31,Hillary Goes Absolutely Berserk On Protester A...,"(Hillary, Goes, Absolutely, Berserk, On, Prote...",Barracuda Brigade 100percentfedup.com
6,6,Fed Up,Fake,0.0,US,2016-11-05T02:13:46.065+02:00,NaT,25689.0,,,...,BREAKING! NYPD Ready To Make Arrests In Weiner...,BREAKING! NYPD Ready To Make Arrests In Weiner...,,BREAKING! NYPD Ready To Make Arrests In Weiner...,bias,,d3cc0fe38f41a59f7c48f8c3528ca5f74193148f,BREAKING! NYPD Ready To Make Arrests In Weiner...,"(BREAKING, !, NYPD, Ready, To, Make, Arrests, ...",Fed Up 100percentfedup.com
7,7,Fed Up,Fake,0.0,US,2016-11-05T05:59:07.458+02:00,NaT,25689.0,,,...,BREAKING! NYPD Ready To Make Arrests In Weiner...,WOW! WHISTLEBLOWER TELLS CHILLING STORY Of Mas...,,WOW! WHISTLEBLOWER TELLS CHILLING STORY Of Mas...,bias,,b4bbf8b5c19e8864f5257832a58b81ef4ed2d4e4,WOW! WHISTLEBLOWER TELLS CHILLING STORY Of Mas...,"(WOW, !, WHISTLEBLOWER, TELLS, CHILLING, STORY...",Fed Up 100percentfedup.com
8,8,Fed Up,Fake,0.0,US,2016-11-07T10:20:06.409+02:00,NaT,25689.0,,,...,\nLimbaugh said that the revelations in the Wi...,BREAKING: CLINTON CLEARED...Was This A Coordin...,,BREAKING: CLINTON CLEARED...Was This A Coordin...,bias,,a19aabaa5a61eb8bc22fadaaa003e5fbba5c4bf6,BREAKING: CLINTON CLEARED...Was This A Coordin...,"(BREAKING, :, CLINTON, CLEARED, ..., Was, This...",Fed Up 100percentfedup.com
9,9,Fed Up,Fake,0.0,US,2016-11-07T10:20:27.252+02:00,NaT,25689.0,,,...,Email \nThese people are sick and evil. They w...,"EVIL HILLARY SUPPORTERS Yell ""F*ck Trump""…Burn...",,"EVIL HILLARY SUPPORTERS Yell ""F*ck Trump""…Burn...",bias,,f54d8e13010d0a79893995ee65360ad4b38b5a35,"EVIL HILLARY SUPPORTERS Yell ""F*ck Trump""…Burn...","(EVIL, HILLARY, SUPPORTERS, Yell, "", F*ck, Tru...",Fed Up 100percentfedup.com


In [8]:
df.type.value_counts()

bs             10346
traditional     7627
conspiracy       341
bias             312
hate             245
satire           144
state            118
junksci          102
fake              19
Name: type, dtype: int64

In [10]:
def news_type_viz(fake_df, news_type,
                  minimum_term_frequency=1, 
                  minimum_not_category_term_frequency=5, 
                  comparison_types=None, 
                  compare_to_trad_only = False,
                  stoplist = []):
    #type_df = fake_df
    #if comparison_types is not None:
    #    type_df = type_df[type_df['type'].isin(comparison_types + [news_type])]
    #type_df['news_type'] = type_df['type'].apply(lambda x: news_type if (x == news_type) else ('not ' + news_type))
    type_df = df
    if compare_to_trad_only:
        type_df = type_df[type_df['type'].isin([news_type, 'traditional'])]
    vc = type_df['clean_title'].value_counts()
    type_df = type_df[type_df['clean_title'].isin(vc[vc == 1].index)]
    corpus = (st.CorpusFromParsedDocuments(type_df, 
                                           category_col = 'type', 
                                           parsed_col = 'parsed_title', 
                                           feats_from_spacy_doc = st.FeatsFromSpacyDoc(strip_final_period=True))
              .build()
              .get_stoplisted_unigram_corpus_and_custom(stoplist))
    # entity_types_to_censor={'FACILITY', 'NORP', 'GPE', 'PERSON', 'WORK_OF_ART', 'LOC', 'EVENT', 'LANGUAGE', 'ORG', 'PRODUCT'}
    html = st.produce_scattertext_explorer(corpus, 
                                           category=news_type, 
                                           category_name=news_type, 
                                           not_category_name='traditional' if compare_to_trad_only else 'not ' + news_type,
                                           minimum_term_frequency=minimum_term_frequency,
                                           minimum_not_category_term_frequency=minimum_not_category_term_frequency,
                                           metadata=type_df['meta'] + ': ' + type_df['type'],
                                           use_full_doc=True,
                                           term_ranker=st.termranking.OncePerDocFrequencyRanker,
                                           width_in_pixels=1000)
    file_name = 'output/'+news_type + ('_t' if compare_to_trad_only else '_a') + ".html"
    open(file_name, 'wb').write(html.encode('utf-8'))
    display(IFrame(src=file_name, width = 1200, height=1000))

## Caution, traditional news dataset is from 2014, and doesn't talk about Trump

## A fake news classifier trained on this data may have false positives with documents containing the term "Trump"

In [11]:
news_type_viz(fake_df, 'hate', 
              minimum_term_frequency=1, 
              minimum_not_category_term_frequency=20, 
              comparison_types=None, 
              compare_to_trad_only = True,
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren'])

## Caution: many articles labeled satire are in French!

## A classifier may accidentally label any French article as satire. Répugnant!

In [12]:
news_type_viz(fake_df, 'satire', 
              minimum_term_frequency=1, 
              minimum_not_category_term_frequency=20, 
              comparison_types=None, 
              compare_to_trad_only = False,
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren'])

In [13]:
news_type_viz(fake_df, 'hate', 
              minimum_term_frequency=1, 
              minimum_not_category_term_frequency=20, 
              comparison_types=None, 
              compare_to_trad_only = False,
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren'])

In [14]:
news_type_viz(fake_df, 'bias', minimum_term_frequency=1, minimum_not_category_term_frequency=20, comparison_types=None, 
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren','shapiro','ben'])

In [15]:
news_type_viz(fake_df, 'satire', minimum_term_frequency=2, minimum_not_category_term_frequency=20, comparison_types=None, 
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren','shapiro','ben','onion'])

In [16]:
news_type_viz(fake_df, 'state', minimum_term_frequency=2, minimum_not_category_term_frequency=20, comparison_types=None, 
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren','shapiro','ben','onion','presstv'])

In [17]:
news_type_viz(fake_df, 'junksci', minimum_term_frequency=2, minimum_not_category_term_frequency=30, comparison_types=None, 
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren','shapiro','ben','onion'])

In [18]:
news_type_viz(fake_df, 'fake', minimum_term_frequency=2, minimum_not_category_term_frequency=50, comparison_types=None, 
              stoplist=['dankof','collett','duke','anglin','shoaf','slattery','farren','shapiro','ben','onion'])