   While sensationalist journalism has existed for centuries, fake news has recently risen to prominence as a result of the political climate.  However, because fake news is targeted at public perception of any entity, it can bring significant harm to corporations as well.  One instance of fake news propagation contributed to a temporary reduction of almost 4% in the publicly traded stock price of PEPSI and a more lingering impact on the corporation's public image.  Therefore, we conclude that corporations may benefit significantly from preventing the proliferation of fake news focused on them.
   
  Twitter and Reddit, two of the most prominent social media sites in the Western world, each have at least 250 million users.  Due to their high traffic, they represent convenient avenues for the propagation of news stories by any interested part, as well as providing vectors for these websites to propagate fake news. While the websites can be used for benign purposes or for relatively normal marketing, their immense reach means that fake news articles that go viral will spread quickly across the internet.   
   
   Due to the speed with which viral fake news can have a negative impact on both the public image and the financial bottom line of businesses, we posit that a proactive approach to fake news postings on these websites before they can hoodwink unsuspecting readers represents a useful asset to the standard crisis communications toolbox. Our approach is to immediately recognize fake news articles pertaining to any interested company and identify them as such to all readers. It is often the case that users begin responding to and discussing articles based purely on the headline, without evaluating their contents; with this being true, alerting readers that they may be being misled is paramount. Rapidly addressing fake news in this manner could prevent significant loss, both monetary and when considering their public image. Further details pertaining to the business model are contained in the attached report.
  
  We currently possess two primary avenues for achieving this. The first is predicated on the aforementioned list of fake news websites curated by OpenSources. By making the default assumption that any story hosted by these websites has a strong likelihood of being misleading or fake, we can simply preempt any significant discussion on Reddit or Twitter by announcing to readers that the source is, at best, suspect.  This is an approach that could be performed by anyone with sufficient familiarity with fake news and the APIs for each website, and does not require the application of data science concepts.
  
  Our second approach will be to evaluate the text inside the story as the basis for classification of news articles independent of their source. This approach will train software to evaluate text using a large corpus of previously-classified news stories. The fake news articles in this corpus were compiled from websites flagged by OpenSources, while the real news articles were collected from media outlets with a long history of reliable reporting. This software will apply neural networks to the natural language problem of classification, which is an emerging field of research.






   In order to integrate emerging data science into our approach, we propose to use artificial neural networks trained on a corpus of pre-classified fake and real news articles to identify unclassified documents.  While curated lists identifying sources of fake news are helpful, they are unlikely to identify every possible purveyor of fake news.  
   
   Therefore, we require a method of evaluating individual stories based on their contents rather than their source.  In the past few decades, artificial neural networks have achieved significant advances in natural language processing.    Much of the code we use in this section is founded on work done by Gareth Dwyer and implemented with a neural networks API called Keras.
Keras allows easy development of neural network architecture, meaning that we can choose a system customized to our purposes.  Because we are engaged in natural language processing, we rely on recurrent neural networks, which are commonly used in this context.  
   
   We use the long short-term memory method introduced by Hochreiter and Schmidhuber in 1997 to address some of the shortcomings in the methodology, as well as some other adjustments recommended by Dwyer.  After training the neural network with our corpus of fake and legitimate news stories, the network is capable of predicting the nature of new documents with approximately 88% accuracy.  
    
   To make use of this, we have programmed Bot_Defender to not only examine the source of news articles, but also to read the articles by itself and come to a conclusion about whether or not the article's contents were semantically similar to fake news articles.  This can be performed on either Twitter and Reddit, and is demonstrated using our Reddit-based Bot Defender.

   The embedding layer transforms each individual word into a vector of a given size (in our use case, a vector of 128 words).  As the neural network is trained, the mapping function used to transform the words learns which words are similar, and adjusts its transformations to map similar words into similar vectors.
 	
   The dropout layer is similar in concept to the random forest technique applied in ensemble learning methods.  It randomly disables neurons in subsequent layers in order to ensure that they are able to learn a comprehensive variety of lessons and nuances rather than re-learning the same lesson over and over.  Because learning mechanisms may fall prey to local minima and fail to discover the true response (for example, always concluding that 'couple' refers to a pair of people and never learning that it may be used idiomatically), dropout layers are necessary to providing a more complete and generalisable approach for the neural network.
 
   The convolutional layer works in tandem with the max pooling layer to train the classifier using pertinent n-grams instead of limiting the trainer to single worlds.  The convolutional layer assigns a filter to subsets of the text to obtain what can be seen as a weighting, while the max pooling layer collects the most significantly weighted blocks from each convolution.
	
   The recurrent neural network - long short-term memory layer (RNN - LSTM) layer is the heart of our learning method, as the other layers are simply dedicated to providing input to this layer.  RNNs function by parsing sequential data and passing information gained from each datum on to the next iteration.  Because our data is linguistic, where each word has an impact on the interpretation of the subsequent words in the phrase, it is well-suited to our purposes.  The LSTM modification allows the function to retain information for particularly relevant content. 
 	
   The dense layer is the final layer and translates the RNN-LSTM output into an actual classification.  This classification is then integrated into the Reddit Bot code for application.

This is not code, and is meant only to show the structure of the network.

model = Sequential()
model.add(Embedding(2000, 128, input_length=300))
model.add(Dropout(0.2))
model.add(Conv1D(64, 5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(data, np.array(y), validation_split=0.25, epochs=3)

   The first approach mentioned above, oriented around the OpenSources list of fake news websites, requires automated comparison of the website hosting stories highlighted in Twitter tweets or Reddit posts to the Opensources list. We utilized the APIs provided by both Twitter and Reddit to design bots that can generate pre-programmed responses. The Twitter bot identifies tweets pertaining to our client companies using hashtags, and then inspects these tweets to see if they contain links to other pages.  If these other pages are hosted by websites on the OpenSources list, a response to that tweet is automatically generated by the Twitter bot.     
   The following code implements automated accounts on both Reddit and Twitter to provide the aforementioned automated responses.  Examples of the output are also included.

### Create a neural network to classify news articles as real or fake

In [2]:
from collections import Counter
from datetime import datetime
 
import json
 
from keras.layers import Embedding, LSTM, Dense, Conv1D, MaxPooling1D, Dropout, Activation
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
 
import numpy as np
import pandas as pd




Using TensorFlow backend.


In [3]:
#Read in the dataset of real and fake news articles

# Import `fake_or_real_news.csv` 
df = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv")

# Set index 
df = df.reset_index('level_0')

# Set `y` 
y = df.label 

y = list(y)

# Drop the `label` column
df.drop("label", axis=1)

# Make training and test sets 
X = df['text']

X = list(X)

print(len(y))

print(len(X))


6335
6335


In [7]:
#Tokenize the text of the articles

tokenizer = Tokenizer(num_words=2000)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
data = pad_sequences(sequences, maxlen=300)

print('Done!')

Done!


In [5]:
#Convert the labels fron 'REAL' and 'FAKE to 1 or 0

for indx, label in enumerate(y):
    if label == 'FAKE':
        y[indx] = 0
    else:
        y[indx] = 1
Counter(y)

Counter({0: 3164, 1: 3171})

In [8]:
#Create the cnn+lstm nnet  and train it on our news articles. We see it has an 88% accuracy on the validation data. 

print('Start')
model = Sequential()
model.add(Embedding(2000, 128, input_length=300))
model.add(Dropout(0.2))
model.add(Conv1D(64, 5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(data, np.array(y), validation_split=0.25, epochs=3)

print('Done!')

Start
Train on 4751 samples, validate on 1584 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Done!


In [90]:
# save the tokenizer and model

import pickle
 
with open("keras_tokenizer.pickle", "wb") as f:
   pickle.dump(tokenizer, f)
model.save("news_fake_model.hdf5")

In [17]:
#This function grabs the text of articles given their url. We can tokenize the output of this and then feed it into
#our neural network to classify the article as real or fake 

import urllib.request
from bs4 import BeautifulSoup
    
def text_grabber(url):
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text


In [18]:
#An example of the output 
text_grabber('http://21stcenturywire.com/2017/12/02/white-helmets-local-councils-uk-fco-financing-terrorism-syria-taxpayer-funds/')

'White Helmets & \'Local Councils\' - Is the UK FCO Financing Terrorism in Syria with Taxpayer Funds? - 21st Century Wire\nSupport 21WIRE >\nBecome a Member\nDonate Here\nShop 21Wire\nAbout 21WIRE\nTV & Radio >\n21WIRE.TV\nAlternate Current Radio\niTunes\nPatrick Henningsen LIVE\nPodomatic\nSoundCloud\nSpreaker\nStitcher\nSunday Wire Show\nUK Column Live\nSunday Wire Show\nHave Your Shout\nShop 21Wire\nDONATE HERE\nNews for the Waking Generation\nInternational\nMyanmar\nNorth Korea\nEurope\nBrexit\nGrenfell Tower\nNATO\nMiddle East\nAleppo\nIran\nIsrael\nQatar\nSaudi Arabia\nSyria\nWhite Helmets\nYemen\nUS News\n2016 Election\nCharlottesville\nFake News\nEurasia\nAfghanistan\nArmenia\nChina\nCrimea\nIndia\nIran\nRussia\nTurkey\nUkraine\nSci-Tech\nBitcoin\nHollywood\nWhite Helmets\n“Jihadis You Pay For”\nWhite Helmets & ‘Local Councils’ – Is the UK FCO Financing Terrorism in Syria with Taxpayer Funds?\nDecember 2, 2017\nBy Vanessa Beeley Leave a Comment\nVanessa Beeley\n21st Century Wir

In [146]:
#Let's try it on 6 real and 6 fake news stories

url_list_fake = ['https://www.theonion.com/russian-olympic-coach-gently-breaks-news-to-hulking-200-1821059083',  'https://www.theonion.com/fda-confirms-psilocybin-reduces-risk-of-mindlessly-foll-1821046978', 'https://ahtribune.com/us/trump-at-war/1948-iran-zionists-trump.html', 'https://ahtribune.com/human-rights/american-human-rights/2023-matthew-hoh.html', 'https://entertainment.theonion.com/leah-remini-rediscovers-her-faith-in-scientology-after-1820914252', 'https://politics.theonion.com/new-rnc-ad-endorses-roy-moore-he-s-a-scumbag-but-he-1821020657']

url_list_real =['http://abcnews.go.com/International/fallout-trumps-jerusalem-decision-dangerous-experts-warn/story?id=51615581', 'http://abcnews.go.com/Politics/wireStory/woman-volunteered-conyers-alleges-sexual-harassment-51613701', 'https://www.washingtonpost.com/news/wonk/wp/2017/12/01/gop-eyes-post-tax-cut-changes-to-welfare-medicare-and-social-security/?hpid=hp_hp-top-table-main_wb-medicare-240p%3Ahomepage%2Fstory', 'http://www.cnn.com/2017/12/06/politics/jerusalem-peace-process-white-house/index.html', 'http://money.cnn.com/2017/12/05/news/tax-haven-eu-blacklist/index.html', 'http://abcnews.go.com/Politics/note-metoo-partisan-split/story?id=51640123']

X_new_fake = []
X_new_real = []


for url in url_list_fake:
    html = urllib.request.urlopen(url).read()
    text_grab = text_grabber(url)
    X_new_fake.append(text_grab)
#print(X_new)

for url in url_list_real:
    html = urllib.request.urlopen(url).read()
    text_grab = text_grabber(url)
    X_new_real.append(text_grab)
#print(X_new)

print('Done!')

Done!


In [148]:
#Run tolenize and predict the 6 real and 6 fake stories

sequences_new_fake = tokenizer.texts_to_sequences(X_new_fake)
data_new_fake = pad_sequences(sequences_new_fake, maxlen=300)

predict_on_fake = model.predict(data_new_fake)

sequences_new_real = tokenizer.texts_to_sequences(X_new_real)
data_new_real = pad_sequences(sequences_new_real, maxlen=300)

predict_on_real = model.predict(data_new_real)

for indx, i in enumerate(['Fake', 'Real']):
    count_right = 0
    for prediction in [predict_on_fake, predict_on_real][indx]:
        if round(prediction[0]) == indx:
            count_right += 1
    print("Classified " + str(count_right) + "/" + str(len([predict_on_fake, predict_on_real][indx])) + " " + i + " news stories correctly")
        
        



Classified 5/6 Fake news stories correctly
Classified 5/6 Real news stories correctly


# Twitter Robot

In [62]:
#Keys for twitter robot

consumer_key = 'yd8omgketBssM2YBa65G2dlg5'
consumer_secret = 'UT1gsiHFTDjqFDgLLeE00TXwoAZ5e2Hk4mYYz23BgXrNKfnqPz'
access_token = '921952184203399168-1DlLwFXWgxjMNRY0Deu5bLYAA2DIVdf'
access_token_secret = 'QCmlzHKmZQMoFRgc1fHRvGslnKqAKoFyZ6wBYXjORKbM5'

In [63]:
#We use Tweepy instead of twitter package as it is easier to implement a bot with Tweepy

import tweepy

# Access and authorize our Twitter credentials from credentials.py
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

In [91]:
#Load in a dataset of sites that have been classified as fake news sites 

import pandas as pd

fake_df = pd.read_csv('fake.csv')

fake_sites_mult = fake_df['site_url']

fake_sites = fake_sites_mult.unique()

fake_sites_list = fake_sites.tolist()

#### Collect tweets of people who share fake news stories

In [None]:
#This listens tweets that share articles from these fake news sites, and stores the data of the tweet.
#We don't do any more with this now, but it could be expanded in the future to keep track of users who disseminate fake
#news to a large audience 

from tweepy import Stream
from tweepy.streaming import StreamListener

class MyListener(StreamListener):
 
    def on_data(self, data):
        try:
            with open('python.json', 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True
 
    def on_error(self, status):
        print(status)
        return True
 
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=fake_sites_list)

#### Tweet at people who talk about our customer (Pepsi) and share a fake news story in the same tweet 

In [None]:
#This will reply to any tweet that has our client's name in the text (Pepsi, in this example) as well as a fake news
#story. The idea is that if they're talking about our client and sharing a fake news article in the same Tweet, the
#fake news article is likely about our client

from time import sleep
company_name = 'Pepsi'


for tweet in tweepy.Cursor(api.search, q=company_name).items():
    try:
        #print(tweet.text)
        for site in fake_sites_list:
            if site in tweet.text:
                print('\nTweet by: @' + tweet.user.screen_name)
                sn = tweet.user.screen_name
                m = "@%s Dear readers:  This post has been identified as containing a link to a website known for propagating misleading news stories.  Please read with caution and critical thinking!" % (sn)
                tweet = api.update_status(m, tweet.id)
                print('I tweeted!')
                sleep(60)

    except tweepy.TweepError as e:
        print(e.reason)

    except StopIteration:
        break
    


            

# Reddit Bot

In [74]:
#We use praw to connect to the Reddit api

import praw

In [75]:
#Define the Reddit bot

bot = praw.Reddit(user_agent='Walter_Defender',
                  client_id='EHO-FihVMbZeUQ',
                  client_secret='9MG0w2hdpGBROZGXTErTPNOd540',
                  username='Bot_Defender',
                  password='Rusty^22^reddit')
print('Done')

Done


In [77]:
#We don't want to reply to the same thread twice. This saves the threads we've already replied to. 
submission_done = []

In [None]:
#Find threads that have our company name and a link to a fake news story, and then reply to those threads telling users
#that the article is from a site known from spreading fake news 

compant_name = 'Pepsi'
print("We'll protect you, " + company_name)
subreddit = bot.subreddit('testingground4bots')
for submission in subreddit.stream.submissions():
    for site in fake_sites_list:
        if site in submission.url and company_name in submission.title and not(submission in submission_done):
            author = submission.author
            #print(author)
            #print(submission.title)
            #print(submission.url)
            message = "Dear readers:  This post has been identified as containing a link to a website known for propagating misleading news stories.  Please read with caution and critical thinking!"# u/{0} ".format(author)
            submission.reply(message) # Send message
            submission_done.append(submission)
            sleep(60)
            print("I poosted!")

In [76]:
#We don't want to reply to the same comment twice. This saves the comments we've already replied to.  

comment_done = []

In [None]:
#This will reply to any comment in any thread in the bot testing grounds subreddit that contains a link to a known
#fake news site and mentions our company in the comment's text. 

subreddit = bot.subreddit('testingground4bots')

comments = subreddit.stream.comments()

for comment in comments:
    text = comment.body
    author = comment.author # Fetch author
    
    if not (comment in comment_done):
        for site in fake_sites_list: 
            if site in text.lower() and company_name.lower() in text.lower():
                print(text)
                # Generate a message
                message = "Dear readers:  This post has been identified as containing a link to a website known for propagating misleading news stories.  Please read with caution and critical thinking! u/{0} ".format(author)
                comment.reply(message) # Send message
                comment_done.append(comment)
                print("I poosted!")
                sleep(600)


### Reddit Bot that classifies an article as fake and then responds 

In [109]:
submission_done = []

In [None]:
#Find threads that have our company name and a link. We then pull the text of the article, and classify it as real or
#fake. We then reply to the thread if the article is fake, and let the users know that our algorithm classified it as 
#such. 

company_name = 'Pepsi'
print("We'll protect you, " + company_name)
subreddit = bot.subreddit('testingground4bots')
for submission in subreddit.stream.submissions():
    if company_name in submission.title:
        if submission.url:
            text_grab = text_grabber(submission.url)
            text_list = [text_grab]
            sequence = tokenizer.texts_to_sequences(text_list)
            data_text = pad_sequences(sequence, maxlen=300)


            predict_text = model.predict(data_text)
            
            if (round(predict_text[0][0]) == 0) and not(submission in submission_done):
                author = submission.author
                print(author)
                print(submission.title)
                print(submission.url)
                message = "Dear readers:  The link shared by u/{0} has been classified by our algorithm as a fake news article. Please read with caution and critical thinking! ".format(author)
                submission.reply(message) # Send message
                submission_done.append(submission)
                print("I poosted!")
                sleep(600)

In [None]:
#Find threads that have our company name and a link to a real news story. This is for teting only, 
#would not be implemented in product
company_name = 'Pepsi'
print("We'll protect you, " + company_name)
subreddit = bot.subreddit('testingground4bots')
for submission in subreddit.stream.submissions():
    if company_name in submission.title:
        if submission.url:
            text_grab = text_grabber(submission.url)
            text_list = [text_grab]
            sequence = tokenizer.texts_to_sequences(text_list)
            data_text = pad_sequences(sequence, maxlen=300)


            predict_text = model.predict(data_text)
            
            if (round(predict_text[0][0]) == 1) and not(submission in submission_done):
                author = submission.author
                print(author)
                print(submission.title)
                print(submission.url)
                message = "Dear readers:  The link shared by u/{0} has been classified by our algorithm as a real news article. ".format(author)
                submission.reply(message) # Send message
                submission_done.append(submission)
                print("I poosted!")
                sleep(600)

<img src="fake.png">

<img src="real.png">

<img src="reddit_example.png">

<img src="tweet_example.png">


Now that we've established the services we can offer as a corporation, we can finalize in context our business model, summarized in the following table:

<img src="s1.png">


The Keyword is a keyword or set of Keywords that Defender will use to monitor and crawl the web and social media for as specified for each plan. The keyword or the set of keywords will be provided by the customer. For example, P&G is a single keyword, but P&G, Gillette and Tide are a set of keywords for one customer. All three plans include Twitter and Reddit. Pro and Advance plans offer more features such as dashboard and more users from the customer's side. The dashboard deliver real time activity monitoring for the customer accounts on the subscribed plan. The user on the Pro and Advance plan allow our customers to take advantage of other services available with their subscription such as social media marketing and search engine optimization. The Advance Plan offers more features to our customers such as sentiment analysis on Facebook, Instagram, and Telegram. In addition to that, it includes mainstream media monitoring and analysis for our customer’s keyword. Advance plan also includes custom response in which the customer will work with assigned customer support to provide custom response to and Fake News related to the customer.   
We priced our services based on the current market. We locked at our competitors in the market and evaluated their services. Although they are not providing the same core services that we are providing(Fake News Detection), they compete in providing sentiment analysis for their clients. Brandwatch, Indico, Lexalytics and Agorapulse are some of the big players in sentiment analysis area. They target small local business and corporates, and their prices started from \$500 monthly to \$5000 just as starting plan. If a customer need more services or add on features, the customer need to pay extra. Defender is targeting medium size and corporate clients, therefore we charge premium prices for our services.    
Our finance to start the service is as follow:


<img src="S2.png">
