# Generating Fake News Headlines with Sentiment Analysis and Deep Learning

### The code below is used to generate a fake news headline document with either positive or negative sentiment, using a training set of a million headlines. The output of this script is a .pdf file (named 'new_headline.pdf' if run with default settings), containing a generated headline with the sentiment specified below and an image accquired via the Bing image search API by using the generated headline as a query. Please check the Technology_Review notebook in this folder to get more information on some of the methods used in this notebook.
###

#### Version 0 - 12/07/17
#### Shairoz Sohail
#### CS 410 - University of Illinois, Urbana-Champaign

Hello!

You are on this page for one of two reasons:
1) This is your first time using this piece of code (software?) and you arrived here from the readme
2) You have succesfully understood the structure of this software, and would like to retrain the news headline models with either new data, new parameters, or new models.

If you are (1), then this is all the code on the backend of the fake news generator. The data is pulled in, a model is trained, and functions are developed and tested to create the fake news document. You should be able to run every cell in this notebook, however be warned that the model takes REALLY LONG TO TRAIN (>= 1 day). The current models have been trained for 5 epochs each, and unless you're planning to run for over that please do not run the modeling and subsequent parameter saving cells, they may overwrite existing models. Details on methods are available on the readme as a technology review. A large part of the time allocated to this project was spent trying to train my own sentiment analyzer, however even after extensive training it really didn't seem as reliable as the pre-trained Vader classifier available from the NLTK package, so that was used instead. Why reinvent the wheel? 

If you are (2) then congratulations! As someone with limited professional software development experiance, I'm happy I was able to cobble together a semi-functioning piece of software. If you have previous experiance with deep learning (highly rec), then you'll immediately recognize that these langauge models are woefully under-trained. 5 epochs is literally a blink of an eye in the gpu-powered world of deep learning. This is a character-level model, meaning it generates text character-by-character. The reasoning for this is the greately decreased number of parameters (number of unique characters vs number of unique words). By default the model has the following architecture:

Input >>> LSTM(128) >>> Dense(len(chars)~46) >>> Softmax()
[optimizer=RMSProp]

If you fully understand what's going on, feel free to run the model with your own architecture and un-comment the 'saving paramers' cell. 

In [2]:
#####################
## Package Imports ##
#####################


import pandas as pd
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer as Sentiment
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import random
import sys
import os




In [3]:
##################
## Loading data ##
##################


headlines = pd.read_csv('million_headlines.csv')
print(str(headlines.shape[0]) + " rows x " + str(headlines.shape[1]) + " columns")
(headlines.head())


1093281 rows x 2 columns


Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [4]:
########################
## Sentiment Analysis ##
########################

### Information about the VADER sentiment analyzer chosen for this code
# https://github.com/cjhutto/vaderSentiment
### Citation
# Hutto, C.J. & Gilbert, E.E. (2014). 
# VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. 
# Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

## Running a pre-trained sentiment analyzer over the positive_headlines
sentiment_analyzer = Sentiment()
sentiment_score = list()
for headline in headlines.headline_text:
    sentiment_score.append(sentiment_analyzer.polarity_scores(text = headline)['compound'])

In [5]:
## Attaching sentiment information back onto headline positive_headlines and inspecting 
## We are being optimistic and labeling positive headlines as ones that have 0(neutral) and above sentiment scores
## Obviously, this leads to the positive_sentiment file being quite a bit bigger then the negative sentiment file
## This is justified by the negative headlines being more polarized (average sentiment < 0)

headlines['sentiment'] = pd.Series(sentiment_score).values
print("Average sentiment score = " + str(np.mean(headlines.sentiment)))
print("Positive sentiment headlines: " + str(len(np.argwhere(headlines.sentiment > 0))))
print("Negative sentiment headlines: " + str(len(np.argwhere(headlines.sentiment < 0))))
headlines.head()

Average sentiment score = -0.07637313124438093
Positive sentiment headlines: 228365
Negative sentiment headlines: 379457


Unnamed: 0,publish_date,headline_text,sentiment
0,20030219,aba decides against community broadcasting lic...,0.0
1,20030219,act fire witnesses must be aware of defamation,-0.34
2,20030219,a g calls for infrastructure protection summit,0.0
3,20030219,air nz staff in aust strike for pay rise,-0.2263
4,20030219,air nz strike to affect australian travellers,-0.128


In [6]:
## Saving labeled data

positive_headlines = headlines.headline_text.iloc[np.where(headlines.sentiment >= 0)]
negative_headlines = headlines.headline_text.iloc[np.where(headlines.sentiment < 0)]
positive_headlines.to_csv('positive_headlines.csv')
negative_headlines.to_csv('negative_headlines.csv')


### Model Training

In [None]:
sentiment = 'positive'

In [22]:
########################
##  Data Preparation  ##
########################

if sentiment=='positive':
    path = 'positive_headlines.csv'
else:
    path = 'negative_headlines.csv'
text = open(path).read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1



corpus length: 33577331
total chars: 46
nb sequences: 11192431
Vectorization...
Build model...


In [None]:
###########################
##  Model Specification  ##
###########################

# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


In [None]:
######################
##  Model Training  ##
######################


# train the model, output generated text after each iteration
for iteration in range(1, 5):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(x, y,
              batch_size=1028,
              epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
            
        # Output is given so you can stop training when output text begins looking coherent
        # Otherwise, 5 epochs should take roughly over a day on top-spec 2017 Macbook Pro
        print()


--------------------------------------------------
Iteration 1
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "years eve revelry
209786,sydney gears up"
years eve revelry
209786,sydney gears up for break at australian and stands
1077887,australia seeks marine the govt to be court
1078584,australia report committee to be to be set to be leaders
1047297,australia to be stand in the state to be committee to restrictions
1052880,australia to be leadership stand to be senate
1073124,council to stay on the research at police start
1074852,council seeks stand to be stand in the competition
108

----- diversity: 0.5
----- Generating with seed: "years eve revelry
209786,sydney gears up"
years eve revelry
209786,sydney gears up for marine september on school rescue
667351,socceroo deal over rain changed
514387,committee to consider a record adelaide the business
1066158,marathon to station for australia pm growing torres
1002288,aboriginal community heads to toll of peace lead at c

KeyboardInterrupt: 

In [None]:
#########################
##  Saving Parameters  ##
#########################


### Run this immediately after the top cell finishes - prevent loss of model training
## This takes a while to run, there are thousands of parameters

# serialize model to JSON
model_json = model.to_json()
path = sentiment + "_model.json"
with open(path, "w") as json_file:
    json_file.write(model_json)
print("Saved model architecture to disk")
    
# serialize weights to HDF5
path_h5 = sentiment + "_model.h5"
model.save_weights(path_h5)
print("Saved model to disk")


In [None]:
##################
##  Test Cases  ##
##################

def test_headline():
    if(len(get_headline())>0):
        return(True)
    else:
        return(False)

def test_image():
    get_image_of("This is a test query", "image_result_test.jpg")
    if('image_result_test.jpg' in os.listdir()):
        os.remove('image_result_test.jpg')
        return(True)
    else:
        return(False)
    
def test_report():
    generate_headline_document("This is a test headline", "fake_news_logo.jpg", "new_headline_test.pdf")
    if('new_headline_test.pdf' in os.listdir()):
        os.remove("new_headline_test.pdf")
        return(True)
    else:
        return(False)
    

In [None]:
############################
##  Function Definitions  ##
############################


def get_headline(seed='The', sentiment='positive'):
    print("Creating model")
    # load json and create model
    if sentiment=='positive':
        json_file = open('positive_model.json', 'r')
    else:
        json_file = open('negative_model.json', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)

    # load weights into new model
    if sentiment=='positive':
        loaded_model.load_weights("positive_model.h5")
    else:
        loaded_model.load_weights("negative_model.h5")
    loaded_model.compile(loss='categorical_crossentropy', optimizer=optimizer)
    print("Loaded " + str(sentiment) + " model from disk")
    
    
    print("Generating headline")
    for diversity in [0.2]:
        generated = ''
        sentence = seed
        generated += sentence
        #print('----- Generating with seed: "' + sentence + '"')
        #print()
        #sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars))) # Padding
            for t, char in enumerate(sentence): 
                x_pred[0, t, char_indices[char]] = 1.

            preds = loaded_model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            #sys.stdout.write(next_char)
            #sys.stdout.flush()
        print(sentence)
        
        
def get_image_of(query, filepath='image_result.jpg'):
    subscriptionKey = '058604aaaf914e7f9e46671f0a8c85ae'
    import http.client, urllib.parse, json
    # Verify the endpoint URI.  At this writing, only one endpoint is used for Bing
    # search APIs.  In the future, regional endpoints may be available.  If you
    # encounter unexpected authorization errors, double-check this value against
    # the endpoint for your Bing search instance in your Azure dashboard.
    host = "api.cognitive.microsoft.com"
    path = "/bing/v7.0/images/search"
    term = query.replace(' ', '+')

    def BingImageSearch(search):
        "Performs a Bing image search and returns the results."

        headers = {'Ocp-Apim-Subscription-Key': subscriptionKey}
        conn = http.client.HTTPSConnection(host)
        query = urllib.parse.quote(search)
        conn.request("GET", path + "?q=" + query, headers=headers)
        response = conn.getresponse()
        headers = [k + ": " + v for (k, v) in response.getheaders()
                       if k.startswith("BingAPIs-") or k.startswith("X-MSEdge-")]
        return headers, response

    print('Searching images for: ', term)
    headers, result = BingImageSearch(term)
    
    response_dict = json.load(result)
    image_url = response_dict['value'][1]['contentUrl']
    print('Saving image...')
    urllib.request.urlretrieve(image_url, filepath)


def generate_headline_document(headline_text, headline_image, filename='new_headline.pdf'):
    from reportlab.pdfgen import canvas
    from reportlab.lib.units import cm
    c = canvas.Canvas(filename)
    c.setFont('Helvetica-Bold', 26)
    c.drawString(80,750,headline_text)
    c.drawInlineImage(headline_image, x=80, y=420, preserveAspectRatio=True, width=10*cm, anchor='w')
    c.save()


In [None]:
#####################
##  Running Tests  ##
#####################

test_headline()
test_image()
test_report()
('positive_model.json' in os.listdir())
('negative_model.json' in os.listdir())
('positive_model.h5' in os.listdir())
('negative_model.h5' in os.listdir())
