<a href="https://colab.research.google.com/github/OmarMeriwani/Fake-Financial-News-Detection/blob/master/Fact_Checking_Ups_and_Downs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fact Checking - Ups and Downs
This document contains the source code for the ups and downs classifier, which is used to specify whether the news titles are mentioning news that lead to higher or lower stock market measures for a specific company.

In [0]:
import numpy as np
from string import punctuation
import pandas as pd
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
from sklearn.model_selection import train_test_split
from nltk.stem.porter import *
from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding, GlobalMaxPooling1D
from keras.layers.merge import Concatenate
from Vocabulary import clean_doc
from keras.utils import np_utils
import os
from stanfordcorenlp import StanfordCoreNLP
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


Preprocessing tools definitions and Stanford Core NLP tool explained [here](https://github.com/OmarMeriwani/Fake-Financial-News-Detection/blob/master/Final/Objectivity/News_Sources_Analysis_Who_Said.ipynb)

In [0]:
lemmatizer = WordNetLemmatizer()
java_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"
os.environ['JAVAHOME'] = java_path
host='http://localhost'
port=9000
scnlp =StanfordCoreNLP(host, port=port,lang='en', timeout=30000)
stemmer = PorterStemmer()

This method finds if a specific word exists in the vocabulary, then it assigns vector weights extracted from a ready embeddings to make the semantic vector representations. The words that does not exist in the vocabulary are replaced with zeros.

In [0]:
def get_weight_matrix2(embedding, vocab):
    vocab_size2 = len(vocab) + 1
    weight_matrix = zeros((vocab_size2, 300))
    for word, i in vocab:
        vector = None
        try:
            vector = embedding.get_vector(word)
        except:
            continue
        if vector is not None:
            weight_matrix[i] = vector
    return weight_matrix

The method below reads a datasheet file, and performs the following tasks:
* Get the news title and the stock market effect from each row.
* Replace named entities.
* Replace numbers.
* Remove stop words.
* Remove punctuation.
* Get POS Tags.
* Then create an array of sentences.
* Convert the effect to either 1 or 0 (original values represents a range between -1 and 1).

In [0]:
def readfile(filename):
    df = pd.read_csv(filename,header=0)
    mode = 'sentence'
    data = pd.DataFrame(columns=['title','effect'])
    prev = ''
    seq = 0
    table = str.maketrans('', '', punctuation)

    for i in range(0,len(df)):
        sentence = df.loc[i][3]
        company = str(df.loc[i][2]).lower()

        tokens = scnlp.word_tokenize(sentence)
        sentenceList = []
        for word in tokens:
            #print(word)
            isAllUpperCase = True
            for letter in word:
                if letter.isupper() == False:
                    isAllUpperCase = False
                    break

            if isAllUpperCase == False:
                sentenceList.append(str(word))
            else:
                sentenceList.append('#ner')
        tokens = sentenceList

        tokens = [w.translate(table) for w in tokens]
        stop_words = set(stopwords.words('english'))
        tokens = [w for w in tokens if not w in stop_words]
        # filter out short tokens
        tokens = [lemmatizer.lemmatize(word.lower()) for word in tokens if len(word) > 1]
        sentence = ' '.join(tokens)
        NER = scnlp.ner(str(sentence))
        POS = scnlp.pos_tag(str(sentence).lower())

        sentenceList = []
        for i in range(0,len(NER)):
            w = NER[i][0]
            n = NER[i][1]
            pos = NER[i][1]
            #print(w, n)
            if str(w).isnumeric() == True:
                sentenceList.append('#num')
                continue
            if pos == 'NNP' and w != '#ner':
                sentenceList.append('#ner')
                continue
            if str(n) == 'O' :
                sentenceList.append(w)
            else:
                sentenceList.append('#ner')
        sentence = ' '.join(sentenceList)
        effect = df.loc[i][4]
        if effect > 0:
            effect = 1
        else:
            effect = 0
        if sentence.strip() != '':
            data.loc[seq] = [sentence,effect]
            print(sentence, effect)
            seq += 1
    return data


Read the dataset, split training and testing samples and convert the labels into categorical output.

In [0]:
data = readfile('SSIX News headlines Gold Standard EN.csv')
headlines = data[['title']]
effects = data[['effect']]

x_train, x_test, y_train, y_test = train_test_split(headlines,effects,test_size=0.2)
traindata = np.array(x_train)
testdata = np.array(x_test)

y_testold = y_test
y_test = np_utils.to_categorical(y_test,num_classes=2)
print(y_testold, y_test)
y_train = np_utils.to_categorical(y_train,num_classes=2)


Prepare the data for word2vec vectors by converting the text into sequences and perform padding sequences to limt them by the minimum length of news titles. 

In [0]:
train_docs = traindata[:,0]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_docs)

test_docs = testdata[:,0]
# pad sequences
encoded_docs = tokenizer.texts_to_sequences(train_docs)
max_length = max([len(s.split()) for s in train_docs])
print('max_length', max_length)
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

encoded_docs = tokenizer.texts_to_sequences(test_docs)
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1


Get weight vectors from Google news vectors.

In [0]:
wv_from_bin = KeyedVectors.load_word2vec_format(datapath('E:/Data/GN/GoogleNews-vectors-negative300.bin'), binary=True)
embedding_vectors = get_weight_matrix2(wv_from_bin, tokenizer.word_index.items())

print('embedding_vectors.shape() =============================')
print(embedding_vectors.shape)

# create the embedding layer
embedding_layer = Embedding(vocab_size, 300, weights=[embedding_vectors], input_length=max_length, trainable=False)
# define model


Neural network parameters, embedding dimension is 300, we used 4 filters, then batch size and epochs number is set. 
The last three lines represents the input of the network.

In [0]:
embeding_dim = 300
filter_sizes = (1,2,3,4)
num_filters = 100
batch_size = 64
num_epochs = 500

input_shape = (max_length,)
model_input = Input(shape=input_shape)
zz = embedding_layer(model_input)


Defining the deep neural network model, started by the convolution layers with RELU activation. Then the dropout layer with a rate 0.8. Then the three dense layers separated by a dropout layer. 

In [0]:
conv_blocks = []
for sz in filter_sizes:
    conv = Convolution1D(filters=num_filters,
                         kernel_size=sz,
                         padding="valid",
                         activation="relu",
                         strides=1)(zz)
    conv = GlobalMaxPooling1D()(conv)
    conv_blocks.append(conv)
z = Concatenate()(conv_blocks if len(conv_blocks) > 1 else conv_blocks[0])
z = Dropout(0.8)(z)
model_output = Dense(10, activation="sigmoid" , bias_initializer='zeros')(z)
model_output = Dense(10)(model_output)
model_output = Dropout(0.8)(model_output)
model_output = Dense(2, activation="selu")(model_output)
model = Model(model_input, model_output)


Model compile, fitting and evaluation. The callback has been used during the experiments. 

In [0]:
    model.compile(loss="categorical_hinge", optimizer="adam", metrics=["accuracy"])
    model.summary(85)
    #callback = tf.keras.callbacks.EarlyStopping(monitor='val_acc', mode='max', min_delta=1, patience=50)
    history = model.fit(Xtrain, y_train, batch_size=batch_size, epochs=50,
              validation_data=(Xtest, y_test), verbose=2)
    print('History', history.history)
    loss, acc = model.evaluate(Xtest, y_test, verbose=2)
    print('Test Accuracy: %f' % (acc*100))
