<a href="https://colab.research.google.com/github/OmarMeriwani/Fake-Financial-News-Detection/blob/master/Similarity_Check.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Similarity Check
This document contains the source code of similarity check in news groups as part of the fake financial news detection framework.

In [0]:
import numpy as np
from string import punctuation
import pandas as pd
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
from sklearn.model_selection import train_test_split
from nltk.stem.porter import *
from keras.utils import np_utils
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Defining lemmatizer and stemmer to be used in the  next steps.

In [0]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()


Normalization method that performs tokenization, and then removes punctuation and stop words and finally performs the case folding and the lemmatization

In [0]:
def clean_doc(doc):
    doc = doc.encode('ascii', errors='ignore').decode("utf-8")
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [lemmatizer.lemmatize(word.lower()) for word in tokens if len(word) > 1]
    return tokens


A method that returns text from a file 

In [0]:
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text


Line by line clearning for lines in a specific document.

In [0]:
def doc_to_clean_lines(doc):
    clean_lines = ''
    lines = doc.splitlines()

    for line in lines:
        clean_lines = ' '.join(clean_doc(line))
    return clean_lines

This method finds if a specific word exists in the vocabulary, then it assigns vector weights extracted from a ready embeddings to make the semantic vector representations. The words that does not exist in the vocabulary are replaced with zeros.

In [0]:
def get_weight_matrix2(embedding, vocab):
    vocab_size2 = len(vocab) + 1
    weight_matrix = zeros((vocab_size2, 300))
    for word, i in vocab:
        vector = None
        try:
            vector = embedding.get_vector(word)
        except:
            continue
        if vector is not None:
            weight_matrix[i] = vector
    return weight_matrix


This method reads a file and then creates a dataset of normalized sentences, timestamps and the groups (which represents the training labels in this experiment).

In [0]:
def readfile(filename):
    df = pd.read_csv(filename,header=0)
    mode = 'sentence'
    data = []
    prev = ''
    for i in range(0,len(df)):
        sentence = str(df.loc[i][1])

        group = int(df.loc[i][2])
        timestamp = df.loc[i][3]
        sentence = doc_to_clean_lines(sentence)
        data.append([sentence,timestamp,group])
    return data


This method splits data to test and training datasets in a way that divides the groups between the two sets according to the percentage.

In [0]:
def sortByGroup(val):
    return val[2]
def split(docs, percentage):
  
    docs.sort(key=sortByGroup)
    length = len(docs)
    groups = []
    test = []
    training = []
    previousGroup = 0
    for i in docs:
        if i[2] != previousGroup and previousGroup == 0:
            previousGroup = i[2]
            groups.append([i[0],i[2]])
        if i[2] == previousGroup and previousGroup != 0:
            groups.append([i[0],i[2]])
        if i[2] != previousGroup and previousGroup != 0:
            gLength = groups.__len__()
            testsize = int(gLength * percentage)
            '''After collecting all the samples of a specific group, we used train_test_split method from sklearn to divide them'''
            groupsTraining, groupsTest  = train_test_split(groups,test_size=percentage)
            for t in groupsTraining:
                training.append(t)
            for t in groupsTest:
                test.append(t)
            groups = []
            groups.append([i[0], i[2]])
            previousGroup = i[2]
            print(i[2])

    firstlength = int (length * percentage)
    return training,test


Reading the dataset, applying split method and the previous normalization methods, then convert the text into sequences and padding the sequences. The labels (groups) are converted into categorical arrays.

In [0]:
data = readfile('NewsGroups1300.csv')
traindata, testdata = split(data,0.2)
traindata = np.array(traindata)
testdata = np.array(testdata)
train_docs = traindata[:,0]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_docs)
encoded_docs = tokenizer.texts_to_sequences(train_docs)

max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
ytrain = traindata[:,1]
ytrain = np_utils.to_categorical(ytrain)
test_docs = testdata[:,0]
encoded_docs = tokenizer.texts_to_sequences(test_docs)

Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
ytest = testdata[:,1]

ytest = np_utils.to_categorical(ytest)
vocab_size = len(tokenizer.word_index) + 1


Load google news embeddings and convert the sequences into embedding vectors.

In [0]:
wv_from_bin = KeyedVectors.load_word2vec_format(datapath('E:/Data/GN/GoogleNews-vectors-negative300.bin'), binary=True)
embedding_vectors = get_weight_matrix2(wv_from_bin, tokenizer.word_index.items())


Creating the neural network model which is a sequential model with two dense layers with 1400, 741 units respectivly  

In [0]:
embedding_layer = Embedding(vocab_size, 300, weights=[embedding_vectors], input_length=max_length, trainable=False)
model = Sequential()
model.add(embedding_layer)
model.add(Dense(1400, activation='relu', input_dim=200))
model.add(Flatten())
model.add(Dense(741, activation='softmax'))
import tensorflow as tf
model.compile(optimizer='adam',
              loss=tf.compat.v1.keras.losses.categorical_crossentropy,
              metrics=['accuracy'])
print(model.summary())


Training and evaluation

In [0]:
model.fit(Xtrain, ytrain, epochs=20, verbose=2, validation_data=(Xtest, ytest))
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))