<a href="https://colab.research.google.com/github/OmarMeriwani/Fake-Financial-News-Detection/blob/master/News_Sources_Analysis_Who_Said.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Sources Analysis - Who Said
This document contains the source code of the "Who Said" classifier. Please refer to the main read me document to find the prequistics to run the document.
The below code shows the required libraries to run the solution, which include nltk methods for preprocessing tasks and wordnet; sk_learn libraries for multiple layer perceptron and splitting training and testing samples; and the library for Stanford Core NLP. In addition to pandas, pickle and os libraris.

In [0]:
import pandas as pd
import pickle
from sklearn.neural_network import MLPClassifier
from stanfordcorenlp import StanfordCoreNLP
import os
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem.porter import *
from sklearn.model_selection import train_test_split
from nltk.corpus import wordnet

These lines are often repeated in other parts of the project. Since Stanford Core NLP tool requires a Java server to be running inside the host, then the below code initiates the requirements for Stanford tool to be running, it includes port number, host name (localhost) and the path to Java in the PC, it is not necesary that this path is the same in all PCs.
After creating the object of Stanford tool, we initiated different NLTK tools that are:  toenizer, stemmer and a lemmatizer.

In [0]:
java_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"
os.environ['JAVAHOME'] = java_path
host='http://localhost'
port=9000
scnlp =StanfordCoreNLP(host, port=port,lang='en', timeout=30000)

tokenizer = RegexpTokenizer('\s+|\:|\.', gaps=True)
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

In this steps we are reading the training dataset (Using Resources Dataset). 

The below method is used to get verb synonyms from WordNet. It is used only to get "say" verb synonyms.

In [0]:
def getVerbs(verb):
    synonyms = []
    for syn in wordnet.synsets(verb):
        for l in syn.lemmas():
            synonyms.append(l.name())
    return set(synonyms)

The below method exploits parsers to get the noun that is used with the dependency parser in Stanford tool to get the subject of the verb, specifically to get "Who Said", the task was done by using POS tags and dependency parser, both of them are Stanford Core NLP tools.

In [0]:
def WhoSaid (sent, verb):
    result = []
    '''Getting POS tags and depedency parse for the sentence'''
    deps = scnlp.dependency_parse(sent)
    tags = scnlp.pos_tag(sent)
    '''Creating an array that will store all the say verbs in the sentence'''
    verbindex = []
    for i in range(1, len(tags)):
        if tags[i][0] == verb:
            verbindex .append( i + 1)
    '''After storing say verbs, the subject for each verb will be selected and added to the results array'''
    for i in deps:
        if i[1] in verbindex and i[0] == 'nsubj':
            result.append([tags[i[2] - 1][0], tags[i[2] - 1][1], ners[i[2] - 1][1] ])
    return result


The below code represents the process of feature extraction, in the beginning, the dataset is read and two arrays declared for the labels and the features. Features are in two type, ones that are related to the say verb which are extracted by using parsers, and ones that are related to the colon. 

In [0]:
df = pd.read_csv('Using Resources Dataset.csv',header=0)
seq = 0
x = []
y = []
for i in range(0,len(df)):
    '''Getting news title and removing frequently occuring punctuation'''
    title= str(df.loc[i].values[0])
    title = title.replace('...','')
    '''Performing POS tagging on the title sentence'''
    lemTags = scnlp.pos_tag(title)
    '''Getting labels from the datasource'''
    isreferenced = df.loc[i].values[2]
    
    '''Preparing set of features for each news title'''
    colonAvailable = 1 if (title.find(':') != -1) else 0
    tags = scnlp.pos_tag(title)
    tagsarr = []
    sayverbs = getVerbs('say')
    isSayVerb = 0
    isNPPSaid = 0
    isNERSaid = 0
    isQuestion = 0
    nnpfound = 0
    if '?' in title:
        isQuestion = 1
    nnp_followed_by_colon = 0
    mid = int((len(tags) -1) / 2)
    
    '''Using who said method to get the subject, and then applying POS tagging and Named entity recognition on the sentence to check if the subjects are proper nouns or named entities'''
    for t in lemTags:
        verb = stemmer.stem(str(t[0]).lower())
        if 'V' in t[1]:
            #sayverbs is an array that contains all the synonyms of the verb say, these verbs are retrieved by the method above
            for j in sayverbs:
                if verb == str(j).lower():
                    #Getting subject of the verb
                    whosaid = WhoSaid(title, str(t[0]))
                    if whosaid != []:
                        for w in whosaid:
                            #Checking if the subject is a proper noun or a named entity
                            if w[1] == 'NNP' and isNPPSaid == 0:
                                isNPPSaid = 1
                            if w[2] != 'O' and isNERSaid == 0:
                                isNERSaid = 1
                        print('Whosaid', whosaid)
                    isSayVerb = 1
                
                    print('SayVerb',t[0])
                    break
    '''After the last step, three features are acquired, isNPPSaid which refers to a proper noun subject of the verb; 
    isNERSaid which refers to a named entity verb subject; and isSayVerb where a say verb exists
    
    The features below are related to the colon in the sentence, and whether it is followed or preceeded by a proper noun'''
    for i in range(0,mid):
        word = tags[i][1]
        if nnpfound == 1 and word == ':':
            nnp_followed_by_colon = 1
            break
        if word == 'NNP':
            nnpfound = 1
        else:
            nnpfound = 0
    nnp_preceeded_by_colon = 0
    for i in range(0,mid ):
        word = tags[len(tags) -1 - i][1]
        word2 = tags[len(tags) -1 - i][0]
        if  word == 'NNP':
            nnpfound = 1
        if nnpfound == 1 and word == ':':
            nnp_preceeded_by_colon = 1
            break
        if word != 'NNP':
            nnpfound = 0
    print(title)
    print( 'isreferenced', isreferenced,'colonAvailable', colonAvailable, 'nnp_followed_by_colon:',
           nnp_followed_by_colon,'nnp_preceeded_by_colon',nnp_preceeded_by_colon, 'isNPPSaid',isNPPSaid,
           'isNERSaid',isNERSaid, 'isQuestion',isQuestion)
    x.append([colonAvailable,nnp_followed_by_colon,nnp_preceeded_by_colon, isNPPSaid,isNERSaid, isQuestion])
    y.append(isreferenced)
    print('--------------------------------------------------------------------------')
max = 0


After extracting the features, the two resulting arrays are seprated into training and test, and then the evaluation performed using a multiple layer perceptron classifier. Finally the resulting model is stored to be used in the objectivity test.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.33, random_state=42)
for i in range(0,100):
    mlp = MLPClassifier()
    mlp.fit(X_train,y_train)
    score = mlp.score(X_test,y_test)
    print(score)
    if score > max:
        max = score
        pickle.dump(mlp, open('WhoSaid.pkl', 'wb'))
