<a href="https://colab.research.google.com/github/OmarMeriwani/CE807-Sentiment-analysis/blob/master/Feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The libraries that have been used in feature extraction, a mixture of tools between nltk tools and Stanford tools using Java local API.

In [0]:
import pandas as pd
import numpy as np
from nltk.tokenize import sent_tokenize
from stanfordcorenlp import StanfordCoreNLP
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import RegexpTokenizer
import os
from senticnet.senticnet import SenticNet
from nltk.stem.porter import *


Java parameters are needed to be set before starting, and to declare the class that is going to be used for stanford services.

In [0]:
java_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"
os.environ['JAVAHOME'] = java_path
host='http://localhost'
port=9000
scnlp =StanfordCoreNLP(host, port=port,lang='en', timeout=30000)


Porter stemmer is used in the project generally

In [0]:
stemmer = PorterStemmer()

As one of the methods that has been used to create polarity for sentences, we used the method that has been discussed in litrature review, where we get the maximum value and the average value of the words that have polarity in Senticnet tool. In case of negation words, we give the same polarity value multiplied by -1.

In [0]:
def SentimentsPolarity(sentence):
    sn = SenticNet()
    values = []
    maxPolarity = 0.0
    prev = ''
    for stem in sentence:
        s = stemmer.stem(stem)
        try:
            polarity_value = sn.polarity_intense(s)
            pv = float(polarity_value)
            #In case of negation, we give the opposite polarity value to the word.
            if prev in ['not','no','never']:
                pv = pv * -1
            absPolarityVal = abs(pv)
            if absPolarityVal > abs(float(maxPolarity)):
                maxPolarity = pv
            values.append(pv)
        except Exception as e:
            prev = s
            continue
        prev = s
    if len(values) == 0:
        return 0.0,maxPolarity
    avg = float(sum(values) / len(values)).__round__(3)
    return avg,maxPolarity


For the details of the below CSV files that are used in feature extraction, please refer to the file [ngrams_Polarity_Value.ipynb](https://github.com/OmarMeriwani/CE807-Sentiment-analysis/blob/master/ngrams_Polarity_Value.ipynb).

In [0]:
df_pbigrams = pd.read_csv('BigramsPolarity.csv',header=0,sep=',')
pbigrams = df_pbigrams.values.tolist()

df_punigrams = pd.read_csv('UnigramsPolarity.csv',header=0,sep=',')
punigrams = df_punigrams.values.tolist()

To understand the extraction of the following lists please refer to [the file](https://github.com/OmarMeriwani/CE807-Sentiment-analysis/blob/master/Frequent_Subjective_Words.ipynb)

In [0]:
df_positiveWords = pd.read_csv('PositiveWords.csv',header=0,sep=',')
positiveWords = df_positiveWords.values.tolist()

df_negativeWords = pd.read_csv('NegativeWords.csv',header=0,sep=',')
negativeWords = df_negativeWords.values.tolist()


Three ngrams of POS sequences were used, 3grams, 4grams and 5grams, to understand the extraction of these features please refer to the file of [POS nGrams](https://github.com/OmarMeriwani/CE807-Sentiment-analysis/blob/master/POS_Ngrams.ipynb).


In [0]:
df_posseq = pd.read_csv('POSTrigrams.csv',header=None,sep=',')
tgrams = df_posseq.values.tolist()

df_posseq = pd.read_csv('POSQgrams.csv',header=None,sep=',')
qgrams = df_posseq.values.tolist()

df_posseq = pd.read_csv('POSPgrams.csv',header=None,sep=',')
pgrams = df_posseq.values.tolist()

The negation is separated from stop words to execlude the negation effect on subjective terms

In [0]:
stop_words = stopwords.words('english')
stop_words = [s for s in stop_words if s not in ['no', 'not', 'never', 'n’t', 'nt']]
df = pd.read_csv('train.csv',header=0,sep='\t')
prev = ''
tknzr = RegexpTokenizer(r'\w+')
ListOfCleanTokens = []


This method has been explained in [POS ngrams](https://github.com/OmarMeriwani/CE807-Sentiment-analysis/blob/master/POS_Ngrams.ipynb), it is a mistake to re-write it here :)

In [0]:
def getNgram(tags,gram):
    counter = 0
    ngrams = []
    while counter < len(tags):
        if counter + gram <= len(tags) - 1:
            temp = []
            for i in range(counter, counter + gram):
                temp.append(tags[i])
            ngrams.append(''.join(temp))
            counter += 1
        else:
            break
    return ngrams


The data set that would be resulted after the loop below should contain the following features.

In [0]:
mode = 's'
results = pd.DataFrame(columns=['phraseID','sentenceID','BigramsPolarity','UnigramsPolarity','SenticnetAVG','senticnetMAX','WordsInScore','POSSequenceScore','y'])
j = 0


In [0]:

for i in range(0,len(df)):
  '''
  Similar to other document, this step specifies either to read only full sentences or all the sentences in the training dataset
  '''
    if mode != 'all':
        if prev != str(df.loc[i][1]):
            sentence = df.loc[i][2]
            prev = str(df.loc[i][1])
        else:
            continue
    else:
        sentence = df.loc[i][2]

    '''
    Tokenize, get values from training dataset, get NER, and POS tags
    '''
    sentences = sent_tokenize(sentence)
    reviewPolarity = int(df.loc[i][3])
    POSSEQPolarity = 0
    phraseID = df.loc[i][0]
    sentenceID = df.loc[i][1]

    tokens = []
    for sent in sentences:
        t = tknzr.tokenize(sent)
        for tk in t:
            tokens.append(tk)
    NER = scnlp.ner(sentence)
    POStagged = scnlp.pos_tag(sentence)
    POSTags = [p for word, p in POStagged]

    '''
    Get the existing POS ngrams in each sentence and insure that no punctuation inside the sets.
    '''
    POSTriGrams = getNgram(POSTags,3)
    POSQuadriGrams = getNgram(POSTags,4)
    POSPentaGrams = getNgram(POSTags,5)
    POSTriGrams = [p.replace(',','').replace(':','') for p in POSTriGrams]
    POSPentaGrams = [p.replace(',','').replace(':','') for p in POSPentaGrams]
    POSQuadriGrams = [p.replace(',','').replace(':','') for p in POSQuadriGrams]

    
    '''
    First feature: the polarity of POS sequences equals the number of occurences of POS ngram multiplied by it's polarity multiplied by the ngram range (3,4,5)
    '''
    for possequence, count, pospolarity in tgrams:
        if possequence in POSTriGrams:
            POSSEQPolarity += count * pospolarity * 3

    for possequence, count, pospolarity in qgrams:
        if possequence in POSQuadriGrams:
            POSSEQPolarity += count * pospolarity * 4

    for possequence, count, pospolarity in pgrams:
        if possequence in POSPentaGrams:
            POSSEQPolarity += count * pospolarity * 5
    
    '''
    Second and third features, the occurence of negative or positive unigrams or bigrams
    '''
    sentenceClean = ' '.join([str(t).lower() for t in tokens if t not in stop_words])
    polarity2 = 0
    polarity3 = 0
    for pb in pbigrams:
        pbigram = pb[1] + ' ' + pb[2]
        if pbigram in sentenceClean:
            if pb[3] == 4:
                polarity2 += 1
            else:
                polarity2 -= 1
    polarity1 = 0
    for up in punigrams:
        punigram = str(up[1])
        if punigram in sentenceClean:
            if up[2] == 4:
                polarity1 += 1
            else:
                polarity1 -= 1
    
    '''
    Fourth feature, if positive or negative words exist, words that were extracted in a different way than the previous two features
    '''
    for pw in positiveWords:
        if pw[0] in sentenceClean and pw[1] >=5 :
            polarity3 += 1
    for nw in negativeWords:
        if nw[0] in sentenceClean and nw[1] >=5 :
            polarity3 -= 1
    
    cleanTokens = [str(t).lower() for t in tokens if t not in stop_words]
    '''
    Get max and avg polarity from SenticNet tool
    '''
    avgAndmaxPol = SentimentsPolarity(cleanTokens)
    '''
    Store the results.
    '''
    r = [ phraseID, sentenceID, polarity2, polarity1,avgAndmaxPol[0],avgAndmaxPol[1], polarity3, POSSEQPolarity, reviewPolarity]
    results.loc[j] = r
    print(r)
    j += 1
results.to_csv('TrainingDataset.csv')

