<a href="https://colab.research.google.com/github/OmarMeriwani/CE807-Sentiment-analysis/blob/master/POS_Ngrams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This document explains the extraction of POS ngrams that are used to occur with negative or positive reviews.

In [0]:
import pandas as pd
from nltk.tokenize import sent_tokenize
from stanfordcorenlp import StanfordCoreNLP
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import os
from nltk.stem.porter import *

java_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"
os.environ['JAVAHOME'] = java_path
host='http://localhost'
port=9000
scnlp =StanfordCoreNLP(host, port=port,lang='en', timeout=30000)
stemmer = PorterStemmer()

stop_words = stopwords.words('english')

df = pd.read_csv('train.csv',header=0,sep='\t')
prev = ''
tknzr = RegexpTokenizer(r'\w+')
ListOfCleanTokens = []


This method gets the POS tags of a specific sentence, and the it loops to get the wanted ngram (trigrams, bigrams..etc) and then returns a list of non-separated POS tags within the wanted ngram range.

In [0]:
def getNgram(tags,gram):
    counter = 0
    ngrams = []
    while counter < len(tags):
        if counter + gram <= len(tags) - 1:
            temp = []
            for i in range(counter, counter + gram):
                temp.append(tags[i])
            ngrams.append(''.join(temp))
            counter += 1
        else:
            break
    return ngrams


Preparing three arrays to store the results.

In [0]:
postrigrams = []
posquadgrams = []
pospetagram = []


In the below steps, the values are getting acquired from the training set, the POS tags are extracted, ngrams of POS sequences as well, the the results for each sentence are stored with the review polarity value that has been assigned to training dataset.

In [0]:
for i in range(0,len(df)):
    '''
    Getting values from each row in training dataset
    '''
    sentence = df.loc[i][2]
    sentences = sent_tokenize(sentence)
    reviewPolarity = int(df.loc[i][3])
    tokens = []
    '''
    Extracting POS tags for each sentence, and getting only POS tags from the resulting tuples of (Word,POStag)
    '''
    POStagged = scnlp.pos_tag(sentence)
    POSTags = [p for word, p in POStagged]
    '''
    Getting POS tag ngrams (3,4,5) using the above method getNgram and removing any punctuation from the results.
    '''
    POSTriGrams = getNgram( POSTags,3)
    POSQuadriGrams = getNgram(POSTags,4)
    POSPentaGrams = getNgram(POSTags,5)
    POSTriGrams = [str(t).replace(',','').replace(':','') for t in POSTriGrams]
    POSQuadriGrams = [str(t).replace(',','').replace(':','') for t in POSQuadriGrams]
    POSPentaGrams = [str(t).replace(',','').replace(':','') for t in POSPentaGrams]
    '''
    Appending the results to the lists, beside the review polarity values from training data.
    '''
    if len(POSTriGrams) != 0:
        postrigrams.append([reviewPolarity,POSTriGrams])
    if len(POSQuadriGrams) != 0:
        posquadgrams.append([reviewPolarity,POSQuadriGrams])
    if len(POSPentaGrams) != 0:
        pospetagram.append([reviewPolarity,POSPentaGrams])
    if i > 1000:
        break


Working with each group of ngrams separatly; a dictionary value will be set to count the occurances of each POS ngram and to separate the POS sequences into negative and positive according to their polarity values (below or above 2) then each group of POS ngrams is going to be stored separatly in a CSV.

In [0]:

#Get POS from positive reviews
groups = 0
while groups < 3:
    PositivePOS = {}
    NegativePOS = {}

    posseq = []
    if groups == 0:
        posseq = postrigrams
    if groups == 1:
        posseq = posquadgrams
    if groups == 2:
        posseq = pospetagram
    '''
    The above while loop is used to deal with the three groups of POS ngrams using the same code
    '''
    
    '''
    For each list in the ngram set
    and for each POS tag in each list
    '''
    for tt in posseq:
        for pt in tt[1]:
          
          '''
          Positive and negative POS tags are stored separatly
          '''
            if tt[0] > 2:
                if pt not in PositivePOS:
                    PositivePOS[pt] = 1
                else:
                    PositivePOS[pt] = PositivePOS.get(pt)  + 1
    #Get POS from negative reviews
            if tt[0] < 2:
                if pt not in NegativePOS:
                    NegativePOS[pt] = 1
                else:
                    NegativePOS[pt] = NegativePOS.get(pt)  + 1

    #Remove the intersection between the negative and positive POS
    intersection = [t for t,i in PositivePOS.items() if t in NegativePOS]
    for i in intersection:
        if i in PositivePOS:
            PositivePOS.pop(i)
    for i in intersection:
        if i in NegativePOS:
            NegativePOS.pop(i)
            
    '''
    STORE THE RESULTS with the number of occurances 
    '''
    filename = ''
    if groups == 0:
        filename = 'POSTrigrams.csv'
    if groups == 1:
        filename = 'POSQgrams.csv'
    if groups == 2:
        filename = 'POSPgrams.csv'

    with open(filename, 'w') as f:
        for key in PositivePOS.keys():
            f.write("%s,%s,%s\n"%(key,PositivePOS[key],1))
        for key in NegativePOS.keys():
            f.write("%s,%s,%s\n"%(key,NegativePOS[key],-1))

    groups += 1