# Turney Algorithm

In this notebook we will implement the Turney Algorithm (proposed by Peter D. Turney: http://www.aclweb.org/anthology/P02-1053.pdf) for automatically create a sentiment lexicon from our data. To that end, we will use a set of IMDB movie reviews classified into positive and negative.



In [1]:
# Needed imports

import os
import math
import nltk
import re
import numpy as np

## Data Loading

We start by loading the positive and negative reviews from the data folder (there are 1000 reviews per class).


In [2]:
data_path = "./imdb1/"

path_pos = data_path + "pos"
path_neg = data_path + "neg"

pos_filenames = os.listdir(path_pos)
neg_filenames = os.listdir(path_neg)

contents=[]
# Read the text in the positive files
for f in pos_filenames:
    with open (path_pos+ "\\" +f) as txt:
        for line in txt:
            contents.append(line)
        
# Read the text in the negativee files
for f in neg_filenames:
    with open (path_neg+ "\\" +f) as txt:
        for line in txt:
            contents.append(line)
                  
# Join the whole contents and split it by word
res='\n'.join(contents).split()

In [3]:
res[:10]

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success']

As seen in class, the first step was based on extracting two-word phrases with adjectives. To allow this, we have to first annotate the words with the POS tagging.

We use the `pos_tag` function implemented into NLTK

In [4]:
from nltk import pos_tag
pos_tags=pos_tag(res)

In [5]:
pos_tags[:10]

[('films', 'NNS'),
 ('adapted', 'VBD'),
 ('from', 'IN'),
 ('comic', 'JJ'),
 ('books', 'NNS'),
 ('have', 'VBP'),
 ('had', 'VBD'),
 ('plenty', 'NN'),
 ('of', 'IN'),
 ('success', 'NN')]

Now we define a function to find the patterns defined in the paper.

    |First Word    |Second Word    |Third Word (not extracted)|
    ***********************************************************
    |JJ	        |NN or NNS	  |anything                  |
    |RB, RBR, RBS  |JJ	         |Not NN nor NNS            |
    |JJ	        |JJ	         |Not NN or NNS             |
    |NN or NNS	 |JJ	         |Nor NN nor NNS            |
    |RB,RBR or RBS |VB,VBD,VBN,VBG |anything                  |
    ***********************************************************


In [6]:
def find_pattern(postag):
    tag_pattern = []  
    for k in range(len(postag)-2):
        if( postag[k][1]=="JJ" and postag[k+1][1]=="JJ" and postag[k+2][1]!="NN" and postag[k+2][1]!="NNS"):
            tag_pattern.append("".join(postag[k][0])+" "+"".join(postag[k+1][0]))
        if( (postag[k][1]=="NN" or postag[k][1]=="NNS") and postag[k+1][1]=="JJ" and postag[k+2][1]!="NN" and postag[k+2][1]!="NNS"):
            tag_pattern.append("".join(postag[k][0])+" "+"".join(postag[k+1][0]))
        if( (postag[k][1]=="RB" or postag[k][1]=="RBR" or postag[k][1]=="RBS") and postag[k+1][1]=="JJ" and postag[k+2][1]!="NN" and postag[k+2][1]!="NNS"):
            tag_pattern.append("".join(postag[k][0])+" "+"".join(postag[k+1][0]))
        if( (postag[k][1]=="RB" or postag[k][1]=="RBR" or postag[k][1]=="RBS") and (postag[k+1][1]=="VB" or postag[k+1][1]=="VBN" or postag[k+1][1]=="VBD" or postag[k+1][1]=="VBG")):
            tag_pattern.append("".join(postag[k][0])+" "+"".join(postag[k+1][0])) 
        if( postag[k][1]=="JJ" and postag[k+1][1]=="NN" ) or ( postag[k][1]=="JJ" and postag[k+1][1]=="NNS" ):
            tag_pattern.append("".join(postag[k][0])+" "+"".join(postag[k+1][0]))
    return tag_pattern

Use the function to store all phrases that satisfy the conditions


In [7]:
tag_pattern = find_pattern(pos_tags)
tag_pattern = list(set(tag_pattern))
tag_pattern[:10]

['not developing',
 '" watch',
 'andy dick',
 'definite difference',
 'bio-mechanical life-forms',
 'sharp term',
 'gigantic neighbor',
 'altar due',
 'stupid screenplay',
 'uninspiring man']

Now we create three data structures to facilitate the algorithm execution

- **mat_phrase_great:** numpy matrix of hits between each phrase and the word great
- **mat_phrase_poor:** hits between phrase and poor in each file
- **mat_phrase_count:** matrix storing 1 if a phrase is present in a file. used for adding corresponding SOs later.
- **hits_great:** stores total hits of great in training set for each fold, correspondingly **hits_poor** stores poor hits


In [8]:
mat_phrase_great= np.zeros((len(tag_pattern), len(pos_filenames) + len(neg_filenames)), dtype="int8")
mat_phrase_poor= np.zeros((len(tag_pattern),  len(pos_filenames) + len(neg_filenames)), dtype="int8")
mat_phrase_count=np.zeros((len(tag_pattern), len(pos_filenames) + len(neg_filenames)), dtype="int8")
hits_great=[]
hits_poor=[]

The following counts the ocurrence of the words `poor` and `great` in the positive files.

Be patient! it is going to take a while

In [9]:
import string

for cnt, fi in enumerate(pos_filenames):
    with open (path_pos + "\\" +fi) as cf:
        txt=cf.read()
        txt = "".join(l for l in txt if l not in string.punctuation)
        file_list=txt.split()
        hits_great.append(file_list.count("great"))
        hits_poor.append(file_list.count("poor"))

        for j in range(len(tag_pattern)):
            all_hit_phrase_index=[]
            hits_phrase_great=0
            hits_phrase_poor=0
            if (tag_pattern[j] in txt):
                mat_phrase_count[j][cnt]=1
                try:
                    for w in (file_list):
                        if (w==tag_pattern[j].split()[0]):
                            ind=file_list.index(w)
                            if(file_list[ind+1]==tag_pattern[j].split()[1]):
                                #print(ind)
                                all_hit_phrase_index.append(ind)
                        for ids in (all_hit_phrase_index):
                            #print(all_hit_index)
                            for words in file_list[ids-10 :ids+11]:
                                if words=="great":
                                    hits_phrase_great+=1
                                if words=="poor":
                                    hits_phrase_poor+=1
                        mat_phrase_great[j][cnt]=hits_phrase_great
                        mat_phrase_poor[j][cnt]=hits_phrase_poor
                except:
                        pass

The same for the negative files

In [10]:
for cnt, fi in enumerate(neg_filenames):    
    with open (path_neg + "\\" +fi) as cf:
            txt=cf.read()
            file_list=txt.split()
            hits_great.append(file_list.count("great"))
            hits_poor.append(file_list.count("poor"))
            for j in range(len(tag_pattern)):
                all_hit_phrase_index=[]
                hits_phrase_great=0
                hits_phrase_poor=0
                if (tag_pattern[j] in txt):
                    mat_phrase_count[j][cnt]=1
                    try:
                        for w in (file_list):
                            if (w==tag_pattern[j].split()[0]):
                                ind=file_list.index(w)
                                if(file_list[ind+1]==tag_pattern[j].split()[1]):
                                    #print(ind)
                                    all_hit_phrase_index.append(ind)
                        for ids in (all_hit_phrase_index):
                            #print(all_hit_phrase_index)
                            for words in file_list[ids-10 :ids+11]:
                                if words=="great":
                                    hits_phrase_great+=1
                                if words=="poor":
                                    hits_phrase_poor+=1                     
                        mat_phrase_great[j][cnt]=hits_phrase_great
                        mat_phrase_poor[j][cnt]=hits_phrase_poor
                    except:
                            pass


Based on this count matrices, we can now calculate the semantic orientation of test data.

The following sentence takes the test_data to annoate and the counts to predict the orientation of each review in the test data (by using the Pointwise mutual information). It also evaluates the prediction compared to the actual label.

In [24]:
acc_all=[]
def semantic_orientation(p_hit_great, p_hit_poor, hits_gr, hits_po, test_data):
    num=(p_hit_great*float(hits_po))+0.01
    den=(p_hit_poor*float(hits_gr))+0.01
    so=np.log2(np.divide(num, den)) #PMI
    so=np.nan_to_num(so) # Change nan to 0 to avoid errors
    acc=0.0
    fold_no=0
    for f in test_data:
        polarity=0.0
        
        if f<1000:
            correct_label="positive"
            
        if f>=1000:
            correct_label="negative"
        
        for p in range(len(so)):
            if mat_phrase_count[p][f]==1:
                polarity+=so[p]
                
        if (polarity>=0.1): # Confidence threshold
            pred="positive"
            
        else:
            pred="negative"
            
        if(pred==correct_label):
             acc+=1
             
    acc=acc/float(200)
    acc_all.append(acc)
    
    print("[INFO] Fold accuracy: %r" %(acc))               
    fold_no+=1

Finally, we apply the latter function to the data.

We use a 10-fold Cross-Validation to predict and evaluate.

In [25]:
"""10 Fold Cross-Validation"""
from sklearn.model_selection import KFold

kf=KFold(n_splits=10, shuffle=True)

# Join the negative and positive filenames
file_names=os.listdir(path_pos)+os.listdir(path_neg)

for train, test_data in kf.split(file_names):  ##for each fold in the 10 fold CV
    tr_great= mat_phrase_great[: , train[0]:train[-1]]
    phrase_hit_great=np.sum(tr_great, axis=1)
    tr_poor= mat_phrase_poor[: , train[0]:train[-1]]
    phrase_hit_poor=np.sum(tr_poor, axis=1)
   
    hits_gr=sum(hits_great[train[0]:train[-1]])
    hits_po=sum(hits_poor[train[0]:train[-1]])
   
    semantic_orientation(phrase_hit_great, phrase_hit_poor, hits_gr, hits_po, test_data)   
    
acc_avg=sum(acc_all)/float(10)
print("[INFO] Accuracy: %r " %(acc_avg))

  """


[INFO] Fold accuracy: 0.975
[INFO] Fold accuracy: 0.98
[INFO] Fold accuracy: 0.995
[INFO] Fold accuracy: 0.97
[INFO] Fold accuracy: 0.98
[INFO] Fold accuracy: 0.99
[INFO] Fold accuracy: 0.97
[INFO] Fold accuracy: 0.98
[INFO] Fold accuracy: 0.97
[INFO] Fold accuracy: 0.985
[INFO] Accuracy: 0.9795 


**Almost perfect classification!**

We have proven that even from a small dataset (2000 reviews) we can create a very good sentiment lexicon.

