# Project Documentation

In [1]:
import nltk
import numpy as py
from nltk.corpus import stopwords
from nltk.tree import *
from nltk.corpus import wordnet as wn
from difflib import SequenceMatcher

# Motivation

The problem our group choose is Mining and Summarizing Customer Review. The reason why we choose it is because as the rapid development of Internet, more and more products are sold online, at the same time, people are willing to shop online. Hence, products' reviews become much more important for both customers and manufactures. For potential customers, it can help them to make decision whether to buy it or not. At the same time, it is necessary for manufactures to improve their products' quality according to the reviews.But since part of the reviews are non-sense, we aim to fetch useful sentence from each review.

# Approach

The algorithms and build-in functions we used sre NLTK tagging and NLTK tree for finding
noun phrases.

NLTK token for tagging each word in sentences:

In [2]:
def token(raw):
    sentlist=[]
    sents=nltk.sent_tokenize(raw)
    for sent in sents:
        tokenw=nltk.word_tokenize(sent)
        tagw=nltk.pos_tag(tokenw)
        sentlist.append(tagw)
    return sentlist

NLTK tree for finding noun-phrases in the sentence we tagged:

In [3]:
def opinionS(sent):
    ow=False
    adj=False
    for word in sent:
        if word[1]=='NN' or word[1]=='NNS' or word[1]=='NNP' or word[1]=='NNPS':
            ow=True
        if word[1]=='JJ' or word[1]=='JJR'or word[1]=='JJS':
            adj=True
    return ow*adj

def NounPhrase(sent):
    feature=[]
    grammer = r"""
        NP:
            {<NN|NNS><NN|NNS><NN|NNS>}
            {<NN|NNS><NN|NNS>}
            {<NN|NNS><IN><NN|NNS><NN|NNS>}
            {<NN|NNS><IN><NN|NNS>}
            {<NN|NNS>}
    """
    cp=nltk.RegexpParser(grammer)
    result = cp.parse(sent)
    for subtree in result.subtrees(filter = lambda t:t.label() == 'NP'):
        feature.append (subtree.leaves())
    return feature

# Data

We conduct our experiments by randomly picking five electronics products which are 2 digital cameras, 1 DVD player, 1 MP3 player, 1 cellular phone and their customer reviews from Amazon.com and C|net.com which are provided by the paper[1].


# Code

*All of our code is written by ourselves

Setting stop words for making sure these words are not in noun-phrases like 'your','haven','themselves' and so on:

In [4]:
stop_words=set(stopwords.words('english'))

Input reading file:

In [5]:
def readfile(addr):
    f=open(addr,'r')
    raw=f.read()
    return raw

NLTK token for tagging each word in sentences:

In [6]:
def token(raw):
    sentlist=[]
    sents=nltk.sent_tokenize(raw)
    for sent in sents:
        tokenw=nltk.word_tokenize(sent)
        tagw=nltk.pos_tag(tokenw)
        sentlist.append(tagw)
    return sentlist

NLTK tree for finding noun-phrases in the sentence we tagged:

In [7]:
def opinionS(sent):
    ow=False
    adj=False
    for word in sent:
        if word[1]=='NN' or word[1]=='NNS' or word[1]=='NNP' or word[1]=='NNPS':
            ow=True
        if word[1]=='JJ' or word[1]=='JJR'or word[1]=='JJS':
            adj=True
    return ow*adj

def NounPhrase(sent):
    feature=[]
    grammer = r"""
        NP:
            {<NN|NNS><NN|NNS><NN|NNS>}
            {<NN|NNS><NN|NNS>}
            {<NN|NNS><IN><NN|NNS><NN|NNS>}
            {<NN|NNS><IN><NN|NNS>}
            {<NN|NNS>}
    """
    cp=nltk.RegexpParser(grammer)
    result = cp.parse(sent)
    for subtree in result.subtrees(filter = lambda t:t.label() == 'NP'):
        feature.append (subtree.leaves())
    return feature

Setup positive and negative list and store them in seed list:

In [8]:
positive=['good','pretty','fantastic','cool','nice','amazing','excellent','perfect','outstanding','clear','remarkable','gorgeous','wonderful','awesome','upbeat','favorable','cheerful','pleased','appealing']
negative=['bad','disappointing','dull','ugly','terrible','disgraceful','poor','shoddy','awful','noisome','disgusting','frustrating','awkward','irritating','weired']
seed_list = {}
for word in positive:
    seed_list[word] = 'positive'
for word in negative:
    seed_list[word] = 'negative'

Setup negation word list for further use

In [9]:
negation_word = ["no","not","yet","never","hardly","little","few","none"]

Setup UI when run the code:

In [10]:
print("(1) Apex AD2600 Progressive-scan DVD player cleaned.txt\n")
print("(2) Canon G3 cleaned.txt\n")
print("(3) Creative Labs Nomad Jukebox Zen Xtra 40GB cleaned.txt\n")
print("(4) Nikon coolpix 4300 cleaned.txt\n")
print("(5) Nokia 6610 cleaned.txt\n")

val = input("Enter file number you wish to process: ")


if val == '1':
    file_name = 'Apex AD2600 Progressive-scan DVD player cleaned.txt'
elif val == '2':
    file_name = 'Canon G3 cleaned.txt'
elif val == '3':
    file_name = 'Creative Labs Nomad Jukebox Zen Xtra 40GB cleaned.txt'
elif val == '4':
    file_name = 'Nikon coolpix 4300 cleaned.txt'	
elif val == '5':
    file_name = 'Nokia 6610 cleaned.txt'
else:
    raise Exception('input should be 1-5. The value of input was: {}'.format(val))

file = 'data/' + file_name
print("Star calculating results...\n")

(1) Apex AD2600 Progressive-scan DVD player cleaned.txt

(2) Canon G3 cleaned.txt

(3) Creative Labs Nomad Jukebox Zen Xtra 40GB cleaned.txt

(4) Nikon coolpix 4300 cleaned.txt

(5) Nokia 6610 cleaned.txt

Enter file number you wish to process: 2
Star calculating results...



Fetching feature in the opinion sentence and store them in candidate list:

In [11]:
raw=readfile(file)
tokenized=token(raw)
OS=[sent for sent in tokenized if opinionS(sent)]
opinionS_N=len(OS)
nounphrase=[]
for sent in OS:
    nounphrase.append(NounPhrase(sent))
nounphrase_N=len(nounphrase)
candidate=[]
for i in range(0,nounphrase_N):
    for j in range(0,len(nounphrase[i])):
        f=''
        for x in range(0,len(nounphrase[i][j])):
            if (nounphrase[i][j][x][0] not in stop_words or nounphrase[i][j][x][0] in ['of','for']) and (x!=len(nounphrase[i][j])-1):
                f+=nounphrase[i][j][x][0]+' '
            elif(nounphrase[i][j][x][0] not in stop_words or nounphrase[i][j][x][0] in ['of','for']) and (x==len(nounphrase[i][j])-1):
                f+=nounphrase[i][j][x][0]
        candidate.append(f)
candidate=[elem for elem in candidate if elem.strip()]

Finding features that appears > 2% in candidate list:

In [12]:
candidateDic={}
for i in candidate:
    if i not in candidateDic:
        candidateDic[i]=1
    else:
        candidateDic[i]+=1

features=[elem for elem in candidateDic.keys() if candidateDic[elem]/opinionS_N > 0.02]

Creating a dictionary to store the sentences that features appeared (key = feature, context = sentence index):

In [13]:
featuresNS=[]
for f in features:
    s=f
    s=s.replace(' ','')
    featuresNS.append(s)

featuresDic={}

for i in range(0,len(OS)):
    sentNS=''
    sent_N=len(OS[i])
    for j in range(0,sent_N):
        sentNS+=OS[i][j][0]
        
    for z in range(0,len(featuresNS)):
        if featuresNS[z] in sentNS:
            if features[z] not in featuresDic:
                featuresDic[features[z]]=[i]
            else:
                featuresDic[features[z]].append(i)
                

Removing opinion sentences' tag for further use

In [14]:
def remove_tag(OS):
    output = []
    for sent in OS:
        new_sent = []
        for word in sent:
            new_sent.append(word[0])
        output.append(new_sent)
    return output
OS_notag = remove_tag(OS)

When we get the adjective word in opinion sentence. We first check whether the adjective word is in the seed_list. If the word is in seed_list, we do nothing. If the word's synonyms is already in the seed_list, we give the adjective word same orientation as it's synonyms. Otherwise, we check whether it's antonyms is in seed_list. If the antonyms is in the seed_list, we give the adjective words opposite orientation as antonyms' orientation. If the words has neither synonyms nor antonyms in the seed_list, we discard this adjective word.

In [15]:
def find_syn_ant(word):
    synonyms = []
    antonyms = []
    for syn in wn.synsets(word):
        for l in syn.lemmas():
            synonyms.append(l.name())
            if l.antonyms():
                antonyms.append(l.antonyms()[0].name())

    return synonyms, antonyms

def negation(orientation):
    if orientation == "positive":
        orientation = "negative"
    else:
        orientation = "positive"
    return orientation

def OrientationPrediction(adj_list, seed_list):
    while True:
        size1 = len(seed_list)
        adj_list, seed_list = OrientationSearch(adj_list, seed_list)
        size2 = len(seed_list)
        if size1 == size2:
            break

    return adj_list, seed_list

def OrientationSearch(adj_list, seed_list):
    added = False
    for adj in adj_list:
        adj_syn, adj_ant = find_syn_ant(adj)
        for syn in adj_syn:
            if syn in seed_list:
                adj_orientation = seed_list[syn]
                seed_list[adj] = adj_orientation
                added = True
                break
        if added == False:
            for ant in adj_ant:
                if ant in seed_list:
                    adj_orientation = negation(seed_list[ant])
                    seed_list[adj] = adj_orientation
                    added = True
                    break
    return adj_list, seed_list

Finding words before and after feature words in opinion sentence:

In [16]:
def close_word(word, sentence, size):
    word_pos = sentence.index(word)
    if len(sentence) <= size:
        window = sentence
    elif word_pos < size:
        window = sentence[0:word_pos + size]
    elif len(sentence) - word_pos < size:
        window = sentence[word_pos - size:-1]
    else:
        window = sentence[word_pos - size: word_pos + size]

    return window

Update seed_list by each adjective words in each opinion sentence

In [17]:
for feature in featuresDic:
    for sentence_index in featuresDic[feature]:
        sentence = OS_notag[sentence_index]
        if feature in sentence:

            window = sentence
            adjs = []
            window_tag = nltk.pos_tag(window)
            for word_tag in window_tag:
                if word_tag[1] == 'JJ':
                    adjs.append(word_tag[0])

            adjs, seed_list = OrientationPrediction(adjs,seed_list)

Function to identify specific adjective word's orientation in specific sentence. Return 1 if it is positive. Otherwise, return -1.

In [18]:
def wordOrientation(word, sentence):
    orientation = seed_list[word]
    window = close_word(word, sentence, 5)
    for neg_word in negation_word:
        if neg_word in window:
            orientation = negation(orientation)

    if orientation == "positive":
        return 1
    else:
        return -1

Identifying the sentence by it's orientation which is calculated by the adjective words in it.
<br>&emsp;&emsp;1.If adjective word orientation is postive, sentence orientation plus 1. 
<br>&emsp;&emsp;2.If the adjective word orientation is negative, the sentence orientation minus 1.
<br>After calculating the adjective words, target sentence is:
    <br>&emsp;&emsp;1.positive if the sentence orientation > 0
    <br>&emsp;&emsp;2.negative if the sentence orientation < 0
<br>When target sentence orientation = 0, we continue calculating it's orientation according to the effective adjective word which we set 5 words before and after feature.
    <br>&emsp;&emsp;1.If effective adjective word is positive, sentence orientation plus 1
    <br>&emsp;&emsp;2.If effective adjective word is negative, sentence orientation minus 1
<br>After calculating the effective adjective words, target sentence is:
    <br>&emsp;&emsp;1.positive if the sentence orientation > 0
    <br>&emsp;&emsp;2.negative if the sentence orientation < 0
    <br>&emsp;&emsp;3.neutral if the sentence orientation = 0

In [19]:
sentenceOrientation = {}
sentence_effective = {}
sentence_opw = {}
sentence_feature = {}

for i,sentence in enumerate(OS_notag):
    orientation = 0
    sentence_opw[i] = []
    sentence_effective[i] = []
    sentence_feature[i] = []

    for feature in featuresDic:
        if feature in sentence:
            sentence_feature[i].append(feature)

            eff_window = close_word(feature, sentence, 5)
            eff_tag = nltk.pos_tag(eff_window)
            for tag in eff_tag:
                if tag[1] == 'JJ' and tag[0] not in sentence_effective[i]:
                    sentence_effective[i].append(tag[0])

    for word in sentence:
        if word in seed_list:
            sentence_opw[i].append(word)

    for op in sentence_opw[i]:
        if op in seed_list:
            orientation += wordOrientation(op,sentence)

    if orientation > 0:
        sentenceOrientation[i] = "Positive"
    elif orientation < 0:
        sentenceOrientation[i] = "Negative"
    else:
        for eff_op in sentence_effective[i]:
            if eff_op in seed_list:
                orientation += wordOrientation(eff_op,sentence)
        if orientation > 0:
            sentenceOrientation[i] = "Positive"
        elif orientation < 0:
            sentenceOrientation[i] = "Negative"
        else:
            sentenceOrientation[i] = "Neutral"

Create featureOrientation for output:

In [20]:
featureOrientation = {}

for feature in featuresDic:
    featureOrientation[feature] = {"positive":[], "negative":[], "neutral":[]}

    for sentence_index in featuresDic[feature]:

        if sentenceOrientation[sentence_index] == "Positive" and sentence_index not in featureOrientation[feature]["positive"]:
            featureOrientation[feature]["positive"].append(sentence_index)

        elif sentenceOrientation[sentence_index] == "Negative" and sentence_index not in featureOrientation[feature]["negative"]:
            featureOrientation[feature]["negative"].append(sentence_index)

        elif sentenceOrientation[sentence_index] == "Neutral" and sentence_index not in featureOrientation[feature]["neutral"]:
            featureOrientation[feature]["neutral"].append(sentence_index)

Combine two feature dictionaries in featureOrientation if those two features have high similarity:

In [21]:
def merge_two_dicts(x, y):
    z = {"positive":[], "negative":[], "neutral":[]}
    for key in z.keys():
        z[key] = x[key] + y[key]
    return z

duplicate_feature = []
for i,prev_feature in enumerate(features):
    for feature in features[i+1:]:
        s = SequenceMatcher(None, prev_feature, feature)
        if s.ratio() > 0.7 and s.ratio() != 1.0:
            featureOrientation[feature] = merge_two_dicts(featureOrientation[prev_feature],featureOrientation[feature])
            duplicate_feature.append(prev_feature)

for feature in duplicate_feature:
    del featureOrientation[feature]

Write the result orientation to file in output folder:

In [22]:
def list_sentence(input):
    return [[' '.join(i)] for i in input]

sentences = list_sentence(OS_notag)

output_file = 'output/' + file_name.replace(' cleaned','_output')
print("Start outputing results to " + output_file + '\n')

output = open(output_file,'w')
# print output loop
for feature in featureOrientation:
    if featureOrientation[feature]["positive"] != [] and featureOrientation[feature]["negative"] != []:
        output.write(feature + '\n')
        if featureOrientation[feature]["positive"] != []:
            output.write("Positive:" + '\n')
            for index in featureOrientation[feature]["positive"]:
                output.write(sentences[index][0].replace("#", "").strip(" ") + '\n')

        if featureOrientation[feature]["negative"] != []:
            output.write("Negative:"+ '\n')
            for index in featureOrientation[feature]["negative"]:
                output.write(sentences[index][0].replace("#", "").strip(" ") + '\n')

        if featureOrientation[feature]["neutral"] != []:
            output.write("neutral:"+ '\n')
            for index in featureOrientation[feature]["neutral"]:
                output.write(sentences[index][0].replace("#", "").strip(" ") + '\n')

output.close()

print("Output completes" + '\n')

Start outputing results to output/Canon G3_output.txt

Output completes



# Experimental Setup

For evaluation, we use the tagged file(manually tagged by the paper's contributors[1]) given by the paper[1] to test the accuracy of our outcoming results. The sentence orientation accuracy = the number of orientation sentence matched with tagged file/ total opinion sentence. The sentence extraction accuracy = the number of opinion sentence matched with tagged file/total opinion sentence. Then, we compare the precision we obtain with the precision given in the paper[1] to conclude the results.

In [23]:
val = input("Do you want to evaluate the output?(y/n): ")

if val.lower() == 'y':
    print("Start evaluating output results ..." + '\n')
    # evaluation
    eval_list = []
    for feature in featureOrientation:

        for sentence_index in featureOrientation[feature]["positive"]:
            if sentence_index not in eval_list:
                eval_list.append(sentence_index)

        for sentence_index in featureOrientation[feature]["negative"]:
            if sentence_index not in eval_list:
                eval_list.append(sentence_index)

        for sentence_index in featureOrientation[feature]["neutral"]:
            if sentence_index not in eval_list:
                eval_list.append(sentence_index)


    eval_list.sort()


    addr = file.replace(' cleaned','')
    f = open(addr,'r')
    raw = f.read()
    add = False

    in_sentence = ''
    for i,char in enumerate(raw):
        if char == '#' and raw[i - 1] == ']':
            add = True
        if add:
            in_sentence += raw[i]
        if raw[i] == '.' or raw[i] == '!':
            add = False

    f.close()

    exist_eval_list = in_sentence.split('##')
    exist_eval_list.remove('')

    total_op = len(exist_eval_list)
    total_correct = 0

    for index in eval_list:
        output_s = sentences[index][0].replace("#", "").strip(" ")
        for sentence in exist_eval_list:
            s = SequenceMatcher(None, output_s, sentence)
            if s.ratio() > 0.9:
                total_correct += 1
                break

    print("Sentence orientation accuracy is:")
    print("%.3f" % (total_correct / total_op))
    print('\n')

    total_correct = 0
    for sentence in sentences:
        sentence = sentence[0].replace("#", "").strip(" ")
        for sentence_comp in exist_eval_list:
            s = SequenceMatcher(None, sentence, sentence_comp)
            if s.ratio() > 0.9:
                total_correct += 1
                break


    print("Opinion sentence extraction precision is:")
    print("%.3f" % (total_correct / total_op))
    print('\n')
    
    print("Program finished")

elif val.lower() == 'n':
    print("Program finished")

else:
    raise Exception('input should be y/n. The value of input was: {}'.format(val))

Do you want to evaluate the output?(y/n): y
Start evaluating output results ...

Sentence orientation accuracy is:
0.626


Opinion sentence extraction precision is:
0.778


Program finished


# Results

OS = opinion sentence
<br>SO = sentence orientation
<br>original = result from paper's model

|  Product|OS extraction(original)| OS extraction | SO accuracy(original）|SO accuracy |
| --- |---|---|---|---|
|  Digital camera1  |0.643|0.739|0.927|0.925|
|  Digital camera2  |0.554|0.778|0.946|0.626|
|  Cellular phone  |0.815|0.709|0.764|0.506|
|   Mp3 player  |0.589|0.825|0.842|0.706|
|  DVD plyer   |0.607|0.770|0.730|0.589|
|  Average    |0.642|0.764|0.842|0.591|





# Analysis of the Results

We improve the opinion sentence extraction precision by effectively extracting more opinion sentences from data files according to adjective and noun phrase existence. Therefore, the hitting rate of our algorithm is higher. Sentence orientation accuracy is lower because we have more features in each comment so there are more error in our algorithm. The place we use pruning and the method of pruning is different from the algorithm in paper which also causes some differences in the final results.

# Future Work

We are going to improve our algorithm and increase the accuracy by implementing better feature pruning algorithm. Then we need to deal with sentence that contains implicit features, such as "It cannot fit in my pockets" which is talking about size but without word 'size'. Finally, we will try to use machine learning in future in order to figure out how to use verbs and nouns in opinion sentences for purpose.

# References

[1] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. 13:168–177, August2004.