# NLP PRACTICAL ASSIGNMENT


***Marcos Martínez Jiménez***

---

<font color='red'>***ATTENTION:***</font> the execution of this notebook requires some additional files included in the deliverable .zip. These include:

* Aspect terms, yelp datasets, modifiers.csv, ... provided for the assignment
* Amazon food dataset (for task 1.3)
> [too heavy to include in the deliverable, may be substituted for any other amazon dataset by changing the loading statement]
* Precomputed results (like the positions of term aspects) so professors may check the final results of Task 5. without having to run cells that take very long.
> [found in results folder]
* Modified aspect terms files
> [found in aspects folder]

Futhermore, the function *advanced_parse_opinion* requires the Stanford CoreNLP service.

## Modules

In [1]:
%load_ext autoreload
%autoreload 2

# module with all my code
import nlp_code as nlp

from IPython.display import clear_output
from nltk.corpus import wordnet as wn
from nltk.corpus import opinion_lexicon
from tqdm import tqdm
import pickle
import pandas as pd
import numpy as np
import re

## Task 1 - Review Datasets

### Task 1.1 - Hotel Reviews

In [7]:
reviews = nlp.parse_batch('yelp_dataset/yelp_hotels.json')

num_rev = len(reviews)
print(f"Reviews loaded: {num_rev}.", "\n")
print(reviews[0], "\n")
print(reviews[0].get('reviewerID'))

Reviews loaded: 5034. 

{'reviewerID': 'qLCpuCWCyPb4G2vN-WZz-Q', 'asin': '8ZwO9VuLDWJOXmtAdc7LXQ', 'summary': 'summary', 'reviewText': "Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!", 'overall': 4.0} 

qLCpuCWCyPb4G2vN-WZz-Q


### Task 1.2 - Spa&Resorts and Restaurants Reviews

***Spa&Resorts***

In [8]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_beauty_spas.json')

num_rev = 0
for entry in line_parser:
    if num_rev==0:
        print(entry, "\n")
        print(entry.get('reviewerID'), "\n")
    num_rev+=1
print(f"Reviews loaded: {num_rev}.")

{'reviewerID': 'Xm8HXE1JHqscXe5BKf0GFQ', 'asin': 'WGNIYMeXPyoWav1APUq7jA', 'summary': 'summary', 'reviewText': "Good tattoo shop. Clean space, multiple artists to choose from and books of their work are available for you to look though and decide who's style most mirrors what you're looking for. I chose Jet to do a cover-up for me and he worked with me on the design and our ideas and communication flowed very well. He's a very personable guy, is friendly and keeps the conversation going while he's working on you, and he doesn't dick around (read: He starts to work and continues until the job is done). He's very professional and informative. Good customer service combines with talent at the craft.", 'overall': 4.0} 

Xm8HXE1JHqscXe5BKf0GFQ 

Reviews loaded: 5579.


***Restaurants***

In [9]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_restaurants.json')

num_rev = 0
for entry in line_parser:
    if num_rev==0:
        print(entry, "\n")
        print(entry.get('reviewerID'), "\n")
    num_rev+=1
print(f"Reviews loaded: {num_rev}.")

{'reviewerID': 'rLtl8ZkDX5vH5nAx9C3q5Q', 'asin': '9yKzy9PApeiPPOUJEtnvkg', 'summary': 'summary', 'reviewText': 'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.Anyway, I can\'t wait to go back!', 'overa

### Task 1.3 - Amazon Food Reviews

In [10]:
line_parser = nlp.parse_lines('yelp_dataset/amazon_food.json')

num_rev = 0
for entry in line_parser:
    if num_rev==0:
        print(entry, "\n")
        print(entry.get('reviewerID'), "\n")
    num_rev+=1
print(f"Reviews loaded: {num_rev}.")

{'reviewerID': 'A1VEELTKS8NLZB', 'asin': '616719923X', 'reviewerName': 'Amazon Customer', 'helpful': [0, 0], 'reviewText': 'Just another flavor of Kit Kat but the taste is unique and a bit different.  The only thing that is bothersome is the price.  I thought it was a bit expensive....', 'overall': 4.0, 'summary': 'Good Taste', 'unixReviewTime': 1370044800, 'reviewTime': '06 1, 2013'} 

A1VEELTKS8NLZB 

Reviews loaded: 151254.


## 2 - Aspect Vocabularies

### 2.1 - Aspect Hotels

***Load and print***

In [11]:
aspect_hotels = pd.read_csv("aspects/aspects_hotels.csv", header=None, names=['aspect','term'])
aspect_hotels

Unnamed: 0,aspect,term
0,amenities,amenity
1,amenities,amenities
2,amenities,services
3,atmosphere,atmosphere
4,atmosphere,atmospheres
...,...,...
277,transportation,trains
278,transportation,tube
279,transportation,tubes
280,transportation,vehicle


***Find references #1***

Literal matching of aspect terms to all reviews. For now, only the number of aspect terms matched in each review is registered.

In [None]:
aspect_stats = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    aspect_stats.append(nlp.matchRE(text,aspect_hotels))
    
print(aspect_stats[0])

Review progress: 100%|██████████| 5034/5034 [00:21<00:00, 232.16it/s]

{'bar': 1, 'building': 2, 'pool': 1, 'shopping': 1, 'transportation': 3}





And the +/- 5 character context of the identified terms is:

In [None]:
text = reviews[0].get('reviewText')
nlp.matchRE(text, aspect_hotels, show_context=True)

Aspect: bar                 Term: bar            Context:top patio bar, and a ve
Aspect: building            Term: lobby          Context:very busy lobby with Gall
Aspect: building            Term: patio          Context:T rooftop patio bar, and 
Aspect: pool                Term: pool           Context:. Awesome pool that's ha
Aspect: shopping            Term: boutique       Context:me. Great boutique rooms. Aw
Aspect: transportation      Term: bus            Context:very very busy lobby wi
Aspect: transportation      Term: car            Context:without a car. Not much
Aspect: transportation      Term: car            Context: summer. A GREAT roofto


Problems with this implementation:

* The aspects file includes some terms that can be matched to bigger words (such as bus <-> busy)
* Some terms could be matched more than one time to the same word (towel <-> towels + towels <-> towels)

In order to fix this the alternative implementation was developed:

* Tokenize the text to match word by word instead of matching over the whole text (at most one term will be matched for each word)
* Show the context of matches through a number of tokens instead of a number of characters
* Develop a score function for string matching so that each word is matched to the only/best possible term (allowing a % of mismatch so word derivations can be picked up)

***Find references #2***

In [None]:
aspect_stats = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    aspect_stats.append(nlp.match_token(text,aspect_hotels,tol=0.2))
    
print(aspect_stats[0])

Review progress: 100%|██████████| 5034/5034 [01:15<00:00, 66.35it/s]

{'transportation': 2, 'shopping': 1, 'pool': 1, 'building': 2, 'bar': 1}





In [None]:
text = reviews[0].get('reviewText')
nlp.match_token(text, aspect_hotels, tol=0.2, out="context")

Aspect: transportation      Term: car            Context:and without a[01m[31m car [0m.
Aspect: shopping            Term: boutique       Context:Great[01m[31m boutique [0mrooms .
Aspect: pool                Term: pool           Context:Awesome[01m[31m pool [0mthat 's
Aspect: building            Term: patio          Context:A GREAT rooftop[01m[31m patio [0mbar ,
Aspect: bar                 Term: bar            Context:GREAT rooftop patio[01m[31m bar [0m, and
Aspect: building            Term: lobby          Context:very very busy[01m[31m lobby [0mwith Gallo
Aspect: transportation      Term: car            Context:but have a[01m[31m car [0m!


This implementation should be able to pick up slight variations of the aspect terms (e.g. serbice instead of service), but could match a word to a wrong term (see example in the following cell of code). A low error threshold has been used to minimize that possibility.

At any rate tokenizing the words is very useful to match only one term per word and show more effective context.

In [None]:
text = reviews[-1].get('reviewText')
nlp.match_token(text, aspect_hotels, tol=0.2, out="context")

Aspect: atmosphere          Term: lights         Context:here for two[01m[31m nights [0m.
Aspect: pool                Term: pool           Context:up in the[01m[31m pool [0marea ,
Aspect: bathrooms           Term: towels         Context:dryer and softer[01m[31m towels [0m.


"lights" is confused with "nights" because there is no "night"/"nights" term that fits the token better and the error is sufficiently small (>20%).

Since making errors might be worse than missing terms with slightly different derivation, the tolerance will be set to 0 from now on.

### 2.2 - Vocabulary extension with WordNet

To find other terms similar to those already included in the aspect files we will first deambiguate the terms to synsets manually. To do this a function was created that iterates over all terms of an aspects file and lets the user select the correct synset from WordNet's list.

Since all variations of a word included in the aspects files will aim for the same synset (e.g. light and lights) a version of the file with only the base term was created:

In [34]:
aspect_hotels_base = pd.read_csv("aspects/aspects_hotels_base.csv", header=None, names=['aspect','term'])
aspect_hotels_base

Unnamed: 0,aspect,term
0,amenities,amenity
1,amenities,services
2,atmosphere,atmosphere
3,atmosphere,ambiance
4,atmosphere,light
...,...,...
151,transportation,subway
152,transportation,taxi
153,transportation,train
154,transportation,tube


In [None]:
term_syns = nlp.deambiguate_terms(aspect_hotels_base)

Aspect: TRANSPORTATION      Term: VEHICLE
------------------------------

0: a conveyance that transports people or objects
	 []
1: a medium for the expression or achievement of something
	 ['his editorials provided a vehicle for his political views', 'a congregation is a vehicle of group identity', 'the play was just a vehicle to display her talents']
2: any substance that facilitates the use of a drug or pigment or other material that is mixed with it
	 []
3: any inanimate object (as a towel or money or clothing or dishes or books or toys etc.) that can transmit infectious agents from one person to another
	 []
Enter correct synset [0..n]: 0


The synsets will be saved to a file so they can be accessed again without having to walk through the whole list of terms.

In [None]:
term_syns_names = []
for asp,syns in term_syns.items():
    for syn in syns:
        term_syns_names.append({'aspect':asp, 'synset':syn.name()})

term_syns_names = pd.DataFrame(term_syns_names)
term_syns_names.to_csv("results/hotel_terms_syns_names.csv", index=False)

In [12]:
term_syns_names = pd.read_csv("results/hotel_terms_syns_names.csv")

term_syns = {asp:[] for asp in term_syns_names['aspect'].unique()}
for i in range(len(term_syns_names)):
    aspect = term_syns_names.loc[i,'aspect']
    synset = wn.synset(term_syns_names.loc[i,'synset'])
    term_syns[aspect].append(synset)

Finally, another function will be used to extract words similar to the synsets (other lemmas of the synsets and lemmas of its hyponyms) identified:

In [32]:
terms = nlp.gather_terms(term_syns)
terms['amenities'], terms['dinner']

Aspect progress: 100%|██████████| 31/31 [00:00<00:00, 2079.58it/s]


(['agreeable', 'agreeableness', 'amenity', 'services'],
 ['high tea', 'dine', 'dinner'])

In [35]:
extra_terms = []
for asp,term in terms.items():
    for t in term:
        extra_terms.append({'aspect':asp, 'term':t})

extra_terms = pd.DataFrame(extra_terms)
aspect_hotels_full = aspect_hotels_base.append(extra_terms)
aspect_hotels_full.reset_index(inplace=True, drop=True)
aspect_hotels_full

Unnamed: 0,aspect,term
0,amenities,amenity
1,amenities,services
2,atmosphere,atmosphere
3,atmosphere,ambiance
4,atmosphere,light
...,...,...
2872,transportation,sledge
2873,transportation,sleigh
2874,transportation,steamroller
2875,transportation,road roller


***Find references #3***

Find references using the extended aspect vocabulary

In [36]:
aspect_stats = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    aspect_stats.append(nlp.match_token(text,aspect_hotels_full))
    
print(aspect_stats[0])

Review progress: 100%|██████████| 5034/5034 [15:23<00:00,  5.45it/s]

{'building': 5, 'transportation': 2, 'location': 2, 'shopping': 1, 'bedrooms': 1, 'pool': 1, 'events': 1, 'bar': 1}





In [37]:
text = reviews[0].get('reviewText')
nlp.match_token(text, aspect_hotels_full, out="context")

Aspect: building            Term: hotel          Context:Great[01m[31m hotel [0min Central
Aspect: building            Term: place          Context:not necessarily a[01m[31m place [0mto stay
Aspect: transportation      Term: car            Context:and without a[01m[31m car [0m.
Aspect: location            Term: area           Context:much around the[01m[31m area [0m, and
Aspect: location            Term: here           Context:you do stay[01m[31m here [0m, it
Aspect: shopping            Term: boutique       Context:Great[01m[31m boutique [0mrooms .
Aspect: bedrooms            Term: rooms          Context:Great boutique[01m[31m rooms [0m.
Aspect: pool                Term: pool           Context:Awesome[01m[31m pool [0mthat 's
Aspect: events              Term: happening      Context:pool that 's[01m[31m happening [0min the
Aspect: building            Term: patio          Context:A GREAT rooftop[01m[31m patio [0mbar ,
Aspect: bar                 Term: bar    

### 2.3 - Additional vocabulary

In order to manage vocabulary for Spa&Resort we will follow a similar approach, start with the aspect vocabulary file and extend it with wordnet. 

The extension was done with only a subset of the aspects in order to do it faster (otherwise I'd have to select the correct synset for another 700 terms).

In [38]:
aspect_spas_base = pd.read_csv("aspects/aspects_spas_base.csv", header=None, names=['aspect','term'])
aspect_rest_base = pd.read_csv("aspects/aspects_restaurants_base.csv", header=None, names=['aspect','term'])
aspect_rs_base = aspect_spas_base.append(aspect_rest_base)
aspect_rs_base.reset_index(inplace=True, drop=True)
aspect_rs_base

Unnamed: 0,aspect,term
0,products,product
1,products,gel
2,products,lotion
3,products,soap
4,products,wax
5,products,waxing
6,service,service
7,service,serving
8,service,attention
9,service,attitude


In [None]:
term_syns = nlp.deambiguate_terms(aspect_rs_base)
terms = nlp.gather_terms(term_syns)

Aspect: BREAKFAST           Term: MOORNING MENU
------------------------------



Aspect progress: 100%|██████████| 9/9 [00:00<00:00, 389.09it/s]


In [None]:
extra_terms = []
for asp,term in terms.items():
    for t in term:
        extra_terms.append({'aspect':asp, 'term':t})

extra_terms = pd.DataFrame(extra_terms)
aspect_rs_full = aspect_rs_base.append(extra_terms)
aspect_rs_full.reset_index(inplace=True, drop=True)
aspect_rs_full.to_csv("aspects/aspects_rs_full", index=False)
aspect_rs_full

Unnamed: 0,aspect,term
0,products,product
1,products,gel
2,products,lotion
3,products,soap
4,products,wax
...,...,...
445,breakfast,zwieback
446,breakfast,rusk
447,breakfast,Brussels_biscuit
448,breakfast,twice-baked_bread


**Restaurants** (only the first 5.000 since there are >100.000 total)

In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_restaurants.json')

i=0
aspect_stats = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    aspect_stats.append(nlp.match_token(text,aspect_rs_full))
    if i>5000:
        break
    
print(aspect_stats[0])

Review number: 5001
{'breakfast': 3, 'service': 1, 'bread': 1}


In [None]:
nlp.match_token(text, aspect_rs_full, out="context")

Aspect: service             Term: service        Context:Excellent[01m[31m service [0mand great


**Spas/Resorts**

In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_beauty_spas.json')

i=0
aspect_stats = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    aspect_stats.append(nlp.match_token(text,aspect_rs_full))
    
print(aspect_stats[0])

Review number: 5579
{'service': 1}


In [None]:
nlp.match_token(text, aspect_rs_full, out="context")

Since a very small subset of aspects were used almost no hits were found. Another round of aspect parsing will be done using the unextended aspect files.

***Unextended aspect files***

In [None]:
aspect_spas = pd.read_csv("aspects/aspects_spas.csv", header=None, names=['aspect','term'])
aspect_rest = pd.read_csv("aspects/aspects_restaurants.csv", header=None, names=['aspect','term'])
aspect_rs = aspect_spas_base.append(aspect_rest)
aspect_rs.reset_index(inplace=True, drop=True)

**Restaurants**

In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_restaurants.json')

i=0
aspect_stats = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    aspect_stats.append(nlp.match_token(text,aspect_rs))
    if i >5000:
        break
    
print(aspect_stats[0])

Review number: 5579
{'shopping': 1, 'service': 1}


In [None]:
nlp.match_token(text, aspect_rs, out="context")

Aspect: staff               Term: employees      Context:the absolute sweetest[01m[31m employees [0mthat do


**Spas/Resorts**

In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_beauty_spas.json')

i=0
aspect_stats = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    aspect_stats.append(nlp.match_token(text,aspect_rs))
    
print(aspect_stats[0])

Review number: 5579
{'shopping': 1, 'service': 1}


In [None]:
nlp.match_token(text, aspect_rs, out="context")

Aspect: staff               Term: employees      Context:the absolute sweetest[01m[31m employees [0mthat do


The unextended aspect files will be used for the final analysis (section 5), but in the real application one could apply the WordNet extension to all the aspects and use that one instead.

## 3 - Opinion lexicons

### 3.1 - Liu's opinion lexicon

In [39]:
print("Positive words:\n--------------\n")
liu_positive = opinion_lexicon.positive()
print(liu_positive)
print(f"Total of words: {len(liu_positive)}\n\n")

print("Negative words:\n--------------\n")
liu_negative = opinion_lexicon.negative()
print(liu_negative)
print(f"Total of words: {len(liu_negative)}")

Positive words:
--------------

['a+', 'abound', 'abounds', 'abundance', 'abundant', ...]
Total of words: 2006


Negative words:
--------------

['2-faced', '2-faces', 'abnormal', 'abolish', ...]
Total of words: 4783


### 3.2 Modifiers

In [40]:
modifiers = pd.read_csv("modifiers.csv", header=None, names=['term','polarity'])
modifiers

Unnamed: 0,term,polarity
0,above,2.0
1,absolutely,2.0
2,abundantly,2.0
3,acutely,2.0
4,amazingly,2.0
...,...,...
295,violently,-1.0
296,whimsically,-1.0
297,wickedly,-1.0
298,wretchedly,-1.0


## 4. Aspect opinions

### 4.1 - 4.3 Term extraction

The function for finding aspect references was adapted to also perform term extraction (just saves the positions of terms instead of counting the references or printing the context).

In [41]:
aspect_positions = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    aspect_positions.append(nlp.match_token(text,aspect_hotels,out="pos"))
    
print(aspect_positions[0])

Review progress: 100%|██████████| 5034/5034 [01:13<00:00, 68.24it/s] 

{'transportation': [(0, 22), (6, 9)], 'shopping': [(3, 1)], 'pool': [(4, 1)], 'building': [(5, 3), (5, 11)], 'bar': [(5, 4)]}





### 4.1 - 4.3 Basic opinion extraction

The first attempt at opinion extraction was to POS tag the sentences, identify adjectives or adverbs adjacent to the aspect terms, and check if they belong to the opinion lexicon. Polarity of +/-1 was assigned depending on whether the adjs/advs were in the positive or negative list. 

In [None]:
aspect_opinions = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    aspect_opinions.append(nlp.basic_parse_opinion(text, aspect_positions[i], opinion_lexicon))
    
print(aspect_opinions[0])

Review progress: 100%|██████████| 5034/5034 [02:15<00:00, 37.17it/s]

{}





In [None]:
aspect_opinions

[{},
 {'drinks': [['drink', 'cheap', -1], ['beer', 'cheap', -1]]},
 {},
 {'spa': [['spa', 'favorite', 1]]},
 {'location': [['river', 'lazy', -1]]},
 {'staff': [['staff', 'disinterested', -1]]},
 {'service': [['service', 'impeccable', 1]],
  'bathrooms': [['shower', 'lovely', 1]]},
 {},
 {},
 {'bedrooms': [['pillows', 'decent', 1]]},
 {},
 {},
 {},
 {'drinks': [['drinks', 'frozen', -1], ['drink', 'wonderful', 1]],
  'cuisine': [['food', 'better', 1]],
  'service': [['service', 'better', 1]]},
 {},
 {},
 {},
 {'service': [['service', 'poor', -1], ['attitudes', 'bad', -1]],
  'staff': [['patrons', 'obnoxious', -1]]},
 {},
 {},
 {},
 {},
 {},
 {'building': [['decor', 'blah', -1], ['lobby', 'cool', 1]],
  'cuisine': [['food', 'decent', 1]]},
 {},
 {},
 {},
 {'staff': [['staff', 'friendly', 1]]},
 {},
 {'cuisine': [['food', 'hot', 1]]},
 {'building': [['spot', 'beautiful', 1]],
  'pool': [['pool', 'tranquil', 1]],
  'service': [['service', 'great', 1]]},
 {},
 {},
 {},
 {'building': [['decor

As before, we might also want to check the entire sentence where an aspect opinion was found:

In [None]:
idx = 13
text = reviews[idx].get('reviewText')
nlp.basic_parse_opinion(text, aspect_positions[idx], opinion_lexicon, show_context=True)

Aspect: drinks              Term: drinks         
Context: i sat around for days on couches drinking bloody mary 's and[03m[35m frozen [0m[01m[31m drinks [0m.


Aspect: drinks              Term: drink          
Context: the lobby bar had a deadly but[03m[35m wonderful [0m[01m[31m drink [0mcalled the stardust .


Aspect: cuisine             Term: food           
Context: Valley Ho had[03m[35m better [0m[01m[31m food [0mand more style , but Royal Palms had a bit more to it and better service .


Aspect: service             Term: service        
Context: Valley Ho had better food and more style , but Royal Palms had a bit more to it and[03m[35m better [0m[01m[31m service [0m.




The default context considered when looking for adjs/advs is +/-1 around the aspect term. A bigger context may be considered.

In [None]:
idx = 13
text = reviews[idx].get('reviewText')
nlp.basic_parse_opinion(text, aspect_positions[idx], opinion_lexicon, context_range=2,show_context=True)

Aspect: events              Term: trip           
Context: I went to the Valley Ho on an[03m[35m amazing [0mbusiness[01m[31m trip [0mthat involved rock stars and the Oakland Athletics , which happen to be my favorite team .


Aspect: building            Term: decor          
Context: the[01m[31m decor [0mwas[03m[35m fantastic [0m, the rooms were huge , with huge tubs , a poolside patio and a comfy chaise lounge .


Aspect: building            Term: lounge         
Context: the decor was fantastic , the rooms were huge , with huge tubs , a poolside patio and a[03m[35m comfy [0mchaise[01m[31m lounge [0m.


Aspect: pool                Term: pool           
Context: the[01m[31m pool [0mwas[03m[35m phenomenal [0m.


Aspect: drinks              Term: drinks         
Context: i sat around for days on couches drinking bloody mary 's and[03m[35m frozen [0m[01m[31m drinks [0m.


Aspect: drinks              Term: drink          
Context: the lobby bar had a deadly bu

This simple approach allows us to extract considerable information, but in order to find modifiers and negations a more sofisticated version using dependency parsing was developed. This should allow to match only adjs/advs that truly reference the aspect term, and to do so independently of how far they are.

***Advanced opinion extraction***

Dependency parsing allowed to find adjs/advs related to the aspect term, and likewise, modifiers related to the adjs/advs. The final polarity of the opinion was obtained by multiplying the term's polarity with the modifier's polarity.

In [None]:
aspect_opinions = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    aspect_opinions.append(nlp.advanced_parse_opinion(text, aspect_positions[i], 
                                                      opinion_lexicon, modifiers, corenlp_port=9500))
    
print(aspect_opinions[0])

Review progress: 100%|██████████| 5034/5034 [32:40<00:00,  2.57it/s]  

{'pool': [['pool', 'awesome', '', 'Neg=False', 1]], 'bar': [['bar', 'great', '', 'Neg=False', 1]]}





It's somewhat hard to find examples with modifiers or negations. Next, some handpicked ones are presented.

A function was created to clearly display the opinions found. The number in parenthesis represents the final polarity assigned to the term (taking into account the modifications and negations).

In [None]:
nlp.display_opinions(aspect_opinions[17])

Aspect: building
	Term: spot (0.5): fun, pretty, Neg=False
Aspect: service
	Term: service (-1): poor, , Neg=False
	Term: attitudes (-1): bad, , Neg=False
Aspect: staff
	Term: patrons (-1): obnoxious, , Neg=False


In [None]:
nlp.display_opinions(aspect_opinions[19])

Aspect: pool
	Term: pools (2.0): pretty, very, Neg=False


In [None]:
nlp.display_opinions(aspect_opinions[75])

Aspect: facilities
	Term: facilities (1): clean, , Neg=False
Aspect: staff
	Term: staff (1): fantastic, , Neg=False
	Term: staff (-1): childish, , Neg=False
Aspect: building
	Term: spot (1): hot, , Neg=False
Aspect: shopping
	Term: store (1): convenient, , Neg=False


In [None]:
nlp.display_opinions(aspect_opinions[29])

Aspect: bedrooms
	Term: bed (1): comfy, , Neg=False
Aspect: cuisine
	Term: food (-1): hot, , Neg=True


And, as usual, we inspect the context using the show_context keyword argument:

In [None]:
idx = 17
text = reviews[idx].get('reviewText')
nlp.advanced_parse_opinion(text, aspect_positions[idx], opinion_lexicon, modifiers, 
                           corenlp_port=9500, show_context=True)

Aspect: building            Term: spot           Modif: pretty         
Value: ['spot', 'fun', 'pretty', 'Neg=False', 0.5]
Context: During the day the pool area is poorly run, but in the evenings the pool turns into a pretty fun nightlife spot.


Aspect: service             Term: service        
Value: ['service', 'poor', '', 'Neg=False', -1]
Context: It's just really amateur hour here with the piss poor service and their bad attitudes and obnoxious patrons.


Aspect: service             Term: attitudes      
Value: ['attitudes', 'bad', '', 'Neg=False', -1]
Context: It's just really amateur hour here with the piss poor service and their bad attitudes and obnoxious patrons.


Aspect: staff               Term: patrons        
Value: ['patrons', 'obnoxious', '', 'Neg=False', -1]
Context: It's just really amateur hour here with the piss poor service and their bad attitudes and obnoxious patrons.




In [None]:
idx = 19
text = reviews[idx].get('reviewText')
nlp.advanced_parse_opinion(text, aspect_positions[idx], opinion_lexicon, modifiers, 
                           corenlp_port=9500, show_context=True)

Aspect: pool                Term: pools          Modif: very           
Value: ['pools', 'pretty', 'very', 'Neg=False', 2.0]
Context: The rooms are extremely comfortable, the grounds are picture perfect and the pools are very pretty.




In [None]:
idx = 75
text = reviews[idx].get('reviewText')
nlp.advanced_parse_opinion(text, aspect_positions[idx], opinion_lexicon, modifiers, 
                           corenlp_port=9500, show_context=True)

Aspect: facilities          Term: facilities     
Value: ['facilities', 'clean', '', 'Neg=False', 1]
Context: The facilities are clean and very well maintained.


Aspect: staff               Term: staff          
Value: ['staff', 'fantastic', '', 'Neg=False', 1]
Context: Housekeeping staff is fantastic also.


Aspect: staff               Term: staff          
Value: ['staff', 'childish', '', 'Neg=False', -1]
Context: Regardless, I was a bit bummed that the reception staff was childish and rude to me.


Aspect: building            Term: spot           
Value: ['spot', 'hot', '', 'Neg=False', 1]
Context: So apparently, this hotel pool must be the hot spot for pool crashers.


Aspect: shopping            Term: store          
Value: ['store', 'convenient', '', 'Neg=False', 1]
Context: There is a small convenient store next to the reception desk that sells wine, beer, water, soda, snacks, etc.




In [None]:
idx = 29
text = reviews[idx].get('reviewText')
nlp.advanced_parse_opinion(text, aspect_positions[idx], opinion_lexicon, modifiers, 
                           corenlp_port=9500, show_context=True)

Aspect: bedrooms            Term: bed            
Value: ['bed', 'comfy', '', 'Neg=False', 1]
Context: The king size bed was comfy with fluffy pillows.


Aspect: cuisine             Term: food           
Value: ['food', 'hot', '', 'Neg=True', -1]
Context: No hot food.




## 5. Opinion summarization

### 5.0a) Final opinion extraction

***Hotel Reviews***

Final opinion extraction was made using the extended aspects (in the case of hotels, not restaurants/spas).

In [None]:
hotel_aspect_positions = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    hotel_aspect_positions.append(nlp.match_token(text,aspect_hotels_full,out="pos"))
    
print(hotel_aspect_positions[0])

Review progress: 100%|██████████| 5034/5034 [13:54<00:00,  6.03it/s]

{'building': [(0, 1), (0, 13), (5, 3), (5, 11), (6, 2)], 'transportation': [(0, 22), (6, 9)], 'location': [(1, 4), (2, 5)], 'shopping': [(3, 1)], 'bedrooms': [(3, 2)], 'pool': [(4, 1)], 'events': [(4, 4)], 'bar': [(5, 4)]}





***Restaurant Reviews*** (first 5.000)

For restaurant and spa reviews the unextended aspects will be used. Furthermore, only the first 5.000 reviews of restaurants will be considered (total of >100.000).

In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_restaurants.json')

i=0
restaurant_aspect_positions = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    restaurant_aspect_positions.append(nlp.match_token(text,aspect_rs,out='pos'))
    if i >5000:
        break
    
print(restaurant_aspect_positions[0])

Review number: 5001
{'breakfast': [(0, 9), (2, 13), (8, 5)], 'seating': [(1, 6)], 'staff': [(2, 1)], 'food': [(2, 6), (5, 7)], 'building': [(5, 10)], 'menu': [(6, 6), (7, 17)], 'eggs': [(6, 16)], 'vegetables': [(6, 17)], 'bread': [(7, 8)]}


***Spa Reviews***

In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_beauty_spas.json')

i=0
spa_aspect_positions = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    spa_aspect_positions.append(nlp.match_token(text,aspect_rs,out="pos"))
    
print(spa_aspect_positions[0])

Review number: 5579
{'shopping': [(0, 2)], 'service': [(5, 2)]}


### 5.0b) Final opinion parsing

***Hotel Reviews***

In [None]:
hotel_aspect_opinions = []
for i in tqdm(range(len(reviews)), desc="Review progress"):
    review = reviews[i]
    text = review.get('reviewText')
    hotel_aspect_opinions.append(nlp.advanced_parse_opinion(text, hotel_aspect_positions[i], 
                                                      opinion_lexicon, modifiers, corenlp_port=9500))

Review progress: 100%|██████████| 5034/5034 [48:44<00:00,  1.72it/s]  


***Restaurant Reviews***

In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_restaurants.json')

i=0
restaurant_aspect_opinions = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    restaurant_aspect_opinions.append(nlp.advanced_parse_opinion(text, restaurant_aspect_positions[i-1], 
                                                      opinion_lexicon, modifiers, corenlp_port=9500))
    if i>5000:
        break

Review number: 5001


In [None]:
line_parser = nlp.parse_lines('yelp_dataset/yelp_beauty_spas.json')

i=0
spa_aspect_opinions = []
for entry in line_parser:
    i+=1
    clear_output(wait=True)
    print('Review number:', i)
    text = entry.get('reviewText')
    spa_aspect_opinions.append(nlp.advanced_parse_opinion(text, spa_aspect_positions[i-1], 
                                                      opinion_lexicon, modifiers, corenlp_port=9500))

Review number: 5579


***Save and load results***

Opinion extraction and parsing take quite a while so they have been saved so you can skip these and load them:

In [None]:
with open('results/hotel_opinions.pickle', 'wb') as f:
    pickle.dump(hotel_aspect_opinions, f)

In [None]:
with open('results/rest_opinions.pickle', 'wb') as f:
    pickle.dump(restaurant_aspect_opinions, f)

In [None]:
with open('results/spa_opinions.pickle', 'wb') as f:
    pickle.dump(spa_aspect_opinions, f)

In [31]:
with open('results/hotel_opinions.pickle', 'rb') as f:
    hotel_aspect_opinions = pickle.load(f)

In [32]:
with open('results/rest_opinions.pickle', 'rb') as f:
    restaurant_aspect_opinions = pickle.load(f)

In [33]:
with open('results/spa_opinions.pickle', 'rb') as f:
    spa_aspect_opinions = pickle.load(f)

### 5.1 Aspect opinions of given reviews

Remember we use a function to quickly show all aspect opinions. The structure of the output is:

```Aspect: <aspect>
        Term: <term> (final_polar): <opinion>, <modifier>, Neg=<bool>
        ...
   ...```

***Hotel Reviews***

In [34]:
list_reviews = [0,5,17]

for ri in list_reviews:
    print(f"Review {ri}\n-----------\n")
    nlp.display_opinions(hotel_aspect_opinions[ri])
    print("\n\n")

Review 0
-----------

Aspect: building
	Term: hotel (1): great, , Neg=False
	Term: place (1): great, , Neg=False
Aspect: bedrooms
	Term: rooms (1): great, , Neg=False
Aspect: pool
	Term: pool (1): awesome, , Neg=False
Aspect: bar
	Term: bar (1): great, , Neg=False



Review 5
-----------

Aspect: cuisine
	Term: food (1): delicious, , Neg=False
Aspect: staff
	Term: staff (1): attentive, , Neg=False
	Term: staff (-1): grumpy, , Neg=False
	Term: staff (-1): uncooperative, , Neg=False
	Term: staff (-1): disinterested, , Neg=False



Review 17
-----------

Aspect: location
	Term: scene (1): improved, , Neg=False
Aspect: events
	Term: run (-1): poorly, , Neg=False
	Term: fun (1): pretty, , Neg=False
Aspect: building
	Term: spot (0.5): fun, pretty, Neg=False
Aspect: service
	Term: service (-1): poor, , Neg=False





***Restaurant Reviews***

In [None]:
list_reviews = [3,4,12]

for ri in list_reviews:
    print(f"Review {ri}\n-----------\n")
    nlp.display_opinions(restaurant_aspect_opinions[ri])
    print("\n\n")

Review 3
-----------

Aspect: appetizers
	Term: entrees (1): solid, , Neg=False
Aspect: meat
	Term: meat (1): tough, , Neg=False
Aspect: staff
	Term: chef (-0.5): unhappy, apparently, Neg=False
	Term: chef (-1.0): unhappy, so, Neg=False
Aspect: sauces
	Term: sauce (1): delicious, , Neg=False
Aspect: service
	Term: service (-2.0): spotty, so, Neg=False



Review 4
-----------

Aspect: food
	Term: food (1): enough, , Neg=False
	Term: food (1): good, , Neg=False
Aspect: building
	Term: building (1): cute, , Neg=False
Aspect: menu
	Term: menu (1): awesome, , Neg=False
	Term: meals (2.0): delicious, very, Neg=False
Aspect: vegetables
	Term: peppers (-1.0): hot, too, Neg=False



Review 12
-----------

Aspect: potatoes
	Term: chips (1): complimentary, , Neg=False





***Spa Reviews***

In [None]:
list_reviews = [0,1,12]

for ri in list_reviews:
    print(f"Review {ri}\n-----------\n")
    nlp.display_opinions(spa_aspect_opinions[ri])
    print("\n\n")

Review 0
-----------

Aspect: shopping
	Term: shop (1): good, , Neg=False
Aspect: service
	Term: service (1): good, , Neg=False



Review 1
-----------

Aspect: staff
	Term: staff (2.0): friendly, extremely, Neg=False
Aspect: price
	Term: price (2.0): reasonable, very, Neg=False



Review 12
-----------

Aspect: shopping
	Term: mall (-1): weird, , Neg=False





### 5.2 Summary of aspect opinions

***Hotel Reviews***

Review #0

In [36]:
summ = nlp.summarize(hotel_aspect_opinions[0])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,building,2,0,2,1.0,0.0
1,bedrooms,1,0,1,1.0,0.0
2,pool,1,0,1,1.0,0.0
3,bar,1,0,1,1.0,0.0


In [37]:
# Means of each measure over all aspects
# Might be a unified criterion to decide
# where to stay
summ[summ.columns[1:]].mean()

Positive count       1.25
Negative count       0.00
Total polarity       1.25
Mean polarity        1.00
Polarity variance    0.00
dtype: float64

Review #5

In [39]:
summ = nlp.summarize(hotel_aspect_opinions[5])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,cuisine,1,0,1,1.0,0.0
1,staff,1,3,-2,-0.5,0.75


In [40]:
summ[summ.columns[1:]].mean()

Positive count       1.000
Negative count       1.500
Total polarity      -0.500
Mean polarity        0.250
Polarity variance    0.375
dtype: float64

Review #17

In [41]:
summ = nlp.summarize(hotel_aspect_opinions[17])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,location,1,0,1.0,1.0,0.0
1,building,1,0,0.5,0.5,0.0
2,events,1,1,0.0,0.0,1.0
3,service,0,1,-1.0,-1.0,0.0


In [42]:
summ[summ.columns[1:]].mean()

Positive count       0.750
Negative count       0.500
Total polarity       0.125
Mean polarity        0.125
Polarity variance    0.250
dtype: float64

***Restaurant Reviews***

Review #3

In [43]:
summ = nlp.summarize(restaurant_aspect_opinions[3])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,appetizers,1,0,1.0,1.0,0.0
1,meat,1,0,1.0,1.0,0.0
2,sauces,1,0,1.0,1.0,0.0
3,staff,0,2,-1.5,-0.75,0.0625
4,service,0,1,-2.0,-2.0,0.0


In [44]:
summ[summ.columns[1:]].mean()

Positive count       0.6000
Negative count       0.6000
Total polarity      -0.1000
Mean polarity        0.0500
Polarity variance    0.0125
dtype: float64

Review #4

In [45]:
summ = nlp.summarize(restaurant_aspect_opinions[4])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,menu,2,0,3.0,1.5,0.25
1,food,2,0,2.0,1.0,0.0
2,building,1,0,1.0,1.0,0.0
3,vegetables,0,1,-1.0,-1.0,0.0


In [47]:
summ[summ.columns[1:]].mean()

Positive count       1.2500
Negative count       0.2500
Total polarity       1.2500
Mean polarity        0.6250
Polarity variance    0.0625
dtype: float64

Review #12

In [48]:
summ = nlp.summarize(restaurant_aspect_opinions[12])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,potatoes,1,0,1,1.0,0.0


In [49]:
summ[summ.columns[1:]].mean()

Positive count       1.0
Negative count       0.0
Total polarity       1.0
Mean polarity        1.0
Polarity variance    0.0
dtype: float64

***Spa Reviews***

Review #0

In [50]:
summ = nlp.summarize(spa_aspect_opinions[0])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,shopping,1,0,1,1.0,0.0
1,service,1,0,1,1.0,0.0


In [51]:
summ[summ.columns[1:]].mean()

Positive count       1.0
Negative count       0.0
Total polarity       1.0
Mean polarity        1.0
Polarity variance    0.0
dtype: float64

Review #1

In [52]:
summ = nlp.summarize(spa_aspect_opinions[1])
summ

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,staff,1,0,2.0,2.0,0.0
1,price,1,0,2.0,2.0,0.0


In [53]:
summ[summ.columns[1:]].mean()

Positive count       1.0
Negative count       0.0
Total polarity       2.0
Mean polarity        2.0
Polarity variance    0.0
dtype: float64

Review #12

In [None]:
nlp.summarize(spa_aspect_opinions[12])

Unnamed: 0,Aspect,Positive count,Negative count,Total polarity,Mean polarity,Polarity variance
0,shopping,0,1,-1,-1.0,0.0


In [54]:
summ[summ.columns[1:]].mean()

Positive count       1.0
Negative count       0.0
Total polarity       2.0
Mean polarity        2.0
Polarity variance    0.0
dtype: float64

### 5.3 Manual evaluation

A manual evaluation of the current workflow for opinion analysis was performed by checking:
* How many opinions identified were correct $\Rightarrow$ Precision
* How many relevant opinions were found in the analysis $\Rightarrow$ Recall

for only the first 20 hotel reviews.

In [56]:
# Each number is from one review (0-19)
num_correct = 4 + 0 + 4 + 2 + 5 + 2 + 6 + 1 + 1 + 2 + 1 + \
              5 + 1 + 8 + 2 + 4 + 1 + 1

num_opinions = 0
for i in range(20):
    for asp in hotel_aspect_opinions[i].keys():
        for term in hotel_aspect_opinions[i][asp]:
            num_opinions += 1
print(f"Precision: {round(num_correct/num_opinions,3)}")

Precision: 0.847


In [57]:
# Each number is from one review (0-19)
num_relevant = 10 + 9 + 15 + 6 + 12 + 10 + 9 + 13 + 6 + 6 + \
               3 + 7 + 2 + 17 + 4 + 18 + 7 + 20 + 5 + 5

print(f"Recall: {round(num_correct/num_relevant,3)}")

Recall: 0.272


Conclusions: 

* While this analysis makes some important mistakes (i.e. consider a negative opinion about a different hotel as a negative opinion about this hotel or "missinterpreting" a phrase), most of the opinions identified are correct. 

* On the other hand, very few of all relevant opinions are actually picked up by the parsing methods. There is probably room for improvement, but many opinions are too subtle for a matching terms approach.

* In the absence of a better analysis pipeline, this method might be decent enough to make decisions about which hotel to stay in.