Praktikum NLP:

ToDo:
- Load dataset √

- Dependency Parsing √

- Extracting aspects (phrases/sentiment-words) √

    - Summarize aspects for each hotel √

    - Calculate sentiment scores for each aspect (TextBlob, VADER, etc...) √

- Analyse 5 most common aspects √

- Dump into graphs for the presentation

In [349]:
#Imports and downloads

import pandas as pd
import re
import spacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer 
import string
from collections import Counter
from textblob import TextBlob

#nltk.download('wordnet')
#nltk.download('omw-1.4')
#!pip install spacy
#!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

    

In [350]:
## Clean data

# Delete redundant and empty reviews

def preprocessing_reviews(df):
    df.drop(df[ df['Negative_Review'] == 'No Negative'].index)
    df.drop(df[ df['Positive_Review'] == 'No Positive'].index)
    df.drop_duplicates(keep=False)

# Split into sentences and add dots to positive reviews and negative reviews
    for text in df['Positive_Review']:
        parts_of_review = re.findall('[A-Z][^A-Z]*', text)
        for index, sentence in enumerate(parts_of_review):
            if "I " or "i " in sentence:
                parts_of_review[index : index +1] = [''.join(parts_of_review[index : index + 1])]
            parts_of_review[index] = sentence.strip() + "."
        complete_review = ' '.join(parts_of_review)

        # correct 'n t'|'nt'|'n t' to ' not'
        complete_review = re.sub(r'n\s+t |n\'t ', ' not',complete_review)
        df['Positive_Review'] = df['Positive_Review'].replace([text], complete_review)

    for text in df['Negative_Review']:
        parts_of_review = re.findall('[A-Z][^A-Z]*', text)
        for index, sentence in enumerate(parts_of_review):
            if "I " or "i " in sentence:
                parts_of_review[index : index +1] = [''.join(parts_of_review[index : index + 1])]
            parts_of_review[index] = sentence.strip() + "."
        complete_review = ' '.join(parts_of_review)

        # correct 'n t'|'nt'|'n t' to ' not'
        complete_review = re.sub(r'n\s+t |n\'t ', ' not',complete_review)
        #        complete_review = re.sub(r'n\s+t|n\'t|nt ', ' not',complete_review)
        df['Negative_Review'] = df['Negative_Review'].replace([text], complete_review)
    return df

#df = preprocessing_reviews(df)
#pd.set_option("display.max_colwidth", None) # -1?

In [351]:
# The main extraction function. Uses 6 rules.
# Rule 0 is self written. 1,3,4,6,7 are taken from https://github.com/ishikaarora/Aspect-Sentiment-Analysis-on-Amazon-Reviews
def extractAspects(listOfReviews):
    # the return value
    aspects = []

    # for intermediate results
    rule0_pairs = []
    rule1_pairs = []
    rule3_pairs = []
    rule4_pairs = []
    rule6_pairs = []
    rule7_pairs = []

    # to id the aspects, used for removal of duplicates later on
    counter = 0

    # main extraction loop
    for review in listOfReviews:

        # pipe the reviews
        doc = nlp(review)

        ## MY (0) RULE OF DEPENDANCY PARSE -
        ## This is the original rule that was shown in our first presentation
        descriptors = '999999'
        aspect = '999999'

        for token in doc:
            # get all the nouns as aspects
            if token.dep_ == 'nsubj' and token.pos_ == 'NOUN':
                aspect = token.text
            # and the adjectives as descriptions...
            if token.pos_ == 'ADJ':
                prepend = ''
                negation = False 
                     
                A_children = token.head.children
                for childa in A_children:
                    if(childa.dep_ == "det" and childa.text == 'no'):
                        negation = True


                for child in token.children:
                    # ... they can be modified by adverbs, so add them, too
                    if child.pos_ != 'ADV':
                        continue
                    prepend += child.text + ' '
                if negation: 
                    descriptors = 'not ' + prepend.strip(".") + token.text.strip(".")
                    negation = True
                else: 
                    descriptors = prepend.strip(".") + token.text.strip(".")
                    negation = True

        # pour into the array and make the dataframe
        if (aspect != '999999' and descriptors != '999999' and aspect != '' and descriptors != ''):
            rule0_pairs.append({'id':counter, 'rule': 0, 'aspect': aspect,
                'description': descriptors, 'sentiment':TextBlob(descriptors).sentiment})



        ## All following rules stem from https://github.com/ishikaarora/Aspect-Sentiment-Analysis-on-Amazon-Reviews


        ## FIRST RULE OF DEPENDANCY PARSE -
        ## M - Sentiment modifier || A - Aspect
        ## RULE = M is child of A with a relationshio of amod
        for token in doc:
            A = "999999"
            M = "999999"
            if token.dep_ == "amod" and not token.is_stop:
                M = token.text
                A = token.head.text
                # add adverbial modifier of adjective (e.g. 'most comfortable headphones')
                M_children = token.children
                for child_m in M_children:
                    if(child_m.dep_ == "advmod"):
                        M_hash = child_m.text
                        M = M_hash + " " + M
                        break
                # negation in adjective, the "no" keyword is a 'det' of the noun (e.g. no interesting characters)
                A_children = token.head.children
                for child_a in A_children:
                    if(child_a.dep_ == "det" and child_a.text == 'no'):
                        neg_prefix = 'not'
                        M = neg_prefix + " " + M
                        break
            if(A != "999999" and M != "999999"):
                rule1_pairs.append({'id':counter, 'rule': 1, 'aspect': A, 'description': M, 'sentiment':TextBlob(M).sentiment})
    


        # SEVENTH RULE OF DEPENDANCY PARSE -
        ## M - Sentiment modifier || A - Aspect
        ## ATTR - link between a verb like 'be/seem/appear' and its complement
        ## Example: 'this is garbage' -> (this, garbage)

        for token in doc:
            children = token.children
            A = "999999"
            M = "999999"
            add_neg_pfx = False
            for child in children :
                if(child.dep_ == "nsubj" and not child.is_stop):
                    A = child.text
                    # check_spelling(child.text)
                if((child.dep_ == "attr") and not child.is_stop):
                    M = child.text
                    #check_spelling(child.text)
                if(child.dep_ == "neg"):
                    neg_prefix = child.text
                    add_neg_pfx = True
            if (add_neg_pfx and M != "999999"):
                M = neg_prefix + " " + M
            if(A != "999999" and M != "999999"):
                rule7_pairs.append({'id':counter,'rule': 7, 'aspect': A, 'description': M, 'sentiment':TextBlob(M).sentiment})




        ## THIRD RULE OF DEPENDANCY PARSE -
        ## M - Sentiment modifier || A - Aspect
        ## Adjectival Complement - A is a child of something with relationship of nsubj, while
        ## M is a child of the same something with relationship of acomp
        ## Assumption - A verb will have only one NSUBJ and DOBJ
        ## "The sound of the speakers would be better. The sound of the speakers could be better" - handled using AUX dependency

        for token in doc:
            children = token.children
            A = "999999"
            M = "999999"
            add_neg_pfx = False
            for child in children :
                if(child.dep_ == "nsubj" and not child.is_stop):
                    A = child.text
                    # check_spelling(child.text)
                if(child.dep_ == "acomp" and not child.is_stop):
                    M = child.text
                # example - 'this could have been better' -> (this, not better)
                if(child.dep_ == "aux" and child.tag_ == "MD"):
                    neg_prefix = "not"
                    add_neg_pfx = True
                if(child.dep_ == "neg"):
                    neg_prefix = child.text
                    add_neg_pfx = True
            if (add_neg_pfx and M != "999999"):
                M = neg_prefix + " " + M
                    #check_spelling(child.text)
            if(A != "999999" and M != "999999"):
                rule3_pairs.append({'id':counter,'rule': 3, 'aspect': A, 'description': M, 'sentiment':TextBlob(M).sentiment})
        ## FOURTH RULE OF DEPENDANCY PARSE -
        ## M - Sentiment modifier || A - Aspect
        #Adverbial modifier to a passive verb - A is a child of something with relationship of nsubjpass, while
        # M is a child of the same something with relationship of advmod
        #Assumption - A verb will have only one NSUBJ and DOBJ

        for token in doc:
            children = token.children
            A = "999999"
            M = "999999"
            add_neg_pfx = False
            for child in children :
                if((child.dep_ == "nsubjpass" or child.dep_ == "nsubj") and not child.is_stop):
                    A = child.text
                    # check_spelling(child.text)
                if(child.dep_ == "advmod" and not child.is_stop):
                    M = child.text
                    M_children = child.children
                    for child_m in M_children:
                        if(child_m.dep_ == "advmod"):
                            M_hash = child_m.text
                            M = M_hash + " " + child.text
                            break
                    #check_spelling(child.text)
                if(child.dep_ == "neg"):
                    neg_prefix = child.text
                    add_neg_pfx = True
            if (add_neg_pfx and M != "999999"):
                M = neg_prefix + " " + M
            if(A != "999999" and M != "999999"):
                rule4_pairs.append({'id':counter,'rule': 4, 'aspect': A, 'description': M, 'sentiment':TextBlob(M).sentiment}) # )


        ## SIXTH RULE OF DEPENDANCY PARSE -
        ## M - Sentiment modifier || A - Aspect
        ## Example - "It ok", "ok" is INTJ (interjections like bravo, great etc)

        for token in doc:
            children = token.children
            A = "999999"
            M = "999999"
            if(token.pos_ == "INTJ" and not token.is_stop):
                for child in children :
                    if(child.dep_ == "nsubj" and not child.is_stop):
                        A = child.text
                        M = token.text
                        # check_spelling(child.text)
            if(A != "999999" and M != "999999"):
                rule6_pairs.append({'id':counter,'rule': 6,  'aspect': A, 'description': M, 'sentiment':TextBlob(M).sentiment})
        counter += 1

    aspects = rule0_pairs + rule1_pairs + rule3_pairs + rule4_pairs + rule6_pairs + rule7_pairs
    return pd.DataFrame(aspects)


In [352]:
def cleanAspects(listOfAspects):
    cleanedDF = []
    lemmatizer = WordNetLemmatizer()
    for id, rule, aspect, descriptor, sentiment in listOfAspects.values:
        if aspect != '' and descriptor != '': cleanedDF.append({
            'id':id, 'rule': rule, 'aspect': lemmatizer.lemmatize(aspect).lower(), 'description': descriptor, 'sentiment': TextBlob(descriptor).sentiment})
    rtn = pd.DataFrame(cleanedDF).drop_duplicates()
    return rtn.drop(rtn[rtn['sentiment'] == (0.0, 0.0)].index)

def countAspects(aspects):
    counter = Counter()
    for aspect in aspects['aspect']:
        counter[aspect] += 1
    return counter

In [353]:
# getting to know our data and importing it

with open('../../include/Hotel_Reviews.csv') as file:
    df = pd.read_csv(file)

onlyAddresses = df["Hotel_Address"]
onlyNames = df["Hotel_Name"]
counterHotels = Counter()

for name in onlyNames:
    counterHotels[name] += 1
    
hotels = []
for name in onlyNames.drop_duplicates():
    hotels.append(name)

print("Total number of hotels in data: ", len(hotels))

c = 1
tenMostCommon = []
for hotel in counterHotels.most_common(10):
    print("Most reviewed hotel #", c, ": ", hotel)
    tenMostCommon.append(hotel[0])
    c += 1


Total number of hotels in data:  1492
Most reviewed hotel # 1 :  ('Britannia International Hotel Canary Wharf', 4789)
Most reviewed hotel # 2 :  ('Strand Palace Hotel', 4256)
Most reviewed hotel # 3 :  ('Park Plaza Westminster Bridge London', 4169)
Most reviewed hotel # 4 :  ('Copthorne Tara Hotel London Kensington', 3578)
Most reviewed hotel # 5 :  ('DoubleTree by Hilton Hotel London Tower of London', 3212)
Most reviewed hotel # 6 :  ('Grand Royale London Hyde Park', 2958)
Most reviewed hotel # 7 :  ('Holiday Inn London Kensington', 2768)
Most reviewed hotel # 8 :  ('Hilton London Metropole', 2628)
Most reviewed hotel # 9 :  ('Millennium Gloucester Hotel London', 2565)
Most reviewed hotel # 10 :  ('Intercontinental London The O2', 2551)


In [354]:
aspectLocation = ['location', 'view', 'area', 'place', 'access', 'transport', 'walk', 
                    'parking', 'views', 'station', 'park', 'distance', 'centre', 'surroundings', 
                    'train', 'airport', 'shuttle', 'connection', 'rail', 'place' , 'sightseeing', 
                    'center', 'tram', 'bus', 'connectivity', 'wharf', 'pier', 'centre', 'center']

aspectProperty = ['building', 'atmosphere', 'lobby', 'tidy', 'carpet', 'balcony', 'floor', 
                    'interior', 'design', 'ambiance', 'stair', 'hall', 'staircase', 'door',
                    'chair', 'furnishing', 'corner', 'foyer', 'noise', 'furniture', 'window',
                    'decor', 'pool', 'lift', 'wall', 'windows', 'fan' , 'light', 'renovation',
                    'construction', 'property', 'surroundings', 'entrance', 'touch' , 'welcoming'
                    ,'spa']

aspectFood = ['food', 'restaurant', 'dinner', 'breakfast', 'lunch', 'bar', 'meal', 'buffet',   
                'drink', 'chef', 'beer' , 'cocktail', 'pizza', 'drinks', 'pizzeria', 'steak',
                'coffee', 'cocktails', 'restaurants', 'tea', 'meal' , 'croissant', 'egg',
                'juice', 'cup', 'milk', 'kettle', 'pot', 'fork', 'spoon', 'knife', 'plate',
                'table', 'chair', 'napkins', 'glass', 'bacon', 'cutlery']


aspectStaff = ['staff', 'service', 'upgrade', 'check', 'reception', 'receptionist', 'concierge',
                'lady', 'customer', 'guy', 'chef', 'welcome', 'job', 'staffs', 'man', 'queue', 'manager',
                'attitude', 'management', 'maid', 'woman', 'employee', 'employees', 'employe', 'member',
                'worker', 'workers', 'clerk', 'check-in']


aspectRoom = ['room', 'view', 'bed', 'price', 'size', 'stay', 'facility', 'bathroom', 'rooms',
                'shower', 'bath', 'bedroom', 'space', 'furniture', 'water', 'comfortable', 'rate', 'money',
                'temperature', 'tub', 'air', 'suite', 'towel', 'facilities', 'comfort', 'charge', 'discount',
                'maintenance', 'smell', 'tv', 'wifi', 'mirror', 'door', 'heating', 'cash', 'comfy',
                'airconditioning', 'conditioning', 'noise', 'internet', 'sheets', 'fan', 'cost', 'toilet',
                'socket', 'curtain', 'wallpaper', 'look', 'pressure', 'smoke', 'vent', 'phone', 'thermostat',
                'heater', 'toiletry', 'lock', 'usb', 'usb-port', 'gel', 'shampoo', 'roomservice', 'bathtub',
                'floor']


In [355]:
for hotelAddress in tenMostCommon:
    print("sentiment analysis for: ", hotelAddress, "\n")
    singleHotelDF = df.drop(df[df['Hotel_Name'] != hotelAddress].index)
    preprocessedData = preprocessing_reviews(singleHotelDF)

    extractedAspectsPos = extractAspects(preprocessedData["Positive_Review"])
    cleanedAspectsPos = cleanAspects(extractedAspectsPos)
    print("5 most mentioned positive aspects as pre-labled (aspect/count): \n",countAspects(cleanedAspectsPos).most_common(5), "\n")


    extractedAspectsNeg = extractAspects(preprocessedData["Negative_Review"])
    cleanedAspectsNeg = cleanAspects(extractedAspectsNeg)
    print("5 most mentioned negative aspects as pre-labled (aspect/count): \n",countAspects(extractedAspectsNeg).most_common(5), "\n")

    # now check with our own sentiment analysis
    aspectsTotal = pd.concat([extractedAspectsNeg, extractedAspectsPos])
    counterRoom = Counter()
    counterStaff = Counter()
    counterProperty = Counter()
    counterFood = Counter()
    counterLocation = Counter()
    sentRoom = 0
    sentStaff = 0
    sentProperty = 0
    sentFood = 0
    sentLocation = 0

    for id, rule, aspect, descripton, sentiment in aspectsTotal.values:
        if aspect in aspectRoom:
            counterRoom[aspect] += 1
            sentRoom += sentiment[0]
        if aspect in aspectStaff:
            counterStaff[aspect] += 1
            sentStaff += sentiment[0]
        if aspect in aspectProperty:
            counterProperty[aspect] += 1
            sentProperty += sentiment[0]
        if aspect in aspectFood:
            counterFood[aspect] += 1
            sentFood += sentiment[0]
        if aspect in aspectLocation:
            counterLocation[aspect] += 1
            sentLocation += sentiment[0]
            
    print("Mean sentiment score for average aspect room: ", sentRoom / sum(i for i in counterRoom.values()))
    print("Mean sentiment score for average aspect staff: ", sentStaff / sum(i for i in counterStaff.values()))
    print("Mean sentiment score for average aspect property: ", sentProperty / sum(i for i in counterProperty.values()))
    print("Mean sentiment score for average aspect food: ", sentFood / sum(i for i in counterFood.values()))
    print("Mean sentiment score for average aspect location: ", sentLocation / sum(i for i in counterLocation.values()))
    print("___________________________________________________________________________________")


sentiment analysis for:  Britannia International Hotel Canary Wharf 

5 most mentioned positive aspects as pre-labled (aspect/count): 
 [('location', 824), ('room', 740), ('staff', 628), ('view', 284), ('hotel', 237)] 

5 most mentioned negative aspects as pre-labled (aspect/count): 
 [('room', 623), ('hotel', 369), ('staff', 335), ('bed', 304), ('breakfast', 183)] 

Mean sentiment score for average aspect room:  0.12851115067472224
Mean sentiment score for average aspect staff:  0.14548336526861122
Mean sentiment score for average aspect property:  0.03196441760593247
Mean sentiment score for average aspect food:  0.10711360439341847
Mean sentiment score for average aspect location:  0.4358853015514519
___________________________________________________________________________________
sentiment analysis for:  Strand Palace Hotel 

5 most mentioned positive aspects as pre-labled (aspect/count): 
 [('location', 1559), ('staff', 966), ('room', 711), ('breakfast', 361), ('hotel', 268)] 



Presentation: Spatial-Mapping where hotels are // where guests are from
=> maybe what aspects are mentioned by primarily by guests by nationality or positive/negative in general

