In [1]:
%load_ext autoreload
%autoreload 2

### Importation des bibliothèques

In [2]:
import os
os.chdir("..")

# my modules
from Preprocessors.ReviewPreprocessor import ReviewPreprocessor
from Aspects.ExplicitAspectExtractor import ExplicitAspectExtractor
from Aspects.CoRefAspectIdentGrouping import CoRefAspectIdentGrouping

# pandas and numpy
import pandas as pd
import numpy as np

# spacy for NLP
import spacy

from time import time

#ignore pandas warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

nlp = spacy.load("en_core_web_sm")

### Importation des données

le dataset utilisé 3000 commentaires, extraite depuis une grande dataset de 50 Million commentaires dans TripAdvisor. [lien vers le dataset](https://www.aclweb.org/anthology/2020.lrec-1.605)

les attributes de dataset créé:
- **hotel_url**: lien de l'hotel commenté;
- **author**: l'auteur de commentaire;
- **date**: date de publication de commentaire;
- **rating**: le score attribué par l'auteur à l'hotel;
- **title**: titre de commentaire;
- **review**: le text de commentaire.

In [3]:
data = pd.read_csv("data/trip_advisor_data_chunk_10000k.csv", encoding="utf-16")
data.rename(columns={"text": "review"}, inplace=True)
print(f"format de dataset: {data.shape}")
data.head(5)

format de dataset: (3000, 6)


Unnamed: 0,hotel_url,author,date,rating,title,review
0,Hotel_Review-g194775-d1121769-Reviews-Hotel_Ba...,Lagaiuzza,2016-01-01T00:00:00,5.0,"Baltic, what else?",We have spent in this hotel our summer holiday...
1,Hotel_Review-g194775-d1121769-Reviews-Hotel_Ba...,ashleyn763,2014-10-01T00:00:00,5.0,Excellent in every way!,I visited Hotel Baltic with my husband for som...
2,Hotel_Review-g194775-d1121769-Reviews-Hotel_Ba...,DavideMauro,2014-08-01T00:00:00,5.0,The house of your family's holiday,I've travelled quite a numbers of hotels but t...
3,Hotel_Review-g303503-d1735469-Reviews-Pousada_...,TwoMonkeysTravel,2017-03-01T00:00:00,5.0,Natural Luxury,"The property is surrounded by trees, which are..."
4,Hotel_Review-g303503-d1735469-Reviews-Pousada_...,analuizade,2016-09-01T00:00:00,5.0,Very cozy!,I had a very pleasant stay at this hotel! All ...


### Prétraitement de données

#### suppression des caractéristiques inutiles. (\n, \t, \r, liens, #..., @..., les emojis)

Après l'analyse des commentaires. Nous avons remarqué que le dataset contient des mots en dialecte marocain -darija-. Ainsi nous les considérons comme des mots corrects.

In [5]:
preprocessor = ReviewPreprocessor(data['review'], spell_allowed_words= ["riad", "dar","rif"], nlp=nlp, subjectivity_threshold=0.6)
data['cleaned_review'] = preprocessor.remove_tags()
data['cleaned_review']

0       We have spent in this hotel our summer holiday...
1       I visited Hotel Baltic with my husband for som...
2       I've travelled quite a numbers of hotels but t...
3       The property is surrounded by trees, which are...
4       I had a very pleasant stay at this hotel! All ...
                              ...                        
2995    We stayed in Portland for three nights and thi...
2996    It was my third time to stay at University pla...
2997    Stayed here for 4 nights in March and I chose ...
2998    I didn't expect much from this hotel from the ...
2999    The hotel rooms were clean and comfortable. Ni...
Name: cleaned_review, Length: 3000, dtype: object

#### correction des fautes d'orthographes

Pour la correction des fautes d'orthographes, nous avons adopté la procédure suivante: premièrement, on parcourt les commentaires une par une. Ensuite on extrait les mots de commentaire traitée. On utilise la bibliothèque **pyspellchecker**, qui utilise la norme de levenshtein pour corriger les mots mal tapés.

Pour vérifié la proccedure. nous modifions le commentaire d'indice 0:

**Contenu de commentaire**
'We have spent in this hotel our summer holidays both in summer 2014 and 2015- I was with my husband and my child ( 4 years old at present). I do really recommend this place- Staff si high qualified, Kind and really helpful- Animation staff get You involved, but always with discrection - Miniclub si super and activities offered are interesting and smart- Rooms clean, with AC and balcony- Restaurant offers a great selection of food - always. The beach si extremly closed to the hotel - Miniclub area offers some gazebos to have shade for kids- A lot of bicycles are available for free- I am completely satisfied of this hotel- Go in lime this!'

les modification appliqué:
- We -> Wee
- have -> haavee
- spent -> spant
- in -> ine

In [6]:
data.loc[0, "review"] = 'Wee haavee spant ine this hotel our summer holidays both in summer 2014 and 2015- I was with my husband and my child ( 4 years old at present). I do really recommend this place- Staff si high qualified, Kind and really helpful- Animation staff get You involved, but always with discrection - Miniclub si super and activities offered are interesting and smart- Rooms clean, with AC and balcony- Restaurant offers a great selection of food - always. The beach si extremly closed to the hotel - Miniclub area offers some gazebos to have shade for kids- A lot of bicycles are available for free- I am completely satisfied of this hotel- Go in lime this!'
data.loc[0, "review"]

'Wee haavee spant ine this hotel our summer holidays both in summer 2014 and 2015- I was with my husband and my child ( 4 years old at present). I do really recommend this place- Staff si high qualified, Kind and really helpful- Animation staff get You involved, but always with discrection - Miniclub si super and activities offered are interesting and smart- Rooms clean, with AC and balcony- Restaurant offers a great selection of food - always. The beach si extremly closed to the hotel - Miniclub area offers some gazebos to have shade for kids- A lot of bicycles are available for free- I am completely satisfied of this hotel- Go in lime this!'

In [7]:
data['cleaned_review'] = preprocessor.spelling_correction()
data['cleaned_review'][0]

3000it [11:38,  4.30it/s]


'We have spent in this hotel our summer holidays both in summer 2014 and 2015- I was with my husband and my child ( 4 years old at present). I do really recommend this place- Staff si high qualified, Kind and really helpful- Animation staff get You involved, but always with discretion - minicab si super and activities offered are interesting and smart- Rooms clean, with AC and balcony- Restaurant offers a great selection of food - always. The beach si extremly closed to the hotel - minicab area offers some gazebos to have shade for kids- A lot of bicycles are available for free- I am completely satisfied of this hotel- Go in lime this!'

les mots modifier dans le commentaires avec indice 0 sont bien corrigés, mais le mot 'si' n'est pas corrigé.

#### suppression des phrases objectives

In [8]:
data['cleaned_review'] = preprocessor.remove_objective_sentences()
data['cleaned_review'][0]

3000it [01:40, 29.93it/s]


'I was with my husband and my child ( 4 years old at present). Rooms clean, with AC and balcony- Restaurant offers a great selection of food - always. A lot of bicycles are available for free- I am completely satisfied of this hotel- Go in lime this!'

les phrases supprimé dans le commentaire 0 :
- We have spent in this hotel our summer holidays both in summer 2014 and 2015
- Staff si high qualified, Kind and really helpful- Animation staff get You involved, but always with discrection
- The beach si extremly closed to the hotel - minicab area offers some gazebos to have shade for kids-


In [10]:
from textblob import TextBlob
TextBlob("Staff si high qualified, Kind and really helpful- Animation staff get You involved, but always with discrection").subjectivity

0.5466666666666666

In [12]:
TextBlob("Staff si high qualified, Kind and really helpful").subjectivity

0.5466666666666666

In [14]:
TextBlob("Animation staff get You involved, but always with discrection").subjectivity

0.0

In [11]:
TextBlob("I was with my husband and my child ( 4 years old at present)").subjectivity

0.1

**Remarques:**
- la deuxième phrase est supprimé à cause de son score de subjectivité inférieur au seuil 0.6
- il y a un problème dans l'étape de prétraitement. dans le cas de la phrase le symbole '-' est lié à l'adjectif helpful, car dans cette commentaire l'auteur à écrit une liste des phrases et lors de la suppression de \n, le symbole '-' sera concaténé par le mot qui le précède.

### Extraction des aspects explicites

Dans cette phase, on extrait les noms les plus frequents comme des aspects explicites.

In [159]:
now = time()
aspect_extractor = ExplicitAspectExtractor(data["cleaned_review"], nlp)
extracted_aspects = aspect_extractor.start(60)
print(extracted_aspects)
print(f"extracting aspects {time() - now}s")

[('room', 9359), ('hotel', 7322), ('staff', 4591), ('breakfast', 3410), ('food', 2681), ('place', 2588), ('day', 2308), ('restaurant', 2263), ('time', 2261), ('night', 2221), ('location', 2046), ('service', 1904), ('bed', 1856), ('area', 1726), ('pool', 1480), ('beach', 1374), ('stay', 1262), ('water', 1238), ('resort', 1161), ('bathroom', 1089), ('view', 1085), ('bar', 1036), ('minute', 994), ('price', 991), ('people', 932), ('shower', 930), ('lot', 925), ('guest', 865), ('thing', 854), ('kid', 852), ('dinner', 852), ('experience', 839), ('way', 836), ('bit', 820), ('family', 781), ('desk', 776), ('trip', 753), ('morning', 726), ('floor', 710), ('meal', 710), ('parking', 695), ('door', 686), ('coffee', 673), ('city', 624), ('riad', 595), ('drink', 573), ('evening', 568), ('hour', 556), ('year', 553), ('reception', 552), ('child', 551), ('air', 543), ('street', 541), ('choice', 533), ('problem', 521), ('walk', 512), ('town', 499), ('quality', 491), ('car', 478), ('facility', 467)]
extr

on remarque que les apsects les plus frequents sont:
- room
- hotel
- staff
- location
- breakfast
- time
- restaurant
- service
- stay

...

### Groupement des aspects similaires

Après l'extraction des aspects explicites. L'objectif de cette étape est le groupements des aspects explicite qui ont un sens similaire, exemple le groupe (food, breakfast, dinner).

Pour atteindre cette objectif, nous proposons d'utiliser le modèle **Word2Vec**. ce dernier est entraîné sur les phrases des commentaires. Ensuite, pour chaque couple d'aspects explicites, on calcule la similarité entre eux à l'aide du modèle construit.

On calcule la matrice de cooccurrence entre les aspects explicites et les mots qui exprime un sentiment (des adjectifs). Cette matrice sera utilisée pour l'extraction des aspects implicites.

#### Création de matrice de co-occurrence entre les aspects explicite et les mots de sentiments

In [160]:
aspects_ = list(dict(extracted_aspects).keys())

In [161]:
co_ref_aspect_ident_grouping = CoRefAspectIdentGrouping(data[["review", "cleaned_review"]], dict(extracted_aspects), nlp)
aspect_sentiment = co_ref_aspect_ident_grouping.get_co_occurrence_matrix()
aspect_sentiment

Unnamed: 0,room,hotel,staff,breakfast,food,place,day,restaurant,time,night,...,child,air,street,choice,problem,walk,town,quality,car,facility
available,16.0,6.0,5.0,10.0,4.0,1.0,6.0,1.0,1.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,2.0,0.0
satisfied,2.0,3.0,1.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
amazed,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
good,187.0,160.0,110.0,187.0,138.0,72.0,33.0,102.0,19.0,34.0,...,8.0,9.0,4.0,25.0,8.0,10.0,14.0,26.0,9.0,28.0
excellent,56.0,57.0,75.0,57.0,64.0,13.0,10.0,47.0,12.0,9.0,...,3.0,1.0,1.0,13.0,1.0,6.0,3.0,17.0,2.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
damaged,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
female,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
satisfying,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
realistic,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Groupement des aspects explicite

la procédure proposé crée des couples des aspects explicite similaire, exemple:

Ensuite, on applique un algorithme pour grouper les couple qui ont un élément partagé, et d'une manière récursive.

Le résultat de groupement:

In [163]:
co_ref_aspect_ident_grouping.get_co_reference_aspects_groups(0.68)

{'kid': ['kid', 'child', 'family'],
 'location': ['street', 'location', 'town', 'city'],
 'room': ['air', 'room', 'bathroom', 'water', 'bed', 'shower'],
 'people': ['people', 'guest'],
 'desk': ['reception', 'desk'],
 'breakfast': ['quality',
  'time',
  'morning',
  'evening',
  'dinner',
  'night',
  'coffee',
  'drink',
  'day',
  'hour',
  'choice',
  'food',
  'breakfast',
  'meal'],
 'place': ['place'],
 'minute': ['minute'],
 'car': ['car'],
 'problem': ['problem'],
 'view': ['view'],
 'walk': ['walk'],
 'floor': ['floor'],
 'price': ['price'],
 'year': ['year'],
 'restaurant': ['restaurant', 'bar'],
 'beach': ['beach'],
 'thing': ['thing'],
 'parking': ['parking'],
 'trip': ['trip'],
 'experience': ['experience'],
 'hotel': ['hotel'],
 'service': ['service'],
 'pool': ['pool'],
 'area': ['area'],
 'way': ['way'],
 'stay': ['stay'],
 'door': ['door'],
 'resort': ['resort', 'riad'],
 'lot': ['lot'],
 'bit': ['bit'],
 'facility': ['facility'],
 'staff': ['staff']}

### Extraction des aspects implicite

La procédure proposée est inspirée de l'article [2]. Premièrement on identifie les phrases subjectives qui ne contiennent pas un aspect explicite, puis on extrait les adjectifs (sentiments) dans cette phrase, afin de trouver les aspects explicites qui coexistent fréquemment avec chaque adjectif.

In [164]:
aspects = list(dict(extracted_aspects).keys())

In [228]:
pattern = [{"POS": "ADJ"}]
matcher = Matcher(nlp.vocab)
matcher.add("SENTIMENT_WORDS", [pattern])

for id_, review in data["cleaned_review"].items():
    doc = nlp(review)
    for sentence in doc.sents:
        aspects_in_sentence = [i for i in aspects if i in sentence.text.lower()]
        if len(aspects_in_sentence) == 0:
            print(f"review id : {id_}")
            # extract ADJ
            adjs = matcher(sentence)
            for id_matcher, start, end in adjs:
                sentiment_word = sentence[start:end].text
                if sentiment_word in list(aspect_sentiment.index):
                    implicit_aspect = aspect_sentiment.loc[sentiment_word].sort_values(ascending=False).index[0]
                    print(f"sentiment word : {sentiment_word}, aspect : {implicit_aspect}")
                    print(sentence.text)
            print("=========")

review id : 1
sentiment word : wrong, aspect : room
I was so wrong.
review id : 1
sentiment word : amazing, aspect : staff
It was truly amazing.
review id : 2
sentiment word : excellent, aspect : staff
The equipe is really excellent.
review id : 5
sentiment word : super, aspect : room
They have super cute monkeys (small ones, the size of a cat) leaving on the trees, everyone loved them!
sentiment word : cute, aspect : hotel
They have super cute monkeys (small ones, the size of a cat) leaving on the trees, everyone loved them!
sentiment word : small, aspect : room
They have super cute monkeys (small ones, the size of a cat) leaving on the trees, everyone loved them!
review id : 5
sentiment word : great, aspect : room
It didn't bother us and we had great sleep.
review id : 8
sentiment word : beautiful, aspect : room
It's just beautiful.
review id : 10
review id : 11
sentiment word : lovely, aspect : hotel
Homemade cakes made by the lovely owner Cinzia who makes also nice fresh eggs taken

In [212]:
from Aspects.ImplicitAspectExtractor import ImplicitAspectExtractor

In [229]:
implicit_aspect_extractor = ImplicitAspectExtractor(data["cleaned_review"], aspect_sentiment, nlp)
implicit_aspect_extractor.extract_implicit_aspects()

3000it [01:16, 39.20it/s]


Unnamed: 0,review_id,sentence,implicit_aspects
0,1,I was so wrong.,[room]
1,1,It was truly amazing.,[staff]
2,2,The equipe is really excellent.,[staff]
3,5,"They have super cute monkeys (small ones, the ...","[room, hotel, room]"
4,5,It didn't bother us and we had great sleep.,[room]
...,...,...,...
1335,2981,Yummy fresh cookie and warm welcome.,"[breakfast, staff]"
1336,2987,Moderate accommodations with moderate cost..,[price]
1337,2992,Walls were pretty thin.,[room]
1338,2995,That was pretty pathetic.,[hotel]


Après l'application de l'algorithme proposé, on remarque qu'on obtient quelque bon extraction. comme :

malheureusement, il y a des aspects implicite pas bien identifier, comme:

**Remarques:**
- la méthode proposé dans l'extraction des aspects implicite n'analyse pas le context de la phrase, c'est pour cela on obtient des aspects mal identifiée comme dans la phrase 1545, 1050, 2682
- dans la phrase 1050, on remarque un mention explicite de l'aspect 'Staff' par le mot employee, mais on identifie les aspects implicite 'hotel' et 'breakfast'.
- de même pour 2904, l'aspect explicite value et en identifier un aspect implicite 'staff'.

**Références:**
- Multiaspect‐based opinion classification model for tourist reviews
- A Hybrid Co‑occurrence and Ranking‑based Approach for Detection of Implicit Aspects in Aspect‑Based Sentiment Analysis