In [1]:
%load_ext autoreload
%autoreload 2

### Importation des bibliothèques

In [2]:
from tqdm import tqdm
from collections import Counter, defaultdict
import os
os.chdir("..")

# my modules
from Preprocessors.ReviewPreprocessor import ReviewPreprocessor
from Aspects.ExplicitAspectExtractor import ExplicitAspectExtractor
from Aspects.CoRefAspectIdentGrouping import CoRefAspectIdentGrouping

# pandas and numpy
import pandas as pd
import numpy as np

# spacy for NLP
import spacy
from spacy.matcher import Matcher
from spacy import displacy

from time import time

#ignore pandas warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# plots
import matplotlib.pyplot as plt
import seaborn as sns

nlp = spacy.load("en_core_web_sm")

In [4]:
data = pd.read_csv("data/all_datasets_cleaned.csv")
data.rename(columns={"text": "review"}, inplace=True)
print(f"format de dataset: {data.shape}")
data.head(5)

format de dataset: (4067, 7)


Unnamed: 0,listing_name,listing_score,username,review_score,review_title,review,cleaned_review
0,Hotel de la Paix Tour Eiffel,4.5,Casey V,5.0,"Charming, clean, and GREAT service!",I was at this hotel last week in a single room...,"of course, the hotel is small, and my room wa..."
1,Hotel de la Paix Tour Eiffel,4.5,AHLife93,5.0,Best hotel,We enjoyed our stay. The staff was friendly an...,we enjoyed our stay. the staff was friendly a...
2,Hotel de la Paix Tour Eiffel,4.5,Katie Anne,5.0,Great Hotel!,I had a fantastic stay at Hotel de la Paix! I ...,i had a fantastic stay at hotel de la paid! i...
3,Hotel de la Paix Tour Eiffel,4.5,pollybrown67,5.0,Paris April 2022,"Upon our arrival to the hotel, we received a w...","upon our arrival to the hotel, we received a ..."
4,Hotel de la Paix Tour Eiffel,4.5,Vianey G,5.0,Great hotel,Great customer services all Front Desk agents ...,great customer services all front desk agents...


### Prétraitement de données

#### suppression des caractéristiques inutiles. (\n, \t, \r, hyperlinks, #..., @..., les emojis)

Après l'analyse des commentaires. Nous avons remarqué que le dataset contient des mots en dialecte marocain -darija-. Ainsi nous les considérons comme des mots corrects.

In [5]:
data["cleaned_review"] = data['review']
preprocessor = ReviewPreprocessor(data['cleaned_review'], nlp=nlp, subjectivity_threshold=0.6)
data['cleaned_review'] = preprocessor.remove_tags()
data['cleaned_review']

0       I was at this hotel last week in a single room...
1       We enjoyed our stay. The staff was friendly an...
2       I had a fantastic stay at Hotel de la Paix! I ...
3       Upon our arrival to the hotel, we received a w...
4       Great customer services all Front Desk agents ...
                              ...                        
4062    This hotel would be okay for a short city trip...
4063    Stayed at this hotel during valentines weekend...
4064    Avoid this awful, cold space where they put yo...
4065    This hotel is full of young people having part...
4066    Water and electricity weren't working for two ...
Name: cleaned_review, Length: 4067, dtype: object

#### transformation en miniscule

In [6]:
data["cleaned_review"] = data["cleaned_review"].apply(lambda r : r.lower())

i was at this hotel last week in a single room, and really enjoyed my experience!! of course, the hotel is small, and my room was very small... but who is staying inside the room while on a trip to paris!?! the bed was comfortable, shower had hot water and good water pressure, breakfast was simple but delicious, and overall -everything was clean, and very nice. but the best part of my experience by far was the charming and kind front desk host named zied. from the moment i arrived until he arranged my taxi to the airport, he was extremely helpful. he called me by name, ensured everything was to my liking, smiled and chatted many times, and made my short stay feel personal and wonderful. i would definitely stay at this hotel again!

#### correction des fautes d'orthographes

Pour la correction des fautes d'orthographes, nous avons adopté la procédure suivante: premièrement, on parcourt les commentaires une par une. Ensuite on extrait les mots de commentaire traitée. On utilise la bibliothèque **pyspellchecker**, qui utilise la norme de levenshtein pour corriger les mots mal tapés.

Pour vérifier la procédure. nous modifions le commentaire d'indice 0:

**Contenu de commentaire**
i was at this hotel last week in a single room, and really enjoyed my experience!! of course, the hotel is small, and my room was very small... but who is staying inside the room while on a trip to paris!?! the bed was comfortable, shower had hot water and good water pressure, breakfast was simple but delicious, and overall -everything was clean, and very nice. but the best part of my experience by far was the charming and kind front desk host named zied. from the moment i arrived until he arranged my taxi to the airport, he was extremely helpful. he called me by name, ensured everything was to my liking, smiled and chatted many times, and made my short stay feel personal and wonderful. i would definitely stay at this hotel again!

les modification appliqué:
- was -> wass
- this -> thhiss
- last -> lasstt
- room -> riom

In [8]:
data.loc[0, "cleaned_review"] = 'i wass at thhiss hotel lasstt week in a single riom, and really enjoyed my experience!! of course, the hotel is small, and my room was very small... but who is staying inside the room while on a trip to paris!?! the bed was comfortable, shower had hot water and good water pressure, breakfast was simple but delicious, and overall -everything was clean, and very nice. but the best part of my experience by far was the charming and kind front desk host named zied. from the moment i arrived until he arranged my taxi to the airport, he was extremely helpful. he called me by name, ensured everything was to my liking, smiled and chatted many times, and made my short stay feel personal and wonderful. i would definitely stay at this hotel again!'

'i was at this hotel last week in a single room, and really enjoyed my experience!! of course, the hotel is small, and my room was very small... but who is staying inside the room while on a trip to paris!?! the bed was comfortable, shower had hot water and good water pressure, breakfast was simple but delicious, and overall -everything was clean, and very nice. but the best part of my experience by far was the charming and kind front desk host named zied. from the moment i arrived until he arranged my taxi to the airport, he was extremely helpful. he called me by name, ensured everything was to my liking, smiled and chatted many times, and made my short stay feel personal and wonderful. i would definitely stay at this hotel again!'

**Contenu de commentaire**
"we enjoyed our stay. the staff was friendly and helpful in helping us navigate paris. the hotel is about three minutes walk from the eiffel tower, 6 minute walk from the subway, and has a lot of restaurants and shops. the rooms were beautiful and they were very, very clean, which is the most important thing to me when i travel. i'd recommend to anyone who is looking for a nice, clean hotel that's not right in the middle of all of the tourist traffic."

les modification appliqué:
- enjoyed -> enjoyedd
- stay -> stayy
- friendly -> friienddly 
- helpful  -> helppful 

In [9]:
data.loc[1, "cleaned_review"]

"we enjoyed our stay. the staff was friendly and helpful in helping us navigate paris. the hotel is about three minutes walk from the eiffel tower, 6 minute walk from the subway, and has a lot of restaurants and shops. the rooms were beautiful and they were very, very clean, which is the most important thing to me when i travel. i'd recommend to anyone who is looking for a nice, clean hotel that's not right in the middle of all of the tourist traffic."

**Contenu de commentaire**
"we enjoyed our stay. the staff was friendly and helpful in helping us navigate paris. the hotel is about three minutes walk from the eiffel tower, 6 minute walk from the subway, and has a lot of restaurants and shops. the rooms were beautiful and they were very, very clean, which is the most important thing to me when i travel. i'd recommend to anyone who is looking for a nice, clean hotel that's not right in the middle of all of the tourist traffic."

les modification appliqué:
- fantastic  -> fantaasticc
- at  -> att 
- spent  -> speent

In [10]:
data.loc[2, "cleaned_review"]

'i had a fantastic stay at hotel de la paix! i had spent a couple days at disneyland with my family, and then needed to find a great hotel for my one solo day in the city of paris. i made a great choice! the staff was very friendly and helpful! they assisted in checking in early, and helped me book a car to the airport. the room was nice and clean, and the hotel is very conveniently located! i was able to walk everywhere. eiffel tower, rue cler (great cafes), louvre, saint-germain area, etc. all within walking distance! merci! -katie anne'