# Extraction de attributs
## Liste des attributs et des champs sur lesquels ils sont basés
- identifiant avis (`id_review`): supprimer, remplacer par l'identifiant de la ligne donnée
- identifiant hotel (`hotel_id`)
- localisation (`user_location`)
    - dict
- score (`rating`)
    - tel quel ($[0,5]$)
- nom utilisateur (`user_pseudo`)
    - tel quel (text) ou label-encodé
- expertise
    - nb d'avis par utilisateur
    - (log du nb d'avis par utilisateur)
    - tanh du nb d'avis par utilisateur
- titre (`review_title`) & text (`review_text`) review
    - sentiment analysis
        basé sur la moyenne des score sur les mots reconnus
        - positivité score
        - negativité score
        - objectivité score
    - keywords: ???
- date postage (`date_review`) & sejour (`date_stayed`)
    - écart séjour et avis, écart séjour avec la date "actuelle" (date la plus récente du dataset)

In [53]:
#imports
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
from sklearn import preprocessing
from IPython.display import clear_output
import re

In [56]:
#paths
DATASET_PATH = "./data2.csv"
COLUMNS = ["id_review", "rating", "review_title", "review_text", "user_pseudo", "user_location", "hotel_id", "date_stayed", "date_review"]

In [66]:
RE = re.compile(r"^(\d+);;(\d.\d);;“(.*)”;;(.*);;(.*);;(.*);;(\d*);;(.*);;(.*)$")
#147639004;;
#5.0;;
#“My home away from home!”;;
#On every [...] for every meal!);;
#Maureen V;;
#Sydney, New South Wales, Australia;;
#93338;;
#December 2012;;
#December 17, 2012

In [74]:
with open(DATASET_PATH, 'r', encoding="utf8") as f:
    data = []
    broken_lines = 0
    for line in f:
        match = RE.fullmatch(line.strip())
        if match:
            fields = match.groups()

            # if the correct number of fields where found
            if len(fields) == len(COLUMNS):
                data.append({column: field for column, field in zip(COLUMNS, fields)})
            else:
                broken_lines += 1
                print("Not the correct number of match on this line:")
                print(line)
        else:
            broken_lines += 1
            print("No match on this line:")
            print(line)
            
    
    print(f"{len(data)} correctly parsed lines, {broken_lines} incorrectly parsed lines")
#df = pd.read_csv(DATASET_ARCHIVE_PATH, sep='\;\;', names=COLUMNS, header=None, error_bad_lines=False)
#clear_output()
df = pd.DataFrame.from_records(data)
print(f"Chargement des données fini, {len(data)} correctly parsed lines, {broken_lines} incorrectly parsed lines")

No match on this line:
{"ratings": {"overall": 1.0, "service": 1.0}, "title": "\u201cThis is a Shady Unprofessional Place\u201d", "text": "Please note that I have not stayed here yet. My experience is based on the pre stay experience I had with this company. My stay was supposed to be Oct 5-11 2008 I booked this Loft because 1st- was in Soho, 2nd the pictures looked good, 3rd the price was good, 4- I was contacted immediately. I never should have done it. They require you prepay on a credit card and sign a \"lease.\" I should have known better by the tone of the emails. Unprofessional, Lacking in tact or intelligently written. The terms of the contract are short. Basically (but vaguely) It says cancelation policy is within outside of 6 weeks prior to the stay a refund of 75% will be given back if re-rented. I didn't think we'd cancel. Well, after reading all of the horrible reviews on Trip Advisor and other sites - I canceled. I immediately emailed the 2 people I dealt with told them a

In [75]:
df.head(3)

Unnamed: 0,id_review,rating,review_title,review_text,user_pseudo,user_location,hotel_id,date_stayed,date_review
0,147643103,5.0,"Truly is ""Jewel of the Upper Wets Side""",Stayed in a king suite for 11 nights and yes i...,Papa_Panda,Gold Coast,93338,December 2012,"December 17, 2012"
1,147639004,5.0,My home away from home!,"On every visit to NYC, the Hotel Beacon is the...",Maureen V,"Sydney, New South Wales, Australia",93338,December 2012,"December 17, 2012"
2,147697954,4.0,Great Stay,This is a great property in Midtown. We two di...,vuguru,Houston,1762573,December 2012,"December 18, 2012"


In [None]:
def process_id_review(df):
    """ supprimer, remplacer par l'identifiant de la ligne donnée """
    #"id_review"

In [77]:
def process_hotel_id(df, nan_value="UNK"):
    """label encode
    """
    #"hotel_id"
    # Gestion des valeurs manquantes NaN
    hotel_ids = df['hotel_id']
    print(hotel_ids.isna().sum())
    hotel_ids.fillna("UNK")

    # Entrainer le LabelEncoder sur les noms d'hotel
    label_encoder = preprocessing.LabelEncoder()
    label_encoder.fit(hotel_ids.values)
    
    # Encoder les hotels avec le LabelEncoder
    hotel_ids_encoded = label_encoder.transform(hotel_ids.values)
    
    return hotel_ids_encoded, label_encoder
process_hotel_id(df)

0


(array([3468, 3468, 1283, ..., 3173, 3173, 3173]), LabelEncoder())

In [None]:
def process_user_location(df):
    """Utilise un """
    #"user_location"
    # transforme en dictionaire

In [None]:
def process_rating(df):
    "rating"
    # tel quel ([1,5])

In [None]:
def process_user_pseudo(df):
    "user_pseudo"
    # nom utilisateur
    # tel quel (text) ou label-encodé

In [None]:
def process_expertise(df):
    "user_pseudo"
    "review_number"
    # nb d'avis par utilisateur
    # (log du nb d'avis par utilisateur)
    # tanh du nb d'avis par utilisateur

In [None]:
def process_text(df):
    "review_title"
    "review_text"
    # titre (review_title) & text (review_text) review
    # sentiment analysis basé sur la moyenne des score sur les mots reconnus
    #   positivité score
    #   negativité score
    #   objectivité score
    # keywords: ???

In [78]:
def process_dates(df, verbose=False):
    """date postage (date_review) & sejour (date_stayed)
    écart séjour et avis, écart séjour avec la date "actuelle" (date la plus récente du dataset)"""
    #"date_review" "Month 31, 2020" -> "%B %d, %Y"
    #"date_stayed" "Month 2020" -> "%B %Y"
    review_date = pd.to_datetime(df["date_review"], format="%B %d, %Y")
    stay_date = pd.to_datetime(df["date_stayed"], format="%B %Y", errors="coerce")
    stay_date_missing = pd.isnull(stay_date)
    stay_date_present = ~stay_date_missing

    # completer séjour manquant avec date avis - écart séjour et avis moyen
    known_date_gap = review_date[stay_date_present] - stay_date[stay_date_present]
    average_date_gap = known_date_gap.mean()
    stay_date[stay_date_missing] = review_date[stay_date_missing] - average_date_gap

    if verbose: print(f"{stay_date_missing.sum()}/{len(stay_date_missing)} dates de séjour manquantes, complétées en retranchant l'écart moyen entre le séjour et l'avis ({average_date_gap}) à la date de l'avis.")

    # écart séjour et avis
    date_gap = review_date - stay_date

    # écart séjour avec la date "actuelle" (date la plus récente du dataset)
    most_recent_date = max(stay_date.max(), review_date.max())
    if verbose: print(f"La date la plus récente du set de donnée est '{most_recent_date}', utilisée comme 'date actuelle'.")
    stay_date_gap = most_recent_date - stay_date

    return date_gap, stay_date_gap

process_dates(df, True)

67620/878554 dates de séjour manquantes, complétées en retranchant l'écart moyen entre le séjour et l'avis (52 days 06:03:52.744958) à la date de l'avis.
La date la plus récente du set de donnée est '2012-12-20 00:00:00', utilisée comme 'date actuelle'.


(0               16 days 00:00:00
 1               16 days 00:00:00
 2               17 days 00:00:00
 3              138 days 00:00:00
 4               16 days 00:00:00
                    ...          
 878549   52 days 06:03:52.744958
 878550          17 days 00:00:00
 878551        -74 days +00:00:00
 878552        -91 days +00:00:00
 878553          12 days 00:00:00
 Length: 878554, dtype: timedelta64[ns], 0                 19 days 00:00:00
 1                 19 days 00:00:00
 2                 19 days 00:00:00
 3                141 days 00:00:00
 4                 19 days 00:00:00
                     ...           
 878549   1624 days 06:03:52.744958
 878550          1633 days 00:00:00
 878551          1633 days 00:00:00
 878552          1633 days 00:00:00
 878553          1755 days 00:00:00
 Name: date_stayed, Length: 878554, dtype: timedelta64[ns])

In [40]:
process_dates(df)