# REVIEW SUMMARIZER
## TRIPADVISOR: HOTELS

*   Esteban Ariza
*   Johan Giraldo
*   Mateo Valdes

## Prerequisites

In [1]:
%pip install transformers
%pip install torch
%pip install sentencepiece
%pip install rouge-score
%pip install evaluate

^C
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import torch
import json
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
import pandas as pd
import csv
from rouge_score import rouge_scorer, scoring
import evaluate
from nltk.translate.bleu_score import sentence_bleu

## Summarizer

Read csv

In [2]:
INPUT_CSV_PATH = "../data/exploratory_analysis/tripadvisor_hotels_clean.csv"
HOTEL_DATA = pd.read_csv(INPUT_CSV_PATH)

Download models

In [3]:
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


### Normal

Write csv file

In [5]:
COLUMNS_NAME = ['ORIGINAL_TEXT', 'SUMMARIZED_TEXT']

In [None]:
try:
    writer = csv.DictWriter(open('summarized_reviews.csv', 'w', encoding='UTF8', newline=''), fieldnames=COLUMNS_NAME, delimiter=',', lineterminator='\r')
    writer.writeheader()
except IOError:
    print("I/O error")

In [6]:
def summarize(review):
    tokenized_text = tokenizer.encode('summarize: ' + review, return_tensors="pt").to(device)
    summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=100,
                                    early_stopping=True)
    row = {}
    row[COLUMNS_NAME[0]] = review
    row[COLUMNS_NAME[1]] = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    try:
        writer.writerow(row)
    except IOError:
                print("I/O error")
    print('Summarized: ' + row[COLUMNS_NAME[0]] + ' to: ' + row[COLUMNS_NAME[1]])

In [None]:
HOTEL_DATA['REVIEW_TEXT'].apply(summarize)

In [4]:
HOTEL_SUMMARY = pd.read_csv('summarized_reviews.csv')

### By Hotel and Year

Open file writer

In [4]:
COLUMNS_NAME = ['HOTEL_NAME','REVIEW_DATE','REVIEW_SUMMARY']
OUTPUT_CSV_PATH = "../data/review_summarizer/summarized_reviews_by_year_and_hotel.csv"

try:
    writer = csv.DictWriter(open(OUTPUT_CSV_PATH, 'w', encoding='UTF8', newline=''), fieldnames=COLUMNS_NAME, delimiter=',', lineterminator='\r')
    writer.writeheader()
except IOError:
    print("I/O error")

Support methods

In [5]:
def fromDateToYear(value):
    return value.split("-")[0]

def concatReviewsByYearAndHotel(df):
    df["REVIEW_DATE"] = df["REVIEW_DATE"].map(fromDateToYear)
    df['REVIEW_TEXT'] = df[['HOTEL_NAME','REVIEW_TEXT','REVIEW_DATE']].groupby(["HOTEL_NAME","REVIEW_DATE"])["REVIEW_TEXT"].transform(lambda x: ". ".join(x))
    return df[['HOTEL_NAME','REVIEW_TEXT','REVIEW_DATE']].drop_duplicates()

Summarizer method

In [6]:
def summarizeByYearHotel(actRow):
    tokenized_text = tokenizer.encode('summarize: ' + actRow["REVIEW_TEXT"], return_tensors="pt").to(device)
    summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=100,
                                    early_stopping=True)
    row = {}
    row[COLUMNS_NAME[0]] = actRow["HOTEL_NAME"]
    row[COLUMNS_NAME[1]] = actRow["REVIEW_DATE"]
    row[COLUMNS_NAME[2]] = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    try:
        writer.writerow(row)
    except IOError:
                print("I/O error")
    print('Summarized: ' + row[COLUMNS_NAME[0]] + '-' + row[COLUMNS_NAME[1]] + ' to: ' + row[COLUMNS_NAME[2]])

Summarize review by year and hotel

In [7]:
HOTEL_DATA_BY_YEARHOTEL = concatReviewsByYearAndHotel(HOTEL_DATA)
HOTEL_DATA_BY_YEARHOTEL.apply(summarizeByYearHotel, axis=1)

Summarized: Hotel Restaurant VILLINO-2022 to: stay in this hotel was magnificent. It has a pleasant, sophisticated atmosphere. Only thing missing was airco or fan, but was doable. Bathroom was clean and plenty of space.


Token indices sequence length is longer than the specified maximum sequence length for this model (831 > 512). Running this sequence through the model will result in indexing errors


Summarized: Hotel Restaurant VILLINO-2020 to: the hotel is on the outskirts of Lindau and it is a 40min walk or take the car to get to the old island. the owners are fully involved in the day to day operations and this makes all the difference in world- the service is without fault and you really feel welcomed!
Summarized: Hotel Restaurant VILLINO-2019 to: the hotel is located close to Lindau in a beautiful garden/park. we stayed for two nights at the Villino and everything really is perfect for the relaxing break...unless the sun comes out! the staff, room, breakfast and overall experience was fantastic.
Summarized: Hotel Restaurant VILLINO-2018 to: the hotel is in a magical setting, amongst apple orchards. the grounds were beautiful and inviting, surrounded by apple and pear sands, and the restaurant was excellent. if you are in the mood, prepare to be delighted.
Summarized: Hotel Restaurant VILLINO-2017 to: the hotel is located in a beautiful garden just on the outskirts of Lindau. 

RuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3105035208 bytes.

Save summaries

In [7]:
HOTEL_SUMMARY = pd.read_csv(OUTPUT_CSV_PATH)

## Test

In [None]:
rouge = evaluate.load('rouge')
predictions = HOTEL_SUMMARY[COLUMNS_NAME[1]].tolist()
references = HOTEL_SUMMARY[COLUMNS_NAME[0]].tolist()
results = rouge.compute(predictions=predictions, references=references)
print(results)

In [20]:
def splitter(value):
    return value.split()

reference = list(map(splitter, HOTEL_SUMMARY[COLUMNS_NAME[0]].tolist()))
candidates = list(map(splitter, HOTEL_SUMMARY[COLUMNS_NAME[1]].tolist()))

# 1-gram:
def bleu(reference, candidates, weights=(0.25, 0.25, 0.25, 0.25)):
    result = 0;
    for candidate in candidates:
        result += sentence_bleu(reference, candidate, weights=weights)
    result = result / len(candidates)
    return result

print('BLEU: %f' %bleu(reference, candidates))
print('BLEU 1-gram: %f' %bleu(reference, candidates, (1, 0, 0, 0)))
print('BLEU 2-gram: %f' %bleu(reference, candidates, (0, 1, 0, 0)))
print('BLEU 3-gram: %f' %bleu(reference, candidates, (0, 0, 1, 0)))
print('BLEU 4-gram: %f' %bleu(reference, candidates, (0, 0, 0, 1)))

BLEU: 0.717735
BLEU 1-gram: 0.891337
BLEU 2-gram: 0.788820
BLEU 3-gram: 0.669039
BLEU 4-gram: 0.581243
