- Amaia Rodríguez-Sierra Aguirrebeña _100472844_
- Lucía de Frutos Martín _100475960_
- Francisco Landa Ortega _100483174_

# Dataset Creation: Extracting Hotel Reviews from Booking.com
This notebook outlines the procedure used to create a custom dataset of hotel reviews. The reviews are collected from Booking.com using a structured web scraping pipeline. The dataset will later be used for various NLP and machine learning tasks.

### 1. Import and Install Required Libraries

In [1]:
!pip install googletrans==4.0.0-rc1

Collecting googletrans==4.0.0-rc1
  Using cached googletrans-4.0.0rc1-py3-none-any.whl
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Using cached httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached h11-0.9.0-

In [None]:
import requests
import pandas as pd
import numpy as np
import gzip
import io
from lxml import etree
from bs4 import BeautifulSoup
import re
from googletrans import Translator

### 2. Main Workflow
The main workflow includes steps for downloading hotel URLs, extracting user reviews, translating non-English reviews, and compiling the data into a structured format.

*Do not execute the next cell, it takes hours to complete*

In [None]:
# Function to extract reviews, hotel name, country, and rating info from a Booking hotel review page
def fetch_reviews_with_scores(url):
    try:
        headers = {'Accept-Language': 'en'}
        # Timeout added to avoid an error and full stop if url doesnt work
        response = requests.get(url,headers=headers, timeout=30)
        response.encoding = 'utf-8'

        # If the request fails (non-200 status), skip this URL
        if response.status_code != 200:
            print(f"Failed to retrieve {url}")
            return None

        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the block containing hotel metadata
        info_block = soup.find('div', class_='standalone_reviews_hotel_info')

        # Extract hotel name
        name_tag = info_block.find('a', class_='standalone_header_hotel_link') if info_block else None
        name = name_tag.get_text(strip=True) if name_tag else np.nan

        # Extract country
        country_tag = info_block.find('a', class_='hotel_address_country') if info_block else None
        country = country_tag.get_text(strip=True) if country_tag else np.nan

        # Find individual review blocks
        review_blocks = soup.find_all('li', itemprop='review')
        reviews_list = []

        for block in review_blocks:
            # Extract review text
            review_tag = block.find('span', itemprop='reviewBody')
            review_text = review_tag.get_text(strip=True) if review_tag else np.nan

            # Extract review score
            score_tag = block.find('span', class_='review-score-badge')
            score = score_tag.get_text(strip=True) if score_tag else np.nan

            # Save extracted data as a dictionary
            reviews_list.append({'Country': country, 'Name': name, 'Review': review_text, 'Rating': score, 'AvgRating': avg_score})

        # Convert list to DataFrame
        result = pd.DataFrame(reviews_list)
        return result

    # Handle request timeouts
    except requests.exceptions.Timeout:
        print(f"Timeout occurred for URL: {url}. Skipping this URL.")
        return None  # Skip on timeout

    # Handle any other type of exception
    except Exception as e:
        print(f"An error occurred for URL {url}: {e}. Skipping this URL.")
        return None  # Skip on error

# Get the sitemap index provided at booking.com/robots.txt
index_url = "https://www.booking.com/sitembk-hotel-review-index.xml"
response = requests.get(index_url)
response.raise_for_status()

# Parse the sitemap XML to extract all .gz sitemap file links
root = etree.fromstring(response.content)
ns = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
gz_links = root.xpath("//ns:sitemap/ns:loc/text()", namespaces=ns)

collected_urls = []
reviews = pd.DataFrame(columns=['Country', 'Name', 'Review', 'Rating', 'AvgRating'])

# Loop through the .gz sitemap files
for gz_url in gz_links:
    print(f"Downloading: {gz_url}")

    # Download and decompress the .gz sitemap file
    gz_response = requests.get(gz_url)
    gz_response.raise_for_status()
    with gzip.open(io.BytesIO(gz_response.content), 'rb') as f:
        xml_data = f.read()

    # Parse the decompressed XML to extract individual hotel review page URLs
    sub_root = etree.fromstring(xml_data)
    urls = sub_root.xpath("//ns:url/ns:loc/text()", namespaces=ns)
    collected_urls.extend(urls)

    # For each hotel review URL, extract and store review data
    for i in collected_urls:
        new_review = fetch_reviews_with_scores(i)
        if new_review is not None:
            # Append the new reviews to the main DataFrame
            reviews = pd.concat([reviews, new_review], ignore_index=True)

### 3. Initial Clean and Preview of the Dataset
Since a large number of reviews have been collected, it's considered safe to drop any rows that contain missing (NaN) values.

In [None]:
# Drop rows with any missing values
new_df = reviews.dropna()

In [None]:
# Initialize the translator
translator = Translator()

In [None]:
# Function to translate country names
def translate_country_name(text):
    try:
        # Automatically detect the source language
        return translator.translate(str(text), src="auto", dest="en").text
    except:
        # If translation fails, return the original text
        return text

In [None]:
# Translate each unique country name
unique_countries = new_df['Country'].unique()
translations = {country: translate_country_name(country) for country in unique_countries}

# Map translated country names back to the DataFrame
new_df['New Country'] = new_df['Country'].map(translations)

In [None]:
new_df

Unnamed: 0,Country,Name,Review,Rating,AvgRating,New Country
0,تركيا,The Hera Premium Hotels,الخدمات بعيده,5.0,5.7,Türkiye
1,تركيا,The Hera Premium Hotels,لم يصلح المكيف,7.0,5.7,Türkiye
2,تركيا,The Hera Premium Hotels,-الافطار كان محدودا وغير ساخن \n-ايضا بعد المس...,7.0,5.7,Türkiye
3,تركيا,The Hera Premium Hotels,تعامل الموظفين سيء: 1/ طلبت 5 قوارير ماء وقال ...,5.0,5.7,Türkiye
4,تركيا,The Hera Premium Hotels,الفطور لم يكن المتوقع,7.0,5.7,Türkiye
...,...,...,...,...,...,...
1262594,مصر,Lacasa Residence,فندق هادي بصلح للعائلات لكن لاسف لا توجد لوحة ...,8.0,9.5,Egypt
1262595,مصر,Lacasa Residence,مكان جميل وطاقم عمل اجمل وخصوصا الاستاذة ريحان...,10,9.5,Egypt
1262596,مصر,Lacasa Residence,الشقق نظيفه. والعاملين عليها متعاونين جدا,10,9.5,Egypt
1262597,مصر,Lacasa Residence,موقعه وغرفه صغيره جداً,8.0,9.5,Egypt


In [None]:
# Save intermediate files to avoid data loss
new_df.to_excel('./BookingReviews.xlsx', sheet_name='Reviews', index=False)

### 4. Translate Arabic Reviews to English
The googletrans library (a Python API for Google Translate) is used to convert Arabic text into English. The translation step will loop through all Arabic reviews and store the English equivalents.

In [None]:
# Create a copy of the first 73000 rows of reviews
df = new_df.iloc[:73000].copy()

In [None]:
# Check if the text contains Arabic characters
def is_arabic(text):
    return bool(re.search(r'[\u0600-\u06FF]', str(text)))

# Translate Arabic text to English
def translate_text(text):
    try:
        return translator.translate(str(text), src='ar', dest='en').text
    except:
        # If translation fails, return the original text
        return text

In [None]:
translated_rows = []
for idx, row in df.iterrows():
    # Translate the review text from Arabic to English
    translated_review = translate_text(row['Review'])

    # Check if the name is in Arabic and translate it if needed
    name = row['Name']
    translated_name = translate_text(name) if is_arabic(name) else name

    # Create a new row dictionary with translated content
    translated_rows.append({
        'Name': translated_name,
        'Review': translated_review,
        'Rating': row['Rating'],        # Original numerical rating
        'AvgRating': row['AvgRating'],  # Original average rating
        'Country': row['Country']       # Original country
    })

    # Print progress every 1000 rows
    if (idx + 1) % 1000 == 0:
        print(f"{idx + 1} rows translated...")

# Convert the list of translated rows into a new DataFrame
translated_df = pd.DataFrame(translated_rows)

In [None]:
# Save the translated reviews into a new CSV file
translated_df.to_csv('BookingReviews_Translated.csv', index=False)
print("Translation completed.")

Translation completed.
