In [3]:
!pip install requests beautifulsoup4 pandas


Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.14.2-py3-none-any.whl.metadata (3.8 kB)
Collecting charset_normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.4.4-cp313-cp313-win_amd64.whl.metadata (38 kB)
Collecting idna<4,>=2.5 (from requests)
  Downloading idna-3.11-py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Using cached urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading certifi-2025.11.12-py3-none-any.whl.metadata (2.5 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.8-py3-none-any.whl.metadata (4.6 kB)
Downloading requests-2.32.5-py3-none-any.whl (64 kB)
Downloading charset_normalizer-3.4.4-cp313-cp313-win_amd64.whl (107 kB)
Downloading idna-3.11-py3-none-any.whl (71 kB)
Using cached urllib3-2.5.0-py3-none-any.whl (129 kB)
Downloading beautifulsoup4

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Fct  de scrapping

# =>
Le problème rencontré avec Beautiful Soup est typique car de nombreux sites, y compris ceux comme Orange et Ooredoo en Tunisie, masquent les avis clients via des techniques dynamiques (JavaScript, chargement asynchrone, données JSON cachées) qui ne sont pas accessibles par un simple parsing HTML.

Solutions pour contourner ce problème
Extraction via données JSON cachées : Certains sites stockent les avis dans des structures JSON invisibles directement dans le code source, qui peuvent être extraites après avoir identifié les bons endpoints ou structures. Par exemple, des sites comme Tripadvisor utilisent cette méthode pour leurs avis clients.

Scraping dynamique avec Selenium ou Puppeteer : Ces outils simulent un navigateur complet qui exécute le JavaScript afin de charger entièrement la page et ses avis avant extraction.

Utilisation d'APIs ou sources alternatives : Parfois, les plateformes ont des APIs publiques ou semi-publiques pour accéder aux avis clients, ou bien on peut extraire les avis depuis des sites tiers comme Trustpilot qui référencent les avis des clients pour ces opérateurs.

Automatisation via plateformes comme Make.com + Apify : Certaines plateformes proposent des modules déjà prêts pour collecter des avis de sites comme Trustpilot, avec des options d'analyse, ce qui pourrait faciliter la collecte.

Exemples spécifiques aux opérateurs en Tunisie
Avis pour Orange Tunisie peuvent être trouvés sur Tripadvisor et Trustpilot mais pas facilement scrapés via BeautifulSoup à cause du contenu dynamique.​

Avis pour Ooredoo Tunisie existent aussi sur Trustpilot et Indeed mais de même, nécessitent des méthodes avancées de scraping ou API.​

Tutoriels sur le scraping d'avis clients montrent que juste BeautifulSoup est souvent insuffisant pour scraper tous les avis car il faut gérer la partie JavaScript ou JSON cachée.​

Pour notre mini-projet on prefere opter pour  : notamment Selenium pour le scraping dynamique, ou bien l'extraction directe de JSON cachés, ou encore d'utiliser des sources comme Trustpilot pour l'analyse des avis clients.

option 1:Selenium

Simule un vrai navigateur

Charge le JS → les avis deviennent visibles

Compatible avec Tunisianet / Orange / Ooredoo

Cette méthode est la plus sûre pour scraper des avis en Tunisie aujourd’hui



Option 2 : Extraire des avis depuis les réseaux sociaux ou forums

Facebook, Twitter, forums de tech tunisiens

Souvent visibles sans JS compliqué

Mais nécessite un peu de nettoyage

I WILL WORK ON BUSINESS HOTEL

MAPS

avis dans site de booking

avis de site de rating momondo 

avid de site de travipadvisdor

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time
import re # Pour la gestion de l'URL de Booking.com

# Enhanced headers to mimic full browser request and avoid 406
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1'
}

# Function for BeautifulSoup scraping (static sites)
def scrape_with_bs(url, source, review_selector, parse_func, pages=2):
    reviews = []
    
    # Momondo is special, it doesn't have review blocks in the same way, and no pagination for "pros/cons"
    if source == 'Momondo':
        response = requests.get(url, headers=HEADERS, allow_redirects=True)
        print(f"{source} fetch status: {response.status_code}")
        if response.status_code not in [200, 301, 302]:
            print(f"Failed to fetch {source} (status: {response.status_code})")
            return []
        soup = BeautifulSoup(response.text, 'html.parser')
        # parse_func for Momondo returns a list of reviews/items
        parsed_momondo_reviews = parse_func(soup)
        for pr in parsed_momondo_reviews:
            pr['source'] = source
            reviews.append(pr)
        return reviews

    # Standard BS scraping for paginated sites like Booking
    for page in range(1, pages + 1):
        paginated_url = url
        if 'booking' in url.lower():
            # Robust URL parameter replacement for Booking pagination
            if "page=" in paginated_url:
                paginated_url = re.sub(r"page=\d+", f"page={page}", paginated_url)
            else:
                paginated_url = f"{paginated_url}&page={page}"

        print(f"Fetching {source} page {page} from {paginated_url}")
        response = requests.get(paginated_url, headers=HEADERS, allow_redirects=True)
        print(f"{source} page {page} fetch status: {response.status_code}")
        if response.status_code not in [200, 301, 302]:
            print(f"Failed to fetch {source} page {page} (status: {response.status_code})")
            continue
        
        soup = BeautifulSoup(response.text, 'html.parser')
        review_blocks = soup.select(review_selector)

        if not review_blocks and page == 1:
            print(f"No reviews found on {source} first page with selector '{review_selector}'.")
            break # No reviews at all
        elif not review_blocks:
            print(f"No more reviews on {source} page {page}.")
            break # End of pagination

        for block in review_blocks:
            try:
                review = parse_func(block)
                review['source'] = source
                reviews.append(review)
            except Exception as e:
                print(f"Error parsing {source} review: {e} in block: {block.prettify()[:200]}...") # Log partial block for debug
        
        time.sleep(random.uniform(2, 5)) # Be polite

    return reviews

# Parse function for Booking.com (Potential updated selectors)
def parse_booking(block):
    # These selectors are examples. YOU MUST VERIFY THEM with browser inspector.
    reviewer = block.select_one('.bui-avatar-block__title') # This is for reviewer name
    if not reviewer: reviewer = block.select_one('.bui-link--light') # Fallback for reviewer

    date = block.select_one('.review_item_date') # Example
    rating = block.select_one('.bui-review-score__badge') # Example
    title = block.select_one('.c-review-block__title') # Example
    
    # Booking often has positive and negative parts
    positive_text_elem = block.select_one('.c-review__body:nth-of-type(1)') # Adjust if multiple .c-review__body
    negative_text_elem = block.select_one('.c-review__body:nth-of-type(2)')
    
    full_text = []
    if positive_text_elem: full_text.append(positive_text_elem.text.strip())
    if negative_text_elem: full_text.append(negative_text_elem.text.strip())

    return {
        'reviewer': reviewer.text.strip() if reviewer else 'Anonymous',
        'date': date.text.strip() if date else 'N/A',
        'rating': rating.text.strip() if rating else 'N/A',
        'title': title.text.strip() if title else 'No Title',
        'text': ' '.join(full_text) if full_text else 'No Text'
    }

# Parse function for Momondo (refined selectors and consistent return)
def parse_momondo(soup):
    reviews = []
    
    # Momondo tends to aggregate pros/cons, not individual reviews easily
    # It might be better to capture this as one "aggregated" review or ignore for individual sentiment analysis.
    # For now, we'll try to capture them as separate "reviews" in the list.

    pros_text = []
    pros_h3 = soup.find('h3', string=lambda text: text and 'Pros +' in text.strip())
    if pros_h3:
        pros_ul = pros_h3.find_next('ul')
        if pros_ul:
            pros_text = [li.text.strip() for li in pros_ul.find_all('li')]
            reviews.append({
                'reviewer': 'Momondo Aggregated',
                'date': 'N/A',
                'rating': 'N/A',
                'title': 'Pros of Business Hotel',
                'text': '; '.join(pros_text),
                'sentiment_label': 'Positive' # Can manually label these
            })

    cons_text = []
    cons_h3 = soup.find('h3', string=lambda text: text and 'Cons -' in text.strip())
    if cons_h3:
        cons_ul = cons_h3.find_next('ul')
        if cons_ul:
            cons_text = [li.text.strip() for li in cons_ul.find_all('li')]
            reviews.append({
                'reviewer': 'Momondo Aggregated',
                'date': 'N/A',
                'rating': 'N/A',
                'title': 'Cons of Business Hotel',
                'text': '; '.join(cons_text),
                'sentiment_label': 'Negative' # Can manually label these
            })
    
    if not reviews:
        reviews.append({'reviewer': 'N/A', 'date': 'N/A', 'rating': 'N/A', 'title': 'No Data', 'text': 'No pros/cons found', 'sentiment_label': 'Neutral'})
    
    return reviews


# Function for Selenium scraping (dynamic sites)
def scrape_with_selenium(url, source, review_locator, parse_func, pages=2):
    reviews = []
    options = Options()
    options.headless = True # Run in headless mode
    # Fix for newer Chrome versions:
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument(f"user-agent={HEADERS['User-Agent']}") # Pass User-Agent to Selenium

    driver = None
    try:
        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
        driver.get(url)
        print(f"Initial page load for {source}: {url}")
        time.sleep(5)  # Longer initial load for stability

        # --- Specific logic for Google Maps & TripAdvisor ---
        if source == 'Google Maps':
            # This is the reviews panel, it's often more reliable to find it by its ARIA role or data-attributes
            # Or by its parent container that is scrollable
            # This CSS selector (.m6QErb.DxyBCb.kA9KIf.dS8AEf.XiKgde) might be volatile.
            # Look for a more stable element, e.g., the div containing all reviews, which has a scrollbar.
            try:
                # Find the scrollable reviews panel
                # This selector is crucial and needs to be verified on Google Maps
                reviews_panel = WebDriverWait(driver, 20).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, 'div.m6QErb.DxyBCb.kA9KIf.dS8AEf.XiKgde')) # Verify this
                )
                print("Google Maps reviews panel found.")
                # Scroll multiple times to load all reviews
                for _ in range(20): # Increased scrolls for more reviews
                    driver.execute_script("arguments[0].scrollBy(0, 5000);", reviews_panel) # Scroll more aggressively
                    time.sleep(2) # Increased sleep for more content to load
                    # Optional: check if new reviews loaded to stop early
                print("Finished scrolling Google Maps reviews.")
            except Exception as e:
                print(f"Could not find or scroll Google Maps reviews panel: {e}")

        # TripAdvisor pagination is handled internally by Selenium loop
        
        # --- Extract reviews after loading/scrolling ---
        review_blocks = driver.find_elements(*review_locator)
        if not review_blocks:
            print(f"No review blocks found for {source} with locator {review_locator}.")
        
        # This loop will now get all currently loaded reviews after scrolling/pagination
        for block in review_blocks:
            try:
                review = parse_func(block)
                review['source'] = source
                reviews.append(review)
            except Exception as e:
                print(f"Error parsing {source} review: {e} in block: {block.text[:100]}...") # Log partial block

        # TripAdvisor specific pagination (if not already handled by the above loop)
        if source == 'TripAdvisor':
            for page_num in range(pages): # pages here represents "next" clicks
                try:
                    next_button = driver.find_element(By.CSS_SELECTOR, '.ui_button.nav.next.primary') # Verify selector
                    if next_button.is_displayed() and next_button.is_enabled():
                        next_button.click()
                        time.sleep(5) # Wait for new page to load
                        
                        # Re-find review blocks after navigating to next page
                        new_review_blocks = driver.find_elements(*review_locator)
                        if not new_review_blocks:
                            print(f"No new reviews found after clicking next on TripAdvisor page {page_num + 1}.")
                            break
                        for block in new_review_blocks:
                            try:
                                review = parse_func(block)
                                review['source'] = source
                                reviews.append(review)
                            except Exception as e:
                                print(f"Error parsing {source} review: {e} in new block: {block.text[:100]}...")
                    else:
                        print(f"No more next button or not enabled on TripAdvisor after {page_num + 1} pages.")
                        break
                except Exception as e:
                    print(f"Error with TripAdvisor pagination on page {page_num + 1}: {e}")
                    break
        
    finally:
        if driver:
            driver.quit()
    return reviews


# Parse function for TripAdvisor (using Selenium WebElement methods)
def parse_tripadvisor(block):
    # These selectors are examples. YOU MUST VERIFY THEM with browser inspector.
    try:
        reviewer_elem = block.find_elements(By.CSS_SELECTOR, '.info_text div:first-child') # Example
        reviewer = reviewer_elem[0].text.strip() if reviewer_elem else 'Anonymous'
        
        date_elem = block.find_elements(By.CSS_SELECTOR, '.ratingDate') # Example
        date = date_elem[0].get_attribute('title') if date_elem else 'N/A' # Date is often in title attribute
        
        rating_elem = block.find_elements(By.CSS_SELECTOR, '.ui_bubble_rating') # Example
        rating = 'N/A'
        if rating_elem:
            # Rating is often in a class like 'bubble_50' for 5 stars, 'bubble_40' for 4 stars
            for cls in rating_elem[0].get_attribute('class').split():
                if 'bubble_' in cls:
                    rating = str(int(cls.replace('bubble_', '')) / 10.0) # Convert 'bubble_50' to '5.0'
                    break

        title_elem = block.find_elements(By.CSS_SELECTOR, '.noQuotes') # Example
        title = title_elem[0].text.strip() if title_elem else 'No Title'
        
        text_elem = block.find_elements(By.CSS_SELECTOR, '.partial_entry') # Example, might need to click "more"
        text = text_elem[0].text.strip() if text_elem else 'No Text'
        
        return {
            'reviewer': reviewer,
            'date': date,
            'rating': rating,
            'title': title,
            'text': text
        }
    except Exception as e:
        print(f"TripAdvisor parse error: {e}. Block text: {block.text[:100]}...")
        return {'reviewer': 'N/A', 'date': 'N/A', 'rating': 'N/A', 'title': 'No Data', 'text': 'Parse failed'}

# Parse function for Google Maps (using Selenium WebElement methods)
def parse_google(block):
    # These selectors are examples. YOU MUST VERIFY THEM with browser inspector.
    try:
        # Reviewer name
        reviewer_elem = block.find_elements(By.CSS_SELECTOR, '.d4r55') # Example
        reviewer = reviewer_elem[0].text.strip() if reviewer_elem else 'Anonymous'

        # Date
        date_elem = block.find_elements(By.CSS_SELECTOR, '.rsqaWe') # Example
        date = date_elem[0].text.strip() if date_elem else 'N/A'
        
        # Rating - often an aria-label on a star element or its parent
        rating_elem = block.find_elements(By.CSS_SELECTOR, '.kvMYJc') # Example: the div containing stars
        rating = 'N/A'
        if rating_elem:
            rating_label = rating_elem[0].get_attribute('aria-label') # "Note 5 sur 5"
            if rating_label and 'sur' in rating_label:
                rating = rating_label.split(' ')[1] # Extract '5'
        
        # Full review text
        text_elem = block.find_elements(By.CSS_SELECTOR, '.wiI7pd') # Example
        text = text_elem[0].text.strip() if text_elem else 'No Text'

        return {
            'reviewer': reviewer,
            'date': date,
            'rating': rating,
            'title': 'No Title (Google Maps)', # Google Maps reviews typically don't have titles
            'text': text
        }
    except Exception as e:
        print(f"Google Maps parse error: {e}. Block text: {block.text[:100]}...")
        return {'reviewer': 'N/A', 'date': 'N/A', 'rating': 'N/A', 'title': 'No Data', 'text': 'Parse failed'}


# Save to CSV
def save_to_csv(reviews, filename='hotel_reviews_combined__final.csv'): 
    if not reviews:
        print("No reviews to save.")
        return
    # Ensure all review dictionaries have the same keys for DictWriter
    # Create a superset of all keys found across all reviews
    all_keys = set()
    for review in reviews:
        all_keys.update(review.keys())
    
    # Fill missing keys with None for consistency
    reviews_for_csv = []
    for review in reviews:
        full_review = {key: review.get(key) for key in all_keys}
        reviews_for_csv.append(full_review)

    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, sorted(list(all_keys))) # Sort keys for consistent column order
        writer.writeheader()
        writer.writerows(reviews_for_csv)
    print(f"Saved {len(reviews)} reviews to {filename}")

# Main execution
if __name__ == "__main__":
    # URLs
    booking_url = "https://www.booking.com/reviews/tn/hotel/business.en-gb.html?label=gen173nr-1FCAEoggI46AdIM1gEaEaIAQGYAQe4AQfIAQzYAQHoAQH4AQKIAgGoAgO4ApvZoMoGwAIB0gIkY2Q5ZGY5ZjMtZWEyNi00NzE5LWI5NjgtMzY4N2E4N2U3M2Q32AIG4AIB&sid=32b2429f18b4624fc09d32f51f4441cc&customer_type=total&hp_nav=0&keep_landing=1&order=featuredreviews&rows=75"
    momondo_url = "https://www.momondo.com/hotels/tunis-tunis-governorate/Business-Hotel-Tunis.mhd2417013.ksp"
    tripadvisor_url = "https://www.tripadvisor.com/Hotel_Review-g293758-d8767447-Reviews-Business_Hotel_Tunis-Tunis_Tunis_Governorate.html"
    google_url = "https://www.google.com/maps/place/Business+H%C3%B4tel/@36.818325,10.1820211,17z/data=!4m8!3m7!1s0x12fd3489b5a4f4e1:0xdda07f445b5b03cd!8m2!3d36.818325!4d10.184596!9m1!1b1!16s%2Fg%2F11bwn563xh?entry=ttu"
    
    all_reviews = []

    print("\n--- Scraping Booking.com ---")
    # Increased pages for Booking (75 reviews/page * 20 pages = 1500 potential reviews)
    booking_reviews = scrape_with_bs(booking_url, 'Booking.com', 'div.c-review-block', parse_booking, pages=20) 
    all_reviews.extend(booking_reviews)
    
    print("\n--- Scraping Momondo.com ---")
    momondo_reviews = scrape_with_bs(momondo_url, 'Momondo', '', parse_momondo, pages=1) # Momondo is one-off
    all_reviews.extend(momondo_reviews)
    
    print("\n--- Scraping TripAdvisor.com ---")
    # TripAdvisor pages count how many "next" clicks to simulate
    tripadvisor_reviews = scrape_with_selenium(tripadvisor_url, 'TripAdvisor', (By.CSS_SELECTOR, 'div[data-reviewid]'), parse_tripadvisor, pages=10) 
    all_reviews.extend(tripadvisor_reviews)

    print("\n--- Scraping Google Maps ---")
    # Google Maps needs more scrolls than pages, so 'pages' here could mean "how many times to try and scroll and extract"
    google_reviews = scrape_with_selenium(google_url, 'Google Maps', (By.CSS_SELECTOR, 'div.jftiEf'), parse_google, pages=1) # pages here refers to how many times to execute the scroll loop
    all_reviews.extend(google_reviews)
    
    save_to_csv(all_reviews)


--- Scraping Booking.com ---
Fetching Booking.com page 1 from https://www.booking.com/reviews/tn/hotel/business.en-gb.html?label=gen173nr-1FCAEoggI46AdIM1gEaEaIAQGYAQe4AQfIAQzYAQHoAQH4AQKIAgGoAgO4ApvZoMoGwAIB0gIkY2Q5ZGY5ZjMtZWEyNi00NzE5LWI5NjgtMzY4N2E4N2U3M2Q32AIG4AIB&sid=32b2429f18b4624fc09d32f51f4441cc&customer_type=total&hp_nav=0&keep_landing=1&order=featuredreviews&rows=75&page=1
Booking.com page 1 fetch status: 200
No reviews found on Booking.com first page with selector 'div.c-review-block'.

--- Scraping Momondo.com ---
Momondo fetch status: 200

--- Scraping TripAdvisor.com ---
Initial page load for TripAdvisor: https://www.tripadvisor.com/Hotel_Review-g293758-d8767447-Reviews-Business_Hotel_Tunis-Tunis_Tunis_Governorate.html
No review blocks found for TripAdvisor with locator ('css selector', 'div[data-reviewid]').
Error with TripAdvisor pagination on page 1: Message: no such element: Unable to locate element: {"method":"css selector","selector":".ui_button.nav.next.primary"}