# Booking.com Kerala Hotels & Reviews Scraper (Scrapy)

This notebook scrapes hotels and reviews from Booking.com Kerala region using **Scrapy**.

**Goal**: Collect 250 hotels with at least 10 reviews each.

**Features**:

- Scrapy-based async scraping (fast & efficient)
- Progress checkpoints for recovery
- Detailed status outputs
- CSV export matching the model format
- Anti-bot handling with rotating headers


## 1. Install Required Libraries


In [6]:
# Install required packages
!uv pip install patchright pandas lxml tqdm nest_asyncio

[2mAudited [1m5 packages[0m [2min 2ms[0m[0m


## 2. Import Libraries and Setup


In [2]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

from patchright.async_api import async_playwright
from bs4 import BeautifulSoup
import time
import json
import re
from datetime import datetime
from pathlib import Path
from urllib.parse import urljoin, urlencode
import pandas as pd
from tqdm.notebook import tqdm

# Configuration
TARGET_HOTELS = 250
MIN_REVIEWS_PER_HOTEL = 10
CHECKPOINT_FILE = "scraping_checkpoint.json"
OUTPUT_CSV = "kerala_hotels_reviews.csv"

# Global storage
scraped_hotels = []
scraped_reviews = []
browser = None
context = None
page = None

print("‚úì Libraries imported successfully (using Patchright - undetected)")
print(f"Target: {TARGET_HOTELS} hotels with {MIN_REVIEWS_PER_HOTEL}+ reviews each")

‚úì Libraries imported successfully (using Patchright - undetected)
Target: 250 hotels with 10+ reviews each


## 3. Helper Functions


In [3]:
def save_checkpoint(hotels, reviews, filename=CHECKPOINT_FILE):
    """Save progress to checkpoint file"""
    data = {
        "hotels": hotels,
        "reviews": reviews,
        "timestamp": datetime.now().isoformat(),
    }
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"üíæ Checkpoint saved: {len(hotels)} hotels, {len(reviews)} reviews")


def load_checkpoint(filename=CHECKPOINT_FILE):
    """Load progress from checkpoint file"""
    if Path(filename).exists():
        with open(filename, "r", encoding="utf-8") as f:
            data = json.load(f)
        print(
            f"üìÇ Loaded checkpoint: {len(data.get('hotels', []))} hotels, {len(data.get('reviews', []))} reviews"
        )
        return data.get("hotels", []), data.get("reviews", [])
    return [], []


def clean_text(text):
    """Clean and normalize text"""
    if not text:
        return ""
    return re.sub(r"\s+", " ", text.strip())


print("‚úì Helper functions defined")

‚úì Helper functions defined


## 4. Initialize Browser

Start Playwright browser (visible mode for debugging).


In [4]:
async def init_browser():
    """Initialize Patchright browser with stealth settings"""
    global browser, context, page

    playwright = await async_playwright().start()

    # Launch with best practices for undetected browsing
    # Using a unique user data directory to avoid conflicts with running Chrome
    import tempfile
    import os

    # Create a unique temp directory for this session
    temp_dir = tempfile.mkdtemp(prefix="booking_scraper_")

    context = await playwright.chromium.launch_persistent_context(
        user_data_dir=temp_dir,
        channel="chrome",  # Use "chrome" if you have Chrome installed
        executable_path="/home/noel/.local/bin/chrome",  # Your Chrome path
        headless=False,
        no_viewport=True,
        # Don't add custom user_agent - let Patchright handle it
    )

    page = context.pages[0] if context.pages else await context.new_page()
    browser = playwright

    print(f"‚úì Patchright browser initialized (stealth mode)")
    print(f"  Using temp dir: {temp_dir}")
    return page


# Initialize browser
page = asyncio.get_event_loop().run_until_complete(init_browser())
print("‚úì Ready to scrape")

‚úì Patchright browser initialized (stealth mode)
  Using temp dir: /tmp/booking_scraper___iw1f43
‚úì Ready to scrape


## 5. Scrape Hotel List

Collect hotels from Kerala region search pages.


In [5]:
async def scrape_hotels_async(max_hotels=TARGET_HOTELS):
    """Scrape hotel list using Playwright"""
    hotels = []
    offset = 0

    print(f"üè® Collecting up to {max_hotels} hotels...")

    while len(hotels) < max_hotels:
        url = f"https://www.booking.com/searchresults.html?dest_id=3476&dest_type=region&offset={offset}"
        print(f"\nüìÑ Page {offset // 25 + 1} (offset={offset})...")

        try:
            await page.goto(url, wait_until="networkidle", timeout=60000)
            await asyncio.sleep(3)

            # Handle popups
            try:
                await page.click(
                    'button[aria-label="Dismiss sign-in info."]', timeout=3000
                )
            except:
                pass

            # Scroll to load all content
            for _ in range(3):
                await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                await asyncio.sleep(1)

            # Get page content
            html = await page.content()
            soup = BeautifulSoup(html, "lxml")

            # Find hotel cards
            cards = soup.select('div[data-testid="property-card"]')
            if not cards:
                cards = soup.select("[data-hotelid]")

            if not cards:
                print("‚ö†Ô∏è No hotel cards found")
                # Save for debugging
                with open(f"debug_page_{offset}.html", "w") as f:
                    f.write(html[:100000])
                break

            print(f"  Found {len(cards)} hotel cards")
            hotels_added = 0

            for card in cards:
                if len(hotels) >= max_hotels:
                    break

                try:
                    # Hotel name
                    name_elem = card.select_one('div[data-testid="title"]')
                    if not name_elem:
                        name_elem = card.select_one('[data-testid="title-link"]')
                    name = clean_text(name_elem.get_text()) if name_elem else None

                    # Hotel URL
                    link_elem = card.select_one('a[data-testid="title-link"]')
                    if not link_elem:
                        link_elem = card.select_one('a[href*="/hotel/"]')
                    url = link_elem.get("href", "") if link_elem else ""
                    if url and not url.startswith("http"):
                        url = "https://www.booking.com" + url

                    # Star rating
                    star_elem = card.select('[data-testid="rating-stars"] span')
                    stars = len(star_elem) if star_elem else 0
                    star_rating = f"{stars} STAR" if stars > 0 else ""

                    # Overall rating
                    rating_elem = card.select_one(
                        'div[data-testid="review-score"] div:first-child'
                    )
                    overall_rating = ""
                    if rating_elem:
                        match = re.search(r"(\d+\.?\d*)", rating_elem.get_text())
                        if match:
                            overall_rating = match.group(1)

                    # Review count
                    review_elem = card.select_one('div[data-testid="review-score"]')
                    review_count = ""
                    if review_elem:
                        text = review_elem.get_text()
                        match = re.search(r"(\d+[\d,]*)\s*review", text, re.I)
                        if match:
                            review_count = match.group(1).replace(",", "")

                    if name and url:
                        hotels.append(
                            {
                                "name": name,
                                "url": url.split("?")[0],
                                "star_rating": star_rating,
                                "overall_rating": overall_rating,
                                "review_count": review_count,
                            }
                        )
                        hotels_added += 1

                except Exception as e:
                    continue

            print(f"  ‚úì Added {hotels_added} hotels (total: {len(hotels)})")

            if hotels_added == 0:
                print("‚ö†Ô∏è No new hotels, stopping")
                break

            offset += 25
            await asyncio.sleep(2)

        except Exception as e:
            print(f"‚ùå Error: {e}")
            break

    return hotels


# Run hotel collection
scraped_hotels = asyncio.get_event_loop().run_until_complete(
    scrape_hotels_async(TARGET_HOTELS)
)
print(f"\n‚úÖ Collected {len(scraped_hotels)} hotels")

# Save checkpoint
if scraped_hotels:
    save_checkpoint(scraped_hotels, [])
    print(f"üíæ Checkpoint saved")

üè® Collecting up to 250 hotels...

üìÑ Page 1 (offset=0)...
  Found 69 hotel cards
  ‚úì Added 69 hotels (total: 69)

üìÑ Page 2 (offset=25)...
  Found 69 hotel cards
  ‚úì Added 69 hotels (total: 138)

üìÑ Page 3 (offset=50)...
‚ùå Error: Page.goto: Timeout 60000ms exceeded.
Call log:
  - navigating to "https://www.booking.com/searchresults.html?dest_id=3476&dest_type=region&offset=50", waiting until "networkidle"


‚úÖ Collected 138 hotels
üíæ Checkpoint saved: 138 hotels, 0 reviews
üíæ Checkpoint saved


## 6. View Collected Hotels

Check the hotel list before collecting reviews.


In [7]:
# Load from checkpoint if needed
if not scraped_hotels:
    scraped_hotels, scraped_reviews = load_checkpoint()

if scraped_hotels:
    hotels_df = pd.DataFrame(scraped_hotels)
    print(f"üìä Total hotels: {len(hotels_df)}")
    display(hotels_df.head(15))

    print(f"\n‚≠ê Star distribution:")
    print(hotels_df["star_rating"].value_counts())
else:
    print("‚ö†Ô∏è No hotels collected. Run cell 5 first.")

üìä Total hotels: 138


Unnamed: 0,name,url,star_rating,overall_rating,review_count
0,Blue Serene Backwater Resort,https://www.booking.com/hotel/in/blue-serene-r...,10 STAR,7.3,100
1,SylVan Resort,https://www.booking.com/hotel/in/sylvan-resort...,,9.2,159
2,The Crescent Wayanad Heritage Pool Resort,https://www.booking.com/hotel/in/the-cresent-w...,,7.3,25
3,PARK RESIDENCY ARCADIA,https://www.booking.com/hotel/in/park-residenc...,8 STAR,8.1,184
4,Super Hotel O Thrissur Near Thrissur Medical C...,https://www.booking.com/hotel/in/oyo-flagship-...,6 STAR,6.5,5
5,Hotel Urban Bella,https://www.booking.com/hotel/in/urban-bella.e...,6 STAR,8.5,39
6,"Nature Valley Farmhouse Resort by Raarees, Mun...",https://www.booking.com/hotel/in/tamshikanya-i...,8 STAR,8.1,71
7,"Royal Plaza Inn by RAK Rooms, Calicut",https://www.booking.com/hotel/in/royal-plaza-i...,6 STAR,8.4,118
8,WithInn Hotel - Kannur Airport,https://www.booking.com/hotel/in/withinn-kannu...,6 STAR,8.5,135
9,Ushasree Wayanad Premium Pool Resort by VOYE H...,https://www.booking.com/hotel/in/ushashree-way...,,8.3,73



‚≠ê Star distribution:
star_rating
           48
8 STAR     34
6 STAR     32
10 STAR    18
4 STAR      6
Name: count, dtype: int64


## 7. Scrape Reviews for Hotels

Visit each hotel and collect reviews.


In [8]:
async def scrape_reviews_async(hotel, min_reviews=MIN_REVIEWS_PER_HOTEL, debug=False):
    """Scrape reviews for a single hotel"""
    reviews = []
    hotel_name = hotel["name"]

    try:
        # Go to hotel page
        await page.goto(hotel["url"], wait_until="networkidle", timeout=60000)
        await asyncio.sleep(3)

        # Check sold out status
        html = await page.content()
        sold_out = (
            "SOLD OUT"
            if "sold out" in html.lower() or "no availability" in html.lower()
            else "AVAILABLE"
        )
        hotel["sold_out_status"] = sold_out

        # Close any popups
        try:
            await page.click('button[aria-label="Dismiss sign-in info."]', timeout=2000)
        except:
            pass

        # Scroll to reviews section (it's usually lower on the page)
        for i in range(5):
            await page.evaluate(
                f"window.scrollTo(0, document.body.scrollHeight * {0.2 * (i + 1)})"
            )
            await asyncio.sleep(0.5)

        # Wait for page to stabilize
        await asyncio.sleep(2)

        # Click "Read all reviews" button to open reviews modal/section
        try:
            # Use page.click with timeout - it will scroll to element if needed
            await page.click('[data-testid="fr-read-all-reviews"]', timeout=5000)
            if debug:
                print("    ‚úì Clicked: fr-read-all-reviews button")
            await asyncio.sleep(3)
        except Exception as e:
            if debug:
                print(f"    ‚ö†Ô∏è Could not click fr-read-all-reviews: {e}")

            # Try alternative selectors
            alt_selectors = [
                'button[data-testid="fr-read-all-reviews"]',
                '[data-testid="review-score-read-all"]',
                'a[href*="#tab-reviews"]',
            ]
            for selector in alt_selectors:
                try:
                    await page.click(selector, timeout=3000)
                    if debug:
                        print(f"    ‚úì Clicked alternative: {selector}")
                    await asyncio.sleep(3)
                    break
                except:
                    continue

        # Get updated page content
        html = await page.content()
        soup = BeautifulSoup(html, "lxml")

        # DEBUG: Save HTML
        if debug:
            debug_file = f"debug_hotel_{hotel_name[:20].replace(' ', '_')}.html"
            with open(debug_file, "w", encoding="utf-8") as f:
                f.write(html)
            print(f"    üìÑ Saved debug HTML to: {debug_file}")

        # Find review cards
        review_cards = soup.select('div[data-testid="review-card"]')

        if debug:
            print(f"    üîç Found {len(review_cards)} review cards")

        for card in review_cards[:min_reviews]:
            try:
                # Reviewer name - look for the name inside the reviewer section
                reviewer = "Anonymous"
                name_elem = card.select_one("div.b08850ce41.f546354b44")
                if name_elem:
                    reviewer = clean_text(name_elem.get_text())

                # Rating - the score number like "7.0"
                rating = ""
                rating_elem = card.select_one("div.f63b14ab7a")
                if rating_elem:
                    rating = clean_text(rating_elem.get_text())

                # Title - h4[data-testid="review-title"]
                title = ""
                title_elem = card.select_one('h4[data-testid="review-title"]')
                if title_elem:
                    title = clean_text(title_elem.get_text())

                # Get actual user review text (positive + negative combined)
                text_parts = []

                # Positive review text - get the span inside b99b6ef58f div
                pos_container = card.select_one('[data-testid="review-positive-text"]')
                if pos_container:
                    pos_span = pos_container.select_one("div.b99b6ef58f span")
                    if pos_span:
                        text_parts.append(clean_text(pos_span.get_text()))

                # Negative review text - get the span inside b99b6ef58f div
                neg_container = card.select_one('[data-testid="review-negative-text"]')
                if neg_container:
                    neg_span = neg_container.select_one("div.b99b6ef58f span")
                    if neg_span:
                        text_parts.append(clean_text(neg_span.get_text()))

                # Combine all text
                review_text = " ".join(text_parts)[:500] if text_parts else ""

                reviews.append(
                    {
                        "hotel_name": hotel_name,
                        "reviewer_name": reviewer,
                        "rating": rating,
                        "review_title": title,
                        "review_text": review_text,
                        "star_rating": hotel.get("star_rating", ""),
                        "overall_rating": hotel.get("overall_rating", ""),
                        "total_review_count": hotel.get("review_count", ""),
                        "sold_out_status": sold_out,
                    }
                )

                if debug and len(reviews) <= 3:
                    print(
                        f"      Review {len(reviews)}: {reviewer} - {rating} - {title[:30]}..."
                    )
                    print(f"         Text: {review_text[:100]}...")

            except Exception as e:
                if debug:
                    print(f"    Error parsing card: {e}")
                continue

    except Exception as e:
        print(f"  Error: {e}")

    return reviews


print("‚úì Review scraping function defined (with scrolling + click)")

‚úì Review scraping function defined (with scrolling + click)


## 8. Debug Single Hotel (Test)

Run this to debug review extraction on one hotel. Check the browser while it runs!


In [33]:
# DEBUG: Test on first hotel with detailed output
# The browser will navigate to the hotel page - watch it!

if scraped_hotels:
    test_hotel = scraped_hotels[0]
    print(f"üß™ Testing on: {test_hotel['name']}")
    print(f"   URL: {test_hotel['url']}")
    print()

    # Run with debug=True
    test_reviews = asyncio.get_event_loop().run_until_complete(
        scrape_reviews_async(test_hotel, min_reviews=10, debug=True)
    )

    print(f"\nüìä Found {len(test_reviews)} reviews")

    if test_reviews:
        for i, r in enumerate(test_reviews[:5]):
            print(f"\n--- Review {i + 1} ---")
            print(f"Reviewer: {r['reviewer_name']}")
            print(f"Rating: {r['rating']}")
            print(f"Title: {r['review_title']}")
            print(f"Text: {r['review_text'][:200]}...")
    else:
        print("\n‚ö†Ô∏è No reviews found. Check the debug HTML file created.")
        print("   Also look at the browser - is the reviews section visible?")
        print("   You may need to scroll down or click something manually.")
else:
    print("‚ö†Ô∏è No hotels loaded. Run cell 10 first.")

üß™ Testing on: Fort Bridge View
   URL: https://www.booking.com/hotel/in/fortbridgeview.en-gb.html

    ‚ö†Ô∏è Could not click fr-read-all-reviews: Page.click: Timeout 5000ms exceeded.
Call log:
  - waiting for locator("[data-testid=\"fr-read-all-reviews\"]")
    - locator resolved to 2 elements. Proceeding with the first one: JSHandle@<button type="button" data-testid="fr-read-all-reviews" class="de576f5064 b46cd7aad7 d0a01e3d83 c7a901b0e7 bbf83acb81">‚Ä¶</button>
  - attempting click action
    2 √ó waiting for element to be visible, enabled and stable
      - element is visible, enabled and stable
      - scrolling into view if needed
      - done scrolling
      - <div class="bbe73dce14">‚Ä¶</div> from <div class="dc7e768484 a37804931c">‚Ä¶</div> subtree intercepts pointer events
    - retrying click action
    - waiting 20ms
    - waiting for element to be visible, enabled and stable
    - element is visible, enabled and stable
    - scrolling into view if needed
    - done scro

## 9. Run Full Review Collection

Process all hotels and collect reviews (after debugging works).


In [9]:
async def collect_all_reviews():
    global scraped_reviews

    if not scraped_hotels:
        print("‚ö†Ô∏è No hotels. Run cell 10 first.")
        return

    # Track processed
    processed = (
        set(r["hotel_name"] for r in scraped_reviews) if scraped_reviews else set()
    )
    all_reviews = list(scraped_reviews) if scraped_reviews else []

    print(f"üöÄ Collecting reviews for {len(scraped_hotels)} hotels...")
    print(f"Already processed: {len(processed)}")

    for idx, hotel in enumerate(scraped_hotels):
        if hotel["name"] in processed:
            continue

        print(f"\nüè® [{idx + 1}/{len(scraped_hotels)}] {hotel['name'][:50]}...")

        reviews = await scrape_reviews_async(hotel, debug=False)
        all_reviews.extend(reviews)
        processed.add(hotel["name"])

        print(f"  ‚úì Got {len(reviews)} reviews")

        # Checkpoint every 10 hotels
        if (idx + 1) % 10 == 0:
            save_checkpoint(scraped_hotels, all_reviews)
            print(f"üíæ Saved: {len(processed)} hotels, {len(all_reviews)} reviews")

        await asyncio.sleep(2)

    scraped_reviews = all_reviews
    save_checkpoint(scraped_hotels, scraped_reviews)

    print(f"\n‚úÖ Done! {len(scraped_reviews)} reviews from {len(processed)} hotels")


# Run collection
asyncio.get_event_loop().run_until_complete(collect_all_reviews())

üöÄ Collecting reviews for 138 hotels...
Already processed: 0

üè® [1/138] Blue Serene Backwater Resort...
  ‚úì Got 10 reviews

üè® [2/138] SylVan Resort...
  ‚úì Got 10 reviews

üè® [3/138] The Crescent Wayanad Heritage Pool Resort...
  ‚úì Got 10 reviews

üè® [4/138] PARK RESIDENCY ARCADIA...
  ‚úì Got 10 reviews

üè® [5/138] Super Hotel O Thrissur Near Thrissur Medical Colle...
  ‚úì Got 10 reviews

üè® [6/138] Hotel Urban Bella...
  ‚úì Got 10 reviews

üè® [7/138] Nature Valley Farmhouse Resort by Raarees, Munnar ...
  ‚úì Got 10 reviews

üè® [8/138] Royal Plaza Inn by RAK Rooms, Calicut...
  ‚úì Got 10 reviews

üè® [9/138] WithInn Hotel - Kannur Airport...
  ‚úì Got 10 reviews

üè® [10/138] Ushasree Wayanad Premium Pool Resort by VOYE HOMES...
  ‚úì Got 10 reviews
üíæ Checkpoint saved: 138 hotels, 100 reviews
üíæ Saved: 10 hotels, 100 reviews

üè® [11/138] The TeaTree Munnar...
  ‚úì Got 10 reviews

üè® [12/138] Thekkady Gavi Suites...
  ‚úì Got 10 reviews

üè® [1

## 10. View & Export Results


In [10]:
# View results and export to CSV
if scraped_reviews:
    df = pd.DataFrame(scraped_reviews)

    # Add row numbers
    df.insert(0, "No.", range(1, len(df) + 1))

    # Rename columns to match template
    df = df.rename(
        columns={
            "hotel_name": "Hotel Name",
            "reviewer_name": "Reviewer Name",
            "rating": "Review Rating (1‚Äì10)",
            "review_title": "Review Title",
            "review_text": "Review Text",
            "total_review_count": "TOTAL REVIEW COUNT OF HOTEL",
            "star_rating": "HOTEL STAR CLASSIFICATION",
            "overall_rating": "OVERAL HOTEL RATING",
            "sold_out_status": "SOLD OUT STATUS",
        }
    )

    # Save to CSV
    output_file = "kerala_hotels_reviews.csv"
    df.to_csv(output_file, index=False, encoding="utf-8-sig")
    print(f"‚úÖ Saved {len(df)} reviews to {output_file}")

    # Display sample
    display(df.head(20))

    # Stats
    print(f"\nTotal reviews: {len(df)}")
    print(f"Unique hotels: {df['Hotel Name'].nunique()}")
else:
    print("No reviews collected yet. Run the collection cells first.")

‚úÖ Saved 670 reviews to kerala_hotels_reviews.csv


Unnamed: 0,No.,Hotel Name,Reviewer Name,Review Rating (1‚Äì10),Review Title,Review Text,HOTEL STAR CLASSIFICATION,OVERAL HOTEL RATING,TOTAL REVIEW COUNT OF HOTEL,SOLD OUT STATUS
0,1,Blue Serene Backwater Resort,Dharmalingam,6.0,Pleasant,Food - very tasty and quantity also much bette...,10 STAR,7.3,100,SOLD OUT
1,2,Blue Serene Backwater Resort,Jitesh,7.0,Good,Ambience and the water body Spa was not functi...,10 STAR,7.3,100,SOLD OUT
2,3,Blue Serene Backwater Resort,Neelam,8.0,Relaxing Stay at Blue Serene Resorts,Blue Serene Resorts offers a delightful stay w...,10 STAR,7.3,100,SOLD OUT
3,4,Blue Serene Backwater Resort,G,8.0,Very clean hotel at excellent location‚Ä¶,"Location,cleanliness Catering staff needs more...",10 STAR,7.3,100,SOLD OUT
4,5,Blue Serene Backwater Resort,Noordheen,8.0,Very good,Food,10 STAR,7.3,100,SOLD OUT
5,6,Blue Serene Backwater Resort,Sreelal,7.0,Can be a really great place with better admini...,"great and quiet location, nice property, view ...",10 STAR,7.3,100,SOLD OUT
6,7,Blue Serene Backwater Resort,Neeraj,8.0,Excellent value for money.,Great value for money and location by the rive...,10 STAR,7.3,100,SOLD OUT
7,8,Blue Serene Backwater Resort,Rajani,8.0,Very good,"Clean , serene property. Food quality was very...",10 STAR,7.3,100,SOLD OUT
8,9,Blue Serene Backwater Resort,Pradeep,8.0,Very good.,"Generally very good. Stay, location and the fo...",10 STAR,7.3,100,SOLD OUT
9,10,Blue Serene Backwater Resort,Loftin,7.0,Good rooms. Bad swimming pool,Breakfast was very good. Room was neat and cle...,10 STAR,7.3,100,SOLD OUT



Total reviews: 670
Unique hotels: 70
