This file shows another method for webscraping that tries to avoid getting blocked by the website.

Selenium (our previous method) is a great tool for webscraping, but it's not the best option when you need to scrape a lot of pages from a website. The reason is that Selenium uses a real browser to make the requests, and this can be easily detected by the website, which can block your IP address.

We're using the `httpx` library to make the requests and the `BeautifulSoup` library to parse the html content.

We'll alse set a 'User-Agent' header to make the request look like it's coming from a real browser.

This method is not perfect (how perfect a method is, is up to you), but it's a good alternative and very light to run.

In [None]:
#use httpx to get the page
import httpx

url = 'https://www.tripadvisor.com/Hotel_Review-g187849-d2340336-Reviews-Armani_Hotel-Milan_Lombardy.html'
response = httpx.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
response.text


In [None]:
page_text = response.text


In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page_text, 'html.parser')

for review in soup.find_all('div', attrs={'data-reviewid': True}):
    review_id = review.get('data-reviewid')
    review_rating = review.find('div', attrs={'data-test-target': "review-rating"}).find('title').text
    review_rating = float(review_rating.split(' ')[0])
    review_title = review.find('div', attrs={'data-test-target': "review-title"}).text
    review_text = review.find('span', attrs={'data-automation': f"reviewText_{review_id}"}).text
    print(review_id, review_rating, review_title.strip(), review_text.strip())

    #get next page using data-smoke-attr="pagination-next-arrow"
next_page = soup.find('a', attrs={'data-smoke-attr': 'pagination-next-arrow'})
print(next_page.get('href'))


In [None]:
BASE_URL = 'https://www.tripadvisor.com'
httpx.get(BASE_URL+next_page.get('href'), headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}).text


In [12]:
import httpx
from bs4 import BeautifulSoup
import csv

BASE_URL = 'https://www.tripadvisor.com'

def get_page(url):
    response = httpx.get(BASE_URL+url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
    page_text = response.text
    soup = BeautifulSoup(page_text, 'html.parser')
    return soup

next_page = '/Hotel_Review-g187849-d2340336-Reviews-or10-Armani_Hotel-Milan_Lombardy.html'

with open('data/ArmaniHotelReviews_202409.csv', 'w', encoding="utf-8") as csv_file:
    csvwriter = csv.writer(csv_file, lineterminator='\n' ) #lineterminator='\n' is used to avoid blank rows in the csv
    csvwriter.writerow(['review_id', 'date_of_stay', 'review_rating', 'review_title', 'review_text'])
    
    page_num = 0
    date_of_stay_class = '' # this is a class name is generated with random characters on the site, we will fetch it once
    while len(next_page) > 0:# and page_num <5:
        page_num += 1
        print(page_num, next_page)
        soup = get_page(next_page)

        # check if the page is blocked by finding data-test-target="reviews-tab". if found then page is not blocked
        if not soup.find('div', attrs={'data-test-target': 'reviews-tab'}):
            print('Page has been blocked')
            break

        for review in soup.find_all('div', attrs={'data-reviewid': True}):
            review_id = review.get('data-reviewid')
            review_rating = review.find('div', attrs={'data-test-target': "review-rating"}).find('title').text
            review_rating = float(review_rating.split(' ')[0])
            review_title = review.find('div', attrs={'data-test-target': "review-title"}).text
            review_text = review.find('span', attrs={'data-automation': f"reviewText_{review_id}"}).text

            if date_of_stay_class == '':
                # date of stay is not easy to access. we need to get it by position within the first review
                # first we get the 4th child div
                # then the 2nd child of the current child div
                # then the 1st child of the current child div - this should be a span
                date_of_stay_class = review.contents[3].contents[1].contents[0].get('class')
            
            # now we can get the date of stay from the span
            date_of_stay = review.contents[3].find('span', attrs={'class': date_of_stay_class}).text
            #then we split the text to get just the date (example: "Date of stay: August 2024" -> "August 2024")
            date_of_stay = date_of_stay.split(': ')[1]
            
            csvwriter.writerow([review_id, date_of_stay, review_rating, review_title.strip(), review_text.strip()])

        #get next page using data-smoke-attr="pagination-next-arrow"
        next_page_nav = soup.find('a', attrs={'data-smoke-attr': 'pagination-next-arrow'})
        if next_page_nav is not None:
            next_page = next_page_nav.get('href', default='')
        else:
            next_page = ''

print("ended")

1 /Hotel_Review-g187849-d2340336-Reviews-or10-Armani_Hotel-Milan_Lombardy.html
2 /Hotel_Review-g187849-d2340336-Reviews-or20-Armani_Hotel-Milan_Lombardy.html
3 /Hotel_Review-g187849-d2340336-Reviews-or30-Armani_Hotel-Milan_Lombardy.html
4 /Hotel_Review-g187849-d2340336-Reviews-or40-Armani_Hotel-Milan_Lombardy.html
5 /Hotel_Review-g187849-d2340336-Reviews-or50-Armani_Hotel-Milan_Lombardy.html
6 /Hotel_Review-g187849-d2340336-Reviews-or60-Armani_Hotel-Milan_Lombardy.html
7 /Hotel_Review-g187849-d2340336-Reviews-or70-Armani_Hotel-Milan_Lombardy.html
8 /Hotel_Review-g187849-d2340336-Reviews-or80-Armani_Hotel-Milan_Lombardy.html
9 /Hotel_Review-g187849-d2340336-Reviews-or90-Armani_Hotel-Milan_Lombardy.html
10 /Hotel_Review-g187849-d2340336-Reviews-or100-Armani_Hotel-Milan_Lombardy.html
11 /Hotel_Review-g187849-d2340336-Reviews-or110-Armani_Hotel-Milan_Lombardy.html
12 /Hotel_Review-g187849-d2340336-Reviews-or120-Armani_Hotel-Milan_Lombardy.html
13 /Hotel_Review-g187849-d2340336-Reviews-or13

In [11]:
import pandas as pd

df = pd.read_csv('data/ArmaniHotelReviews_202409.csv')
print(df['review_rating'].mean())
df.head(10)

4.627027027027027


Unnamed: 0,review_id,date_of_stay,review_rating,review_title,review_text
0,869387650,November 2022,5.0,I love this hotel,It is truly difficult to find a flaw in this h...
1,889405632,May 2023,5.0,Outstanding time at Armani in Milan,Fantastic experience at the Armani Hotel in Mi...
2,779338587,October 2020,5.0,Great Milano Hotel - Very Well Located,This hotel is all you expect from Armani. Desi...
3,837733277,May 2022,5.0,Thank you Lifestyle team,This hotel deserves a lot more than the flak r...
4,923971481,October 2023,5.0,Stylish Armani in stylish Milan,What an amazing stay! Totally stylish and such...
5,793435259,June 2021,5.0,Amazing....every time,"I love everything about the Armani hotel, one ..."
6,904077271,July 2023,3.0,First class team,The team at the Armani were amongst the best I...
7,797695140,July 2021,4.0,Everything is beautiful about this hotel excep...,The Hotel is a 'go-to' place with an amazing d...
8,837350044,April 2022,5.0,Enjoyable trip,It is not the first time we visit Milan but it...
9,815655696,October 2021,4.0,"Good location, sleek hotel",Very sleek hotel and impressive to see. The lo...
