# Webscraping with Python (another way)

This file shows another method for webscraping that tries to avoid getting blocked by the website.

Selenium (our previous method) is a great tool for webscraping, but it's not the best option when you need to scrape a lot of pages from a website. The reason is that Selenium uses a real browser to make the requests, and this can be easily detected by the website, which can block your IP address.

We're using the `httpx` library to make the requests and the `BeautifulSoup` library to parse the html content.

We'll alse set a 'User-Agent' header to make the request look like it's coming from a real browser.

This method is not perfect (how perfect a method is, is up to you), but it's a good alternative and very light to run.

## Use HTTPX to make the requests

Here we'll make a a basic request to the website and get the html content.

This will be useful to understand how the website is structured and how we can extract the information we need.

In [8]:
#use httpx to get the page
import httpx

url = 'https://www.tripadvisor.com/Hotel_Review-g187849-d2340336-Reviews-Armani_Hotel-Milan_Lombardy.html'
response = httpx.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
response.text


'<!DOCTYPE html><html lang="en-US"><head><link rel="icon" id="favicon" href="https://static.tacdn.com/favicon.ico?v2" type="image/x-icon"/><link rel="mask-icon" sizes="any" href="https://static.tacdn.com/img2/brand_refresh/application_icons/mask-icon.svg" color="#000000"/><meta name="theme-color" content="#34e0a1"/><meta name="format-detection" content="telephone=no"/><meta property="al:ios:app_name" content="TripAdvisor"/><meta property="al:ios:app_store_id" content="284876795"/><meta property="twitter:app:id:ipad" name="twitter:app:id:ipad" content="284876795"/><meta property="twitter:app:id:iphone" name="twitter:app:id:iphone" content="284876795"/><meta property="al:ios:url" content="tripadvisor://www.tripadvisor.com/Hotel_Review-g187849-d2340336-Reviews-Armani_Hotel-Milan_Lombardy.html?m=33762"/><meta property="twitter:app:url:ipad" name="twitter:app:url:ipad" content="tripadvisor://www.tripadvisor.com/Hotel_Review-g187849-d2340336-Reviews-Armani_Hotel-Milan_Lombardy.html?m=33762"/

## Use BeautifulSoup to parse the html content

After getting the html content, we'll use BeautifulSoup to parse it and extract the information we need.

We're writing a basic outline of the code here, just to test if we can get the html content and parse it.

Our goal is to check functionality, not to extract all the information we need, so we're only scraping data from the single page we downloaded. 
> We'll write a more complete code later.

In [9]:
page_text = response.text # here we're saving the response text to a variable and making our code more readable

In [10]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page_text, 'html.parser')

for review in soup.find_all('div', attrs={'data-reviewid': True}):
    review_id = review.get('data-reviewid')
    review_rating = review.find('div', attrs={'data-test-target': "review-rating"}).find('title').text
    review_rating = float(review_rating.split(' ')[0])
    review_title = review.find('div', attrs={'data-test-target': "review-title"}).text
    review_text = review.find('span', attrs={'data-automation': f"reviewText_{review_id}"}).text
    print(review_id, review_rating, review_title.strip(), review_text.strip())

    #get next page using data-smoke-attr="pagination-next-arrow"
next_page = soup.find('a', attrs={'data-smoke-attr': 'pagination-next-arrow'})
print(next_page.get('href'))


742018385 5.0 Decent hotel Super modern hotel in a really good location.The rooms are why this hotel is so good. They are modern, large, comfortable and very quiet. The technology that runs them is great, though a bit annoying with all the light it puts out - impossible to get the room pitch black. Bathrooms are fabulous. The spa pool is lovely with great views over Milan. I would recommend them having hours for kids to be able to use the pool, quite frustrating that they don't allow kids at all.Breakfast was decent but over priced. Other guests obviously felt this way too as we were the only ones ever there and the 6-7 waiters just hung around.
808143792 5.0 August 2021 Returned to Milan for the first time since the pandemic. Have always loved Armani for it's fantastic location, service, amenities, and overall aesthetic. Staff are lovely and helped arrange same-day dinner reservations at Michelin star restaurants, as well as a in-room Covid test prior to departure. The rooms are start

## Putting it all together

After understanding how the website is structured and how we can extract the information we need, we'll put everything together and extract the information we need.

- We'll write a function that receives the url of the page we want to scrape and returns the information we need.
- We'll write a loop to scrape all the pages we need.
- We'll save the information in a csv file.
- We're also checking if the website is blocking us and if there are more pages to scrape.

In [11]:
BASE_URL = 'https://www.tripadvisor.com'
httpx.get(BASE_URL+next_page.get('href'), headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}).text


'<!DOCTYPE html><html lang="en-US"><head><link rel="icon" id="favicon" href="https://static.tacdn.com/favicon.ico?v2" type="image/x-icon"/><link rel="mask-icon" sizes="any" href="https://static.tacdn.com/img2/brand_refresh/application_icons/mask-icon.svg" color="#000000"/><meta name="theme-color" content="#34e0a1"/><meta name="format-detection" content="telephone=no"/><meta property="al:ios:app_name" content="TripAdvisor"/><meta property="al:ios:app_store_id" content="284876795"/><meta property="twitter:app:id:ipad" name="twitter:app:id:ipad" content="284876795"/><meta property="twitter:app:id:iphone" name="twitter:app:id:iphone" content="284876795"/><meta property="al:ios:url" content="tripadvisor://www.tripadvisor.com/Hotel_Review-g187849-d2340336-Reviews-or10-Armani_Hotel-Milan_Lombardy.html?m=33762"/><meta property="twitter:app:url:ipad" name="twitter:app:url:ipad" content="tripadvisor://www.tripadvisor.com/Hotel_Review-g187849-d2340336-Reviews-or10-Armani_Hotel-Milan_Lombardy.html

In [15]:
import httpx
from bs4 import BeautifulSoup
import csv

BASE_URL = 'https://www.tripadvisor.com'

def get_page(url):
    response = httpx.get(BASE_URL+url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
    page_text = response.text
    soup = BeautifulSoup(page_text, 'html.parser')
    return soup

next_page = '/Hotel_Review-g187849-d2340336-Reviews-Armani_Hotel-Milan_Lombardy.html'

with open('data/ArmaniHotelReviews_202409.csv', 'w', encoding="utf-8") as csv_file:
    csvwriter = csv.writer(csv_file, lineterminator='\n' ) #lineterminator='\n' is used to avoid blank rows in the csv
    csvwriter.writerow(['review_id', 'date_of_stay', 'review_rating', 'review_title', 'review_text'])
    
    page_num = 0
    date_of_stay_class = '' # this is a class name is generated with random characters on the site, we will fetch it once
    while len(next_page) > 0:# and page_num <5:
        page_num += 1
        print(page_num, next_page)
        soup = get_page(next_page)

        # check if the page is blocked by finding data-test-target="reviews-tab". if found then page is not blocked
        if not soup.find('div', attrs={'data-test-target': 'reviews-tab'}):
            print('Page has been blocked')
            break

        for review in soup.find_all('div', attrs={'data-reviewid': True}):
            review_id = review.get('data-reviewid')
            review_rating = review.find('div', attrs={'data-test-target': "review-rating"}).find('title').text
            review_rating = float(review_rating.split(' ')[0])
            review_title = review.find('div', attrs={'data-test-target': "review-title"}).text
            review_text = review.find('span', attrs={'data-automation': f"reviewText_{review_id}"}).text

            if date_of_stay_class == '':
                # date of stay is not easy to access. we need to get it by position within the first review
                # first we get the 4th child div
                # then the 2nd child of the current child div
                # then the 1st child of the current child div - this should be a span
                #date_of_stay_class = review.contents[3].contents[1].contents[0].get('class')

                # or we can get it by the content we're trying to find (ie. "Date of stay:") and then get the class of the parent
                date_of_stay_class = review.find(string="Date of stay:").parent.parent.get('class')
            
            # now we can get the date of stay from the span
            date_of_stay_span = review.contents[3].find('span', attrs={'class': date_of_stay_class})
            #then we split the text to get just the date (example: "Date of stay: August 2024" -> "August 2024")
            if date_of_stay_span is not None:
                date_of_stay = date_of_stay_span.text.split(': ')[1]
            
            csvwriter.writerow([review_id, date_of_stay, review_rating, review_title.strip(), review_text.strip()])

        #get next page using data-smoke-attr="pagination-next-arrow"
        next_page_nav = soup.find('a', attrs={'data-smoke-attr': 'pagination-next-arrow'})
        if next_page_nav is not None:
            next_page = next_page_nav.get('href', default='')
        else:
            next_page = ''

print("ended")

1 /Hotel_Review-g187849-d2340336-Reviews-Armani_Hotel-Milan_Lombardy.html
2 /Hotel_Review-g187849-d2340336-Reviews-or10-Armani_Hotel-Milan_Lombardy.html
3 /Hotel_Review-g187849-d2340336-Reviews-or20-Armani_Hotel-Milan_Lombardy.html
4 /Hotel_Review-g187849-d2340336-Reviews-or30-Armani_Hotel-Milan_Lombardy.html
5 /Hotel_Review-g187849-d2340336-Reviews-or40-Armani_Hotel-Milan_Lombardy.html
6 /Hotel_Review-g187849-d2340336-Reviews-or50-Armani_Hotel-Milan_Lombardy.html
7 /Hotel_Review-g187849-d2340336-Reviews-or60-Armani_Hotel-Milan_Lombardy.html
8 /Hotel_Review-g187849-d2340336-Reviews-or70-Armani_Hotel-Milan_Lombardy.html
9 /Hotel_Review-g187849-d2340336-Reviews-or80-Armani_Hotel-Milan_Lombardy.html
10 /Hotel_Review-g187849-d2340336-Reviews-or90-Armani_Hotel-Milan_Lombardy.html
11 /Hotel_Review-g187849-d2340336-Reviews-or100-Armani_Hotel-Milan_Lombardy.html
12 /Hotel_Review-g187849-d2340336-Reviews-or110-Armani_Hotel-Milan_Lombardy.html
13 /Hotel_Review-g187849-d2340336-Reviews-or120-Arma

## Checking the data

After scraping the data, we'll check if everything is correct.

If the data is correct, we can move on to analyzing it.

In [16]:
import pandas as pd

df = pd.read_csv('data/ArmaniHotelReviews_202409.csv')
print(df['review_rating'].mean())
df.head(10)

4.636299435028248


Unnamed: 0,review_id,date_of_stay,review_rating,review_title,review_text
0,742018385,January 2020,5.0,Decent hotel,Super modern hotel in a really good location.T...
1,808143792,August 2021,5.0,August 2021,Returned to Milan for the first time since the...
2,878021261,February 2023,5.0,"Beautiful hotel, lovely staff","The hotel is beautiful as one would expect, wh..."
3,740655301,January 2020,5.0,my new favorite hotel,One of the best thought out suites we have eve...
4,745399028,December 2019,5.0,A very Merry Christmas for someone who isn‚Äôt ...,My third visit and everything is just as excel...
5,802595672,July 2021,4.0,armani hotel milano,arriving to the hotel door doesn‚Äôt take you to...
6,838119503,May 2022,5.0,Armani gets an ‚ÄúA+‚Äù,"When you enter the hotel, you are greeted cord..."
7,857062192,July 2022,5.0,Summer in Milan at Armani,"Spacious room, friendly staff and housekeepers..."
8,775289203,October 2020,5.0,Great!,As far as I can this hotel is perfect. My bedr...
9,808717285,September 2021,1.0,"Worst experience, racist staff foreigners are ...","Worst experience, rude and abusive staff do no..."
