

# Scraping Data from Tripadvisor Using Beautiful Soup

Not all services provide an API for accessing their data, including popular websites such as Tripadvisor, Timeout London, and Gumtree. In cases where no API is available, we resort to extracting data directly from the webpage by parsing the HTML code.

The process is relatively straightforward: the webpage is loaded into memory, and the HTML content is then searched to locate the desired information. This is made feasible by the highly structured nature of HTML pages.

To accomplish this task, we require a couple of libraries. One is 'requests' (familiar to you), and the other is '***Beautiful Soup***' which assists in extracting information from a soup of HTML code. Refer to the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more details.

![](https://www.crummy.com/software/BeautifulSoup/bs4/doc/_images/6.1.jpg)

## Beautiful Soup

Let's get started with Beautiful Soup. We import the library and we have a look at an example:


In [1]:
from bs4 import BeautifulSoup

text = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
 """

soup = BeautifulSoup(text)

The first command is to print the html file in a nice way.

❗You can use *sublime* as an alternative editor to inspect the code. Copy/paste the text into a new file in sublime, use "Ctrl" + "Shift" + "p" to install ["HTMLBeautify"](https://github.com/rareyman/HTMLBeautify).

In [2]:
print (soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



There is some parsing that can be done with simple chain commands:

In [3]:
print(soup.body.p.get_text())
print(soup.head.title.get_text())



The Dormouse's story



The Dormouse's story



The two most used commands are `.find_all()` and `.find()`. Both are working the same way, except `.find()` returns the first match only and `.find_all() `returns all results as a list.

Let's inspect the basic block of the html code:
```
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
```


In [None]:
name = soup.find("a", class_="sister")
print (name)
print (name.get_text())

In [None]:
name = soup.find("a", id="link1")
print (name)
print (name.get_text())

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>

Elsie



In [None]:
name = soup.find("a")
print (name)
print (name.get_text())

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>

Elsie



We get all the info inbetween the two tags "a". We can extract the other items like this:

In [None]:
text = name.get_text().strip()
print(text)

link = name ["href"]
print(link)

id = name ["id"]
print(id)

Elsie
http://example.com/elsie
link1


Same goes for all names:

In [None]:
names = soup.find_all("a", class_="sister")

for name in names:
  print(name.get_text().strip())

Elsie
Lacie
Tillie


This is admittibly a very short introduction to Beautiful Soup, but it covers about 95% of what you will need to scrape a webpage. And with that in mind, we can start...



⚠️ It's worth noting that if you encounter challenges in your coding excercise, you have the option to seek assistance from ***ChatGPT***. Simply input the HTML, express your intent to utilize Python Beautiful Soup, and describe the issues.

---




##Scraping Tripadvisor



This is a full example of how to scrape data from a webpage, [Tripadvisior](https://www.tripadvisor.co.uk/Hotels-g186338-London_England-Hotels.html) in this case. The techniques are similar to other pages.

<div class="markdown-google-sans">

###Getting the html and creating a soup

</div>
We load the page with requests. In this case, we identify ourselves with a header. This is a way to make the server believe that it's a genuine browser that sends the requests.

In [None]:
from bs4 import BeautifulSoup
import requests


url = "https://www.tripadvisor.co.uk/Hotel_Review-g186338-d10810215-Reviews-The_Montcalm_Royal_London_House-London_England.html"


headers = {
    'authority': 'www.dickssportinggoods.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-US,en;q=0.9',
}

r = requests.get(url, headers=headers)

soup = BeautifulSoup (r.text)

print(soup.prettify())



###Searching in the soup of a single entry



Here we examine the page, randomly pick some items and flex our muscles...

In [5]:
title = soup.find_all("h1", {"class": "QdLfr b d Pn"})
print(title)

print (title[0])

print(title[0].get_text().strip())

[<h1 class="QdLfr b d Pn" id="HEADING">The Montcalm Royal London House</h1>]
<h1 class="QdLfr b d Pn" id="HEADING">The Montcalm Royal London House</h1>
The Montcalm Royal London House


In [6]:
title = soup.find_all("h1", class_="QdLfr b d Pn")
title[0].get_text()

'The Montcalm Royal London House'

Take note of the distinction between `{"class": "QdLfr b d Pn"}` and `class_="QdLfr b d Pn"` in the two search examples. Although both yield identical outcomes, the usage of `class_=` is integral to Beautiful Soup's built-in search functionalities. On the other hand, the expression `{"class": "QdLfr b d Pn"}` is more generalized, allowing for the possibility of incorporating different text within it.

In [7]:
amount_of_reviews = soup.find_all("span", class_="qqniT")
amount_of_reviews[0].get_text()

'4,082 reviews'

In [8]:
rating = soup.find_all("span", class_="uwJeR P")
rating[0].get_text()

'5.0'

In [9]:
ranking = soup.find_all("div", class_="cGAqf")
ranking[0].get_text()

#clean up the results:
#ranking[0].get_text().split(" ")[0].replace('#', '')


'#21 of 1,225 hotels in London'

In [10]:
adress = title = soup.find_all("span", class_="fHvkI PTrfg")
adress[0].get_text()

'Royal London House 22-25 Finsbury Square, London EC2A 1DX England'

###Modularising the search


Now that we have tested the search and we know that it works, we define functions. This makes it easier to modularise our code later on. The first four functions are as above. The last function combines the previous four and returns the search result as a dictionary.

Having checked the functionality of our search, we continue to define functions, enhancing the modularity of our code for future adaptations. The initial four functions maintain their prior structure. The last function combine  the preceding four, presenting the search result in the form of a dictionary.

In [13]:
def get_name(soup):
  name = soup.find_all("h1", class_="QdLfr b d Pn")
  return name[0].get_text()

def get_adress(soup):
  adress = title = soup.find_all("span", class_="fHvkI PTrfg")
  return adress[0].get_text()

def get_ranking(soup):
  ranking = soup.find_all("div", class_="cGAqf")
  ranking = ranking[0].get_text().split(" ")[0].replace('#', '')
  return ranking

def get_amount_reviews (soup):
  amount_of_reviews = soup.find_all("span", class_="qqniT")
  return amount_of_reviews[0].get_text()





def get_info (url):
  r = requests.get(url, headers=headers)
  soup = BeautifulSoup (r.text)

  infodict = {}
  infodict["Name"] = get_name(soup)
  infodict["Adress"] = get_adress(soup)
  infodict["Ranking"] = get_ranking(soup)
  infodict["Amount_Reviews"] = get_amount_reviews(soup)
  infodict["url"] = url

  return infodict


... and we test the functions

In [14]:
line = get_info ("https://www.tripadvisor.co.uk/Hotel_Review-g186338-d3164384-Reviews-Park_Grand_London_Hyde_Park-London_England.html")

line

{'Name': 'Park Grand London Hyde Park',
 'Adress': '78 - 82 Westbourne Terrace Paddington, London W2 6QA England',
 'Ranking': '1',
 'Amount_Reviews': '4,036 reviews',
 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d3164384-Reviews-Park_Grand_London_Hyde_Park-London_England.html'}


 ### The Landing page

Extracting links from the landing page.


In [None]:
main_url = "https://www.tripadvisor.co.uk/Hotels-g186338-London_England-Hotels.html"

req = requests.get(main_url, headers=headers)
soup = BeautifulSoup (req.text)

soup


We parse the landing page and we extract an array of links. We test and we write a function.

In [39]:
# that would be looking for the first link. this takes a bit of try and error to find the right expression

entries = soup.find_all("div" , {"data-automation" : "hotel-card-title"})
link= entries[21].find("a",href=True )["href"]
print(link)

/Hotel_Review-g186338-d1886452-Reviews-The_Hide_London-London_England.html


In [41]:
# that would be looking for all entries

entries = soup.find_all("div" , {"data-automation" : "hotel-card-title"})

for entry in entries[:4]:
  link = entry.find("a",href=True)['href']
  # print(entry)
  full_url = "https://www.tripadvisor.co.uk" + link
  print (full_url)



https://www.tripadvisor.co.uk/Hotel_Review-g186338-d602781-Reviews-Limegrove_Hotel_London-London_England.html
https://www.tripadvisor.co.uk/Hotel_Review-g186338-d10810215-Reviews-The_Montcalm_Royal_London_House-London_England.html
https://www.tripadvisor.co.uk/Hotel_Review-g186338-d193062-Reviews-Park_Grand_Paddington_Court-London_England.html
https://www.tripadvisor.co.uk/Hotel_Review-g186338-d3164384-Reviews-Park_Grand_London_Hyde_Park-London_England.html


In [42]:
# now that we know it works, we combine that into a function:
# the input is the landing page and the result is a list with the links
# to the detail pages.

def get_hotel_links (url_landing):
  requ = requests.get(url_landing, headers=headers)
  bsoup = BeautifulSoup (requ.text)

  entries = bsoup.find_all("div" , {"data-automation" : "hotel-card-title"})

  hotel_links = []
  for entry in entries:
    link = entry.find("a",href=True)
    link = link['href']
    full_url = "https://www.tripadvisor.co.uk" + link
    hotel_links.append(full_url)

  return hotel_links


hotel_links = get_hotel_links(main_url)

hotel_links

['https://www.tripadvisor.co.uk/Hotel_Review-g186338-d187686-Reviews-The_Savoy-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d10810215-Reviews-The_Montcalm_Royal_London_House-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d193062-Reviews-Park_Grand_Paddington_Court-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d3164384-Reviews-Park_Grand_London_Hyde_Park-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d15134890-Reviews-The_Resident_Covent_Garden-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d13569031-Reviews-Travelodge_London_City_hotel-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d242994-Reviews-Central_Hotel-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d243667-Reviews-Travelodge_London_Kings_Cross_Royal_Scot-London_England.html',
 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d19


###Parsing all subsequent pages from one landing page



This is the same as doing it for one entry, just that this one is in a loop. Note that there is a delay of 0.5 second inbetween the calls. This is important to account for it, otherwise the server might block the request.

In [55]:
import time

for hotel_link in hotel_links:
  info = get_info(hotel_link)
  print (info)
  time.sleep(0.5)

{'Name': 'The Savoy', 'Adress': 'The Strand, London WC2R 0EZ England', 'Ranking': '33', 'Amount_Reviews': '7,328 reviews', 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d187686-Reviews-The_Savoy-London_England.html'}
{'Name': 'Wilde Aparthotels, London, Paddington', 'Adress': '4 North Wharf Road Paddington, London W2 1NW England', 'Ranking': '12', 'Amount_Reviews': '1,190 reviews', 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d23500327-Reviews-Wilde_Aparthotels_London_Paddington-London_England.html'}
{'Name': 'Hilton London Metropole', 'Adress': '225 Edgware Road, London W2 1JU England', 'Ranking': '218', 'Amount_Reviews': '1,409 reviews', 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d193089-Reviews-Hilton_London_Metropole-London_England.html'}
{'Name': 'Strand Palace', 'Adress': '372 Strand, London WC2R 0JJ England', 'Ranking': '294', 'Amount_Reviews': '1,073 reviews', 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d193112-Reviews-S

###Populating a Panda Dataframe


We know that it's possible to get information for all the hotels  on the landing page. As our data volume continues to grow, there is a  need for better organization.
In this context, a ***Panda dataframe*** proves to be the optimal solution. Now, let's revisit the same code as earlier, but this time we are generating a dataframe.


In [56]:
import pandas as pd

df =  pd.DataFrame()


rows_list = []
for hotel_link in hotel_links:
  info = get_info(hotel_link)
  # print(info)
  rows_list.append(info)
  time.sleep(0.5)


df = pd.DataFrame(rows_list)

df

Unnamed: 0,Name,Adress,Ranking,Amount_Reviews,url
0,The Savoy,"The Strand, London WC2R 0EZ England",33,"7,328 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...
1,"Wilde Aparthotels, London, Paddington","4 North Wharf Road Paddington, London W2 1NW E...",12,"1,190 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...
2,Hilton London Metropole,"225 Edgware Road, London W2 1JU England",218,"1,409 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...
3,Strand Palace,"372 Strand, London WC2R 0JJ England",294,"1,073 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...
4,The Chesterfield Mayfair,"35 Charles Street Mayfair, London W1J 5EB England",9,"5,195 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...
5,hub by Premier Inn London Westminster Abbey hotel,"21 Tothill Street Westminster, London SW1H 9LL...",183,"3,264 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...
6,Central Hotel,"16-18 Argyle Street, London WC1H 8EG England",42,686 reviews,https://www.tripadvisor.co.uk/Hotel_Review-g18...
7,The Resident Soho,"10 Carlisle Street, London W1D 3BR England",22,"3,765 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...
8,ibis London Canning Town,"8 Silvertown Way Canning Town, London E16 1ED ...",120,762 reviews,https://www.tripadvisor.co.uk/Hotel_Review-g18...
9,Royal Lancaster London,"Lancaster Terrace, London W2 2TY England",4,"1,808 reviews",https://www.tripadvisor.co.uk/Hotel_Review-g18...


... and the same code as a function:


In [57]:
def get_all_entries (hotel_links):
  df =  pd.DataFrame()
  rows_list = []
  for hotel_link in hotel_links:
    info = get_info(hotel_link)
    print(info)
    rows_list.append(info)
    time.sleep(0.5)
  df = pd.DataFrame(rows_list)
  return df



<div class="markdown-google-sans">

###Filenaming

</div>

Ultimately, we want to save a csv file. Yet, it's always risky to have a sloppy naming convention for the files. After a little while it's hard to tell what's what and there is always the risk of accidentally overwriting the files.

To help with this problem, it makes sense to write the date and time of the creation in front of the filename. This way you know which file is the latest and you can manually delete old files.


In [58]:
from datetime import datetime

dateTimeObj = datetime.now()
print(dateTimeObj.year, '/', dateTimeObj.month, '/', dateTimeObj.day)
print(dateTimeObj.hour, ':', dateTimeObj.minute, ':', dateTimeObj.second, '.', dateTimeObj.microsecond)

timestamp = str(dateTimeObj.year) + str(dateTimeObj.month) + str(dateTimeObj.day) +'_' + str(dateTimeObj.hour)+ str(dateTimeObj.minute) + str (dateTimeObj.second)

def filename(filename):
  dateTimeObj = datetime.now()
  timestamp = str(dateTimeObj.year) + str(dateTimeObj.month) + str(dateTimeObj.day) +'_' + str(dateTimeObj.hour)+ str(dateTimeObj.minute) + str (dateTimeObj.second)+'_'
  name = timestamp + filename + ".csv"
  return name

filename("data")

2023 / 11 / 11
14 : 40 : 42 . 772646


'20231111_144042_data.csv'



###Scraping Strategy



Tripadvisor informs us that there will be 4,424 hotel entries. Considering that each landing page has 30 entries, we can expect about 141 landing pages, each of them with a unique address.

It is important that the data are safe in case a problem causes a crash while parsing a page. The best is to subsequently add data to the csv file each time python has finished parsing a page.

In other words, we start with an empty text file. Each time a landing page is parsed, the text file is being opened, the data added and the file closed. This cycle repeats till the end of the overall programm. In case the program crashes, there is some data saved. This is possibly not the fastest way, but fairly easy and robust.

We create the filename and the empty file.

In [59]:
datafile_name = filename("tripadvisor")

dt =  pd.DataFrame()

dt.to_csv(datafile_name , sep=',', encoding='utf-8')

We create an array of links to landing pages we like to parse:

In [62]:
# main_url = "https://www.tripadvisor.co.uk/Hotels-g186338-oa30-London_England-Hotels.html"

def main_url (pagenumber):
  url = "https://www.tripadvisor.co.uk/Hotels-g186338-oa"+str(pagenumber)+"-London_England-Hotels.html"
  return url

page_numbers = [30, 60, 90, 120]
page_numbers = [ 90, 120]

for page_number in page_numbers:
  print (main_url(page_number))


https://www.tripadvisor.co.uk/Hotels-g186338-oa90-London_England-Hotels.html
https://www.tripadvisor.co.uk/Hotels-g186338-oa120-London_England-Hotels.html


And now we loop thought each page and append the data during each loop:

In [63]:
for page_number in page_numbers:
  url_to_get =  main_url(page_number)
  print (url_to_get)
  hotel_links = get_hotel_links(url_to_get)
  dt    =get_all_entries(hotel_links)
  dt.to_csv(datafile_name, mode='a', header=False)


https://www.tripadvisor.co.uk/Hotels-g186338-oa90-London_England-Hotels.html
{'Name': 'Central Hotel', 'Adress': '16-18 Argyle Street, London WC1H 8EG England', 'Ranking': '42', 'Amount_Reviews': '686 reviews', 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d242994-Reviews-Central_Hotel-London_England.html'}
{'Name': 'The Savoy', 'Adress': 'The Strand, London WC2R 0EZ England', 'Ranking': '33', 'Amount_Reviews': '7,328 reviews', 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d187686-Reviews-The_Savoy-London_England.html'}
{'Name': "Lost Property St Paul's London - Curio Collection by Hilton", 'Adress': '3-5 Ludgate Hill, London EC4M 7AA England', 'Ranking': '26', 'Amount_Reviews': '428 reviews', 'url': 'https://www.tripadvisor.co.uk/Hotel_Review-g186338-d23906171-Reviews-Lost_Property_St_Paul_s_London_Curio_Collection_by_Hilton-London_England.html'}
{'Name': 'Travelodge London Central Kings Cross', 'Adress': "Willing House 356-364 Gray's Inn Road Willings House,

## Closing Remarks

Google Colab is a great platform for code development, enabling seamless collaboration with colleagues without the need to install external libraries and dependencies. However, it's important to note that the free version has limitations on code execution duration. Runtime issues may result in disconnection and code failure. It's essential to anticipate and address potential failures on multiple fronts. For lengthy scraping scripts, it's advisable to run them locally on your computer. Additionally, employing the `try` structure and breaking down procedures into smaller, more manageable steps can enhance code manageability and robustness.