# Scraping Review Data From TripAdvisor

We will use the request and bs4 libraries to scrape reviews data from tripadvisor. TripAdvisor doesn't block repeated requests to their URL's unlike site like Yelp. As such there is no need to send authorization headers. We can make repeated calls to reviews page to extract reviews data without any IP blocking

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

In [3]:
headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate',
        'accept-language': 'en,mr;q=0.9',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}

url = "https://www.tripadvisor.ca/Hotel_Review-g304551-d304216-Reviews-The_Oberoi_New_Delhi-New_Delhi_National_Capital_Territory_of_Delhi.html"
req = requests.get(url,headers=headers,timeout=5,verify=True)
print (req.status_code)
soup = BeautifulSoup(req.content, 'html.parser')

200


Using TQDM library to check the current loop number during execution. This is especially helpful in case of timouts occuring during script execution. Based on loop number we can run the script from the page where timeout occured to get the remaining reviews. Classes for the review content are found by inspecting the elements. In future these classes may change

In [18]:
master_review_list = []

for x in tqdm(range(0, 2900, 10)):
  url = "https://www.tripadvisor.ca/Hotel_Review-g304551-d304216-Reviews-or" + str(x) + "-The_Oberoi_New_Delhi-New_Delhi_National_Capital_Territory_of_Delhi.html#REVIEWS"
  req = requests.get(url,headers=headers,timeout=5,verify=True)
  soup = BeautifulSoup(req.content, 'html.parser')
  for y in soup.body.find_all(class_="YibKl"):
    review_content = []
    # Review content
    if y.find("q", {"class": "QewHA"}) :
      review_content.append(y.select_one('q[class*="QewHA"]').text)
    else :
      review_content.append(None)
    # Date of stay
    if y.find("span", {"class": "teHYY"}) :
      review_content.append(y.select_one('span[class*="teHYY"]').text)
    else :
      review_content.append(None)
    # Rating
    if y.find("div", {"class": "Hlmiy F1"}) :
      review_content.append(y.find("div", {"class": "Hlmiy F1"}).span['class'][1])
    else :
      review_content.append(None)
    # Owner's response
    if y.find("span", {"class": "MInAm"}) :
      review_content.append(y.select_one('span[class*="MInAm"]').text)
    else :
      review_content.append(None)
    master_review_list.append(review_content)

100%|██████████| 290/290 [08:39<00:00,  1.79s/it]


Obtained data is in list format. We will convert this to a dataframe so it becomes easier to write the data to CSV. We will use the pandas library for this.

In [19]:
reviews_df = pd.DataFrame(master_review_list, columns = ['Customer Review', 'Date Of Stay', 'Customer Rating', 'Owner Responded'])

Writing the dataframe to a csv file in project structure. We will perform cleaning and preprocessing using the CSV file. This is done so that we do not overload the scraping script with logic and instead only focus on getting the data in raw format

In [20]:
reviews_df.to_csv('../data/oberoi_delhi_reviews.csv', index = False)