# SP702 Final Project (Task 1)

**By**: Francis Mark M. Cayco

**Github profile**: https://github.com/PeteCastle

**Instructions:**
**Prepare hotel data**
Extract hotel data (Note: at least 25 different hotels)
- Name of hotel
- Location of hotel (Note: Barangay, City/Municipality, Province)
- Amenities
- Price range (e.g. price per person, room, or hour)
- Number of available rooms (e.g. number of rooms for two people)
- Other data (extract more data regarding hotels that can be used by tourists)
- Customer reviews per attribute of hotels from review sites (e.g. customer review
regarding the customer service of hotels). You need to gather at least 15 customer
reviews. Make sure that each review is unique and posted by different users (i.e. no
duplicate reviews and no duplicate username).

Store the collected data in a csv file. File name format: **Hotel.csv** <br>
Please upload your Jupyter notebook file. Filename format: **Extract_Hotel_Source.ipynb**. Note
that you might need to create different python code per source of data. Submit all code you
used to extract data.

## Data Sources
1. TripAdvisor (https://www.tripadvisor.com.ph)
2. RapidAPI's TripAdvisor Endpoint (https://rapidapi.com/apidojo/api/travel-advisor)

## How did I extract the data?
1. The GeoID of the location is obatined using the **RapidAPI's TripAdvisor Endpoint**. The GeoID is used to get the data from the TripAdvisor website.
2. The GeoID is referenced in the TripAdvisor website, returning the URLs of hotel, restaurants, and attractions in the location.
3. The list of hotels and detail of each hotel is obtained from the TripAdvisor website.
4. **BeautifulSoup** library to is used to extract the data hotel page.

The extracted data is stored in a **CSV** file.  One for the list of hotels and another for the user reviews of each hotel.


## Website Screenshots
![](resources/images/2023-08-26-15-01-46.png)
![](resources/images/2023-08-26-15-02-52.png)
![](resources/images/2023-08-26-15-03-13.png)


## What are the list of the data that I have extracted?
| **Hotel Information** |                    |
|-----------------------|--------------------|
| Name                  | Name of the hotel  |
| URL                   | URL of the hotel   |
| Address               | Full address of the hotel |
| About                 | Description of the hotel |
| Review_Count          | Number of reviews of the hotel |
| Rating                | Rating of the hotel (1 to 5) |
| Rating_Description    | Description of the rating (Excellent, Very Good, Average, Poor, Terrible) |
| Rating_Location       | Rating of the location (1 to 5) |
| Rating_Cleanliness    | Rating of the cleanliness (1 to 5) |
| Rating_Service        | Rating of the service (1 to 5) |
| Rating_Value          | Rating of the value (1 to 5) |
| Property_Amenities    | List of amenities of the hotel |
| Room_Features         | List of room features of the hotel |
| Room_Types            | List of room types of the hotel |
| Hotel_Class           | Hotel class (1 to 5 stars) |
| Walkability_Score     | Walkability score of the hotel |
| Walkability_Description | Description of the walkability score (Great for walkers, somewhat walkable, etc) |
| Nearby_Restaurant_Count | Number of restaurants near the hotel |
| Nearby_Attraction_Count | Number of attractions near the hotel |
| Price_Range           | Price range of the hotel in PHP |
| Old_Name              | Old name of the hotel |
| Room_Count            | Number of rooms of the hotel |


| **Hotel Review** |                    |
|-----------------------|--------------------|
| Name                  | Name of the hotel  |
| Rating                | Reviewer's rating |
| Rating_Date           | Date of the review |
| Title                 | Title of the review |
| Content               | Content of the review |
| Visit_Date            | Date when the customer visited the hotel |
| URL                   | URL of the review |

In [2]:
# The full code is in the file `tripadvisor/hotel.py``
from tripadvisor.hotel import TripadvisorHotel

NUM_PAGES = 1 # Number of pages to scrape.  The total number of hotels scraped is NUM_PAGES * 30

# List of locations to scrape.  
#  Note that if the location is present in datasets/locations.json, the location will be scraped. 
#  Otherwise, the location will have to be obtained from RapidAPI's Tripadvisor API, which require credential access.
#  The locations available are: Manila, Baguio, Cavite, and Tagaytay
locations = ["Manila", "Cavite","Tagaytay"] 

hotels = TripadvisorHotel(NUM_PAGES)
hotels.extractData(locations)
hotel_details, hotel_reviews = hotels.getDataframe()

print("Hotel Details")
display(hotel_details.head(5))

print("Hotel Reviews")
display(hotel_reviews.head(5))

[32m2023-08-26 15:08:28[0m [1;30mINFO[0m Scraping the URLS of selected restaurants in Manila
[32m2023-08-26 15:08:32[0m [1;30mINFO[0m Scraped the URLS of 30 selected restaurants in Manila
[32m2023-08-26 15:08:39[0m [1;30mINFO[0m Scraped all details from New Coast Hotel Manila
[32m2023-08-26 15:08:41[0m [1;30mINFO[0m Scraped all details from Sofitel Philippine Plaza Manila
[32m2023-08-26 15:08:44[0m [1;30mINFO[0m Scraped all details from Diamond Hotel Philippines
[32m2023-08-26 15:08:46[0m [1;30mINFO[0m Scraped all details from The Manila Hotel
[32m2023-08-26 15:08:49[0m [1;30mINFO[0m Scraped all details from Hotel H2O
[32m2023-08-26 15:08:51[0m [1;30mINFO[0m Scraped all details from The Bayleaf Intramuros
[32m2023-08-26 15:08:53[0m [1;30mINFO[0m Scraped all details from Hotel Lucky Chinatown
[32m2023-08-26 15:08:55[0m [1;30mINFO[0m Scraped all details from City Garden Suites
[32m2023-08-26 15:08:57[0m [1;30mINFO[0m Scraped all details from Ba

Hotel Details


Unnamed: 0,name,url,address,about,review_count,rating,rating_description,rating_Location,rating_Cleanliness,rating_Service,...,Room_features,Room_types,hotel_class,walkability_score,walkability_description,nearby_restaurant_count,nearby_attraction_count,price_range,old_name,room_count
0,New Coast Hotel Manila,https://www.tripadvisor.com.ph/Hotel_Review-g2...,"1588 Pedro Gil Street Corner MH del Pilar, Man...",New Coast Hotel Manila is a 5-star deluxe busi...,3420,4.5,Excellent,4.0,4.5,4.5,...,"Air conditioning,Room service,VIP room facilit...","Ocean view,Non-smoking rooms,Suites,Family rooms",5.0,97,Great for walkers,196,16,"₱4,241",New World Manila Bay Hotel,305
1,Sofitel Philippine Plaza Manila,https://www.tripadvisor.com.ph/Hotel_Review-g2...,"Roxas Boulevard CCP Complex, Manila, Luzon 130...",Sofitel Philippine Plaza Manila is an iconic 5...,7806,4.5,Excellent,4.0,4.5,4.5,...,"Blackout curtains,Bathrobes,Air conditioning,D...","Ocean view,City view,Landmark view,Bridal suit...",5.0,57,Somewhat walkable,30,4,"₱7,634",,609
2,Diamond Hotel Philippines,https://www.tripadvisor.com.ph/Hotel_Review-g2...,"Roxas Boulevard cor. Dr. J. Quintos St., Manil...",Set against the magnificent golden sunset of t...,3179,4.5,Excellent,4.5,4.5,4.5,...,"Bathrobes,Air conditioning,Desk,Housekeeping,C...","Ocean view,City view,Pool view,Non-smoking roo...",5.0,97,Great for walkers,186,19,"₱6,122",,483
3,The Manila Hotel,https://www.tripadvisor.com.ph/Hotel_Review-g2...,"1 Rizal Park, Manila, Luzon 0913 Philippines",Manila Hotel embodies a rich tradition of eleg...,2013,4.0,Very good,4.0,4.5,4.5,...,"Air conditioning,Housekeeping,Room service,Saf...","Ocean view,Bridal suite,Non-smoking rooms,Suit...",5.0,72,Somewhat walkable,12,11,"₱4,411",,510
4,Hotel H2O,https://www.tripadvisor.com.ph/Hotel_Review-g2...,"Luneta Manila Ocean Park, Behind Quirino Grand...",Hotel H2O is the first and only marine-themed ...,1654,4.0,Very good,4.0,4.0,4.0,...,"Air conditioning,Room service,Safe,Refrigerato...","Ocean view,Non-smoking rooms,Suites,Family rooms",4.0,68,Somewhat walkable,12,8,"₱4,015",,147


Hotel Reviews


Unnamed: 0,name,rating,ratingDate,title,content,visit_date,url
0,New Coast Hotel Manila,5.0,19 Aug,Great place to stay. Highly recommend.,Excellent place to stay. Laiza did a great job...,August 2023,/ShowUserReviews-g298573-d483187-r911589965-Ne...
1,New Coast Hotel Manila,5.0,17 Aug,Exceeding Expectations!!!,This is our very first hotel stay in manila an...,July 2023,/ShowUserReviews-g298573-d483187-r911007475-Ne...
2,New Coast Hotel Manila,5.0,7 Aug,Good value for money,The lobby still has the same appearance as it ...,July 2023,/ShowUserReviews-g298573-d483187-r908653327-Ne...
3,New Coast Hotel Manila,5.0,5 Aug,A Wonderful bad weather experience,It's been a rainy week in Metro Manila and we ...,July 2023,/ShowUserReviews-g298573-d483187-r908038555-Ne...
4,New Coast Hotel Manila,4.0,Jul 2023,Friendly staff and nice service,It was a nice hotel experience. The room atten...,July 2023,/ShowUserReviews-g298573-d483187-r906855689-Ne...


In [4]:
hotel_details.to_csv('final_data/Hotel_details.csv', index=False)
hotel_reviews.to_csv('final_data/Hotel_reviews.csv', index=False)

#### Full Code from hotel.py file:
```

from abc import ABC, abstractmethod
import json
import os
from dotenv import load_dotenv
import requests
import coloredlogs
import logging.config
import pandas as pd
from bs4 import BeautifulSoup
from seleniumwire import webdriver

class TripadvisorWrapper(ABC):
    # locations : dict = {}
    driver: webdriver.Chrome
    TRIPADVISOR_URL : str = "https://www.tripadvisor.com.ph"
    NUM_PAGES : int = 1
    RAPIDAPI_KEY : str = ""
    RAPIDAPI_HOST : str = ""
    
    def __init__(self, locations_file = "datasets/locations.json", NUM_PAGES = 1):
        # self.locations = json.load(open(locations_file, "r"))
        load_dotenv('credentials.env')
        self.RAPIDAPI_KEY = os.getenv("RAPIDAPI-KEY")
        self.RAPIDAPI_HOST = os.getenv("RAPIDAPI-HOST")
        self.locations_file = locations_file
        logging.config.dictConfig({
            'version': 1,
            'disable_existing_loggers': True,
        })
        coloredlogs.install(fmt='%(asctime)s %(levelname)s %(message)s')
        chrome_options = webdriver.ChromeOptions()
        prefs = {
            "profile.managed_default_content_settings.images": 2,
            "profile.managed_default_content_settings.javascript": 2
        }
        chrome_options.add_experimental_option("prefs", prefs)
        self.driver = webdriver.Chrome(chrome_options=chrome_options)

        self.NUM_PAGES = NUM_PAGES
        self.scraped_infos = []
        self.scraped_reviews = []
        return
    
    @abstractmethod
    def extractData(self, locations):
        pass

    @abstractmethod
    def getItemList(self, location) -> []:
        pass
    
    @abstractmethod
    def getItemInfo(self, page: BeautifulSoup, url, location):
        pass

    def getDataframe(self) -> tuple[pd.DataFrame, pd.DataFrame]:
        return  \
            pd.DataFrame(self.scraped_infos), \
            pd.DataFrame(self.scraped_reviews)

    def getRating(self, element_classes):
        rating_value = None
        if "bubble_50" in element_classes:
            rating_value = 5.0
        elif "bubble_45" in element_classes:
            rating_value = 4.5
        elif "bubble_40" in element_classes:
            rating_value = 4.0
        elif "bubble_35" in element_classes:
            rating_value = 3.5
        elif "bubble_30" in element_classes:
            rating_value = 3.0
        elif "bubble_25" in element_classes:
            rating_value = 2.5
        elif "bubble_20" in element_classes:
            rating_value = 2.0
        elif "bubble_15" in element_classes:
            rating_value = 1.5
        elif "bubble_10" in element_classes:
            rating_value = 1.0
        return rating_value

    def getRatingDescription(self,rating: float):
        if rating > 4.5:
            return "Excellent"
        elif rating > 3.5:
            return "Very Good"
        elif rating > 2.5:
            return "Average"
        elif rating > 1.5:
            return "Poor"
        elif rating > 0.5:
            return "Terrible"
        else:
            return None
        
    def getLocationUrl(self, location, type) -> str:
        '''
        location : str 
        type : str -  only accepts the following: Attractions, Hotels, Restaurants, Tourism
        '''
        location = location.title()
        type = type.lower()
        locations = json.load(open("datasets/locations.json"))

        if location not in locations.keys():
            url = "https://travel-advisor.p.rapidapi.com/locations/v2/auto-complete"

            querystring = {"query":location,"lang":"en_US","units":"km"}

            headers = {
                "X-RapidAPI-Key": self.RAPIDAPI_KEY,
                "X-RapidAPI-Host": self.RAPIDAPI_HOST
            }
            
            response = requests.get(url, headers=headers, params=querystring)

            suggestion = response.json()["data"]["Typeahead_autocomplete"]["results"]

            locations[location] = {}
            for i in range(4):
                if suggestion[i]["__typename"] == "Typeahead_QuerySuggestionItem":
                    type = suggestion[i]["buCategory"].lower()

                    locations[location][type]  = suggestion[i]["route"]["url"]
                elif suggestion[i]["__typename"] == "Typeahead_LocationItem":
                    locations[location]["location"] = suggestion[i]["detailsV2"]["route"]["url"]

            with open(self.locations_file, "w") as outfile:
                json.dump(locations, outfile)

            return locations[location][type]

        else:
            return locations[location][type]

from bs4 import BeautifulSoup
import pandas as pd
import logging
import traceback

class TripadvisorHotel(TripadvisorWrapper):
    def __init__(self, locations_file = "datasets/locations.json" , NUM_PAGES=1):
            super().__init__(locations_file, NUM_PAGES)
            return
    
    def getItemList(self,location) -> []:
        logging.info(f"Scraping the URLS of selected restaurants in {location}")
        url = self.TRIPADVISOR_URL + self.getLocationUrl(location, "hotels")
        offset = 0
        hotel_links = []
        
        for _ in range(self.NUM_PAGES):
            index = url.index('-g')
            offset_url = url[:index] + f"-oa{offset}" + url[index:]
            self.driver.get(offset_url)
            hotel_list_page = BeautifulSoup(self.driver.page_source, "html.parser")

            hotel_items = []
            hotel_items.extend(hotel_list_page.find_all("div", class_="jsTLT K"))

            for item in hotel_items:
                hotel_links.append(item.contents[0].get("href"))

            offset += 30
        logging.info(f"Scraped the URLS of {len(hotel_items)} selected restaurants in {location}") 
        return hotel_links
    
    def getItemInfo(self, page: BeautifulSoup, url, location):
        hotel_info = {}
        hotel_info["name"] = page.find("h1","QdLfr b d Pn").contents[0]
        hotel_info["url"] = url
        hotel_info["location"]= location
        hotel_info["address"] = page.find("span", "fHvkI PTrfg").contents[0]
        hotel_info["about"] = page.find("div", "fIrGe _T").contents[0]
        hotel_info["review_count"] = int(page.find("span", "qqniT").contents[0].replace(",",""))
        hotel_info["rating"] = float(page.find("span", "uwJeR P").contents[0])
        hotel_info["rating_description"] = page.find("div", "kkzVG").contents[0]

        #Ratings
        ratings = page.find_all("div", "HXCfp")
        for rating in ratings:
            rating_type = rating.find("div", "hLoRK").contents[0]
            classes = rating.find("span", "ui_bubble_rating")
            rating_value = self.getRating(classes.get("class"))
            hotel_info[f"rating_{rating_type}"] = rating_value

        # More Info
        try:
            more_info = page.find("div", "aeQAp S5 b Pf ME").parent.contents # gets parent of the classs
            current_info_type = ""
            for element in more_info:
                if element.get("class") == None:
                    continue
                if " ".join(element.get("class")) == "aeQAp S5 b Pf ME" :
                    current_info_type = element.contents[0].replace(" ","_")
                elif " ".join(element.get("class")) == "OsCbb K":
                    details = []
                    detail_elements = element.find_all("div", "yplav f ME H3 _c")
                    for detail_element in detail_elements:
                        details.append(detail_element.contents[1])
                    hotel_info[current_info_type] = ",".join(details)
        except:
            pass

        try:
            hotel_info["hotel_class"] = page.find("svg", "JXZuC d H0").get("aria-label").split(" ")[0]
        except:
            hotel_info["hotel_class"] = None

        # To add:
        # hotel style
        # languages spoken

        # Proximity Details
        try:
            hotel_info["walkability_score"] = page.find("span","iVKnd fSVJN").contents[0]
            hotel_info["walkability_description"] = page.find("span","lSyvc H3 b zpbpA").contents[0]
            hotel_info["nearby_restaurant_count"] = page.find("span","iVKnd Bznmz").contents[0]
            hotel_info["nearby_attraction_count"] = page.find("span","iVKnd rYxbA").contents[0]
        except:
            pass

        # Tags
        tags_elements = page.find("div", "GFCJJ").contents[0].contents
        current_tag = ""
        for tag_element in tags_elements:
            if tag_element.get("class") == None:
                continue
            if " ".join(tag_element.get("class")) == "mpDVe Ci b":
                current_tag = tag_element.contents[0].replace(" ","_").lower()

                # modify tag elements:
                current_tag = "other_name" if current_tag == "also_known_as" else current_tag
                current_tag = "old_name" if current_tag == "formerly_known_as" else current_tag
                current_tag = "room_count" if current_tag == "number_of_rooms" else current_tag
            elif " ".join(tag_element.get("class")) == "IhqAp Ci":
                hotel_info[current_tag] = tag_element.contents[0].replace("<!-- -->","")
        del hotel_info["location"] #redundant

        # Reviews
        hotel_review = []
        for review in page.find_all("div", "YibKl MC R2 Gi z Z BB pBbQr"):
            review_info = {}
            review_info["name"] = hotel_info["name"]
            # review_info["type"] = "hotel"
            review_info["rating"] = self.getRating(review.find("span", "ui_bubble_rating").get("class"))
            review_info["ratingDate"] = review.find("div", "cRVSd").contents[0].contents[1].replace(" wrote a review ","")
            review_info["title"] = review.find("div", "KgQgP MC _S b S6 H5 _a").contents[0].contents[0].contents[0].contents[0]
            review_info["content"] = review.find("span", "QewHA H4 _a").contents[0].contents[0]
            review_info["visit_date"] = review.find("span", "teHYY _R Me S4 H3").contents[1].strip()
            
            review_info["url"] = review.find("a", "Qwuub").get("href")
            hotel_review.append(review_info)
        logging.info(f"Scraped all details from {hotel_info['name']}")
        return hotel_info, hotel_review
    
    def extractData(self, locations=[]):
        for location in locations:
            for link in self.getItemList(location):
                try:
                    url = self.TRIPADVISOR_URL + link
                    self.driver.get(url)

                    hotel_info, hotel_review = self.getItemInfo( BeautifulSoup(self.driver.page_source, "html.parser") , url, location)

                    self.scraped_infos.append(hotel_info)
                    self.scraped_reviews.extend(hotel_review)
                except Exception as e:
                    logging.warn(f"An error has occured on {link}:{traceback.print_exc()}")
                    continue
        logging.info(f"Scraped all {len(self.scraped_infos)} hotels in {location}")
        return 
        
```
 
    
