# SP702 Final Project (Task 3)

**By**: Francis Mark M. Cayco

**Github profile**: https://github.com/PeteCastle

**Instructions:**
**Task 3: Prepare tourist site data**
Extract **tourist site** data (Note: at least 25 different tourist sites)
- Tourist Site Name
- Classification (e.g. museum, park, zoo, etc.)
- Location (Note: Barangay, City/Municipality, Province)
- Number of visitors per year (Note: in thousands)
- Entrance fee (Note: Peso)
- Other data (extract more data regarding tourist sites that can be used by tourists)
- Customer reviews of tourist site from review sites. Make sure that each review is unique and posted by different users (i.e. no duplicate reviews and no duplicate username).  
  
Store the collected data to a csv file. File name format: **Tourist_site.csv **

Please upload your Jupyter notebook file. Filename format: **Extract_Tourist_site_Source.ipynb**. Note that you might need to create different python code per source of data. Submit all code you used to extract data.
SPARTA

## Data Sources
1. TripAdvisor (https://www.tripadvisor.com.ph)
2. RapidAPI's TripAdvisor Endpoint (https://rapidapi.com/apidojo/api/travel-advisor)

## How did I extract the data?
1. The GeoID of the location is obatined using the **RapidAPI's TripAdvisor Endpoint**. The GeoID is used to get the data from the TripAdvisor website.
2. The GeoID is referenced in the TripAdvisor website, returning the URLs of hotel, restaurants, and attractions in the location.
3. The list of toursit site and detail of each toursit site is obtained from the TripAdvisor website.
4. **BeautifulSoup** library to is used to extract the data tourist site page.

The extracted data is stored in a **CSV** file.  One for the list of toursit site and another for the user reviews of each toursit site.


## Website Screenshots
![](resources/images/2023-08-26-16-07-39.png)
![](resources/images/2023-08-26-16-08-00.png)


## What are the list of the data that I have extracted?

| **Tourist Site Information** |                    |
|--------------------------|--------------------|
| Name                     | Name of the tourist site |
| Location                 | Location of the tourist site |
| URL                      | URL of the tourist site |
| About                    | About the tourist site |
| Operating_Hours          | Operating hours of the tourist site |
| Trip_Duration            | Suggested trip duration of the tourist site |
| Site_Type                | Type of the tourist site |
| Rating                   | Rating of the tourist site |
| Review_Count             | Number of reviews of the tourist site |
| Rating_Description       | Rating description of the tourist site |
| Entrance_Fee             | Entrance fee of the tourist site |
| Address                  | Address of the tourist site |
| **Tourist Site Review Information** |                           |
|------------------------------------|---------------------------|
| Name                               | Name of the tourist site  |
| URL                                | URL of the tourist site   |
| Title                              | Title of the review       |
| User_Link                          | Link to the user's profile |
| User_Name                          | Name of the user          |
| Rating                             | Rating of the review      |
| Content                            | Content of the review     |
| Visit_Date                         | Date of visit             |
| Purpose                            | Purpose of the visit      |
| Review_Date                        | Date of the review        |





In [1]:
# The full code is in the file `tripadvisor/tourist_site.py`
from tripadvisor.tourist_site import TripadvisorTouristSite

NUM_PAGES = 1 # Number of pages to scrape.  The total number of tourist site scraped is NUM_PAGES * 30

# List of locations to scrape.  
#  Note that if the location is present in datasets/locations.json, the location will be scraped. 
#  Otherwise, the location will have to be obtained from RapidAPI's Tripadvisor API, which require credential access.
#  The locations available are: Manila, Baguio, Cavite, and Tagaytay
locations = ["Manila", "Cavite","Tagaytay"] 

tourist_site = TripadvisorTouristSite(NUM_PAGES)
tourist_site.extractData(locations)
tourist_site_details, tourist_site_reviews = tourist_site.getDataframe()

print("Tourist Site Details")
display(tourist_site_details.head(5))

print("Tourist Site Reviews")
display(tourist_site_reviews.head(5))

[32m2023-08-26 16:12:17[0m [1;30mINFO[0m Scraping the URLS of selected restaurants in Manila
[32m2023-08-26 16:12:22[0m [1;30mINFO[0m Scraped the URLS of 30 selected tourist destinations in Manila
[32m2023-08-26 16:12:24[0m [1;30mINFO[0m Scraped all details from Intramuros
[32m2023-08-26 16:12:25[0m [1;30mINFO[0m Scraped all details from Fort Santiago
[32m2023-08-26 16:12:27[0m [1;30mINFO[0m Scraped all details from San Agustin Church
[32m2023-08-26 16:12:29[0m [1;30mINFO[0m Scraped all details from Rizal Park
[32m2023-08-26 16:12:31[0m [1;30mINFO[0m Scraped all details from Manila Cathedral
[32m2023-08-26 16:12:33[0m [1;30mINFO[0m Scraped all details from National Museum
[32m2023-08-26 16:12:35[0m [1;30mINFO[0m Scraped all details from Museo San Agustin
[32m2023-08-26 16:12:37[0m [1;30mINFO[0m Scraped all details from Robinsons Place Mall
[32m2023-08-26 16:12:39[0m [1;30mINFO[0m Scraped all details from Manila Ocean Park
[32m2023-08-26 16:1

Tourist Site Details


Unnamed: 0,name,location,url,about,operating_hours,trip_duration,site_type,rating,review_count,rating_description,entrance_fee,address
0,Intramuros,Manila,https://www.tripadvisor.com.ph/Attraction_Revi...,"Intramuros, ""The Walled City,"" is the oldest d...",8:00 AM - 6:00 PM,1-2 hours,"Neighborhoods,Historic Walking Areas",4.0,3594,Very Good,6429.046667,"Bonifacio Dr & Padre Burgos St, Manila, Luzon ..."
1,Fort Santiago,Manila,https://www.tripadvisor.com.ph/Attraction_Revi...,This museum and public park was built as a sto...,,1-2 hours,"Historic Sites,Parks",4.0,2105,Very Good,5604.195217,"Intramuros, Manila, Luzon 1002 Philippines"
2,San Agustin Church,Manila,https://www.tripadvisor.com.ph/Attraction_Revi...,This museum and courtyard gardens is one of th...,,,Religious Sites,4.5,1577,Very Good,5680.478636,"Gen Luna & Real Sts Intramuros, Manila, Luzon ..."
3,Rizal Park,Manila,https://www.tripadvisor.com.ph/Attraction_Revi...,,,1-2 hours,Parks,4.0,2083,Very Good,6737.211,"Roxas Blvd, Manila, Luzon 1000 Philippines"
4,Manila Cathedral,Manila,https://www.tripadvisor.com.ph/Attraction_Revi...,Former Philippine archbishops are buried in a ...,,2-3 hours,Religious Sites,4.0,1142,Very Good,3621.528,"Cabildo cor Beaterio Intramuros, Manila, Luzon..."


Tourist Site Reviews


Unnamed: 0,name,url,title,user_link,user_name,rating,content,visit_date,purpose,review_date
0,Intramuros,/ShowUserReviews-g298573-d548076-r910654959-In...,Weekly Sales Training,/Profile/marikrisvis,Marikrisvi S,5.0,Was assingend to do Sales Training with this c...,Aug 2023,Business,",15 August 2023"
1,Intramuros,/ShowUserReviews-g298573-d548076-r908236233-In...,Explore Manila’s Old Walled City,/Profile/luzlK3560IQ,Luz Li,5.0,I do not like the traffic in Manila at all but...,Aug 2023,Couples,",6 August 2023"
2,Intramuros,/ShowUserReviews-g298573-d548076-r899571460-In...,Very interesting historical area of Manila,/Profile/roaming_kiwi58,roaming_kiwi58,4.0,Intramuros is the historic old walled city par...,Mar 2023,,",1 July 2023"
3,Intramuros,/ShowUserReviews-g298573-d548076-r899081251-In...,Amazing Experience,/Profile/minnieShellharbour,minnieShellharbour,5.0,What an amazing place to see. It is a must whe...,Jun 2023,Couples,",30 June 2023"
4,Intramuros,/ShowUserReviews-g298573-d548076-r893463779-In...,Great touristy spot.,/Profile/BigBlock13,BigBlock13,5.0,"Very beautiful spot in Manila, with lots of hi...",Apr 2023,,",4 June 2023"


In [3]:
tourist_site_details.to_csv('final_data/Tourist_site_details.csv', index=False)
tourist_site_reviews.to_csv('final_data/Tourist_site_reviews.csv', index=False)

#### Full Code from tourist_site.py file:
```

from abc import ABC, abstractmethod
import json
import os
from dotenv import load_dotenv
import requests
import coloredlogs
import logging.config
import pandas as pd
from bs4 import BeautifulSoup
from seleniumwire import webdriver

class TripadvisorWrapper(ABC):
    # locations : dict = {}
    driver: webdriver.Chrome
    TRIPADVISOR_URL : str = "https://www.tripadvisor.com.ph"
    NUM_PAGES : int = 1
    RAPIDAPI_KEY : str = ""
    RAPIDAPI_HOST : str = ""
    
    def __init__(self, locations_file = "datasets/locations.json", NUM_PAGES = 1):
        # self.locations = json.load(open(locations_file, "r"))
        load_dotenv('credentials.env')
        self.RAPIDAPI_KEY = os.getenv("RAPIDAPI-KEY")
        self.RAPIDAPI_HOST = os.getenv("RAPIDAPI-HOST")
        self.locations_file = locations_file
        logging.config.dictConfig({
            'version': 1,
            'disable_existing_loggers': True,
        })
        coloredlogs.install(fmt='%(asctime)s %(levelname)s %(message)s')
        chrome_options = webdriver.ChromeOptions()
        prefs = {
            "profile.managed_default_content_settings.images": 2,
            "profile.managed_default_content_settings.javascript": 2
        }
        chrome_options.add_experimental_option("prefs", prefs)
        self.driver = webdriver.Chrome(chrome_options=chrome_options)

        self.NUM_PAGES = NUM_PAGES
        self.scraped_infos = []
        self.scraped_reviews = []
        return
    
    @abstractmethod
    def extractData(self, locations):
        pass

    @abstractmethod
    def getItemList(self, location) -> []:
        pass
    
    @abstractmethod
    def getItemInfo(self, page: BeautifulSoup, url, location):
        pass

    def getDataframe(self) -> tuple[pd.DataFrame, pd.DataFrame]:
        return  \
            pd.DataFrame(self.scraped_infos), \
            pd.DataFrame(self.scraped_reviews)

    def getRating(self, element_classes):
        rating_value = None
        if "bubble_50" in element_classes:
            rating_value = 5.0
        elif "bubble_45" in element_classes:
            rating_value = 4.5
        elif "bubble_40" in element_classes:
            rating_value = 4.0
        elif "bubble_35" in element_classes:
            rating_value = 3.5
        elif "bubble_30" in element_classes:
            rating_value = 3.0
        elif "bubble_25" in element_classes:
            rating_value = 2.5
        elif "bubble_20" in element_classes:
            rating_value = 2.0
        elif "bubble_15" in element_classes:
            rating_value = 1.5
        elif "bubble_10" in element_classes:
            rating_value = 1.0
        return rating_value

    def getRatingDescription(self,rating: float):
        if rating > 4.5:
            return "Excellent"
        elif rating > 3.5:
            return "Very Good"
        elif rating > 2.5:
            return "Average"
        elif rating > 1.5:
            return "Poor"
        elif rating > 0.5:
            return "Terrible"
        else:
            return None
        
    def getLocationUrl(self, location, type) -> str:
        '''
        location : str 
        type : str -  only accepts the following: Attractions, Hotels, Restaurants, Tourism
        '''
        location = location.title()
        type = type.lower()
        locations = json.load(open("datasets/locations.json"))

        if location not in locations.keys():
            url = "https://travel-advisor.p.rapidapi.com/locations/v2/auto-complete"

            querystring = {"query":location,"lang":"en_US","units":"km"}

            headers = {
                "X-RapidAPI-Key": self.RAPIDAPI_KEY,
                "X-RapidAPI-Host": self.RAPIDAPI_HOST
            }
            
            response = requests.get(url, headers=headers, params=querystring)

            suggestion = response.json()["data"]["Typeahead_autocomplete"]["results"]

            locations[location] = {}
            for i in range(4):
                if suggestion[i]["__typename"] == "Typeahead_QuerySuggestionItem":
                    type = suggestion[i]["buCategory"].lower()

                    locations[location][type]  = suggestion[i]["route"]["url"]
                elif suggestion[i]["__typename"] == "Typeahead_LocationItem":
                    locations[location]["location"] = suggestion[i]["detailsV2"]["route"]["url"]

            with open(self.locations_file, "w") as outfile:
                json.dump(locations, outfile)

            return locations[location][type]

        else:
            return locations[location][type]
            
from bs4 import BeautifulSoup
import pandas as pd
import logging
from .wrapper import TripadvisorWrapper
import traceback

class TripadvisorTouristSite(TripadvisorWrapper):
    def __init__(self, locations_file = "datasets/locations.json" , NUM_PAGES=1):
            super().__init__(locations_file, NUM_PAGES)
            return
    
    def getItemList(self,location) -> []:
        logging.info(f"Scraping the URLS of selected restaurants in {location}")
        url = self.TRIPADVISOR_URL + self.getLocationUrl(location, "attractions")
        offset = 0
        attraction_links = []
        for i in range(self.NUM_PAGES):
            index = url.index('-g')
            offset_url = url[:index] + f"-oa{offset}" + url[index:]
            self.driver.get(offset_url)
            attraction_list_page = BeautifulSoup(self.driver.page_source, "html.parser")

            attraction_items = []
        
            attraction_items.extend(attraction_list_page.find_all("div", class_="alPVI eNNhq PgLKC tnGGX"))

            for item in attraction_items:
                attraction_links.append(item.contents[0].get("href"))

            offset += 30
        logging.info(f"Scraped the URLS of {len(attraction_links)} selected tourist destinations in {location}") 
        return attraction_links
    
    def getItemInfo(self, page: BeautifulSoup, url,location):
        attraction_info = {}
        attraction_info["name"] = page.find("div", class_="iSVKr").contents[0].contents[0]
        attraction_info["location"] = location
        attraction_info["url"] = url
        try:
            attraction_info["about"] = page.find("div", class_="pqqta _d").contents[0].contents[0].contents[0].contents[0]
        except:
            pass
        try:
            attraction_info["operating_hours"] = page.find("span", class_="EFKKt").contents[0]
        except:
            pass
        try:
            attraction_info["trip_duration"] = page.find("div", class_="nvXSy f _Y Q2").contents[0].contents[1].contents[0].contents[0]
        except:
            pass
        try:
            attraction_info["site_type"] = page.find_all("div", class_="kUaIL")[2].contents[0].contents[0].contents[0].contents[0].replace(" • ",",")
        except:
            pass

        try:
            review_info = page.find_all("div", class_="jVDab o W f u w GOdjs")[1]
            attraction_info["rating"] = float(review_info.contents[0].get("aria-label").split(" ")[0])
            attraction_info["review_count"] = int(review_info.contents[1].contents[0].split(" ")[0].replace(",",""))
            attraction_info["rating_description"] = self.getRatingDescription(attraction_info["rating"])
            
        except:
            pass

        prices = []
        for price_element in page.find_all("div", class_="biGQs _P fiohW avBIb fOtGX"):
            try:
                prices.append(float(price_element.contents[0].replace("₱","").replace(",","")))
            except:
                continue
        attraction_info["entrance_fee"] = sum(prices)/len(prices) if len(prices) > 0 else None

        try:
            attraction_info["address"] = page.find("div", class_="AcNPX A").contents[0].contents[0].contents[0].contents[0].contents[0].contents[1].contents[0].contents[0]
        except:
            pass

        attraction_reviews = []
        for review_elements in page.find("div","LbPSX").contents[0].contents:
            try:
                review = {}
                review["name"] = attraction_info["name"]
                # review["type"] = "tourist_site"
                review["url"] = review_elements.find_all("a",class_="BMQDV _F Gv wSSLS SwZTJ FGwzt ukgoS")[1].get("href")
                review["title"] = review_elements.find_all("a", class_="BMQDV _F Gv wSSLS SwZTJ FGwzt ukgoS")[1].contents[0].contents[0]
                review["user_link"] = review_elements.find_all("a",class_="BMQDV _F Gv wSSLS SwZTJ FGwzt ukgoS")[0].get("href")
                review["user_name"] = review_elements.find_all("a", class_="BMQDV _F Gv wSSLS SwZTJ FGwzt ukgoS")[0].contents[0]

                review["rating"] = review_elements.find("svg","UctUV d H0").get("aria-label").split(" ")[0]
                review["content"] = review_elements.find("div","biGQs _P pZUbB KxBGd").contents[0].contents[0]

                addtl_details = review_elements.find("div", class_="RpeCd").contents[0].split(" • ")
                review["visit_date"] = addtl_details[0]
                review["purpose"] = addtl_details[1] if len(addtl_details) >1 else None

                review["review_date"] = review_elements.find("div", class_="TreSq").contents[0].contents[0].replace("Written ",",")
            except:
                pass
            attraction_reviews.append(review)

        logging.info(f"Scraped all details from {attraction_info['name']}")
        return attraction_info, attraction_reviews
    
    def extractData(self, locations):
        for location in locations:
            for link in self.getItemList(location):
                try:
                    attraction_url = self.TRIPADVISOR_URL + link
                    self.driver.get(attraction_url)

                    attraction_info, attraction_review = self.getItemInfo( BeautifulSoup(self.driver.page_source, "html.parser") , attraction_url, location)

                    self.scraped_infos.append(attraction_info)
                    self.scraped_reviews.extend(attraction_review)
                except Exception as e:
                    logging.warn(f"An error has occured on {link}:{traceback.print_exc()}")
                    continue
        logging.info(f"Scraped all {len(self.scraped_infos)} tourist destinations in {location}")
        return
```
 
    
