# Capstone - Travel Recommender (WanderLust)

## Notebook 1 of 4
- **Notebook 1: Introduction, Scraping**
- Notebook 2: Combining Datasets, Data Cleaning, EDA and Base Model
- Notebook 3: NLP (Sentiment Analysis), Feature Engineering + EDA + Model(With Feature Engineering) Conclusion + Recommendations
- Notebook 4: Google Cloud + Streamlit

## Executive Summary

### Introduction & Background
Tourism is defined as when people travel and stay in places outside of their usual environment for less than one consecutive year for leisure, business, health, or other reasons. [link](https://www.statista.com/topics/962/global-tourism/#dossierContents__outerWrapper). Globally it is made up 10 percent of global GDP in 2019 and was worth almost $9 trillion. [link](https://www.mckinsey.com/industries/travel-logistics-and-infrastructure/our-insights/reimagining-the-9-trillion-tourism-economy-what-will-it-take)

With post-covid times settling in, more people are looking into leisure travel and finding things to do overseas to fill up their itinerary. But, with the overwhelming amount of information available online and so many options available, the process of finding something one prefer to do can be a hassle. 

Popular travel websites in recent times, such as (e.g. [Tripadvisor.com](https://www.tripadvisor.com/), [Expedia.com](https://www.expedia.com/) and [Booking.com](https://www.booking.com/attractions/index.html?aid=397594&label=gog235jc-1DCAEoggI46AdIM1gDaMkBiAEBmAExuAEXyAEP2AED6AEB-AECiAIBqAIDuAL_3MGbBsACAdICJGQ2NDZlYjljLTJiNDEtNGM5Yi05NDExLTQzNzIyYmE5MjFiMtgCBOACAQ&sid=0bd894e0a09fa5a41d0d1005be44fb09)) prioritise country location as an input before recommending the activities.

Research has shown that 97% of the travel consumers are influenced by customer post-experience reviews when it comes to making a purchase decision. Hence, we decided to incorporate reviews and ratings by unique individuals on the activity on the modelling system. Sentiment analysis was done using a pretrained model from Hugging Face [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment). The result was that the more positive the activity is received, plus matching to the degree of the genre the user is interested in, the more likely it would be recommended. 

What’s interesting is that this might include activities that user might not have specifically tried before. For example, the user might not know Snowshoeing, however the same user who likes adventure and nature and has rated it high in interest may be recommended Snowshoeing as an activity. The Travel Recommendation System not only would be a useful tool especially for those that focus on activity research over location when it comes to travelling, but also can be a trusted source since it is based on analysing past reviews.

### Problem Statement
- Popular travel websites in recent times, such as (e.g. [Tripadvisor.com](https://www.tripadvisor.com/), [Expedia.com](https://www.expedia.com/) and [Booking.com](https://www.booking.com/attractions/index.html?aid=397594&label=gog235jc-1DCAEoggI46AdIM1gDaMkBiAEBmAExuAEXyAEP2AED6AEB-AECiAIBqAIDuAL_3MGbBsACAdICJGQ2NDZlYjljLTJiNDEtNGM5Yi05NDExLTQzNzIyYmE5MjFiMtgCBOACAQ&sid=0bd894e0a09fa5a41d0d1005be44fb09)) prioritise country location as an input before recommending the activities. However, that assumes that everyone has already decided on the country they are planning to go to. What if the person, or user, has not decided where to go, or prefers to choose based on their hobby or interest? 

- That’s where the Travel Recommender System comes in. It pulls out a list of 6 things that a user is likely to enjoy, based on what they like to do when they travel and how important it is to do a genre of activity when overseas. 

### Project Goals
1. To achieve accurate recommendations based on user’s selection of categories of activities they would like to do, especially for new users (cold start issue)
2. Incorporate sentiment analysis of reviews to modelling - feature engineering

# Scraping

- Some of the information that were not in the dataset includes
1. Description of the activity
2. Duration of the activity
3. URL of images

In [1]:
import pandas as pd
import csv
import time

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import urllib

In [2]:
final_att_data = pd.read_csv('datasets/final_attractions_data.csv')

In [3]:
option = webdriver.ChromeOptions()
chrome_executable = Service('/Users/calerlime/OneDrive/my_materials/capstone/chromedriver.exe')
driver = webdriver.Chrome(service=chrome_executable)
driver.implicitly_wait(15)

path_to_file = "./datasets/durationdescriptionimages.csv"

csvFile = open(path_to_file, 'a',  encoding = "utf-8")
header = ["attraction_id", "attraction", "description", "duration", "images"]
csvWriter = csv.DictWriter(csvFile, fieldnames = header)
csvWriter.writeheader()

# Split the index of the file to scrap the website on interval
start = [0,400,800,1200,1600]
end = [400,800,1200,1600,1705]
    
for starting, ending in zip(start, end):
    
    for i in range(starting, ending):

        # url you want to scrape
        cat_url = final_att_data['attraction'][i]
        driver.get(cat_url) 

        # this is where you want to place your csv file. so actually can do the '../dataset/duration.csv' probably. 
        path_to_file = "./datasets/durationdescriptionimages.csv"

        try:           
            # Scraping the activity's description
            WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.biGQs._P.pZUbB.Whbsa.KxBGd")))
            description = driver.find_element(By.CSS_SELECTOR, "div.biGQs._P.pZUbB.Whbsa.KxBGd").text  
            
            #  Scraping the duration of activity of the page
            WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.biGQs._P.pZUbB.egaXP.KxBGd"))) #Wait till this element appear before scraping
            duration = driver.find_elements(By.CSS_SELECTOR, "span.biGQs._P.pZUbB.egaXP.KxBGd")[2].text
            
            #  Scraping the URL of the first image of the page
            images = driver.find_elements(By.CSS_SELECTOR, "div.Kxegy._R.w._Z.GA")[0].get_attribute("style")
            
        except:
            
            duration = 'NA'
            
        # create and open csv file
        csvFile = open(path_to_file, 'a',  encoding = "utf-8")
        header = ["attraction_id", "attraction", "description", "duration", "images"]
        csvWriter = csv.DictWriter(csvFile, fieldnames = header)
        csvWriter.writerow({"attraction_id": final_att_data['attraction_id'][i], "attraction":final_att_data['attraction'][i], 
                            "description": description, "duration": duration, "images": images})

# Add time to sleep so that it wouldn't flood the website with requests
    # time.sleep(300)

driver.close()

In [4]:
driver.quit()

- The scraping file is saved as 'durationdescriptionimages.csv' and will be added to the other datasets in Notebook 2.