# Webscraping TripAdvisor
## -- All recommended hotels on TripAdvisor

### Modules

The modules we will be using for this version of web scraping are:

- **pandas**: We will later use Pandas to create a dataframe for the name of each recommended hotel and its site link information after webscraping TripAdvisor.

- **requests**: The module that enables us to send http requests so that we can receive all the response data.

- **BeautifulSoup**: I've searched online regarding more tutorials of webscraping and learned that BeautifulSoup is an extremely useful module when trying to get data from html.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as soup

# Part 1: Data Preparation

## Step 1: Top 30 hotels

### Get the Site Link with the Name

Firstly, we use the requests and BeautifulSoup modules to get the response data from the website ranked by "Best Value". And then, we observed the "Developer Tools" on the website page for webscraping the information we want.

In [2]:
html_main = requests.get("https://www.tripadvisor.com/Hotels-g32655-Los_Angeles_California-Hotels.html")
bsobj_main = soup(html_main.content, "lxml")

In [3]:
rank = []
hotel_name = []
rate_score = []
hotel_link = []
for s1 in bsobj_main.find_all("div", class_ = "ui_column is-8 main_col allowEllipsis"):
    if s1.find("span", class_ = "ui_merchandising_pill sponsored_v2") == None:
        for s2 in s1.find_all("div", class_ = "listing_title"):
            hotel_name.append(s1.a.text[6:])
            hotel_link.append(s1.a["href"])
        for s3 in s1.find_all("div", class_ = "info-col"):
            if s3.a.text == "0 reviews":
                rate_score.append("0 of 5 bubbles")
            else:
                rate_score.append(s3.a["alt"])

for s4 in bsobj_main.find_all("div", class_ = "popindex"):    
    rank.append(s4.text.split(" ")[0][1:])

## Step 2: The rest of the hotels - 450 recommended hotels in Los Angeles

Use the similar way in step 1, we can get all the hotel information on the rest of the webpages on TripAdvisor.

In [4]:
top_600 = list(range(30, 600, 30))

In [5]:
for num in top_600:
    html_main = requests.get("https://www.tripadvisor.com/Hotels-g32655-oa" + str(num) + "-Los_Angeles_California-Hotels.html")
    bsobj_main = soup(html_main.content, "lxml")
    
    for s1 in bsobj_main.find_all("div", class_ = "ui_column is-8 main_col allowEllipsis"):
        if s1.find("span", class_ = "ui_merchandising_pill sponsored_v2") == None:
            for s2 in s1.find_all("div", class_ = "listing_title"):
                hotel_name.append(s1.a.text[6:])
                hotel_link.append(s1.a["href"])
            for s3 in s1.find_all("div", class_ = "info-col"):
                if s3.a.text == "0 reviews":
                    rate_score.append("0 of 5 bubbles")
                else:
                    rate_score.append(s3.a["alt"])

    for s4 in bsobj_main.find_all("div", class_ = "popindex"):    
        rank.append(s4.text.split(" ")[0][1:])

## Step 3: Make a dataFrame for all the webscraping information

### There are 450 recommended hotels in Los Angeles!

In [6]:
hotel_link_update = []
for site in hotel_link:
    hotel_link_update.append("https://www.tripadvisor.com/" + site)

In [7]:
data = {"Rank": rank[:450], "Hotel Name": hotel_name[:450], "Rate": rate_score[:450], "Site Link": hotel_link_update[:450]}
df = pd.DataFrame.from_dict(data)
df

Unnamed: 0,Rank,Hotel Name,Rate,Site Link
0,1,The Hollywood Roosevelt,4.5 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
1,2,Hollywood Hotel,4 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
2,3,Hilton Los Angeles/Universal City,4 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
3,4,Hotel Erwin,4.5 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
4,5,Hotel Figueroa,4.5 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
...,...,...,...,...
445,108,Villa Delle Stelle,4.5 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
446,132,Park Plaza Lodge Hotel,3.5 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
447,155,Days Inn by Wyndham Hollywood Near Universal S...,3.5 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...
448,160,Super 8 by Wyndham Canoga Park,3.5 of 5 bubbles,https://www.tripadvisor.com//Hotel_Review-g326...


## Step 3: CSV File -- There are 450 recommended hotels on TripAdvisor in total!

We have successfully finished webscraping all the recommended hotels on TripAdvisor. We want to save them into a csv file for convenience.

In [8]:
df.to_csv("450_hotel.csv")

# Part 2: Web App Simulation

## Step 1: Show all the hotels to the users for selection

In [9]:
hotel_name[:450]

['The Hollywood Roosevelt',
 'Hollywood Hotel',
 'Hilton Los Angeles/Universal City',
 'Hotel Erwin',
 'Hotel Figueroa',
 'Loews Hollywood Hotel',
 'Hotel June',
 'Luxe Sunset Boulevard Hotel',
 'Hotel Indigo Los Angeles Downtown',
 'Sheraton Grand Los Angeles',
 'InterContinental Los Angeles Downtown',
 'The Wayfarer Downtown LA',
 'Sheraton Gateway Los Angeles Hotel',
 'La Quinta Inn & Suites by Wyndham Lax',
 'The Westin Bonaventure Hotel & Suites, Los Angeles',
 'Hotel Angeleno',
 'JW Marriott Los Angeles L.A. LIVE',
 'Blvd Hotel & Spa',
 'SLS Hotel, a Luxury Collection Hotel, Beverly Hills',
 'Venice Suites',
 'Kawada Hotel',
 'Millennium Biltmore Los Angeles',
 'The Westin Los Angeles Airport',
 'Hilton Los Angeles Airport',
 'Renaissance Los Angeles Airport Hotel',
 'W Hollywood',
 'Su Casa Venice Beach',
 'The LINE Hotel Los Angeles',
 'BLVD Hotel & Suites',
 'Rotex Plaza Hotel',
 'Silver Lake Pool and Inn',
 'Magic Castle Hotel',
 'DoubleTree by Hilton Hotel San Pedro - Port o

In [10]:
def visit_hotel(df, want_to_go_name):
    each_name = want_to_go_name.split(".")
    
    for name in each_name:
        if name == each_name[0]:
            visit_html = df[["Hotel Name", "Site Link"]][df["Hotel Name"] == name]
        else:
            visit_html = visit_html.append(df[["Hotel Name", "Site Link"]][df["Hotel Name"] == name])
    
    return visit_html

## Step 2: Ask for an input from the users

In [11]:
# want_to_go_name = str(input("Which hotel do you want to live?"))
# visit_html = visit_hotel(df, want_to_go_name)

Which hotel do you want to live?Hotel Figueroa


In [12]:
def find_location(visit_html):
    
    location = []
    
    for link in visit_html["Site Link"]:
        html = requests.get(link)
        bsobj = soup(html.content, "lxml")
    
        for loc in bsobj.find_all("script", type = "application/ld+json"):
            if "streetAddress" in loc.string:
                find_part_html = loc.string.split("{")
                for street in find_part_html:
                    if "streetAddress" in street:
                        for value in street.split(","):
                            if "streetAddress" in value:
                                street_name = value.split(":")[1][1:][:-1] + ", "
                                   
                            if "addressLocality" in value:
                                city_name = value.split(":")[1][1:][:-1] + ", "
                                 
                            if "postalCode" in value:
                                zipcode = "CA " + value.split(":")[1][1:][:-1]
                            
        location_info = street_name + city_name + zipcode
        location.append(location_info)
    
    return location

In [13]:
# location = find_location(visit_html)

In [14]:
# location

['939 S Figueroa St, Los Angeles, CA 90015-1302',
 '939 S Figueroa St, Los Angeles, CA 90015-1302']