# TripAdvisor

### Webscraping

##### Imports

First we start with the imports. We need essentially three (or four) main libraries to work this out; these are:
 + requests (to fetch the website)
 + lxml (a faster html parser to speed bs4)
 + bs4 (a.k.a beautiful soup, a web scraping library)
 + pandas (a maths oriented data(set) manipulation library)

Since requests uses urllib3 as a dependency, we can import it first to configure it to supress the annoying warning about the "insecure" connection (lack of SSL).

In [1]:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import requests
from bs4 import BeautifulSoup as soup
import pandas as pd
from random import randint
from time import sleep

##### Request configuration

We need to configure our request, specially in this case, since TripAdvisor wont send us a webpage if we at least not try to emulate a real browser.
First we configure our headers, ripping the main headers from our browser, as seen in the Developer tools (F12) in Chromium (we used the new Microsoft Edge).

Then we request the webpage (Hotels in Beja, Portugal) with our headers attached, a timeout to stop if it takes too long (something is wrong), and verification is disabled (SSL).
If the status code is OK (200), doesn't print.

After that we create a BeautifulSoup4 scrapable object with the html content of the page, using lxml.

In [2]:
headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36 Edg/96.0.1054.29'}
url = "https://www.tripadvisor.pt/Hotels-g189102-Beja_Beja_District_Alentejo-Hotels.html"
req = requests.get(url,headers=headers,timeout=5,verify=False)
req.status_code
bsobj = soup(req.content, 'lxml')

##### Scraping

Now we start scraping. First we start getting hotel names.

In [3]:
hotel = []
for name in bsobj.findAll('div',{'class':'listing_title'}):
  hotel.append(name.text.strip())
print(len(hotel))

30


Then their ratings.

In [4]:
ratings = []
for rating in bsobj.findAll('a',{'class':'ui_bubble_rating'}):
  ratings.append(rating['alt'])
print(len(ratings))

29


The number of reviews (they have a big issue).

In [5]:
reviews = []
for review in bsobj.findAll('a',{'class':'review_count'}):
  reviews.append(review.text.strip())
print(len(reviews))

30


This number is referring to the TOTAL number of reviews. However, we can only scrape a single language, that's dependent of the domain/locale (.com .pt); there's no query parameter to change either the number here, or the sorting of reviews.

Now we get the prices.

In [6]:
price = []
for p in bsobj.findAll('div',{'class':'price-wrap'}):
  price.append(p.text.replace('€','').strip()) 
print(len(price))

25


Some of these will be empty, since the price is gotten via phone call.

Now, we get to the weird part: Reviews. These reviews are handled by the hotel page in weird ways:
 + only four shown per subpage
 + each subpage is counted via a multiple of five
 + any multiple of five non-existent will not give 404, but redirect to the first subpage
 + language selection via scraping is non-existent, no query parameters, only radio buttons with random labels, defaults chosen via domain/locale (.com .pt)

So the strategy found is:
 + create a monstruos amount of subpage links (about 200)
 + scrape all reviews, even repeated via the redirect to the first subpage
 + later use sets, or dictionaries to remove duplicates

That was done with this awful looking but functional code.

In [7]:
links = []
for review in bsobj.findAll('a',{'class':'review_count'}):
  try: 
    a = review['href']
    a = 'https://www.tripadvisor.pt'+ a
    c = a[:(a.find('Reviews')+7)] + '' + a[(a.find('Reviews')+7):]
    links.append(c)
    for i in range(5,1000,5):
        b = a[:(a.find('Reviews')+7)] + '-or' + str(i) + a[(a.find('Reviews')+7):]
        links.append(b)
  except:
    pass
print(links)

['https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Reviews-Pousada_Convento_Beja-Beja_Beja_District_Alentejo.html#REVIEWS', 'https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Reviews-or5-Pousada_Convento_Beja-Beja_Beja_District_Alentejo.html#REVIEWS', 'https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Reviews-or10-Pousada_Convento_Beja-Beja_Beja_District_Alentejo.html#REVIEWS', 'https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Reviews-or15-Pousada_Convento_Beja-Beja_Beja_District_Alentejo.html#REVIEWS', 'https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Reviews-or20-Pousada_Convento_Beja-Beja_Beja_District_Alentejo.html#REVIEWS', 'https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Reviews-or25-Pousada_Convento_Beja-Beja_Beja_District_Alentejo.html#REVIEWS', 'https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Reviews-or30-Pousada_Convento_Beja-Beja_Beja_District_Alentejo.html#REVIEWS', 'https://www.tripadvisor.pt/Hotel_Review-g189102-d239324-Rev

Now, we get the smallest length number of the arrays regarting to the hotels, hoping we remove some of the (repeated) sponsorships.

In [8]:
# bad code
length = len(price)
if length > len(reviews):
    length = len(reviews)
if length > len(hotel):
    length = len(hotel)
if length > len(ratings):
    length = len(ratings)

And then create the ID table of the hotels with the most basic information in a pandas DataFrame, and export that one to a .csv file that we can use in Exel, PowerBI, ML libraries like Keras, Tensorflow, SciKitLearn can use.

In [9]:
d1 = {'Hotel':hotel[:length],'Estrelas':ratings[:length],'Avaliações':reviews[:length],'Preço':price[:length]}
df = pd.DataFrame.from_dict(d1)
print(df)
df.to_csv('listtable.csv')

                                  Hotel         Estrelas        Avaliações  \
0                 Pousada Convento Beja  4,5 de 5 bolhas    751 avaliações   
1              Vila Galé Clube de Campo  4,5 de 5 bolhas    917 avaliações   
2                     Herdade dos Grous  4,5 de 5 bolhas    229 avaliações   
3                         Hotel Bejense    4 de 5 bolhas    192 avaliações   
4                        Herdade do Vau  4,5 de 5 bolhas     76 avaliações   
5                  Herdade Da Diabroria    4 de 5 bolhas     54 avaliações   
6                          Hotel Melius    4 de 5 bolhas     90 avaliações   
7                      BejaParque Hotel  3,5 de 5 bolhas    159 avaliações   
8                    Hotel São Domingos    4 de 5 bolhas    135 avaliações   
9                    Maria`s Guesthouse    5 de 5 bolhas      7 avaliações   
10                  Hotel Santa Bárbara    4 de 5 bolhas     62 avaliações   
11                          Beja Hostel  3,5 de 5 bolhas     35 

Now the most horrible of codes presents you with the creations of various, separated .csv files with the scraped reviews, that we can use.

It iterates all the links and since there's 200 links per hotel, every 200 we use some list comprehension magic with sets to remove duplicates and export the DataFrame to a useful .csv file.

In [10]:
count = 0
count2 = 0
allreviews = []
for link in links:
    try:
        html2 = requests.get(link,headers=headers)
        bsobj2 = soup(html2.content,'lxml')
        for r in bsobj2.findAll('q'):
            try:
                rev = r.span.text.strip()
                allreviews.append(rev + '\n')
            except:
                pass
    except:
        pass
    count += 1
    if count == 200:
        seen = set()
        allreviews = [item for item in allreviews if not(tuple(item) in seen or seen.add(tuple(item)))]
        dfr = pd.DataFrame.from_dict({'Avaliações':allreviews})
        print(dfr)
        dfr.to_csv('hotel' + str(count2) + '.csv')
        allreviews = []
        count = 0
        count2 += 1

                                            Avaliações
0    Excelente hotel .  Pessoal da recepção e servi...
1    Património histórico ao seu melhor nível de re...
2    Chegámos depois da meia-noite e fomos recebido...
3    Gostei muito do alojamento e da estadia. Desta...
4    Na reserva tinha a descrição de um tipo de qua...
..                                                 ...
324  Não é pode esperar de um hotel de luxo e não o...
325  A Pousada em beja, como muitos, tem uma antigo...
326  A primeira estadia em uma pousada e nós impres...
327  O hotel se hospedado por uma noite em este sle...
328  Duas das US hospedado aqui para uma noite em m...

[329 rows x 1 columns]
                                            Avaliações
0    Hotel localizado no "meio" do Alentejo, mais p...
1    Ideal para uns excelentes momentos, descontrai...
2    Excelente local! A comida é fantástica! Os vin...
3    Este hotel nunca falha , espaços para crianças...
4    Éramos um casal com filhos gémeos de