# TripAdvisor

### Webscraping

##### Imports

First we start with the imports. We need essentially three (or four) main libraries to work this out; these are:
 + requests (to fetch the website)
 + lxml (a faster html parser to speed bs4)
 + bs4 (a.k.a beautiful soup, a web scraping library)
 + pandas (a maths oriented data(set) manipulation library)

Since requests uses urllib3 as a dependency, we can import it first to configure it to supress the annoying warning about the "insecure" connection (lack of SSL).

In [1]:
import urllib3
import requests
import re
from bs4 import BeautifulSoup as soup
import pandas as pd
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

##### Request configuration

We need to configure our request, specially in this case, since TripAdvisor wont send us a webpage if we at least not try to emulate a real browser.
First we configure our headers, ripping the main headers from our browser, as seen in the Developer tools (F12) in Chromium (we used the new Microsoft Edge).

Then we request the webpage (Hotels in Beja, Portugal) with our headers attached, a timeout to stop if it takes too long (something is wrong), and verification is disabled (SSL).
If the status code is OK (200), doesn't print.

After that we create a BeautifulSoup4 scrapable object with the html content of the page, using lxml.

In [2]:

headers = {
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Methods": "GET",
    "Access-Control-Allow-Headers": "Content-Type",
    "accept": "*/*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-GB,en;q=0.9,en-US;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36 Edg/96.0.1054.29",
}
url = (
    "https://www.tripadvisor.pt/Attractions-g189102-Activities-Beja_Beja_District_Alentejo.html"
)
req = requests.get(url, headers=headers, timeout=5, verify=False)
req.status_code
bsobj = soup(req.content, "lxml")

##### Scraping

Now we start scraping. First we start getting attraction names.

In [3]:

place = []
for name in bsobj.findAll("span", {"name": "title"}):
    place.append(re.sub(r"\b\d+\b", "", name.text.strip())[2:])
print(place)

['Castelo de Beja', 'Museu Regional de Beja (Museu Rainha D. Leonor)', 'Nucleo Museologico', 'Casa de Santa Vitória', 'Igreja de Nossa Senhora Dos Prazeres E Museu Episcopal', 'Ruínas Romanas de Pisões', 'Museu Visigotico--Igreja de Santo Amaro', 'Jardim Gago Coutinho e Sacadura Cabral', 'Sé Catedral de Beja / Igreja de São Tiago', 'Museu Jorge Vieira/Casa Das Artes', 'Porta de Évora - Arco romano de Beja', 'Pelourinho de Beja', 'Igreja de Santa Maria da Feira', 'Igreja do Salvador', 'Igreja da Misericórdia', 'Estátua da Rainha Dona Leonor', 'Igreja do Carmo', 'Ermida de Santo André', 'Igreja de Nossa Senhora do Pé da Cruz', 'Ermida de Santo Estêvão', 'Bairro da Mouraria', 'Janela Manuelina', 'Arcadas da Praça da República', 'Arco das portas de Avis', 'Monumento ao Prisioneiro Político Desconhecido', 'Palácio dos Maldonados', 'Convento de Santo António em Beja', 'Colégio dos Jesuítas de Beja', 'Piscina Descoberta Municipal de Beja', 'Passo da Rua da Ancha']


Some of these will be empty, since the price is gotten via phone call.

Now, we get to the weird part: Reviews. These reviews are handled by the hotel page in weird ways:
 + only four shown per subpage
 + each subpage is counted via a multiple of ten
 + any multiple of five non-existent will not give 404, but redirect to the first subpage
 + language selection via scraping is non-existent, no query parameters, only radio buttons with random labels, defaults chosen via domain/locale (.com .pt)

So the strategy found is:
 + create a monstruos amount of subpage links (about 400)
 + scrape all reviews, even repeated via the redirect to the first subpage
 + later use sets, or dictionaries to remove duplicates

That was done with this awful looking but functional code.

In [4]:

links = []
for review in bsobj.findAll("a", {"href": re.compile(r'#REVIEWS')}):
    try:
        a = review["href"]
        a = "https://www.tripadvisor.pt" + a
        c = a[: (a.find("Reviews") + 7)] + "" + a[(a.find("Reviews") + 7):]
        links.append(c)
        for i in range(10, 4000, 10):
            b = (
                a[: (a.find("Reviews") + 7)]
                + "-or"
                + str(i)
                + a[(a.find("Reviews") + 7):]
            )
            links.append(b)
    except:
        pass
# print(links)

And then create the ID table of the attractions with the most basic information in a pandas DataFrame, and export that one to a .csv file that we can use in Excel, PowerBI, ML libraries like Keras, Tensorflow, SciKitLearn can use.

In [5]:

length = len(place)
d1 = {
    "Attraction": place[:length]
}
df = pd.DataFrame.from_dict(d1)
print(df)
df.to_csv("listtable.csv")

                                           Attraction
0                                     Castelo de Beja
1     Museu Regional de Beja (Museu Rainha D. Leonor)
2                                  Nucleo Museologico
3                               Casa de Santa Vitória
4   Igreja de Nossa Senhora Dos Prazeres E Museu E...
5                            Ruínas Romanas de Pisões
6             Museu Visigotico--Igreja de Santo Amaro
7              Jardim Gago Coutinho e Sacadura Cabral
8           Sé Catedral de Beja / Igreja de São Tiago
9                   Museu Jorge Vieira/Casa Das Artes
10               Porta de Évora - Arco romano de Beja
11                                 Pelourinho de Beja
12                     Igreja de Santa Maria da Feira
13                                 Igreja do Salvador
14                             Igreja da Misericórdia
15                      Estátua da Rainha Dona Leonor
16                                    Igreja do Carmo
17                          

Now the most terrible of codes presents you with the creations of various, separated .csv files with the scraped reviews, that we can use.

It iterates all the links and since there's 400 links per attraction, every 400 we use some list comprehension magic with sets to remove duplicates and export the DataFrame to a useful .csv file.

In [6]:

count = 0
count2 = 0
allreviews = []
for link in links:
    try:
        html2 = requests.get(link, headers=headers)
        bsobj2 = soup(html2.content, "lxml")
        for r in bsobj2.findAll("span", {"class": "NejBf"}): # as of 7Dez, because in 6Dez it was "class": "cSoNT"; tripadvisor, i hope you go bankrupt
            for rev in r:
                try:
                    rv = rev.text
                    if "desde" not in rv and "€" not in rv:
                        allreviews.append(rv + "\n")
                except:
                    pass
    except:
        pass
    count += 1
    if count == 400:
        seen = set()
        allreviews = [
            item
            for item in allreviews
            if not (tuple(item) in seen or seen.add(tuple(item)))
        ]
        dfr = pd.DataFrame.from_dict({"Avaliações": allreviews})
        print(dfr)
        dfr.to_csv("place" + str(count2) + ".csv")
        allreviews = []
        count = 0
        count2 += 1


                                            Avaliações
0                                         Muito bom.\n
1    Castelo bem conservado, com torre de menagem a...
2                       Bonito, mas podia ser melhor\n
3    Castelo bonito numa zona central de Beja. Não ...
4                                                   \n
..                                                 ...
683  Passamos o dia em Beja em 9 de Junho de 2012. ...
684                                      Elevando-se\n
685  soberbas vistas sobre a área circundante. Bem ...
686             Definitivamente um se você na região\n
687  O castelo maio ser principalmente Vazio, mas p...

[688 rows x 1 columns]
                                            Avaliações
0                                              Museu\n
1    Bom. Um excelente local que conta parte da his...
2                                     Baixo Alentejo\n
3    História muito interessante, especialmente da ...
4                                        