# TripAdvisor

### Webscraping

##### Imports

First we start with the imports. We need essentially three (or four) main libraries to work this out; these are:
 + requests (to fetch the website)
 + lxml (a faster html parser to speed bs4)
 + bs4 (a.k.a beautiful soup, a web scraping library)
 + pandas (a maths oriented data(set) manipulation library)

Since requests uses urllib3 as a dependency, we can import it first to configure it to supress the annoying warning about the "insecure" connection (lack of SSL).

In [1]:
import urllib3
import requests
import re
from bs4 import BeautifulSoup as soup
import pandas as pd
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

##### Request configuration

We need to configure our request, specially in this case, since TripAdvisor wont send us a webpage if we at least not try to emulate a real browser.
First we configure our headers, ripping the main headers from our browser, as seen in the Developer tools (F12) in Chromium (we used the new Microsoft Edge).

Then we request the webpage (Restaurants in Beja, Portugal) with our headers attached, a timeout to stop if it takes too long (something is wrong), and verification is disabled (SSL).
If the status code is OK (200), doesn't print.

After that we create a BeautifulSoup4 scrapable object with the html content of the page, using lxml.

In [2]:
headers = {
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Methods": "GET",
    "Access-Control-Allow-Headers": "Content-Type",
    "accept": "*/*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-GB,en;q=0.9,en-US;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36 Edg/96.0.1054.29",
}
url = "https://www.tripadvisor.pt/Restaurants-g189102-Beja_Beja_District_Alentejo.html"
req = requests.get(url, headers=headers, timeout=5, verify=False)
req.status_code
bsobj = soup(req.content, "lxml")

##### Scraping

Now we start scraping. First we start getting restaurant names.

In [3]:
place = []
prelinks = []
for name in bsobj.findAll("div", {"class": "OhCyu"}):
    place.append(re.sub(r"\b\d+\b", "", name.span.text.strip())[2:])
    prelinks.append(name.span.a["href"])

Some of these will be empty, since the price is gotten via phone call.

Now, we get to the weird part: Reviews. These reviews are handled by the restaurant page in weird ways:
 + only ten shown per subpage
 + each subpage is counted via a multiple of ten
 + any multiple of five non-existent will not give 404, but redirect to the first subpage
 + language selection via scraping is non-existent, no query parameters, only radio buttons with random labels, defaults chosen via domain/locale (.com .pt)

So the strategy found is:
 + create a monstruos amount of subpage links (about 400)
 + scrape all reviews, even repeated via the redirect to the first subpage
 + later use sets, or dictionaries to remove duplicates

That was done with this awful looking but functional code.

In [4]:
links = []
for pre in prelinks:
    try:
        a = "https://www.tripadvisor.pt"
        c = a + "" + pre
        d = c[: (c.find("-Reviews-") + len("-Reviews-") - 1)]
        e = c[(c.find("-Reviews-") + len("-Reviews-") - 1) :]
        links.append(c)
        for i in range(10, 4000, 10):
            b = d + "-or" + str(i) + e
            links.append(b)
    except:
        pass

And then create the ID table of the attractions with the most basic information in a pandas DataFrame, and export that one to a .csv file that we can use in Excel, PowerBI, ML libraries like Keras, Tensorflow, SciKitLearn can use.

In [5]:
length = len(place)

d1 = {"Restaurant": place[:length]}
df = pd.DataFrame.from_dict(d1)
print(df)
df.to_csv("listtable.csv")

                                       Restaurant
0                              Íntimo restaurante
1                           Restaurante Dom Dinis
2                   Herdade dos Grous Restaurante
3                        Adega Tipica Restaurante
4                              Bifanas do Márinho
5                                    Pulo Do Lobo
6                                      Toi Farois
7                    Restaurante Sabores Do Monte
8                                 Pizzaria Milano
9   Pizaria e Restaurante Mediterrâneo Dona Maria
10                                  Frango à Guia
11                     Casa de Pasto - Tem Avondo
12                     Restaurante Espelho D'Água
13                        Hamburgueria da Avenida
14                                      O Arbitro
15              Adega do Castelo - Museu do Vinho
16                  Pinguinhas - Tapas e Petiscos
17                                 Taberna A Pipa
18                                  Luiz Da Rocha


Now the most terrible of codes presents you with the creations of various, separated .csv files with the scraped reviews, that we can use.

It iterates all the links and since there's 400 links per restaurant, every 400 we use some list comprehension magic with sets to remove duplicates and export the DataFrame to a useful .csv file.

In [6]:
count = 0
count2 = 0
allreviews = []
for link in links:
    try:
        html2 = requests.get(link, headers=headers)
        bsobj2 = soup(html2.content, "lxml")
        for r in bsobj2.findAll("p", {"class": "partial_entry"}):
            for rev in r:
                try:
                    rv = rev.text.strip()
                    allreviews.append(rv + "\n")
                except:
                    pass
    except:
        pass
    count += 1
    if count == 400:
        seen = set()
        allreviews = [
            item
            for item in allreviews
            if not (tuple(item) in seen or seen.add(tuple(item)))
        ]
        dfr = pd.DataFrame.from_dict({"Avaliações": allreviews})
        dfr.to_csv("restaurant" + str(count2) + ".csv")
        print(dfr)
        allreviews = []
        count = 0
        count2 += 1

                                            Avaliações
0    Num bairro de Beja encontra-se este restaurant...
1    Não se deixem intimidar pelo aspecto do restau...
2    duas pessoas, embora sendo individuais. \nServ...
3                                               Mais\n
4    Cozinhar divinamente, sem dúvida é uma arte! N...
..                                                 ...
118  Restaurante excelente a todos os níveis, começ...
119  Restaurante de excelência, decoração simples, ...
120  A minha primeira vez em Portugal e este foi o ...
121  Este é um lugar para relaxar e desfrutar de um...
122  É um restaurante agradável todos os pratos são...

[123 rows x 1 columns]
                                            Avaliações
0    Comida muito boa, muita variedade neste tipo d...
1    Jantar de amigos\n solicitamos tábua de vaca c...
2    Almoço muito bom com uma qualidade acima da mé...
3    Aconcelhas a quem venha visitar a cidade de Be...
4    Para os amantes da carne de vaca, en