# Парсинг HTML страниц

Во время путешествия по Турции я столкнулось с проблемой недостаточной релевантности комментариев о достопремечательностях с моей туристической реальностью. Было замечено, что местные жители отдают предпочтение совершенно другим местам, чем руско- и англоговорящие пользователи. 

Для того, чтобы сравнить как отличаются предпочтения разных языковых групп, мы выполним парсинг сайта **tripadvisor** по главным достопримечательностям Португалии. 

Выгружать будем будем комментарии на русском, португальском и английском языках. 

А так же данный проек демонстрирует как по названию достопримечательности ее можно искать на **googlemaps** и получать комментарии из данного источника.

## 1. Импорт библиотек

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

import googlemaps
import geopandas as gpd
import geocoder

## 2. Tripadvisor

Первая и последующие страницы для сайта Tripadvisor в разделе Attractions различаются, запишем 2 разных адреса. Помимо списка достопримечательностей с первой страницы, загрузим данные еще с 2х.

In [2]:
URL_1 = 'https://www.tripadvisor.com/Attractions-g189100-Activities-a_allAttractions.true-Portugal.html'
URL_2 = 'https://www.tripadvisor.com/Attractions-g189100-Activities-oa{}-Portugal.html'

HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
          'assept' : '*/*'}

PAGE_COUNT = 2

In [3]:
def get_html(url, params=None):
    r = requests.get(url, headers=HEADERS, params=params)
    return r

def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all("article", class_="bMktZ bQbFd")
    place = []
    for item in items:
        place.append({
            'title': (re.sub(r'[0-9.]', '', item.find('div', class_='bUshh o csemS').get_text(strip=True))),
            'rating': item.find('svg', class_='RWYkj d H0').get('title')[0:3],
            'link': item.find('a').get('href'),
            'comment_count': item.find('span', class_='WlYyy diXIH bGusc bQCoY').get_text(strip=True).replace(',', ''),
            'category': item.find('div', class_='WlYyy diXIH fPixj').get_text(strip=True).replace(' • ', ', ')
        })
    return place

def parse():
    html = get_html(URL_1)
    if html.status_code == 200:
        pages_count = PAGE_COUNT
        place = []
        html = get_html(URL_1)
        print(f'Парсинг страницы 1 из {pages_count + 1}...')
        place.extend(get_content(html.text))
        for page in range(1, pages_count + 1):
            print(f'Парсинг страницы {page + 1} из {pages_count + 1}...')
            url = URL_2.format(page * 30)
            html = get_html(url)
            place.extend(get_content(html.text))
        return place
    else:
        print('Error')

d = parse()

Парсинг страницы 1 из 3...
Парсинг страницы 2 из 3...
Парсинг страницы 3 из 3...


In [4]:
tripadvisor  = pd.DataFrame(d)

In [5]:
tripadvisor.head(10)

Unnamed: 0,title,rating,link,comment_count,category
0,Oceanário de Lisboa,4.5,/Attraction_Review-g189158-d195144-Reviews-Oce...,39856,Aquariums
1,Ponte de Dom Luís I,4.5,/Attraction_Review-g189180-d636456-Reviews-Pon...,24582,Bridges
2,Praia Sao Rafael,4.5,/Attraction_Review-g189112-d3869627-Reviews-Pr...,1359,Beaches
3,Pico do Arieiro,4.5,/Attraction_Review-g189167-d4740534-Reviews-Pi...,6707,Mountains
4,Ponta da Piedade,5.0,/Attraction_Review-g189117-d1755231-Reviews-Po...,5464,Geologic Formations
5,Quinta da Regaleira,5.0,/Attraction_Review-g189164-d484394-Reviews-Qui...,13418,"Architectural Buildings, Castles"
6,Animaris Ilha Deserta,4.5,/Attraction_Review-g189116-d3672317-Reviews-An...,1346,"Boat Tours, Speed Boats Tours"
7,Seven Hanging Valleys Trail,4.5,/Attraction_Review-g189111-d17378213-Reviews-S...,17,Hiking Trails
8,Krazy World Zoo,4.0,/Attraction_Review-g189111-d670139-Reviews-Kra...,401,Amusement & Theme Parks
9,Fontes and Cascada da Risco,4.5,/Attraction_Review-g189166-d548047-Reviews-25_...,1008,Hiking Trails


Мы выгрузили таблицу с:
- названием достопримечательности,
- общим рейтингом,
- ссылкой на комментарии,
- количество комментариев,
- категорией достопримечательности.

Добавим в нашу таблицу комментарии на разных языках.

In [6]:
URL_list = list(tripadvisor['link'])
HOST = 'https://www.tripadvisor.{}'
domains = ['com', 'ru', 'pt']

def get_html(url, params=None):
    r = requests.get(url, headers=HEADERS, params=params)
    return r

def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    place = []
    try:
        item = soup.find('div', class_='dHjBB')
        texts = item.find_all('span', class_='NejBf')
        ratings = item.find_all('svg', class_='RWYkj d H0')
        for i in range(0, len(texts)):
            place.append({
                'text': texts[i].get_text(strip=True),
                'rating': ratings[i].get('title')[0:3]
            })
        
    except:
        place.append('None')

    return place

def parse():
    html = get_html(host + URL_list[i])
    if html.status_code == 200:
        place = []
        place.extend(get_content(html.text))
        return place
    else:
        print('Error')
        
domain_dict = []
for domain in domains:  
    print(f'Парсинг комментариев домена {domain} ...')
    host = HOST.format(domain)
    d = []
    for i in range(0, len(URL_list)):
        x = parse()
        d.append(x)
    domain_dict.append({'domain_{}'.format(domain): d})

Парсинг комментариев домена com ...
Парсинг комментариев домена ru ...
Парсинг комментариев домена pt ...


In [7]:
tripadvisor['domain_com'] = pd.DataFrame(domain_dict[0])
tripadvisor['domain_ru'] = pd.DataFrame(domain_dict[1])
tripadvisor['domain_pt'] = pd.DataFrame(domain_dict[2])
tripadvisor.head(10)

Unnamed: 0,title,rating,link,comment_count,category,domain_com,domain_ru,domain_pt
0,Oceanário de Lisboa,4.5,/Attraction_Review-g189158-d195144-Reviews-Oce...,39856,Aquariums,[{'text': 'Incredible to see such an abundance...,"[{'text': 'Нам понравилось', 'rating': '5,0'},...","[{'text': 'Oceanário para todas as idades', 'r..."
1,Ponte de Dom Luís I,4.5,/Attraction_Review-g189180-d636456-Reviews-Pon...,24582,Bridges,"[{'text': 'Worth a Walk', 'rating': '4.0'}, {'...","[{'text': 'Страшновато', 'rating': '5,0'}, {'t...","[{'text': 'excelente', 'rating': '4,0'}, {'tex..."
2,Praia Sao Rafael,4.5,/Attraction_Review-g189112-d3869627-Reviews-Pr...,1359,Beaches,"[{'text': 'Lovely beach!', 'rating': '5.0'}, {...","[{'text': 'Очень хорошо', 'rating': '5,0'}, {'...","[{'text': 'Pequeninas mas muito limpa', 'ratin..."
3,Pico do Arieiro,4.5,/Attraction_Review-g189167-d4740534-Reviews-Pi...,6707,Mountains,"[{'text': 'Could this be more beautiful!', 'ra...","[{'text': 'Волшебная Мадейра', 'rating': '5,0'...","[{'text': 'Local incrível', 'rating': '5,0'}, ..."
4,Ponta da Piedade,5.0,/Attraction_Review-g189117-d1755231-Reviews-Po...,5464,Geologic Formations,[{'text': 'Impressive rock formations that mus...,"[{'text': 'Самое красивое место', 'rating': '5...","[{'text': 'Paisagem magnífica', 'rating': '5,0..."
5,Quinta da Regaleira,5.0,/Attraction_Review-g189164-d484394-Reviews-Qui...,13418,"Architectural Buildings, Castles","[{'text': 'Beautiful Palace and Grounds', 'rat...","[{'text': 'Мистическое место', 'rating': '5,0'...","[{'text': 'Maravilhoso!!!', 'rating': '5,0'}, ..."
6,Animaris Ilha Deserta,4.5,/Attraction_Review-g189116-d3672317-Reviews-An...,1346,"Boat Tours, Speed Boats Tours",[None],[None],[None]
7,Seven Hanging Valleys Trail,4.5,/Attraction_Review-g189111-d17378213-Reviews-S...,17,Hiking Trails,[{'text': 'Breathtaking Grottos and panoramic ...,[{'text': 'Breathtaking Grottos and panoramic ...,"[{'text': 'Paradisíaco', 'rating': '4,0'}, {'t..."
8,Krazy World Zoo,4.0,/Attraction_Review-g189111-d670139-Reviews-Kra...,401,Amusement & Theme Parks,"[{'text': 'Friendly with a personal touch', 'r...","[{'text': 'Детям очень понравилось', 'rating':...",[{'text': 'Parque inseguro 😔 e com poucas medi...
9,Fontes and Cascada da Risco,4.5,/Attraction_Review-g189166-d548047-Reviews-25_...,1008,Hiking Trails,"[{'text': 'This trail has waterfalls, Leveda, ...","[{'text': 'Левада 25 водопадов', 'rating': '5,...","[{'text': 'Levada 25 fontes', 'rating': '5,0'}..."


In [8]:
tripadvisor[tripadvisor['domain_com'].apply(lambda x: x[0]) == 'None']

Unnamed: 0,title,rating,link,comment_count,category,domain_com,domain_ru,domain_pt
6,Animaris Ilha Deserta,4.5,/Attraction_Review-g189116-d3672317-Reviews-An...,1346,"Boat Tours, Speed Boats Tours",[None],[None],[None]


Обнаружили предложение на бронирование лодочных туров. Они не совсем подгодят под наш запрос, удалим их.

In [9]:
none_index = list(tripadvisor[tripadvisor['domain_com'].apply(lambda x: x[0]) == 'None'].index)
tripadvisor = tripadvisor.query('index != @none_index').reset_index(drop=True)

In [10]:
tripadvisor.head(10)

Unnamed: 0,title,rating,link,comment_count,category,domain_com,domain_ru,domain_pt
0,Oceanário de Lisboa,4.5,/Attraction_Review-g189158-d195144-Reviews-Oce...,39856,Aquariums,[{'text': 'Incredible to see such an abundance...,"[{'text': 'Нам понравилось', 'rating': '5,0'},...","[{'text': 'Oceanário para todas as idades', 'r..."
1,Ponte de Dom Luís I,4.5,/Attraction_Review-g189180-d636456-Reviews-Pon...,24582,Bridges,"[{'text': 'Worth a Walk', 'rating': '4.0'}, {'...","[{'text': 'Страшновато', 'rating': '5,0'}, {'t...","[{'text': 'excelente', 'rating': '4,0'}, {'tex..."
2,Praia Sao Rafael,4.5,/Attraction_Review-g189112-d3869627-Reviews-Pr...,1359,Beaches,"[{'text': 'Lovely beach!', 'rating': '5.0'}, {...","[{'text': 'Очень хорошо', 'rating': '5,0'}, {'...","[{'text': 'Pequeninas mas muito limpa', 'ratin..."
3,Pico do Arieiro,4.5,/Attraction_Review-g189167-d4740534-Reviews-Pi...,6707,Mountains,"[{'text': 'Could this be more beautiful!', 'ra...","[{'text': 'Волшебная Мадейра', 'rating': '5,0'...","[{'text': 'Local incrível', 'rating': '5,0'}, ..."
4,Ponta da Piedade,5.0,/Attraction_Review-g189117-d1755231-Reviews-Po...,5464,Geologic Formations,[{'text': 'Impressive rock formations that mus...,"[{'text': 'Самое красивое место', 'rating': '5...","[{'text': 'Paisagem magnífica', 'rating': '5,0..."
5,Quinta da Regaleira,5.0,/Attraction_Review-g189164-d484394-Reviews-Qui...,13418,"Architectural Buildings, Castles","[{'text': 'Beautiful Palace and Grounds', 'rat...","[{'text': 'Мистическое место', 'rating': '5,0'...","[{'text': 'Maravilhoso!!!', 'rating': '5,0'}, ..."
6,Seven Hanging Valleys Trail,4.5,/Attraction_Review-g189111-d17378213-Reviews-S...,17,Hiking Trails,[{'text': 'Breathtaking Grottos and panoramic ...,[{'text': 'Breathtaking Grottos and panoramic ...,"[{'text': 'Paradisíaco', 'rating': '4,0'}, {'t..."
7,Krazy World Zoo,4.0,/Attraction_Review-g189111-d670139-Reviews-Kra...,401,Amusement & Theme Parks,"[{'text': 'Friendly with a personal touch', 'r...","[{'text': 'Детям очень понравилось', 'rating':...",[{'text': 'Parque inseguro 😔 e com poucas medi...
8,Fontes and Cascada da Risco,4.5,/Attraction_Review-g189166-d548047-Reviews-25_...,1008,Hiking Trails,"[{'text': 'This trail has waterfalls, Leveda, ...","[{'text': 'Левада 25 водопадов', 'rating': '5,...","[{'text': 'Levada 25 fontes', 'rating': '5,0'}..."
9,Graham's Port Lodge,4.5,/Attraction_Review-g580268-d3893366-Reviews-Gr...,2536,Wineries & Vineyards,"[{'text': 'The personal touch.', 'rating': '5....",[{'text': 'Увлекательная экскурсия в мир портв...,"[{'text': 'Adoramos a visita! Vinho incrível',..."


## GoogleMaps

Для загрузки данных можете истопользовать свой API.

In [11]:
gmaps = googlemaps.Client(key='AI***')

place_list = list(tripadvisor['title'])

In [None]:
place_dict = []

for p in range(0, len(place_list)):
    place_name = place_list[p]
    place_result = gmaps.places(place_name)
    if place_result['results'] != []:
        place_id = place_result['results'][0]['place_id']
        place = gmaps.place(place_id = place_id, language='en')
        try:
            reviews = place['result']['reviews']
            user_ratings_total = place['result']['user_ratings_total']
            types = place['result']['types']
            rating =  place['result']['rating']
            reviews_dict = []

            for i in range(len(reviews)):
                text = reviews[i]['text']
                rating = reviews[i]['rating']
                reviews_dict.append({'text': text,
                                     'rating': rating})
            place_dict.append({'place': place_name,
                               'rating': rating,
                               'types': types,
                               'user_ratings_total': user_ratings_total,
                               'reviews_en': reviews_dict})
        except:
            place_dict.append({'place': place_name,
                               'rating': 'No rating',
                               'types': 'No types',
                               'user_ratings_total': 'No ratings',
                               'reviews_en': 'No reviews'})

In [None]:
google_maps = pd.DataFrame(place_dict)
google_maps.head()

In [None]:
place_dict = []

for p in range(0, len(place_list)):
    place_name = place_list[p]
    place_result = gmaps.places(place_name)
    if place_result['results'] != []:
        place_id = place_result['results'][0]['place_id']
        place = gmaps.place(place_id = place_id, language='ru')
        try:
            reviews = place['result']['reviews']
            reviews_dict = []

            for i in range(len(reviews)):
                text = reviews[i]['text']
                rating = reviews[i]['rating']
                reviews_dict.append({'text': text,
                                     'rating': rating})
            place_dict.append({'reviews': reviews_dict})
        except:
            place_dict.append({'reviews': 'No reviews'})

In [None]:
place_dict

In [None]:
google_maps['reviews_ru'] = pd.DataFrame(place_dict)
google_maps.head()

In [None]:
place_dict = []

for p in range(0, len(place_list)):
    place_name = place_list[p]
    place_result = gmaps.places(place_name)
    if place_result['results'] != []:
        place_id = place_result['results'][0]['place_id']
        place = gmaps.place(place_id = place_id, language='pt')
        try:
            reviews = place['result']['reviews']
            reviews_dict = []

            for i in range(len(reviews)):
                text = reviews[i]['text']
                rating = reviews[i]['rating']
                reviews_dict.append({'text': text,
                                     'rating': rating})
            place_dict.append({'reviews': reviews_dict})
        except:
            place_dict.append({'reviews': 'No reviews'})

In [None]:
google_maps['reviews_pt'] = pd.DataFrame(place_dict)
google_maps.head()

In [None]:
google_maps[google_maps['reviews_ru'] == 'No reviews'].head()

In [None]:
google_maps.to_csv('google_maps.csv', index=False) 
tripadvisor.to_csv('tripadvisor.csv', index=False) 

In [15]:
tripadvisor = pd.read_csv('tripadvisor.csv')
tripadvisor.head()

Unnamed: 0.1,Unnamed: 0,title,rating,link,comment_count,category,domain_com,domain_ru,domain_pt
0,0,Oceanário de Lisboa,4.5,/Attraction_Review-g189158-d195144-Reviews-Oce...,39835,Aquariums,"[{'text': ""One of the best oceanariums I've be...",[{'text': 'Мой рекомендасьен! красиво очень! р...,[{'text': 'Excelente para miúdos e graúdos. O ...
1,1,Ponte de Dom Luís I,4.5,/Attraction_Review-g189180-d636456-Reviews-Pon...,24566,Bridges,[{'text': 'A beautiful bridge with many angles...,[{'text': 'Не боялся высоты до этого моста. Ст...,[{'text': 'Único! Experiência fora de série! F...
2,2,Praia Sao Rafael,4.5,/Attraction_Review-g189112-d3869627-Reviews-Pr...,1358,Beaches,"[{'text': ""We walked here one very hot morning...",[{'text': 'Отель на хорошую 8-ку. Замечательны...,"[{'text': 'Praia familiar e reservada, apesar ..."
3,3,Pico do Arieiro,4.5,/Attraction_Review-g189167-d4740534-Reviews-Pi...,6703,Mountains,[{'text': 'As most have written in their revie...,[{'text': 'Будете на Мадейре - заходите на пик...,[{'text': 'Saí do Funchal de madrugada para as...
4,4,Ponta da Piedade,5.0,/Attraction_Review-g189117-d1755231-Reviews-Po...,5460,Geologic Formations,[{'text': 'We walked to the site from our hote...,[{'text': 'Побережье в этом месте самое красив...,[{'text': 'São sempre dignas de deslumbramento...


In [16]:
google_maps = pd.read_csv('google_maps.csv')
google_maps.head()

Unnamed: 0,place,rating,types,user_ratings_total,reviews_en,reviews_ru,reviews_pt
0,Oceanário de Lisboa,4,"['aquarium', 'tourist_attraction', 'point_of_i...",60249,"[{'text': 'Fantastic aquarium, one of the best...",[{'text': 'Я с двумя детьми потратила около 4 ...,[{'text': 'Sem dúvida um ponto de interesse a ...
1,Ponte de Dom Luís I,5,"['tourist_attraction', 'point_of_interest', 'e...",56047,[{'text': 'A very impressive structure. Import...,"[{'text': 'Замечательный мост сам по себе, а т...",[{'text': 'Vista deslumbrante abrangendo a lin...
2,Praia Sao Rafael,5,"['parking', 'point_of_interest', 'establishment']",25,"[{'text': 'Hidden gem of Algarve, a small piec...","[{'text': 'Скрытая жемчужина Алгарве, маленьки...","[{'text': 'Praia pequena mas linda!', 'rating'..."
3,Pico do Arieiro,4,"['natural_feature', 'establishment']",1627,[{'text': 'Get there at least an hour before s...,"[{'text': 'Облака и дождь, а так вообще красот...","[{'text': 'Pena não ter conseguido ver, estava..."
4,Ponta da Piedade,5,"['tourist_attraction', 'point_of_interest', 'e...",11375,[{'text': 'Place worth visiting if you are not...,"[{'text': 'Место притягивает своими пейзажами,...",[{'text': 'Lugar de paisagem ímpar... bastante...


В результате мы получили 2 таблицы с комментариями из двух источников. Осталось проанализировать полученные комментарии.