# Web scraping project - Complete API and Web Scraping Notebook
### **Léo RINGEISSEN & Santiago MARTIN - DIA 3**

## Introduction

**This notebook is composed of 4 parts :**
- First we import all of the libraries we'll need to complete this phase of the project
- Second we perform our API data retrieval from the SNCF API for emissions information on different itineraries and we pair them up with their respective TripAdvisor links
- Third we use the corresponding links to scrape the TripAdivsor pages for reviews of the destinations
- Finally we merge all of the gathered data into two csv files, one that's not aggregated and one that is, with concatenated titles and reviews and averaged ratings

**Adjustments to the project scope :**

We've learned over the course of this work that a typical method used by companies to prevent bots from illegally scraping their web pages is to deliberately complicate the repetitive tasks a bot would do. 

An example of this is having inconsistent page scrolling, where going to page 2 from page 3 won't be the same as if you'd gone from page 1. Another example is not having simple logic between the URLs of different pages. There are also dynamic elements like text translations that do not get retrieved in the initial soup provided by Selenium.

We initially thought that these were merely poor implementations from TripAdvisor, but we now understand that these inefficiencies are deliberate to prevent any automated web scraping processes from taking place, we can therefore not bypass most of these issues without partnering with or paying TripAdvisor to access their data.

As a result of these limitations, as well as data limitations on the end of the contents of the SNCF API, we've reduced the scope of our project to only trips by train whose origin is Paris and whose destination has a page with sufficient reviews in French on TripAdvisor. In addition, many destinations didn't have review pages at all, so we took the reviews of that destination's most iconic landmark, which would serve as an adequate substitution.

**Data quality and project scalability :**

With these adjustments to the project scope, we end up working with around a third (36) of the itineraries accessible with the API (119), and half of the itineraries whose origin is Paris (75), which still leaves with ample room to make fun and ecological travel recommendations based on traveler preferences and ecological goals, which was the objective we set out to achieve with this project.

Since we couldn't scroll through review pages without being thrown out for bot detection, we resolved the issue by manually selecting the URLs to the first and second review pages (if a second review page is even available). We recognize that this solution would not be scalable for a project with bigger scope, as with more desintations and review pages it would require a lot of manual labor to copy all of the URLs by hand. Fortunately, for the goals of this project which are to learn to web scrape data and utilize it for Machine Learning applications, this is not a concern and we can still move forward with our project.

Using automated google engine searches to automatically find URLs was not a viable option too, as the google search engine results are also made difficult to use via a bot, and we still would have to manually monitor the quality of the first urls yielded by our bot, so it would need more complex and powerful libraries than selenium to be crate a more scalable solution.

**Final CSV generation :**

By the end of this notebook we generate an aggregated and a non-aggregated CSV file containing our API and web scraped data. The columns are origin, destination, links for first and second review pages, distance between origin and destination, train trip carbon emissions, the scraped URL (only in the non-aggregated file), the review title (concatenated in the aggregated file), the review content (concatenated in the aggregated file), and the review rating (averaged in the aggregated file).

## Importations

In [70]:
import requests
import pandas as pd
import csv
import time
import random
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

## API data retrieval

In [71]:
# Define API URL
url = "https://ressources.data.sncf.com/api/records/1.0/search/"

# List of destinations and links
# Each tuple contains (Destination, Page1 Link, Page2 Link)
destinations_and_links = [
    ('Orléans', 'https://www.tripadvisor.fr/Attraction_Review-g187129-d9788284-Reviews-or10-Centre_ville-Orleans_Loiret_Centre_Val_de_Loire.html', None),
    ('Metz', 'https://www.tripadvisor.fr/ShowUserReviews-g187164-d2060561-r425068727-Gare_de_Metz_Ville-Metz_Moselle_Grand_Est.html', None),
    ('Strasbourg', 'https://www.tripadvisor.fr/ShowUserReviews-g187075-r287480889-Strasbourg_Bas_Rhin_Grand_Est.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187075-r86136484-Strasbourg_Bas_Rhin_Grand_Est.html#REVIEWS'),
    ('Mulhouse', 'https://fr.tripadvisor.ch/ShowUserReviews-g196495-d4323527-r683174230-Mulhouse_Alsace_Agglomeration-Mulhouse_Haut_Rhin_Grand_Est.html', None),
    ('Nancy', 'https://www.tripadvisor.fr/ShowUserReviews-g187162-r189484920-Nancy_Meurthe_et_Moselle_Grand_Est.html', None),
    ('Reims', 'https://www.tripadvisor.fr/Attraction_Review-g187137-d230790-Reviews-Cathedrale_Notre_Dame_de_Reims-Reims_Marne_Grand_Est.html', None),
    ('Freiburg Breisgau HBF', 'https://www.tripadvisor.fr/Attraction_Review-g315924-d7613164-Reviews-Fribourg_Centre-Fribourg_Canton_of_Fribourg.html', None),
    ('Annecy', 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r111704080-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r86606692-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html'),
    ('Zuerich HB', 'https://www.tripadvisor.fr/ShowUserReviews-g188113-r110402149-Zurich.html', None),
    ('Grenoble', 'https://www.tripadvisor.fr/ShowUserReviews-g187264-r86484268-Grenoble_Isere_Auvergne_Rhone_Alpes.html#REVIEWS', None),
    ('Geneve', 'https://www.tripadvisor.fr/ShowUserReviews-g188057-r110008648-Geneva.html#REVIEWS', None),
    ('Torino Porta Susa', 'https://www.tripadvisor.fr/ShowUserReviews-g187855-d14015512-r680565044-Torino_Centro_Storico-Turin_Province_of_Turin_Piedmont.html', None),
    ('Montpellier Saint-Roch', 'https://www.tripadvisor.fr/ShowUserReviews-g187153-r196910894-Montpellier_Herault_Occitanie.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187153-r87903461-Montpellier_Herault_Occitanie.html#REVIEWS'),
    ('Lausanne', 'https://www.tripadvisor.fr/ShowUserReviews-g188107-r83803063-Lausanne_Canton_of_Vaud.html', None),
    ('Lyon Part Dieu', 'https://www.tripadvisor.fr/ShowUserReviews-g187265-d195466-r930458891-Vieux_Lyon-Lyon_Rhone_Auvergne_Rhone_Alpes.html', None),
    ('Milano P Garibaldi', 'https://www.tripadvisor.fr/ShowUserReviews-g187849-r86055239-Milan_Lombardy.html', 'https://www.tripadvisor.fr/ShowUserReviews-g187849-r47647288-Milan_Lombardy.html#REVIEWS'),
    ('Besançon Franche Comté', 'https://www.tripadvisor.fr/ShowUserReviews-g187143-d12685787-r661965295-Centre_historique_de_Besancon-Besancon_Doubs_Bourgogne_Franche_Comte.html', None),
    ('Marseille Saint-Charles', 'https://www.tripadvisor.fr/ShowUserReviews-g187253-r282071566-Marseille_Bouches_du_Rhone_Provence_Alpes_Cote_d_Azur.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187253-r236557551-Marseille_Bouches_du_Rhone_Provence_Alpes_Cote_d_Azur.html#REVIEWS'),
    ('Toulon', 'https://www.tripadvisor.fr/ShowUserReviews-g187257-d2240716-r908833935-Old_Town-Toulon_Var_Provence_Alpes_Cote_d_Azur.html', None),
    ('Basel SBB', 'https://www.tripadvisor.fr/Attraction_Review-g188049-d3253822-Reviews-Basel_s_Old_Town-Basel.html', None),
    ('Barcelona Sants', 'https://www.tripadvisor.fr/ShowUserReviews-g187497-r296210183-Barcelona_Catalonia.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187497-r224251966-Barcelona_Catalonia.html#REVIEWS'),
    ('Chambéry - Challes-les-Eaux', 'https://www.tripadvisor.fr/ShowUserReviews-g8309764-d8036417-r834475582-Ville_Ancienne-Chambery_Savoie_Auvergne_Rhone_Alpes.html', None),
    ('Nice', 'https://www.tripadvisor.fr/ShowUserReviews-g187234-r227473428-Nice_French_Riviera_Cote_d_Azur_Provence_Alpes_Cote_d_Azur.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187234-r148571361-Nice_French_Riviera_Cote_d_Azur_Provence_Alpes_Cote_d_Azur.html#REVIEWS'),
    ('Lille Europe', 'https://www.tripadvisor.fr/ShowUserReviews-g187178-r243531408-Lille_Nord_Hauts_de_France.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187178-r55549677-Lille_Nord_Hauts_de_France.html#REVIEWS'),
    ('La Rochelle', 'https://www.tripadvisor.fr/ShowUserReviews-g187206-d1161992-r905093931-Vieux_Port-La_Rochelle_Charente_Maritime_Nouvelle_Aquitaine.html', None),
    ('Le Mans', 'https://www.tripadvisor.fr/ShowUserReviews-g187195-d8820281-r947848356-Le_Mans-Le_Mans_City_Le_Mans_Sarthe_Pays_de_la_Loire.html', None),
    ('Arcachon', 'https://www.tripadvisor.fr/Attraction_Review-g196505-d13116954-Reviews-Bassin_d_Arcachon-Arcachon_Gironde_Nouvelle_Aquitaine.html', None),
    ('Vannes', 'https://www.tripadvisor.fr/ShowUserReviews-g196537-d4744044-r271350649-Centre_Historique_de_Vannes-Vannes_Morbihan_Brittany.html', None),
    ('Saint-Pierre-des-Corps', 'https://www.tripadvisor.fr/ShowUserReviews-g187130-r87967680-Tours_Indre_et_Loire_Centre_Val_de_Loire.html#REVIEWS', None),
    ('Biarritz', 'https://www.tripadvisor.fr/ShowUserReviews-g187080-d662516-r719142517-La_Cote_des_Basques-Biarritz_Basque_Country_Pyrenees_Atlantiques_Nouvelle_Aquitai.html', None),
    ('Brest', 'https://www.tripadvisor.fr/ShowUserReviews-g187095-d13115677-r758545879-Le_Telepherique-Brest_Finistere_Brittany.html', None),
    ('Bordeaux Saint-Jean', 'https://www.tripadvisor.fr/ShowUserReviews-g187079-r146752265-Bordeaux_Gironde_Nouvelle_Aquitaine.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187079-r87964667-Bordeaux_Gironde_Nouvelle_Aquitaine.html#REVIEWS'),
    ('Bruxelles N Aero', 'https://www.tripadvisor.fr/ShowUserReviews-g188644-r116978873-Brussels.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g188644-r47132120-Brussels.html#REVIEWS'),
    ('Calais Ville', 'https://www.tripadvisor.fr/ShowUserReviews-g196659-d2578078-r765445999-Plage_de_Calais-Calais_Pas_de_Calais_Hauts_de_France.html', None),
    ('London St-Pancras', 'https://www.tripadvisor.fr/ShowUserReviews-g186338-r298426044-London_England.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g186338-r216169055-London_England.html#REVIEWS'),
    ('Koeln HBF', 'https://www.tripadvisor.fr/Attraction_Review-g187371-d8135291-Reviews-Historic_Old_Town-Cologne_North_Rhine_Westphalia.html#REVIEWS', None),
    ('Amsterdam Centraal', 'https://www.tripadvisor.fr/ShowUserReviews-g188590-r108902684-Amsterdam_North_Holland_Province.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g188590-r32248877-Amsterdam_North_Holland_Province.html#REVIEWS'),
    ('Rouen Rive Droite', 'https://www.tripadvisor.fr/ShowUserReviews-g187191-r88211602-Rouen_Seine_Maritime_Haute_Normandie_Normandy.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g187191-r87879158-Rouen_Seine_Maritime_Haute_Normandie_Normandy.html#REVIEWS'),
    ('Le Havre Gare locale', 'https://www.tripadvisor.fr/ShowUserReviews-g187190-d10256712-r461549484-Centre_ville_reconstruit_du_Havre-Le_Havre_Seine_Maritime_Haute_Normandie_Norma.html', None),
    ('Trouville-Deauville', 'https://www.tripadvisor.fr/Attraction_Review-g187184-d4947553-Reviews-Plage_de_Deauville-Deauville_City_Calvados_Basse_Normandie_Normandy.html#REVIEWS', None),
    ('Brive-la-Gaillarde', 'https://www.tripadvisor.fr/ShowUserReviews-g196612-r90222351-Brive_la_Gaillarde_Correze_Nouvelle_Aquitaine.html#REVIEWS', 'https://www.tripadvisor.fr/ShowUserReviews-g196612-r57594561-Brive_la_Gaillarde_Correze_Nouvelle_Aquitaine.html#REVIEWS'),
    ('Limoges-Bénédictins', 'https://www.tripadvisor.fr/ShowUserReviews-g187159-r146599786-Limoges_Haute_Vienne_Nouvelle_Aquitaine.html#REVIEWS', None),
    ('Clermont-Ferrand', 'https://www.tripadvisor.fr/ShowUserReviews-g187091-r105094784-Clermont_Ferrand_Puy_de_Dome_Auvergne_Rhone_Alpes.html#REVIEWS', None)
]

# Extract API data
params = {
    "dataset": "emission-co2-perimetre-complet",
    "q": "",
    "rows": 1000,
    "select": "origine, destination, distance_entre_les_gares, train_empreinte_carbone_kgco2e"
}

results = []
response = requests.get(url, params=params)
if response.status_code == 200:
    data = response.json()
    city_to_link = {city: (link1, link2) for city, link1, link2 in destinations_and_links}
    if "records" in data and data["records"]:
        for record in data["records"]:
            fields = record.get("fields", {})
            origine = fields.get("origine", "")
            destination = fields.get("destination", "")
            if 'Paris' in origine and destination in city_to_link:
                results.append({
                    "origine": origine,
                    "destination": destination,
                    "page1_link": city_to_link[destination][0],
                    "page2_link": city_to_link[destination][1],
                    "distance": fields.get("distance_entre_les_gares", 0),
                    "train_emissions": fields.get("train_empreinte_carbone_kgco2e", None)
                })
else:
    print(f"Error: {response.status_code} - {response.text}")

In [72]:
print(results)

[{'origine': 'Paris Gare de Lyon', 'destination': 'Annecy', 'page1_link': 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r111704080-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html#REVIEWS', 'page2_link': 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r86606692-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html', 'distance': 545.0, 'train_emissions': 1.5805}, {'origine': 'Paris Gare de Lyon', 'destination': 'Zuerich HB', 'page1_link': 'https://www.tripadvisor.fr/ShowUserReviews-g188113-r110402149-Zurich.html', 'page2_link': None, 'distance': 614.0, 'train_emissions': 2.0876}, {'origine': 'Paris Saint-Lazare', 'destination': 'Rouen Rive Droite', 'page1_link': 'https://www.tripadvisor.fr/ShowUserReviews-g187191-r88211602-Rouen_Seine_Maritime_Haute_Normandie_Normandy.html#REVIEWS', 'page2_link': 'https://www.tripadvisor.fr/ShowUserReviews-g187191-r87879158-Rouen_Seine_Maritime_Haute_Normandie_Normandy.html#REVIEWS', 'distance': 139.0, 'train_emissions': 3.3916}, {'origine': 'Paris Montp

## Web scraping reviews

In [73]:
# Collect Reviews
driver_path = 'C:/ChromeDriver/chromedriver-win64/chromedriver.exe'
service = Service(driver_path)
driver = webdriver.Chrome(service=service)

# Function to convert rating bubbles
def convert_bubble_to_rating(bubble_class):
    if 'bubble_' in bubble_class:
        bubble_value = int(bubble_class.split('_')[1])
        return bubble_value // 10
    return None

all_reviews = []
for record in results:
    origine = record['origine']
    destination = record['destination']
    for page_link in ['page1_link', 'page2_link']:
        url = record[page_link]
        if not url: continue
        time.sleep(random.uniform(3, 7))
        response = requests.get(url, headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0',
            'Accept-Language': 'fr-FR,fr;q=0.9',
            'Referer': 'https://www.google.com',
            'Connection': 'keep-alive'
        })
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            review_blocks = soup.find_all('div', id=re.compile(r'review_\d+'))
            for block in review_blocks:
                title = block.find('div', class_='quote').get_text(strip=True) if block.find('div', class_='quote') else None
                content = block.find('div', class_='entry').get_text(strip=True) if block.find('div', class_='entry') else None
                rating_element = block.find('span', class_='ui_bubble_rating')
                rating = convert_bubble_to_rating(rating_element['class'][1]) if rating_element else None
                all_reviews.append({
                    'origine': origine,
                    'destination': destination,
                    'page1_link': record['page1_link'],
                    'page2_link': record['page2_link'],
                    'distance': record['distance'],
                    'train_emissions': record['train_emissions'],
                    'scraped_url': url,
                    'title': title,
                    'review': content,
                    'rating': rating
            })
        else:
            print(f"Failed to fetch page: {url}")

In [74]:
print(all_reviews)

[{'origine': 'Paris Gare de Lyon', 'destination': 'Annecy', 'page1_link': 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r111704080-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html#REVIEWS', 'page2_link': 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r86606692-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html', 'distance': 545.0, 'train_emissions': 1.5805, 'scraped_url': 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r111704080-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html#REVIEWS', 'title': '“Annecyyyy...!!! Quand tu nous tiens..!!!”', 'review': "Vtt sur le Semnoz, pédalo sur le lac, promenade au bord de l'eau (superbe), l'ambiance de la vielle ville, pti resto super sympa... vraiment vraiment bien.... I love Annecy .... :-)", 'rating': 5}, {'origine': 'Paris Gare de Lyon', 'destination': 'Annecy', 'page1_link': 'https://www.tripadvisor.fr/ShowUserReviews-g187260-r111704080-Annecy_Haute_Savoie_Auvergne_Rhone_Alpes.html#REVIEWS', 'page2_link': 'https://www.tripadvisor.fr/ShowU

## CSV generation

### Non-aggregated CSV

In [75]:
# Save non-aggregated reviews to CSV
non_aggregated_csv = 'non_aggregated_emissions_and_reviews.csv'
reviews_df = pd.DataFrame(all_reviews)
reviews_df.to_csv(non_aggregated_csv, index=False, encoding='utf-8')

### Aggregated CSV

In [76]:
# Aggregate Reviews
aggregated_reviews = reviews_df.groupby('destination').agg(
    titles=('title', lambda x: ' || '.join(x.dropna().astype(str))),
    reviews=('review', lambda x: ' || '.join(x.dropna().astype(str))),
    average_rating=('rating', 'mean')
).reset_index()

# Merge with Emissions Data
emissions_df = pd.DataFrame(results)
merged_df = pd.merge(emissions_df, aggregated_reviews, how='left', on='destination')

# Final Aggregated CSV
output_csv = 'aggregated_emissions_and_reviews.csv'
merged_df.to_csv(output_csv, index=False, encoding='utf-8')

Delete empty rows

In [22]:
import pandas as pd
df = pd.read_csv('aggregated_emissions_and_reviews.csv',sep=",")
len(df)

43

In [23]:
df = df.dropna(subset=['reviews'])
len(df)

36

In [24]:
df.to_csv('cleaned_aggregated_emissions_and_reviews.csv', index=False)