<h1>Groupe 1 - Data Collection
> Rome2Rio scrap script
<span class="tocSkip"></span>
> *Authors : All*

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Environment" data-toc-modified-id="Environment-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Environment</a></span><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Data-Loading" data-toc-modified-id="Data-Loading-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Data Loading</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Functions</a></span></li></ul></li><li><span><a href="#Crawl" data-toc-modified-id="Crawl-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Crawl</a></span></li><li><span><a href="#Descriptive-statistics-on-recovered-data" data-toc-modified-id="Descriptive-statistics-on-recovered-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Descriptive statistics on recovered data</a></span></li></ul></div>

# Introduction

One of the sources we need to scrap was the Rome2Rio website. This notebook contains the code to retrieve the data with Chromedriver driver. To have the new data of Rome2Rio, we're going to launch manually this script every month. We scrapped the itinerary between all European capitals. They contain the departure and arrival city, the number of transports taken, the price minimum and maximum and also the duration of the itinerary.

V0 : basic scrap 
We retrieve all the informations of the website and we put the results into a json.

V1 : Quality in the code 
Begin of respect of quality chart

V2 : Generalization 
Creation of statistics for Rome2Rio. We finally arrived to scrapp 13 674 row.

V_not used : We try to adapt the code with the chromedriver to phantomjs but that don't work.

# Environment

## Libraries

In [2]:
import ast
import json
import pandas as pd
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
import re

## Data Loading

## Functions

This part contains all the functions we developped during the project for Rome2rio

In [3]:
def getdata(departure_city, arrival_city):
    """Documentation

    Parameters:

        departure_city: The departure city
        arrival_city: The arrival city

    Return:

        datafinal: The final dataframe with the variables imposed in the data dictionary for rom2rio

    """
    # Activation of Chrome Options

    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    webdriver = "../Driver/chromedriver.exe"

    driver = Chrome(webdriver, chrome_options=chrome_options)
    # The link of the website we want to scrap
    url = "https://www.rome2rio.com/map/"+departure_city+"/"+arrival_city
    driver.get(url)

    # The number of items we have by class
    items = len(driver.find_elements_by_class_name("route__title"))
    items2 = len(driver.find_elements_by_class_name("route__details"))

    total = []
    data = pd.DataFrame()

    # driver.find_elements_by_class_name : to locate an element by class attribute name
    for i in range(items):
        a = driver.find_elements_by_class_name("route__title")[i].text
        b = driver.find_elements_by_class_name("route__details")[i].text

        # nb days
        jour = "days"
        if jour in a:
            nb_days = int(
                ((re.findall("[0-9]+"+" days", a))[0]).replace(jour, ""))
        else:
            nb_days = 0
        # nb hours
            nb_hours = (re.search("[0-9]+"+"h", a))
            if nb_hours:
                nb_hours = int(
                    ((re.search("[0-9]+"+"h", a)).group(0)).replace("h", ""))
            else:
                nb_hours = 0

        # nb minute
        nb_minute = re.search("[0-9]+"+"m", a)
        if nb_minute:
            nb_minute = int((nb_minute.group(0)).replace("m", ""))
        else:
            nb_minute = 0

        new = ((a, b, nb_days, nb_hours, nb_minute))
        total.append(new)

    data = pd.DataFrame(total, columns=[
                        'itineraire', 'prix', 'liste_jour', 'liste_hours', 'liste_minute'])

    data['Data_Source'] = 'Rome2rio'
    data['Departure_city'] = departure_city
    data['Arrival_city'] = arrival_city
    # Count which transport we take and how many times
    data['Nb_bus_taken'] = data.itineraire.str.count("Bus|bus")
    data['Nb_train_taken'] = data.itineraire.str.count("Train|train")
    data['Nb_car_taken'] = data.itineraire.str.count(
        "Drive|drive|Rideshare|rideshare|car|Car")
    data['Nb_plane_taken'] = data.itineraire.str.count("Fly|fly")

    # Give the minimal price and the maximum price
    data['Price_max'] = data['prix'].apply(lambda st: st[st.find("-")+1:])
    data['Price_min'] = data['prix'].apply(lambda st: st[:st.find("-")])
    data['Duration1'] = data['itineraire'].apply(
        lambda st: st[st.find("•")+1:])

    data['days'] = data['liste_jour']
    data['hours'] = data['liste_hours']
    data['minute'] = data['liste_minute']

    data['days'] = data['days'].replace('', '0')
    data['minute'] = data['minute'].replace('', '0')
    data['hours'] = data['hours'].replace('', '0')

    data['Duration'] = data['days']*24*60+data['hours']*60+data['minute']

    datafinal = data[['Data_Source', 'Departure_city', 'Arrival_city', 'Nb_bus_taken',
                      'Nb_train_taken', 'Nb_car_taken', 'Nb_plane_taken', 'Duration', 'Price_min', 'Price_max']]

    return datafinal

# Crawl

This part contains the crawl of Rome2rio website with the execution of all functions 

To load the script on all European capitals, we can use this list below and replace it instead of example_capitale

liste_capitale = ["Paris",
                  "Berlin",
                  "Rome",
                  "Madrid",
                  "London",
                  "Dublin",
                  "Lisbon",
                  "Brussels",
                  "Luxembourg",
                  "Amsterdam",
                  "Bern",
                  "Copenhagen",
                  "Oslo",
                  "Stockholm",
                  "Helsinki",
                  "Tallinn",
                  "Riga",
                  "Vilnius",
                  "Warsaw",
                  "Prague",
                  "Vienna",
                  "Bratislava",
                  "Budapest",
                  "Ljubljana",
                  "Ankara",
                  "Bucharest",
                  "Belgrade",
                  "Sofia",
                  "Tijana",
                  "Skopje",
                  "Athens",
                  "Chisinau",
                  "Kiev",
                  "Minsk",
                  "Moscow",
                  "Tbilissi",
                  "Bakou",
                  "Verevan",
                  "Sarajevo",
                  "Reykjavik",
                  "Valletta",
                  "Zagreb",
                  "Nicosia",
                  "Andorra la Vella",
                  "San Marino",
                  "Vatican City"]

In [1]:
liste_capitale = ["Paris", "Berlin", "Rome", "Madrid", "London", "Dublin", "Lisbon", "Brussels", "Luxembourg", "Amsterdam",
                  "Bern", "Copenhagen", "Oslo", "Stockholm", "Helsinki", "Tallinn", "Riga", "Vilnius", "Warsaw", "Prague", "Vienna", "Bratislava",
                  "Budapest", "Ljubljana", "Ankara", "Bucharest", "Belgrade", "Sofia", "Tijana", "Skopje", "Athens", "Chisinau", "Kiev", "Minsk",
                  "Moscow", "Tbilissi", "Bakou", "Verevan", "Sarajevo", "Reykjavik", "Valletta", "Zagreb", "Nicosia", "Andorra la Vella",
                  "San Marino", "Vatican City"]

In [5]:
# To run the function getdata to a list of capitale
tmp = [[x, y] for x in liste_capitale for y in liste_capitale if x != y]
appended_data = []
for i in range(len(tmp)):
    appended_data.append(getdata(tmp[i][0], tmp[i][1]))
appended_data = pd.concat(appended_data, ignore_index=True)



In [6]:
appended_data.head(3)

Unnamed: 0,Data_Source,Departure_city,Arrival_city,Nb_bus_taken,Nb_train_taken,Nb_car_taken,Nb_plane_taken,Duration,Price_min,Price_max
0,Rome2rio,Paris,Berlin,0,0,0,1,285,47€,224€
1,Rome2rio,Paris,Berlin,0,0,0,1,281,49€,277€
2,Rome2rio,Paris,Berlin,0,0,0,1,293,57€,326€


In [7]:
# The others we want to have in the final dataframe
col_names = ['Date_Review', 'Review', 'Airline_Name', 'Airline_Type', 'Region_Operation', 'Aircraft_Type', 'Cabin_Class', 'Type_Of_Lounge',
             'Type_Of_Traveller', 'Date_Visit', 'Date_Flown', 'Airport', 'Route', 'Category', 'Category_Detail',
             'Cabin_Staff_Service', 'Lounge_Staff_Service', 'Bar_And_Beverages', 'Food_And_Beverages', 'Ground_Service', 'Catering', 'Cleanliness',
             'Lounge_Comfort', 'Aisle_Space', 'Wifi_And_Connectivity', 'Inflight_Entertainment', 'Viewing_Tv_Screen', 'Power_Supply',
             'Seat', 'Seat_type', 'Seat_Comfort', 'Seat_Legroom', 'Seat_Storage', 'Seat_Width', 'Seat_Recline', 'Washrooms',
             'Value_For_Money', 'Overall_Customer_Rating', 'Overall_Service_Rating', 'Overall_Airline_Rating',
             'Recommended', 'Title', 'Author', 'Description', 'Date_publication',
             'View_count', 'Likes', 'Dislikes', 'Nb_subscribers', 'Nb_comments', 'Nb_sharing', 'Hashtags', 'Awards', 'Registration', 'Location',
             'Contributions_Pers', 'Nb_pertinent_comments_Pers', 'Queuing_Times', 'Terminal_Seating', 'Terminal_Signs', 'Airport_Shopping',
             'Experience_At_Airport']
datafinal = pd.DataFrame(columns=col_names)

In [8]:
# Concatenate to have all features
rom2rio_ = pd.concat([appended_data, datafinal], sort=False)

In [9]:
# Export the final result to json
rom2rio = rom2rio_.to_json()
with open('../Results_json/data_Rome2Rio.json', 'a', encoding='utf8') as outfile:
    json.dump(rom2rio, outfile, ensure_ascii=False, indent=4)

# Descriptive statistics on recovered data

In [19]:
# Load the result of all european city
r2r = pd.read_csv("../Results_json/data_Rome2Rio.csv", sep=";")

In [21]:
r2r[['Departure_city', 'Arrival_city', 'Duration']].groupby(['Departure_city', 'Arrival_city']).agg(
    [('Min', 'min'), ('Mean', 'mean'), ('Median', 'median'), ('Max', 'max'), ('Count', 'count')])

Unnamed: 0_level_0,Unnamed: 1_level_0,Duration,Duration,Duration,Duration,Duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Min,Mean,Median,Max,Count
Departure_city,Arrival_city,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Amsterdam,Andorra la Vella,478,944.333333,875.0,1902,9
Amsterdam,Athens,390,1681.818182,1931.0,3000,11
Amsterdam,Bakou,636,3008.272727,3180.0,6300,11
Amsterdam,Belgrade,319,1005.700000,1093.5,1840,10
Amsterdam,Berlin,240,358.888889,354.0,585,9
Amsterdam,Bern,301,442.333333,427.0,763,9
Amsterdam,Bratislava,295,645.200000,537.5,1142,10
Amsterdam,Brussels,113,160.333333,147.5,262,6
Amsterdam,Bucharest,366,1275.100000,1515.5,2246,10
Amsterdam,Budapest,299,809.200000,825.0,1554,10
