#### 0. Imports

In [None]:
# data processing
import pandas as pd
import numpy as np

# browser automation
import selenium

# working with time
import time
from datetime import date, timedelta

# working with asynchronous functions
import asyncio

# import system to append parent folder to path - enables src importing
import sys
sys.path.append("..")

# data extraction support functions
import src.data_extraction_support as des
from src.data_extraction_support import create_country_airport_code_df

# 1.Introduction to this notebook

# 2. Data extraction

The idea is to periodically be saving information to the database about travel to several cities. That way, people can consult the what the best dates are among their options, with respect to their preferences in budget and other travel preferences.

For the time being, the only origin city is going to be Madrid. However, destinations range from 6 different cities:
- Barcelona
- Sevilla
- Bilbao
- Valencia
- Málaga
- Valencia

The idea is to store query information about flights, accommodations and activites, being able to analyze:
- What dates are the best to go to.
- What days and times yield the least expensive prices for the same flights and accommodations.

Therefore, the goal will be to:
- Application:
    1. Be able to select flights based on user requirements:
        - Stops
        - Duration
        - Price
        - Time of departure
        - Origin airport
        - Destination airport
    2. Be able to select accommodations based on user requirements:
        - Stars
        - Score
        - Number of comments
        - Number of people attended
        - Distance Score
        - People per room
        - Type of room
        - Distance to city centre

    3. Be able to offer activities based on user:
        - Dates of travel
        - Categories of activites


- Analysis:
    - Destinations:
        1. Best absolute price to scores (accommodations, activities) per destination
        2. Best price to score based on user restrictions (of accommodation type, categories, etc)
        3. Best absolute destination to go to based on user preferences

    - Flights:
        1. Best absolute (least expensive) dates to travel somewhere
        2. Best (least expensive) dates in advance to get flight tickets
        3. Best (least expensive) days and times to consult flight ticket prices
    
    - Accommodations
        1. Best absolute price to score per destination
        2. Best price to score based on user preferences of accommodation
        3. Destination offering more accommodations of certain preferences (distance, house, etc)
        4. Best dates for better price
        5. Best dates and advance to get availability
    
    - Activities:
        1. Best dates for available activities
        2. Best dates for available activities of preference
        3. Best dates to get offers
        4. Best destinations for activities of preference

    - Demand:
        1. Are there acitvities that make certain dates more expensive?
        2. Is it high season or low season there?

    - Weather:
        1. If the travel is to be had in a lower advance than 14 days, what will the weather be like?



For Flights requests, I will not be able to make more than 35 000 requests a month. To make my flights analysis I will need to:
1. Analyse best times, days of the week and days in advance to query a flight price:
    1. Check flights at different times, for the same destination, for several sample destinations. This is 4 destinations, 24 times a day.
    2. Check flights at different days of the week, for the same destinations, for several sample destinations. This is 4 destinations, for 4 weeks at least.
    3. Check flights at different days in advance with respect to the flight, for several sample destinations and days of the week. This is 4 destinations, for advances from the same day to 4 weeks.
    
    For that ideal comparison, that makes up for 4 destinations x 24 queries a day x 28 days window x during 28 days, equating to 75k requests. 
    However, the main problem is that until monday there is not enough time to gather the information, so it will have to be storing this information and analysing hourly information and potentially the differences in querying on a weekend vs a monday, to compare ir with the average most expensive days during the year (special events at each city might apply)
    
2. Analyse best dates to get flights. 
    1. Check flights for different dates. I can do this once for 365 days of the year, for the 4 destinations chosen. 

    That makes up for 4 destinations x 365 days. 


The extractions to be made are:
- Flights
    - Skyscrapper
- Accommodations
    - Booking
    - Airbnb
- Activities
    - Civitatis
- Weather
- Stationality demand

# 2.1 Flights

First, let's make queries for our 4 chosen destinations. Selecting Spain gives already almost all the airports we need, lacking Bilbao, for what we add that option into our list.

In [None]:
list_of_countries_or_cities = ["spain","bilbao"]

In [None]:
# # lines commented to not execute them when running the notebook
# countries_airports = create_country_airport_code_df(list_of_countries_or_cities)

# countries_airports.to_csv("../data/airport_codes/countries_airports.csv")
countries_airports = pd.read_csv("../data/airport_codes/countries_airports.csv")

In [None]:
countries_airports

Information needed for extraction from each flight:
- Duration
- Price
- Stops
- Departure
- Arrival
- Company
- Self_transfer
- Fare_policy columns: 'isChangeAllowed', 'isPartiallyChangeable', 'isCancellationAllowed', 'isPartiallyRefundable'
- Score
- Luggage price (optional)
- Origin airport
- Destination airport

In [None]:
origin_city = "madrid"
n_adults = 2

# would have to translate them to english if user is spanish
# destination_cities = ['barcelona','bilbao','seville','valencia']
destination_cities = ['barcelona']


In [None]:
# careful here as how to select the main airport is now mere coincidence and in the future it will need a method to be selected
querystrings_list = des.build_flight_request_querystring_list(countries_airports,origin_city,destination_cities, '2024-11-01', days_window=2, n_adults= 1, n_children=0, n_infants=0, origin_airport_code="Yes", 
                                   destination_airport_code="Yes",sort_by="price_high",currency="EUR")
querystrings_list

In [None]:
print(f"The number of API requests is {len(querystrings_list)}")

In [None]:
itineraries_dict_list =  await des.request_flight_itineraries_async_multiple(querystrings_list)

In [None]:
itineraries_dict_list_flat = [itinerary_dict for dict_list in itineraries_dict_list if dict_list for itinerary_dict in dict_list]

In [None]:
print(f"The number of API itineraries got is {len(itineraries_dict_list_flat)}")

In [None]:
import datetime

In [None]:
def create_itineraries_dataframe(itineraries_dict_list):

    extracted_itinerary_info_list = list()

    for itinerary in itineraries_dict_list:
        extracted_itinerary_info_list.append(extract_flight_info(itinerary))
        
    return pd.DataFrame(extracted_itinerary_info_list)

def extract_flight_info(flight_dict):

    flight_result_dict = {}

    flight_result_dict_assigner = {
        'date_query': lambda _: datetime.datetime.now(),
        'score': lambda flight: float(flight['score']),
        'duration': lambda flight: int(flight['legs'][0]['durationInMinutes']),
        'price': lambda flight: int(flight['price']['formatted'].split()[0].replace(",","")),
        'price_currency': lambda flight: flight['price']['formatted'].split()[1],
        'stops': lambda flight: int(flight['legs'][0]['stopCount']),
        'departure': lambda flight: pd.to_datetime(flight['legs'][0]['departure']),
        'arrival': lambda flight: pd.to_datetime(flight['legs'][0]['arrival']),
        'company': lambda flight: flight['legs'][0]['carriers']['marketing'][0]['name'],
        'self_transfer': lambda flight: flight['isSelfTransfer'],
        'fare_isChangeAllowed': lambda flight: flight['farePolicy']['isChangeAllowed'],
        'fare_isPartiallyChangeable': lambda flight: flight['farePolicy']['isPartiallyChangeable'],
        'fare_isCancellationAllowed': lambda flight: flight['farePolicy']['isCancellationAllowed'],
        'fare_isPartiallyRefundable': lambda flight: flight['farePolicy']['isPartiallyRefundable'],
        'score': lambda flight: float(flight['score']),
        'origin_airport': lambda flight: flight['legs'][0]['origin']['name'],
        'destination_airport': lambda flight: flight['legs'][0]['destination']['name']
    }


    for key, function in flight_result_dict_assigner.items():
        try:
            flight_result_dict[key] = function(flight_dict)
        except KeyError:
            flight_result_dict[key] = np.nan  


    return flight_result_dict

In [None]:
itineraries_df = create_itineraries_dataframe(itineraries_dict_list_flat)

In [None]:
itineraries_df.sort_values(by="price",ascending=True)

In [None]:
itineraries_df.sort_values(by="score",ascending=False)

The flights have been acquired for a test window of 2 days and 4 cities. Let's first transform and then load into the database, to check if there is something missing, before launching full range queries.

## 2.2 Accommodations

### 2.2.1 Booking

1. Get all accommodation links
2. Get soups from all accommodation links
3. 

#### 2.2.1.1 Testing functions

In [26]:
# data processing
import pandas as pd
import numpy as np

## Scraping
# Webdriver automation
from selenium import webdriver 
from webdriver_manager.chrome import ChromeDriverManager  
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# html parsing
from bs4 import BeautifulSoup
# make synchronous request
import requests

# math operations
import math

# work with dates and time
import time
import datetime

# # work with asynchronicity
import asyncio
import aiohttp

# work with concurrency
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

import json

# environment variables
import dotenv
import os
dotenv.load_dotenv()
AIR_SCRAPPER_API_KEY = os.getenv("AIR_SCRAPPER_KEY")
GOOGLE_API = os.getenv("GOOGLE API")

# import support functions
import sys 
sys.path.append("..")

# function typing
from typing import List, Optional

# regular expressions
import re

In [31]:
### Accommodations - Booking - NEW VERSIONS

def scrape_accommodations_from_page(page_soup, booking_url, verbose=False):
    print(booking_url)
    accommodation_scraper_dict = {
        "query_date": lambda _: datetime.datetime.now(),
        "checkin": re.findall(r"checkin=(\d{4}-\d{2}-\d{2})", booking_url),
        "checkout": re.findall(r"checkout=(\d{4}-\d{2}-\d{2})", booking_url),
        "name": lambda card: card.find("div",{"data-testid":"title"}).text,
        "url": lambda card: card.find("a",{"data-testid":"title-link"})["href"],
        "price_currency": lambda card: card.find("span",{"data-testid":"price-and-discounted-price"}).text.split()[0],
        "total_price_amount": lambda card: card.find("span",{"data-testid":"price-and-discounted-price"}).text.split()[1].replace(".","").replace(",","."),
        "distance_city_center_km": lambda card: card.find("span",{"data-testid":"distance"}).text.split()[1].replace(".","").replace(",","."),
        "score": lambda card: card.find("div",{"data-testid": "review-score"}).find_all("div",recursive=False)[0].find("div").next_sibling.text.strip().replace(",","."),
        "n_comments": lambda card: card.find("div",{"data-testid": "review-score"}).find_all("div",recursive=False)[1].find("div").next_sibling.text.strip().split()[0].replace(".",""),
        "close_to_metro": lambda card: "Yes" if card.find("span",{"class":"f419a93f12"}) else "No",
        "sustainability_cert": lambda card: "Yes" if card.find("span",{"class":"abf093bdfe e6208ee469 f68ecd98ea"}) else "No",
        "room_type": lambda card: card.find("h4",{"class":"abf093bdfe e8f7c070a7"}).text,
        "double_bed": lambda card: "Yes" if any(["doble" in element.text for element in card.find_all("div",{"class":"abf093bdfe"})]) else "No",
        "single_bed": lambda card: "Yes" if any(["individual" in element.text for element in card.find_all("div",{"class":"abf093bdfe"})]) else "No",
        "free_cancellation": lambda card: "Yes" if any([element.text == "Cancelación gratis" for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else "No",
        "breakfast_included": lambda card: "Yes" if any([element.text == "Cancelación gratis" for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else "No",
        "pay_at_hotel": lambda card: "Yes" if any(['Sin pago por adelantado' in element.text for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else "No",
        "location_score": lambda card: card.find("span",{"class":"a3332d346a"}).text.split()[1].replace(",","."),
        "free_taxi": lambda card: "Yes" if any(["taxi gratis" in element.text.lower() for element in card.find_all("div",{"span":"b30f8eb2d6"})]) else "No"
    }

    accommodation_data_dict = {key: [] for key in accommodation_scraper_dict}

    for accommodation_card in page_soup.findAll("div", {"aria-label":"Alojamiento"}):
            for key, accommodation_scraper_function in accommodation_scraper_dict.items():
                try:
                    accommodation_data_dict[key].append(accommodation_scraper_function(accommodation_card))
                except Exception as e:
                    if verbose == True:
                        print(f"Error filling {key} due to {e}")
                    accommodation_data_dict[key].append(np.nan)

    return accommodation_data_dict


# dynamic html loading functions
def scroll_to_bottom(driver):
    last_height = driver.execute_script("return window.pageYOffset")

    while True:

        driver.execute_script('window.scrollBy(0, 2000)')
        time.sleep(0.8)
        
        new_height =  driver.execute_script("return window.pageYOffset")
        if new_height == last_height:
            break
        last_height = new_height

def scroll_back_up(driver):
    driver.execute_script('window.scrollBy(0, -600)')
    time.sleep(0.2)

def click_load_more(driver):
    try:
        button = WebDriverWait(driver, 3).until(EC.element_to_be_clickable(("xpath",'//*[@id="bodyconstraint-inner"]/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[3]/div[*]/button')))
        button.click()

        return True
    except:
        return print("'Load more' not found")

def scroll_and_click_cycle(driver):
    while True:
        print("Scrolling again")
        scroll_to_bottom(driver)
        scroll_back_up(driver)
        if not click_load_more(driver):
            break





def build_booking_urls(destinations_list: List[str], start_date: str, stay_duration: int = 2, step_length: int = 7, n_steps: int = 52, adults: int = 2, children: int = 0,
                           rooms: int = 1, max_price: int = 350, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = None):
    

    start_date_datetime = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    booking_url_list = list()
    for destination in destinations_list:
        for step in range(n_steps):
            checkin = (start_date_datetime + datetime.timedelta(days=step*step_length)).strftime("%Y-%m-%d")
            checkout = (start_date_datetime + datetime.timedelta(days=step*step_length + stay_duration)).strftime("%Y-%m-%d")

            booking_search_link = build_booking_url_full(
                destination=destination,
                checkin=checkin,
                checkout=checkout,
                adults=adults, 
                children=children, 
                rooms=rooms, 
                max_price=max_price, 
                star_ratings=star_ratings, 
                meal_plan=meal_plan,  
                review_score=review_score,  
                max_distance_meters=max_distance_meters 
            )

            booking_url_list.append(booking_search_link)

    return booking_url_list

def build_booking_url_full(destination: str, checkin: str, checkout: str, adults: int = 1, children: int = 0,
                           rooms: int = 1, min_price: int = 1, max_price: int = 1, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = None):
    """
    Build a Booking.com search URL by including all parameter filters, 
    ensuring proper formatting for all parameters.

    Parameters:
    - destination (str): Destination city.
    - checkin (str): Check-in date in YYYY-MM-DD format.
    - checkout (str): Check-out date in YYYY-MM-DD format.
    - adults (int): Number of adults.
    - children (int): Number of children.
    - rooms (int): Number of rooms.
    - min_price (int): Minimum price in Euros.
    - max_price (int): Maximum price in Euros.
    - star_ratings (list): List of star ratings (e.g., [3, 4, 5]).
    - meal_plan (int): Meal plan (0 for no meal, 1 for breakfast, etc.).
    - review_score (list): List of review scores (e.g., [80, 90] for 8.0+ and 9.0+).
    - max_distance_meters (int): Maximum distance from city center in meters (e.g., 500).

    Returns:
    - str: A Booking.com search URL based on the specified filters.
    """
    
    base_url = "https://www.booking.com/searchresults.es.html?"
    
    # Start with basic search parameters (ensure no tuple formatting)
    url = f"{base_url}ss={destination}&checkin={checkin}&checkout={checkout}&group_adults={adults}&group_children={children}"
    
    if rooms is not None:
       url += f"&no_rooms={rooms}"
    
    if min_price is not None and max_price is not None:
        price_filter = f"price%3DEUR-{min_price}-{max_price}-1"
    elif min_price is not None:
        price_filter = f"price%3DEUR-{min_price}-1-1"
    elif max_price is not None:
        price_filter = f"price%3DEUR-{max_price}-1"
    else:
        price_filter = None

    # Construct 'nflt' parameter to add other filters
    nflt_filters = []
    
    if price_filter:
        nflt_filters.append(price_filter)
    
    if star_ratings:
        star_filter = '%3B'.join([f"class%3D{star}" for star in star_ratings])
        nflt_filters.append(star_filter)
    
    meal_plan_options = {
            "breakfast": 1,
            "breakfast_dinner": 9,
            "kitchen": 999,
            "nothing": None
        }
    meal_plan_formatted = meal_plan_options.get(meal_plan, None)

    if meal_plan_formatted is not None:
        meal_plan_str = f"mealplan%3D{meal_plan_formatted}"
        nflt_filters.append(meal_plan_str)
    
    if review_score:
        review_filter = '%3B'.join([f"review_score%3D{score}" for score in review_score])
        nflt_filters.append(review_filter)
    
    if max_distance_meters is not None:
        distance_str = f"distance%3D{max_distance_meters}"
        nflt_filters.append(distance_str)
    
    # Add all 'nflt' filters to URL
    if nflt_filters:
        url += f"&nflt={'%3B'.join(nflt_filters)}"
    
    return url

def accommodations_booking_selenium_fetch_all_html_contents_concurrent(booking_url_list):
    # Determine optimal max_workers, usually best around the number of CPUs for Selenium
    max_workers = min(len(booking_url_list), os.cpu_count() or 1)
    with ThreadPoolExecutor(max_workers=2) as executor:
        futures = [executor.submit(fetch_booking_html, booking_url) for booking_url in booking_url_list]

        # Collect results as they complete
        html_contents_total = []
        for future in futures:
            html_contents_total.append(future.result())

    return html_contents_total, booking_url_list


def fetch_booking_html(booking_url):

    # open driver
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(booking_url)

    # scroll and load more until bottom
    # css_selector = "#bodyconstraint-inner > div:nth-child(8) > div > div.af5895d4b2 > div.df7e6ba27d > div.bcbf33c5c3 > div.dcf496a7b9.bb2746aad9 > div.d4924c9e74 > div.c82435a4b8.f581fde0b8 > button"
    scroll_and_click_cycle(driver)

    # fetch booking url html
    html_page = driver.page_source

    return html_page

def fetch_booking_html_optimized(booking_url):

    # ADD OPTIMIZATION OPTIONS HERE

    # open driver
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(booking_url)

    # scroll and load more until bottom
    css_selector = "#bodyconstraint-inner > div:nth-child(8) > div > div.af5895d4b2 > div.df7e6ba27d > div.bcbf33c5c3 > div.dcf496a7b9.bb2746aad9 > div.d4924c9e74 > div.c82435a4b8.f581fde0b8 > button"
    scroll_and_click_cycle(driver, css_selector)

    # fetch booking url html
    html_page = driver.page_source

    return html_page

def accommodations_booking_soup_from_all_html_contents_parallel(html_contents_total, booking_urls_list, verbose=False):
    print(html_contents_total)
    print(f"There are {len(html_contents_total)} html contents")
    start_time = time.time()
    with ThreadPoolExecutor() as executor:

        page_dfs = list(executor.map(accommodations_booking_parse_single_page_wrapper, html_contents_total, booking_urls_list, [verbose] * len(html_contents_total)))

    total_activities_df = pd.concat(page_dfs).reset_index(drop=True)
    end_time = time.time()
    print(f"The whole parallel Beautiful Soup process took {end_time-start_time}")
    return total_activities_df

def accommodations_booking_parse_single_page_wrapper(page_html, booking_url, verbose=False):
    return accommodations_booking_parse_single_page(page_html, booking_url,verbose=verbose)


def accommodations_booking_parse_single_page(page_html,booking_url, verbose=False):
    page_soup = BeautifulSoup(page_html, "html.parser")
    return pd.DataFrame(scrape_accommodations_from_page(page_soup,booking_url, verbose=verbose))


def accommodations_booking_extract_all_acommodations_selenium_concurrent(destinations_list: List[str], start_date: str, stay_duration: int = 2, step_length: int = 7, n_steps: int = 52, adults: int = 2, children: int = 0,
                           rooms: int = 1, max_price: int = 350, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = 5000, verbose=False):
    
    start_time = time.time()

    booking_urls_list = build_booking_urls(destinations_list = destinations_list, start_date= start_date, stay_duration =stay_duration , step_length = step_length, n_steps = n_steps, adults = adults, children = children,
                           rooms = rooms, max_price = max_price, star_ratings = star_ratings, meal_plan = meal_plan, review_score = review_score, max_distance_meters = max_distance_meters)
    
    print(f"It took {time.time() - start_time} seconds to build the urls")
    booking_html_contents_total, booking_urls_list = accommodations_booking_selenium_fetch_all_html_contents_concurrent(booking_urls_list)
    print(f"It took {time.time() - start_time} seconds for selenium to get the html contents")

    print("Now parsing with beautiful soup")
    total_accommodations_df = accommodations_booking_soup_from_all_html_contents_parallel(booking_html_contents_total, booking_urls_list,verbose=verbose)
    return total_accommodations_df
        

In [32]:
destination_cities = ['barcelona']

In [33]:
total_accommodations_df = accommodations_booking_extract_all_acommodations_selenium_concurrent(destinations_list=destination_cities, start_date="2024-11-02", n_steps=1, max_price=100)

It took 0.0 seconds to build the urls
Scrolling again
'Load more' not found
It took 18.396345853805542 seconds for selenium to get the html contents
Now parsing with beautiful soup
There are 1 html contents
https://www.booking.com/searchresults.es.html?ss=barcelona&checkin=2024-11-02&checkout=2024-11-04&group_adults=2&group_children=0&no_rooms=1&nflt=price%3DEUR-1-100-1%3Bdistance%3D5000
The whole parallel Beautiful Soup process took 0.8013639450073242


In [30]:
total_accommodations_df

Unnamed: 0,query_date,checkin,checkout,name,url,price_currency,total_price_amount,distance_city_center_km,score,n_comments,close_to_metro,sustainability_cert,room_type,double_bed,single_bed,free_cancellation,breakfast_included,pay_at_hotel,location_score,free_taxi
0,2024-11-02 09:40:41.179163,,,Hostel New York,https://www.booking.com/hotel/es/hostel-new-yo...,€,162,1.1,7.0,3721.0,Yes,No,Habitación Económica con literas,No,No,No,No,No,,No
1,2024-11-02 09:40:41.181163,,,Rooms Diago,https://www.booking.com/hotel/es/rooms-diago.e...,€,199,1.3,,,Yes,No,Habitación Triple con baño compartido,Yes,Yes,No,No,No,,No
2,2024-11-02 09:40:41.186168,,,Generator Barcelona,https://www.booking.com/hotel/es/generator-hos...,€,189,1.5,7.7,7221.0,Yes,No,Cama en habitación compartida de 6 camas,No,No,No,No,No,,No
3,2024-11-02 09:40:41.188166,,,Bed and Bike Barcelona,https://www.booking.com/hotel/es/bed-and-bike-...,€,184,1.2,8.6,1790.0,Yes,No,Cama individual en dormitorio compartido mixto...,No,No,No,No,No,,No
4,2024-11-02 09:40:41.191165,,,Unite Hostel Barcelona,https://www.booking.com/hotel/es/unite-hostel-...,€,175,2.2,8.1,8176.0,Yes,No,Cama en habitación compartida de 8 camas,No,No,No,No,No,,No
5,2024-11-02 09:40:41.193166,,,Pars Teatro Hostel,https://www.booking.com/hotel/es/albareda-yout...,€,163,1.5,8.2,1166.0,Yes,No,Cama individual en dormitorio compartido de 4 ...,No,No,No,No,No,,No
6,2024-11-02 09:40:41.196167,,,Kabul Party Hostel Barcelona,https://www.booking.com/hotel/es/kabul-backpac...,€,176,0.9,8.4,4919.0,Yes,Yes,Cama individual en dormitorio compartido mixto...,No,No,No,No,No,9.6,No
7,2024-11-02 09:40:41.198162,,,The Loft Hostel Barcelona La Pedrera,https://www.booking.com/hotel/es/la-flor-de-ga...,€,195,1.2,8.2,2648.0,Yes,Yes,Cama en habitación compartida mixta de 6 camas,No,No,No,No,No,9.3,No
8,2024-11-02 09:40:41.201169,,,Arc House Barcelona,https://www.booking.com/hotel/es/arc-house.es....,€,145,1.3,7.0,1293.0,Yes,No,Litera en dormitorio compartido mixto con 10 c...,No,No,No,No,No,,No
9,2024-11-02 09:40:41.204167,,,Alberg Sants Bcn,https://www.booking.com/hotel/es/alberg-sants-...,€,155,2.4,6.0,89.0,Yes,No,Litera en habitación compartida mixta,No,No,No,No,No,,No


### 2.2.2 Airbnb

In [34]:
destination_city = "barcelona"
checkin_date = "2024-11-01"
checkout_date = "2024-11-03"
n_adults = 2
url = f"https://www.airbnb.com/s/{destination_city}/homes?checkin={checkin_date}&checkout={checkout_date}&adults={n_adults}"

driver = webdriver.Chrome()

driver.get(url)
driver.maximize_window()
WebDriverWait(driver, 5).until(
    EC.element_to_be_clickable((By.XPATH, "//button[text()='Aceptar todas']"))
).click()

html_page = driver.page_source
# pagination = driver.find_element(By.XPATH, "//nav[aria-label()='Paginación de resultados de búsqueda']")
# pagination_elements = pagination.find_elements(By.TAG_NAME,"a")
# pagination_elements


In [35]:
soup = BeautifulSoup(html_page,"html.parser")
soup

<html class="dir native vz2oe5x v1koiow6 vrbhsjc vgue9iu vyb6402" data-is-async-local-storage="true" data-is-hyperloop="true" dir="ltr" lang="es" style="--vh: 6.78px; --vw: 15.36px; --vw-unitless: 1536;"><head><style>.gm-style-moc{background-color:rgba(0,0,0,.45);pointer-events:none;text-align:center;-webkit-transition:opacity ease-in-out;transition:opacity ease-in-out}.gm-style-mot{color:white;font-family:Roboto,Arial,sans-serif;font-size:22px;margin:0;position:relative;top:50%;transform:translateY(-50%);-webkit-transform:translateY(-50%);-ms-transform:translateY(-50%)}sentinel{}
</style><style>.gm-style img{max-width: none;}.gm-style {font: 400 11px Roboto, Arial, sans-serif; text-decoration: none;}</style><meta charset="utf-8"/><meta content="es" name="locale"/><meta content="notranslate" name="google"/><meta content="authenticity_token" id="csrf-param-meta-tag" name="csrf-param"/><meta content="" id="csrf-token-meta-tag" name="csrf-token"/><meta content="" id="english-canonical-url

In [44]:
len(soup.find("div",{"style":"display: contents;"}).find_all("div",{"itemprop":"itemListElement"}))

18

In [45]:
soup.find("div",{"style":"display: contents;"}).find_all("div",{"itemprop":"itemListElement"})[0]

<div itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem"><meta content="Apartamento en el centro" itemprop="name"/><meta content="1" itemprop="position"/><meta content="www.airbnb.es/rooms/29413203?adults=2&amp;search_mode=regular_search&amp;check_in=2024-11-01&amp;check_out=2024-11-03&amp;source_impression_id=p3_1730486354_P3fjrHbYRr_O28Wa&amp;previous_page_section_name=1000" itemprop="url"/><div><div><div aria-labelledby="title_29413203" class="cy5jw6o atm_5j_223wjw atm_70_87waog atm_j3_1u6x1zy atm_jb_4shrsx atm_mk_h2mmj6 atm_vy_7abht0 dir dir-ltr" data-testid="card-container" role="group"><a aria-labelledby="title_29413203" class="l1ovpqvx atm_1he2i46_1k8pnbi_10saat9 atm_yxpdqi_1pv6nv4_10saat9 atm_1a0hdzc_w1h1e8_10saat9 atm_2bu6ew_929bqk_10saat9 atm_12oyo1u_73u7pn_10saat9 atm_fiaz40_1etamxe_10saat9 bn2bl2p atm_5j_223wjw atm_9s_1ulexfb atm_e2_1osqo2v atm_fq_idpfg4 atm_mk_stnw88 atm_tk_idpfg4 atm_vy_1osqo2v atm_26_1j28jx2 atm_3f_glywfm atm_kd_glywfm atm_3f_gl

Sera posible entrar en todos los alojamientos de airbnb, sacar su precio, informacion y demás y sus disponibilidades directamente?

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from concurrent.futures import ThreadPoolExecutor

In [69]:
def click_button(xpath):
    try:
        WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, xpath))
        ).click()
    except TimeoutException:
        print(f"Button with XPath '{xpath}' not found.")

# XPaths for the buttons
xpath_cerrar = "//button[@aria-label='Cerrar']"
xpath_aceptar_todas = "//button[text()='Aceptar todas']"


In [78]:
url = "https://www.airbnb.es/rooms/29413203?adults=2&search_mode=regular_search&check_in=2024-11-01&check_out=2024-11-03&source_impression_id=p3_1730487404_P321LpZsGLiX0DKt&previous_page_section_name=1000&federated_search_id=1293e1b7-d1cd-4837-a262-b1013a1b5764"
# response = requests.get(url)
driver = webdriver.Chrome()

driver.get(url)
with ThreadPoolExecutor(max_workers=2) as executor:
    executor.submit(click_button, xpath_cerrar)
    executor.submit(click_button, xpath_aceptar_todas)

driver.maximize_window()


html_page = driver.page_source
soup_bs = BeautifulSoup(html_page, "html.parser")

In [79]:
soup_bs

<html class="scrollbar-gutter dir native vz2oe5x v1koiow6 vrbhsjc vgue9iu vyb6402" data-is-async-local-storage="true" data-is-hyperloop="true" dir="ltr" lang="es" style="--vh: 5.33px; --vw: 12.8px; --vw-unitless: 1280;"><head><meta charset="utf-8"/><meta content="es" name="locale"/><meta content="notranslate" name="google"/><meta content="authenticity_token" id="csrf-param-meta-tag" name="csrf-param"/><meta content="" id="csrf-token-meta-tag" name="csrf-token"/><meta content="" id="english-canonical-url"/><meta content="on" name="twitter:widgets:csp"/><meta content="yes" name="mobile-web-app-capable"/><meta content="yes" name="apple-mobile-web-app-capable"/><meta content="Airbnb" name="application-name"/><meta content="Airbnb" name="apple-mobile-web-app-title"/><meta content="#ffffff" name="theme-color"/><meta content="#ffffff" name="msapplication-navbutton-color"/><meta content="black-translucent" name="apple-mobile-web-app-status-bar-style"/><meta content="/?utm_source=homescreen" na

In [80]:
calendar = soup_bs.find("div",{"data-plugin-in-point-id":"AVAILABILITY_CALENDAR_INLINE"})

In [81]:
calendar

<div data-plugin-in-point-id="AVAILABILITY_CALENDAR_INLINE" data-section-id="AVAILABILITY_CALENDAR_INLINE" style="padding-top: 48px; padding-bottom: 48px;"><div data-testid="inline-availability-calendar"><div><div class="sewcpu6 atm_le_74f3fj atm_le_8opf4g__oggzyc atm_le_dm248g__qky54b dir dir-ltr" style="--spacingBottom: 0;"><div class="t5p7tdn atm_7l_dezgoh atm_bx_48h72j atm_cs_10d11i2 atm_c8_sz6sci atm_g3_17zsb9a atm_fr_kzfbxz dir dir-ltr"><section><h2 class="hpipapi atm_7l_1kw7nm4 atm_c8_1x4eueo atm_cs_1kw7nm4 atm_g3_1kw7nm4 atm_gi_idpfg4 atm_l8_idpfg4 atm_kd_idpfg4_pfnrn2 dir dir-ltr" elementtiming="LCP-target" tabindex="-1">2 noches en Vilanova i la Geltrú</h2></section></div><div class="s1bh1tge atm_7l_1esdqks atm_bx_48h72j atm_cs_6adqpa atm_c8_km0zk7 atm_g3_18khvle atm_fr_1m9t47k atm_lo_1yuitx dir dir-ltr"><div class="_16j7g3i" data-testid="availability-calendar-date-range">1 de nov. de 2024 - 3 de nov. de 2024</div></div></div></div><div class="c1e8f4ze atm_9s_1txwivl atm_ks_1

In [82]:
available_dates = []
# Loop through all divs with 'data-testid' that match the 'calendar-day' pattern
for day in calendar.find_all('div', attrs={'data-testid': True}):
    # Check if the div is a date and is available
    if day['data-testid'].startswith('calendar-day') and day.get('data-is-day-blocked') != 'true':
        date = day['data-testid'].replace('calendar-day-', '')
        available_dates.append(date)

# Output available dates
print("Available dates:", available_dates)

Available dates: ['01/11/2024', '02/11/2024', '03/11/2024', '04/11/2024', '05/11/2024', '06/11/2024', '08/11/2024', '09/11/2024', '10/11/2024', '11/11/2024', '12/11/2024', '13/11/2024', '14/11/2024', '17/11/2024', '18/11/2024', '19/11/2024', '20/11/2024', '21/11/2024', '22/11/2024', '23/11/2024', '24/11/2024', '25/11/2024', '26/11/2024', '27/11/2024', '28/11/2024', '29/11/2024', '30/11/2024', '01/12/2024', '02/12/2024', '03/12/2024', '04/12/2024', '05/12/2024', '06/12/2024', '07/12/2024', '08/12/2024', '09/12/2024', '10/12/2024', '11/12/2024', '12/12/2024', '14/12/2024', '15/12/2024', '16/12/2024', '17/12/2024', '18/12/2024', '19/12/2024', '20/12/2024', '21/12/2024', '22/12/2024', '23/12/2024']


## 2.3 Activities

The functions used for this processed have been inspired by a previous project. In order to optimize the extraction of such a vast information, Multithreading and Parallel processes have implemented in variations of the original function in order to assess the impact of this optimization, as an additional exploration of this project.

In [None]:
cities = ['barcelona','bilbao','sevilla','valencia']
## calculate today
today = str(date.today()) 
end_date = str(date.today()  + timedelta(days=360))
today

Availability is only possible for a 15 days window. That forces to calculate roundup(days(end_date - start_date)/15) = iterations, which is something that can be handled by the main function. 

Updating the `src/data_extraction_support.py` function.

In [None]:
# result_df_parallel = des.activities_civitatis_extract_all_activites_parallel(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_parallel

In [None]:
# result_df_multithread = des.activities_civitatis_extract_all_activites_multithread(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_multithread

In [None]:
# result_df_multithread_selenium = des.activities_civitatis_extract_all_activites_multithread_selenium(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_multithread_selenium

In [None]:
result_df_parallel_selenium = des.activities_civitatis_extract_all_activites_parallel_selenium(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

result_df_parallel_selenium

In [None]:
# result_df_parallel_selenium_optimized = des.activities_civitatis_extract_all_activites_parallel_selenium_optimized(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_parallel_selenium_optimized

### Time and results comparison

In [None]:
# print(f"The shape of the multithread soup is {result_df_multithread.shape[0]}.")
# print("Multithread soup has these null values:")
# display(result_df_multithread.isna().sum())

# print(f"\nThe shape of the multithread soup + selenium concurrent is {result_df_multithread_selenium.shape[0]}.")
# print("Multithread soup + selenium concurrent has these null values:")
# display(result_df_multithread_selenium.isna().sum())

# print(f"\nThe shape of the parallel soup is {result_df_parallel.shape[0]}.")
# print("Parallel soup has these null values:")
# display(result_df_parallel.isna().sum())

print(f"\nThe shape of the multithread soup + selenium concurrent is {result_df_parallel_selenium.shape[0]}.")
print("Multithread soup + selenium concurrent has these null values:")
display(result_df_parallel_selenium.isna().sum())

# print(f"\nThe shape of the multithread soup + selenium concurrent optimized is {result_df_parallel_selenium_optimized.shape[0]}.")
# print("Multithread soup + selenium concurrent optimized has these null values:")
# result_df_parallel_selenium_optimized.isna().sum()

Optimising selenium with a multithread works best as not only it accelerates the whole process, but also returns less NaN values somehow.

Now, let's get the addresses from the latitude and longitude with geopy asyncronized: