#### 0. Imports

In [1]:
# data processing
import pandas as pd
import numpy as np

# browser automation
import selenium

# working with time
import time
from datetime import date
import datetime

# working with asynchronous functions
import asyncio

# import system to append parent folder to path - enables src importing
import os
import sys
sys.path.append("..")

# data extraction support functions
import src.data_extraction_support as des



# 1.Introduction to this notebook

# 2. Data extraction

The idea is to periodically be saving information to the database about travel to several cities. That way, people can consult the what the best dates are among their options, with respect to their preferences in budget and other travel preferences.

For the time being, the only origin city is going to be Madrid. However, destinations range from 6 different cities:
- Barcelona
- Sevilla
- Bilbao
- Valencia
- Málaga
- Valencia

The idea is to store query information about flights, accommodations and activites, being able to analyze:
- What dates are the best to go to.
- What days and times yield the least expensive prices for the same flights and accommodations.

Therefore, the goal will be to:
- Application:
    1. Be able to select flights based on user requirements:
        - Stops
        - Duration
        - Price
        - Time of departure
        - Origin airport
        - Destination airport
    2. Be able to select accommodations based on user requirements:
        - Stars
        - Score
        - Number of comments
        - Number of people attended
        - Distance Score
        - People per room
        - Type of room
        - Distance to city centre

    3. Be able to offer activities based on user:
        - Dates of travel
        - Categories of activites


- Analysis:
    - Destinations:
        1. Best absolute price to scores (accommodations, activities) per destination
        2. Best price to score based on user restrictions (of accommodation type, categories, etc)
        3. Best absolute destination to go to based on user preferences

    - Flights:
        1. Best absolute (least expensive) dates to travel somewhere
        2. Best (least expensive) dates in advance to get flight tickets
        3. Best (least expensive) days and times to consult flight ticket prices
    
    - Accommodations
        1. Best absolute price to score per destination
        2. Best price to score based on user preferences of accommodation
        3. Destination offering more accommodations of certain preferences (distance, house, etc)
        4. Best dates for better price
        5. Best dates and advance to get availability
    
    - Activities:
        1. Best dates for available activities
        2. Best dates for available activities of preference
        3. Best dates to get offers
        4. Best destinations for activities of preference

    - Demand:
        1. Are there acitvities that make certain dates more expensive?
        2. Is it high season or low season there?

    - Weather:
        1. If the travel is to be had in a lower advance than 14 days, what will the weather be like?



For Flights requests, I will not be able to make more than 35 000 requests a month. To make my flights analysis I will need to:
1. Analyse best times, days of the week and days in advance to query a flight price:
    1. Check flights at different times, for the same destination, for several sample destinations. This is 4 destinations, 24 times a day.
    2. Check flights at different days of the week, for the same destinations, for several sample destinations. This is 4 destinations, for 4 weeks at least.
    3. Check flights at different days in advance with respect to the flight, for several sample destinations and days of the week. This is 4 destinations, for advances from the same day to 4 weeks.
    
    For that ideal comparison, that makes up for 4 destinations x 24 queries a day x 28 days window x during 28 days, equating to 75k requests. 
    However, the main problem is that until monday there is not enough time to gather the information, so it will have to be storing this information and analysing hourly information and potentially the differences in querying on a weekend vs a monday, to compare ir with the average most expensive days during the year (special events at each city might apply)
    
2. Analyse best dates to get flights. 
    1. Check flights for different dates. I can do this once for 365 days of the year, for the 4 destinations chosen. 

    That makes up for 4 destinations x 365 days. 


The extractions to be made are:
- Flights
    - Skyscrapper
- Accommodations
    - Booking
    - Airbnb
- Activities
    - Civitatis
- Weather
- Stationality demand

# 2.1 Flights

First, let's make queries for our 4 chosen destinations. Selecting Spain gives already almost all the airports we need, lacking Bilbao, for what we add that option into our list.

In [2]:
list_of_countries_or_cities = ["spain","bilbao"]

In [3]:
countries_airports = des.create_country_airport_code_df(list_of_countries_or_cities)

countries_airports[["city","latitude","longitude"]] = await des.get_cities_coordinates(countries_airports["city"].to_list())
countries_airports

countries_airports.to_csv("../data/airport_codes/countries_airports.csv")

In [4]:
countries_airports

Unnamed: 0,country,city,city_entityId,airport_skyId,airport_entityId,airport_name,latitude,longitude
0,spain,Madrid,27544850,MAD,95565077,Madrid,40.4167047,-3.7035825
1,spain,Barcelona,27548283,BCN,95565085,Barcelona,41.3828939,2.1774322
2,spain,Port Of Spain,27546011,POS,104120358,Port Of Spain,,
3,spain,Málaga,27547484,AGP,95565095,Malaga,36.7213028,-4.4216366
4,spain,Seville,27547022,SVQ,95565089,Seville,37.3886303,-5.9953403
5,spain,Valencia,27547405,VLC,95565090,Valencia,39.4697065,-0.3763353
6,spain,Ceuta,29952264,JCU,129054552,Ceuta,35.89442195,-5.355817352394269
7,bilbao,Bilbao,27538794,BIO,95565104,Bilbao,43.2630018,-2.9350039


### Flights

Information needed for extraction from each flight:
- Duration
- Price
- Stops
- Departure
- Arrival
- Company
- Self_transfer
- Fare_policy columns: 'isChangeAllowed', 'isPartiallyChangeable', 'isCancellationAllowed', 'isPartiallyRefundable'
- Score
- Luggage price (optional)
- Origin airport
- Destination airport

In [5]:
origin_city = "madrid"
n_adults = 2

# would have to translate them to english if user is spanish
destination_cities = ['barcelona','bilbao','seville','valencia']


In [6]:
# careful here as how to select the main airport is now mere coincidence and in the future it will need a method to be selected
querystrings_list = des.build_flight_request_querystring_list_single(countries_airports,origin_city,destination_cities, '2024-11-08', n_steps=52, step_length=7, days_window=2, n_adults= 1, n_children=0, n_infants=0, origin_airport_code=True, 
                                   destination_airport_code=True,sort_by="price_high",currency="EUR")
querystrings_list

[{'originSkyId': 'madrid',
  'destinationSkyId': 'barcelona',
  'originEntityId': '95565077',
  'destinationEntityId': '95565085',
  'date': '2024-11-08',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'price_high',
  'currency': 'EUR'},
 {'originSkyId': 'barcelona',
  'destinationSkyId': 'madrid',
  'originEntityId': '95565085',
  'destinationEntityId': '95565077',
  'date': '2024-11-10',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'price_high',
  'currency': 'EUR'},
 {'originSkyId': 'madrid',
  'destinationSkyId': 'barcelona',
  'originEntityId': '95565077',
  'destinationEntityId': '95565085',
  'date': '2024-11-15',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'price_high',
  'currency': 'EUR'},
 {'originSkyId': 'barcelona',
  'destinationSkyId': 'madrid',
  'originEntityId': '95565085',
  'destinationEntityId': '95565077',
  'date': '2024-11-17',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'pri

In [8]:
print(f"The number of API requests is {len(querystrings_list)}")

The number of API requests is 416


In [9]:
itineraries_dict_list =  await des.request_flight_itineraries_async_multiple(querystrings_list)

In [10]:
itineraries_dict_list_flat = [itinerary_dict for dict_list in itineraries_dict_list if dict_list for itinerary_dict in dict_list]

In [11]:
print(f"The number of API itineraries got is {len(itineraries_dict_list_flat)}")

The number of API itineraries got is 16552


In [12]:
itineraries_dict_list_flat[0]

{'id': '13870-2411081510--32680-0-9772-2411081635',
 'price': {'raw': 48.45,
  'formatted': '49 €',
  'pricingOptionId': '9DcPRM1OtY74'},
 'legs': [{'id': '13870-2411081510--32680-0-9772-2411081635',
   'origin': {'id': 'MAD',
    'entityId': '95565077',
    'name': 'Madrid',
    'displayCode': 'MAD',
    'city': 'Madrid',
    'country': 'Spain',
    'isHighlighted': False},
   'destination': {'id': 'BCN',
    'entityId': '95565085',
    'name': 'Barcelona',
    'displayCode': 'BCN',
    'city': 'Barcelona',
    'country': 'Spain',
    'isHighlighted': False},
   'durationInMinutes': 85,
   'stopCount': 0,
   'isSmallestStops': False,
   'departure': '2024-11-08T15:10:00',
   'arrival': '2024-11-08T16:35:00',
   'timeDeltaInDays': 0,
   'carriers': {'marketing': [{'id': -32680,
      'logoUrl': 'https://logos.skyscnr.com/images/airlines/favicon/UX.png',
      'name': 'Air Europa'}],
    'operationType': 'fully_operated'},
   'segments': [{'id': '13870-9772-2411081510-2411081635--32680'

In [13]:
itineraries_df = des.create_itineraries_dataframe(itineraries_dict_list_flat)

In [14]:
itineraries_df.head(2)

Unnamed: 0,itinerary_id,query_date,score,duration,price,price_currency,stops,departure,arrival,company,...,fare_is_change_allowed,fare_is_partially_changeable,fare_is_cancellation_allowed,fare_is_partially_refundable,origin_airport,destination_airport,origin_airport_code,destination_airport_code,origin_airport_entityid,destination_airport_entityid
0,13870-2411081510--32680-0-9772-2411081635,2024-11-04 05:52:24.873861,0.999,85,49,€,0,2024-11-08 15:10:00,2024-11-08 16:35:00,Air Europa,...,False,False,False,False,Madrid,Barcelona,MAD,BCN,95565077,95565085
1,13870-2411080715--32222-0-9772-2411080830,2024-11-04 05:52:24.879866,0.864733,75,38,€,0,2024-11-08 07:15:00,2024-11-08 08:30:00,Iberia,...,False,False,False,False,Madrid,Barcelona,MAD,BCN,95565077,95565085


In [49]:
flights_depart = itineraries_df[itineraries_df["origin_airport"]=="Madrid"].sort_values(by="price",ascending=True)
flights_return = itineraries_df[itineraries_df["origin_airport"]=="Barcelona"].sort_values(by="price",ascending=True)

In [50]:
flights_depart.head(2)

Unnamed: 0,date_query,score,duration,price,price_currency,stops,departure,arrival,company,self_transfer,fare_isChangeAllowed,fare_isPartiallyChangeable,fare_isCancellationAllowed,fare_isPartiallyRefundable,origin_airport,destination_airport,origin_airport_code,destination_airport_code,origin_airport_entityid,destination_airport_entityid
5356,2024-11-04 00:15:44.172238,0.894993,75,21,€,0,2025-09-12 21:35:00,2025-09-12 22:50:00,Iberia,False,False,False,False,False,Madrid,Barcelona,MAD,BCN,95565077,95565085
5357,2024-11-04 00:15:44.173429,0.79704,75,21,€,0,2025-09-12 06:45:00,2025-09-12 08:00:00,Iberia,False,False,False,False,False,Madrid,Barcelona,MAD,BCN,95565077,95565085


In [51]:
flights_return.head(2)

Unnamed: 0,date_query,score,duration,price,price_currency,stops,departure,arrival,company,self_transfer,fare_isChangeAllowed,fare_isPartiallyChangeable,fare_isCancellationAllowed,fare_isPartiallyRefundable,origin_airport,destination_airport,origin_airport_code,destination_airport_code,origin_airport_entityid,destination_airport_entityid
1370,2024-11-04 00:15:40.749062,0.999,85,15,€,0,2025-01-12 11:50:00,2025-01-12 13:15:00,Vueling Airlines,False,False,False,False,False,Barcelona,Madrid,BCN,MAD,95565085,95565077
1499,2024-11-04 00:15:40.849430,0.999,85,15,€,0,2025-01-19 11:50:00,2025-01-19 13:15:00,Vueling Airlines,False,False,False,False,False,Barcelona,Madrid,BCN,MAD,95565085,95565077


The flights have been acquired for a test window of 2 days and 4 cities. Let's first transform and then load into the database, to check if there is something missing, before launching full range queries.

In [15]:
itineraries_df.to_parquet("../data/flights/itineraries.parquet")

All this gets encapsulated in the function `src/data_extraction_support/get_flights()`.

In [None]:
itineraries_df = await des.get_flights(countries_airports,origin_city,destination_cities, '2024-11-08', n_steps=52, step_length=7, days_window=2, n_adults= 1, n_children=0, n_infants=0, origin_airport_code=True, 
                                   destination_airport_code=True,sort_by="price_high",currency="EUR")

## 2.2 Accommodations

### 2.2.1 Booking

1. Get all accommodation links
2. Get soups from all accommodation links
3. 

#### 2.2.1.1 Testing functions

In [None]:
### Accommodations - Booking - NEW VERSIONS

def scrape_accommodations_from_page(page_soup, booking_url, verbose=False):
    accommodation_scraper_dict = {
        "query_date": lambda _: datetime.datetime.now(),
        "city": lambda _: re.findall(r"ss=([a-z]+)&", booking_url)[0],
        "checkin": lambda _: re.findall(r"checkin=(\d{4}-\d{2}-\d{2})", booking_url)[0],
        "checkout": lambda _: re.findall(r"checkout=(\d{4}-\d{2}-\d{2})", booking_url)[0],
        "n_adults_search": lambda _: re.findall(r"group_adults=(\d+)", booking_url)[0],
        "n_children_search":lambda _: re.findall(r"group_children=(\d+)", booking_url)[0],
        "n_rooms_search": lambda _: re.findall(r"no_rooms=(\d+)", booking_url)[0],
        "name": lambda card: card.find("div",{"data-testid":"title"}).text,
        "url": lambda card: card.find("a",{"data-testid":"title-link"})["href"],
        "price_currency": lambda card: card.find("span",{"data-testid":"price-and-discounted-price"}).text.split()[0],
        "total_price_amount": lambda card: card.find("span",{"data-testid":"price-and-discounted-price"}).text.split()[1].replace(".","").replace(",","."),
        "distance_city_center_km": lambda card: card.find("span",{"data-testid":"distance"}).text.split()[1].replace(".","").replace(",","."),
        "score": lambda card: card.find("div",{"data-testid": "review-score"}).find_all("div",recursive=False)[0].find("div").next_sibling.text.strip().replace(",","."),
        "n_comments": lambda card: card.find("div",{"data-testid": "review-score"}).find_all("div",recursive=False)[1].find("div").next_sibling.text.strip().split()[0].replace(".",""),
        "close_to_metro": lambda card: True if card.find("span",{"class":"f419a93f12"}) else False,
        "sustainability_cert": lambda card: True if card.find("span",{"class":"abf093bdfe e6208ee469 f68ecd98ea"}) else False,
        "room_type": lambda card: card.find("h4",{"class":"abf093bdfe e8f7c070a7"}).text,
        "double_bed": lambda card: True if any(["doble" in element.text for element in card.find_all("div",{"class":"abf093bdfe"})]) else False,
        "single_bed": lambda card: True if any(["individual" in element.text for element in card.find_all("div",{"class":"abf093bdfe"})]) else False,
        "free_cancellation": lambda card: True if any([element.text == "Cancelación gratis" for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else False,
        "breakfast_included": lambda card: True if any([element.text == "Cancelación gratis" for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else False,
        "pay_at_hotel": lambda card: True if any(['Sin pago por adelantado' in element.text for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else False,
        "location_score": lambda card: card.find("span",{"class":"a3332d346a"}).text.split()[1].replace(",","."),
        "free_taxi": lambda card: True if any(["taxi gratis" in element.text.lower() for element in card.find_all("div",{"span":"b30f8eb2d6"})]) else False
    }

    accommodation_data_dict = {key: [] for key in accommodation_scraper_dict}

    for accommodation_card in page_soup.findAll("div", {"aria-label":"Alojamiento"}):
            for key, accommodation_scraper_function in accommodation_scraper_dict.items():
                try:
                    accommodation_data_dict[key].append(accommodation_scraper_function(accommodation_card))
                except Exception as e:
                    if verbose == True:
                        print(f"Error filling {key} due to {e}")
                    accommodation_data_dict[key].append(np.nan)

    return accommodation_data_dict


# dynamic html loading functions
def scroll_to_bottom(driver):
    last_height = driver.execute_script("return window.pageYOffset")

    while True:

        driver.execute_script('window.scrollBy(0, 2000)')
        time.sleep(0.2)
        
        new_height =  driver.execute_script("return window.pageYOffset")
        if new_height == last_height:
            break
        last_height = new_height

def scroll_back_up(driver):
    driver.execute_script('window.scrollBy(0, -600)')
    time.sleep(0.2)

def click_load_more(driver):
    try:
        button = WebDriverWait(driver, 3).until(EC.element_to_be_clickable(("xpath",'//*[@id="bodyconstraint-inner"]/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[3]/div[*]/button')))
        button.click()

        return True
    except:
        return print("'Load more' not found")

def scroll_and_click_cycle(driver):
    while True:
        print("Scrolling again")
        scroll_to_bottom(driver)
        scroll_back_up(driver)
        if not click_load_more(driver):
            break





def build_booking_urls(destinations_list: List[str], start_date: str, stay_duration: int = 2, step_length: int = 7, n_steps: int = 52, adults: int = 2, children: int = 0,
                           rooms: int = 1, max_price: int = 350, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = None):
    

    start_date_datetime = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    booking_url_list = list()
    for destination in destinations_list:
        for step in range(n_steps):
            checkin = (start_date_datetime + datetime.timedelta(days=step*step_length)).strftime("%Y-%m-%d")
            checkout = (start_date_datetime + datetime.timedelta(days=step*step_length + stay_duration)).strftime("%Y-%m-%d")

            booking_search_link = build_booking_url_full(
                destination=destination,
                checkin=checkin,
                checkout=checkout,
                adults=adults, 
                children=children, 
                rooms=rooms, 
                max_price=max_price, 
                star_ratings=star_ratings, 
                meal_plan=meal_plan,  
                review_score=review_score,  
                max_distance_meters=max_distance_meters 
            )

            booking_url_list.append(booking_search_link)

    return booking_url_list

def build_booking_url_full(destination: str, checkin: str, checkout: str, adults: int = 1, children: int = 0,
                           rooms: int = 1, min_price: int = 1, max_price: int = 1, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = None):
    """
    Build a Booking.com search URL by including all parameter filters, 
    ensuring proper formatting for all parameters.

    Parameters:
    - destination (str): Destination city.
    - checkin (str): Check-in date in YYYY-MM-DD format.
    - checkout (str): Check-out date in YYYY-MM-DD format.
    - adults (int): Number of adults.
    - children (int): Number of children.
    - rooms (int): Number of rooms.
    - min_price (int): Minimum price in Euros.
    - max_price (int): Maximum price in Euros.
    - star_ratings (list): List of star ratings (e.g., [3, 4, 5]).
    - meal_plan (int): Meal plan (0 for no meal, 1 for breakfast, etc.).
    - review_score (list): List of review scores (e.g., [80, 90] for 8.0+ and 9.0+).
    - max_distance_meters (int): Maximum distance from city center in meters (e.g., 500).

    Returns:
    - str: A Booking.com search URL based on the specified filters.
    """
    
    base_url = "https://www.booking.com/searchresults.es.html?"
    
    # Start with basic search parameters (ensure no tuple formatting)
    url = f"{base_url}ss={destination}&checkin={checkin}&checkout={checkout}&group_adults={adults}&group_children={children}"
    
    if rooms is not None:
       url += f"&no_rooms={rooms}"
    
    if min_price is not None and max_price is not None:
        price_filter = f"price%3DEUR-{min_price}-{max_price}-1"
    elif min_price is not None:
        price_filter = f"price%3DEUR-{min_price}-1-1"
    elif max_price is not None:
        price_filter = f"price%3DEUR-{max_price}-1"
    else:
        price_filter = None

    # Construct 'nflt' parameter to add other filters
    nflt_filters = []
    
    if price_filter:
        nflt_filters.append(price_filter)
    
    if star_ratings:
        star_filter = '%3B'.join([f"class%3D{star}" for star in star_ratings])
        nflt_filters.append(star_filter)
    
    meal_plan_options = {
            "breakfast": 1,
            "breakfast_dinner": 9,
            "kitchen": 999,
            "nothing": None
        }
    meal_plan_formatted = meal_plan_options.get(meal_plan, None)

    if meal_plan_formatted is not None:
        meal_plan_str = f"mealplan%3D{meal_plan_formatted}"
        nflt_filters.append(meal_plan_str)
    
    if review_score:
        review_filter = '%3B'.join([f"review_score%3D{score}" for score in review_score])
        nflt_filters.append(review_filter)
    
    if max_distance_meters is not None:
        distance_str = f"distance%3D{max_distance_meters}"
        nflt_filters.append(distance_str)
    
    # Add all 'nflt' filters to URL
    if nflt_filters:
        url += f"&nflt={'%3B'.join(nflt_filters)}"

    url += "&sr_view=list"
    
    return url

def accommodations_booking_selenium_fetch_all_html_contents_concurrent(booking_url_list,max_threads=5):
    # Determine optimal max_workers
    with ThreadPoolExecutor(max_workers=max_threads) as executor:
        futures = [executor.submit(fetch_booking_html_optimized, booking_url) for booking_url in booking_url_list]

        # Collect results as they complete
        html_contents_total = []
        for future in futures:
            html_contents_total.append(future.result())

    return html_contents_total, booking_url_list


def fetch_booking_html(booking_url):

    # open driver
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(booking_url)

    # scroll and load more until bottom
    # css_selector = "#bodyconstraint-inner > div:nth-child(8) > div > div.af5895d4b2 > div.df7e6ba27d > div.bcbf33c5c3 > div.dcf496a7b9.bb2746aad9 > div.d4924c9e74 > div.c82435a4b8.f581fde0b8 > button"
    scroll_and_click_cycle(driver)

    # fetch booking url html
    html_page = driver.page_source

    return html_page

def fetch_booking_html_optimized(booking_url):

    # ADD OPTIMIZATION OPTIONS HERE
    # add optimization options
    options = Options()
    options.add_argument("--no-sandbox")      # Enables no-sandbox mode
    options.add_argument("--disable-gpu")     # Disables GPU usage
    # options.add_argument("--headless")        # Runs Chrome in headless mode

    # open driver
    driver = webdriver.Chrome(options=options)
    driver.maximize_window()
    driver.get(booking_url)

    # scroll and load more until bottom
    css_selector = "#bodyconstraint-inner > div:nth-child(8) > div > div.af5895d4b2 > div.df7e6ba27d > div.bcbf33c5c3 > div.dcf496a7b9.bb2746aad9 > div.d4924c9e74 > div.c82435a4b8.f581fde0b8 > button"
    scroll_and_click_cycle(driver)

    # fetch booking url html
    html_page = driver.page_source

    return html_page

def accommodations_booking_soup_from_all_html_contents_parallel(html_contents_total, booking_urls_list, verbose=False):
    start_time = time.time()
    with ThreadPoolExecutor() as executor:

        page_dfs = list(executor.map(accommodations_booking_parse_single_page_wrapper, html_contents_total, booking_urls_list, [verbose] * len(html_contents_total)))

    total_activities_df = pd.concat(page_dfs).reset_index(drop=True)
    end_time = time.time()
    print(f"The whole parallel Beautiful Soup process took {end_time-start_time}")
    return total_activities_df

def accommodations_booking_parse_single_page_wrapper(page_html, booking_url, verbose=False):
    return accommodations_booking_parse_single_page(page_html, booking_url,verbose=verbose)


def accommodations_booking_parse_single_page(page_html,booking_url, verbose=False):
    page_soup = BeautifulSoup(page_html, "html.parser")
    return pd.DataFrame(scrape_accommodations_from_page(page_soup,booking_url, verbose=verbose))


def get_accommodations_booking(destinations_list: List[str], start_date: str, stay_duration: int = 2, step_length: int = 7, n_steps: int = 52, adults: int = 2, children: int = 0,
                           rooms: int = 1, max_price: int = 350, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = 5000, max_threads = 5,verbose=False):
    
    start_time = time.time()

    booking_urls_list = build_booking_urls(destinations_list = destinations_list, start_date= start_date, stay_duration =stay_duration , step_length = step_length, n_steps = n_steps, adults = adults, children = children,
                           rooms = rooms, max_price = max_price, star_ratings = star_ratings, meal_plan = meal_plan, review_score = review_score, max_distance_meters = max_distance_meters)
    
    print(f"It took {time.time() - start_time} seconds to build the urls")
    booking_html_contents_total, booking_urls_list = accommodations_booking_selenium_fetch_all_html_contents_concurrent(booking_urls_list, max_threads=max_threads)
    print(f"It took {time.time() - start_time} seconds for selenium to get the html contents")

    print("Now parsing with beautiful soup")
    total_accommodations_df = accommodations_booking_soup_from_all_html_contents_parallel(booking_html_contents_total, booking_urls_list,verbose=verbose)
    return total_accommodations_df
        

In [6]:
total_accommodations_booking_df = des.get_accommodations_booking(destinations_list=destination_cities, start_date="2024-11-08", n_steps=52, max_price=150,scroll_period=0.2, max_threads=6)

It took 0.0020008087158203125 seconds to build the urls
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
'Load more' not found
'Load more' not found
'Load more' not found
'Load more' not found
'Load more' not found
'Load more' not found
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
'Load more' not found
'Load more' not found
'Load more' not found
'Load more' not found
Scrolling again
'Load more' not found
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
'Load more' not found
Scrolling again
Scrolling again
'Load more' not found
'Load more' not found
'Load more' not found
'Load more' not found
'Load more' not found
Scrolling again
Scrolling again
Scrolling again
'Load more' not found
Scrolling again
'Load more' not found
Scrolling again
Scrolling again
'Load more' not found
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrollin

In [7]:
total_accommodations_booking_df.head()

Unnamed: 0,query_date,city,checkin,checkout,n_adults_search,n_children_search,n_rooms_search,name,url,price_currency,...,close_to_metro,sustainability_cert,room_type,double_bed,single_bed,free_cancellation,breakfast_included,pay_at_hotel,location_score,free_taxi
0,2024-11-04 00:43:02.045730,barcelona,2024-11-08,2024-11-10,2,0,1,Nice and comfortable room for your stay in BCN,https://www.booking.com/hotel/es/nice-and-comf...,€,...,True,False,Habitación Doble Económica,True,False,False,False,False,,False
1,2024-11-04 00:43:02.239603,barcelona,2024-11-08,2024-11-10,2,0,1,Hotel Derby,https://www.booking.com/hotel/es/derby.es.html...,€,...,True,False,Habitación (1 o 2 adultos) - 1 o 2 camas,True,True,False,False,False,,False
2,2024-11-04 00:43:02.245775,barcelona,2024-11-08,2024-11-10,2,0,1,Travelodge Barcelona Poblenou,https://www.booking.com/hotel/es/travelodge-ba...,€,...,True,False,Habitación Doble,True,False,False,False,False,,False
3,2024-11-04 00:43:02.266016,barcelona,2024-11-08,2024-11-10,2,0,1,Hotel Best Front Maritim,https://www.booking.com/hotel/es/front-maritim...,€,...,True,True,Habitación Doble - 2 camas,False,True,False,False,False,,False
4,2024-11-04 00:43:02.269974,barcelona,2024-11-08,2024-11-10,2,0,1,Apartamentos DV,https://www.booking.com/hotel/es/apartamentos-...,€,...,True,False,Habitación Doble con baño privado,True,False,False,False,False,8.9,False


In [8]:
total_accommodations_booking_df.to_parquet("../data/accommodations/booking.parquet")

### 2.2.2.1 Booking - extra information
It would be ideal to have general information not from the search, but from the availability dates, rooms, and characteristics of the individual accommodations themselves. It seems impossible to get from the cards themselves, and it is hard to know if we can get them directly from the scraping of the individual accommodation urls.

#### do I need extra information from the individual pages links?

### 2.2.2 Airbnb

Left for later if possible. There is no information about distance to the city center for the most part. Maybe getting inside each accommodation could give latitude and or longitude.

In [9]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [10]:
destination_city = "barcelona"
checkin_date = "2024-11-01"
checkout_date = "2024-11-03"
n_adults = 2
url = f"https://www.airbnb.com/s/{destination_city}/homes?checkin={checkin_date}&checkout={checkout_date}&adults={n_adults}"

driver = webdriver.Chrome()

driver.get(url)
driver.maximize_window()
WebDriverWait(driver, 5).until(
    EC.element_to_be_clickable((By.XPATH, "//button[text()='Aceptar todas']"))
).click()

html_page = driver.page_source
# pagination = driver.find_element(By.XPATH, "//nav[aria-label()='Paginación de resultados de búsqueda']")
# pagination_elements = pagination.find_elements(By.TAG_NAME,"a")
# pagination_elements


AttributeError: 'NoneType' object has no attribute 'is_displayed'

In [None]:
soup = BeautifulSoup(html_page,"html.parser")
soup

In [None]:
len(soup.find("div",{"style":"display: contents;"}).find_all("div",{"itemprop":"itemListElement"}))

In [None]:
soup.find("div",{"style":"display: contents;"}).find_all("div",{"itemprop":"itemListElement"})[0]

Sera posible entrar en todos los alojamientos de airbnb, sacar su precio, informacion y demás y sus disponibilidades directamente?

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from concurrent.futures import ThreadPoolExecutor

In [None]:
def click_button(xpath):
    try:
        WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, xpath))
        ).click()
    except TimeoutException:
        print(f"Button with XPath '{xpath}' not found.")

# XPaths for the buttons
xpath_cerrar = "//button[@aria-label='Cerrar']"
xpath_aceptar_todas = "//button[text()='Aceptar todas']"


In [None]:
url = "https://www.airbnb.es/rooms/29413203?adults=2&search_mode=regular_search&check_in=2024-11-01&check_out=2024-11-03&source_impression_id=p3_1730487404_P321LpZsGLiX0DKt&previous_page_section_name=1000&federated_search_id=1293e1b7-d1cd-4837-a262-b1013a1b5764"
# response = requests.get(url)
driver = webdriver.Chrome()

driver.get(url)
with ThreadPoolExecutor(max_workers=2) as executor:
    executor.submit(click_button, xpath_cerrar)
    executor.submit(click_button, xpath_aceptar_todas)

driver.maximize_window()


html_page = driver.page_source
soup_bs = BeautifulSoup(html_page, "html.parser")

In [None]:
soup_bs

In [None]:
calendar = soup_bs.find("div",{"data-plugin-in-point-id":"AVAILABILITY_CALENDAR_INLINE"})

In [None]:
calendar

In [None]:
available_dates = []
# Loop through all divs with 'data-testid' that match the 'calendar-day' pattern
for day in calendar.find_all('div', attrs={'data-testid': True}):
    # Check if the div is a date and is available
    if day['data-testid'].startswith('calendar-day') and day.get('data-is-day-blocked') != 'true':
        date = day['data-testid'].replace('calendar-day-', '')
        available_dates.append(date)

# Output available dates
print("Available dates:", available_dates)

## 2.3 Activities

The functions used for this processed have been inspired by a previous project. In order to optimize the extraction of such a vast information, Multithreading and Parallel processes have implemented in variations of the original function in order to assess the impact of this optimization, as an additional exploration of this project.

In [12]:
cities = ['barcelona','bilbao','sevilla','valencia']
## calculate today
today = str(date.today()  + datetime.timedelta(days=1))
end_date = str(date.today()  + datetime.timedelta(days=365))
today

'2024-11-05'

Availability is only possible for a 15 days window. That forces to calculate roundup(days(end_date - start_date)/15) = iterations, which is something that can be handled by the main function. 

Updating the `src/data_extraction_support.py` function.

Actually, it has to be 7 days.

In [13]:
# result_df_parallel = des.activities_civitatis_extract_all_activites_parallel(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_parallel

In [14]:
# result_df_multithread = des.activities_civitatis_extract_all_activites_multithread(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_multithread

In [15]:
# result_df_multithread_selenium = des.activities_civitatis_extract_all_activites_multithread_selenium(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_multithread_selenium

In [16]:
result_df_parallel_selenium = des.activities_civitatis_extract_all_activites_parallel_selenium(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

result_df_parallel_selenium

Now parsing with beautiful soup
The whole parallel Beautiful Soup process took 25.69040846824646


Unnamed: 0,query_date,city,activity_date_range_start,activity_date_range_end,activity_name,description,url,image,image2,available_days,available_times,duration,latitude,longitude,price,currency,category,spanish
0,2024-11-04 01:23:44.031007,barcelona,2024-11-05,2024-11-11,Excursión a Montserrat + Visita a una bodega,En esta excursión a Montserrat no solo disfrut...,www.civitatis.com/es/barcelona/tour-tapas-vino...,www.civitatis.com/f/espana/barcelona/tour-tapa...,,"[06, 11, 07, 09, 08, 05, 10]","[[8:45], [8:45], [8:45], [8:45], [8:45], [8:45...",7h 30m,41.3940236912484,2.181866082214644,19.98,EUR,Gastronomía y enoturismo,Español
1,2024-11-04 01:23:44.034541,barcelona,2024-11-05,2024-11-11,Paseo en catamarán al atardecer con música en ...,Contempla el skyline de Barcelona mientras dis...,www.civitatis.com/es/barcelona/paseo-catamaran...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/barcelona/paseo-cat...,"[07, 08, 06, 11, 10, 09, 05]","[[16:30], [16:30], [17:00], [16:30], [16:30], ...",1h 30m,41.37495867288118,2.17849589524371,7.65,EUR,Paseos en barco,Español
2,2024-11-04 01:23:44.037536,barcelona,2024-11-05,2024-11-11,Free tour por el Parque Güell,En este free tour por el Parque Güell conocere...,www.civitatis.com/es/barcelona/free-tour-parqu...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/barcelona/free-tour...,"[11, 05, 06, 07, 08]","[[11:30], [11:30], [11:30], [11:30], [11:30]]",1h 30m,41.41508351,2.154768947,2.00,EUR,Visitas guiadas y free tours,Español
3,2024-11-04 01:23:44.039537,barcelona,2024-11-05,2024-11-11,Teleférico de Montjuïc,Con esta entrada al Teleférico de Montjuïc pod...,www.civitatis.com/es/barcelona/billete-telefer...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/barcelona/billete-t...,"[07, 11, 05, 08, 06, 09, 10]","[[10:00], [10:00], [10:00], [10:00], [10:00], ...",,41.368762,2.163434,3.20,EUR,Entradas,
4,2024-11-04 01:23:44.043574,barcelona,2024-11-05,2024-11-11,Excursión a Tarragona y Sitges,En esta excursión a Tarragona y Sitges desde B...,www.civitatis.com/es/barcelona/excursion-tarra...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/barcelona/excursion...,"[05, 07, 10, 09, 08, 06]","[[8:30], [8:30], [8:30], [8:30], [8:30], [8:30]]",10 horas,0,0,24.63,EUR,Excursiones de un día,Español
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18446,2024-11-04 01:24:04.173259,valencia,2025-10-25,2025-10-31,Paseo en catamarán por Valencia,En este paseo en catamarán a vela por Valencia...,www.civitatis.com/es/valencia/paseo-catamaran-...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/valencia/paseo-cata...,,,1 hora,39.460925,-0.324599,3.00,EUR,Paseos en barco,
18447,2024-11-04 01:24:04.174238,valencia,2025-10-25,2025-10-31,Entrada al Museo Iluziona de Valencia,Descubrid las leyendas de Valencia a través de...,www.civitatis.com/es/valencia/entrada-museo-il...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/valencia/entrada-mu...,,,,39.46475896364761,-0.3758850218933492,3.00,EUR,Entradas,
18448,2024-11-04 01:24:04.175239,valencia,2025-10-25,2025-10-31,Excursión a Montanejos,¿Os gusta caminar por la naturaleza? En esta e...,www.civitatis.com/es/valencia/excursion-montan...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/valencia/excursion-...,,,8h 30m,39.47625373037879,-0.3469842302587606,11.25,EUR,Excursiones de un día,Español
18449,2024-11-04 01:24:04.176238,valencia,2025-10-25,2025-10-31,Tour del estadio de Mestalla,Acompañadnos a descubrir los secretos de uno d...,www.civitatis.com/es/valencia/tour-estadio-mes...,"www.civitatis.comdata:image/gif;base64,R0lGODl...",www.civitatis.com/f/espana/valencia/tour-estad...,,,1 hora,39.474775,-0.359234,2.60,EUR,Visitas guiadas y free tours,Español


In [17]:
result_df_parallel_selenium.isna().sum()

query_date                      0
city                            0
activity_date_range_start       0
activity_date_range_end         0
activity_name                 240
description                     0
url                           240
image                           0
image2                       2971
available_days               5504
available_times              5504
duration                     1738
latitude                      240
longitude                     240
price                         447
currency                      447
category                      240
spanish                      4403
dtype: int64

In [15]:
# result_df_parallel_selenium_optimized = des.activities_civitatis_extract_all_activites_parallel_selenium_optimized(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_parallel_selenium_optimized

### Time and results comparison

In [18]:
# print(f"The shape of the multithread soup is {result_df_multithread.shape[0]}.")
# print("Multithread soup has these null values:")
# display(result_df_multithread.isna().sum())

# print(f"\nThe shape of the multithread soup + selenium concurrent is {result_df_multithread_selenium.shape[0]}.")
# print("Multithread soup + selenium concurrent has these null values:")
# display(result_df_multithread_selenium.isna().sum())

# print(f"\nThe shape of the parallel soup is {result_df_parallel.shape[0]}.")
# print("Parallel soup has these null values:")
# display(result_df_parallel.isna().sum())

print(f"\nThe shape of the multithread soup + selenium concurrent is {result_df_parallel_selenium.shape[0]}.")
print("Multithread soup + selenium concurrent has these null values:")
display(result_df_parallel_selenium.isna().sum())

# print(f"\nThe shape of the multithread soup + selenium concurrent optimized is {result_df_parallel_selenium_optimized.shape[0]}.")
# print("Multithread soup + selenium concurrent optimized has these null values:")
# result_df_parallel_selenium_optimized.isna().sum()


The shape of the multithread soup + selenium concurrent is 18451.
Multithread soup + selenium concurrent has these null values:


query_date                      0
city                            0
activity_date_range_start       0
activity_date_range_end         0
activity_name                 240
description                     0
url                           240
image                           0
image2                       2971
available_days               5504
available_times              5504
duration                     1738
latitude                      240
longitude                     240
price                         447
currency                      447
category                      240
spanish                      4403
dtype: int64

### 2.3.1 Save results for transformation experiments

In [19]:
result_df_parallel_selenium.to_parquet("../data/activities/activities.parquet")

## 2.4 Special dates and events

In [42]:
import aiohttp
import asyncio
import pandas as pd

BASE_URL = "https://calendarific.com/api/v2/holidays"

cities_dict = {
    "Barcelona": "CT",     # Catalonia
    "Sevilla": "AN",       # Andalusia
    "Bilbao": "PV",        # Basque Country
    "Valencia": "VC"       # Valencian Community
}

async def fetch_holidays(city, region_code):
    params = {
        "api_key": CALENDARIFIC_API_KEY,
        "country": "ES",
        "year": 2024,
        "location": region_code  # E.g., "CT" for Catalonia, "AN" for Andalusia
    }
    async with aiohttp.ClientSession() as session:
        async with session.get(BASE_URL, params=params) as response:
            data = await response.json()
            return {city: data["response"]["holidays"]}

async def get_cities_dates(cities_dict):

    tasks = [fetch_holidays(city, region) for city, region in cities_dict.items()]
    results = await asyncio.gather(*tasks)
    
    # Convert results to DataFrame or other structure
    holidays_df = pd.DataFrame(results)
    return results



In [43]:
cities_dates = await get_cities_dates(cities_dict)

In [47]:
cities_dates[0]

{'Barcelona': [{'name': "New Year's Day",
   'description': 'New Year’s Day is the first day of the year, or January 1, in the Gregorian calendar.',
   'country': {'id': 'es', 'name': 'Spain'},
   'date': {'iso': '2024-01-01',
    'datetime': {'year': 2024, 'month': 1, 'day': 1}},
   'type': ['National holiday'],
   'primary_type': 'National holiday',
   'canonical_url': 'https://calendarific.com/holiday/spain/new-year-day',
   'urlid': 'spain/new-year-day',
   'locations': 'All',
   'states': 'All'},
  {'name': 'Reconquest Day',
   'description': 'Reconquest Day is a observance in Spain',
   'country': {'id': 'es', 'name': 'Spain'},
   'date': {'iso': '2024-01-02',
    'datetime': {'year': 2024, 'month': 1, 'day': 2}},
   'type': ['Observance'],
   'primary_type': 'Observance',
   'canonical_url': 'https://calendarific.com/holiday/spain/reconquest-day',
   'urlid': 'spain/reconquest-day',
   'locations': 'All',
   'states': 'All'},
  {'name': 'Epiphany',
   'description': 'Epiphany is

## 2.5 Weather forecast and historical

### 2.5.1 Get cities latitude and longitude

Set params for historical and forecast

#### Setting weather API parameters

Key daily parameters for planning based on actual conditions:

1. **Apparent Temperature**: `apparent_temperature_mean`, `apparent_temperature_min`, `apparent_temperature_max`
   - The "feels like" temperature, factoring in humidity and wind.

2. **Precipitation**: `precipitation_sum`, `precipitation_hours`
   - Total amount of rain/snow and duration of precipitation. Critical for knowing if rain gear is needed.

3. **Wind**: `wind_speed_10m_max`, `wind_gusts_10m_max`
   - Maximum wind speed and gusts. High values can affect outdoor plans.

4. **Cloud Cover & Sunshine**: `cloudcover`, `sunshine_duration`
   - Average cloud cover and total sunshine hours. Useful for understanding light conditions.

5. **UV Index**: `uv_index_max`
   - Maximum UV exposure, indicating sun protection needs.

6. **Sunrise, Sunset, Daylight**: `sunrise`, `sunset`, `daylight_duration`
   - Times for sunrise and sunset, plus total daylight hours. Important for scheduling activities in daylight.

These metrics provide a clear view of daily conditions to help plan clothing, outdoor activities, and sun protection.


In [17]:
params = {
    "daily": [
        "apparent_temperature_mean", "apparent_temperature_min", "apparent_temperature_max",
        "precipitation_sum", "precipitation_hours",
        "wind_speed_10m_max", "wind_gusts_10m_max",
         "sunshine_duration", "daylight_duration"
    ],
    "timezone": "Europe/Madrid",
    "forecast_days": 14
}

In [18]:
not_nan_filter = ~countries_airports["latitude"].isna()
cities_dict = {row[0]: (row[1], row[2]) for row in countries_airports.loc[not_nan_filter,["airport_name","latitude","longitude"]].itertuples(index=False,name=None)}

In [19]:
forecast = await des.get_forecast(cities_dict, params)
forecast["forecast/history"] = "forecast"

In [20]:
# Define parameters template (will be updated for each city)
params_history = {
    "start_date": str(datetime.datetime.today().date()-datetime.timedelta(days=366*5)),
    "end_date": str(datetime.datetime.today().date()-datetime.timedelta(days=1)),
    "daily": [
        "apparent_temperature_mean", "apparent_temperature_min", "apparent_temperature_max",
        "precipitation_sum", "precipitation_hours",
        "wind_speed_10m_max", "wind_gusts_10m_max",
         "sunshine_duration", "daylight_duration"
    ],
    "timezone": "Europe/Madrid"
}

In [21]:
weather_history = await des.get_weather_history_for_cities(cities_dict, params_history)
weather_history["forecast/history"] = "history"


In [22]:
weather = pd.concat([forecast,weather_history])

In [23]:
os.makedirs("../data/weather/",exist_ok=True)

In [24]:
weather.to_parquet("../data/weather/weather.parquet")

## 3. Comments on updates

I can keep results on all queried information from their dates. This can provide information for the analysis on how prices, availability, etc change  with time and date of the query. However, I will need to be constantly updating information, which can be costly in ressources and API calls.

## 3.1 Flights updates

For an analysis, it could be cool to do them hourly for a 4 week period.

However, for the current task of comparing several weekends, I will do it just once.

Nevertheless, the interest could lie in keeping daily results and just updating flight information as per user request, than can be re-used for similar user requests.

Information from when to check for the cheapest flights can be used to advice the user when to check the flights, appart for the target price notification that could be provided.

It would be ideal to scrape independent flights providers to provide better results and not depend upon API. Although the latter is much faster.

## 3.2 Accommodations updates

Booking limitations
- To get more specific information about prices, I have to check every individual date.
- ¨ about types of rooms and options, ¨.
- To get more specific information about availabilities, ¨

It would be ideal to somehow obtain this information via API or better exploring the website.

Airbnb limitations
- I mostly will need to search for all dates, on all cities, navigate to all paginations and then also navigate to each listing. Heavy.
- To get more specific information about prices, I have to check every individual date.
- ¨ about types of rooms and options, ¨.
- To get more specific information about availabilities, ¨

Explore as soon as I can save everything on the database.

## 3.3 Activities updates

Can be done every day.