#### 0. Imports

In [1]:
# data processing
import pandas as pd
import numpy as np

# browser automation
import selenium

# working with time
import time
from datetime import date, timedelta

# working with asynchronous functions
import asyncio

# import system to append parent folder to path - enables src importing
import sys
sys.path.append("..")

# data extraction support functions
import src.data_extraction_support as des
from src.data_extraction_support import create_country_airport_code_df



# 1.Introduction to this notebook

# 2. Data extraction

The idea is to periodically be saving information to the database about travel to several cities. That way, people can consult the what the best dates are among their options, with respect to their preferences in budget and other travel preferences.

For the time being, the only origin city is going to be Madrid. However, destinations range from 6 different cities:
- Barcelona
- Sevilla
- Bilbao
- Valencia
- Málaga
- Valencia

The idea is to store query information about flights, accommodations and activites, being able to analyze:
- What dates are the best to go to.
- What days and times yield the least expensive prices for the same flights and accommodations.

Therefore, the goal will be to:
- Application:
    1. Be able to select flights based on user requirements:
        - Stops
        - Duration
        - Price
        - Time of departure
        - Origin airport
        - Destination airport
    2. Be able to select accommodations based on user requirements:
        - Stars
        - Score
        - Number of comments
        - Number of people attended
        - Distance Score
        - People per room
        - Type of room
        - Distance to city centre

    3. Be able to offer activities based on user:
        - Dates of travel
        - Categories of activites


- Analysis:
    - Destinations:
        1. Best absolute price to scores (accommodations, activities) per destination
        2. Best price to score based on user restrictions (of accommodation type, categories, etc)
        3. Best absolute destination to go to based on user preferences

    - Flights:
        1. Best absolute (least expensive) dates to travel somewhere
        2. Best (least expensive) dates in advance to get flight tickets
        3. Best (least expensive) days and times to consult flight ticket prices
    
    - Accommodations
        1. Best absolute price to score per destination
        2. Best price to score based on user preferences of accommodation
        3. Destination offering more accommodations of certain preferences (distance, house, etc)
        4. Best dates for better price
        5. Best dates and advance to get availability
    
    - Activities:
        1. Best dates for available activities
        2. Best dates for available activities of preference
        3. Best dates to get offers
        4. Best destinations for activities of preference

    - Demand:
        1. Are there acitvities that make certain dates more expensive?
        2. Is it high season or low season there?

    - Weather:
        1. If the travel is to be had in a lower advance than 14 days, what will the weather be like?



For Flights requests, I will not be able to make more than 35 000 requests a month. To make my flights analysis I will need to:
1. Analyse best times, days of the week and days in advance to query a flight price:
    1. Check flights at different times, for the same destination, for several sample destinations. This is 4 destinations, 24 times a day.
    2. Check flights at different days of the week, for the same destinations, for several sample destinations. This is 4 destinations, for 4 weeks at least.
    3. Check flights at different days in advance with respect to the flight, for several sample destinations and days of the week. This is 4 destinations, for advances from the same day to 4 weeks.
    
    For that ideal comparison, that makes up for 4 destinations x 24 queries a day x 28 days window x during 28 days, equating to 75k requests. 
    However, the main problem is that until monday there is not enough time to gather the information, so it will have to be storing this information and analysing hourly information and potentially the differences in querying on a weekend vs a monday, to compare ir with the average most expensive days during the year (special events at each city might apply)
    
2. Analyse best dates to get flights. 
    1. Check flights for different dates. I can do this once for 365 days of the year, for the 4 destinations chosen. 

    That makes up for 4 destinations x 365 days. 


The extractions to be made are:
- Flights
    - Skyscrapper
- Accommodations
    - Booking
    - Airbnb
- Activities
    - Civitatis
- Weather
- Stationality demand

# 2.1 Flights

First, let's make queries for our 4 chosen destinations. Selecting Spain gives already almost all the airports we need, lacking Bilbao, for what we add that option into our list.

In [2]:
list_of_countries_or_cities = ["spain","bilbao"]

In [3]:
# # lines commented to not execute them when running the notebook
# countries_airports = create_country_airport_code_df(list_of_countries_or_cities)

# countries_airports.to_csv("../data/airport_codes/countries_airports.csv")
countries_airports = pd.read_csv("../data/airport_codes/countries_airports.csv")

In [4]:
countries_airports

Unnamed: 0.1,Unnamed: 0,country,city,city_entityId,airport_skyId,airport_entityId,airport_name
0,0,spain,Madrid,27544850,MAD,95565077,Madrid
1,1,spain,Barcelona,27548283,BCN,95565085,Barcelona
2,2,spain,Port Of Spain,27546011,POS,104120358,Port Of Spain
3,3,spain,Málaga,27547484,AGP,95565095,Malaga
4,4,spain,Seville,27547022,SVQ,95565089,Seville
5,5,spain,Valencia,27547405,VLC,95565090,Valencia
6,6,spain,Salamanca,27546391,SLM,95565100,Salamanca Matacan
7,7,bilbao,Bilbao,27538794,BIO,95565104,Bilbao


Information needed for extraction from each flight:
- Duration
- Price
- Stops
- Departure
- Arrival
- Company
- Self_transfer
- Fare_policy columns: 'isChangeAllowed', 'isPartiallyChangeable', 'isCancellationAllowed', 'isPartiallyRefundable'
- Score
- Luggage price (optional)
- Origin airport
- Destination airport

In [5]:
origin_city = "madrid"
n_adults = 2

# would have to translate them to english if user is spanish
# destination_cities = ['barcelona','bilbao','seville','valencia']
destination_cities = ['barcelona']


In [6]:
# careful here as how to select the main airport is now mere coincidence and in the future it will need a method to be selected
querystrings_list = des.build_flight_request_querystring_list_single(countries_airports,origin_city,destination_cities, '2024-11-01', n_steps=3, step_length=7, days_window=2, n_adults= 1, n_children=0, n_infants=0, origin_airport_code="Yes", 
                                   destination_airport_code="Yes",sort_by="price_high",currency="EUR")
querystrings_list

[{'originSkyId': 'madrid',
  'destinationSkyId': 'barcelona',
  'originEntityId': '95565077',
  'destinationEntityId': '95565085',
  'date': '2024-11-01',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'price_high',
  'currency': 'EUR'},
 {'originSkyId': 'barcelona',
  'destinationSkyId': 'madrid',
  'originEntityId': '95565085',
  'destinationEntityId': '95565077',
  'date': '2024-11-03',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'price_high',
  'currency': 'EUR'},
 {'originSkyId': 'madrid',
  'destinationSkyId': 'barcelona',
  'originEntityId': '95565077',
  'destinationEntityId': '95565085',
  'date': '2024-11-08',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'price_high',
  'currency': 'EUR'},
 {'originSkyId': 'barcelona',
  'destinationSkyId': 'madrid',
  'originEntityId': '95565085',
  'destinationEntityId': '95565077',
  'date': '2024-11-10',
  'adults': '1',
  'childrens': '0',
  'infants': '0',
  'sortBy': 'pri

In [7]:
print(f"The number of API requests is {len(querystrings_list)}")

The number of API requests is 6


In [8]:
itineraries_dict_list =  await des.request_flight_itineraries_async_multiple(querystrings_list)

In [9]:
itineraries_dict_list_flat = [itinerary_dict for dict_list in itineraries_dict_list if dict_list for itinerary_dict in dict_list]

In [10]:
print(f"The number of API itineraries got is {len(itineraries_dict_list_flat)}")

The number of API itineraries got is 349


In [13]:
import datetime

In [11]:
def create_itineraries_dataframe(itineraries_dict_list):

    extracted_itinerary_info_list = list()

    for itinerary in itineraries_dict_list:
        extracted_itinerary_info_list.append(extract_flight_info(itinerary))
        
    return pd.DataFrame(extracted_itinerary_info_list)

def extract_flight_info(flight_dict):

    flight_result_dict = {}

    flight_result_dict_assigner = {
        'date_query': lambda _: datetime.datetime.now(),
        'score': lambda flight: float(flight['score']),
        'duration': lambda flight: int(flight['legs'][0]['durationInMinutes']),
        'price': lambda flight: int(flight['price']['formatted'].split()[0].replace(",","")),
        'price_currency': lambda flight: flight['price']['formatted'].split()[1],
        'stops': lambda flight: int(flight['legs'][0]['stopCount']),
        'departure': lambda flight: pd.to_datetime(flight['legs'][0]['departure']),
        'arrival': lambda flight: pd.to_datetime(flight['legs'][0]['arrival']),
        'company': lambda flight: flight['legs'][0]['carriers']['marketing'][0]['name'],
        'self_transfer': lambda flight: flight['isSelfTransfer'],
        'fare_isChangeAllowed': lambda flight: flight['farePolicy']['isChangeAllowed'],
        'fare_isPartiallyChangeable': lambda flight: flight['farePolicy']['isPartiallyChangeable'],
        'fare_isCancellationAllowed': lambda flight: flight['farePolicy']['isCancellationAllowed'],
        'fare_isPartiallyRefundable': lambda flight: flight['farePolicy']['isPartiallyRefundable'],
        'score': lambda flight: float(flight['score']),
        'origin_airport': lambda flight: flight['legs'][0]['origin']['name'],
        'destination_airport': lambda flight: flight['legs'][0]['destination']['name']
    }


    for key, function in flight_result_dict_assigner.items():
        try:
            flight_result_dict[key] = function(flight_dict)
        except KeyError:
            flight_result_dict[key] = np.nan  


    return flight_result_dict

In [14]:
itineraries_df = create_itineraries_dataframe(itineraries_dict_list_flat)

In [16]:
flights_depart = itineraries_df[itineraries_df["origin_airport"]=="Madrid"].sort_values(by="price",ascending=True)
flights_return = itineraries_df[itineraries_df["origin_airport"]=="Barcelona"].sort_values(by="price",ascending=True)

In [19]:
flights_depart.head()

Unnamed: 0,date_query,score,duration,price,price_currency,stops,departure,arrival,company,self_transfer,fare_isChangeAllowed,fare_isPartiallyChangeable,fare_isCancellationAllowed,fare_isPartiallyRefundable,origin_airport,destination_airport
206,2024-11-02 19:10:25.459652,0.857862,75,21,€,0,2024-11-15 06:45:00,2024-11-15 08:00:00,Iberia,False,False,False,False,False,Madrid,Barcelona
207,2024-11-02 19:10:25.460659,0.677158,75,25,€,0,2024-11-15 07:45:00,2024-11-15 09:00:00,Iberia,False,False,False,False,False,Madrid,Barcelona
205,2024-11-02 19:10:25.458576,0.999,85,33,€,0,2024-11-15 15:10:00,2024-11-15 16:35:00,Air Europa,False,False,False,False,False,Madrid,Barcelona
61,2024-11-02 19:10:25.364576,0.834892,75,38,€,0,2024-11-08 07:15:00,2024-11-08 08:30:00,Iberia,False,False,False,False,False,Madrid,Barcelona
68,2024-11-02 19:10:25.372574,0.480435,75,38,€,0,2024-11-08 06:45:00,2024-11-08 08:00:00,Iberia,False,False,False,False,False,Madrid,Barcelona


In [20]:
flights_return.head()

Unnamed: 0,date_query,score,duration,price,price_currency,stops,departure,arrival,company,self_transfer,fare_isChangeAllowed,fare_isPartiallyChangeable,fare_isCancellationAllowed,fare_isPartiallyRefundable,origin_airport,destination_airport
287,2024-11-02 19:10:25.517747,0.999,85,36,€,0,2024-11-17 13:30:00,2024-11-17 14:55:00,Iberia,False,False,False,False,False,Barcelona,Madrid
294,2024-11-02 19:10:25.522663,0.306007,85,38,€,0,2024-11-17 07:00:00,2024-11-17 08:25:00,Iberia,False,False,False,False,False,Barcelona,Madrid
142,2024-11-02 19:10:25.417589,0.564432,85,39,€,0,2024-11-10 07:00:00,2024-11-10 08:25:00,Iberia,False,False,False,False,False,Barcelona,Madrid
289,2024-11-02 19:10:25.519750,0.496261,85,41,€,0,2024-11-17 13:30:00,2024-11-17 14:55:00,Vueling Airlines,False,False,False,False,False,Barcelona,Madrid
288,2024-11-02 19:10:25.518751,0.682714,90,45,€,0,2024-11-17 11:50:00,2024-11-17 13:20:00,Air Europa,False,False,False,False,False,Barcelona,Madrid


The flights have been acquired for a test window of 2 days and 4 cities. Let's first transform and then load into the database, to check if there is something missing, before launching full range queries.

In [21]:
itineraries_df.to_parquet("../data/flights/itineraries.parquet")

## 2.2 Accommodations

### 2.2.1 Booking

1. Get all accommodation links
2. Get soups from all accommodation links
3. 

#### 2.2.1.1 Testing functions

In [38]:
# data processing
import pandas as pd
import numpy as np

## Scraping
# Webdriver automation
from selenium import webdriver 
from webdriver_manager.chrome import ChromeDriverManager  
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# html parsing
from bs4 import BeautifulSoup
# make synchronous request
import requests

# math operations
import math

# work with dates and time
import time
import datetime

# # work with asynchronicity
import asyncio
import aiohttp

# work with concurrency
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

import json

# environment variables
import dotenv
import os
dotenv.load_dotenv()
AIR_SCRAPPER_API_KEY = os.getenv("AIR_SCRAPPER_KEY")
GOOGLE_API = os.getenv("GOOGLE_API_KEY")
CALENDARIFIC_API_KEY = os.getenv("CALENDARIFIC_API_KEY")

# import support functions
import sys 
sys.path.append("..")

# function typing
from typing import List, Optional

# regular expressions
import re

In [33]:
destination_cities = ['barcelona']

In [34]:
### Accommodations - Booking - NEW VERSIONS

def scrape_accommodations_from_page(page_soup, booking_url, verbose=False):
    accommodation_scraper_dict = {
        "query_date": lambda _: datetime.datetime.now(),
        "checkin": lambda _: re.findall(r"checkin=(\d{4}-\d{2}-\d{2})", booking_url)[0],
        "checkout": lambda _: re.findall(r"checkout=(\d{4}-\d{2}-\d{2})", booking_url)[0],
        "n_adults_search": lambda _: re.findall(r"group_adults=(\d+)", booking_url)[0],
        "n_children_search":lambda _: re.findall(r"group_children=(\d+)", booking_url)[0],
        "n_rooms_search": lambda _: re.findall(r"no_rooms=(\d+)", booking_url)[0],
        "name": lambda card: card.find("div",{"data-testid":"title"}).text,
        "url": lambda card: card.find("a",{"data-testid":"title-link"})["href"],
        "price_currency": lambda card: card.find("span",{"data-testid":"price-and-discounted-price"}).text.split()[0],
        "total_price_amount": lambda card: card.find("span",{"data-testid":"price-and-discounted-price"}).text.split()[1].replace(".","").replace(",","."),
        "distance_city_center_km": lambda card: card.find("span",{"data-testid":"distance"}).text.split()[1].replace(".","").replace(",","."),
        "score": lambda card: card.find("div",{"data-testid": "review-score"}).find_all("div",recursive=False)[0].find("div").next_sibling.text.strip().replace(",","."),
        "n_comments": lambda card: card.find("div",{"data-testid": "review-score"}).find_all("div",recursive=False)[1].find("div").next_sibling.text.strip().split()[0].replace(".",""),
        "close_to_metro": lambda card: "Yes" if card.find("span",{"class":"f419a93f12"}) else "No",
        "sustainability_cert": lambda card: "Yes" if card.find("span",{"class":"abf093bdfe e6208ee469 f68ecd98ea"}) else "No",
        "room_type": lambda card: card.find("h4",{"class":"abf093bdfe e8f7c070a7"}).text,
        "double_bed": lambda card: "Yes" if any(["doble" in element.text for element in card.find_all("div",{"class":"abf093bdfe"})]) else "No",
        "single_bed": lambda card: "Yes" if any(["individual" in element.text for element in card.find_all("div",{"class":"abf093bdfe"})]) else "No",
        "free_cancellation": lambda card: "Yes" if any([element.text == "Cancelación gratis" for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else "No",
        "breakfast_included": lambda card: "Yes" if any([element.text == "Cancelación gratis" for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else "No",
        "pay_at_hotel": lambda card: "Yes" if any(['Sin pago por adelantado' in element.text for element in card.find_all("div",{"class":"abf093bdfe d068504c75"})]) else "No",
        "location_score": lambda card: card.find("span",{"class":"a3332d346a"}).text.split()[1].replace(",","."),
        "free_taxi": lambda card: "Yes" if any(["taxi gratis" in element.text.lower() for element in card.find_all("div",{"span":"b30f8eb2d6"})]) else "No"
    }

    accommodation_data_dict = {key: [] for key in accommodation_scraper_dict}

    for accommodation_card in page_soup.findAll("div", {"aria-label":"Alojamiento"}):
            for key, accommodation_scraper_function in accommodation_scraper_dict.items():
                try:
                    accommodation_data_dict[key].append(accommodation_scraper_function(accommodation_card))
                except Exception as e:
                    if verbose == True:
                        print(f"Error filling {key} due to {e}")
                    accommodation_data_dict[key].append(np.nan)

    return accommodation_data_dict


# dynamic html loading functions
def scroll_to_bottom(driver):
    last_height = driver.execute_script("return window.pageYOffset")

    while True:

        driver.execute_script('window.scrollBy(0, 2000)')
        time.sleep(0.4)
        
        new_height =  driver.execute_script("return window.pageYOffset")
        if new_height == last_height:
            break
        last_height = new_height

def scroll_back_up(driver):
    driver.execute_script('window.scrollBy(0, -600)')
    time.sleep(0.2)

def click_load_more(driver):
    try:
        button = WebDriverWait(driver, 3).until(EC.element_to_be_clickable(("xpath",'//*[@id="bodyconstraint-inner"]/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[3]/div[*]/button')))
        button.click()

        return True
    except:
        return print("'Load more' not found")

def scroll_and_click_cycle(driver):
    while True:
        print("Scrolling again")
        scroll_to_bottom(driver)
        scroll_back_up(driver)
        if not click_load_more(driver):
            break





def build_booking_urls(destinations_list: List[str], start_date: str, stay_duration: int = 2, step_length: int = 7, n_steps: int = 52, adults: int = 2, children: int = 0,
                           rooms: int = 1, max_price: int = 350, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = None):
    

    start_date_datetime = datetime.datetime.strptime(start_date, "%Y-%m-%d")
    booking_url_list = list()
    for destination in destinations_list:
        for step in range(n_steps):
            checkin = (start_date_datetime + datetime.timedelta(days=step*step_length)).strftime("%Y-%m-%d")
            checkout = (start_date_datetime + datetime.timedelta(days=step*step_length + stay_duration)).strftime("%Y-%m-%d")

            booking_search_link = build_booking_url_full(
                destination=destination,
                checkin=checkin,
                checkout=checkout,
                adults=adults, 
                children=children, 
                rooms=rooms, 
                max_price=max_price, 
                star_ratings=star_ratings, 
                meal_plan=meal_plan,  
                review_score=review_score,  
                max_distance_meters=max_distance_meters 
            )

            booking_url_list.append(booking_search_link)

    return booking_url_list

def build_booking_url_full(destination: str, checkin: str, checkout: str, adults: int = 1, children: int = 0,
                           rooms: int = 1, min_price: int = 1, max_price: int = 1, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = None):
    """
    Build a Booking.com search URL by including all parameter filters, 
    ensuring proper formatting for all parameters.

    Parameters:
    - destination (str): Destination city.
    - checkin (str): Check-in date in YYYY-MM-DD format.
    - checkout (str): Check-out date in YYYY-MM-DD format.
    - adults (int): Number of adults.
    - children (int): Number of children.
    - rooms (int): Number of rooms.
    - min_price (int): Minimum price in Euros.
    - max_price (int): Maximum price in Euros.
    - star_ratings (list): List of star ratings (e.g., [3, 4, 5]).
    - meal_plan (int): Meal plan (0 for no meal, 1 for breakfast, etc.).
    - review_score (list): List of review scores (e.g., [80, 90] for 8.0+ and 9.0+).
    - max_distance_meters (int): Maximum distance from city center in meters (e.g., 500).

    Returns:
    - str: A Booking.com search URL based on the specified filters.
    """
    
    base_url = "https://www.booking.com/searchresults.es.html?"
    
    # Start with basic search parameters (ensure no tuple formatting)
    url = f"{base_url}ss={destination}&checkin={checkin}&checkout={checkout}&group_adults={adults}&group_children={children}"
    
    if rooms is not None:
       url += f"&no_rooms={rooms}"
    
    if min_price is not None and max_price is not None:
        price_filter = f"price%3DEUR-{min_price}-{max_price}-1"
    elif min_price is not None:
        price_filter = f"price%3DEUR-{min_price}-1-1"
    elif max_price is not None:
        price_filter = f"price%3DEUR-{max_price}-1"
    else:
        price_filter = None

    # Construct 'nflt' parameter to add other filters
    nflt_filters = []
    
    if price_filter:
        nflt_filters.append(price_filter)
    
    if star_ratings:
        star_filter = '%3B'.join([f"class%3D{star}" for star in star_ratings])
        nflt_filters.append(star_filter)
    
    meal_plan_options = {
            "breakfast": 1,
            "breakfast_dinner": 9,
            "kitchen": 999,
            "nothing": None
        }
    meal_plan_formatted = meal_plan_options.get(meal_plan, None)

    if meal_plan_formatted is not None:
        meal_plan_str = f"mealplan%3D{meal_plan_formatted}"
        nflt_filters.append(meal_plan_str)
    
    if review_score:
        review_filter = '%3B'.join([f"review_score%3D{score}" for score in review_score])
        nflt_filters.append(review_filter)
    
    if max_distance_meters is not None:
        distance_str = f"distance%3D{max_distance_meters}"
        nflt_filters.append(distance_str)
    
    # Add all 'nflt' filters to URL
    if nflt_filters:
        url += f"&nflt={'%3B'.join(nflt_filters)}"

    url += "&sr_view=list"
    
    return url

def accommodations_booking_selenium_fetch_all_html_contents_concurrent(booking_url_list):
    # Determine optimal max_workers, usually best around the number of CPUs for Selenium
    max_workers = min(len(booking_url_list), os.cpu_count() or 1)
    with ThreadPoolExecutor(max_workers=2) as executor:
        futures = [executor.submit(fetch_booking_html, booking_url) for booking_url in booking_url_list]

        # Collect results as they complete
        html_contents_total = []
        for future in futures:
            html_contents_total.append(future.result())

    return html_contents_total, booking_url_list


def fetch_booking_html(booking_url):

    # open driver
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(booking_url)

    # scroll and load more until bottom
    # css_selector = "#bodyconstraint-inner > div:nth-child(8) > div > div.af5895d4b2 > div.df7e6ba27d > div.bcbf33c5c3 > div.dcf496a7b9.bb2746aad9 > div.d4924c9e74 > div.c82435a4b8.f581fde0b8 > button"
    scroll_and_click_cycle(driver)

    # fetch booking url html
    html_page = driver.page_source

    return html_page

def fetch_booking_html_optimized(booking_url):

    # ADD OPTIMIZATION OPTIONS HERE

    # open driver
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(booking_url)

    # scroll and load more until bottom
    css_selector = "#bodyconstraint-inner > div:nth-child(8) > div > div.af5895d4b2 > div.df7e6ba27d > div.bcbf33c5c3 > div.dcf496a7b9.bb2746aad9 > div.d4924c9e74 > div.c82435a4b8.f581fde0b8 > button"
    scroll_and_click_cycle(driver, css_selector)

    # fetch booking url html
    html_page = driver.page_source

    return html_page

def accommodations_booking_soup_from_all_html_contents_parallel(html_contents_total, booking_urls_list, verbose=False):
    start_time = time.time()
    with ThreadPoolExecutor() as executor:

        page_dfs = list(executor.map(accommodations_booking_parse_single_page_wrapper, html_contents_total, booking_urls_list, [verbose] * len(html_contents_total)))

    total_activities_df = pd.concat(page_dfs).reset_index(drop=True)
    end_time = time.time()
    print(f"The whole parallel Beautiful Soup process took {end_time-start_time}")
    return total_activities_df

def accommodations_booking_parse_single_page_wrapper(page_html, booking_url, verbose=False):
    return accommodations_booking_parse_single_page(page_html, booking_url,verbose=verbose)


def accommodations_booking_parse_single_page(page_html,booking_url, verbose=False):
    page_soup = BeautifulSoup(page_html, "html.parser")
    return pd.DataFrame(scrape_accommodations_from_page(page_soup,booking_url, verbose=verbose))


def accommodations_booking_extract_all_acommodations_selenium_concurrent(destinations_list: List[str], start_date: str, stay_duration: int = 2, step_length: int = 7, n_steps: int = 52, adults: int = 2, children: int = 0,
                           rooms: int = 1, max_price: int = 350, star_ratings: list = None, 
                           meal_plan: str = None, review_score: list = None, max_distance_meters: int = 5000, verbose=False):
    
    start_time = time.time()

    booking_urls_list = build_booking_urls(destinations_list = destinations_list, start_date= start_date, stay_duration =stay_duration , step_length = step_length, n_steps = n_steps, adults = adults, children = children,
                           rooms = rooms, max_price = max_price, star_ratings = star_ratings, meal_plan = meal_plan, review_score = review_score, max_distance_meters = max_distance_meters)
    
    print(f"It took {time.time() - start_time} seconds to build the urls")
    booking_html_contents_total, booking_urls_list = accommodations_booking_selenium_fetch_all_html_contents_concurrent(booking_urls_list)
    print(f"It took {time.time() - start_time} seconds for selenium to get the html contents")

    print("Now parsing with beautiful soup")
    total_accommodations_df = accommodations_booking_soup_from_all_html_contents_parallel(booking_html_contents_total, booking_urls_list,verbose=verbose)
    return total_accommodations_df
        

In [35]:
total_accommodations_booking_df = accommodations_booking_extract_all_acommodations_selenium_concurrent(destinations_list=destination_cities, start_date="2024-11-02", n_steps=4, max_price=150)

It took 0.0 seconds to build the urls
Scrolling again
Scrolling again
'Load more' not found
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
'Load more' not found
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
'Load more' not found
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
Scrolling again
'Load more' not found
It took 123.35918164253235 seconds for selenium to get the html contents
Now parsing with beautiful soup
The whole parallel Beautiful Soup process took 16.833760499954224


In [36]:
total_accommodations_booking_df.head()

Unnamed: 0,query_date,checkin,checkout,n_adults_search,n_children_search,n_rooms_search,name,url,price_currency,total_price_amount,...,close_to_metro,sustainability_cert,room_type,double_bed,single_bed,free_cancellation,breakfast_included,pay_at_hotel,location_score,free_taxi
0,2024-11-02 20:38:04.015543,2024-11-02,2024-11-04,2,0,1,Catalonia Albeniz,https://www.booking.com/hotel/es/cataloniaalbe...,€,262,...,Yes,Yes,Habitación Doble - 1 o 2 camas,Yes,Yes,No,No,No,,No
1,2024-11-02 20:38:04.044726,2024-11-02,2024-11-04,2,0,1,Catalonia Sagrada Familia,https://www.booking.com/hotel/es/cataloniaarag...,€,292,...,Yes,Yes,Habitación Doble - 1 o 2 camas,Yes,Yes,No,No,No,,No
2,2024-11-02 20:38:04.145106,2024-11-02,2024-11-04,2,0,1,Catalonia Atenas,https://www.booking.com/hotel/es/cataloniatena...,€,285,...,Yes,Yes,Habitación Doble - 1 o 2 camas,Yes,Yes,No,No,No,,No
3,2024-11-02 20:38:04.150123,2024-11-02,2024-11-04,2,0,1,Catalonia Park Güell,https://www.booking.com/hotel/es/catalonia-par...,€,222,...,Yes,Yes,Habitación Doble - 1 o 2 camas,Yes,Yes,No,No,No,,No
4,2024-11-02 20:38:04.154201,2024-11-02,2024-11-04,2,0,1,Roma Reial,https://www.booking.com/hotel/es/roma-reial.es...,€,232,...,Yes,No,Habitación Doble - 2 camas,No,Yes,No,No,No,9.4,No


In [37]:
total_accommodations_booking_df.to_parquet("../data/accommodations/booking.parquet")

### 2.2.2.1 Booking - extra information
It would be ideal to have general information not from the search, but from the availability dates, rooms, and characteristics of the individual accommodations themselves. It seems impossible to get from the cards themselves, and it is hard to know if we can get them directly from the scraping of the individual accommodation urls.

#### do I need extra information from the individual pages links?

### 2.2.2 Airbnb

Left for later if possible. There is no information about distance to the city center for the most part. Maybe getting inside each accommodation could give latitude and or longitude.

In [None]:
destination_city = "barcelona"
checkin_date = "2024-11-01"
checkout_date = "2024-11-03"
n_adults = 2
url = f"https://www.airbnb.com/s/{destination_city}/homes?checkin={checkin_date}&checkout={checkout_date}&adults={n_adults}"

driver = webdriver.Chrome()

driver.get(url)
driver.maximize_window()
WebDriverWait(driver, 5).until(
    EC.element_to_be_clickable((By.XPATH, "//button[text()='Aceptar todas']"))
).click()

html_page = driver.page_source
# pagination = driver.find_element(By.XPATH, "//nav[aria-label()='Paginación de resultados de búsqueda']")
# pagination_elements = pagination.find_elements(By.TAG_NAME,"a")
# pagination_elements


In [None]:
soup = BeautifulSoup(html_page,"html.parser")
soup

In [None]:
len(soup.find("div",{"style":"display: contents;"}).find_all("div",{"itemprop":"itemListElement"}))

In [None]:
soup.find("div",{"style":"display: contents;"}).find_all("div",{"itemprop":"itemListElement"})[0]

Sera posible entrar en todos los alojamientos de airbnb, sacar su precio, informacion y demás y sus disponibilidades directamente?

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from concurrent.futures import ThreadPoolExecutor

In [None]:
def click_button(xpath):
    try:
        WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, xpath))
        ).click()
    except TimeoutException:
        print(f"Button with XPath '{xpath}' not found.")

# XPaths for the buttons
xpath_cerrar = "//button[@aria-label='Cerrar']"
xpath_aceptar_todas = "//button[text()='Aceptar todas']"


In [None]:
url = "https://www.airbnb.es/rooms/29413203?adults=2&search_mode=regular_search&check_in=2024-11-01&check_out=2024-11-03&source_impression_id=p3_1730487404_P321LpZsGLiX0DKt&previous_page_section_name=1000&federated_search_id=1293e1b7-d1cd-4837-a262-b1013a1b5764"
# response = requests.get(url)
driver = webdriver.Chrome()

driver.get(url)
with ThreadPoolExecutor(max_workers=2) as executor:
    executor.submit(click_button, xpath_cerrar)
    executor.submit(click_button, xpath_aceptar_todas)

driver.maximize_window()


html_page = driver.page_source
soup_bs = BeautifulSoup(html_page, "html.parser")

In [None]:
soup_bs

In [None]:
calendar = soup_bs.find("div",{"data-plugin-in-point-id":"AVAILABILITY_CALENDAR_INLINE"})

In [None]:
calendar

In [None]:
available_dates = []
# Loop through all divs with 'data-testid' that match the 'calendar-day' pattern
for day in calendar.find_all('div', attrs={'data-testid': True}):
    # Check if the div is a date and is available
    if day['data-testid'].startswith('calendar-day') and day.get('data-is-day-blocked') != 'true':
        date = day['data-testid'].replace('calendar-day-', '')
        available_dates.append(date)

# Output available dates
print("Available dates:", available_dates)

## 2.3 Activities

The functions used for this processed have been inspired by a previous project. In order to optimize the extraction of such a vast information, Multithreading and Parallel processes have implemented in variations of the original function in order to assess the impact of this optimization, as an additional exploration of this project.

In [17]:
cities = ['barcelona','bilbao','sevilla','valencia']
cities = ['barcelona']
## calculate today
today = str(date.today()  + timedelta(days=25))
end_date = str(date.today()  + timedelta(days=38))
today

'2024-11-27'

Availability is only possible for a 15 days window. That forces to calculate roundup(days(end_date - start_date)/15) = iterations, which is something that can be handled by the main function. 

Updating the `src/data_extraction_support.py` function.

Actually, it has to be 7 days.

In [18]:
# result_df_parallel = des.activities_civitatis_extract_all_activites_parallel(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_parallel

In [19]:
# result_df_multithread = des.activities_civitatis_extract_all_activites_multithread(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_multithread

In [20]:
# result_df_multithread_selenium = des.activities_civitatis_extract_all_activites_multithread_selenium(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_multithread_selenium

In [21]:
result_df_parallel_selenium = des.activities_civitatis_extract_all_activites_parallel_selenium(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

result_df_parallel_selenium

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=130.0.6723.92)
Stacktrace:
	GetHandleVerifier [0x00007FF76AAE3AF5+28005]
	(No symbol) [0x00007FF76AA483F0]
	(No symbol) [0x00007FF76A8E580A]
	(No symbol) [0x00007FF76A8BFA85]
	(No symbol) [0x00007FF76A962AD7]
	(No symbol) [0x00007FF76A97B1B1]
	(No symbol) [0x00007FF76A95B7E3]
	(No symbol) [0x00007FF76A9275C8]
	(No symbol) [0x00007FF76A928731]
	GetHandleVerifier [0x00007FF76ADD646D+3118813]
	GetHandleVerifier [0x00007FF76AE26CC0+3448624]
	GetHandleVerifier [0x00007FF76AE1CF3D+3408301]
	GetHandleVerifier [0x00007FF76ABAA44B+841403]
	(No symbol) [0x00007FF76AA5344F]
	(No symbol) [0x00007FF76AA4F4C4]
	(No symbol) [0x00007FF76AA4F65D]
	(No symbol) [0x00007FF76AA3EBB9]
	BaseThreadInitThunk [0x00007FFC65F7257D+29]
	RtlUserThreadStart [0x00007FFC663AAF08+40]


In [14]:
result_df_parallel_selenium.isna().sum()

query_date                    0
activity_date_range_start     0
activity_date_range_end       0
activity_name                 2
description                   0
url                           2
image                         0
image2                       44
available_days               20
available_times              20
duration                     36
latitude                      2
longitude                     2
price                         3
currency                      3
category                      2
spanish                      76
dtype: int64

In [15]:
# result_df_parallel_selenium_optimized = des.activities_civitatis_extract_all_activites_parallel_selenium_optimized(cities_list=cities, date_start=today, date_end=end_date,verbose=False)

# result_df_parallel_selenium_optimized

### Time and results comparison

In [21]:
# print(f"The shape of the multithread soup is {result_df_multithread.shape[0]}.")
# print("Multithread soup has these null values:")
# display(result_df_multithread.isna().sum())

# print(f"\nThe shape of the multithread soup + selenium concurrent is {result_df_multithread_selenium.shape[0]}.")
# print("Multithread soup + selenium concurrent has these null values:")
# display(result_df_multithread_selenium.isna().sum())

# print(f"\nThe shape of the parallel soup is {result_df_parallel.shape[0]}.")
# print("Parallel soup has these null values:")
# display(result_df_parallel.isna().sum())

print(f"\nThe shape of the multithread soup + selenium concurrent is {result_df_parallel_selenium.shape[0]}.")
print("Multithread soup + selenium concurrent has these null values:")
display(result_df_parallel_selenium.isna().sum())

# print(f"\nThe shape of the multithread soup + selenium concurrent optimized is {result_df_parallel_selenium_optimized.shape[0]}.")
# print("Multithread soup + selenium concurrent optimized has these null values:")
# result_df_parallel_selenium_optimized.isna().sum()


The shape of the multithread soup + selenium concurrent is 1831.
Multithread soup + selenium concurrent has these null values:


activity_name       24
description          0
url                 24
image                0
image2             364
available_days     105
available_times    105
duration           190
latitude            24
longitude           24
price               27
currency            27
category            24
spanish            414
dtype: int64

### 2.3.1 Save results for transformation experiments

In [31]:
result_df_parallel_selenium.to_parquet("../data/activities/90_day_availability.parquet")

NameError: name 'result_df_parallel_selenium' is not defined

## 2.4 Special dates and events

In [42]:
import aiohttp
import asyncio
import pandas as pd

BASE_URL = "https://calendarific.com/api/v2/holidays"

cities_dict = {
    "Barcelona": "CT",     # Catalonia
    "Sevilla": "AN",       # Andalusia
    "Bilbao": "PV",        # Basque Country
    "Valencia": "VC"       # Valencian Community
}

async def fetch_holidays(city, region_code):
    params = {
        "api_key": CALENDARIFIC_API_KEY,
        "country": "ES",
        "year": 2024,
        "location": region_code  # E.g., "CT" for Catalonia, "AN" for Andalusia
    }
    async with aiohttp.ClientSession() as session:
        async with session.get(BASE_URL, params=params) as response:
            data = await response.json()
            return {city: data["response"]["holidays"]}

async def get_cities_dates(cities_dict):

    tasks = [fetch_holidays(city, region) for city, region in cities_dict.items()]
    results = await asyncio.gather(*tasks)
    
    # Convert results to DataFrame or other structure
    holidays_df = pd.DataFrame(results)
    return results



In [43]:
cities_dates = await get_cities_dates(cities_dict)

In [47]:
cities_dates[0]

{'Barcelona': [{'name': "New Year's Day",
   'description': 'New Year’s Day is the first day of the year, or January 1, in the Gregorian calendar.',
   'country': {'id': 'es', 'name': 'Spain'},
   'date': {'iso': '2024-01-01',
    'datetime': {'year': 2024, 'month': 1, 'day': 1}},
   'type': ['National holiday'],
   'primary_type': 'National holiday',
   'canonical_url': 'https://calendarific.com/holiday/spain/new-year-day',
   'urlid': 'spain/new-year-day',
   'locations': 'All',
   'states': 'All'},
  {'name': 'Reconquest Day',
   'description': 'Reconquest Day is a observance in Spain',
   'country': {'id': 'es', 'name': 'Spain'},
   'date': {'iso': '2024-01-02',
    'datetime': {'year': 2024, 'month': 1, 'day': 2}},
   'type': ['Observance'],
   'primary_type': 'Observance',
   'canonical_url': 'https://calendarific.com/holiday/spain/reconquest-day',
   'urlid': 'spain/reconquest-day',
   'locations': 'All',
   'states': 'All'},
  {'name': 'Epiphany',
   'description': 'Epiphany is

## 2.5 Weather forecast and historical

### 2.5.1 Get cities latitude and longitude

In [56]:
import aiohttp
import asyncio


cities_list = ["barcelona","seville","bilbao","valencia"]

# Asynchronous function to fetch latitude and longitude for a city
async def get_lat_lon(city, country="Spain"):
    url = "https://nominatim.openstreetmap.org/search"
    params = {
        "city": city,
        "country": country,
        "format": "json",
        "limit": 1
    }
    async with aiohttp.ClientSession() as session:
        async with session.get(url, params=params) as response:
            if response.status == 429:
                print(f"Rate limit hit for {city}. Retrying...")
                await asyncio.sleep(1)  # Wait for a second before retrying
                return await get_lat_lon(city)  # Retry the same request
            elif response.status == 200:
                data = await response.json()
                if data:
                    lat = data[0]["lat"]
                    lon = data[0]["lon"]
                    return city, lat, lon
            return city, None, None

# Main function to handle multiple cities asynchronously
async def get_cities_coordinates(cities_list): 

    tasks = [get_lat_lon(city) for city in cities_list]
    results = await asyncio.gather(*tasks)
    
    for city, lat, lon in results:
        print(f"City: {city}, Latitude: {lat}, Longitude: {lon}")

    return results


In [58]:
# Run the asynchronous main function
results = await get_cities_coordinates(cities_list)

City: barcelona, Latitude: 41.3828939, Longitude: 2.1774322
City: seville, Latitude: 37.3886303, Longitude: -5.9953403
City: bilbao, Latitude: 43.2630018, Longitude: -2.9350039
City: valencia, Latitude: 39.4697065, Longitude: -0.3763353


In [59]:
results

[('barcelona', '41.3828939', '2.1774322'),
 ('seville', '37.3886303', '-5.9953403'),
 ('bilbao', '43.2630018', '-2.9350039'),
 ('valencia', '39.4697065', '-0.3763353')]

In [80]:
import aiohttp
import asyncio
import pandas as pd

# Define the base URL and parameters for the forecast
BASE_URL_FORECAST = "https://api.open-meteo.com/v1/forecast"
params = {
    "daily": "temperature_2m_max,temperature_2m_min,precipitation_sum",
    "timezone": "Europe/Madrid",
    "forecast_days": 14
}

# params = {
#     "daily": "temperature_2m_max,temperature_2m_min,wind_speed_10m_max,wind_speed_10m_min,wind_direction_10m_dominant,precipitation_sum,precipitation_hours,snowfall_sum,snow_depth,humidity_2m_max,humidity_2m_min,pressure_msl,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,shortwave_radiation_sum,sunrise,sunset,uv_index_max",
#     "timezone": "Europe/Madrid",
#     "forecast_days": 14
# }

# params = {
#     "daily": ",".join([
#         # Temperature
#         "temperature_2m_max",
#         "temperature_2m_min",
        
#         # Wind
#         "wind_speed_10m_max",
#         "wind_speed_10m_min",
#         "wind_direction_10m_dominant",
        
#         # Precipitation
#         "precipitation_sum",
#         "precipitation_hours",
        
#         # Snow
#         "snowfall_sum",
#         "snow_depth",
        
#         # Humidity and Pressure
#         "humidity_2m_max",
#         "humidity_2m_min",
#         "pressure_msl",
        
#         # Cloud Cover and Radiation
#         "cloudcover",
#         "cloudcover_low",
#         "cloudcover_mid",
#         "cloudcover_high",
#         "shortwave_radiation_sum",
        
#         # Sun and UV Index
#         "sunrise",
#         "sunset",
#         "uv_index_max"
#     ]),
#     "timezone": "Europe/Madrid",
#     "forecast_days": 14
# }



async def fetch_forecast(city, latitude, longitude):
    async with aiohttp.ClientSession() as session:
        async with session.get(BASE_URL_FORECAST, params={**params, "latitude": latitude, "longitude": longitude}) as response:
            data = await response.json()
            return {city: data.get("daily", {})}

async def get_forecast(cities):
    tasks = [fetch_forecast(city, lat, lon) for city, (lat, lon) in cities.items()]
    results = await asyncio.gather(*tasks)
    
    # Consolidate results into a DataFrame
    forecast_data = {city: result[city] for result in results for city in result}
    all_forecasts = []
    for city, daily_data in forecast_data.items():
        city_df = pd.DataFrame(daily_data)
        city_df["City"] = city
        all_forecasts.append(city_df)
    
    forecast_df = pd.concat(all_forecasts, ignore_index=True)
    return forecast_df



In [81]:
params

{'daily': 'temperature_2m_max,temperature_2m_min,precipitation_sum',
 'timezone': 'Europe/Madrid',
 'forecast_days': 14}

In [82]:
cities_dict = {
    "Valencia": (39.4699, -0.3763),
    "Barcelona": (41.3851, 2.1734),
    "Sevilla": (37.3886, -5.9823),
    "Bilbao": (43.2630, -2.9350),
    "Madrid": (40.4168, -3.7038)
}

forecast = await get_forecast(cities_dict)
forecast

Unnamed: 0,time,temperature_2m_max,temperature_2m_min,precipitation_sum,City
0,2024-11-02,20.5,15.1,0.4,Valencia
1,2024-11-03,20.9,15.0,5.2,Valencia
2,2024-11-04,20.9,17.0,2.7,Valencia
3,2024-11-05,22.8,16.3,0.0,Valencia
4,2024-11-06,20.9,15.5,3.4,Valencia
...,...,...,...,...,...
65,2024-11-11,15.3,11.0,0.0,Madrid
66,2024-11-12,14.8,9.9,0.3,Madrid
67,2024-11-13,15.3,9.4,0.0,Madrid
68,2024-11-14,14.4,9.0,0.0,Madrid


In [69]:
BASE_URL_HISTORICAL = "https://api.open-meteo.com/v1/era5"
years = range(2018, 2023 + 1)

async def fetch_historical(city, latitude, longitude, year):
    async with aiohttp.ClientSession() as session:
        start_date = f"{year}-01-01"
        end_date = f"{year}-12-31"
        params = {
            "latitude": latitude,
            "longitude": longitude,
            "start_date": start_date,
            "end_date": end_date,
            "daily": "temperature_2m_max,temperature_2m_min,precipitation_sum",
            "timezone": "Europe/Madrid"
        }
        async with session.get(BASE_URL_HISTORICAL, params=params) as response:
            data = await response.json()
            return {city: {year: data.get("daily", {})}}

async def get_historical(cities):
    tasks = [
        fetch_historical(city, lat, lon, year) 
        for city, (lat, lon) in cities.items()
        for year in years
    ]
    results = await asyncio.gather(*tasks)
    
    # Consolidate results into a DataFrame
    historical_data = {}
    for result in results:
        for city, year_data in result.items():
            if city not in historical_data:
                historical_data[city] = {}
            historical_data[city].update(year_data)

    all_historical = []
    for city, yearly_data in historical_data.items():
        city_df = pd.concat({year: pd.DataFrame(data) for year, data in yearly_data.items()}, names=["Year", "Day"])
        city_df["City"] = city
        all_historical.append(city_df)

    historical_df = pd.concat(all_historical).reset_index(level="Day").reset_index()
    return historical_df



In [70]:
historical = await get_historical(cities_dict)
historical

Unnamed: 0,Year,Day,City


In [86]:
import aiohttp
import asyncio
import pandas as pd

# Base URL for Open-Meteo Archive API
BASE_URL = "https://archive-api.open-meteo.com/v1/archive"

# Define parameters template (will be updated for each city)
params_template = {
    "start_date": "2023-10-17",
    "end_date": "2024-10-31",
    "daily": [
        "weather_code", "temperature_2m_max", "temperature_2m_min", "temperature_2m_mean",
        "apparent_temperature_max", "apparent_temperature_min", "apparent_temperature_mean",
        "sunrise", "sunset", "daylight_duration", "sunshine_duration",
        "precipitation_sum", "rain_sum", "snowfall_sum", "precipitation_hours",
        "wind_speed_10m_max", "wind_gusts_10m_max", "wind_direction_10m_dominant",
        "shortwave_radiation_sum", "et0_fao_evapotranspiration"
    ],
    "timezone": "Europe/Madrid"
}

# Example dictionary of cities with coordinates
cities_dict = {
    "Valencia": (39.4699, -0.3763),
    "Barcelona": (41.3851, 2.1734),
    "Sevilla": (37.3886, -5.9823),
    "Bilbao": (43.2630, -2.9350),
    "Madrid": (40.4168, -3.7038)
}

# Asynchronous function to fetch weather data for a single city
async def fetch_weather_data(url, city, latitude, longitude):
    params = params_template.copy()
    params.update({
        "latitude": latitude,
        "longitude": longitude
    })
    
    async with aiohttp.ClientSession() as session:
        async with session.get(url, params=params) as response:
            if response.status == 200:
                data = await response.json()
                if "daily" in data:
                    df = pd.DataFrame(data["daily"])
                    df["City"] = city  # Add city name to the DataFrame
                    return df
                else:
                    print(f"No daily data for {city}")
                    return pd.DataFrame()  # Return empty DataFrame if no data
            else:
                print(f"Error {response.status} for {city}")
                return pd.DataFrame()  # Return empty DataFrame on error

# Main function to get data for all cities and return it as a consolidated DataFrame
async def get_weather_data_for_cities(cities_dict):
    tasks = [
        fetch_weather_data(BASE_URL, city, lat, lon)
        for city, (lat, lon) in cities_dict.items()
    ]
    results = await asyncio.gather(*tasks)
    
    # Concatenate all city DataFrames into one
    all_cities_df = pd.concat(results, ignore_index=True)
    return all_cities_df

# Run the asynchronous main function
weather_df = await get_weather_data_for_cities(cities_dict)
weather_df


Unnamed: 0,time,weather_code,temperature_2m_max,temperature_2m_min,temperature_2m_mean,apparent_temperature_max,apparent_temperature_min,apparent_temperature_mean,sunrise,sunset,...,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,wind_speed_10m_max,wind_gusts_10m_max,wind_direction_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration,City
0,2023-10-17,3,25.8,17.7,21.6,27.2,19.0,23.3,2023-10-17T07:13,2023-10-17T18:20,...,0.0,0.0,0.0,0.0,14.0,28.8,13,9.31,1.95,Valencia
1,2023-10-18,3,29.2,17.9,23.8,25.4,18.7,22.5,2023-10-18T07:14,2023-10-18T18:18,...,0.0,0.0,0.0,0.0,29.8,54.7,252,12.72,4.84,Valencia
2,2023-10-19,63,29.1,19.0,23.8,26.2,17.6,21.7,2023-10-19T07:15,2023-10-19T18:17,...,4.7,4.7,0.0,3.0,27.8,51.8,226,8.50,4.74,Valencia
3,2023-10-20,51,22.5,17.1,19.5,17.6,13.1,15.5,2023-10-20T07:16,2023-10-20T18:16,...,0.6,0.6,0.0,3.0,39.8,70.9,259,9.70,4.50,Valencia
4,2023-10-21,3,23.7,15.0,18.6,19.5,11.4,14.9,2023-10-21T07:17,2023-10-21T18:14,...,0.0,0.0,0.0,0.0,22.5,41.8,264,13.98,4.62,Valencia
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1900,2024-10-27,61,13.0,3.2,8.4,8.8,0.3,5.2,2024-10-27T07:39,2024-10-27T18:17,...,2.9,2.9,0.0,5.0,19.2,33.8,32,10.83,1.81,Madrid
1901,2024-10-28,51,18.5,7.9,12.7,15.9,4.6,10.0,2024-10-28T07:40,2024-10-28T18:16,...,1.3,1.3,0.0,5.0,20.5,39.6,50,12.04,2.17,Madrid
1902,2024-10-29,3,16.7,12.1,14.1,14.1,8.8,11.1,2024-10-29T07:41,2024-10-29T18:15,...,0.0,0.0,0.0,0.0,29.9,59.8,35,0.00,0.97,Madrid
1903,2024-10-30,3,18.7,10.7,14.3,17.3,7.4,12.2,2024-10-30T07:42,2024-10-30T18:13,...,0.0,0.0,0.0,0.0,21.1,49.3,75,0.00,0.98,Madrid


## 3. Comments on updates

I can keep results on all queried information from their dates. This can provide information for the analysis on how prices, availability, etc change  with time and date of the query. However, I will need to be constantly updating information, which can be costly in ressources and API calls.

## 3.1 Flights updates

For an analysis, it could be cool to do them hourly for a 4 week period.

However, for the current task of comparing several weekends, I will do it just once.

Nevertheless, the interest could lie in keeping daily results and just updating flight information as per user request, than can be re-used for similar user requests.

Information from when to check for the cheapest flights can be used to advice the user when to check the flights, appart for the target price notification that could be provided.

It would be ideal to scrape independent flights providers to provide better results and not depend upon API. Although the latter is much faster.

## 3.2 Accommodations updates

Booking limitations
- To get more specific information about prices, I have to check every individual date.
- ¨ about types of rooms and options, ¨.
- To get more specific information about availabilities, ¨

It would be ideal to somehow obtain this information via API or better exploring the website.

Airbnb limitations
- I mostly will need to search for all dates, on all cities, navigate to all paginations and then also navigate to each listing. Heavy.
- To get more specific information about prices, I have to check every individual date.
- ¨ about types of rooms and options, ¨.
- To get more specific information about availabilities, ¨

Explore as soon as I can save everything on the database.

## 3.3 Activities updates

Can be done every day.