# **Homework 1 - Text Mining**

## Group Members:
1. **Tarang Kadyan**  
   <tarang.kadyan@bse.eu>

2. **Oriol Gelabert**  
   <oriol.gelabert@bse.eu>

3. **Enzo Infantes**  
   <enzo.infantes@bse.eu>

<img src='https://upload.wikimedia.org/wikipedia/commons/4/41/BSE_primary_logo_color.jpg' width=300 />

# **1. Libraries**

In [9]:
import os
import time

import pandas as pd
import numpy as np
import re

from selenium.webdriver.common.keys import Keys
from packages.selenium_setup import *
from packages.scraper import BookingScraper
from packages.dataloading import DataCollection

# **2. Searching**
This contains past preferences such as passwords, cookie acceptance etc.

Open **Booking.com** website to start our search. \
Note: In this step, we are using the functions from `selenium_setup.py`.

In [2]:
dfolder= os.getcwd()
geko_path = os.path.join(dfolder, 'geckodriver.exe')
link='https://www.booking.com/index.es.html'

browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)

**Reject Cookies**: The following cell closes Rejects the cookies, closing the banner and avoiding possible interference when scrapping.

In [3]:
path_cookies='//button[@id="onetrust-reject-all-handler"]'
cookies= browser.find_elements('xpath',path_cookies)
cookies[0].click()

**Google Log-In Pop-Up** is inside an Iframe object, so we should first acces to the iframe and latter to the element we are willing to click, in this case the close button of this pop-up.

In [4]:
try:
    iframe_google = browser.find_elements(By.TAG_NAME,'iframe')[0] #we find the iframe object
    browser.switch_to.frame(iframe_google) #switch to acces the iframe
    close_log_in = browser.find_element(By.CSS_SELECTOR, '#close') # find the close button element
    close_log_in.click() #click on it to close the pop-up
except Exception:
    print('No Google LogIn found')
    
browser.switch_to.default_content() #switch back the browser to default to exit the iframe and continue web-scrapping

## **2.1 Booking Scraper - Initial Results**
In this section, we are using the `BookingScraper` class from `scraper.py`. Inside it, the following steps are performed:

- Define and search for our destination (city).
- Specify the month and the exact date (month and day).
- Click the search button.

We are selecting the dates of the event we want to track, in our case MWC. This event is hosted between 3-6 March, so our initial serach will be on the week of 01 to 08 March.

In [5]:
# Pipeline to filter the destination, month, and year of the trip, and perform the search.
scraper = BookingScraper(browser)
scraper.run_pipeline(place="Barcelona", 
                     target_month="marzo", 
                     target_year="2025",
                     from_date="03-01", 
                     to_date="03-08")

Message: Unable to locate element: /html/body/div[3]/div[2]/div/form/div/div[4]/button/span; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:511:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:136:16

Message: Unable to locate element: /html/body/div[3]/div[2]/div/form/div/div[4]/button/span; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.

**Genius pop-up** might appear, we would want to close it:

In [6]:
def close_genius():
    '''
    Wait for content to load around 5 seconds and close the Genius Banner.
    In case there is no genius banner we will perform no action
    '''
    try:
          time.sleep(5)  
          path_close_genius='//div[@class="abcc616ec7 cc1b961f14 c180176d40 f11eccb5e8 ff74db973c"]'
          close_genius= browser.find_elements('xpath',path_close_genius)
          close_genius[0].click()
    except Exception as e:
        print("No Genius Banner")

close_genius()

No Genius Banner


**Important Step**

Now, we want to extract information from all hotels in Barcelona. The issue is that Booking initially displays only 25 hotels, and as you scroll down, it loads up to 75 before requiring a click on the 'Load More' button.

To handle this, we create a function that scrolls to the bottom of the page, searches for this clickable button, and clicks it. This process runs in a while loop until the 'Load More' button is no longer available, indicating that all possible hotels have been displayed.

Some waiting times are added to allow the browser to load elements properly

In [None]:
def scroll_and_click():
    '''
    Scroll to the bottom of the page and click the button to load more hotels.
    We define some time.sleep() to wait for the content to load.
    '''
    try:
        browser.find_element(By.TAG_NAME, 'body').send_keys(Keys.END) # Scroll to the bottom
        time.sleep(3)  # Wait for content to load
            
        wait = WebDriverWait(browser, 7) # Wait for the button to be clickable
        button = wait.until(EC.element_to_be_clickable((By.XPATH, '//button[@class = "a83ed08757 c21c56c305 bf0537ecb5 f671049264 af7297d90d c0e0affd09"]')))

        button.click()
        time.sleep(1)  # Wait for new content to load
        while True:
            try:
                browser.find_element(By.TAG_NAME, 'body').send_keys(Keys.END) # Scroll to the bottom
                time.sleep(0.7)  # Wait for content to load
                
                wait = WebDriverWait(browser, 4) # Wait for the button to be clickable
                button = wait.until(EC.element_to_be_clickable((By.XPATH, '//button[@class = "a83ed08757 c21c56c305 bf0537ecb5 f671049264 af7297d90d c0e0affd09"]')))

                button.click()
                time.sleep(0.7)  # Wait for new content to load
            except Exception as e:
                print("All hotels have been loaded")
                break
    except Exception:
        print('No button found')

scroll_and_click()

All hotels have been loaded


## **2.2 Data Loading - Final Results**
Now we have all the hotels loaded we will extract some information from them. After a quick inspection we identify several elements that could be usefull when analyzing an hotel:

* Hotel name : Can be used as an identifier
* Price : We can get the price of the stay for each hotel
* Rating: The feedback given by consumers of booking for each hotel
* Stars: We can also retrieve the number of stars of an hotel
* Distance to the center: This could indicate us if only central hotels are affected by an increase of prices.
* Neigbourhood: Maybe only certain neighbourhoods are affected by the presence of an event.

In [10]:
scraper = DataCollection(browser)
df = scraper.get_hotel_information()

In [11]:
df.head()

Unnamed: 0,name,rating,stars,price,location,distance,link
0,chic&basic Habana Hoose,85,3,3.023,"Ciutat Vella, Barcelona",0.9,https://www.booking.com/hotel/es/chic-amp-basi...
1,BARCELONA GOTIC Guesthouse,77,1,839.0,"Ciutat Vella, Barcelona",1.0,https://www.booking.com/hotel/es/guesthouse-ba...
2,Axel TWO Barcelona 4 Sup - Adults Only,83,4,2.331,"Eixample, Barcelona",1.5,https://www.booking.com/hotel/es/two-barcelona...
3,Barcelonaforrent The Central Place,82,4,5.844,"Eixample, Barcelona",0.6,https://www.booking.com/hotel/es/barcelonaforr...
4,Travelodge Barcelona Poblenou,73,1,1.49,"Sant Martí, Barcelona",2.9,https://www.booking.com/hotel/es/travelodge-ba...
