# Navigating and Data Collection of Airbnb Listings: A Selenium WebDriver Approach

# Introduction

This Python script employs the Selenium package to navigate and scrape data from Airbnb's website. The process is automated through a web driver that simulates a user browsing through property listings on Airbnb. Specifically, the script is designed to:

Construct the URL for the Airbnb search page based on a specified location and date range.

Open the URL in a browser window controlled by Selenium.

Sequentially navigate through the first three pages of the search results.

For each listing found, extract the URL, nightly price, number of beds, and user rating, handling instances where information may be missing.

Store the gathered information in a Pandas DataFrame and Export the DataFrame to a CSV file for subsequent use or analysis.

In [2]:
#pip install selenium

Collecting selenium
  Obtaining dependency information for selenium from https://files.pythonhosted.org/packages/b4/f9/e9ac5e4c5d84b07c7d117d67b2c84be221bcb9e62ff31fd0a1bbc06099c0/selenium-4.19.0-py3-none-any.whl.metadata
  Downloading selenium-4.19.0-py3-none-any.whl.metadata (6.9 kB)
Collecting trio~=0.17 (from selenium)
  Obtaining dependency information for trio~=0.17 from https://files.pythonhosted.org/packages/17/c9/f86f89f14d52f9f2f652ce24cb2f60141a51d087db1563f3fba94ba07346/trio-0.25.0-py3-none-any.whl.metadata
  Downloading trio-0.25.0-py3-none-any.whl.metadata (8.7 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Obtaining dependency information for trio-websocket~=0.9 from https://files.pythonhosted.org/packages/48/be/a9ae5f50cad5b6f85bd2574c2c923730098530096e170c1ce7452394d7aa/trio_websocket-0.11.1-py3-none-any.whl.metadata
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting typing_extensions>=4.9.0 (from selenium)
  Obtaining dependency in

In [None]:
#import necessary libraries and modules
#selenium and its components for web scraping.
#time and random for managing delays in page loading and clicks to simulate human behavior and avoid detection.
#pandas for data manipulation and saving the scraped data into a CSV file

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import random
import pandas as pd


In [2]:
#Setting Up the WebDriver

serv_obj = Service('C:/Users/shubh/Downloads/chromedriver-win64/chromedriver-win64/chromedriver.exe')
driver = webdriver.Chrome(service=serv_obj)


In [3]:
#Defining Search Parameters for the Airbnb location and the check-in and check-out dates are set.

location = 'williamsburg-va'
checkin = '2024-03-18'
checkout = '2024-03-24'

In [4]:
# Building the Website URL to scrape
url_to_scrape = "https://www.airbnb.com/s/"+location+"/homes?tab_id=home_tab&date_picker_type=calendar&checkin="+checkin+"&checkout="+checkout

#The WebDriver opens the constructed URL in a browser window
driver.get(url_to_scrape)

#An empty list is initialized to store the scraped listing details
data = []

# Wait for the website to load with a timeout of 15 seconds
wait = WebDriverWait(driver, 15)  # Initialize once and use throughout


In [5]:
#Page Scraping Loop is set to scrape data from up to 3 pages of listings

page = 1

while page <= 3:
    try:
        #a try-except block is used to handle exceptions gracefully, such as timeouts or missing elements
        listings_xpath_base = '//*[@id="site-content"]/div/div[2]/div/div/div/div/div[1]/div'
        #//*[@id="site-content"]/div/div[2]/div/div/div/div/div[1]/div[1]/div/div[2]/div/div/div/div/a
        num_listings = len(driver.find_elements(By.XPATH, listings_xpath_base))
        
        for i in range(1, num_listings + 1):#Iterates over each listing found on the page
            try:
                listing_xpath = f'{listings_xpath_base}[{i}]'
                listing_url_xpath = f'{listing_xpath}/div/div[2]/div/div/div/div/a'
                price_xpath = f'{listing_xpath}/div/div[2]/div/div/div/div/div/div[2]/div[5]/div[2]/div/div/span[1]/span'
                beds_xpath = f'{listing_xpath}/div/div[2]/div/div/div/div/div/div[2]/div[3]/span[2]/span'
                rating_xpath = f'{listing_xpath}/div/div[2]/div/div/div/div/div/div[2]/div[6]/span/span'

                listing_details = {
                    'URL': driver.find_element(By.XPATH, listing_url_xpath).get_attribute('href'),
                    'Price': driver.find_element(By.XPATH, price_xpath).text,
                    'Beds': driver.find_element(By.XPATH, beds_xpath).text,
                    'Rating': driver.find_element(By.XPATH, rating_xpath).text
                }
                #constructs XPaths for various details (URL, price, beds, rating), 
                #retrieves the information using Selenium, and stores it in a dictionary listing_detail
                
                data.append(listing_details)
            except NoSuchElementException:
                print(f"Missing information for listing {i}, skipping.")
                continue
                
    except TimeoutException:
        print("Timeout while waiting for listings to load.")
        break

    if page < 3:
        try:
            #Finds the "Next" button, clicks it to go to the next page of listings and increments the page count
            
            next_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//a[contains(@aria-label, "Next")]')))
            driver.get(next_button.get_attribute('href'))
            page += 1
            time.sleep(random.randint(2, 15))  # Ensure the next page is fully loaded
        except NoSuchElementException:
            print("Reached the end of the pages.")
            break
        except Exception as e:
            print(f"An error occurred while navigating: {e}")
            break
    else:
        break


Missing information for listing 1, skipping.
Missing information for listing 16, skipping.
Missing information for listing 5, skipping.
Missing information for listing 11, skipping.
Missing information for listing 14, skipping.
Missing information for listing 15, skipping.
Missing information for listing 17, skipping.


In [6]:
#Finally, the browser is closed, 
#the scraped data is converted into a pandas DataFrame

driver.quit()

df = pd.DataFrame(data)

print(df)

                                                  URL  \
0   https://www.airbnb.com/rooms/10711339744456620...   
1   https://www.airbnb.com/rooms/69072429954963487...   
2   https://www.airbnb.com/rooms/11015873820985475...   
3   https://www.airbnb.com/rooms/50155260?adults=1...   
4   https://www.airbnb.com/rooms/43614314?adults=1...   
5   https://www.airbnb.com/rooms/50027632?adults=1...   
6   https://www.airbnb.com/rooms/70212613522070648...   
7   https://www.airbnb.com/rooms/9595397?adults=1&...   
8   https://www.airbnb.com/rooms/11079587108399276...   
9   https://www.airbnb.com/rooms/20031636?adults=1...   
10  https://www.airbnb.com/rooms/41371157?adults=1...   
11  https://www.airbnb.com/rooms/63355636240451489...   
12  https://www.airbnb.com/rooms/50976430?adults=1...   
13  https://www.airbnb.com/rooms/14788657?adults=1...   
14  https://www.airbnb.com/rooms/15490423?adults=1...   
15  https://www.airbnb.com/rooms/72101225582397146...   
16  https://www.airbnb.com/room

In [7]:
#Converted to CSV 
df.to_csv('airbnb_listings.csv', index=False)

# Observations from Output:

Several listings are missing critical information, causing the script to skip them. This could be due to inconsistencies in the page structure or dynamic content that hasn't loaded by the time of extraction.

Despite some skipped listings, the script successfully captures and stores the available data for the majority of the listings.

The output DataFrame presents a variety of information, including price discounts and listings without reviews, indicating "New place to stay."

The dataset compiled from the scraped data seems to include duplicates, as indicated by repeated URLs, which suggests that the script may benefit from additional logic to detect and handle such cases.

The ratings are detailed, providing both an average score and the number of reviews, offering insight into the listing's popularity and guest satisfaction.