# Web Scraper Test Script

## Designed to collect rental property data **Apartments.com**

This script is a web scraper designed to collect rental property data from the website **apartments.com** for properties in San Diego County, California, under $4,000. It uses **Selenium** for browser automation and **BeautifulSoup** for HTML parsing. The script processes the data, cleans it, and saves it to a CSV file. Here's a breakdown of its functionality:

In [13]:
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import time
import os
import logging
import re
from datetime import datetime

In [14]:
HEADLESS = True
WAIT_TIME = 4
LISTINGS_PER_PAGE = 40
BASE_URL = "https://www.apartments.com/apartments-condos/san-diego-county-ca/under-4000/"
LOG_FILE = "scraper_log.txt"
TEST_MODE = True
MAX_UNITS = 10

In [3]:
logging.basicConfig(
    filename=LOG_FILE,
    filemode='w',
    format='%(asctime)s - %(levelname)s - %(message)s',
    level=logging.WARNING
)

In [4]:
def init_driver():
    options = Options()
    if HEADLESS:
        options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("user-agent=Mozilla/5.0")
    service = Service(EdgeChromiumDriverManager().install(), log_output=os.devnull)
    return webdriver.Edge(service=service, options=options)

In [5]:
def extract_low_price(price):
    if pd.isna(price):
        return None
    price = re.sub(r'[^\d\-]', '', str(price))
    return float(price.split('-')[0]) if '-' in price else float(price) if price.isdigit() else None

In [6]:
def extract_amenities(soup):
    labels = soup.select('.amenityLabel') + soup.select('.combinedAmenitiesList li span')
    text = ' '.join(el.get_text(separator=' ').lower().strip() for el in labels)

    logging.debug("Combined amenities text: %s", text)

    return {
        'HasWasherDryer': 'washer/dryer' in text or 'in unit washer' in text,
        'HasAirConditioning': 'air conditioning' in text,
        'HasPool': 'pool' in text,
        'HasSpa': 'spa' in text or 'hot tub' in text,
        'HasGym': 'fitness center' in text or 'gym' in text,
        'HasEVCharging': 'ev charging' in text,
        'AllowsDogs': 'dogs allowed' in text or 'dog friendly' in text,
        'AllowsCats': 'cats allowed' in text or 'cat friendly' in text
    }

In [7]:
def scrape_listings(driver):
    all_units = []
    page = 1

    while True:
        url = f"{BASE_URL}{page}/"
        logging.info(f"Scraping page {page}: {url}")
        driver.get(url)
        time.sleep(WAIT_TIME)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        listings = soup.find_all('article')

        if not listings:
            break

        for listing in listings:
            title = listing.find('span', class_='js-placardTitle')
            address = listing.find('div', class_='property-address')
            phone = listing.find('button', class_='phone-link')
            property_url = listing.get('data-url')

            if title and address and property_url:
                try:
                    driver.get(property_url)
                    time.sleep(WAIT_TIME / 2)
                    detail_soup = BeautifulSoup(driver.page_source, 'html.parser')
                    unit_containers = detail_soup.find_all('li', class_='unitContainer js-unitContainerV3')

                    rental_type = "Unknown"
                    og_title_tag = detail_soup.find("meta", property="og:title")
                    if og_title_tag and og_title_tag.get("content"):
                        content = og_title_tag["content"].lower()
                        for term in ["house rental", "townhome", "condo", "apartment"]:
                            if term in content:
                                rental_type = term.replace(" rental", "").capitalize()

                    amenities = extract_amenities(detail_soup)

                    for unit in unit_containers:
                        unit_number = unit.find('div', class_='unitColumn column')
                        price = unit.find('div', class_='pricingColumn column')
                        sqft = unit.find('div', class_='sqftColumn column')
                        beds = unit.get('data-beds')
                        baths = unit.get('data-baths')

                        all_units.append({
                            'Property': title.text.strip(),
                            'Address': address.text.strip(),
                            'Unit': unit_number.text.strip() if unit_number else "N/A",
                            'Price': price.text.strip() if price else "N/A",
                            'SqFt': sqft.text.strip() if sqft else "N/A",
                            'Beds': beds if beds else "N/A",
                            'Baths': baths if baths else "N/A",
                            'RentalType': rental_type,
                            'Phone': phone.get('phone-data') if phone and phone.has_attr('phone-data') else "N/A",
                            **amenities,
                            'StorageFee': None,
                            'ListingURL': property_url
                        })

                        if TEST_MODE and len(all_units) >= MAX_UNITS:
                            logging.info(f"TEST_MODE: Stopping after {MAX_UNITS} listings.")
                            return pd.DataFrame(all_units)

                except Exception as e:
                    logging.warning(f"Error processing {property_url}: {e}")

        if len(listings) < LISTINGS_PER_PAGE:
            break
        page += 1

    return pd.DataFrame(all_units)

In [8]:
def clean_data(df):
    df['Price'] = df['Price'].apply(extract_low_price)
    df['SqFt'] = pd.to_numeric(df['SqFt'].astype(str).str.replace(',', '').str.extract(r'(\d+)', expand=False), errors='coerce')
    df['Beds'] = pd.to_numeric(df['Beds'], errors='coerce')
    df['Baths'] = pd.to_numeric(df['Baths'], errors='coerce')

    df['ZipCode'] = df['Address'].str.extract(r'(\d{5})(?!.*\d{5})')
    city_state = df['Address'].str.extract(r',\s*([^,]+),\s*([A-Z]{2})\s*\d{5}')
    df['City'] = city_state[0].str.strip()
    df['State'] = city_state[1].str.strip()

    df['PricePerSqFt'] = df.apply(
        lambda row: round(row['Price'] / row['SqFt'], 2)
        if pd.notnull(row['Price']) and pd.notnull(row['SqFt']) and row['SqFt'] > 0 else None,
        axis=1
    )

    df['Beds_Baths'] = df.apply(
        lambda row: f"{int(row['Beds']) if pd.notnull(row['Beds']) else 'N/A'} Bed / {int(row['Baths']) if pd.notnull(row['Baths']) else 'N/A'} Bath",
        axis=1
    )

    final_order = [
        'Property', 'Address', 'City', 'State', 'ZipCode', 'Phone',
        'Unit', 'Beds', 'Baths', 'Beds_Baths', 'SqFt', 'Price', 'PricePerSqFt',
        'RentalType',
        'HasWasherDryer', 'HasAirConditioning', 'HasPool', 'HasSpa',
        'HasGym', 'HasEVCharging', 'StorageFee',
        'AllowsDogs', 'AllowsCats',
        'ListingURL'
    ]
    return df[final_order]

In [9]:
def main():
    start_time = time.time()
    driver = init_driver()
    df = scrape_listings(driver)
    driver.quit()

    if df.empty:
        print("No data collected. File not saved.")
        logging.warning("No data collected. File not saved.")
    else:
        df = clean_data(df)
        filename = f'test_san_diego_county_rentals_{datetime.today().strftime("%Y-%m-%d")}.csv'
        df.to_csv(filename, index=False)
        print(f"Scraping complete. Data saved to {filename}")
        logging.info(f"Scraping complete. Data saved to {filename}")

    duration = time.time() - start_time
    minutes, seconds = divmod(duration, 60)
    print(f"Script runtime: {int(minutes)} minutes and {seconds:.2f} seconds")

In [10]:
if __name__ == "__main__":
    main()

Scraping complete. Data saved to test_san_diego_county_rentals_2025-04-25.csv
Script runtime: 0 minutes and 16.16 seconds


This script is a web scraper designed to collect rental property data from the website **apartments.com** for properties in San Diego County, California, under $4,000. It uses **Selenium** for browser automation and **BeautifulSoup** for HTML parsing. The script processes the data, cleans it, and saves it to a CSV file. Here's a breakdown of its functionality:

---

### **Key Components**

1. **Imports and Configuration**
   - Libraries like `selenium`, `BeautifulSoup`, `pandas`, `logging`, and `re` are imported.
   - Constants are defined:
     - `HEADLESS`: Runs the browser in headless mode (no GUI).
     - `WAIT_TIME`: Time to wait for pages to load.
     - `LISTINGS_PER_PAGE`: Number of listings per page.
     - `BASE_URL`: The URL for the search query.
     - `LOG_FILE`: File for logging warnings and errors.
     - `TEST_MODE` and `MAX_UNITS`: Used for testing to limit the number of listings scraped.

2. **`init_driver()`**
   - Initializes a Selenium WebDriver for Microsoft Edge.
   - Configures the browser to run in headless mode and sets a custom user-agent.

3. **`extract_low_price(price)`**
   - Extracts the lowest price from a price string (e.g., "$1,200-$1,500" → `1200`).

4. **`extract_amenities(soup)`**
   - Parses the HTML of a property page to extract amenities like washer/dryer, air conditioning, pool, gym, etc.
   - Returns a dictionary of boolean values indicating the presence of each amenity.

5. **`scrape_listings(driver)`**
   - The main scraping function:
     - Iterates through pages of listings.
     - For each listing, navigates to the property detail page and extracts data such as:
       - Property name, address, unit details, price, square footage, beds, baths, rental type, phone number, and amenities.
     - Appends the data to a list of dictionaries (`all_units`).
     - Stops scraping if `TEST_MODE` is enabled and the maximum number of units is reached.
   - Returns a Pandas DataFrame containing all the scraped data.

6. **`clean_data(df)`**
   - Cleans and processes the scraped data:
     - Converts price, square footage, beds, and baths to numeric values.
     - Extracts city, state, and ZIP code from the address.
     - Calculates price per square foot.
     - Creates a "Beds_Baths" column summarizing the number of beds and baths.
     - Reorders columns into a final format for saving.

7. **`main()`**
   - The entry point of the script:
     - Initializes the WebDriver.
     - Calls `scrape_listings()` to collect data.
     - Cleans the data using `clean_data()`.
     - Saves the cleaned data to a CSV file named with the current date.
     - Logs the runtime of the script.

---

### **Workflow**

1. **Initialization**
   - The script sets up the WebDriver and logging.

2. **Scraping**
   - It navigates through pages of listings on **apartments.com**.
   - For each listing, it extracts relevant details and amenities.

3. **Data Cleaning**
   - The raw data is processed to ensure consistency and usability.

4. **Saving Results**
   - The cleaned data is saved to a CSV file in the format: `san_diego_county_rentals_YYYY-MM-DD.csv`.

5. **Runtime Logging**
   - The script logs warnings, errors, and runtime information.

---

### **Key Features**
- **Headless Mode**: Allows the script to run without opening a browser window.
- **Error Handling**: Logs warnings for errors encountered during scraping.
- **Test Mode**: Limits the number of listings scraped for testing purposes.
- **Data Cleaning**: Ensures the final dataset is well-structured and ready for analysis.

---

### **Output**
The script generates a CSV file containing the following columns:
- Property details (name, address, city, state, ZIP code, phone number, etc.).
- Unit details (price, square footage, beds, baths, rental type).
- Amenities (washer/dryer, air conditioning, pool, gym, etc.).
- Calculated fields (price per square foot, beds/baths summary).

---

### **Usage**
Run the script by executing it in the terminal:
```bash
python scraper.py
```
Ensure that the required dependencies (e.g., Selenium, BeautifulSoup, Pandas) are installed, and the WebDriver for Edge is properly configured.