This script is a web scraper designed to collect rental property data from the website **apartments.com** for properties in San Diego County, California, under $4,000. It uses **Selenium** for browser automation and **BeautifulSoup** for HTML parsing. The script processes the data, cleans it, and saves it to a CSV file. Here's a breakdown of its functionality:

---

### **Key Components**

1. **Imports and Configuration**
   - Libraries like `selenium`, `BeautifulSoup`, `pandas`, `logging`, and `re` are imported.
   - Constants are defined:
     - `HEADLESS`: Runs the browser in headless mode (no GUI).
     - `WAIT_TIME`: Time to wait for pages to load.
     - `LISTINGS_PER_PAGE`: Number of listings per page.
     - `BASE_URL`: The URL for the search query.
     - `LOG_FILE`: File for logging warnings and errors.
     - `TEST_MODE` and `MAX_UNITS`: Used for testing to limit the number of listings scraped.

2. **`init_driver()`**
   - Initializes a Selenium WebDriver for Microsoft Edge.
   - Configures the browser to run in headless mode and sets a custom user-agent.

3. **`extract_low_price(price)`**
   - Extracts the lowest price from a price string (e.g., "$1,200-$1,500" → `1200`).

4. **`extract_amenities(soup)`**
   - Parses the HTML of a property page to extract amenities like washer/dryer, air conditioning, pool, gym, etc.
   - Returns a dictionary of boolean values indicating the presence of each amenity.

5. **`scrape_listings(driver)`**
   - The main scraping function:
     - Iterates through pages of listings.
     - For each listing, navigates to the property detail page and extracts data such as:
       - Property name, address, unit details, price, square footage, beds, baths, rental type, phone number, and amenities.
     - Appends the data to a list of dictionaries (`all_units`).
     - Stops scraping if `TEST_MODE` is enabled and the maximum number of units is reached.
   - Returns a Pandas DataFrame containing all the scraped data.

6. **`clean_data(df)`**
   - Cleans and processes the scraped data:
     - Converts price, square footage, beds, and baths to numeric values.
     - Extracts city, state, and ZIP code from the address.
     - Calculates price per square foot.
     - Creates a "Beds_Baths" column summarizing the number of beds and baths.
     - Reorders columns into a final format for saving.

7. **`main()`**
   - The entry point of the script:
     - Initializes the WebDriver.
     - Calls `scrape_listings()` to collect data.
     - Cleans the data using `clean_data()`.
     - Saves the cleaned data to a CSV file named with the current date.
     - Logs the runtime of the script.

---

### **Workflow**

1. **Initialization**
   - The script sets up the WebDriver and logging.

2. **Scraping**
   - It navigates through pages of listings on **apartments.com**.
   - For each listing, it extracts relevant details and amenities.

3. **Data Cleaning**
   - The raw data is processed to ensure consistency and usability.

4. **Saving Results**
   - The cleaned data is saved to a CSV file in the format: `san_diego_county_rentals_YYYY-MM-DD.csv`.

5. **Runtime Logging**
   - The script logs warnings, errors, and runtime information.

---

### **Key Features**
- **Headless Mode**: Allows the script to run without opening a browser window.
- **Error Handling**: Logs warnings for errors encountered during scraping.
- **Test Mode**: Limits the number of listings scraped for testing purposes.
- **Data Cleaning**: Ensures the final dataset is well-structured and ready for analysis.

---

### **Output**
The script generates a CSV file containing the following columns:
- Property details (name, address, city, state, ZIP code, phone number, etc.).
- Unit details (price, square footage, beds, baths, rental type).
- Amenities (washer/dryer, air conditioning, pool, gym, etc.).
- Calculated fields (price per square foot, beds/baths summary).

---

### **Usage**
Run the script by executing it in the terminal:
```bash
python scraper.py
```
Ensure that the required dependencies (e.g., Selenium, BeautifulSoup, Pandas) are installed, and the WebDriver for Edge is properly configured.

In [13]:
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import time
import os
import logging
import re
from datetime import datetime

This code imports various libraries and modules that are essential for the web scraping script. Here's a breakdown of each import:

---

### **Selenium Imports**
1. **`from selenium import webdriver`**:
    - Imports the `webdriver` module from Selenium, which is used to automate browser interactions.

2. **`from selenium.webdriver.edge.service import Service`**:
    - Imports the `Service` class, which is used to manage the Microsoft Edge WebDriver.

3. **`from selenium.webdriver.edge.options import Options`**:
    - Imports the `Options` class, which allows you to configure browser settings (e.g., headless mode, user-agent).

4. **`from webdriver_manager.microsoft import EdgeChromiumDriverManager`**:
    - Imports the `EdgeChromiumDriverManager` class, which automatically downloads and manages the correct version of the Microsoft Edge WebDriver.

---

### **BeautifulSoup Import**
5. **`from bs4 import BeautifulSoup`**:
    - Imports the `BeautifulSoup` class from the `bs4` library, which is used for parsing and navigating HTML content.

---

### **Pandas Import**
6. **`import pandas as pd`**:
    - Imports the `pandas` library, which is used for data manipulation and analysis. It is commonly used to work with tabular data in the form of DataFrames.

---

### **Time and OS Imports**
7. **`import time`**:
    - Imports the `time` module, which provides functions for working with time (e.g., delays using `time.sleep()`).

8. **`import os`**:
    - Imports the `os` module, which provides functions for interacting with the operating system (e.g., file paths, environment variables).

---

### **Logging Import**
9. **`import logging`**:
    - Imports the `logging` module, which is used for logging messages (e.g., warnings, errors, runtime information).

---

### **Regex Import**
10. **`import re`**:
     - Imports the `re` module, which provides functions for working with regular expressions (e.g., pattern matching, text extraction).

---

### **Datetime Import**
11. **`from datetime import datetime`**:
     - Imports the `datetime` class from the `datetime` module, which is used for working with dates and times (e.g., generating timestamps).

---

### **Purpose**
These imports collectively enable the script to:
1. Automate browser interactions using Selenium.
2. Parse and extract data from HTML using BeautifulSoup.
3. Manipulate and analyze data using Pandas.
4. Handle time delays, file paths, and logging.
5. Use regular expressions for text processing.
6. Work with dates and times for logging and file naming.

These libraries and modules are essential for building a robust and efficient web scraping script.

In [14]:
HEADLESS = True
WAIT_TIME = 4
LISTINGS_PER_PAGE = 40
BASE_URL = "https://www.apartments.com/apartments-condos/san-diego-county-ca/under-4000/"
LOG_FILE = "scraper_log.txt"
TEST_MODE = True
MAX_UNITS = 10

### Explanation of Variables

1. **`HEADLESS = True`**:
    - This variable determines whether the browser runs in headless mode. 
    - When set to `True`, the browser operates without a graphical user interface (GUI), making it faster and suitable for automated tasks.

2. **`WAIT_TIME = 4`**:
    - Specifies the number of seconds to wait for a page to load or for elements to appear during web scraping.
    - This helps ensure that the script doesn't proceed before the page is fully loaded.

3. **`LISTINGS_PER_PAGE = 40`**:
    - Indicates the number of property listings displayed per page on the website being scraped.
    - This value is used to determine when to stop scraping if fewer listings are found on a page.

4. **`BASE_URL = "https://www.apartments.com/apartments-condos/san-diego-county-ca/under-4000/"`**:
    - The base URL of the website to scrape. 
    - It points to rental property listings in San Diego County, California, with a price cap of $4,000.

5. **`LOG_FILE = "scraper_log.txt"`**:
    - The name of the file where warnings, errors, and runtime information are logged.
    - Useful for debugging and tracking the script's execution.

6. **`TEST_MODE = True`**:
    - When set to `True`, the script runs in test mode, limiting the number of listings scraped.
    - This is helpful for debugging or testing the script without scraping the entire dataset.

7. **`MAX_UNITS = 10`**:
    - Specifies the maximum number of property listings to scrape when `TEST_MODE` is enabled.
    - Prevents excessive data collection during testing.

In [3]:
logging.basicConfig(
    filename=LOG_FILE,
    filemode='w',
    format='%(asctime)s - %(levelname)s - %(message)s',
    level=logging.WARNING
)

### Explanation of `logging.basicConfig`

The `logging.basicConfig` function is used to configure the logging settings for the script. Here's a breakdown of each parameter:

1. **`filename=LOG_FILE`**:
    - Specifies the name of the file where log messages will be saved.
    - In this case, the value of `LOG_FILE` is `'scraper_log.txt'`, so all log messages will be written to this file.

2. **`filemode='w'`**:
    - Sets the mode for opening the log file.
    - `'w'` means the file will be overwritten each time the script runs. If you want to append to the file instead, use `'a'`.

3. **`format='%(asctime)s - %(levelname)s - %(message)s'`**:
    - Defines the format of log messages.
    - `%(asctime)s`: Includes the timestamp of when the log message was created.
    - `%(levelname)s`: Includes the severity level of the log message (e.g., WARNING, ERROR).
    - `%(message)s`: Includes the actual log message.

4. **`level=logging.WARNING`**:
    - Sets the minimum severity level of messages to be logged.
    - In this case, only messages with a severity of WARNING or higher (e.g., ERROR, CRITICAL) will be logged. Lower severity levels like DEBUG or INFO will be ignored.

### Example Log Entry
A log entry in the file might look like this:
```
2023-03-15 14:30:45,123 - WARNING - This is a warning message.
```

This configuration ensures that important warnings and errors are recorded in the specified log file for debugging and monitoring purposes.

In [4]:
def init_driver():
    options = Options()
    if HEADLESS:
        options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("user-agent=Mozilla/5.0")
    service = Service(EdgeChromiumDriverManager().install(), log_output=os.devnull)
    return webdriver.Edge(service=service, options=options)

### Explanation of `init_driver` Function

The `init_driver` function initializes and configures a Selenium WebDriver for Microsoft Edge. Here's a detailed breakdown of its components:

1. **`options = Options()`**:
    - Creates an instance of the `Options` class to configure browser settings.

2. **`if HEADLESS:`**:
    - Checks if the `HEADLESS` variable is set to `True`.
    - If `True`, the browser will run in headless mode (without a graphical user interface), which is faster and suitable for automated tasks.

3. **`options.add_argument("--headless")`**:
    - Adds the `--headless` argument to enable headless mode.

4. **`options.add_argument("--disable-gpu")`**:
    - Disables GPU hardware acceleration. This is often used in headless mode to avoid potential rendering issues.

5. **`options.add_argument("--no-sandbox")`**:
    - Disables the sandboxing feature, which is sometimes required for running the browser in certain environments (e.g., Docker containers).

6. **`options.add_argument("user-agent=Mozilla/5.0")`**:
    - Sets a custom user-agent string to mimic a real browser and avoid detection by websites.

7. **`service = Service(EdgeChromiumDriverManager().install(), log_output=os.devnull)`**:
    - Creates a `Service` instance for managing the Edge WebDriver.
    - `EdgeChromiumDriverManager().install()` automatically downloads and installs the appropriate version of the Edge WebDriver.
    - `log_output=os.devnull` suppresses WebDriver logs by redirecting them to `/dev/null`.

8. **`return webdriver.Edge(service=service, options=options)`**:
    - Initializes and returns an instance of the Edge WebDriver with the specified `service` and `options`.

### Purpose
This function sets up a Selenium WebDriver with custom configurations, such as headless mode and user-agent spoofing, to automate browser interactions efficiently and avoid detection during web scraping.

In [5]:
def extract_low_price(price):
    if pd.isna(price):
        return None
    price = re.sub(r'[^\d\-]', '', str(price))
    return float(price.split('-')[0]) if '-' in price else float(price) if price.isdigit() else None

### Explanation of `extract_low_price` Function

The `extract_low_price` function is designed to extract the lowest price from a given price string. Here's a detailed breakdown of its logic:

1. **`if pd.isna(price):`**
    - Checks if the input `price` is `NaN` (Not a Number) or missing using Pandas' `isna()` function.
    - If the price is missing, the function returns `None`.

2. **`price = re.sub(r'[^\d\-]', '', str(price))`**
    - Converts the `price` to a string (if it isn't already).
    - Uses the `re.sub()` function to remove all characters except digits (`\d`) and hyphens (`-`).
    - This ensures that only numeric values and ranges (e.g., "1200-1500") are retained.

3. **`return float(price.split('-')[0]) if '-' in price else float(price) if price.isdigit() else None`**
    - Checks if the cleaned `price` contains a hyphen (`-`):
        - If `True`, splits the string at the hyphen and takes the first part (the lower bound of the range), converting it to a float.
    - If there is no hyphen, checks if the `price` is a valid numeric string using `isdigit()`:
        - If `True`, converts the string to a float and returns it.
    - If neither condition is met (e.g., the string is empty or invalid), the function returns `None`.

### Purpose
This function is used to standardize and extract the lowest price from various price formats, such as:
- "$1,200-$1,500" → `1200`
- "$1,200" → `1200`
- Invalid or missing prices → `None`

### Example Usage
```python
extract_low_price("$1,200-$1,500")  # Output: 1200.0
extract_low_price("$1,200")         # Output: 1200.0
extract_low_price("N/A")            # Output: None
extract_low_price(None)             # Output: None
```

In [6]:
def extract_amenities(soup):
    labels = soup.select('.amenityLabel') + soup.select('.combinedAmenitiesList li span')
    text = ' '.join(el.get_text(separator=' ').lower().strip() for el in labels)

    logging.debug("Combined amenities text: %s", text)

    return {
        'HasWasherDryer': 'washer/dryer' in text or 'in unit washer' in text,
        'HasAirConditioning': 'air conditioning' in text,
        'HasPool': 'pool' in text,
        'HasSpa': 'spa' in text or 'hot tub' in text,
        'HasGym': 'fitness center' in text or 'gym' in text,
        'HasEVCharging': 'ev charging' in text,
        'AllowsDogs': 'dogs allowed' in text or 'dog friendly' in text,
        'AllowsCats': 'cats allowed' in text or 'cat friendly' in text
    }

### Explanation of `extract_amenities` Function

The `extract_amenities` function is designed to parse the HTML content of a property listing page and extract information about the amenities available. Here's a detailed breakdown of its components:

1. **`labels = soup.select('.amenityLabel') + soup.select('.combinedAmenitiesList li span')`**:
    - Uses BeautifulSoup's `select()` method to find all elements matching the specified CSS selectors:
        - `.amenityLabel`: Selects elements with the class `amenityLabel`.
        - `.combinedAmenitiesList li span`: Selects `<span>` elements inside `<li>` elements within the `combinedAmenitiesList` class.
    - Combines the results from both selectors into a single list of elements.

2. **`text = ' '.join(el.get_text(separator=' ').lower().strip() for el in labels)`**:
    - Iterates over each element in the `labels` list.
    - Extracts the text content of each element using `get_text(separator=' ')`.
    - Converts the text to lowercase using `.lower()` for case-insensitive matching.
    - Strips any leading or trailing whitespace using `.strip()`.
    - Joins all the extracted text into a single string, separated by spaces.

3. **`logging.debug("Combined amenities text: %s", text)`**:
    - Logs the combined amenities text at the DEBUG level for debugging purposes.
    - This helps in verifying the extracted text during development or troubleshooting.

4. **`return { ... }`**:
    - Returns a dictionary where each key represents a specific amenity, and the value is a boolean indicating whether the amenity is present in the `text`.
    - The presence of each amenity is determined by checking if specific keywords or phrases are found in the `text`:
        - `'HasWasherDryer'`: Checks for `'washer/dryer'` or `'in unit washer'`.
        - `'HasAirConditioning'`: Checks for `'air conditioning'`.
        - `'HasPool'`: Checks for `'pool'`.
        - `'HasSpa'`: Checks for `'spa'` or `'hot tub'`.
        - `'HasGym'`: Checks for `'fitness center'` or `'gym'`.
        - `'HasEVCharging'`: Checks for `'ev charging'`.
        - `'AllowsDogs'`: Checks for `'dogs allowed'` or `'dog friendly'`.
        - `'AllowsCats'`: Checks for `'cats allowed'` or `'cat friendly'`.

### Purpose
This function extracts and standardizes information about property amenities from the HTML content of a listing page. The resulting dictionary can be used to store or analyze the availability of specific amenities for each property.

### Example Usage
```python
# Assuming `soup` is a BeautifulSoup object containing the HTML of a property page
amenities = extract_amenities(soup)
print(amenities)
# Output:
# {
#     'HasWasherDryer': True,
#     'HasAirConditioning': False,
#     'HasPool': True,
#     'HasSpa': False,
#     'HasGym': True,
#     'HasEVCharging': False,
#     'AllowsDogs': True,
#     'AllowsCats': False
# }
```


In [7]:
def scrape_listings(driver):
    all_units = []
    page = 1

    while True:
        url = f"{BASE_URL}{page}/"
        logging.info(f"Scraping page {page}: {url}")
        driver.get(url)
        time.sleep(WAIT_TIME)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        listings = soup.find_all('article')

        if not listings:
            break

        for listing in listings:
            title = listing.find('span', class_='js-placardTitle')
            address = listing.find('div', class_='property-address')
            phone = listing.find('button', class_='phone-link')
            property_url = listing.get('data-url')

            if title and address and property_url:
                try:
                    driver.get(property_url)
                    time.sleep(WAIT_TIME / 2)
                    detail_soup = BeautifulSoup(driver.page_source, 'html.parser')
                    unit_containers = detail_soup.find_all('li', class_='unitContainer js-unitContainerV3')

                    rental_type = "Unknown"
                    og_title_tag = detail_soup.find("meta", property="og:title")
                    if og_title_tag and og_title_tag.get("content"):
                        content = og_title_tag["content"].lower()
                        for term in ["house rental", "townhome", "condo", "apartment"]:
                            if term in content:
                                rental_type = term.replace(" rental", "").capitalize()

                    amenities = extract_amenities(detail_soup)

                    for unit in unit_containers:
                        unit_number = unit.find('div', class_='unitColumn column')
                        price = unit.find('div', class_='pricingColumn column')
                        sqft = unit.find('div', class_='sqftColumn column')
                        beds = unit.get('data-beds')
                        baths = unit.get('data-baths')

                        all_units.append({
                            'Property': title.text.strip(),
                            'Address': address.text.strip(),
                            'Unit': unit_number.text.strip() if unit_number else "N/A",
                            'Price': price.text.strip() if price else "N/A",
                            'SqFt': sqft.text.strip() if sqft else "N/A",
                            'Beds': beds if beds else "N/A",
                            'Baths': baths if baths else "N/A",
                            'RentalType': rental_type,
                            'Phone': phone.get('phone-data') if phone and phone.has_attr('phone-data') else "N/A",
                            **amenities,
                            'StorageFee': None,
                            'ListingURL': property_url
                        })

                        if TEST_MODE and len(all_units) >= MAX_UNITS:
                            logging.info(f"TEST_MODE: Stopping after {MAX_UNITS} listings.")
                            return pd.DataFrame(all_units)

                except Exception as e:
                    logging.warning(f"Error processing {property_url}: {e}")

        if len(listings) < LISTINGS_PER_PAGE:
            break
        page += 1

    return pd.DataFrame(all_units)

### Explanation of `scrape_listings` Function

The `scrape_listings` function is the core of the web scraping process. It navigates through multiple pages of property listings, extracts relevant details, and compiles them into a structured dataset. Here's a detailed breakdown:

---

#### **1. Initialization**
- **`all_units = []`**:
    - Initializes an empty list to store data for all property units.
- **`page = 1`**:
    - Starts scraping from the first page of listings.

---

#### **2. Page Navigation**
- **`while True:`**:
    - Enters an infinite loop to scrape pages until no more listings are found.
- **`url = f"{BASE_URL}{page}/"`**:
    - Constructs the URL for the current page using the `BASE_URL` and `page` number.
- **`logging.info(f"Scraping page {page}: {url}")`**:
    - Logs the current page being scraped.
- **`driver.get(url)`**:
    - Navigates to the constructed URL using the Selenium WebDriver.
- **`time.sleep(WAIT_TIME)`**:
    - Waits for the page to load completely before proceeding.
- **`soup = BeautifulSoup(driver.page_source, 'html.parser')`**:
    - Parses the page's HTML content using BeautifulSoup.
- **`listings = soup.find_all('article')`**:
    - Finds all property listings on the page, which are contained in `<article>` elements.

---

#### **3. Exit Condition**
- **`if not listings:`**:
    - If no listings are found, the loop breaks, indicating the end of available pages.

---

#### **4. Listing Details Extraction**
- Iterates through each listing on the page:
    - **`title = listing.find('span', class_='js-placardTitle')`**:
        - Extracts the property title.
    - **`address = listing.find('div', class_='property-address')`**:
        - Extracts the property address.
    - **`phone = listing.find('button', class_='phone-link')`**:
        - Extracts the phone number.
    - **`property_url = listing.get('data-url')`**:
        - Extracts the URL for the property details page.

---

#### **5. Property Details Extraction**
- If `title`, `address`, and `property_url` are valid:
    - **`driver.get(property_url)`**:
        - Navigates to the property details page.
    - **`time.sleep(WAIT_TIME / 2)`**:
        - Waits for the page to load.
    - **`detail_soup = BeautifulSoup(driver.page_source, 'html.parser')`**:
        - Parses the HTML content of the details page.
    - **`unit_containers = detail_soup.find_all('li', class_='unitContainer js-unitContainerV3')`**:
        - Finds all unit-specific details on the page.

---

#### **6. Rental Type Identification**
- **`rental_type = "Unknown"`**:
    - Initializes the rental type as "Unknown".
- **`og_title_tag = detail_soup.find("meta", property="og:title")`**:
    - Extracts the `<meta>` tag containing the page title.
- **`if og_title_tag and og_title_tag.get("content"):`**:
    - If the title exists, checks for keywords like "house rental", "townhome", "condo", or "apartment" to determine the rental type.

---

#### **7. Amenities Extraction**
- **`amenities = extract_amenities(detail_soup)`**:
    - Calls the `extract_amenities` function to extract amenities information from the details page.

---

#### **8. Unit Details Extraction**
- Iterates through each unit in `unit_containers`:
    - Extracts details such as:
        - **`unit_number`**: Unit number.
        - **`price`**: Price of the unit.
        - **`sqft`**: Square footage.
        - **`beds`**: Number of bedrooms.
        - **`baths`**: Number of bathrooms.
    - Appends the extracted data to `all_units` as a dictionary, including:
        - Property details, unit details, rental type, amenities, and the property URL.

---

#### **9. Test Mode Limitation**
- **`if TEST_MODE and len(all_units) >= MAX_UNITS:`**:
    - If `TEST_MODE` is enabled and the number of scraped units reaches `MAX_UNITS`, the function stops scraping and returns the collected data as a Pandas DataFrame.

---

#### **10. Pagination**
- **`if len(listings) < LISTINGS_PER_PAGE:`**:
    - If the number of listings on the current page is less than the expected number (`LISTINGS_PER_PAGE`), it indicates the last page, and the loop breaks.
- **`page += 1`**:
    - Increments the page number to scrape the next page.

---

#### **11. Return Data**
- **`return pd.DataFrame(all_units)`**:
    - Converts the collected data into a Pandas DataFrame and returns it.

---

### Purpose
This function automates the process of navigating through pages of property listings, extracting relevant details, and compiling them into a structured dataset for further analysis or storage.

### Example Output
The function returns a Pandas DataFrame with columns such as:
- `Property`, `Address`, `Unit`, `Price`, `SqFt`, `Beds`, `Baths`, `RentalType`, `Phone`, `HasWasherDryer`, `HasAirConditioning`, `ListingURL`, etc.

In [8]:
def clean_data(df):
    df['Price'] = df['Price'].apply(extract_low_price)
    df['SqFt'] = pd.to_numeric(df['SqFt'].astype(str).str.replace(',', '').str.extract(r'(\d+)', expand=False), errors='coerce')
    df['Beds'] = pd.to_numeric(df['Beds'], errors='coerce')
    df['Baths'] = pd.to_numeric(df['Baths'], errors='coerce')

    df['ZipCode'] = df['Address'].str.extract(r'(\d{5})(?!.*\d{5})')
    city_state = df['Address'].str.extract(r',\s*([^,]+),\s*([A-Z]{2})\s*\d{5}')
    df['City'] = city_state[0].str.strip()
    df['State'] = city_state[1].str.strip()

    df['PricePerSqFt'] = df.apply(
        lambda row: round(row['Price'] / row['SqFt'], 2)
        if pd.notnull(row['Price']) and pd.notnull(row['SqFt']) and row['SqFt'] > 0 else None,
        axis=1
    )

    df['Beds_Baths'] = df.apply(
        lambda row: f"{int(row['Beds']) if pd.notnull(row['Beds']) else 'N/A'} Bed / {int(row['Baths']) if pd.notnull(row['Baths']) else 'N/A'} Bath",
        axis=1
    )

    final_order = [
        'Property', 'Address', 'City', 'State', 'ZipCode', 'Phone',
        'Unit', 'Beds', 'Baths', 'Beds_Baths', 'SqFt', 'Price', 'PricePerSqFt',
        'RentalType',
        'HasWasherDryer', 'HasAirConditioning', 'HasPool', 'HasSpa',
        'HasGym', 'HasEVCharging', 'StorageFee',
        'AllowsDogs', 'AllowsCats',
        'ListingURL'
    ]
    return df[final_order]

### Explanation of `clean_data` Function

The `clean_data` function processes and cleans the raw data collected during web scraping to ensure it is structured, consistent, and ready for analysis or storage. Here's a detailed breakdown of its components:

---

#### **1. Price Cleaning**
```python
df['Price'] = df['Price'].apply(extract_low_price)
```
- Applies the `extract_low_price` function to the `Price` column to extract the lowest price from various formats (e.g., "$1,200-$1,500" → `1200`).
- Ensures the `Price` column contains numeric values or `None` for invalid entries.

---

#### **2. Square Footage Cleaning**
```python
df['SqFt'] = pd.to_numeric(df['SqFt'].astype(str).str.replace(',', '').str.extract(r'(\d+)', expand=False), errors='coerce')
```
- Converts the `SqFt` column to a string and removes commas (e.g., "1,200" → "1200").
- Extracts numeric values using a regular expression (`\d+`).
- Converts the result to a numeric type using `pd.to_numeric`, coercing invalid entries to `NaN`.

---

#### **3. Beds and Baths Cleaning**
```python
df['Beds'] = pd.to_numeric(df['Beds'], errors='coerce')
df['Baths'] = pd.to_numeric(df['Baths'], errors='coerce')
```
- Converts the `Beds` and `Baths` columns to numeric types, coercing invalid entries to `NaN`.

---

#### **4. Extracting Zip Code**
```python
df['ZipCode'] = df['Address'].str.extract(r'(\d{5})(?!.*\d{5})')
```
- Extracts the last 5-digit number from the `Address` column, which represents the ZIP code.

---

#### **5. Extracting City and State**
```python
city_state = df['Address'].str.extract(r',\s*([^,]+),\s*([A-Z]{2})\s*\d{5}')
df['City'] = city_state[0].str.strip()
df['State'] = city_state[1].str.strip()
```
- Uses a regular expression to extract the city and state from the `Address` column.
- The city is captured as the first group, and the state (2-letter abbreviation) is captured as the second group.
- Strips any leading or trailing whitespace from the extracted values.

---

#### **6. Calculating Price Per Square Foot**
```python
df['PricePerSqFt'] = df.apply(
    lambda row: round(row['Price'] / row['SqFt'], 2)
    if pd.notnull(row['Price']) and pd.notnull(row['SqFt']) and row['SqFt'] > 0 else None,
    axis=1
)
```
- Calculates the price per square foot by dividing the `Price` by `SqFt`.
- Ensures both `Price` and `SqFt` are valid and `SqFt` is greater than 0 before performing the calculation.
- Rounds the result to 2 decimal places or assigns `None` for invalid rows.

---

#### **7. Creating Beds and Baths Summary**
```python
df['Beds_Baths'] = df.apply(
    lambda row: f"{int(row['Beds']) if pd.notnull(row['Beds']) else 'N/A'} Bed / {int(row['Baths']) if pd.notnull(row['Baths']) else 'N/A'} Bath",
    axis=1
)
```
- Creates a summary column combining the number of beds and baths in the format: "X Bed / Y Bath".
- Replaces missing values with "N/A".

---

#### **8. Reordering Columns**
```python
final_order = [
    'Property', 'Address', 'City', 'State', 'ZipCode', 'Phone',
    'Unit', 'Beds', 'Baths', 'Beds_Baths', 'SqFt', 'Price', 'PricePerSqFt',
    'RentalType',
    'HasWasherDryer', 'HasAirConditioning', 'HasPool', 'HasSpa',
    'HasGym', 'HasEVCharging', 'StorageFee',
    'AllowsDogs', 'AllowsCats',
    'ListingURL'
]
return df[final_order]
```
- Specifies the desired order of columns for the final DataFrame.
- Returns the DataFrame with columns reordered according to `final_order`.

---

### Purpose
The `clean_data` function ensures that the raw scraped data is cleaned, standardized, and organized into a consistent format. This makes the dataset ready for analysis, visualization, or storage in a CSV file.

### Example Output
After cleaning, the DataFrame will have columns such as:
- `Property`, `Address`, `City`, `State`, `ZipCode`, `Phone`, `Unit`, `Beds`, `Baths`, `Beds_Baths`, `SqFt`, `Price`, `PricePerSqFt`, `RentalType`, `HasWasherDryer`, `HasAirConditioning`, `HasPool`, `HasSpa`, `HasGym`, `HasEVCharging`, `StorageFee`, `AllowsDogs`, `AllowsCats`, `ListingURL`.
```

In [9]:
def main():
    start_time = time.time()
    driver = init_driver()
    df = scrape_listings(driver)
    driver.quit()

    if df.empty:
        print("No data collected. File not saved.")
        logging.warning("No data collected. File not saved.")
    else:
        df = clean_data(df)
        filename = f'test_san_diego_county_rentals_{datetime.today().strftime("%Y-%m-%d")}.csv'
        df.to_csv(filename, index=False)
        print(f"Scraping complete. Data saved to {filename}")
        logging.info(f"Scraping complete. Data saved to {filename}")

    duration = time.time() - start_time
    minutes, seconds = divmod(duration, 60)
    print(f"Script runtime: {int(minutes)} minutes and {seconds:.2f} seconds")

### Explanation of `main` Function

The `main` function serves as the entry point for the web scraping script. It orchestrates the entire workflow, from initializing the web driver to saving the scraped data. Here's a detailed breakdown:

---

#### **1. Start Timer**
```python
start_time = time.time()
```
- Records the start time of the script execution to calculate the total runtime later.

---

#### **2. Initialize WebDriver**
```python
driver = init_driver()
```
- Calls the `init_driver` function to initialize a Selenium WebDriver with the specified configurations (e.g., headless mode).

---

#### **3. Scrape Listings**
```python
df = scrape_listings(driver)
```
- Calls the `scrape_listings` function to scrape property listings from the website.
- The scraped data is returned as a Pandas DataFrame and stored in the variable `df`.

---

#### **4. Quit WebDriver**
```python
driver.quit()
```
- Closes the WebDriver to release system resources after scraping is complete.

---

#### **5. Check for Empty Data**
```python
if df.empty:
    print("No data collected. File not saved.")
    logging.warning("No data collected. File not saved.")
```
- Checks if the DataFrame `df` is empty (i.e., no data was scraped).
- If empty:
  - Prints a message to the console.
  - Logs a warning message to the log file.

---

#### **6. Clean and Save Data**
```python
else:
    df = clean_data(df)
    filename = f'test_san_diego_county_rentals_{datetime.today().strftime("%Y-%m-%d")}.csv'
    df.to_csv(filename, index=False)
    print(f"Scraping complete. Data saved to {filename}")
    logging.info(f"Scraping complete. Data saved to {filename}")
```
- If data was successfully scraped:
  - Calls the `clean_data` function to clean and process the raw data.
  - Generates a filename for the output CSV file using the current date (e.g., `test_san_diego_county_rentals_YYYY-MM-DD.csv`).
  - Saves the cleaned DataFrame to a CSV file without including the index column.
  - Prints a success message to the console and logs the same message.

---

#### **7. Calculate and Print Runtime**
```python
duration = time.time() - start_time
minutes, seconds = divmod(duration, 60)
print(f"Script runtime: {int(minutes)} minutes and {seconds:.2f} seconds")
```
- Calculates the total runtime of the script by subtracting the start time from the current time.
- Converts the runtime into minutes and seconds using `divmod`.
- Prints the runtime to the console in a human-readable format.

---

### Purpose
The `main` function coordinates the entire scraping process, including:
1. Initializing the WebDriver.
2. Scraping property listings.
3. Cleaning and saving the data.
4. Logging warnings or success messages.
5. Reporting the script's runtime.

This function ensures the workflow is executed in a structured and efficient manner.

In [10]:
if __name__ == "__main__":
    main()

Scraping complete. Data saved to test_san_diego_county_rentals_2025-04-25.csv
Script runtime: 0 minutes and 16.16 seconds


### Explanation of `if __name__ == "__main__":`

This construct is a common Python idiom used to ensure that certain code is executed only when the script is run directly, and not when it is imported as a module in another script. Here's a detailed explanation:

---

#### **1. `__name__` Variable**
- In Python, every script has a special built-in variable called `__name__`.
- When a script is run directly, `__name__` is set to `"__main__"`.
- When a script is imported as a module in another script, `__name__` is set to the name of the script (e.g., `"script_name"`).

---

#### **2. Purpose of the Check**
```python
if __name__ == "__main__":
```
- This condition checks if the script is being run directly (i.e., `__name__` is `"__main__"`).
- If `True`, the code block inside the `if` statement is executed.
- If the script is imported as a module, the code block is skipped.

---

#### **3. Calling the `main` Function**
```python
main()
```
- If the script is run directly, the `main()` function is called to execute the workflow defined in the script.
- This ensures that the script's functionality is triggered only when intended.

---

#### **4. Why It's Useful**
- **Prevents Unintended Execution**: Ensures that the script's main logic doesn't run automatically when the script is imported as a module.
- **Modularity**: Allows the script to be reused as a module in other scripts without executing its main logic.

---

### Example
```python
# script.py
def main():
    print("This is the main function.")

if __name__ == "__main__":
    main()
```

- **When run directly**:
```bash
$ python script.py
This is the main function.
```

- **When imported**:
```python
# another_script.py
import script
# No output, as `main()` is not called automatically.
```

This construct is essential for writing reusable and modular Python code.