## Importing Required Libraries

### undetected_chromedriver
- **Purpose:** A special version of ChromeDriver that helps avoid detection by websites which may block automated browsing.
- **Use case:** Ensures the browser automation mimics real user behavior to reduce the chance of being blocked.

### selenium.webdriver.common.by – By
- **Purpose:** Locate elements on a webpage (by ID, XPath, class name, etc.).
- **Use case:** Essential for finding specific HTML elements to interact with during automation.

### selenium.webdriver.support.ui – WebDriverWait
- **Purpose:** Wait for certain conditions or elements to be ready before continuing execution.
- **Use case:** Prevents errors by ensuring elements have loaded or become interactive before acting.

### selenium.webdriver.support.expected_conditions – EC
- **Purpose:** A collection of pre-built conditions (e.g., element clickable, visible).
- **Use case:** Works with `WebDriverWait` to define what to wait for.

### traceback
- **Purpose:** Extract, format, and print stack traces.
- **Use case:** Debugging and error logging during script execution.

### time
- **Purpose:** Provides time-related functions like `sleep()`.
- **Use case:** Pausing execution for set durations to control scraping pace.

### random
- **Purpose:** Generates pseudo-random numbers.
- **Use case:** Introduces randomness (e.g., variable wait times) to mimic human activity.

### re
- **Purpose:** Regular expression operations for pattern matching in strings.
- **Use case:** Extracting or validating text patterns such as prices or dates.

### csv
- **Purpose:** Read from and write to CSV files.
- **Use case:** Store scraped data in a structured, tabular format.

### urllib.parse
- **Purpose:** Manipulate URLs (encode/decode query parameters, parse components).
- **Use case:** Construct or deconstruct URLs during navigation or data extraction.


In [7]:
# Importing the undetected_chromedriver module as 'uc'.
# This is a special version of ChromeDriver that helps to avoid detection by websites
# which might block or restrict automated browser actions.
import undetected_chromedriver as uc

# Importing By class from selenium.webdriver.common.by
# 'By' is used to locate elements on a webpage, such as by ID, XPATH, class name, etc.
from selenium.webdriver.common.by import By

# Importing WebDriverWait class from selenium.webdriver.support.ui
# WebDriverWait allows us to wait for certain conditions or elements to be ready 
# before proceeding with the next steps in automation.
from selenium.webdriver.support.ui import WebDriverWait

# Importing expected_conditions as EC from selenium.webdriver.support
# expected_conditions provides a set of pre-built conditions to use with WebDriverWait,
# e.g., waiting for an element to be clickable or visible.
from selenium.webdriver.support import expected_conditions as EC

# Importing traceback module
# This module provides utilities to extract, format, and print stack traces of Python programs.
# Useful for debugging errors and exceptions.
import traceback

# Importing time module
# Provides various time-related functions such as sleep(), which pauses execution for a given number of seconds.
import time

# Importing random module
# This module implements pseudo-random number generators for various distributions.
# Often used to introduce randomness, e.g., random wait times to mimic human behavior.
import random

# Importing re module
# This module provides regular expression matching operations, allowing pattern matching in strings.
import re

# Importing csv module
# Used for reading from and writing to CSV (Comma-Separated Values) files easily.
import csv

# Importing urllib.parse module
# Contains functions to manipulate URLs, such as encoding query parameters or parsing URL components.
import urllib.parse


### Function: `get_coupons_with_retry`

This function automates the process of retrieving coupon prices from a GoodRx webpage using Selenium.

**Purpose:**
- Locate and extract two coupon prices from dynamically loaded page elements.
- Retry multiple times if coupons are not immediately visible, reducing the chance of missing data due to slow loading.

**Key Features:**
- Uses `WebDriverWait` with a short timeout to avoid long script freezes.
- Scrolls to coupon elements to ensure they are in view before extraction.
- Handles missing or empty text values by returning `"N/A"`.
- Includes a retry mechanism with a configurable number of attempts and delays between retries.

**Parameters:**
- `driver`: Selenium WebDriver controlling the browser session.
- `wait`: WebDriverWait instance for explicit waits.
- `retries`: Number of times to attempt retrieval before failing.
- `delay`: Pause duration (seconds) between retries.

**Returns:**
- Tuple containing two coupon prices as strings.
- Returns `("N/A", "N/A")` if prices are not retrieved after all retries.


In [8]:
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

def get_coupons_with_retry(driver, wait, retries=3, delay=2):
    """
    Attempts to retrieve coupon prices from the webpage, retrying if elements are not found or visible.
    
    Parameters:
    - driver: Selenium WebDriver instance controlling the browser.
    - wait: WebDriverWait instance for handling explicit waits.
    - retries: Number of times to retry finding the coupon elements before giving up (default 3).
    - delay: Number of seconds to wait between retries (default 2).
    
    Returns:
    - Tuple of two strings representing coupon prices if found.
    - Returns ("N/A", "N/A") if coupons could not be retrieved after all retries.
    """
    
    for attempt in range(retries):
        try:
            # **Use short timeout to avoid long freeze:** wait.max_wait typically controls timeout,
            # but explicitly using wait here is risky if wait has a large timeout.
            # Instead, use WebDriverWait with a short timeout inline to avoid freezing.
            
            # Wait up to 5 seconds for the elements to be visible
            coupons = WebDriverWait(driver, 5).until(
                EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "span[data-qa='pricing-option-price']"))
            )
            
            # Once visible elements are found
            if len(coupons) >= 2:
                driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", coupons[0])
                time.sleep(1)  # Allow for animations/item loads to finish
                
                # Clean the coupon text; if empty, treat as "N/A"
                coupon1 = coupons[0].text.strip() if coupons[0].text.strip() else "N/A"
                coupon2 = coupons[1].text.strip() if coupons[1].text.strip() else "N/A"
                
                return coupon1, coupon2
            
            else:
                # Not enough coupon elements found yet, retry after delay
                time.sleep(delay)
        except Exception as e:
            # Catch specific exceptions if you prefer (TimeoutException, NoSuchElementException)
            # but broad except keeps it safe here.
            
            # Optional: log the exception for debugging
            # print(f"[Warning] Attempt {attempt + 1}: Exception when fetching coupons: {e}")
            
            time.sleep(delay)
    
    # After all retries exhausted, return N/A
    return "N/A", "N/A"


### Function: `scrape_goodrx`

Automates the process of scraping GoodRx pharmacy pricing information for a given drug and ZIP code.

**Purpose:**
- Open a drug’s GoodRx page in a Chrome browser (undetected mode).
- Enter a specified ZIP code to update location-based prices.
- Extract pharmacy details, prices, dosages, quantities, and special offers.
- Retrieve coupon prices for listings with special offers.

**Key Features:**
- **Browser automation:** Uses `undetected_chromedriver` to avoid bot detection.
- **Dynamic element handling:** Waits for and interacts with modal dialogs, inputs, and buttons.
- **Data extraction:** Collects drug details, pharmacy names, standard prices, special offers, and coupon prices.
- **Fallback handling:** Assigns `"N/A"` where data is missing.
- **Retry mechanism:** Uses `get_coupons_with_retry` to fetch coupon prices reliably.
- **Graceful termination:** Closes browser even if errors occur.

**Parameters:**
- `drug_url` *(str)*: Full GoodRx URL for the target drug.
- `zip_code` *(str)*: ZIP code to use for location-based pricing.

**Returns:**
- **List[dict]**: Each dictionary contains:
  - `drug`, `location`, `dosage`, `quantity`, `pharmacy`, `price`,  
    `special_offer`, `standard_coupon_price`, `special_coupon_price`


In [9]:
def scrape_goodrx(drug_url, zip_code):
    scraped_results = []  # Will hold all scraped data entries

    # Parse drug name from URL path, capitalize first letter
    parsed_url = urllib.parse.urlparse(drug_url)
    drug_name_from_url = parsed_url.path.strip("/").split("/")[-1]
    drug_name_formatted = drug_name_from_url.capitalize()  # e.g. 'atorvastatin'

    try:
        # Initialize Chrome WebDriver options to start maximized for visual stability
        options = uc.ChromeOptions()
        options.add_argument("--start-maximized")

        # Create undetected chromedriver instance (to avoid easy detection)
        driver = uc.Chrome(options=options)
        driver.delete_all_cookies()

        # Navigate to drug URL
        driver.get(drug_url)
        time.sleep(random.uniform(3, 5))  # Random delay to mimic human browsing
        print(f"✅ Page loaded: {driver.title}")

        wait = WebDriverWait(driver, 30)  # Explicit wait setup

        # Step 1: Click location selector to enter ZIP code
        location_span = wait.until(EC.element_to_be_clickable((By.XPATH, "//span[contains(@class,'truncate')]")))
        location_span.click()
        time.sleep(2)  # Allow modal to appear

        # Step 2: Input ZIP code and submit
        zip_input = wait.until(EC.presence_of_element_located((By.ID, "locationModalAddress")))
        zip_input.clear()
        zip_input.send_keys(zip_code)

        submit_button = wait.until(EC.element_to_be_clickable(
            (By.XPATH, "//button[@type='submit' and @form='locationModalForm' and contains(translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'set location')]")
        ))
        submit_button.click()
        time.sleep(6)  # Wait longer for page to update prices by location

        # Find all pharmacy cards displayed on the page
        pharmacy_cards = driver.find_elements(By.CSS_SELECTOR, "li.list-none.flex.shadow-raised")
        print(f"ℹ️ Found {len(pharmacy_cards)} pharmacy cards for ZIP {zip_code}")

        # Iterate over each pharmacy card to extract details
        for i, card in enumerate(pharmacy_cards):
            try:
                # Scroll pharmacy card into view smoothly
                driver.execute_script("arguments[0].scrollIntoView({behavior: 'smooth', block: 'center'});", card)
                time.sleep(1.5)  # pause to stabilize rendering

                # Extract drug info dynamically by locating the span containing the drug name
                try:
                    xpath_str = f"//span[contains(text(),'{drug_name_formatted}')]"
                    drug_info = driver.find_element(By.XPATH, xpath_str).text.strip()
                    pattern = rf"({re.escape(drug_name_formatted)})\s+(\d+mg)\s+\((\d+\s+\w+)\)"
                    match = re.match(pattern, drug_info)
                    if match:
                        drug_name, dosage, quantity = match.groups()
                    else:
                        drug_name, dosage, quantity = "N/A", "N/A", "N/A"
                except:
                    drug_name, dosage, quantity = "N/A", "N/A", "N/A"

                # Extract pharmacy name
                try:
                    pharmacy = card.find_element(By.CSS_SELECTOR, "span[data-qa='seller-name']").text.strip()
                except:
                    pharmacy = "N/A"

                # Extract price listed
                try:
                    price = card.find_element(By.CSS_SELECTOR, "span[data-qa='seller-price']").text.strip()
                except:
                    price = "N/A"

                # Extract special offer description
                try:
                    offer = card.find_element(By.CSS_SELECTOR, "span[data-qa='special-offer-text']").text.strip()
                except:
                    offer = "No special offer"

                # Handle cards without special offers differently - just append data
                if offer.lower() == "no special offer":
                    scraped_results.append({
                        "drug": drug_name,
                        "location": zip_code,
                        "dosage": dosage,
                        "quantity": quantity,
                        "pharmacy": pharmacy,
                        "price": price,
                        "special_offer": offer,
                        "standard_coupon_price": "N/A",
                        "special_coupon_price": "N/A",
                    })
                    print(f"Listed (No Special Offer): {pharmacy} at {zip_code}")
                    print("-" * 50)
                    continue

                # For cards with special offers, click to load coupon prices
                try:
                    card.click()
                except:
                    pass

                # Use your get_coupons_with_retry function to retrieve coupon prices
                standard, special = get_coupons_with_retry(driver, wait)

                scraped_results.append({
                    "drug": drug_name,
                    "location": zip_code,
                    "dosage": dosage,
                    "quantity": quantity,
                    "pharmacy": pharmacy,
                    "price": price,
                    "special_offer": offer,
                    "standard_coupon_price": standard,
                    "special_coupon_price": special,
                })

                print(f"Scraped: {pharmacy} at {zip_code} (Special Offer)")
                print("-" * 50)

            except Exception:
                print(f"❌ Error on card {i+1} at {zip_code}")
                traceback.print_exc()
                continue

    except Exception:
        print("❌ Critical failure during scrape")
        traceback.print_exc()

    finally:
        try:
            driver.quit()
            print("🔚 Browser closed")
        except:
            pass

    return scraped_results


### Target Drug URLs and ZIP Codes

**Drug URLs:**
A predefined list of GoodRx drug pages to scrape.  
Each URL points to a specific drug’s pricing and offers page.

**ZIP Codes:**
A list of U.S. ZIP codes used for location-based pricing lookups.  
Scraping for multiple ZIP codes allows price comparison across different regions.


In [11]:
# List of URLs for different drugs on the GoodRx website
# Each URL corresponds to a specific drug page to scrape pricing and offers from
drug_urls = [
    "https://www.goodrx.com/atorvastatin",
    "https://www.goodrx.com/lisinopril",
    "https://www.goodrx.com/metformin",
    "https://www.goodrx.com/amlodipine",
    "https://www.goodrx.com/omeprazole",
    "https://www.goodrx.com/simvastatin",
    "https://www.goodrx.com/losartan",
    "https://www.goodrx.com/albuterol",
    "https://www.goodrx.com/hydrochlorothiazide",
    "https://www.goodrx.com/levothyroxine",
]


# List of ZIP codes to specify location-based price lookups
# Each ZIP code corresponds to a geographical area in the US to emulate user location
zip_codes = [
    "10001",  # ZIP code for New York, NY
    "90001",  # ZIP code for Los Angeles, CA
    "60601",  # ZIP code for Chicago, IL
]


### Scraping Loop – Multiple Drugs and Locations

This section runs the scraping process for every combination of target drug and ZIP code.

**Process:**
1. Iterate over each drug URL in `drug_urls`.
2. For each drug, iterate over each ZIP code in `zip_codes`.
3. Call `scrape_goodrx()` to retrieve pricing and offer data for that drug-location pair.
4. Append results to `all_scraped_data`, a master list storing all collected records.
5. Pause for 10 seconds between requests to avoid overwhelming the GoodRx servers and reduce the risk of being blocked.


In [12]:
# Initialize an empty list to collect all scraped data from multiple drug and ZIP code combinations
all_scraped_data = []

# Outer loop: iterate over each drug URL in the drug_urls list
for drug_url in drug_urls:
    # Inner loop: iterate over each ZIP code in the zip_codes list for location-specific scraping
    for zip_code in zip_codes:
        # Print a message to indicate which drug URL and ZIP code is currently being scraped
        print(f"\n=== Scraping {drug_url} for ZIP {zip_code} ===")
        
        # Call the scrape_goodrx function to scrape pricing and offer data for the specific drug and location
        data = scrape_goodrx(drug_url, zip_code)
        
        # Append the scraped data (a list of dictionaries) to the master all_scraped_data list
        all_scraped_data.extend(data)
        
        # Pause execution for 10 seconds to avoid overwhelming the target website
        # This polite delay reduces the risk of being blocked or flagged as a bot
        time.sleep(10)


### Saving Scraped Data to CSV

After completing the scraping process, this section saves the collected data to a CSV file.

**Process:**
1. Check if `all_scraped_data` contains any records.
2. Use the keys from the first record as column headers.
3. Create a new file named `scraped_goodrx_data.csv` with UTF-8 encoding.
4. Write the header row followed by all scraped records.
5. Print a confirmation message showing the total number of records saved.
6. If no data was scraped, print a message indicating the dataset is empty.


In [15]:
# Check if the list 'all_scraped_data' contains any scraped data
if all_scraped_data:
    
    # Get the dictionary keys from the first item to use as CSV column headers
    keys = all_scraped_data[0].keys()
    
    # Open a new CSV file named "scraped_goodrx_data.csv" in write mode
    # 'newline=""' prevents adding extra blank lines on some platforms
    # 'encoding="utf-8"' ensures proper character encoding support
    with open("scraped_goodrx_data.csv", "w", newline="", encoding="utf-8") as csvfile:
        
        # Create a CSV DictWriter object using the keys as fieldnames (column headers)
        writer = csv.DictWriter(csvfile, fieldnames=keys)
        
        # Write the header row to the CSV file
        writer.writeheader()
        
        # Write all the dictionaries in 'all_scraped_data' as rows in the CSV file
        writer.writerows(all_scraped_data)
    
    # Print a confirmation message including the number of records saved
    print(f"Saved {len(all_scraped_data)} records to scraped_goodrx_data.csv")

else:
    # If 'all_scraped_data' is empty, print a message to indicate no data was scraped
    print("No data scraped.")


Saved 345 records to scraped_goodrx_data.csv
