# Scraping Zocdoc.com

This notebook defines a function to scrape all the primary care physicians (PCPs) from zocdoc.com that turn up within a 1-mile radius when a search for a given zip code is conducted. 

The information scraped for each doctor includes:
* Name
* Specialty
* Address
* Upcoming appointments for the next five days, or soonest available appointment if they don't have any within five days
* Rating and total number of reviews
* In-network insurances
* Education
* Languages spoken at their office
* Gender
* NPI number (National Provider Identifier)

The scape function does scrape reviews of the doctor. 

This notebook also contains a function to clean and standardize that data and output it to a pandas dataframe and csv 

Lastly, it will run the scraper on each zip code in NYC to produce a comprehensive dataset of doctors on zocdoc.com and their upcoming availability.

In [86]:
import pandas as pd
import numpy as np

import requests
from bs4 import BeautifulSoup
import json

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains

from webdriver_manager.chrome import ChromeDriverManager

from datetime import datetime

In [99]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/jmingram/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


A function to navigate to, pull up and scrape the information from the popup containing insurance information on individual doctor profiles

In [88]:
def get_insurance():
    try:
        scroll = driver.find_element(By.ID, 'insurance-target-element')
    except: return []
    driver.execute_script("arguments[0].scrollIntoView(true);", scroll)
    time.sleep(2)
    button = driver.find_element(By.ID, 'insurance-target-element').find_element(By.TAG_NAME, 'a')
    button.click()
    driver.switch_to.active_element
    insurance_soup = BeautifulSoup(driver.page_source)
    insurances = insurance_soup.find_all(class_='dx0sxs-0 nCTkH')
    insurance_list = []
    for insurance in insurances:
        insurance_list.append(insurance.text)
    return insurance_list

A function to check for duplicates if multiple zip codes are being scraped, in order to speed up the function

In [89]:
def check_for_dupes(df_list, column, name):
    for df in df_list:
        if name in df[column].tolist():
            print(f'HERE: {name}')
            return True
    else: return False

The scrape function, which takes a given zip code and scrapes information on all doctors within a one mile radius of that zip code. If multi_zip evaluates to True, it will run the check_for_dupes function above to make sure it isn't scraping redundant information. For a wholistic scrape of a given zip code, multi_zip must be set to False. This function returns a list of dictionaries

In [98]:
def scrape(zip_code, multi_zip):
    print(f'Starting scrape at {datetime.now()}')
    driver.maximize_window()
    driver.get(f'https://www.zocdoc.com/search?address={zip_code}&after_5pm=false&before_10am=false&city=New+York&day_filter=AnyDay&dr_specialty=153&filters=%7B%22distance_radius%22%3A%5B%22to_1_mile%22%5D%7D&gender=-1&insurance_carrier=-1&insurance_plan=-1&language=-1&locationType=placemark&offset=0&reason_visit=75&searchQueryGuid=5bbddc60-6399-4dba-8faa-d624ed6c6018&searchType=specialty&search_query=Primary+Care+Physician+%28PCP%29&sort_type=Default&timesgridType=fiveDays&visitType=inPersonVisit')
    soup_doc = BeautifulSoup(driver.page_source)
    results = []
    doc_list = []
    pages = soup_doc.find(attrs={'data-test': 'search-pagination'}).find_all('a')
    print(f"Found {len(pages)} pages")
    #Iterate through all search result pages
    for i, page in enumerate(pages):
        print(f"On page {i + 1}")
        #Iterate through all search results
        search_results = soup_doc.find_all(attrs={'data-test': 'search-result-item'})
        print(f"Found {len(search_results)} search results")
        for card in search_results:
            if (card.find(attrs={'data-test': 'doctor-card-info-name'}).text) in doc_list:
                print(f"SKIPPING: {card.find(attrs={'data-test': 'doctor-card-info-name'}).text}")
                continue
            elif multi_zip:
                if check_for_dupes(all_zips_dfs, 'name', card.find(attrs={'data-test': 'doctor-card-info-name'}).text) == True:
                    continue 
            doc = {}
            doc['name'] = card.find(attrs={'data-test': 'doctor-card-info-name'}).text
            print(doc['name'])
            doc_list.append(doc['name'])
            doc['specialty'] = card.find(attrs={'data-test': 'doctor-card-info-specialty'}).text
            if card.find(attrs={'data-test': 'doctor-card-info-location-address'}) is not None:
                #Address doesn't appear for telehealth only providers
                doc['street_address'] = card.find(attrs={'data-test': 'doctor-card-info-location-address'}).text
                doc['city'] = card.find(attrs={'data-test': 'doctor-card-info-location-city'}).text
                doc['state'] = card.find(attrs={'data-test': 'doctor-card-info-location-state'}).text
                doc['zip'] = card.find(attrs={'data-test': 'doctor-card-info-location-zip'}).text
            if card.find(attrs={'data-test': 'doctor-card-info-rating-number'}) is not None:
                doc['rating'] = card.find(attrs={'data-test': 'doctor-card-info-rating-number'}).text 
                doc['num_reviews'] = card.find(attrs={'data-test': 'doctor-card-review-count'}).text[1:-1]
            else:
                doc['rating'] = np.nan
                doc['num_reviews'] = '0'
            if card.find(attrs={'data-test': 'no-availability-view'}) is not None:
                doc['num_appts_next_5days'] = 0
            else:
                for grid in card.find_all(attrs={'data-test': 'timesgrid-day-column'}):
                    if grid.find('a') is not None:
                        doc['next_appt'] = grid.find('a')['aria-label']
                        doc['num_appts_next_5days'] = len(grid.find_all('a'))
            if card.find_all(attrs={'data-is-sponsored-result': 'true'}): 
                doc['sponsored'] = True
            else:
                doc['sponsored'] = False
            doc['profile_url'] = card.find('a')['href']
            #Move to individual provider profile
            driver.get('http://zocdoc.com' + card.find('a')['href'])
            doctor_soup = BeautifulSoup(driver.page_source)
            educ = []
            for item in doctor_soup.find_all(attrs={'data-test': 'education-list'}):
                educ.append(item.text)
            doc['education'] = educ 
            langs = []
            if doctor_soup.find(attrs={'data-test': 'Languages-section'}) is not None:
                for item in doctor_soup.find(attrs={'data-test': 'Languages-section'}).find_all('li'):
                    langs.append(item.text)
            doc['languages'] = langs
            doc['gender'] = doctor_soup.find(attrs={'data-test': 'Sex-section'}).find('p').text
            if doctor_soup.find(attrs={'itemprop': 'identifier'}) is not None:
                doc['npi'] = doctor_soup.find(attrs={'itemprop': 'identifier'}).text #dtype string
            if doc['num_appts_next_5days'] == 0:
                try:
                    doc['next_appt'] = doctor_soup.find(attrs={'role': 'gridcell'})['aria-label']
                except:
                    doc['next_appt'] = np.nan
            try:
                doc['insurance'] = get_insurance()
            except:
                print("error gathering insurance data")
            doc['asof'] = datetime.now()
            results.append(doc)
            #Return to search results page
            driver.get('http://zocdoc.com' + page['href'])
            soup_doc = BeautifulSoup(driver.page_source)
        #Move to next page
        page_nav = driver.find_element(By.TAG_NAME, 'nav').find_elements(By.TAG_NAME, 'span')
        if i + 1 < len(page_nav):
            page_nav[i+1].find_element(By.TAG_NAME, 'a').click()
            soup_doc = BeautifulSoup(driver.page_source)
    print(f'Finished scrape at {datetime.now()}')
    return results

A function to turn the output of scrape, a list of dictionaries, into a dataframe with the appropriate data types, as well as save it to a csv. 

In [92]:
def clean_data(results, csv_name):
    df = pd.DataFrame(results)
    df.drop_duplicates(subset='npi', inplace=True)
    df.npi = df.npi.astype(int, errors='ignore')
    df.rating = df.rating.astype(float, errors='ignore') 
    df.num_reviews = df.num_reviews.str.replace(',', '')
    df.num_reviews = df.num_reviews.str.split(' ').str[0]
    df.num_reviews = df.num_reviews.fillna(0).astype(int) 
    df.next_appt = pd.to_datetime(df.next_appt, format='%I:%M %p on %A, %B %d, %Y', errors='ignore')
    df.to_csv(csv_name, index=False)
    return df

A faster scrape to get updated appointment availability. Requires a dataframe from the scrape and clean_data functions The code from this scrape is largely borrowed from scrape fumction, it just searches for doctors in the dataframe and updates the next_appt and num_appts_next_5days with new values

In [91]:
def update_availability_scrape(frame_to_update):
    print(f'Scrape began at {datetime.now()}')
    driver.maximize_window()
    zips = set(list(frame_to_update.zip))
    updated_list = []
    frame_to_update['asof'] = ''
    for zip_code in zips:
        print(f'Starting zip code: {zip_code}')
        driver.get(f'https://www.zocdoc.com/search?address={zip_code}&after_5pm=false&before_10am=false&city=New+York&day_filter=AnyDay&dr_specialty=153&filters=%7B%22distance_radius%22%3A%5B%22to_1_mile%22%5D%7D&gender=-1&insurance_carrier=-1&insurance_plan=-1&language=-1&locationType=placemark&offset=0&reason_visit=75&searchQueryGuid=5bbddc60-6399-4dba-8faa-d624ed6c6018&searchType=specialty&search_query=Primary+Care+Physician+%28PCP%29&sort_type=Default&timesgridType=fiveDays&visitType=inPersonVisit')
        soup_doc = BeautifulSoup(driver.page_source)
        pages = soup_doc.find(attrs={'data-test': 'search-pagination'}).find_all('a')
        for i, page in enumerate(pages):
            search_results = soup_doc.find_all(attrs={'data-test': 'search-result-item'})
            for card in search_results:
                name = card.find(attrs={'data-test': 'doctor-card-info-name'}).text
                if name not in updated_list:
                    print('Found a new name')
                    if card.find(attrs={'data-test': 'no-availability-view'}) is not None:
                        print('No upcoming availability, went into the page to get something else')
                        frame_to_update.loc[frame_to_update.name == name, 'num_appts_next_5days'] = 0 
                        time.sleep(0.5)
                        driver.get('http://zocdoc.com' + card.find('a')['href'])
                        doctor_soup = BeautifulSoup(driver.page_source)
                        try:
                            frame_to_update.loc[frame_to_update.name == name, 'next_appt'] = doctor_soup.find(attrs={'role': 'gridcell'})['aria-label'] #mod
                        except:
                            frame_to_update.loc[frame_to_update.name == name, 'next_appt'] = np.nan 
                    else:
                        print('Updating upcoming availability')
                        for grid in card.find_all(attrs={'data-test': 'timesgrid-day-column'}):
                            if grid.find('a') is not None:
                                frame_to_update.loc[frame_to_update.name == name, 'next_appt'] = grid.find('a')['aria-label']
                                frame_to_update.loc[frame_to_update.name == name, 'num_appts_next_5days'] = len(grid.find_all('a'))
                    updated_list.append(name)
                    frame_to_update.loc[frame_to_update == name, 'asof'] = datetime.now()
                else: print('Already updated this name')
            time.sleep(1)
    print(f'Scrape concluded at {datetime.now()}')

## Citywide Scrape

Getting the list of NYC zip codes

Data source: https://www.unitedstateszipcodes.org/zip-code-database

In [93]:
full_df = pd.read_csv('zip_code_database.csv')
nyc_counties = ['New York County', 'Queens County', 'Bronx County', 'Kings County', 'Richmond County']
nyc_zips = full_df[(full_df.county.isin(nyc_counties) == True) & (full_df.decommissioned == 0) & (full_df.state == 'NY')]
nyc_zips.zip = nyc_zips.zip.astype(str)
zips = list(nyc_zips.zip)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [None]:
#THIS WILL SCRAPE THE WHOLE CITY
all_zips_dfs = []
broken_zips = []
for zip_code in zips:
    try:
        print(f'**STARTING {zip_code} SCRAPE**')
        results = scrape(zip_code, True)
        if len(results) > 0:
            df = clean_data(results, f'{zip_code}_data.csv')
            all_zips_dfs.append(df)
            print(f'{zip_code} added to dataframes list')
    except Exception as e:
        broken_zips.append(zip_code)
        print(f'{zip_code} scrape encountered an error')
        print(e)

In [511]:
#nyc_docs = pd.concat(all_zips_dfs)
#nyc_docs.to_csv('all-zocdoc-data-nyc.csv', index=False)

In [514]:
nyc_docs.shape

(487, 16)