# Web Scraping

Scraped doctor's name and their speciality from https://www.zocdoc.com/specialty website. Initial web page of this website includes many specialities and its corresponding insurances, procedures and reason in which doctors are available. We have to click only specialities and scrape doctor's name and their specific speciality from that webpage. Each speciality have doctor's name in 10 webpages. We have to navigate to each and every pages to scrape the contents.

#### Importing libraries

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd

I have used **Firefox** browser so I have used **geckodriver** to link **selenium** and Firefox browser. 

In [2]:
driver = webdriver.Firefox()
#Opening the home page with the help of selenium
home = driver.get('https://www.zocdoc.com/specialty')

In [3]:
#Getting all the source of the webpage where we visited
content = driver.page_source

In [4]:
#Handing over all the contents of the webpage to beautifulsoup to scrape the webpage.
soup = bs(content)

#### Scraping all the specialities from the home page

In [5]:
specialities = []

In [6]:
for i in soup.findAll(class_='sc-2gkh1u-2 sc-1hky09g-1 jULkYr'):
    special = i.find(class_='sc-2gkh1u-3 jBFlDB')
    specialities.append(special.text)

In [7]:
specialities

['Acupuncturist',
 'Allergist',
 'Audiologist',
 'Cardiologist',
 'Chiropractor',
 'Dentist',
 'Dermatologist',
 'Dietitian',
 'Ear, Nose & Throat Doctor',
 'Emergency Medicine Physician',
 'Endocrinologist',
 'Endodontist',
 'Eye Doctor',
 'Family Physician',
 'Gastroenterologist',
 'Hand Surgeon',
 'Hearing Specialist',
 'Hematologist',
 'Infectious Disease Specialist',
 'Infertility Specialist',
 'Internist',
 'Naturopathic Doctor',
 'Nephrologist',
 'Neurologist',
 'Neurosurgeon',
 'Nurse Practitioner',
 'Nutritionist',
 'OB-GYN',
 'Oncologist',
 'Ophthalmologist',
 'Optometrist',
 'Oral Surgeon',
 'Orthodontist',
 'Orthopedic Surgeon',
 'Pain Management Specialist',
 'Pediatric Dentist',
 'Pediatric Urgent Care Specialist',
 'Pediatrician',
 'Periodontist',
 'Physiatrist',
 'Physical Therapist',
 'Plastic Surgeon',
 'Podiatrist',
 'Primary Care Doctor',
 'Prosthodontist',
 'Psychiatrist',
 'Psychologist',
 'Psychotherapist',
 'Pulmonologist',
 'Radiologist',
 'Rheumatologist',
 'S

In [8]:
#Lists to contain scraped Doctor's name and their speciality
name = []
speciality = []

#### Scraping Doctor's name and speciality from all the pages.

In [9]:
#Iterating over all the specialities
for doc in specialities:
    #Moving to home page
    driver.get('https://www.zocdoc.com/specialty')
    #Clicking on a particular speciality
    driver.find_element_by_partial_link_text(doc).click()
    #Getting the page source and handing over it to beautifulsoup 
    new_soup = bs(driver.page_source)
    """When a speciality is clicked on the home page then in the URL chages. 
       For example if we click 'Ear, Nose & Throat Doctor' then URL will change to 
       'https://www.zocdoc.com/ear-nose-throat-doctors' so we have to clean the speciality according to URL"""
    doc = doc.lower()
    doc = doc.replace('-','')
    doc = doc.replace(', ','-')
    doc = doc.replace(' & ','-')
    doc = doc.replace(' / ','-')
    doc = doc.replace(' ','-')
    #Scraping the first page
    for i in new_soup.findAll(class_='htzklx-15 iHYPbJ'):
        name.append(i.text)
    for i in new_soup.findAll(class_='htzklx-16 jrBxhj'):
        speciality.append(i.text)
    #Scraping from 2 to 10 pages
    for i in range(2,11):
        page = 'https://www.zocdoc.com/'+doc+'s/'+str(i)
        driver.get(page)
        new_soup = bs(driver.page_source)
        for i in new_soup.findAll(class_='htzklx-15 iHYPbJ'):
            name.append(i.text)
        for i in new_soup.findAll(class_='htzklx-16 jrBxhj'):
            speciality.append(i.text)

In [10]:
len(name)

9298

In [11]:
len(speciality)

9298

#### Converting lists to a dataframe

In [12]:
df = pd.DataFrame(list(zip(name,speciality)),columns = ["Doctor's Name",'speciality'])

In [13]:
df.shape

(9298, 2)

In [14]:
df.head(10)

Unnamed: 0,Doctor's Name,speciality
0,"Diem Truong, LAc, MSTOM",Acupuncturist
1,"Monique Rivera, LAc",Acupuncturist
2,"Ronald Pratt, LAc, DiplAc, MA, MSAc",Acupuncturist
3,"Daniel Camburn, LAc",Acupuncturist
4,"Deborah Barbiere, LAc, MSTOM, PsyD",Acupuncturist
5,"Miguel Maya, MSTOM, LAc",Acupuncturist
6,"Elizabeth Healy, LAc",Acupuncturist
7,"Irina Logman, LAc, DACM",Acupuncturist
8,"Han Jun, LAc",Acupuncturist
9,"Michiko Yoshifuji, DiplAc, LAc",Acupuncturist


#### Converting dataframe to csv

In [15]:
zocdoc_csv = df.to_csv('zocdoc.csv')

In [16]:
#During scraping this dataset contains so many duplicates values. So we will drop these duplicate values.
df = df.drop_duplicates()
df = df.reset_index(drop=True)

In [17]:
df.shape

(3782, 2)

In [18]:
df.head()

Unnamed: 0,Doctor's Name,speciality
0,"Diem Truong, LAc, MSTOM",Acupuncturist
1,"Monique Rivera, LAc",Acupuncturist
2,"Ronald Pratt, LAc, DiplAc, MA, MSAc",Acupuncturist
3,"Daniel Camburn, LAc",Acupuncturist
4,"Deborah Barbiere, LAc, MSTOM, PsyD",Acupuncturist
