# 2. Crawling

Let's import the required libraries:

In [1]:
import pandas as pd
import json
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import TimeoutException
from datetime import datetime
import time

Let's also write some selenuim important functions:

>## 2.1 Global variables for selenuim:

In [2]:
driver = ""
original_tab = ""
secound_tab = ""
already_opend = False
options = Options()
options.headless = True

>## 2.2 Selenuim functions:

#### 1. openBroswerWindow(URL):
The function opens headless Firefox windows according the URL it received.
#### 2. openNewTabAndSwitch():
The function opens headless tab and switch the driver control to the new tab.
#### 3. switchBetween2Tabs():
The function switch the driver control between 2 tabs.
#### 4. closeDriverAndBroswer():
The function close the headless broswer and the driver.

In [3]:
def openBroswerWindow(URL):
    global driver, original_tab, secound_tab, already_opend, options
    if(already_opend == False):
        
        driver = webdriver.Firefox(options=options, service=Service(GeckoDriverManager().install()))
        already_opend = True
    driver.get(URL)
    original_tab = driver.current_window_handle

def openNewTabAndSwitch():
    global driver, secound_tab
    driver.switch_to.new_window('tab')
    secound_tab = driver.current_window_handle

def switchBetween2Tabs():
    global driver, original_tab, secound_tab
    current_tab = driver.current_window_handle;
    if(original_tab == current_tab):
        driver.switch_to.window(secound_tab)
    else:
        driver.switch_to.window(original_tab)
        
def closeDriverAndBroswer():
    global driver
    driver.quit()

Now that there are functions that help control selenium, we can get started.\
I am going to bring all the available appointments of the Maccabi HMO.\
First, let's access Maccabi's website using openBroswerWindow() function:

>## 2.3 Fetching pre-crawling important information

Before I start crawling,I will need important information that will make the job easier for me.\
Using the web inspector in the browser,\
I researched Maccabi's website, the appointment system and learned how it works,\
Now I will use the functions i wrote above to simulate a patient \
looking for an appointment for doctors in various fields.
#### The process will be as follows:
==> For each of the areas of treatment:\
====> The crawler will enter the search page and enter a treatment area and click the search button.\
====> For each resulting page:\
======> For each of the queues on the page:\
========> The crawler will retrieve the relevant information for the queued queue.

In order to carry out the proposed solution,\
we will need to find all the areas of care that Maccabi offers to its insured.\
Let's go to the "Services Guide" and see what can be obtained from there.

In [4]:
url = 'https://serguide.maccabi4u.co.il/heb/doctors/'
openBroswerWindow(url)




Current firefox version is 100.0
Get LATEST geckodriver version for 100.0 firefox
Driver [C:\Users\Haim\.wdm\drivers\geckodriver\win64\v0.31.0\geckodriver.exe] found in cache


After quick exploration of the html, i found the data i was looking for in a JSON format inside 'script' tag inside 'window.\__INITIAL_STATE\__'.

In [5]:
rawData = driver.execute_script("""return window.__INITIAL_STATE__""")

This data is going to help us later while we scraping Maccabi's website for available appoitments.\
Lets save it for now.

In [6]:
fieldsOfTreatment = rawData["settings"]["doctors"]["Data"]["Fields"]
fieldsOfTreatment = pd.json_normalize(fieldsOfTreatment)

Now, lets check our data:

In [7]:
print(fieldsOfTreatment.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   K       102 non-null    object
 1   V       102 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB
None


Cool, we have 102 treatment fields to scan!\
Now that we have the preliminary information we were looking for,\
We will write functions that will help us keep the code clean and readable.

>## 2.4 Crawling functions:

Based on the information I have gathered about how the appointment system works and the HTML structure,\
I will record actions that will help me crawl.

>> ## 2.4.1 Main serach results page functions:

#### getElement(row, by, value, func= None, funcValue = None, elements = False):
The function returns selenuim web object using find_element functions.\
This function is required because there are some objects that dosn't appear in diffrent web pages,\
and using the selenuim dedicated function to get a missing object will results in assertion error.
#### getNumOfDoctors(docNumStr):
The function extract the number of doctors that came up in the serach results using regex.
#### extractGender(imageUrl):
The function extract the gender of the doctor using comparsion between constant avatar images.
#### getAddress(addressStr):
The function extract the clinic address and returns the city and the street name and number.
City and street name and number obtained by using regex.
#### getSpeciality(docField, numOfTitle, getNumOfSpeciality = False):
The function the doctor's speciality.
#### extractFromArray(arr, index):
The function extract object from array.\
This function is reqiured to avoid out of bounds situations.

In [8]:
def getElement(row, by, value, func= None, funcValue = None, elements = False):
    if(len(row.find_elements(by, value)) > 0):
        if(func == "get_attribute"):
            return row.find_element(by, value).get_attribute(funcValue)
        if(func == "text"):
            return row.find_element(by, value).text
        if(func == "object"):
            if(elements):
                return row.find_elements(by, value)
            return row.find_element(by, value)
    return None

def getNumOfDoctors(docNumStr):
    numOfDoctors = 0
    regex = re.findall("[0-9]+", docNumStr)
    numOfDoctors = extractFromArray(regex, 0)
    if(numOfDoctors == None):
        numOfDoctors = 0
    return numOfDoctors

def extractGender(imageUrl):
    MALE = 'https://serguide.maccabi4u.co.il/media/09480721fdf8469ca6d6cc51b5bbe29b.svg?v=bd81a3f8-1d30-46a1-98c1-fb26a842c397'
    FEMALE = 'https://serguide.maccabi4u.co.il/media/a0c04626f2734055aa80ea34b632fb40.svg?v=bd81a3f8-1d30-46a1-98c1-fb26a842c397'
    if(imageUrl == MALE):
        return "M"
    if(imageUrl == FEMALE):
        return "F"
    return "M"

def getAddress(addressStr):
    clinicStreet = None
    clinicCity = addressStr
    if(addressStr != None):
        split = re.split(", ", addressStr)
        if(len(split) == 2):
            clinicStreet = split[0]
            clinicCity = split[1]
    return clinicStreet, clinicCity
    
def getSpeciality(docField, numOfTitle, getNumOfSpeciality = False):
    arr = docField.split('\n')
    if(getNumOfSpeciality):
        return len(arr)
    if(numOfTitle == 1):
        if(len(arr) == 1):
            return docField
        return arr[0]
    if(numOfTitle > 1):
        if(len(arr) > 1):
            return arr[numOfTitle -1]
        return None
    return None

def extractFromArray(arr, index):
    if(len(arr) == 0):
        return None
    return arr[index]


>> ## 2.4.2 Doctor details page functions:

#### scrapeDocDetails(url):
The main scrape doctor details page funtion.\
This function used to scarpe each doctor's page individualy.
#### getDocProperties(text):
The function extract boolean value of the following properties:\
needsReferral, performUS, absence, acceptingNewPatients, videoCall.
#### extractActivityTime(timeObjects):
The function extract a weekly calander that includes the doctor's activity time.
#### getDoctorNameAndTitle(docNameString):
The function extract the doctor's name and title using regex.
#### getDocDegreeDetails(degreeButtonObject):
Details of the doctor's degree studies and details of the internship he / she performed.

In [9]:
def scrapeDocDetails(url):
        switchBetween2Tabs()#switch to 2nd tab
        driver.get(url)
        try:
            WebDriverWait(driver, 25).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".docPropTitle")))
        except TimeoutException:
            print("Timeout... Skipped (1 Doctor is this treatment field)")
        clinicStreet, clinicCity = getAddress(getElement(driver, By.XPATH, "//div[@itemprop='Address']", 'text'))
        licenseNum = getElement(driver, By.CSS_SELECTOR, 'div.docPropSubTitle:nth-child(2)', "text")     
        if(licenseNum != None):
            licenseNum =  extractFromArray(re.findall("([0-9]\-[0-9]+)$", licenseNum), 0)
        languages = getElement(driver, By.CSS_SELECTOR, '.docPropInnerBorder > div:nth-child(3) > ul:nth-child(2)', "text")        
        visitCost = getElement(driver, By.CSS_SELECTOR, '.ConsultCosts > div:nth-child(2)', "text")        
        accessible = getElement(driver, By.CSS_SELECTOR, 'li.disNoneMobile > img:nth-child(1)', "get_attribute", "alt")
        sun, mon, tue, wed, thu, fri, sat = extractActivityTime(getElement(driver, By.TAG_NAME, 'time', "object", elements=True))
        graduationYear, academicInstitution, profession, specialization, specializationMedicalInstitution, specializationYear = getDocDegreeDetails(getElement(driver, By.CSS_SELECTOR, '.listBlockMobile > li:nth-child(1) > a:nth-child(1)', "object"))
        switchBetween2Tabs()#switch to main tab
        return clinicStreet, clinicCity, licenseNum, languages, visitCost, accessible, sun, mon, tue, wed, thu, fri, sat, graduationYear, academicInstitution, profession, specialization, specializationMedicalInstitution, specializationYear
        
def getDocProperties(text):
    needsReferral = False
    performUS = False
    absence = False
    acceptingNewPatients = True
    videoCall = False
    for prop in text:
        if(prop.text == "בהפניית רופא מטפל"):
            needsReferral = True
        if(prop.text == "מבצע אולטרסאונד"):
            performUS = True
        if(prop.text == "היעדרות"):
            absence = True
        if(prop.text == "אין קבלת חברים חדשים"):
            acceptingNewPatients = False
        if(prop.text == "אפשרות לשיחת וידאו"):
            videoCall = True            
    return needsReferral, performUS, absence, acceptingNewPatients, videoCall


def extractActivityTime(timeObjects):
    sun = mon = tue = wed = thu = fri = sat = None
    if(timeObjects != None and len(timeObjects) > 0):
        for day in timeObjects:
            timeToAdd = day.get_attribute("datetime") +" "+ getElement(day, By.CSS_SELECTOR, "div:nth-child(1) > div:nth-child(4)", "text")
            if(timeToAdd.find("Su") >= 0):
                sun = timeToAdd
                continue
            if(timeToAdd.find("Mo") >= 0):
                mon = timeToAdd
                continue
            if(timeToAdd.find("Tu") >= 0):
                tue = timeToAdd
                continue
            if(timeToAdd.find("We") >= 0):
                wed = timeToAdd
                continue
            if(timeToAdd.find("Th") >= 0):
                thu = timeToAdd
                continue
            if(timeToAdd.find("Fr") >= 0):
                fri = timeToAdd
                continue
            if(timeToAdd.find("Sa") >= 0):
                sat = timeToAdd
    return sun, mon, tue, wed, thu, fri, sat

def getDoctorNameAndTitle(docNameString):
        docName = docNameString
        docTitle = None
        if(docName != None):
            split = re.split('''^([\א-\ת\"\']+)\s''', docName)
            if(len(split) == 3):
                docTitle = split[1]
                docName = split[2]
        return docTitle, docName
    
def getDocDegreeDetails(degreeButtonObject):
    graduationYear = None
    academicInstitution = None
    profession = None
    specialization = None
    specializationMedicalInstitution = None
    specializationYear = None
    if(degreeButtonObject != None and getElement(degreeButtonObject, By.CLASS_NAME, "sectionDocTxt", "text") == "פרטי השכלה ומומחיות"):
        degreeButtonObject.click()
        rows = getElement(driver, By.CLASS_NAME, "popUpTableRow", "object", elements=True)
        if(rows != None):
            graduationYear = getElement(rows[1], By.CSS_SELECTOR, "div:nth-child(2) > div:nth-child(3) > div:nth-child(2)", "text")
            academicInstitution = getElement(rows[1], By.CSS_SELECTOR, "div:nth-child(2) > div.popUpTableColWrap.disNoneMobile > div:nth-child(2)", "text")
            profession = getElement(rows[1], By.CSS_SELECTOR, "div:nth-child(2) > div:nth-child(1) > div:nth-child(2)", "text")
            if(len(rows) > 2):
                specialization = getElement(rows[2], By.CSS_SELECTOR, "div:nth-child(2) > div:nth-child(1) > div:nth-child(2)", "text")
                specializationMedicalInstitution = getElement(rows[2], By.CSS_SELECTOR, "div:nth-child(2) > div:nth-child(2) > div:nth-child(2)", "text")
                specializationYear = getElement(rows[2], By.CSS_SELECTOR, "div:nth-child(2) > div:nth-child(3) > div:nth-child(2)", "text")
    return graduationYear, academicInstitution, profession, specialization, specializationMedicalInstitution, specializationYear

>> ## 2.4.3 Results page scrape function

#### scrapePage(results):
This function gets 1 pages as selenuim object and scrape each appoitment on this page.\
Each page contains up to 10 appoitments.\
The function returns a temporary dataframe that includes all the appoitments details on the given page.

In [10]:
def scrapePage(results):
    dfToReturn = pd.DataFrame()
    for row in results:
        doc_url = getElement(row, By.CSS_SELECTOR, ".docResualtTitleList a", "get_attribute", "href")
        clinicStreet, clinicCity, licenseNum, languages, visitCost, accessible, sun, mon, tue, wed, thu, fri, sat, graduationYear, academicInstitution, profession, specialization, specializationMedicalInstitution, specializationYear = scrapeDocDetails(doc_url)
        needsReferral, performUS, absence, acceptingNewPatients, videoCall = getDocProperties(getElement(row, By.CLASS_NAME, "t_G_1", "object", elements=True))
        docNameString = getElement(row, By.CLASS_NAME, "docPropTitle", "text")
        docTitle, docName = getDoctorNameAndTitle(docNameString)
        rowObg= {
            "docTitle": docTitle,
            "docName": docName,
            "gender": extractGender(getElement(row, By.CLASS_NAME, "iconForUser", "get_attribute", "src")),
            "firstSpeciality": getSpeciality(getElement(row, By.CLASS_NAME, "docPropSubTitle", "text"), 1),
            "secountSpeciality": getSpeciality(getElement(row, By.CLASS_NAME, "docPropSubTitle", "text"), 2),
            "numberOfSpecializations": getSpeciality(getElement(row, By.CLASS_NAME, "docPropSubTitle", "text"), 0, True),
            "clinicStreet": clinicStreet,
            "clinicCity": clinicCity,
            "closestAppointment": getElement(row, By.CLASS_NAME, "closestAppointMentText", "text"),
            "dateOfScraping": datetime.today().strftime('%d/%m/%y'),
            "onlineAppointmentScheduling": (getElement(row, By.CLASS_NAME, "newShowSearch", "text") == "זימון תור"),
            "onlineAppointmentCanceling": (getElement(row, By.CLASS_NAME, "clearSearchFields", "text") == "ביטול תור"),
            "needsReferral": needsReferral,
            "preformUS": performUS,
            "absence": absence,
            "acceptingNewPatients": acceptingNewPatients,
            "videoCall": videoCall,
            "licenseNum": licenseNum, 
            "languages": languages, 
            "visitCost": visitCost, 
            "accessible": accessible, 
            "receptionOnSunday": sun, 
            "receptionOnMonday": mon, 
            "receptionOnTuesday": tue, 
            "receptionOnWednesday": wed, 
            "receptionOnThursday": thu, 
            "receptionOnFriday": fri, 
            "receptionOnSaturday": sat,
            "graduationYear" : graduationYear,
            "academicInstitution": academicInstitution, 
            "profession": profession,
            "specialization": specialization,
            "specializationMedicalInstitution": specializationMedicalInstitution,
            "specializationYear": specializationYear
        }
            
        dfToReturn = dfToReturn.append(rowObg, ignore_index=True)
    return dfToReturn

Scrape one field:

>> ## 2.4.4 Scrape field function

#### scrapeOneField(fieldID, startFromPage = None):
The function returns the entire appoitments data about the given field.\
It's loop the entire field page and scrapes each page using **scrapePage(results)** function.\
The function returns a temporary dataframe that includes all the appoitments details on the given field.

In [11]:
def scrapeOneField(fieldID, startFromPage = None):  
    dfToReturn = pd.DataFrame()
    driver.get(url+'?Field='+str(fieldID))
    submitBtn = driver.find_element(By.ID, "SearchButton").click()
    try: 
        element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "docResualtWrap")))
        numOfDoctors = getNumOfDoctors(getElement(driver, By.CLASS_NAME, "t_B_13", "text"))
        print("["+str(datetime.now().strftime('%H:%M:%S'))+"]: Now scraping treatment field \"",row[1]["V"], "\" ("+numOfDoctors+" appointments)...")
        results = driver.find_elements(By.CLASS_NAME, "docResualtWrap")
        nextPage = 1
        if(startFromPage != None):
            for page in range(1, startFromPage):
                nextPage = getElement(driver, By.CLASS_NAME, "pagingNextButton", "object")
                e = driver.find_element(By.CLASS_NAME,"docResualtWrap")
                if(nextPage == None):
                    break
                driver.execute_script("arguments[0].click();", nextPage)
                element = WebDriverWait(driver, 25).until(EC.staleness_of(e))

        while(nextPage != None):
            try:
                results = getElement(driver, By.CLASS_NAME,"docResualtWrap", "object", elements=True)
                scrapeResults = scrapePage(results)
                dfToReturn = dfToReturn.append(scrapeResults, ignore_index=True)
                nextPage = getElement(driver, By.CLASS_NAME, "pagingNextButton", "object")
                e = driver.find_element(By.CLASS_NAME,"docResualtWrap")
                if(nextPage == None):
                    break
                driver.execute_script("arguments[0].click();", nextPage)
                element = WebDriverWait(driver, 25).until(EC.staleness_of(e))
            except StaleElementReferenceException as e:
                break    
        return dfToReturn, numOfDoctors
    except TimeoutException:
        print("Timeout... Skipped (",row[1]["V"],").")
        return dfToReturn, 0


> ## 2.3 The crawling:

After all the hard work we can start the work of scraping.\
First, Lets define our dataframe:

In [12]:
df = pd.DataFrame()

Using the functions i wrote above we can start scraping,\
I'm going to loop each field and use the **scrapeOneField()** function that returns a dataframe of the given field.\
Every dataframe that returns will append to the main dataframe.\
Lets do this!

In [13]:
sumOfAppointments = 0
start = time.time()
openNewTabAndSwitch()
switchBetween2Tabs() # switch back to the main tab
print("[WARNING]: This action can take between 6-9 hours.\nPlease be patient and do not mess with the Firefox window running in the background.\n")
print("["+str(datetime.now().strftime('%d/%m/%Y %H:%M'))+"]: Scraping strted.")
for row in fieldsOfTreatment.iterrows():
    tmpDf, aptSum = scrapeOneField(row[1]["K"])
    aptSum = int(aptSum)
    df= df.append(tmpDf, ignore_index=True)
    sumOfAppointments += aptSum
closeDriverAndBroswer() # close driver & headless broswer
print("\n["+str(datetime.now().strftime('%d/%m/%Y %H:%M'))+"]: Scraping finished.")
end = time.time()
print("Total scraping time is:", (end - start)/ 3600, "hours.")
print("Total appointments scraped:", sumOfAppointments)

Please be patient and do not mess with the Firefox window running in the background.

[16/05/2022 17:28]: Scraping strted.
[17:28:11]: Now scraping treatment field " אולטרה-סאונד גינקולוגי ומיילדותי " (128 appointments)...
[17:33:59]: Now scraping treatment field " אולטרסאונד נשים (על קול) " (128 appointments)...
[17:39:43]: Now scraping treatment field " אונקולוגיה " (87 appointments)...
[17:41:18]: Now scraping treatment field " אונקולוגיה - גידולי מערכת העיכול " (17 appointments)...
[17:41:53]: Now scraping treatment field " אונקולוגיה - גידולי ראש צוואר " (4 appointments)...
[17:42:01]: Now scraping treatment field " אונקולוגיה - גידולי רדיותרפיה " (5 appointments)...
[17:42:14]: Now scraping treatment field " אונקולוגיה - גידולי ריאה " (9 appointments)...
[17:42:37]: Now scraping treatment field " אונקולוגיה - גידולי שד " (23 appointments)...
[17:43:36]: Now scraping treatment field " אונקולוגיה - גידולים אורולוגים " (16 appointments)...
[17:44:21]: Now scraping treatment field " 

[00:22:01]: Now scraping treatment field " רפואת הנקה " (1 appointments)...
[00:22:07]: Now scraping treatment field " רפואת הפה " (10 appointments)...
[00:22:28]: Now scraping treatment field " רפואת נשים - גיל המעבר " (12 appointments)...
[00:22:45]: Now scraping treatment field " רפואת נשים - נערות ומתבגרות " (9 appointments)...
[00:23:01]: Now scraping treatment field " רפואת נשים - צוואר הרחם " (43 appointments)...
[00:24:30]: Now scraping treatment field " רפואת נשים/גינקולוגיה " (1033 appointments)...

[17/05/2022 01:03]: Scraping finished.
Total scraping time is: 7.595590588582887 hours.
Total appointments scraped: 12155


Wow!\
It took a long time, let's hope the wait was not in vain.\
Let's see what we got!

In [14]:
df

Unnamed: 0,docTitle,docName,gender,firstSpeciality,secountSpeciality,numberOfSpecializations,clinicStreet,clinicCity,closestAppointment,dateOfScraping,...,receptionOnWednesday,receptionOnThursday,receptionOnFriday,receptionOnSaturday,graduationYear,academicInstitution,profession,specialization,specializationMedicalInstitution,specializationYear
0,"ד""ר",עלימי יחיא מונא,F,אולטראסאונד נשים (על קול),,1.0,כפר קרע,כפר קרע,יום ו' 20/05/22,16/05/22,...,,,Fr 11:00-14:00 שבועית,,2010,אוני' העברית ירושלים,רפואה,רפואת נשים,,2017
1,"ד""ר",שטופמאכר יוסף,M,אולטראסאונד נשים (על קול),,1.0,התקוה 4,באר שבע,,16/05/22,...,,Th 07:30-13:30 שבועית,,,1978,"אוני' פורונז', בריה""מ",רפואה,רפואת נשים,"בי""ח - סורוקה ישראל",1997
2,פרופ',טלר ישראל,M,אולטראסאונד נשים (על קול),נשים - גינקולוגיה,2.0,חורב 15,חיפה,,16/05/22,...,,,Fr 09:00-12:00 אחת לשבועיים,,1973,אוני' העברית ירושלים,רפואה,רפואת נשים,"בי""ח - רמב""ם ישראל",1981
3,"ד""ר",אבירם רמי,M,אולטראסאונד נשים (על קול),,1.0,רמז דוד 13,נתניה,,16/05/22,...,We 14:00-17:40 שבועית,,,,1982,אוניברסיטת תל אביב,רפואה,רפואת נשים,"בי""ח - רבין ישראל",1988
4,"ד""ר",ברנן רפאל,M,אולטראסאונד נשים (על קול),,1.0,לוי משה 11,ראשון לציון,,16/05/22,...,,,,,1988,אוניברסיטת תל אביב,רפואה,רפואת נשים,"בי""ח - וולפסון ישראל",1992
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10056,"ד""ר",אבו אחמד נינה,F,נשים - גינקולוגיה,,1.0,אל-קרם אלשמאלי,כפר כנא,,17/05/22,...,,Th 08:30-10:30 שבועית,,,,,,,,
10057,"ד""ר",נסרה רימא,F,נשים - גינקולוגיה,,1.0,"צה""ל 3",צפת,,17/05/22,...,,,,,2016,"אוני' דמשק, סוריה",רפואה,,,
10058,"ד""ר",אילוז רועי,M,נשים - גינקולוגיה,,1.0,מעלה הרדוף 32,רמת ישי,,17/05/22,...,,,,,2015,הטכניון חיפה,רפואה,,,
10059,"ד""ר",סואלחי ודאד,F,נשים - גינקולוגיה,,1.0,,כפר קרע,,17/05/22,...,,,,,1984,"אוני' פדואה, איטליה",רפואה,,,


Okay, that looks great.\
Undoubtedly the information meets the minimum requirements (50K data item).\
Let's check just how many:

In [15]:
print(df.shape[0], "*", df.shape[1], "=", df.shape[0]*df.shape[1], "data items.")

10061 * 34 = 342074 data items.


Great! lets contiune.

As you may have noticed there is a big gap between the number of all the doctors we scraped from the Maccabi website,\
and the number of lines in my dataframe.\
The difference is due to 2 reasons:
1. The number of doctors listed is the number of doctors who were at the beginning of the search,\
as you know the search took a long time and during which appointments were scheduled / canceled.
2. The number of doctors on the list is the total number of doctors who work at the Maccabi HMO.\
Our study refers to Maccabi doctors in public medicine only,\
that the payment for the appointment is covered by health insurance or a minimum deductible\
and therefore the code adds to the dataframe only doctors from public medicine.

After all this is said, let's save the dataframe we got to the CSV file.\
We save the dataframe in the "utf-8-sig" encoding because the information includes Hebrew characters\
and we must use the appropriate encoding so that we can read from it in the following steps.

In [16]:
df.to_csv("data/data_after_scraping.csv", encoding='utf-8-sig')
print("Dataframe saved as CSV file.")

Dataframe saved as CSV file.


Amazing, let's continue to the data clearing step.