# Webscraper

#### **Author** -- Eric Chantland (ericchantland@gmail.com)
#### **Created** -- October 2022


<p>The objective of this program is to webscrape the eHRAF database for various text passages in order to explore misery and how it relates with folk/wild traditions. To run this program, merely press play and enter in the URL and your name. To obtain the URL, progress through ehRAF website and enter in your search terms in the "advanced search" boxes and the filters. When you have successfully filtered the searches (so the region list) but BEFORE you actually look at any of the culture's links, copy the URL that it gives you into this program. This program seeks to scrape the webpage for texts and OCM's. It is therefore very susceptible to the eHRAF webpage changing, loading slow/fast, and/or not providing standardized information. If this program fails, it is likely due to one of these three.<p>

<p> To run this program, you can go individually or just run all the cells at once. However, this autonomous webpage (chrome) must be running for the rest of the webscraper to work. Therefore, if you run into an issue, just rerurn the whole program. If that does not fix it, then there must be a different issue. <p>

<p> If the program stalls for whatever reason, then webpage may have loaded too slow and the scraper was not able to catch that this happened resulting in an infinite loop. The program needs to be updated for its sleep timer. <b>Stop the program and contact me.<b> <p>


## Packages requirements

If you are loading this for the first time, you will need the packages used in the file. The best way is to use a virtual environment or use anaconda (which might save on space but some package may not be available through there). For a virtual environment (venv) type into the terminal:

        python - m venv venv
        
When venv is created, select it as your prefered kernal (should give you a prompt to do so, otherwise, select it in the to right corner). Then create a new terminal (should be able to do so at the top of the mac screen under "terminal"). Each line in your terminal should start with "venv" or whatever you named the environment if you called it something else. Now you can install the requirments in the terminal using:

        pip install -r requirements.txt -v
        
NOTE: It is likely the requirements file is bloated with packages you do not need. I have not tried to slim it down so feel free to just install packages as you see fit with

        pip install <your package name>

In [5]:
import pandas as pd                 # dataframe storing
from bs4 import BeautifulSoup       # parsing web content in a nice way
import ssl                          # MAY BE UNNECESSARY: provides access to the security socket layer (ssl) https://docs.python.org/3/library/ssl.html
import urllib                       # MAY BE UNNECESSARY: open and navigate URL's
import os                           # Find where this file is located.
import time                         # for pausing the program in order for it to load the webpage
import re                           # regex for searching through strings


from selenium import webdriver      # load and run the webpage dynamically.
from selenium.webdriver.chrome.options import Options

# for wait times
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


### URL


<b>You will need to use your browser for this.</b> On eHRAF homepage, you may put in various search terms and queries using their GUI (their menu) once you are happy with the search terms,
To access this menu, go to https://ehrafworldcultures.yale.edu/search/advanced where you will see a section for "culture", "subjects" and "keywords". You may input words separated by a comma into any and all of these boxes to conduct your search. Addiitonally, you can click the boixes which say "Any of these ___" to choose if you want "any" "none" or "all" of the search terms you put in the specific box. When you think it looks good, press "search" and you will get a URL at the top of your browser. <b> However</b> you may also filter the document further in the main page (shown in the picture below) by clicking on the "filter" button. A common filter to use is checking the "culture Level Samples" tab and selecting "PSF". This will update the URL which you can paste

<p> Try getting the same URL as shown below. We advanced searched for the keyword "apple" and chose the "culture level sample" of "PSF" in the filter </p>

 Copy and paste the URL into the parenthesis of the Variable below:
For example, to get the search for all documents containing "apple" and through the PSF, I got a hyperlink like this:
https://ehrafworldcultures.yale.edu/search?q=text%3AApple&fq=culture_level_samples%7CPSF

In [6]:
# make sure the hyperlink goes within quotes!
URL = input("Please enter in an eHRAF URL, otherwise enter nothing for a demo\n")
user = input("What is your name?\n")

<!-- <img src="Hyperlink-Example.png"> -->
![alt text](Hyperlink-Example.png "Title")

In [7]:
# Use a autonomous Chrome page to dynamically load the page for scraping. 
# Requires webdriver to be downloaded and then its path directed to.

# iniate "headless" which stops chrome from showing itself when this is run, 
# switch headless to False if you want to see the webpage or True if you want it to run in the background
options = Options()
options.headless = False


# Unless you want to change to location, make sure the chromedriver program is located within the same file folder that you run this application in.
# You must have chrome (or download another browser driver and change the path). Download the chrome software here: https://chromedriver.chromium.org/downloads
path = os.getcwd() + "/chromedriver"
driver = webdriver.Chrome(executable_path = path, options=options)
# driver.maximize_window()

# Demo if there is no URL entered
if URL == '':
    URL = r'https://ehrafworldcultures.yale.edu/search?q=text%3AApple&fq=culture_level_samples%7CPSF'

homeURL = "https://ehrafworldcultures.yale.edu/"
searchTokens = URL.split('/')[-1]

# Load the HTML page (note that this should be updated to allow for modular input)
driver.get(homeURL + searchTokens)

# Find then click on each tab to reveal content for scraping
# Elements must be individually clicked backwards. I do not know why this is a thing but my guess is each 
# clicked tab adds HTML pushing future tabs to a new location thereby making some indexing no longer point to a retrieved tab. 
# Loading backwards avoids this.
country_tab = driver.find_elements_by_class_name('trad-overview__result')
for i in range(len(country_tab)-1,-1,-1):
    try:
        #Note: this clicking should work for each of the Regions. However, technically, trad-overview__result is not the actual 
        # element that should be clicked on. It is just good enough for simplicity sake. If this give you trouble, consider putting in
        # driver.execute_script("arguments[0].click();", country_tab[i]) and changing the above drive.find to a button element or whatever is the true clickable drop down, 
        # although this will take a bit more indexing so beware.
        country_tab[i].click()
        
    except:
        print(f"WARNING tab {i} failed to be clicked")

# Parse processed webpage with BeautifulSoup
soup = BeautifulSoup(driver.page_source)

# extract the number of documents intended to be found
document_count = soup.find_all("span", {'class':'found__results'})
document_count = document_count[0].small.em.next_element
document_count = int(document_count.split()[1])
# estimate the time this will take
import math
time_sec = document_count/4.33
time_min = ""
time_hour = ""
if time_sec > 3600:
    time_hour = math.floor(time_sec/3600)
    time_sec -= time_hour*3600
    time_hour = f"{time_hour} hour(s), "
if time_sec > 60:
    time_min = math.floor(time_sec/60)
    time_sec -= time_min*60
    time_min = f"{time_min} minute(s), and "

time_sec = f"{math.floor(time_sec)} second(s)"


print(f"This will scrape up to {document_count} documents and take roughly \n{time_hour}{time_min}{time_sec}")

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 106
Current browser version is 108.0.5359.71 with binary path /Applications/Google Chrome.app/Contents/MacOS/Google Chrome


In [4]:
import time

# Create a dictionary to store all cultures and their links for later use
culture_dict = {}

# find the tables containing the cultures then loop through them to extract their subregion, region, name, and the link to the passages
# Note that if the ehraf website changes, this loop might need fixing by changing where the information is retrieved.
# Also note that if the dynamic page is not loaded correctly, (a warning is given above), this may also fail.
table_culture_links = soup.find_all('tr', {'class':'mdc-data-table__row'})

# repeat in case the website took to long to load.
loop_protect = 0
while len(table_culture_links) == 0:
    time.sleep(.1)
    soup = BeautifulSoup(driver.page_source)
    table_culture_links = soup.find_all('tr', {'class':'mdc-data-table__row'})
    loop_protect += 1
    if loop_protect > 5:
        raise Exception(f"Repeated loading {loop_protect-1} times but did not find links")
for culture_i in table_culture_links:
    culture_list = list(culture_i.children)

    subRegion = culture_list[0].text
    cultureName = culture_list[1].a.text
    link = culture_list[1].a.attrs['href']
    region = culture_i.findParent('table', {'role':'region'}).attrs['id']
    source_count = int(culture_list[-2].text)
    
    culture_dict[cultureName] = {"Region":region, "SubRegion":subRegion, "link":link, "Source_count":source_count, "Reloads":{"Source_reload":0, "Doc_reload":0}}
print(f"Number of cultures extracted {len(culture_dict)}")

Number of cultures extracted 1


## Main Scraper

In [5]:

doc_count_total = 0

# create dataframe to hold all the data
df_eHRAF = pd.DataFrame({"Region":[], "SubRegion":[], "Culture":[], 'DocTitle':[], 'Year':[], "OCM":[], "OWC":[], "Passage":[]})



# For each Culture, go to their webpage link then scrape the document data
for key in culture_dict.keys():
    driver.get(homeURL + culture_dict[key]['link'])
    # driver.get(homeURL + culture_dict['Azande']['link'])
    # driver.get(homeURL + culture_dict[key]['link'])
    doc_count = 0
    
    # dataframe for each culture
    df_eHRAFCulture = pd.DataFrame({"Region":[], "SubRegion":[], "Culture":[], 'DocTitle':[], 'Year':[], "OCM":[], "OWC":[], "Passage":[]})

    # loop until every page containing a source tab is clicked
    source_total = culture_dict[key]['Source_count']
    while source_total > 0:
        # Try to make the program wait until the wepage is loaded
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "mdc-data-table__row")))
        #Click every source tab
        sourceTabs = driver.find_elements_by_class_name('mdc-data-table__row')
        for source_i in sourceTabs:
            driver.execute_script("arguments[0].click();", source_i)

        #Log the source table's results number in order to know where to start and stop clicking.
        # Skip every 2 logs as they do not contain the information desired
        soup = BeautifulSoup(driver.page_source)
        sourceCount = soup.find_all('td',{'class':'mdc-data-table__cell mdc-data-table__cell--numeric'})
        sourceCount_list = list(map(lambda x: int(x.text), sourceCount[0::3]))


        
        # WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "trad-data__results")))

        # wait to make sure the page is loaded. CHANGE to a higher time if it runs indefinately
        time.sleep(.1)

        #get the results tab(which is basically the source tab but contained within a different HTML element) for sub indexing sources
        resultsTabs = driver.find_elements_by_class_name('trad-data__results')
        # if the resultsTabs did not all load, reload as necessary
        reload_protect = 0
        while len(sourceCount_list) != len(resultsTabs) and reload_protect<=10:
            time.sleep(.1)
            resultsTabs = driver.find_elements_by_class_name('trad-data__results')
            reload_protect += 1
        if reload_protect != 0:
            culture_dict[key]["Reloads"]["Source_reload"] += reload_protect
        

        resultsTabs_count = len(resultsTabs) #For later reload checking

        # click and extract information from each document within the result/source tabs
        for i in range(len(resultsTabs)):
            total = sourceCount_list[i]
            
            # loop until the program can click and find every piece of information for each document (this is probably where things will break if times are off)
            while True:
                docTabs = resultsTabs[i].find_elements_by_class_name('sre-result__title')
                #Click all the tabs within a source
                for doc in docTabs:
                    driver.execute_script("arguments[0].click();", doc)
                    doc_count +=1


                soup = BeautifulSoup(driver.page_source)
                #Extract the document INFO here
                soupDocs = soup.find_all('section',{'class':'sre-result__sre-result'}, limit=total)
                for soupDoc in soupDocs:
                    docPassage = soupDoc.find('div',{'class':'sre-result__sre-content'}).text
                    
                    soupOCM = soupDoc.find_all('div',{'class':'sre-result__ocms'})
                    # OCMs
                    # find all direct children a tags then extract the text
                    ocmTags = soupOCM[0].find_all('a', recursive=False)
                    OCM_list = []
                    for ocmTag in ocmTags:
                        OCM_list.append(int(ocmTag.span.text))
                    # OWC
                    OWC = soupOCM[1].a['name']

                    DocTitle = soupDoc.find('div',{'class':'sre-result__sre-content-metadata'})
                    DocTitle = DocTitle.div.text
                    # Search for the document's year of creation 
                    Year = re.search('\(([0-9]{0,4})\)', DocTitle)
                    if Year is not None:
                        # remove the date then strip white space at the end and start to give the document's title
                        DocTitle = re.sub(f'\({Year.group()}\)', '', DocTitle).strip()
                        # get the year without the parenthesis
                        Year = int(Year.group()[1:-1])
                    
                    # dataframe for each document
                    df_Doc = pd.DataFrame({'OCM':[OCM_list], 'OWC':[OWC], 'DocTitle':[DocTitle], 'Year':[Year],  'Passage':[docPassage]})
                    df_eHRAFCulture = pd.concat([df_eHRAFCulture, df_Doc], ignore_index=True)
                # set remaining docs in a source tab (for clicking the "next" button if not all of them are shown)
                total -= len(docTabs)

                # If there are more tabs hidden away, find the button, click it, and then refresh the results
                # otherwise, end the loop and close the source tab to make search for information easier
                # NOTE that we have to search for the resultsTabs again because the page refreshed and the points 
                # originally found above no longer point to the same location and therefore will not work
                if total >0:
                    SourceTabFooter = resultsTabs[i].find_elements_by_class_name('trad-data__results--pagination')
                    buttons = SourceTabFooter[0].find_elements_by_class_name('rmwc-icon--ligature')
                    driver.execute_script("arguments[0].click();", buttons[-1])
                    time.sleep(.1)
                    resultsTabs = driver.find_elements_by_class_name('trad-data__results')
                    # in case .1 was not enough time, redo until the entire page is loaded again.
                    reload_protect = 0
                    while len(resultsTabs) < resultsTabs_count and reload_protect<=10:
                        time.sleep(.1)
                        resultsTabs = driver.find_elements_by_class_name('trad-data__results')
                        reload_protect += 1
                    # else:
                    #     raise Exception("failed to load all results tabs, please contact ericchantland@gmail.com for info on fixing the time waits")
                    if reload_protect != 0:
                        if reload_protect > 10:
                            raise Exception("failed to load all results tabs, please contact ericchantland@gmail.com for info on fixing the time waits")
                        else:
                            culture_dict[key]["Reloads"]["Doc_reload"] += reload_protect
                            
                else:
                    ## close sourcetab(this might save time in the long run) 
                    ## NOTE: commented out because it will not work anymore with multi sources (sources with more than 10 passages). 
                    ## If you want it to close the tabs, you could copy the above resultsTabs reload and put it right after this line of code then chnage docTabs = resultsTabs[i] above to docTabs = resultsTabs[0]
                    # driver.execute_script("arguments[0].click();", sourceTabs[i])
                    break
        # Run to the next page if necessary. Check to see if there are more source tabs left, if so, click the next page and continue scraping the page
        source_total -= len(resultsTabs)
        if source_total >0:
            next_page = driver.find_element_by_xpath("//button[@title='Next Page']")
            driver.execute_script("arguments[0].click();", next_page)



    df_eHRAFCulture[['Region','SubRegion',"Culture"]] = [culture_dict[key]['Region'], culture_dict[key]['SubRegion'], key ]    
    df_eHRAF = pd.concat([df_eHRAF, df_eHRAFCulture], ignore_index=True)
    doc_count_total += doc_count
    if doc_count < sum(sourceCount_list):
        print(f"WARNING {doc_count} out of {sum(sourceCount_list)} documents loaded for {key}")

print(f'{doc_count_total} documents out of a possible {document_count} loaded (also check dataframe)')


4 documents out of a possible 4 loaded (also check dataframe)


In [6]:
# print reload count
for key, val in culture_dict.items():
    if culture_dict[key]["Reloads"]["Source_reload"] >0 or culture_dict[key]["Reloads"]["Doc_reload"] >0:
        print(key," Source reloads: ", culture_dict[key]["Reloads"]["Source_reload"]," Document reloads: ", culture_dict[key]["Reloads"]["Doc_reload"])

In [7]:
# close the webpage
driver.close()

In [8]:
# Any null values?
if df_eHRAF.isnull().values.any():
    print('Some null values found:')
    for col in df_eHRAF.columns:
        print(f"{col}: {df_eHRAF[col].isnull().sum()}")
else: 
    print("no null values found")


no null values found


In [9]:
# get time and date that this program was run
from datetime import date, datetime
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
current_date = now.strftime("%m/%d/%y")



In [10]:
# clean and strip the URL to be put into the excel document

replace_dict = {'%28':'(', '%29':')', '%3A':'~', '%7C':'|', '%3B':';'}
remove_list = [homeURL, 'search', '\?q=', 'fq=', '&', 'culture_level_samples']

URL_name = URL

for i in remove_list:
    URL_name = re.sub(i, '', URL_name)
# print(URL_name)
for key, val in replace_dict.items():
    URL_name = re.sub(key, val, URL_name)
# print(URL_name)


URL_name_nonPlussed = re.sub('\+', ' ', URL_name)
URL_name_nonPlussed = re.sub('\~', ':', URL_name)
URL_name_nonPlussed



'cultures:%22Hawaiians%22+AND+(subjects:%22spirits+and+gods%22+AND+text:Apple)'

In [11]:
# place run information within the "run_info" column
df_eHRAF['run_Info'] = None
df_eHRAF.loc[0, 'run_Info'] = "User: " + user
df_eHRAF.loc[1, 'run_Info'] = "Run Time: " + str(current_time)
df_eHRAF.loc[2, 'run_Info'] = "Run Date: " + str(current_date)
df_eHRAF.loc[3, 'run_Info'] = "Run Input: " + URL_name_nonPlussed
df_eHRAF.loc[4, 'run_Info'] = "Run URL: " + URL
df_eHRAF

Unnamed: 0,Region,SubRegion,Culture,DocTitle,Year,OCM,OWC,Passage,run_Info
0,Oceania,Polynesia,Hawaiians,Arts and crafts of Hawaii,1957.0,"[322, 5311, 776, 778]",ov05,Further elaboration is shown in an image in th...,User: Eric
1,Oceania,Polynesia,Hawaiians,Arts and crafts of Hawaii,1957.0,"[322, 5311, 776, 778]",ov05,Further elaboration is shown in an image in th...,Run Time: 15:42:21
2,Oceania,Polynesia,Hawaiians,"Native planters in old Hawaii: their life, lor...",1972.0,"[533, 535, 776, 778, 824, 874]",ov05,"Laka, the goddess of the wildwood who was patr...",Run Date: 11/29/22
3,Oceania,Polynesia,Hawaiians,Arts and crafts of Hawaii,1957.0,"[322, 5311, 776, 778]",ov05,Further elaboration is shown in an image in th...,Run Input: cultures:%22Hawaiians%22+AND+(subje...
4,,,,,,,,,Run URL: https://ehrafworldcultures.yale.edu/s...


In [12]:
directory = os.getcwd()  # get current directory
output_dir = "Data"  # output directory
output_dir_path = directory + '/' + output_dir  # output directory path
os.makedirs(output_dir_path, exist_ok=True)  # make Data folder if it does not exist

try:
    df_eHRAF.to_excel('Data/' + URL_name + '_web_data.xlsx', index=False)
except:
    print("Unable to save the title of the document, please rename it or risk overwriting")
    df_eHRAF.to_excel('Data/' + user + str(now.strftime("%m_%d_%y")) + '_web_data.xlsx', index=False)