# Webscraper

#### **Author** -- Eric Chantland (ericchantland@gmail.com)
#### **Created** -- October 2022


<p>This program seeks to scrape the webpage for texts and OCM's. It is therefore very susceptible to the eHRAF webpage changing, loading slow/fast, and/or not providing standardized information. If this program fails, it is likely due to one of these three.<p>

<p> To run this program, you can go individually or just run all the cells at once. However, this autonomous webpage (chrome) must be running for the rest of the webscraper to work. Therefore, if you run into an issue, just rerurn the whole program. If that does not fix it, then there must be a different issue. <p>

<p> If the program stalls for whatever reason, then webpage may have loaded too slow and the scraper was not able to catch that this happened resulting in an infinite loop. <b>Stop the program and contact me.<b> <p>


In [3]:
import pandas as pd                 # dataframe storing
from bs4 import BeautifulSoup       # parsing web content in a nice way
import ssl                          # MAY BE UNNECESSARY: provides access to the security socket layer (ssl) https://docs.python.org/3/library/ssl.html
import urllib                       # MAY BE UNNECESSARY: open and navigate URL's
import os                           # Find where this file is located.
import time                         # for pausing the program in order for it to load the webpage
import re                           # regex for searching through strings


from selenium import webdriver      # load and run the webpage dynamically.
from selenium.webdriver.chrome.options import Options

# for wait times
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


For each of these search variables, input the text "any", "none", or "all" before a list of the desired keywords. This is excluding "culture" which only contains "any" and "none"

For each of the query variables, just input a list as they are always "Any"

Alternatively, you can copy and paste the hyperlink and get the same results 


In [307]:

# Search variables, 
# need at least one to contain information but the rest are optional
cultures = ["Any",[]]       #if you want specific cultures then input "Any" followed by a list of cultures, if you want none of a specic culture, instead put in "None" and then a list
subjects = ["Any",[]]       #for specific subject, first index can be "any", "none" or "all". Second index is a list of desired keywrods
text = ["Any",[]]           #for specific word in a text, irst index can be "any", "none" or "all". Second index is a list of desired keywrods
and_or_search = "and"       #specify whether you want 

#Query Variables
# All are optional and do not need to specify "any" except for the 

### URL
On eHRAF homepage, you may put in various search terms and queries using their GUI (their menu) once you are happy with the saerch terms, Copy and paste the URL into the parenthesis of the Variable below:
For example, to get the search for all documents containing "apple" and through the PSF, I got a hyperlink like this:
https://ehrafworldcultures.yale.edu/search?q=text%3AApple&fq=culture_level_samples%7CPSF

In [318]:
# make sure the hyperlink goes within quotes!
URL = "https://ehrafworldcultures.yale.edu/search?q=text%3AApple&fq=culture_level_samples%7CPSF"

<!-- <img src="Hyperlink-Example.png"> -->
![alt text](Hyperlink-Example.png "Title")

In [328]:
# Use a autonomous Chrome page to dynamically load the page for scraping. 
# Requires webdriver to be downloaded and then its path directed to.

# iniate "headless" which stops chrome from showing itself when this is run, 
# switch headless to False if you want to see the webpage or True if you want it to run in the background
options = Options()
options.headless = False


# Unless you want to change to location, make sure the chromedriver program is located within the same file folder that you run this application in.
# You must have chrome (or download another browser driver and change the path). Download the chrome software here: https://chromedriver.chromium.org/downloads
path = os.getcwd() + "/chromedriver"
driver = webdriver.Chrome(executable_path = path, options=options)
# driver.maximize_window()

homeURL = "https://ehrafworldcultures.yale.edu/"
searchTokens = URL.split('/')[-1]

# Load the HTML page (note that this should be updated to allow for modular input)
driver.get(homeURL + searchTokens)

# Find then click on each tab to reveal content for scraping
# Elements must be individually clicked backwards. I do not know why this is a thing but my guess is each 
# clicked tab adds HTML pushing future tabs to a new location thereby making some indexing no longer point to a retrieved tab. 
# Loading backwards avoids this.
country_tab = driver.find_elements_by_class_name('trad-overview__result')
for i in range(len(country_tab)-1,-1,-1):
    try:
        #Note: this clicking should work for each of the Regions. However, technically, trad-overview__result is not the actual 
        # element that should be clicked on. It is just good enough for simplicity sake. If this give you trouble, consider putting in
        # driver.execute_script("arguments[0].click();", country_tab[i]) and changing the above drive.find to a button element or whatever is the true clickable drop down, 
        # although this will take a bit more indexing so beware.
        country_tab[i].click()
        
    except:
        print(f"WARNING tab {i} failed to be clicked")

# Parse processed webpage with BeautifulSoup
soup = BeautifulSoup(driver.page_source)

# extract the number of documents intended to be found
document_count = soup.find_all("span", {'class':'found__results'})
document_count = document_count[0].small.em.next_element
document_count = int(document_count.split()[1])
# estimate the time this will take
import math
time_sec = document_count/4.33
time_min = ""
time_hour = ""
if time_sec > 3600:
    time_hour = math.floor(time_sec/3600)
    time_sec -= time_hour*3600
    time_hour = f"{time_hour} hour(s), "
if time_sec > 60:
    time_min = math.floor(time_sec/60)
    time_sec -= time_min*60
    time_min = f"{time_min} minute(s), and "

time_sec = f"{math.floor(time_sec)} second(s)"


print(f"This will scrape up to {document_count} documents and take roughly \n{time_hour}{time_min}{time_sec}")

This will scrape up to 489 documents and take roughly 
 1 minute(s), and 52 seconds


In [329]:
#Example of finding then printing out the dynamic webpage (before clicking). Shown here are the regions
region_dir = soup.find_all('div',
{'class':'trad-overview__result'})

for x in region_dir:
    print(x.h4.button.text)


Africa
Asia
Europe
Middle America and the Caribbean
Middle East
North America
Oceania
South America


In [330]:
import time

# Create a dictionary to store all cultures and their links for later use
culture_dict = {}

# find the tables containing the cultures then loop through them to extract their subregion, region, name, and the link to the passages
# Note that if the ehraf website changes, this loop might need fixing by changing where the information is retrieved.
# Also note that if the dynamic page is not loaded correctly, (a warning is given above), this may also fail.
table_culture_links = soup.find_all('tr', {'class':'mdc-data-table__row'})

# repeat in case the website took to long to load.
loop_protect = 0
while len(table_culture_links) == 0:
    time.sleep(.1)
    soup = BeautifulSoup(driver.page_source)
    table_culture_links = soup.find_all('tr', {'class':'mdc-data-table__row'})
    loop_protect += 1
    if loop_protect > 5:
        raise Exception(f"Repeated loading {loop_protect-1} times but did not find links")
for culture_i in table_culture_links:
    culture_list = list(culture_i.children)

    subRegion = culture_list[0].text
    cultureName = culture_list[1].a.text
    link = culture_list[1].a.attrs['href']
    region = culture_i.findParent('table', {'role':'region'}).attrs['id']
    # 
    culture_dict[cultureName] = {"Region":region, "SubRegion":subRegion, "link":link}
print(f"Number of cultures extracted {len(culture_dict)}")

Number of cultures extracted 48


## Main Scraper

In [331]:

doc_count_total = 0

# create dataframe to hold all the data
df_eHRAF = pd.DataFrame({"Region":[], "SubRegion":[], "Culture":[], 'DocTitle':[], 'Year':[], "OCM":[], "OWC":[], "Passage":[]})

# For each Culture, go to their webpage link then scrape the document data
for key in culture_dict.keys():
    driver.get(homeURL + culture_dict[key]['link'])
    # driver.get(homeURL + culture_dict['Azande']['link'])
    # driver.get(homeURL + culture_dict[key]['link'])
    doc_count = 0
    
    # dataframe for each culture
    df_eHRAFCulture = pd.DataFrame({"Region":[], "SubRegion":[], "Culture":[], 'DocTitle':[], 'Year':[], "OCM":[], "OWC":[], "Passage":[]})
   
    # Try to make the program wait until the wepage is loaded
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "mdc-data-table__row")))
    #Click every source tab
    sourceTabs = driver.find_elements_by_class_name('mdc-data-table__row')
    for source_i in sourceTabs:
        driver.execute_script("arguments[0].click();", source_i)

    #Log the source table's results number in order to know where to start and stop clicking.
    # Skip every 2 logs as they do not contain the information desired
    soup = BeautifulSoup(driver.page_source)
    sourceCount = soup.find_all('td',{'class':'mdc-data-table__cell mdc-data-table__cell--numeric'})
    sourceCount_list = list(map(lambda x: int(x.text), sourceCount[0::3]))


    
    # WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "trad-data__results")))

    # wait to make sure the page is loaded. CHANGE to a higher time if it runs indefinately
    time.sleep(.1)

    #get the results tab(which is basically the source tab but contained within a different HTML element) for sub indexing sources
    resultsTabs = driver.find_elements_by_class_name('trad-data__results')
    # if the resultsTabs did not all load, reload as necessary
    reload_protect = 0
    while len(sourceCount_list) != len(resultsTabs) or reload_protect>10:
        time.sleep(.1)
        resultsTabs = driver.find_elements_by_class_name('trad-data__results')
        reload_protect += 1
    

    resultsTabs_count = len(resultsTabs) #For later reload checking

    for i in range(len(resultsTabs)):
        total = sourceCount_list[i]
        
        #While total is > 0
        while True:
            docTabs = resultsTabs[i].find_elements_by_class_name('sre-result__title')
            #Click all the tabs within a source
            for doc in docTabs:
                driver.execute_script("arguments[0].click();", doc)
                doc_count +=1


            soup = BeautifulSoup(driver.page_source)
            #Extract the document INFO here
            soupDocs = soup.find_all('section',{'class':'sre-result__sre-result'}, limit=total)
            for soupDoc in soupDocs:
                docPassage = soupDoc.find('div',{'class':'sre-result__sre-content'}).text
                
                soupOCM = soupDoc.find_all('div',{'class':'sre-result__ocms'})
                # OCMs
                # find all direct children a tags then extract the text
                ocmTags = soupOCM[0].find_all('a', recursive=False)
                OCM_list = []
                for ocmTag in ocmTags:
                    OCM_list.append(ocmTag.span.text)
                # OWC
                OWC = soupOCM[1].a['name']

                DocTitle = soupDoc.find('div',{'class':'sre-result__sre-content-metadata'})
                DocTitle = DocTitle.div.text
                # Search for the document's year of creation 
                Year = re.search('\(([0-9]{0,4})\)', DocTitle)
                if Year is not None:
                    # remove the date than strip white space at the end and start to give the document's title
                    DocTitle = re.sub(f'\({Year.group()}\)', '', DocTitle).strip()
                    # get the year without the parenthesis
                    Year = int(Year.group()[1:-1])
                
                # dataframe for each document
                df_Doc = pd.DataFrame({'OCM':[OCM_list], 'OWC':[OWC], 'DocTitle':[DocTitle], 'Year':[Year],  'Passage':[docPassage]})
                df_eHRAFCulture = pd.concat([df_eHRAFCulture, df_Doc], ignore_index=True)
            # set remaining docs in a source tab (for clicking the "next" button if not all of them are shown)
            total -= len(docTabs)

            # If there are more tabs hidden away, find the button, click it, and then refresh the results
            # otherwise, end the loop and close the source tab to make search for information easier
            if total >0:
                SourceTabFooter = resultsTabs[i].find_elements_by_class_name('trad-data__results--pagination')
                buttons = SourceTabFooter[0].find_elements_by_class_name('rmwc-icon--ligature')
                driver.execute_script("arguments[0].click();", buttons[-1])
                time.sleep(.1)
                resultsTabs = driver.find_elements_by_class_name('trad-data__results')
                # in case .1 was not enough time, redo until the entire page is loaded again.
                reload_protect = 0
                while len(resultsTabs) < resultsTabs_count or reload_protect>10:
                    time.sleep(.1)
                    resultsTabs = driver.find_elements_by_class_name('trad-data__results')
                    reload_protect += 1
            else:
                # close sourcetab(this might save time in the long run)
                driver.execute_script("arguments[0].click();", sourceTabs[i])
                break


    df_eHRAFCulture[['Region','SubRegion',"Culture"]] = [culture_dict[key]['Region'], culture_dict[key]['SubRegion'], key ]    
    df_eHRAF = pd.concat([df_eHRAF, df_eHRAFCulture], ignore_index=True)
    doc_count_total += doc_count
    if doc_count < sum(sourceCount_list):
        print(f"WARNING {doc_count} out of {sum(sourceCount_list)} documents loaded for {key}")

print(f'{doc_count_total} documents out of a possible {document_count} loaded (also check dataframe)')


489 documents out of a possible 489 loaded (also check dataframe)


In [305]:
df_eHRAF

Unnamed: 0,Continent,SubRegion,Culture,DocTitle,Year,OCM,OWC,Passage
0,Africa,Central Africa,Azande,An account of the Zande,1926.0,"[137, 222, 252]",fo07,The bagara looks like a miniature custard appl...
1,Africa,Central Africa,Azande,"Witchcraft, oracles and magic among the Azande",1937.0,[787],fo07,It is operated in the following manner. A man ...
2,Africa,Central Africa,Azande,"Witchcraft, oracles and magic among the Azande",1937.0,"[754, 755, 825, 902]",fo07,‘She began by grinding sesame and when she had...
3,Africa,Central Africa,Azande,"Witchcraft, oracles and magic among the Azande",1937.0,"[789, 826, 856]",fo07,"‘When a child begins to grow, his teeth try to..."
4,Africa,Central Africa,Azande,"Witchcraft, oracles and magic among the Azande",1937.0,[000],fo07,"Abagite , (deep intramuscular abscesses), 483,..."
...,...,...,...,...,...,...,...,...
484,South-America,Central Andes,Aymara,The Aymara Indians of the Lake Titicaca Plateau,1948.0,"[292, 535]",sf05,Dance costumes form a large category in themse...
485,South-America,Eastern South America,Bororo,"Through the wilderness of Brazil by horse, can...",1909.0,[000],sp08,"Wallace describes another milk tree, the Masse..."
486,South-America,Eastern South America,Guaraní,Prophets of agroforestry: Guarani communities ...,1995.0,"[432, 435, 883]",sm04,Table 5 . Sample of an Adolescent’s Wage Expen...
487,South-America,Southern South America,Mataco,The Mataco Indians and their language,1897.0,"[192, 196]",si07,One notices here the constant change of u to a...


In [306]:
# Any null values?
df_eHRAF.isnull().values.any()

False

In [9]:

df_eHRAF.to_excel('Data/web_data.xlsx', index=False)
# df_eHRAF.to_csv('web_data.csv', index=False)

In [None]:
# df_demo = df_eHRAF.copy()
# df_demo['Original_Scraper_URL'] = URL
# df_demo

## Optional dataframe manipulation

In [332]:
# for searching by OCM in the list
lst = ["753", "583", "435"]
msk = df_eHRAF['OCM'].apply(lambda x: not set(x).isdisjoint(lst))
out = df_eHRAF.loc[msk]
out

Unnamed: 0,Region,SubRegion,Culture,DocTitle,Year,OCM,OWC,Passage
41,Africa,Western Africa,Tiv,A source book on Tiv religion,1969.0,"[753, 755, 757, 761, 842, 845]",ff57,1. Wantor . [LB 18 August 1951 Ukusu] On the w...
88,Asia,South Asia,Sinhalese,Domestic architecture among the Kandyan Sinhalese,1971.0,"[114, 335, 341, 346, 351, 361, 364, 463, 532, ...",ax04,"1. With love I venerate munindu , ‘the eminent..."
113,Europe,Scandinavia,Saami,The history of Lapland: containing a geographi...,1704.0,"[132, 137, 223, 231, 251, 262, 278, 291, 438, ...",ep04,Next to the Beasts we will take a view of the ...
118,Europe,Scandinavia,Saami,The history of Lapland: containing a geographi...,1704.0,"[573, 578, 753, 754, 755, 756, 776, 778, 789, ...",ep04,Having thus given you a large Account of what ...
123,Europe,Southeastern Europe,Serbs,Peasant life in Jugoslavia,1941.0,"[000, 583]",ef06,"As already observed, wives, in many districts,..."
172,Europe,Southeastern Europe,Serbs,A Serbian village,1967.0,"[432, 435]",ef06,Table 5. Agricultural Produce Sold in Arandjel...
185,Europe,Southeastern Europe,Serbs,The peasant urbanites; a study of rural-urban ...,1973.0,[435],ef06,Prices 1. On the Kalenić open market (Kaleniće...
186,Europe,Southeastern Europe,Serbs,Healing ritual: studies in the technique and t...,1935.0,"[753, 784]",ef06,One should not: Keep a black cat; keep one in ...
214,Middle-America-and-the-Caribbean,Northern Mexico,Tarahumara,The Tarahumara: an Indian tribe of northern Me...,1935.0,"[423, 435]",nu33,"The fruit trees, introduced by the padres, hav..."
247,Middle-East,Middle East,Kurds,Rowanduz: a Kurdish administrative and mercant...,1953.0,"[583, 584, 831, 836, 885]",ma11,There is a strong prejudice against unmarried ...


In [333]:
# Make each OCM have its own row by exploding (you can reset the index with .reset_index(drop=True))
df_eHRAF.explode(column='OCM')

Unnamed: 0,Region,SubRegion,Culture,DocTitle,Year,OCM,OWC,Passage
0,Africa,Central Africa,Azande,An account of the Zande,1926.0,137,fo07,The bagara looks like a miniature custard appl...
0,Africa,Central Africa,Azande,An account of the Zande,1926.0,222,fo07,The bagara looks like a miniature custard appl...
0,Africa,Central Africa,Azande,An account of the Zande,1926.0,252,fo07,The bagara looks like a miniature custard appl...
1,Africa,Central Africa,Azande,"Witchcraft, oracles and magic among the Azande",1937.0,787,fo07,It is operated in the following manner. A man ...
2,Africa,Central Africa,Azande,"Witchcraft, oracles and magic among the Azande",1937.0,754,fo07,‘She began by grinding sesame and when she had...
...,...,...,...,...,...,...,...,...
487,South-America,Southern South America,Mataco,The Mataco Indians and their language,1897.0,192,si07,One notices here the constant change of u to a...
487,South-America,Southern South America,Mataco,The Mataco Indians and their language,1897.0,196,si07,One notices here the constant change of u to a...
488,South-America,Southern South America,Ona,Analytical and critical bibliography of the tr...,1917.0,104,sh04,"4. Throat. Sk, j[unknown] e[unknown] ka[unknow..."
488,South-America,Southern South America,Ona,Analytical and critical bibliography of the tr...,1917.0,192,sh04,"4. Throat. Sk, j[unknown] e[unknown] ka[unknow..."
