Skip to content

Web Scraping

DeadlyCoconuts edited this page Jul 6, 2019 · 18 revisions
backend

Web scraping done by Wikipedia Graph Mapper is handled by the script process_link.py written in Python, and is powered by libraries such as Requests and BeautifulSoup.

Table of Contents

Receiving the Command

Upon receiving a user request, the function generate_lists is first called to handle and process user input.

def generate_lists(self_url, max_level, isMobileBrowser, search_mode):
    # patches requests as it has compatibility issues with Google App Engine/ comment this out to test on development server
    requests_toolbelt.adapters.appengine.monkeypatch() 

    global MAX_LEVEL
    MAX_LEVEL = int(max_level)

    global SEARCH_MODE
    SEARCH_MODE = int(search_mode)

    # clears list for each new request made
    del entryList[:]

    nodeList, linkList = proc_data(scrape(self_url, currentLevel = 0), isMobileBrowser)
    return nodeList, linkList

The line containing the function monkeypatch() should only be used when deploying the application on Google App Engine. Comment it out when testing on your development server as it might create other errors while running.

Besides calling all the other dependent functions scrape() and proc_data() to actually carry out web scraping and data processing respectively, the generate_lists() function also sets the user input maximum search depth max_level as a global variable to force the recursive function scrape() to return prematurely upon arriving at the maximum search depth. Similarly, search_mode, which indicates 1 of the 3 various search settings*, is also set as a global variable to make scrape() use the same hyperlink selection criteria for each page scraped.

  • Refer to this repository's readme for a more elaborate explanation on the various search settings offered and the following section Scraping Data off Wikipedia for more details regarding the implementation of these 3 options.

Scraping Data off Wikipedia

The recursive function scrape initially takes in a URL performs a HTTP request to that page. It then parses and extracts the relevant URLs from the HTML document, all while saving them into the list entryList.

Last but not least, it calls itself on the these URLs found and repeats the same process all over again, terminating only when the max search level has been attained:

if currentLevel + 1 > MAX_LEVEL: # stops sending requests if spider is going to reach max crawl level
    return

The page of the URL passed to the function is accessed by using Requests. BeautifulSoup then parses this page into a BeautifulSoup object:

page = requests.get(self_url)
soup = BeautifulSoup(page.content, 'lxml')

This BeautifulSoup object is then used to extract the page title, as well as other URLs contained within the page:

self_title = soup.title.contents[0] # stores title of page

listTags = [] # list to store all hyperlinks found

if SEARCH_MODE == 1: # OPTION 1: Use this to search only the FIRST paragraph of each page for hyperlinks
    tag = soup.find('p') # search for first paragraph
    while (tag.find('a') != None and 'Coordinates' in tag.find('a').contents) or (tag.get('class') != None): # if first search result is not a pure <p> tag nor a coordinate link
        tag = tag.findNext('p')
    listTags.extend(tag.findAll('a'))
elif SEARCH_MODE == 2: # OPTION 2: Use this to search the introduction of each page only for hyperlinks
    stop_at = soup.find('h2') # finds the first h2 element i.e. where the first subsection header is found
    class_extr = stop_at.find_all_previous('p') # extracts all elements before this element
    for paragraph in class_extr:
        listTags.extend(paragraph.findAll('a'))
elif SEARCH_MODE == 3: # OPTION 3: Use this to search the entire page for hyperlinks
    for paragraph in soup.findAll('p'): # for each paragraph found
        listTags.extend(paragraph.findAll('a')) # stores all hyperlinks found

Once this has been done, it's time to do some additional filtering of the Tag objects found. A simple Wikipedia paragraph can contain various types of non-relevant links such as those shown below:

capture

So in order to filter random citation notes, Wikimedia files, and the like, each tag is parsed as a string and compared to certain list of special keywords. Tags containing these keywords, as well as repeats, will be removed. These tags are then saved as a dictionary and stored in the list object entryList. In addition, the function scrape() is then once again (recursively) called on each legitimate tag found:

# cleans up list of hyperlinks; retains only relevant links
listLinks = [] # stores the name and url of each hyperlink found
# stores list of keywords that indicates links to be filtered out
listOfFilterKeywords = ['cite_note', 'File', 'wikimedia', 'Help', ':Verifiability', 'Wikipedia:', 'wiktionary.org'] 
for tag in listTags:
    if not any(keyword in str(tag) for keyword in listOfFilterKeywords): # checks if keyword is found; if so, skip this tag
        if 'title' in tag.attrs and 'href' in tag.attrs: # checks if title and link elements exist in the tag
            # stores a dictionary of the information regarding each hyperlink i.e. which page it is found on
            entry = {"self_title": self_title, "self_url": self_url, "ext_title": tag['title'], "ext_url": HOME_URL + tag['href'], "current_level": currentLevel} 
            if entry not in entryList: # filters out entries already present
                entryList.append(entry)
            scrape(entry["ext_url"], currentLevel) # depth-search via recursion

Last but not least, after having all its recursion branches terminated, the scrape() returns entryList:

return entryList

Processing all those Data

The entryList list returned is then passed into the function proc_data(), which processes the data contained within into 2 lists nodeList and linkLists, which represent respectively the nodes and links(edges) that would be built in the graph.

To create a list of nodes representing all the Wikipedia pages found, all unique URLs found in entryList are added to an intermediate list urls:

# creates a list to store unique urls found
urls = list(set([ data['self_url'] for data in entryList ])) # removes URL duplicates from self_urls
urls.extend(list(set([ data['ext_url'] for data in entryList ]))) # adds other URLs branches to list

Each url in urls is then saved in nodeList as a dictionary object containing the url as a unique identifier id, with the page title as label and the search depth where the page was first found as level, 0 being the source, 1 being its neighbours, and so on...:

nodeList = [] # to store nodes
for url in urls:
    for data in entryList:
        if url == data["self_url"]:
            entry = {"id": url, "label": data["self_title"], "level": data["current_level"] - 1}
            if entry not in nodeList:
                nodeList.append(entry)
            break
        elif url == data["ext_url"]:
            entry = {"id": url, "label": data["ext_title"], "level": data["current_level"]}
            if entry not in nodeList:
                nodeList.append(entry)
            break

The next step then involves filling up linkList with the necessary data needed to define the edges of the graph. This is easier than the previous step as we would only need to extract data from entryList without having any need for further filtering.

However, there is an additional step which requires us to define the strength of each edge, which roughly represents the attractive force between the target and the source of the link. In other words, the higher the value, the closer the two nodes are likely to be. Two different ways of defining this strength value have been implemented - one which is used when the user accesses the application via a mobile browser and the other when accessed via normal web browser:

linkList = [] # to store links
for data in entryList:
    if isMobileBrowser:
        strength = 0.6
    else:
        # strength formula 
        strength = 0.8 - 0.4*data["current_level"]/MAX_LEVEL
    linkList.append({"target": data["self_url"], "source": data["ext_url"], "strength": strength})

As can be seen, a uniform strength is used to pack nodes closer to one another on a mobile browser, whereas a decreasing function with respect to node level is used to define strength on a regular browser.

Last but not least, the function returns both nodeList and linkList, which will then be used to generate the graph on the client-side:

return nodeList, linkList
Clone this wiki locally