# MonsterScrap doc
These functions allow to perform web scrapping on the Monster platform, to collect job detail.
____
Requirements:

In [1]:
from urllib.request import urlopen, HTTPError
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
from unicodedata import normalize
from json import loads
import re

The *scrapBody()* function allows to take the body part of an html document from the URL.  
This is to avoid redundant code in the main function.

In [2]:
def scrapBody(url):
    with urlopen(url) as response:
        body = BeautifulSoup(response.read(), 'html.parser').body
    return body

The *idFromLink()* function allows to extract the job ID from link.
IDs come in two different forms:
- A series of 9 numbers
- A string of characters xxxxxxxxxx-xxxx-xxxx-xxxx-xxxxxx-xxxxxxxxxxxx

Links on Monster come in 3 different forms: 
- Pages generated with ASP including the ID, composed only by 9 digits
- The form \*/monster/\* with the ID, composed by string, which is the standard form for the main site
- "job offer" *(in the language of the country)* with the ID in figures at the end, which is the standard form for each sub-sites.

In [3]:
def idFromLink(link):
    if ".aspx" in link:
        jobID = link[-14:-5]
    elif "/monster/" in link:
        jobID = re.findall(r'monster/.+?\?', link)[0][8:-1]
    else:
        jobID = link[link.rfind('/')+1:]
    return jobID

The _scrapMonsterID()_ function extracts the *jobIDs* for each job and country from the search results provided by Monster.
The [monster.co.uk](https://www.monster.co.uk/internationalJobs) site has access to the main master database, unlike the sub-sites which only have access to their own country the database.

A query allows to have the total match if it exists. If there are more than 5000, the division *resultCountLabel* displays "5000+". If there is none, the division does not appear on the page.

During the page browsing, $p$ greater than 1, the site has a behavior that displays the absence of results on a page $p$ while there are some on the page $p+1$ and therefore ignores the 20 results of the page $p$. The function counts it as an error.

The absence of *resultCountLabel* is not interpreted as the end of the results unless the size of the list of jobs covered is equal to (or greater than) the match minus the number of jobs ignored by the error.

In [4]:
def scrapMonsterID(searchList, countryList):
    setID = set()
    for search in searchList:
        search = search.replace(" ","+")
        for country in countryList:
            match = 5001
            error = 0
            listID = set()
            page = 1
            while True:
                url = "https://www.monster.co.uk/medley?q={}&fq=countryabbrev_s%3A{}&pg={}".format(
                    search, country, page)
                try:
                    body = scrapBody(url)
                except HTTPError:
                    break
                else:
                    if body.find(id="resultCountLabel") is None:
                        if len(listID) == 0:
                            break
                        else:
                            error += 1
                            if len(listID) >= (match - 20 * error):
                                break
                            else:
                                page += 1
                                continue
                    else:
                        match = int(
                            re.sub(
                                "\D", "",
                                body.find(
                                    id="resultCountLabel").text.split()[-1]))
                        links = [
                            link.a.attrs['href']
                            for link in body.find_all("div", class_="jobTitle")
                        ]
                        listID = {idFromLink(link) for link in links}
                        page += 1
                setID = setID.union(listID)
    return setID

The *dicoFromJson()* function normalizes the data of the request response. For a *jobID*, it collects information about the ad, the company and the specificities of the job in a dictionary.

- description: long description
- country: 2-letter abbreviation of the country
- city: full city name
- posted: job post creation date
- header: post title
- company: company name
- type: type of employment contract (employee, intern...)
- category: job category
- url: post url redirection

In [5]:
def dicoFromJson(jobID):
    url = "https://job-openings.monster.com/v2/job/pure-json-view?jobid={}".format(
        jobID)
    try:
        query = urlopen(url).read()
    except HTTPError:
        return {}
    dico = json.loads(
        normalize('NFKD', query.decode('utf-8')).encode('ascii', 'ignore'))

    general = (("description", "jobDescription"),
               ("country", "jobLocationCountry"),
               ("city", "jobLocationCity"),
               ("posted", "postedDate"))
    company = (("header", "companyHeader"),
               ("company", "name"))
    tracks = (("type", "eVar33"),
              ("category", "eVar28"))

    ginfo, cinfo, tinfo = {}, {}, {}
    for g in general:
        try:
            ginfo[g[0]] = normalize(
                "NFKD",
                " ".join(BeautifulSoup(dico[g[1]], 'lxml').get_text().split()))
        except KeyError:
            ginfo[g[0]] = ""
    for c in company:
        try:
            cinfo[c[0]] = BeautifulSoup(dico["companyInfo"][c[1]],
                                        'lxml').get_text().rstrip()
        except KeyError:
            cinfo[c[0]] = ""
    for t in tracks:
        try:
            tinfo[t[0]] = BeautifulSoup(dico["adobeTrackingProperties"][t[1]],
                                        'lxml').get_text().rstrip()
        except KeyError:
            tinfo[t[0]] = ""
    
    dico = {**ginfo, **cinfo, **tinfo}
    dico["url"] = "https://job-openings.monster.co.uk/monster/{}".format(jobID)
    return dico

**MonsterScrap()** is the main function which collects and standardizes data on the Monster site.  
Threads are used depending on the size of the results for data normalization.

In [6]:
def MonsterScrap(searchList, countryList):
    scraped = list()
    setID = scrapMonsterID(searchList, countryList)
    if len(setID) < 20:
        workers = len(setID)
    else:
        workers = len(setID) / 5
    with ThreadPoolExecutor(workers) as executor:
        for result in executor.map(dicoFromJson, setID):
            scraped.append(result)
    return scraped

## Example of use
Let's do research on the data scientist workstation in the United Kingdom only.  
Preview the 6th item in the list.

In [7]:
listOfJob = MonsterScrap(["Data Scientist"],["UK"])
listOfJob[5]

{'description': 'Role We are looking for a Deep Learning Engineer to join our AI Engineering team in Cambridge or Gothenburg. The ideal candidate will have industry experience developing and applying Machine Learning and Deep Learning solutions, e.g. developing data pre-processing pipelines, modelling, training state of the art deep neural networks (CNNs/RNNs/LSTMs/Transformers) as well as deploying inferencing pipelines to process unseen data at scale. The position will involve taking these skills and applying them to some of the most exciting data & prediction problems in drug discovery. You will work as part of a global team of deeply technical data scientists, knowledge engineers & machine learning engineers and have the chance to create tools that will advance the standard of healthcare improving the lives of millions of patients across the globe.We are working in collaboration with our scientists to help develop better drugs faster, choose the right treatment for a patient and ru

The final goal is to combine this list with those of other search engines in order to compose a data frame *(with pandas)*.