# "Toy model" for job posting scraping

This is a basic system that illustrates the process of web scraping. I used request library to acces urls, Beautiful soup and regular expressions to parse webpage content and natural language toolkit to handle extracted text. In the project I followed this idea:

## Building the model

1. Create corpus of job postings - I restricted myself to job postings about industry accessed from page jobs.cz. For simplicity I used only about 40 postings.
2. Analyze this corpus - tokenize, normalize and filter the corpus with the aim of extracting the key words - most frequent words in the corpus
3. Define criteria for a text to be a job posting - based on the length of the text and number of keywords

## Testing the model

1. Access company's website and find the page with career postings. This model only works for pages where postings are displayed on internal pages.
2. Find first two levels of internal links on the career page - filter links in header and footer (by ignoring links displayed on the main page, as they do not link to job postings and contain header and footer links). On top of that, I assume that job postings can be accessed navigating two subsequent pages from the career page
3. Assess if the page content for the above links correspond to a job posting.


In [1]:
import requests

In [2]:
from bs4 import BeautifulSoup
from nltk import FreqDist	
import nltk
from nltk.tokenize import word_tokenize
import re
from urllib.parse import urlparse

# Training the model




In [3]:
    #access jobs.cz, "strojirentsvi"
    try:
        r = requests.get('https://www.jobs.cz/prace/strojirenstvi/')

    except requests.exceptions.RequestException as e:  
        raise SystemExit(e)
    #initialize an instance of beautifulsoup
    soup=BeautifulSoup(r.text)

    #find all links on the page
    links=[]
    for link in soup.find_all('a'):
        links.append(link.get('href'))
    
    #filter links corresponding to job postings (are in the form: https://www.jobs.cz/rpd/...)
    test_links=[]
    for link in links:
        if re.search('https://www.jobs.cz/rpd/.',link):
            test_links.append(link)
       

In [4]:
#create a list of job postings accessed on jobs.cz
job_postings=[]
for i in range(0,len(test_links)):
    try:
        r = requests.get('{}'.format(test_links[i]))

    except requests.exceptions.RequestException as e:  
        raise SystemExit(e)

    #get text content from the webpage    
    soup=BeautifulSoup(r.text)    
    job_postings.append(soup.get_text())

#check number of postings
len(job_postings)    

38

Preprocess job postings

In [5]:
#remove redundant characters
def clean_text(text):
     replaced_text=re.sub(r'\n',' ',text)        
     replaced_text = re.sub(r'-', ' ', replaced_text)      
     replaced_text = re.sub(r',', ' ', replaced_text)           
     replaced_text = re.sub(r'？', '', replaced_text)
     replaced_text = re.sub(r'!', '', replaced_text)
     replaced_text = re.sub(r'►', '', replaced_text)
     replaced_text = re.sub(r'/', '', replaced_text)
     replaced_text = re.sub(r'•', '', replaced_text)
     replaced_text = re.sub(r':', '', replaced_text)
     replaced_text = re.sub(r'[.]', ' ', replaced_text)
     replaced_text = re.sub(r'[(]', '', replaced_text)  
     replaced_text = re.sub(r'[)]', '', replaced_text)
     replaced_text = re.sub(r'\xa0', ' ', replaced_text)      
     replaced_text = re.sub(r'　', ' ', replaced_text)
     replaced_text=replaced_text.lower()
        
     return replaced_text

In [6]:
#get list of tokens with stop words removed
#I did not create a stopword list for Czech language, I used the condition on number of characters instead

def trim_test_postings(domains,job_postings):
    #get tokens
    tokens=[]
    for i in range(0,len(domains)):
        tokens = tokens+word_tokenize(clean_text(job_postings[i]))    

    #filter stopwords    
    stopwords_removed=[]
    for item in tokens:
        if len(item)>3 and len(item)<9:
            stopwords_removed.append(item)
    return stopwords_removed


In [7]:
#creating a corpus of preprocessed job postings
corpus=nltk.Text(trim_test_postings(test_links,job_postings))


In [8]:
#function that finds the 150 most common words in a corpus
def frequent_words(text):
    fdist = FreqDist(text) 
    most_common_tuple=fdist.most_common(150)
    #print(most_common_tuple)

    #create a list of most common words
    most_common_words=[]
    for i in range(0,len(most_common_tuple)):
        most_common_words.append(most_common_tuple[i][0])
    return most_common_words

In [10]:
#creating list of words that will serve as the key words in the analysis
corpus_words=frequent_words(corpus)

corpus_words

[('práce', 218), ('pracovní', 94), ('firmy', 93), ('týdnů', 81), ('vstup', 66), ('spol', 65), ('lidé', 57), ('nebo', 57), ('poměru', 55), ('pracovat', 49), ('jsme', 46), ('vzdělání', 45), ('nabídky', 42), ('plný', 40), ('úvazek', 39), ('skupiny', 39), ('kraj', 39), ('průmysl', 38), ('okres', 38), ('historie', 37), ('menu', 37), ('odborné', 37), ('vyučení', 37), ('podmínky', 37), ('pozice', 35), ('ochrana', 35), ('rodiny', 35), ('firemní', 35), ('smlouva', 34), ('sdílet', 34), ('členem', 34), ('nábor', 34), ('člen', 34), ('všechna', 34), ('brigády', 33), ('odpovědí', 33), ('zařazeno', 33), ('přidat', 33), ('napsat', 33), ('jobsmůj', 33), ('firmypro', 33), ('námpro', 33), ('cookies', 33), ('soukromí', 33), ('english', 33), ('lmcjobs', 33), ('czslušná', 33), ('czfirmy', 33), ('očima', 33), ('jednom', 33), ('czonline', 33), ('alma', 33), ('career', 33), ('práva', 33), ('obsahu', 33), ('chytrou', 33), ('matej', 33), ('benefity', 33), ('máte', 32), ('praha', 32), ('vztahu', 32), ('pozici', 3

['práce',
 'pracovní',
 'firmy',
 'týdnů',
 'vstup',
 'spol',
 'lidé',
 'nebo',
 'poměru',
 'pracovat',
 'jsme',
 'vzdělání',
 'nabídky',
 'plný',
 'úvazek',
 'skupiny',
 'kraj',
 'průmysl',
 'okres',
 'historie',
 'menu',
 'odborné',
 'vyučení',
 'podmínky',
 'pozice',
 'ochrana',
 'rodiny',
 'firemní',
 'smlouva',
 'sdílet',
 'členem',
 'nábor',
 'člen',
 'všechna',
 'brigády',
 'odpovědí',
 'zařazeno',
 'přidat',
 'napsat',
 'jobsmůj',
 'firmypro',
 'námpro',
 'cookies',
 'soukromí',
 'english',
 'lmcjobs',
 'czslušná',
 'czfirmy',
 'očima',
 'jednom',
 'czonline',
 'alma',
 'career',
 'práva',
 'obsahu',
 'chytrou',
 'matej',
 'benefity',
 'máte',
 'praha',
 'vztahu',
 'pozici',
 'dobu',
 'znalost',
 'školení',
 'strojů',
 'výroby',
 'stroje',
 'dovolená',
 'práci',
 'česká',
 'technik',
 'kurzy',
 'tuto',
 'nabídku',
 'výrobní',
 'naše',
 'kontrola',
 'adresa',
 'šanci',
 'zatím',
 'méně',
 'oblasti',
 'možnost',
 'továrna',
 'výroba',
 'můžete',
 'oboru',
 'jazyky',
 'délka',
 'o

I defined a function to evaluate if a certain text is a job posting or not as follows. The current model is oversimplified as I set a very low theshold to work well with my particular problem.


In [None]:
def trim(page_text):
    #get tokens
    tokens = word_tokenize(clean_text(page_text)) 

    #filter stopwords
    stopwords_removed=[]
    for item in tokens:
        if len(item)>3 and len(item)<9:
            stopwords_removed.append(item)
    return stopwords_removed

In [11]:
#check if it is a job posting. Criteria: number of words greater than 200, number of words corresponding to key words greater than 50.
def job_posting(text,corpus_words):
    score=0
    
    # preprocessed list of words on the given webpage
    bag_of_words=trim(text)
    
    #if word is in corpus than score rises
    for item in bag_of_words:
        if item in corpus_words:
            score=score+1
    
    #check conditions on job postings
    
    if len(bag_of_words) > 100 and score/len(bag_of_words)>0.1:
        return True
    else:
        return False    


In [12]:
#check if the evaluation function works well with the training job postings
for posting in job_postings:    
    print(job_posting(posting,corpus_words))

True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True


# Testing the model

Test the model on an example of a website https://www.rdrymarov.cz/

This model is oversimplified as it only works with pages where job postings are displayed on internal pages.

In [13]:
page='https://www.rdrymarov.cz/'

In [14]:

def access_url(url):
    try:
        r = requests.get(url)
    except requests.exceptions.RequestException as e:  
        raise SystemExit(e)
    
    formatted_url=r.url
    visited_pages.add(r.url)
    soup=BeautifulSoup(r.text)

    #parse url to get the precise domain
    parsed_url=urlparse(formatted_url)
    domain=parsed_url.netloc

    return domain, soup, formatted_url


In [15]:
#initialize set of pages already visited
visited_pages=set()


In [16]:
#get all links on the given page
def get_links(soup):
    links=set()
    
    for link in soup.find_all('a'):
        links.add(link.get('href'))

    return links    

In [17]:
#filter internal links
def format_links(links,domain):
    #unify the form of the links first
    formatted_links=set()

    for link in links:
        #add domain to internal links
        if re.search('^/',link):
            formatted_links.add('https://{}{}'.format(domain,link)) 
        else:
             formatted_links.add(link)

    #filter internal links
    formatted_internal_links=set()

    for link in formatted_links:
        if re.search('.{}.'.format(domain),link):
            formatted_internal_links.add(link) 
             
    return formatted_internal_links      

Access website www.rdrymarov.cz

In [18]:

domain=access_url(page)[0]
soup=access_url(page)[1]
links=get_links(soup)
formatted=format_links(links,domain)

#remember links shown on the homepage
links_main_page=set(formatted)

#show internal links displayed on the homepage
links_main_page



{'https://www.rdrymarov.cz/',
 'https://www.rdrymarov.cz/cenik',
 'https://www.rdrymarov.cz/financovani',
 'https://www.rdrymarov.cz/fotogalerie',
 'https://www.rdrymarov.cz/kariera',
 'https://www.rdrymarov.cz/katalog-domu',
 'https://www.rdrymarov.cz/kontakt',
 'https://www.rdrymarov.cz/kubis',
 'https://www.rdrymarov.cz/largo',
 'https://www.rdrymarov.cz/mapa-stranek',
 'https://www.rdrymarov.cz/mini',
 'https://www.rdrymarov.cz/montovane-domy',
 'https://www.rdrymarov.cz/nasi-partneri',
 'https://www.rdrymarov.cz/nova',
 'https://www.rdrymarov.cz/novinka-ela-s-krystofem-si-navrhli-svuj-dum-snu',
 'https://www.rdrymarov.cz/novinka-reference-bydleni-v-bungalovu-largo-121',
 'https://www.rdrymarov.cz/novinka-rozhovor-o-dome-kubis-74-s-manzeli-monikou-a-vladimirem',
 'https://www.rdrymarov.cz/novinky-a-akce/projekty-rodinnych-domu-rd-rymarov',
 'https://www.rdrymarov.cz/novinky-a-akce/rd-magazin',
 'https://www.rdrymarov.cz/o-nas',
 'https://www.rdrymarov.cz/podminky-pouziti',
 'https:

In [19]:
#find a page with job postings
key_words=['kariera', 'prace', 'pozice', 'mista', 'zamestnani']


for link in formatted:
    for key_word in key_words:
        if re.search('.{}'.format(key_word),link):
            career=link

  #link to career page
career          

'https://www.rdrymarov.cz/kariera'

In [20]:
def links_to_visit(url,links_main_page,visited_pages):
    result=access_url(url)

    links=get_links(result[1])
    formatted=format_links(links,result[0])

    links=[]
    for link in formatted:
        if link not in visited_pages and link not in links_main_page:
            links.append(link)
            
    return links

Let us get links accessible from https://www.rdrymarov.cz/kariera. I assume that job postings should be available within two clicks from the webpage https://www.rdrymarov.cz/kariera.

In [22]:
#loop through links accessible from https://www.rdrymarov.cz/kariera 

#which pages to visit?
pages_to_visit=[]

#starts with https://www.rdrymarov.cz/kariera
pages_to_visit.append(career)

i=0

#list of candicates for job postings
final_pages=[]

while i<2:
    i=i+1
    for page in pages_to_visit:
        linky=links_to_visit(page, links_main_page,visited_pages)

        #add links to final_pages
        for link in linky:
            final_pages.append(link)
        #get new links to visit    
        pages_to_visit=linky[:]

In [23]:
#candidates
final_pages

['https://www.rdrymarov.cz/montaznik-ridic-autojerabu-vazac-bremen',
 'https://www.rdrymarov.cz/montaznik-instalater-topenar',
 'https://www.rdrymarov.cz/pokryvac',
 'https://www.rdrymarov.cz/elektrikar',
 'https://www.rdrymarov.cz/tesar',
 'https://www.rdrymarov.cz/instalater-topenar',
 'https://www.rdrymarov.cz/nova-vyrobni-linka',
 'https://www.rdrymarov.cz/montaznik-malir',
 'http://www.rdrymarov.cz/novinky-a-akce/projekty-rd-rymarov',
 'https://www.rdrymarov.cz/delnik-vyroby-domu-stolar',
 'https://www.rdrymarov.cz/stavebni-elektrikar',
 'https://www.rdrymarov.cz/instalater-vodovodu']

In [24]:
#initialize set of postings
scraped_postings=set()

for link in final_pages:
    result=access_url(link)
    links=get_links(result[1])
    formatted=format_links(links,result[0]) 

    #evaluate if the content is a job posting
    print(job_posting(result[1].get_text(),corpus_words),link) 

    #if True, than add content to the set of postings   
    if job_posting(result[1].get_text(),corpus_words):        
        scraped_postings.add(result[1].get_text())

True https://www.rdrymarov.cz/montaznik-ridic-autojerabu-vazac-bremen
True https://www.rdrymarov.cz/montaznik-instalater-topenar
True https://www.rdrymarov.cz/pokryvac
True https://www.rdrymarov.cz/elektrikar
True https://www.rdrymarov.cz/tesar
True https://www.rdrymarov.cz/instalater-topenar
False https://www.rdrymarov.cz/nova-vyrobni-linka
True https://www.rdrymarov.cz/montaznik-malir
False http://www.rdrymarov.cz/novinky-a-akce/projekty-rd-rymarov
True https://www.rdrymarov.cz/delnik-vyroby-domu-stolar
True https://www.rdrymarov.cz/stavebni-elektrikar
True https://www.rdrymarov.cz/instalater-vodovodu


In [None]:
#print job postings

for item in scraped_postings:
    print(item)

This toy model was succesfully detected job postings.

To be improved:

1. generalization to websites with job postings on external pages

2. large corpora of job postings for better evaluation

3. more sophisticated evaluation method (cluster URLs to find URLs corresponding to postings?, consider n-grams and collocations?)

4. use selenium library for certain tasks?

Is it legal to scrape any site and how to deal with anti scraping software?


