# Job Scraping in Linkedin with BeautifulSoup and Selenium with Natural Language Processing

Recently, my friend asked my to help him automate his job serching experience using Python. He wanted my to help him rank the jobs that have a high chance of him being called in for a interview. This mean I had to find all job description of all the jobs, and I have to find the keywords of these job descriptions and check if they are in the resume. Let's Start.

### Part 1 : Getting the information from Linkedin

Let us start by calling all the nessary libraries, for this project we are going to mainly be using `selenium` and `BeautifulSoup`


In [None]:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
from time import time
import os 

Let us redefine the directory to where you are going to save all your files

In [None]:
start_time = time()
os.chdir(r"C:\Users\Aakash\Desktop\AAKASH\Coding Stuff\Python\Project\Linkedin Project")

Let us define url to serch from

In [None]:
url = 'https://www.linkedin.com/jobs/search/?keywords=senior%20electrical%20engineer'

Install the the webdriver for your prefered website and make sure you save it where your code is located. 

In [None]:
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
action = ActionChains(driver)

Linkedin has a lazy-loading system. This means that the content is not loaded until we need it. In the case of Linkedin we need to get to scroll down to load more jobs. We will use a for loop to scroll down multiple times.

In [None]:
for scroll in range(0, 20):
    sleep(2) # Has time to load
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")

Next, using `BeautifulSoup`, let us get the html source code of the page to find the links to the job pages.

In [None]:
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')

We are going to find the links of the page sources using `.find_all`, using the class names of the `a` tag to help find only the correct link. We append the links into a list so we call it later.

In [None]:
job_link = []

for a in soup.find_all('a', 'base-card__full-link',href=True):
    job_link.append(a['href'])

Let us go to the pages we extracted before and get all the information we need. In order to find the information we need, we are going to find the tag using their class name. We use `sleep(2)` so that the code waits for few second in order to make sure we do not get classified as a bot in the severs. We run this code we only get the tag, so later we need to extract the text from it. We can see what tag we extract from the fist parameter of the `soup.find()`, and in the case of `raw_job_title` it is `h1`

In [None]:
for x in range(0, len(job_link)):
    driver.get(job_link[x])
    sleep(2)
      
    job_source = driver.page_source
    soup = BeautifulSoup(job_source, 'lxml')
    
    raw_job_title.append(soup.find('h1', class_='top-card-layout__title topcard__title'))
    raw_company_name.append(soup.find('a', class_ = 'topcard__org-name-link topcard__flavor--black-link'))
    raw_location.append(soup.find('span', class_='topcard__flavor topcard__flavor--bullet'))
    raw_job_description.append(soup.find('div', class_='show-more-less-html__markup show-more-less-html__markup--clamp-after-5'))
    raw_level.append(soup.find('span', class_= "description__job-criteria-text description__job-criteria-text--criteria")) 
    raw_function.append(soup.find('span',class_= 'description__job-criteria-text description__job-criteria-text--criteria'))
    
    sleep(2)

Once we are done using the driver, let us close the driver.

In [None]:
driver.close()

Make sure the informmation got stored correctly. The length of all the lists should be the same.

In [None]:
len(job_link)
len(raw_job_title)
len(raw_company_name)
len(raw_location)
len(raw_job_description)
len(raw_level)
len(raw_function)

Let us remove the tags, which leaves behind the text only, which allows us to carry on our Natural language Processing

In [None]:
for y in range(0, len(job_link)):
    job_title.append(raw_job_title[y].text)
    company_name.append(raw_company_name[y].text)
    location.append(raw_location[y].text)
    level.append(raw_level[y].text)
    function.append(raw_function[y].text)
    job_description.append(raw_job_description[y].get_text())

Once again let us make sure the information was stored correctly. The length of all the lists should be the same

In [None]:
len(job_title)
len(company_name)
len(location)
len(job_description)
len(level)
len(function)

### Part 2: Finding keywords and ranking jobs

Let us start by testing some commonly used keyword detection model. We will test the keyword detection system on our resume. We are going to test four different keyword detection systems, which are `Rake`, `gensim`, `yake` and `spacy` models. I did considered using other models such as a deep learning model such as `LSTM`, but I didn't have enough training and testing data in order to build one

In [None]:
resume_file = open('resume.txt', 'r', encoding='utf-8')
resume = resume_file.read()
resume_file.close()

In [None]:
from rake_nltk import Rake

rake_nltk_var = Rake()
rake_nltk_var.extract_keywords_from_text(resume)
keyword_extracted = rake_nltk_var.get_ranked_phrases()


In [None]:
from gensim.summarization import keywords

keyword = keywords(resume)
result = keyword.split()


In [None]:
import en_core_web_md
import en_core_web_lg

nlp_md = en_core_web_md.load()
nlp_lg = en_core_web_lg.load()

doc_md = nlp_md(resume)
result_md = list(doc_md.ents)

doc_lg = nlp_lg(resume)
resume_lg = list(doc_lg.ents)


In [None]:
import yake

language = 'en'
max_ngram_size = 3
deduplication_threshold = 0.9
numOfKeywords = 30

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(resume)

Out of the four model tested, I think the Rake-nltk model and the yake models are the best. But we will use the spacy library to find the proper nouns later on, so word like company names don't come into our keyword list

In [None]:
import re
import yake
import spacy
from rake_nltk import Rake
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from statistics import mean

Let us continue by removing numbers and special charaters from our job description and the resume, so our keyword model is not case sensitive

In [None]:
for x in range(0, len(job_description)):
    job_description[x] = re.sub("[^A-Za-z" "]+"," ",job_description[x]).lower()
    job_description[x] = re.sub("[0-9" "]+"," ", job_description[x])

In [None]:
resume = re.sub("[^A-Za-z" "]+"," ",resume).lower()
resume = re.sub("[0-9" "]+"," ", resume)

Let us get a text file with all the words in the english language, so our model doesn't pick up a word that doesn't exsit in the english language.

In [None]:
dict_file = open('dict.txt', 'r', encoding='utf-8')
dict = dict_file.read()
dict_file.close()

We also have to remove all capitization, special charaters and numbers for the dict file 

In [None]:
raw_dict = re.sub("[^A-Za-z" "]+"," ",dict).lower()
raw_dict = re.sub("[0-9" "]+"," ", dict)

We are also going to remove stopword from the dictionary, so words like 'I', 'you' and 'about' will be removed. To do this we are going to tokenize the word, and this will spilt all the words, so we can check if it is in the stopword list. For the list itself we are going to use the stopword list provided by the library `nltk`

In [None]:
stop_words = set(stopwords.words('english'))
dict_tokens = word_tokenize(raw_dict)

dict = []
for w in dict_tokens:
    if w not in stop_words:
        dict.append(w)

Now that we have prepared all the nessasary items, we can find the keywords of the job descriptions we extracted. 

In [None]:
language = 'en'
max_ngram_size = 3
deduplication_threshold = 0.5
numOfKeywords = 30

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
rake_nltk_var = Rake()

raw_yake_job_keyword = []
rake_job_keyword = []

for z in range(0, len(job_link)):
    raw_yake_job_keyword.append(list(custom_kw_extractor.extract_keywords(job_description[z])))
    rake_nltk_var.extract_keywords_from_text(job_description[z])
    rake_job_keyword.append(rake_nltk_var.get_ranked_phrases())

Now, we have found the keywords in the job description. Next, we need to find the the synonyms of the keywords, as our resume might have the keyword but worded differently. Without the synonym finder, jobs will be graded badly.

At the code  below, we first clean the keyword phrase becasue the models we can detemine of a phrase can be multiple words. To clean it we have to check if it is a proper noun, so things like names and places, in it is a actual word, and if it in the stopword. Then, our code find the synonyms for the word and tries to spilt them if the synonyms have multiple words in it.

In [None]:
for list_num in range(0, len(rake_job_keyword)):
    
    filtered_sentence = []
    for word_num in range(0, len(rake_job_keyword[list_num])):
        keyword_phrase = rake_job_keyword[list_num][word_num]
        word_tokens = word_tokenize(keyword_phrase)
        
        filter_sentence = []
        for w in word_tokens:
            
            doc = nlp(w)
            if doc[0].tag_ == 'NNP':
                proper = True
            else:
                proper = False
            
            if w in dict:
                in_dict = True
            else:
                in_dict = False
                
            if w not in stop_words and proper == False and in_dict == True:
                filter_sentence.append(w)
            
        filtered_sentence.append(filter_sentence)
    
   
    synonyms = []
    for filtered_num in range(0, len(filtered_sentence)):
        for filtered_word in range(0, len(filtered_sentence[filtered_num])):
            word_synonym = []
            for syn in wordnet.synsets(filtered_sentence[filtered_num][filtered_word]):
                for l in syn.lemmas():
                    syn_word = l.name()
                    try:
                        syn_word = syn_word.replace("_", " ")
                    except:
                        pass
                    word_synonym.append(syn_word)
            synonyms.append(word_synonym)
    for num in range(0, len(rake_job_keyword[list_num])):
        rake_job_keyword[list_num][num] = synonyms[num]

It is a little different for thr `yake` model, because the `yake` model returns the keyword with the a float, so we have save the pictures into a different list without the float

In [None]:
yake_job_keyword = []
for f in range(0, len(job_link)):
    yake_job = []
    for word in raw_yake_job_keyword[f]:
        yake_job.append(word[0])
    yake_job_keyword.append(yake_job)

In [None]:
for list_num in range(0, len(yake_job_keyword)):
    
    filtered_sentence = []
    for word_num in range(0, len(yake_job_keyword[list_num])):
        keyword_phrase = yake_job_keyword[list_num][word_num]
        word_tokens = word_tokenize(keyword_phrase)
        
        filter_sentence = []
        for w in word_tokens:
            
            doc = nlp(w)
            if doc[0].tag_ == 'NNP':
                proper = True
            else:
                proper = False
            
            if w in dict:
                in_dict = True
            else:
                in_dict = False
                
            if w not in stop_words and proper == False and in_dict == True:
                filter_sentence.append(w)
            
        filtered_sentence.append(filter_sentence)
    
   
    synonyms = []
    for filtered_num in range(0, len(filtered_sentence)):
        for filtered_word in range(0, len(filtered_sentence[filtered_num])):
            word_synonym = []
            for syn in wordnet.synsets(filtered_sentence[filtered_num][filtered_word]):
                for l in syn.lemmas():
                    syn_word = l.name()
                    try:
                        syn_word = syn_word.replace("_", " ")
                    except:
                        pass
                    word_synonym.append(syn_word)
            synonyms.append(word_synonym)
    for num in range(0, len(yake_job_keyword[list_num])):
        yake_job_keyword[list_num][num] = synonyms[num]

We now have to calculate the percentage of keywords that are in the resume. We will define a function called `calculate_percentage` that will calculate the percentage given the two parameters `point` and `length`

In [None]:
def calculate_percentage(point, length):
    decimal = num / length
    return  decimal*100

In [None]:
yake_percent = []
rake_percent = []
avg_percent = []

In [None]:
for q in range(0, len(job_link)):
    rake_point = 0
    for keyword in rake_job_keyword[q]:
        for key in keyword:
            if key in resume:
                rake_point += 1
                break
    rake_percent.append(calculate_percentage(rake_point, len(rake_job_keyword[q])))
    
for q in range(0, len(job_link)):
    yake_point = 0
    for keyword in yake_job_keyword[q]:
        for key in keyword:
            if key in resume:
                yake_point += 1 
                break
    yake_percent.append(calculate_percentage(yake_point, len(yake_job_keyword[q])))
    
for g in range(0, len(job_link)):
    avg_percent.append(mean([yake_percent[g], rake_percent[g]]))

Now we can make it into a dataframe, for ease of use, we should sort the values of the dataset based on `avg_percent`

In [None]:
data = pd.DataFrame(data_dict)
data = data.sort_values(by = 'avg_percent')

###  Part 3: Making it a Function

Now that we have got the code, let us make it easy for the user by making it a function, where he can call it with the nessasary parameters. Let us also add some errors into our code in case it is not inputed the correct parameters. 

In [None]:
def job_find(url, scroll_num , model_name = 'both'):
    import os 
    import re
    import pandas as pd
    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.common.action_chains import ActionChains
    from time import sleep
    from time import time
    import yake
    import spacy
    from rake_nltk import Rake
    from nltk.corpus import wordnet
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from statistics import mean
    
    start_time = time()
    os.chdir(r"C:\Users\Aakash\Desktop\AAKASH\Coding Stuff\Python\Project\Linkedin Project")
    
    nlp = spacy.load('en_core_web_lg')

    driver = webdriver.Chrome()
    driver.get(url)
    sleep(3)
    action = ActionChains(driver)

    for scroll in range(0, scroll_num):
        try:
            element = driver.find_element_by_link_text("See more jobs")
            action.click(element).preform()
            #See More Jobs Button to not being clicked 
        except:
            sleep(2) # Has time to load
            driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")

    source = driver.page_source
    soup = BeautifulSoup(source, 'lxml')

    job_link = []
    
    for a in soup.find_all('a', 'base-card__full-link',href=True):
        job_link.append(a['href'])
        
    class WebLinkError(Exception):
        def __init__(self, link, message='is a invalid Linkedin url!'):
            self.link = link
            self.message = message
            super().__init__(self.message)
        def __str__(self):
            return f'{self.link} {self.message}'
        
    class ModelError(Exception):
        def __init__(self, model ,message = 'is not a valid model. Use rake model or yake model'):
            self.model = model
            self.message = message
            super().__init__(self.message)
        def __str__(self):
            return f"{self.model}{self.message}"
    def calculate_percentage(num, length):
        decimal = num / length
        return  decimal*100
        

    raw_job_title = []
    raw_company_name = []
    raw_location = []
    raw_job_description = []
    raw_level = []
    raw_function = []

    for x in range(0, len(job_link)):
        driver.get(job_link[x])
        sleep(2)
        
        url_count = 0 
        url_flag = True
        while url_flag == True:
            current_url = driver.current_url
            if current_url != job_link[x]:
                driver.get(job_link[x])
                url_count =+ 1
            else: 
                url_flag = False
            if url_count == 10:
                raise WebLinkError(job_link[x])
        
        job_source = driver.page_source
        soup = BeautifulSoup(job_source, 'lxml')
        
        raw_job_title.append(soup.find('h1', class_='top-card-layout__title topcard__title'))
        raw_company_name.append(soup.find('a', class_ = 'topcard__org-name-link topcard__flavor--black-link'))
        raw_location.append(soup.find('span', class_='topcard__flavor topcard__flavor--bullet'))
        raw_job_description.append(soup.find('div', class_='show-more-less-html__markup show-more-less-html__markup--clamp-after-5'))
        raw_level.append(soup.find('span', class_= "description__job-criteria-text description__job-criteria-text--criteria")) 
        raw_function.append(soup.find('span',class_= 'description__job-criteria-text description__job-criteria-text--criteria'))
        
        sleep(2)
        
    driver.close()

    if len(job_link) == len(raw_job_title) == len(raw_company_name) == len(raw_location) == len(raw_job_description) == len(raw_level) ==len(raw_function):
        pass
    else:
        raise ValueError("Lengh of job information list is mismatched.")
    
    job_title = []
    company_name = []
    location = []
    job_description = []
    level = []
    function = []

    for y in range(0, len(job_link)):
        job_title.append(raw_job_title[y].text)
        company_name.append(raw_company_name[y].text)
        location.append(raw_location[y].text)
        level.append(raw_level[y].text)
        function.append(raw_function[y].text)
        job_description.append(raw_job_description[y].get_text())

    len(job_link)
    len(job_title)
    len(company_name)
    len(location)
    len(job_description)
    len(level)
    len(function)
    
    if len(job_link) == len(job_title) == len(company_name) == len(location) == len(job_description) == len(level) ==len(function):
        pass
    else:
        raise ValueError("Lengh of job information list is mismatched.")
    
    for x in range(0, len(job_description)):
        job_description[x] = re.sub("[^A-Za-z" "]+"," ",job_description[x]).lower()
        job_description[x] = re.sub("[0-9" "]+"," ", job_description[x])
    
    resume_file = open('resume.txt', 'r', encoding='utf-8')
    resume = resume_file.read()
    resume_file.close()
    
    resume = re.sub("[^A-Za-z" "]+"," ",resume).lower()
    resume = re.sub("[0-9" "]+"," ", resume)
    
    dict_file = open('dict.txt', 'r', encoding='utf-8')
    dict = dict_file.read()
    dict_file.close()
    
    raw_dict = re.sub("[^A-Za-z" "]+"," ",dict).lower()
    raw_dict = re.sub("[0-9" "]+"," ", dict)
    
    stop_words = set(stopwords.words('english'))
    dict_tokens = word_tokenize(raw_dict)

    dict = []
    for w in dict_tokens:
        if w not in stop_words:
            dict.append(w)
    
    if model_name == 'both':
    
        language = 'en'
        max_ngram_size = 3
        deduplication_threshold = 0.5
        numOfKeywords = 30
    
        custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
        rake_nltk_var = Rake()
        
        raw_yake_job_keyword = []
        rake_job_keyword = []
        
        for z in range(0, len(job_link)):
            raw_yake_job_keyword.append(list(custom_kw_extractor.extract_keywords(job_description[z])))
            rake_nltk_var.extract_keywords_from_text(job_description[z])
            rake_job_keyword.append(rake_nltk_var.get_ranked_phrases())
        
        for list_num in range(0, len(rake_job_keyword)):
            
            filtered_sentence = []
            for word_num in range(0, len(rake_job_keyword[list_num])):
                keyword_phrase = rake_job_keyword[list_num][word_num]
                word_tokens = word_tokenize(keyword_phrase)
                
                filter_sentence = []
                for w in word_tokens:
                    
                    doc = nlp(w)
                    if doc[0].tag_ == 'NNP':
                        proper = True
                    else:
                        proper = False
                    
                    if w in dict:
                        in_dict = True
                    else:
                        in_dict = False
                        
                    if w not in stop_words and proper == False and in_dict == True:
                        filter_sentence.append(w)
                    
                filtered_sentence.append(filter_sentence)
            
           
            synonyms = []
            for filtered_num in range(0, len(filtered_sentence)):
                for filtered_word in range(0, len(filtered_sentence[filtered_num])):
                    word_synonym = []
                    for syn in wordnet.synsets(filtered_sentence[filtered_num][filtered_word]):
                        for l in syn.lemmas():
                            syn_word = l.name()
                            try:
                                syn_word = syn_word.replace("_", " ")
                            except:
                                pass
                            word_synonym.append(syn_word)
                    synonyms.append(word_synonym)
            for num in range(0, len(rake_job_keyword[list_num])):
                rake_job_keyword[list_num][num] = synonyms[num]
        
        
        yake_job_keyword = []
        for f in range(0, len(job_link)):
            yake_job = []
            for word in raw_yake_job_keyword[f]:
                yake_job.append(word[0])
            yake_job_keyword.append(yake_job)
            
        for list_num in range(0, len(yake_job_keyword)):
            
            filtered_sentence = []
            for word_num in range(0, len(yake_job_keyword[list_num])):
                keyword_phrase = yake_job_keyword[list_num][word_num]
                word_tokens = word_tokenize(keyword_phrase)
                
                filter_sentence = []
                for w in word_tokens:
                    
                    doc = nlp(w)
                    if doc[0].tag_ == 'NNP':
                        proper = True
                    else:
                        proper = False
                    
                    if w in dict:
                        in_dict = True
                    else:
                        in_dict = False
                        
                    if w not in stop_words and proper == False and in_dict == True:
                        filter_sentence.append(w)
                    
                filtered_sentence.append(filter_sentence)
            
           
            synonyms = []
            for filtered_num in range(0, len(filtered_sentence)):
                for filtered_word in range(0, len(filtered_sentence[filtered_num])):
                    word_synonym = []
                    for syn in wordnet.synsets(filtered_sentence[filtered_num][filtered_word]):
                        for l in syn.lemmas():
                            syn_word = l.name()
                            try:
                                syn_word = syn_word.replace("_", " ")
                            except:
                                pass
                            word_synonym.append(syn_word)
                    synonyms.append(word_synonym)
            for num in range(0, len(yake_job_keyword[list_num])):
                yake_job_keyword[list_num][num] = synonyms[num]

        yake_percent = []
        rake_percent = []
        avg_percent = []
        
        for q in range(0, len(job_link)):
            rake_point = 0
            for keyword in rake_job_keyword[q]:
                for key in keyword:
                    if key in resume:
                        rake_point += 1
                        break
            rake_percent.append(calculate_percentage(rake_point, len(rake_job_keyword[q])))
            
        for q in range(0, len(job_link)):
            yake_point = 0
            for keyword in yake_job_keyword[q]:
                for key in keyword:
                    if key in resume:
                        yake_point += 1 
                        break
            yake_percent.append(calculate_percentage(yake_point, len(yake_job_keyword[q])))
            
        for g in range(0, len(job_link)):
            avg_percent.append(mean([yake_percent[g], rake_percent[g]]))
            
        data_dict = {
            'link':job_link,
            'job_title' : job_title,
            'company_name':company_name,
            'location': location,
            'job_description' : job_description,
            'yake_keywords' : yake_job_keyword,
            'rake_keywords' : rake_job_keyword,
            'level':level,
            'function':function,
            'yake_percent': yake_percent,
            'rake_percent': rake_percent,
            'avg_percent': avg_percent}
        
        data = pd.DataFrame(data_dict)
        data = data.sort_values(by = 'avg_percent')
        return data
        
    elif model_name == 'rake':
        rake_nltk_var = Rake()
        
        rake_job_keyword = []
        
        for z in range(0, len(job_link)):
            rake_nltk_var.extract_keywords_from_text(job_description[z])
            rake_job_keyword.append(rake_nltk_var.get_ranked_phrases())
        
        for list_num in range(0, len(rake_job_keyword)):
            
            filtered_sentence = []
            for word_num in range(0, len(rake_job_keyword[list_num])):
                keyword_phrase = rake_job_keyword[list_num][word_num]
                word_tokens = word_tokenize(keyword_phrase)
                
                filter_sentence = []
                for w in word_tokens:
                    
                    doc = nlp(w)
                    if doc[0].tag_ == 'NNP':
                        proper = True
                    else:
                        proper = False
                    
                    if w in dict:
                        in_dict = True
                    else:
                        in_dict = False
                        
                    if w not in stop_words and proper == False and in_dict == True:
                        filter_sentence.append(w)
                    
                filtered_sentence.append(filter_sentence)
                   
            synonyms = []
            for filtered_num in range(0, len(filtered_sentence)):
                for filtered_word in range(0, len(filtered_sentence[filtered_num])):
                    word_synonym = []
                    for syn in wordnet.synsets(filtered_sentence[filtered_num][filtered_word]):
                        for l in syn.lemmas():
                            syn_word = l.name()
                            try:
                                syn_word = syn_word.replace("_", " ")
                            except:
                                pass
                            word_synonym.append(syn_word)
                    synonyms.append(word_synonym)
            for num in range(0, len(rake_job_keyword[list_num])):
                rake_job_keyword[list_num][num] = synonyms[num]

            rake_percent = []
            avg_percent = []
        
        for q in range(0, len(job_link)):
            rake_point = 0
            for keyword in rake_job_keyword[q]:
                for key in keyword:
                    if key in resume:
                        rake_point += 1
                        break
            rake_percent.append(calculate_percentage(rake_point, len(rake_job_keyword[q])))

        data_dict = {
            'link':job_link,
            'job_title' : job_title,
            'company_name':company_name,
            'location': location,
            'job_description' : job_description,
            'rake_keywords' : rake_job_keyword,
            'level':level,
            'function':function,
            'rake_percent': rake_percent}
        
        data = pd.DataFrame(data_dict)
        data = data.sort_values(by = 'rake_percent')
        return data
                
    elif model_name == 'yake':
        language = 'en'
        max_ngram_size = 3
        deduplication_threshold = 0.5
        numOfKeywords = 30
    
        custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
        rake_nltk_var = Rake()
        
        raw_yake_job_keyword = []
        
        for z in range(0, len(job_link)):
            raw_yake_job_keyword.append(list(custom_kw_extractor.extract_keywords(job_description[z])))
         
        yake_job_keyword = []
        for f in range(0, len(job_link)):
            yake_job = []
            for word in raw_yake_job_keyword[f]:
                yake_job.append(word[0])
            yake_job_keyword.append(yake_job)
            
        for list_num in range(0, len(yake_job_keyword)):
            
            filtered_sentence = []
            for word_num in range(0, len(yake_job_keyword[list_num])):
                keyword_phrase = yake_job_keyword[list_num][word_num]
                word_tokens = word_tokenize(keyword_phrase)
                
                filter_sentence = []
                for w in word_tokens:
                    
                    doc = nlp(w)
                    if doc[0].tag_ == 'NNP':
                        proper = True
                    else:
                        proper = False
                    
                    if w in dict:
                        in_dict = True
                    else:
                        in_dict = False
                        
                    if w not in stop_words and proper == False and in_dict == True:
                        filter_sentence.append(w)
                    
                filtered_sentence.append(filter_sentence)
            
           
            synonyms = []
            for filtered_num in range(0, len(filtered_sentence)):
                for filtered_word in range(0, len(filtered_sentence[filtered_num])):
                    word_synonym = []
                    for syn in wordnet.synsets(filtered_sentence[filtered_num][filtered_word]):
                        for l in syn.lemmas():
                            syn_word = l.name()
                            try:
                                syn_word = syn_word.replace("_", " ")
                            except:
                                pass
                            word_synonym.append(syn_word)
                    synonyms.append(word_synonym)
            for num in range(0, len(yake_job_keyword[list_num])):
                yake_job_keyword[list_num][num] = synonyms[num]
        
        yake_percent = []
        avg_percent = []
            
        for q in range(0, len(job_link)):
            yake_point = 0
            for keyword in yake_job_keyword[q]:
                for key in keyword:
                    if key in resume:
                        yake_point += 1 
                        break
            yake_percent.append(calculate_percentage(yake_point, len(yake_job_keyword[q])))
            
        data_dict = {
            'link':job_link,
            'job_title' : job_title,
            'company_name':company_name,
            'location': location,
            'job_description' : job_description,
            'yake_keywords' : yake_job_keyword,
            'level':level,
            'function':function,
            'yake_percent': yake_percent}
        
        data = pd.DataFrame(data_dict)
        data = data.sort_values(by= 'yake_percent')
        return data   
    else:
        raise ModelError(model_name)

The full code in my github repository, and that is all for this project. Thank you for reading.