# NLP: Keywords extraction

> ## Content
>
> Pre-processing user data (by web-scraping)
>
> Keywords Extraction
>
> 1. RAKE (Rapid Automatic Keyword Extraction algorithm)
> 2. YAKE (Yet Another Keyword Extractor)
> 3. Postion Rank (pke unsupervised model)
> 4. Topic Rank (pke unsupervised model)
> 5. Multi-partite Rank (pke unsupervised model)
> 6. KeyBERT
>
> Method Evaluation


## packages


In [1]:
from urllib.request import urlopen, Request
import urllib
from html.parser import HTMLParser
import json
import re   # get rid of weird white spaces

# for keywords extraction 
from keybert import KeyBERT
from rake_nltk import Rake
# import nltk
# nltk.download('stopwords')
import yake
import pke
import string
import spacy


# for method evaluation
import logging
import time
import random
import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## pre-processing user data

Read all the URLs from user data, then do some web-scraping to get every sentence in the webpages user has browsed.

Data is stored in `sentences` for further natural language processing


In [222]:
# words parser class
class WordsParser(HTMLParser):
    # tags to search text within
    search_tags = ['p', 'div', 'span', 'a']
    
    # current tag
    current_tag = ''
    
    # list of all sentences
    all_words = []
    
    # handle starting tag
    def handle_starttag(self, tag, attr):
        # store current tag
        self.current_tag = tag        
            
    # handle tag's data
    def handle_data(self, data):
        
      # make sure current tag matches search tags
      if self.current_tag in self.search_tags:
        if (
            (('.' in data) or ('!' in data) or ('?' in data)) and
            ('...' not in data)
        ):
            
            # clean the data
            data = re.sub('\s+' ,' ', data)
            " ".join(data.split())      # remove duplicate spaces and newline characters
            
            # add to the list
            self.all_words.append(data)

In [223]:
# main driver
if __name__ == '__main__':
  # read all json files to get URLs
  URLs = []
  cnt = 0
  for i in range(1):                # <------- change no. of user data files here
      with open('../data/{}.json'.format(i+1)) as openfile:

          json_object = json.load(openfile)

          # # the num of webpages every json file visitd
          # numPages = len(json_object["data"])
          # cnt = numPages + cnt

          # append all urls into a list `URLs`
          # for j in range(len(json_object["data"])):
          for j in range(5):        # <------- change no. of urls here
              URLs.append(json_object["data"][j]['url'])

  # # if (cnt == len(URLs)): print("true")        # check if all webpages are put into `URLs`

  html = ""

  # target URL to scrape
  for url in URLs:
      print(url)
      # make HTTP GET request to the target URL
      class AppURLopener(urllib.request.FancyURLopener):
          version = "Mozilla/5.0"

      opener = AppURLopener()
      response = opener.open(url)

      # extract HTML document from response
      html = response.read().decode('utf-8', errors='ignore') + html
      print(len(html))

  # create words parser instance
  words_parser = WordsParser()

  # feed HTML to words parser
  words_parser.feed(html)

  # get all full sentences
  sentences = words_parser.all_words

  # now that we have got all the sentences webpages user browsed
  # save to a txt file for backup
  # print(sentences)
  with open("../web_scraping/sentences.txt", "w") as f:
      for sentence in sentences:
          f.write(sentence + '\n')
  

https://www.google.com/search?q=umich+biostatistics+master&oq=umich+biostatistics+master&aqs=chrome..69i57j69i60l3.11673j0j7&sourceid=chrome&ie=UTF-8


  opener = AppURLopener()


100632
https://sph.umich.edu/biostat/
156026
https://sph.umich.edu/biostat/programs/masters.html
196971
https://sph.umich.edu/biostat/apply-ms-biostatistics.html
243121
https://rackham.umich.edu/admissions/applying/
467909


In [224]:
print(len(sentences))

70


## Keywords Extraction

https://towardsdatascience.com/keyword-extraction-a-benchmark-of-7-algorithms-in-python-8a905326d93f

method 2, 3, 4, 5: https://boudinfl.github.io/pke/build/html/unsupervised.html


In [2]:
# small dataset for quick test, can delete later
# sentences = ['Why Study Public Health?', 'Graduate programs in the Department of Biostatistics at the University of Michigan School of Public Health are among the best in the world. Currently, we are ranked as the No. 4 biostatistics department by US News and World Report. ']

# initiate BERT outside of functions
bert = KeyBERT()

In [226]:
# initialize stop-wrod list
print(pke.lang.stopwords.get('en'))

stoplist = list(string.punctuation)         # punctuations:  !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~
stoplist += pke.lang.stopwords.get('en')

{'am', 'has', 'three', 'around', 'front', 'seemed', 'again', 'himself', 'back', 'would', 'mostly', 'with', 'always', 'besides', 'yourselves', 'off', 'beyond', 'have', 'does', "'re", 'anyway', 'take', 'themselves', 'full', 'ten', '’m', 'whom', 'noone', 'against', 'therefore', 'because', 'therein', 'meanwhile', 'moreover', 'per', 'mine', 'amount', 'and', 'never', 'each', 'what', 'over', 'four', 'although', 'already', '‘m', 'they', 'towards', 'across', "'ll", 'hereupon', 'by', 'be', 'in', 'eight', 'almost', 'top', 'nowhere', 'or', 'unless', 'others', 'n‘t', 'that', 'been', 'now', 'until', 'another', 'i', 'once', 'became', 'whoever', 'go', 'whereupon', 'ever', 'whole', 'down', 'along', 'third', 'whatever', 'were', 'something', 'name', 'all', 'whose', 'really', 'except', 'thereupon', 'their', 'some', 'below', 'call', 'nor', 'also', 'herself', 'next', 'while', 'your', 'everything', 'often', 'such', 'indeed', 'sometime', 'should', 'eleven', 'may', 'sixty', 'whereby', 'less', 'within', 'are', 

### Forms of data for further processing

**sentences**: list of strings

**article**: string

**doc**: spacy model processed


In [3]:
# spacy models
nlp = spacy.load('en_core_web_sm')
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")
print([(w.text, w.pos_) for w in doc])

article = ''
for sentence in sentences:
    sentence +=  " "
    article += sentence
doc = nlp(article)

print(article)

[('This', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]
Why Study Public Health? Graduate programs in the Department of Biostatistics at the University of Michigan School of Public Health are among the best in the world. Currently, we are ranked as the No. 4 biostatistics department by US News and World Report.  


In [228]:
from spacy.matcher import Matcher 
# checks if a list of keywords match a certain POS pattern
def match(text):
    patterns = [
        [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'VERB'}],
        [{'POS': 'NOUN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
        [{'POS': 'VERB'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],  
        [{'POS': 'NOUN'}, {'POS': 'VERB'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}, {'POS': 'NOUN'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'ADV'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'VERB'}],
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}],
        [{'POS': 'NOUN'}, {'POS': 'NOUN'}],
        [{'POS': 'ADJ'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'ADP'}, {'POS': 'PROPN'}],
        [{'POS': 'PROPN'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
        [{'POS': 'PROPN'}, {'POS': 'VERB'}, {'POS': 'NOUN'}],
        [{'POS': 'NOUN'}, {'POS': 'ADP'}, {'POS': 'NOUN'}],
        [{'POS': 'PROPN'}, {'POS': 'NOUN'}, {'POS': 'PROPN'}],
        [{'POS': 'VERB'}, {'POS': 'ADV'}],
        [{'POS': 'PROPN'}, {'POS': 'NOUN'}],
        [{'POS': 'NOUN'}],
        [{'POS': 'VERB'}],
        [{'POS': 'ADJ'}],
        ]
    matcher = Matcher(nlp.vocab)
    matcher.add("pos-matcher", patterns)
    # create spacy object
    doc = nlp(text)
    # iterate through the matches
    matches = matcher(doc)
    # if matches is not empty, it means that it has found at least a match
    if len(matches) > 0:
        return True
    return False

match("very delicious cake")

True

### 1. RAKE

https://pypi.org/project/rake-nltk/


In [248]:
def rake_extractor(text, n):
    # Uses stopwords for english from NLTK, and all puntuation characters by default
    r = Rake()

    # Extraction given string
    # r.extract_keywords_from_text(text)    <-- raises error
    
    # Extraction given the list of strings where each string is a sentence.
    r.extract_keywords_from_sentences([text])

    # To get keyword phrases ranked highest to lowest with scores.
    keywords_scored = r.get_ranked_phrases_with_scores()[:n]

    results = {}

    for keyword in keywords_scored:
        # The lower the score, the more relevant the keyword is.
        # print("keyword \'", keyword[1], "\' scores", keyword[0])

        # convert scores to integers
        score = int(keyword[0])
        results[keyword[1]] = score

    return results 
    
result_rake = rake_extractor(article, 5)
print(result_rake)

{'edu › biostat › programs › mastersthe ms': 43, 'edu › biostat › programs › masters': 38, 'edu › biostat › programssph': 28}


### 2. YAKE

https://liaad.github.io/yake/docs/getting_started

https://github.com/LIAAD/yake


In [230]:
def yake_extractor(text, n):

    # params tuning
    language = "en"
    max_ngram_size = 1
    deduplication_thresold = 0.9
    deduplication_algo = 'seqm'
    windowSize = 1
    numOfKeywords = n
    
    scored_keywords = yake.KeywordExtractor(lan=language, 
                                     n=max_ngram_size, 
                                     dedupLim=deduplication_thresold, 
                                     dedupFunc=deduplication_algo, 
                                     windowsSize=windowSize, 
                                     top=numOfKeywords).extract_keywords(text)
    
    results = {}
    for keyword in scored_keywords:
        # The lower the score, the more relevant the keyword is.
        # print("keyword \'", keyword[0], "\' scores", keyword[1])

        # convert scores to integers
        score = 100 - int(keyword[1] * 1000)
        results[keyword[0]] = score
    return results

# usage
article = ''
for sentence in sentences:
    article += sentence

result_yake = yake_extractor(article, 5)
print(result_yake)

{'Graduate': 93, 'Rackham': 90, 'programs': 89, 'Biostatistics': 88, 'program': 81}


### 3. PositionRank


In [231]:
def position_rank_extractor(text, n):

    # define the valid Part-of-Speeches to occur in the graph
    pos = {'NOUN', 'PROPN', 'ADJ'}      # , 'ADV'

    # define extractor & load document
    extractor = pke.unsupervised.PositionRank()
    extractor.load_document(text, language='en')

    # candidate number
    extractor.candidate_selection(maximum_word_number=1)

    # weight the candidates using the sum of their word's scores that are computed using random walk biaised with the position of the words in the document. In the graph, nodes are words (nouns and adjectives only) that are connected if they occur in a window of 3 words.
    # window:  the window within the sentence for connecting two words in the graph, defaults to 10.
    extractor.candidate_weighting(window=3, pos=pos)
    
    # get the 5-highest scored candidates as keyphrases
    scored_keywords = extractor.get_n_best(n=n)
    
    results = {}
    for keyword in scored_keywords:
        # The lower the score, the more relevant the keyword is.
        # print("keyword \'", keyword[0], "\' scores", keyword[1])

        # convert scores to integers
        score = int(keyword[1] * 1000)
        results[keyword[0]] = score
    return results 

# usage
result_position_rank = position_rank_extractor(doc, 5)
print(result_position_rank)

{'graduate': 104, 'program': 63, 'rackham': 42, 'schools': 39, 'michigan': 32}


### 4. TopicRank

Uses TopicRank to extract the top 5 keywords from a text

Arguments: text (str)

Returns: list of keywords (list)


In [232]:
# method tuning
def topic_rank_extractor(text, n):

    # use TopicRank as extractor
    extractor = pke.unsupervised.TopicRank()

    # load content with stopwords and punctuations
    extractor.load_document(text, language='en', stoplist=stoplist)

    # sentence structure pattern
    # select the longest sequences of nouns and adjectives, that do not contain punctuation marks or stopwords as candidates.
    pos = {'NOUN', 'PROPN', 'ADJ'}   #, 'ADV'
    extractor.candidate_selection(pos=pos)

    # default:
    # build topics by grouping candidates with HAC (average linkage, threshold of 1/4 of shared stems). 
    # Weight the topics using random walk, and select the first occuring candidate from each topic.
    extractor.candidate_weighting()

    # get the 5-highest scored candidates as keyphrases
    scored_keywords = extractor.get_n_best(n=n)
    
    results = {}
    for keyword in scored_keywords:
        # The lower the score, the more relevant the keyword is.
        # print("keyword \'", keyword[0], "\' scores", keyword[1])

        # convert scores to integers
        score = int(keyword[1] * 100)
        results[keyword[0]] = score
    return results 

# # usage
result_topic_rank = topic_rank_extractor(doc, 5)
print(result_topic_rank)

{'application review process': 4, 'graduate education': 3, 'graduate students': 3, 'biostatistics': 3, 'master': 3}


### 5. MultipartiteRank


In [233]:
def multipartite_rank_extractor(text, n):
    
    extractor = pke.unsupervised.MultipartiteRank()
    extractor.load_document(text, language='en', stoplist=stoplist)
    pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
    extractor.candidate_selection(pos=pos)

    # build the Multipartite graph and rank candidates using random walk, alpha controls the weight adjustment mechanism
    extractor.candidate_weighting(alpha=1.1, threshold=0.74, method='average')
    
    # get the 5-highest scored candidates as keyphrases
    scored_keywords = extractor.get_n_best(n=n)
    
    results = {}
    for keyword in scored_keywords:
        # The lower the score, the more relevant the keyword is.
        # print("keyword \'", keyword[0], "\' scores", keyword[1])

        # convert scores to integers
        score = int(keyword[1] * 100)
        results[keyword[0]] = score
    return results 

# # usage
result_mtpartite_rank = multipartite_rank_extractor(doc, 5)
print(result_mtpartite_rank)

{'graduate education': 6, 'application review process': 3, 'graduate students': 2, 'rackham graduate school': 2, 'master': 2}


### 6. KeyBERT

https://towardsdatascience.com/how-to-extract-relevant-keywords-with-keybert-6e7b3cf889ae

https://maartengr.github.io/KeyBERT/guides/quickstart.html#usage

Uses KeyBERT to extract the top 5 one-word-long keywords from a text, combined with _frequency count_

Arguments(str): text

Returns(dictrionary): list of keywords, with scores add up together


In [252]:
def keybert_extractor(text, n):
    keywords = bert.extract_keywords(
                    text, 
                    keyphrase_ngram_range=(1, 1), 
                    stop_words= 'english', 
                    top_n=n,
                    diversity = 0.8)

    results = {}
    #index = 0

    # get keywords in every sentence
    # for scored_keywords in keywords:

        # index = index + 1
        # print("keywords in sentence", index, "is", scored_keywords)

        # loop every keyword in each sentence
    for keyword in keywords:
        word = keyword[0]
        score = int(float(keyword[1] * 100) )
        try:
            # try to update count of the given keyword if available
            results[word] += score
            # print("update:", word, score)
        
        except:
            # store current keyword
            results[word] = score
            # print("record:", word)
    
    # get top n keywords
    import operator
    import itertools
    sorted_results = dict( sorted(results.items(), key=operator.itemgetter(1), reverse=True))
    results = dict(itertools.islice(sorted_results.items(), n))
    return results 

# # usage:
result_KeyBERT = keybert_extractor(article, 5)
print(result_KeyBERT)

{'graduate': 48, 'graduates': 44, 'rackham': 42, 'scholarship': 41, 'doctoral': 39}


In [236]:
# outputs
results = [result_rake, result_yake, result_position_rank, result_topic_rank, result_mtpartite_rank, result_KeyBERT]
i = 1

# score normalization
# ... code here ...

for result in results:
    print(i, ".")
    for item in result.items():
      print(item)
    i += 1

1 .
('edu › biostat › programs › masters', 37)
('edu › biostat › programs sph', 28)
2 .
('Graduate', 93)
('Rackham', 90)
('programs', 89)
('Biostatistics', 88)
('program', 81)
3 .
('graduate', 104)
('program', 63)
('rackham', 42)
('schools', 39)
('michigan', 32)
4 .
('application review process', 4)
('graduate education', 3)
('graduate students', 3)
('biostatistics', 3)
('master', 3)
5 .
('graduate education', 6)
('application review process', 3)
('graduate students', 2)
('rackham graduate school', 2)
('master', 2)
6 .
('graduate', 48)
('graduates', 44)
('rackham', 42)
('scholarship', 41)
('doctoral', 39)


## Method Evaluation


In [237]:
import numpy as np
import pandas as pd

# Get seconds from time.
def get_sec(time_str):
    h, m, s = time_str.split(':')
    return int(h) * 3600 + int(m) * 60 + int(s)

In [238]:
# uses an extractor to retrieve keywords from a list of documents
def extract_keywords_from_corpus(extractor, corpus, n):
    extractor_name = extractor.__name__.replace("_extractor", "")
    logging.info(f"Starting keyword extraction with {extractor_name}")
    corpus_kws = {}
    start = time.time()
    # logging.info(f"Timer initiated.")     # output start of timer
    for idx, text in tqdm.tqdm(enumerate(corpus), desc="Extracting keywords from corpus..."):
        corpus_kws[idx] = extractor(text, n)
    end = time.time()
    # logging.info(f"Timer stopped.")       # output end of timer
    elapsed = time.strftime("%H:%M:%S", time.gmtime(end - start))
    logging.info(f"Time elapsed: {elapsed}")
    
    return {"algorithm": extractor.__name__, 
            "corpus_kws": corpus_kws, 
            "elapsed_time": elapsed}

In [254]:
# This function runs the benchmark for the above keyword extraction algorithms
def benchmark(corpus, shuffle=True):
    logging.info("Starting benchmark...\n")
    
    # Shuffle the corpus
    if shuffle:
        random.shuffle(corpus)

    # extract keywords from corpus
    results = []
    extractors = [
        rake_extractor, 
        yake_extractor, 
        topic_rank_extractor, 
        position_rank_extractor,
        multipartite_rank_extractor,
        keybert_extractor,
    ]
    for extractor in extractors:
        result = extract_keywords_from_corpus(extractor, corpus, 8) # <-- change num of keywrods here
        results.append(result)

    # compute average number of extracted keywords
    for result in results:
        len_of_kw_list = []
        for kws in result["corpus_kws"].values():
            len_of_kw_list.append(len(kws))
        result["avg_keywords_per_document"] = np.mean(len_of_kw_list)

    # match keywords
    for result in results:
        for idx, kws in result["corpus_kws"].items():
            match_results = []
            for kw in kws:
                match_results.append(match(kw))
                result["corpus_kws"][idx] = match_results

    # compute average number of matched keywords
    for result in results:
        len_of_matching_kws_list = []
        for idx, kws in result["corpus_kws"].items():
            len_of_matching_kws_list.append(len([kw for kw in kws if kw]))
        result["avg_matched_keywords_per_document"] = np.mean(len_of_matching_kws_list)
        # compute average percentange of matching keywords, round 2 decimals
        result["avg_percentage_matched_keywords"] = round(
            result["avg_matched_keywords_per_document"] / result["avg_keywords_per_document"], 
            2)
        
    # create score based on the avg percentage of matched keywords divided by time elapsed (in seconds)
    for result in results:
        elapsed_seconds = get_sec(result["elapsed_time"]) + 0.1
        # weigh the score based on the time elapsed
        result["performance_score"] = round(result["avg_percentage_matched_keywords"] / elapsed_seconds, 2)
    
    # # delete corpus_kw
    # for result in results:
    #     del result["corpus_kws"]

    # create results dataframe
    df = pd.DataFrame(results)
    df.to_csv("method_eval.csv", index=False)
    logging.info("Benchmark finished. Results saved to results.csv")
    return df

In [255]:
# Evaluate
results = benchmark(sentences[:500], shuffle=True)

Extracting keywords from corpus...: 70it [00:00, 4643.68it/s]
Extracting keywords from corpus...: 70it [00:00, 277.68it/s]
Extracting keywords from corpus...: 70it [00:49,  1.42it/s]
Extracting keywords from corpus...: 70it [00:49,  1.41it/s]
Extracting keywords from corpus...: 70it [00:51,  1.37it/s]
Extracting keywords from corpus...: 70it [00:04, 16.80it/s]


## Final Output for Word-cloud Viuslisation


In [251]:
import json

# n: number of keywords
n = 25

print(article)
dic = keybert_extractor(article, n)
print(dic)  # the output with original scores

# convert scores to scale of 5 - 40
scores = []
result = {}

# get every score, calculate scale ratio
for keyword in dic.items():
    scores.append(keyword[1])
scores.sort()
original_scale = float(scores[n-1] - scores[0])
# print(original_scale)   # 17.0
ratio = float(35.0 / original_scale)
base = scores[0] * ratio - 5

# loop through every score to update converted value
for keyword in dic.items():
    result[keyword[0]] = keyword[1] * ratio - base

print(result)

# Serializing json
json_object = json.dumps(result, indent=4)

# Writing to sample.json
with open("../js/wordcloud//keywords.json", "w") as outfile:
    outfile.write(json_object)

    

Graduate education at the University of Michigan is a shared enterprise. The Rackham Graduate School works together with faculty in the schools and colleges of the University to provide more than 180 graduate degree programs and to sustain a dynamic intellectual climate within which graduate students thrive.The Rackham Graduate School and the graduate program work as a team to manage the application review process. As an applicant you will be interacting with both offices.The University of Michigan provides many sources of financial assistance to help students meet educational and living expenses. Whether you are a prospective student, a current student, a master’s or doctoral student, we want to make sure you know about the funding available for your graduate education.From your first registration through the final stages of the degree process, we’re here to help every step of the way.Ph.D.D.M.A.Rackham offers opportunities, funding, and resources that prepare graduate students and po