This is one of the simplest yet most powerful techniques of extracting important information from unstructured text documents. Keyphrase extraction, also known as terminology extraction, is the process of extracting key terms or phrases from a body of unstructured text so that the core themes are captured. This technique falls under the broad umbrella of information retrieval and extraction. Keyphrase extraction is useful in many areas, some of which are mentioned here:

- Semantic web
- Query based search engines and crawlers
- Recommendation systems
- Tagging systems
- Document similarity
- Translation

Keyphrase extraction is often the starting point for carrying out more complex tasks in text analytics or natural language processing and the output can act as features for more complex systems. There are various approaches for keyphrase extraction; we cover the following two major techniques:

- Collocations
- Weighted tag-based phrase extraction

An important point to remember is that we will be extracting phrases, which are usually a collection of words and can sometimes just be single words. If you are extracting keywords, that is also known as keyword extraction and it is a subset of keyphrase extraction.

# keyword --> TF-IDF

In [83]:
###

## Collocations

The term collocation is borrowed from analyzing corpora and linguistics. A collocation can be defined as a sequence or group of words that tend to occur frequently and this frequency tends to be more than what could be termed a random or chance occurrence. Various types of collocations can be formed based on parts of speech like nouns, verbs, and so on. There are various ways to extract collocations and one of the best ways to do it is to use an n-gram grouping or segmentation approach. This is where we construct n-grams out of a corpus and then count the frequency of each n-gram and rank them based on their frequency of occurrence to get the most frequent n-gram collocations.

The idea is to have a corpus of documents (paragraphs or sentences), tokenize them to form sentences, flatten the list of sentences to form one large sentence or string over which we slide a window of size n based on the n-gram range, and compute n-grams across the string. Once they are computed, we count each n-gram based on its frequency of occurrence and then rank it. This yields the most frequent collocations on the basis
of frequency. We implement this from scratch initially so that you can understand the algorithm better and then we use some of NLTK’s built-in capabilities to depict it.

Let’ start by loading some necessary dependencies and a corpus on which we will be computing collocations. We use the NLTK Gutenberg corpus’s book, Lewis Carroll’s Alice in Wonderland, as our corpus. We also normalize the corpus to standardize the text content using our handy text_normalizer module, which we built and used in the previous chapters.

In [107]:
from nltk.corpus import gutenberg 
import nltk
from operator import itemgetter
import re


import string
from nltk.corpus import stopwords
from nltk.corpus import inaugural
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stopwords_en = stopwords.words('english')



[nltk_data] Downloading package punkt to /Users/olmos/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/olmos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/olmos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [108]:
## Load Modules
lemmatizer  = WordNetLemmatizer()
snowball = SnowballStemmer('english')
stopwords   = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation

def normalize(text):
    normalized_text = []
    for token in nltk.word_tokenize(text):
        token = re.sub(r'[^a-zA-Z\s]', '', token, re.I|re.A)
        token = token.lower() 
        #token = lemmatizer.lemmatize(token)
        #token = snowball.stem(token)
        if token not in stopwords and token not in punctuation and token.isalnum():
            normalized_text.append(token)
            
    return normalized_text

In [109]:
# load corpus
alice = gutenberg.raw(fileids='carroll-alice.txt') 

alice_norm = normalize(alice)

In [110]:
def compute_ngrams(sequence, n):
    return list(
            zip(*(sequence[index:] for index in range(n)))
    )

In [111]:
compute_ngrams([1,2,3,4], 2)

[(1, 2), (2, 3), (3, 4)]

In [112]:
compute_ngrams([1,2,3,4], 3)

[(1, 2, 3), (2, 3, 4)]

In [113]:
# take the second element for sort
def take_second(elem):
    return elem[1]


# random list
random = [(2, 2), (3, 4), (4, 1), (1, 3)]

# sort list with key
sorted_list = sorted(random, key=take_second)
sorted_list = sorted(random, key=itemgetter(1))

# print list
print('Sorted list:', sorted_list)

Sorted list: [(4, 1), (2, 2), (1, 3), (3, 4)]


In [114]:
def get_top_ngrams(flat_corpus, ngram_val=1, limit=5):
    
    ngrams = compute_ngrams(flat_corpus, ngram_val) 
    ngrams_freq_dist = nltk.FreqDist(ngrams) 
    sorted_ngrams_fd = sorted(ngrams_freq_dist.items(),key=itemgetter(1), reverse=True) 
    sorted_ngrams = sorted_ngrams_fd[0:limit]
    sorted_ngrams = [(' '.join(text), freq) for text, freq in sorted_ngrams]
    
    return sorted_ngrams

In [115]:
get_top_ngrams(alice_norm, ngram_val=2,limit=10)

[('said alice', 123),
 ('mock turtle', 55),
 ('march hare', 31),
 ('said king', 29),
 ('thought alice', 26),
 ('ca nt', 25),
 ('wo nt', 23),
 ('white rabbit', 22),
 ('said hatter', 22),
 ('said mock', 20)]

In [116]:
get_top_ngrams(alice_norm, ngram_val=3,limit=10)

[('said mock turtle', 20),
 ('said march hare', 9),
 ('nt wo nt', 7),
 ('poor little thing', 6),
 ('wo nt wo', 6),
 ('little golden key', 5),
 ('certainly said alice', 5),
 ('white kid gloves', 5),
 ('said alice nt', 5),
 ('march hare said', 5)]

his output shows us sequences of two and three words generated by n-grams along with the number of times they occur throughout the corpus. We can see that most of the collocations point to people who are speaking something as “said <person>”. We also see the people who are popular characters in Alice in Wonderland, like the mock turtle, the king, the rabbit, the hatter, and Alice are depicted in the collocations.
    
We now look at NLTK’s collocation finders, which enable us to find collocations using various measures like raw frequencies, pointwise mutual information, and so on. Just to explain briefly, pointwise mutual information can be computed for two events or terms as the logarithm of the ratio of the probability of them occurring together by the product of their individual probabilities, assuming that they are independent of each other. Mathematically, we can represent it as follows:
    
    
$$pmi(x,y) = \log \frac{p(x,y)}{p(x)p(y)}$$
    
This measure is symmetric. The following code snippet shows us how to compute these collocations using these measures

In [117]:
from nltk.collocations import BigramCollocationFinder 
from nltk.collocations import BigramAssocMeasures

In [118]:
finder = BigramCollocationFinder.from_words(alice_norm)

In [119]:
finder

<nltk.collocations.BigramCollocationFinder at 0x7fd289b907f0>

In [120]:
finder.nbest(BigramAssocMeasures.raw_freq, 10)

[('said', 'alice'),
 ('mock', 'turtle'),
 ('march', 'hare'),
 ('said', 'king'),
 ('thought', 'alice'),
 ('ca', 'nt'),
 ('wo', 'nt'),
 ('said', 'hatter'),
 ('white', 'rabbit'),
 ('said', 'mock')]

In [121]:
finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)

[('mock', 'turtle'),
 ('march', 'hare'),
 ('said', 'alice'),
 ('white', 'rabbit'),
 ('ca', 'nt'),
 ('wo', 'nt'),
 ('join', 'dance'),
 ('soo', 'oop'),
 ('minute', 'two'),
 ('said', 'king')]

In [122]:
finder.nbest(BigramAssocMeasures.pmi, 10)

[('abide', 'figures'),
 ('acceptance', 'elegant'),
 ('accounting', 'tastes'),
 ('accustomed', 'usurpation'),
 ('act', 'crawling'),
 ('adding', 'youre'),
 ('adjourn', 'immediate'),
 ('adoption', 'energetic'),
 ('affair', 'trusts'),
 ('agony', 'terror')]

In [123]:
from nltk.collocations import TrigramCollocationFinder 
from nltk.collocations import TrigramAssocMeasures

trifinder = TrigramCollocationFinder.from_words(alice_norm)

trifinder.nbest(TrigramAssocMeasures.raw_freq, 10)

[('said', 'mock', 'turtle'),
 ('said', 'march', 'hare'),
 ('nt', 'wo', 'nt'),
 ('poor', 'little', 'thing'),
 ('wo', 'nt', 'wo'),
 ('certainly', 'said', 'alice'),
 ('little', 'golden', 'key'),
 ('march', 'hare', 'said'),
 ('mock', 'turtle', 'said'),
 ('said', 'alice', 'nt')]

In [124]:
trifinder.nbest(TrigramAssocMeasures.pmi, 10)

[('accustomed', 'usurpation', 'conquest'),
 ('adjourn', 'immediate', 'adoption'),
 ('adoption', 'energetic', 'remedies'),
 ('ancient', 'modern', 'seaography'),
 ('arithmetic', 'ambition', 'distraction'),
 ('bed', 'various', 'pretexts'),
 ('brother', 'latin', 'grammar'),
 ('canvas', 'bag', 'tied'),
 ('cherrytart', 'custard', 'pineapple'),
 ('circle', 'exact', 'shape')]

In [126]:
trifinder.nbest(TrigramAssocMeasures.likelihood_ratio, 20)

[('said', 'mock', 'turtle'),
 ('mock', 'turtle', 'sighed'),
 ('mock', 'turtle', 'replied'),
 ('mock', 'turtle', 'soup'),
 ('different', 'mock', 'turtle'),
 ('ix', 'mock', 'turtle'),
 ('mock', 'turtle', 'capering'),
 ('mock', 'turtle', 'seals'),
 ('cried', 'mock', 'turtle'),
 ('miserable', 'mock', 'turtle'),
 ('mock', 'turtle', 'drive'),
 ('mock', 'turtle', 'persisted'),
 ('mock', 'turtle', 'recovered'),
 ('mock', 'turtle', 'sang'),
 ('mock', 'turtle', 'yawned'),
 ('mock', 'turtle', 'youve'),
 ('mystery', 'mock', 'turtle'),
 ('mock', 'turtle', 'went'),
 ('obliged', 'mock', 'turtle'),
 ('show', 'mock', 'turtle')]

### BAG OF N-GRAMS!!

https://radimrehurek.com/gensim/models/phrases.html

http://uc-r.github.io/creating-text-features


https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf

In [103]:
DOCUMENT = """
The Elder Scrolls V: Skyrim is an action role-playing video game developed by Bethesda Game Studios 
and published by Bethesda Softworks. It is the fifth main installment in The Elder Scrolls series, 
following The Elder Scrolls IV: Oblivion.
The game's main story revolves around the player character's quest to defeat Alduin the World-Eater, 
a dragon who is prophesied to destroy the world. The game is set 200 years after the events of Oblivion 
and takes place in the fictional province of Skyrim. Over the course of the game, the player completes 
quests and develops the character by improving skills. The game continues the open-world tradition of 
its predecessors by allowing the player to travel anywhere in the game world at any time, and to ignore 
or postpone the main storyline indefinitely.
The team opted for a unique and more diverse open world than Oblivion's Imperial Province of Cyrodiil, 
which game director and executive producer Todd Howard considered less interesting by comparison. 
The game was released to critical acclaim, with reviewers particularly mentioning the character advancement 
and setting, and is considered to be one of the greatest video games of all time.


The Elder Scrolls V: Skyrim is an action role-playing game, playable from either a first or 
third-person perspective. The player may freely roam over the land of Skyrim which is an open world 
environment consisting of wilderness expanses, dungeons, cities, towns, fortresses, and villages. 
Players may navigate the game world more quickly by riding horses or by utilizing a fast-travel system 
which allows them to warp to previously discovered locations. The game's main quest can be completed or 
ignored at the player's preference after the first stage of the quest is finished. However, some quests 
rely on the main storyline being at least partially completed. Non-player characters (NPCs) populate the 
world and can be interacted with in a number of ways: the player may engage them in conversation, 
marry an eligible NPC, kill them or engage in a nonlethal "brawl". The player may 
choose to join factions which are organized groups of NPCs — for example, the Dark Brotherhood, a band 
of assassins. Each of the factions has an associated quest path to progress through. Each city and town 
in the game world has jobs that the player can engage in, such as farming.

Players have the option to develop their character. At the beginning of the game, players create 
their character by selecting their sex and choosing between one of several races including humans, 
orcs, elves, and anthropomorphic cat or lizard-like creatures and then customizing their character's 
appearance. Over the course of the game, players improve their character's skills which are numerical 
representations of their ability in certain areas. There are eighteen skills divided evenly among the 
three schools of combat, magic, and stealth. When players have trained skills enough to meet the 
required experience, their character levels up. Health is depleted primarily when the player 
takes damage and the loss of all health results in death. Magicka is depleted by the use of spells, 
certain poisons and by being struck by lightning-based attacks. Stamina determines the player's 
effectiveness in combat and is depleted by sprinting, performing heavy "power attacks" 
and being struck by frost-based attacks. Skyrim is the first entry in The Elder Scrolls to 
include dragons in the game's wilderness. Like other creatures, dragons are generated randomly in 
the world and will engage in combat with NPCs, creatures and the player. Some dragons may attack 
cities and towns when in their proximity. The player character can absorb the souls of dragons 
in order to use powerful spells called "dragon shouts" or "Thu'um". A regeneration 
period limits the player's use of shouts in gameplay.

Skyrim is set around 200 years after the events of The Elder Scrolls IV: Oblivion, although it is 
not a direct sequel. The game takes place in Skyrim, a province of the Empire on the continent of 
Tamriel, amid a civil war between two factions: the Stormcloaks, led by Ulfric Stormcloak, and the 
Imperial Legion, led by General Tullius. The player character is a Dragonborn, a mortal born with 
the soul and power of a dragon. Alduin, a large black dragon who returns to the land after being 
lost in time, serves as the game's primary antagonist. Alduin is the first dragon created by Akatosh, 
one of the series' gods, and is prophesied to destroy and consume the world.
"""

# Latent Semantic Analysis

http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf

https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models/Ch06e%20-%20Document%20Summarization.ipynb



In [142]:
import gensim
from gensim.models import TfidfModel

In [143]:
# Create dictionary of tokens: the input is the preprocessed corpus 

sentences = nltk.sent_tokenize(DOCUMENT)

norm_text = [normalize(s) for s in sentences]

D = gensim.corpora.Dictionary(norm_text)

corpus_bow = [D.doc2bow(doc) for doc in norm_text]

model = TfidfModel(corpus_bow)  

corpus_tfidf = model[corpus_bow]


In [144]:
from gensim.matutils import corpus2dense, corpus2csc

n_tokens = len(D)
num_docs = len(corpus_bow)
# Convert BoW representacion
corpus_bow_dense = corpus2dense(corpus_bow, num_terms=n_tokens, num_docs=num_docs).T
corpus_bow_sparse = corpus2csc(corpus_bow, num_terms=n_tokens, num_docs=num_docs).T
# Convert TFIDF representacion
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms=n_tokens, num_docs=num_docs).T
corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms=n_tokens, num_docs=num_docs).T

In [145]:
corpus_tfidf_dense.shape

(35, 270)

# Text Rank & Page Rank



https://radimrehurek.com/gensim_3.8.3//auto_examples/tutorials/run_summarization.html#sphx-glr-auto-examples-tutorials-run-summarization-py

# Rake algorithm

Rapid Automatic Keyword Extraction (RAKE) is a well-known keyword extraction method which uses a list of stopwords and phrase delimiters to detect the most relevant words or phrases in a piece of text.

Then, the algorithm splits the text at phrase delimiters and stopwords to create candidate expressions. 

Once the text has been split, the algorithm creates a matrix of word co-occurrences. Each row shows the number of times that a given content word co-occurs with every other content word in the candidate phrases

After that matrix is built, words are given a score. That score can be calculated as the degree of a word in the matrix (i.e. the sum of the number of co-occurrences the word has with any other content word in the text), as the word frequency (i.e. the number of times the word appears in the text), or as the degree of the word divided by its frequency.

As from above, we know that RAKE classifies the main content bearing word as Candidate Keyword by parsing the document with the help of stop words and phrase delimiters. This is done basically by some of the following steps, firstly the document text is split into an array of words by the specific word delimiters, and secondly, the array is again split into a sequence of contiguous words at phrase delimiters and stop word positions. Finally, the words that lie in the same sequence are assigned the same position in the text and together are considered as a candidate key.

After identifying all the candidate keywords from the text data, a graph of word co-occurrence is generated which calculates the score for each candidate keyword and defined as the member word score. With the help of this graph, we evaluate several metrics for calculating word scores, based on the degree and frequency of the vertices in the graph.

As we know that Rake splits candidate keywords by stop words, so the extracted keywords do not contain interior stop words, therefore an interest wan expressed in identifying keywords that contain interior stop words as the axis of evil. To find keywords that adjoin one another at least twice in the same document and the same order. For this purpose, a new candidate keyword is created as a combination of those keywords and the interior stop words. In this part, we should understand that very few linked words are only extracted which add significance.

https://medium.datadriveninvestor.com/rake-rapid-automatic-keyword-extraction-algorithm-f4ec17b2886c


https://catalogimages.wiley.com/images/db/pdf/9780470749821.excerpt.pdf

Figura RAKE vs TextRank

https://monkeylearn.com/keyword-extraction/

https://pypi.org/project/rake-nltk/

In [127]:
!pip install rake-nltk



In [128]:
from rake_nltk import Rake

r = Rake() # Uses stopwords for english from NLTK, and all puntuation characters.

r.extract_keywords_from_text(DOCUMENT)

r.get_ranked_phrases() # To get keyword phrases ranked highest to lowest.


['executive producer todd howard considered less interesting',
 'eighteen skills divided evenly among',
 'several races including humans',
 'main story revolves around',
 'set around 200 years',
 'player may freely roam',
 'open world environment consisting',
 'use powerful spells called',
 'playing video game developed',
 'dragons may attack cities',
 'set 200 years',
 'greatest video games',
 'player may choose',
 'reviewers particularly mentioning',
 'regeneration period limits',
 'previously discovered locations',
 'fifth main installment',
 'trained skills enough',
 'elder scrolls v',
 'elder scrolls iv',
 'players may navigate',
 'main storyline indefinitely',
 'player takes damage',
 'large black dragon',
 'player completes quests',
 'diverse open world',
 'player may engage',
 'least partially completed',
 'associated quest path',
 'bethesda game studios',
 'game takes place',
 'elder scrolls series',
 'first dragon created',
 'elder scrolls',
 'playing game',
 'main storyline'