# Tutorial 5: Keyphrase Extraction

**Keyphrase extraction** is the task of automatically selecting a small set of phrases that best describe a given free text document.

Keyphrase extraction, also known as terminology extraction is defined as the process or technique of extracting key important and relevant terms or phrases from a body of unstructured text such that the core topics or themes of the text document(s) are captured in these key phrases. This technique falls under the broad umbrella of information retrieval and extraction. 

_Supervised_ keyphrase extraction requires large amounts of labeled training data and generalizes very poorly outside the domain of the training data. At the same time, _unsupervised_ systems have issues with accuracy, and often do not generalize well, as they require the input document to belong to a larger corpus also given as input.

In this notebook, we will cover keyphrase extraction methods based on:

+ Collocations
+ N-grams
+ Weighted Tag Based methods
+ RAKE (Rapid Automatic Keyword Extraction algorithm)

## Import Libraries

In [16]:
!pip install contractions
!pip install textsearch



In [17]:
import nltk

In [18]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [19]:
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Basic Text Pre-processor

In [20]:
import re
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, flags=re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [21]:
from nltk.corpus import gutenberg
from operator import itemgetter

## Load Dataset

In this notebook, we will make use of a sample description of elephants taken from Wikipedia.


In [69]:
import requests
import bs4
import re

res = requests.get('https://pastebin.com/raw/W182iruJ')
data = res.text

In [70]:
elephants = nltk.sent_tokenize(data)
elephants[:10]

['Elephants are the largest living land mammals.',
 'The largest elephant recorded was one shot in Angola, 1974.',
 'It weighed 27,060 pounds (13.5 tons) and stood 13 feet 8 inches tall.',
 'Their skin color is grey.',
 'At birth, an elephant calf may weigh as much as 100 kg (225 pounds).',
 'The baby elephant develops for 20 to 22 months inside its mother.',
 'No other land animal takes this long to develop before being born.',
 'In the wild, elephants have strong family relationship.',
 'Their ways of acting toward other elephants are hard for people to understand.',
 'They "talk" to each other with very low sounds.']

In [71]:
norm_elephants = list(filter(None, [normalize_document(line)
                                      for line in elephants]))
norm_elephants[:10]

['elephants largest living land mammals',
 'largest elephant recorded one shot angola',
 'weighed pounds tons stood feet inches tall',
 'skin color grey',
 'birth elephant calf may weigh much kg pounds',
 'baby elephant develops months inside mother',
 'land animal takes long develop born',
 'wild elephants strong family relationship',
 'ways acting toward elephants hard people understand',
 'talk low sounds']

## Collocations

A collocation can be defined as a sequence or group of words which tend to occur frequently such that this frequency tends to be more than what could be termed as a random or chance occurrence. 

There are various ways to extract collocations and one of the best ways to do that is to use an n-gram grouping or segmentation approach where we construct n-grams out of a corpus and then counting the frequency of each n-gram and ranking them based on their frequency of occurrence to get the most frequent n-gram collocations. 

Thus collocations are phrases or expressions containing multiple words, that are highly likely to co-occur. For example — ‘social media’, ‘school holiday’, ‘machine learning’, etc.

Let us prepare a function to generate n-grams from a sequence of tokens.

In [72]:
def compute_ngrams(sequence, n):
    return list(
            zip(*(sequence[index:] 
                     for index in range(n)))
    )

In [73]:
# test bi-gram extraction
compute_ngrams([1,2,3,4], 2)

[(1, 2), (2, 3), (3, 4)]

In [74]:
# test tri-gram extraction
compute_ngrams([1,2,3,4], 3)

[(1, 2, 3), (2, 3, 4)]

## N-Grams

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

Lets find some top used ngrams from our dataset

In [75]:
def flatten_corpus(corpus):
    return ' '.join([document.strip() 
                     for document in corpus])

In [76]:
def get_top_ngrams(corpus, ngram_val=1, limit=5):

    corpus = flatten_corpus(corpus)
    tokens = nltk.word_tokenize(corpus)

    ngrams = compute_ngrams(tokens, ngram_val)
    ngrams_freq_dist = nltk.FreqDist(ngrams)
    sorted_ngrams_fd = sorted(ngrams_freq_dist.items(), 
                              key=itemgetter(1), reverse=True)
    sorted_ngrams = sorted_ngrams_fd[0:limit]
    sorted_ngrams = [(' '.join(text), freq) 
                     for text, freq in sorted_ngrams]

    return sorted_ngrams

We make use of nltk's FreqDist class to create a counter of all the n-grams based on their frequency and then we sort them based on their frequency and return the top n-grams based on the specified user limit.

In [77]:
get_top_ngrams(corpus=norm_elephants, ngram_val=2,
               limit=20)

[('african elephants', 8),
 ('asian elephants', 6),
 ('modern elephants', 4),
 ('indian elephants', 3),
 ('elephants used', 3),
 ('elephants strong', 2),
 ('elephants african', 2),
 ('loxodonta africanus', 2),
 ('elephants eat', 2),
 ('teeth called', 2),
 ('group loxodonta', 2),
 ('evolved gomphotheres', 2),
 ('used tourists', 2),
 ('south africa', 2),
 ('female elephant', 2),
 ('elephant often', 2),
 ('elephants largest', 1),
 ('largest living', 1),
 ('living land', 1),
 ('land mammals', 1)]

In [78]:
get_top_ngrams(corpus=norm_elephants, ngram_val=3,
               limit=20)

[('indian elephants eat', 2),
 ('elephants used tourists', 2),
 ('elephants largest living', 1),
 ('largest living land', 1),
 ('living land mammals', 1),
 ('land mammals largest', 1),
 ('mammals largest elephant', 1),
 ('largest elephant recorded', 1),
 ('elephant recorded one', 1),
 ('recorded one shot', 1),
 ('one shot angola', 1),
 ('shot angola weighed', 1),
 ('angola weighed pounds', 1),
 ('weighed pounds tons', 1),
 ('pounds tons stood', 1),
 ('tons stood feet', 1),
 ('stood feet inches', 1),
 ('feet inches tall', 1),
 ('inches tall skin', 1),
 ('tall skin color', 1)]

##N-Gram and Pointwise Mutual Information

The collocations package from ``nltk`` provides collocation finders which by default consider all ngrams in a text as candidate collocations.


### Point Wise Mutual Information
Simple frequency isn’t the best measure of association between words. One problem is that raw frequency is very skewed and not very discriminative. If we want to know what kinds of contexts are shared by apricot and pineapple but not by digital and information, we’re not going to get good discrimination from words like the, it, or they, which occur frequently with all sorts of words and aren’t informative about any particular word.

Pointwise mutual information can be computed for two events or terms as the logarithm of the ratio of the probability of them occurring together by the product of their individual probabilities assuming that they are independent of each other 

For more details, [Refer Here](https://web.stanford.edu/~jurafsky/slp3/15.pdf) and [here](https://eranraviv.com/understanding-pointwise-mutual-information-in-statistics/)

In [79]:
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures

In [81]:
finder = BigramCollocationFinder.from_documents([item.split() 
                                                for item 
                                                in norm_elephants])
finder

<nltk.collocations.BigramCollocationFinder at 0x7fdd9ada6ac8>

In [82]:
bigram_measures = BigramAssocMeasures()                                                
finder.nbest(bigram_measures.raw_freq, 20)

[('african', 'elephants'),
 ('asian', 'elephants'),
 ('modern', 'elephants'),
 ('elephants', 'used'),
 ('indian', 'elephants'),
 ('elephant', 'often'),
 ('elephants', 'eat'),
 ('elephants', 'strong'),
 ('evolved', 'gomphotheres'),
 ('female', 'elephant'),
 ('group', 'loxodonta'),
 ('loxodonta', 'africanus'),
 ('south', 'africa'),
 ('teeth', 'called'),
 ('used', 'tourists'),
 ('acacia', 'trees'),
 ('across', 'alps'),
 ('acting', 'toward'),
 ('actual', 'family'),
 ('africa', 'asia')]

In [83]:
finder.nbest(bigram_measures.pmi, 20)   

[('across', 'alps'),
 ('alive', 'feeding'),
 ('allows', 'restricted'),
 ('ancestors', 'palaeocene'),
 ('ants', 'bite'),
 ('appendix', 'ii'),
 ('avoid', 'acacia'),
 ('became', 'cooler'),
 ('cameroon', 'gabon'),
 ('carthaginian', 'general'),
 ('climate', 'became'),
 ('color', 'grey'),
 ('concentration', 'silica'),
 ('conservation', 'efforts'),
 ('controlled', 'contraception'),
 ('cooler', 'drier'),
 ('countries', 'sport'),
 ('country', 'found'),
 ('crushed', 'criminals'),
 ('dietary', 'supply')]

In [84]:
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import TrigramAssocMeasures

In [85]:
finder = TrigramCollocationFinder.from_documents([item.split() 
                                                for item 
                                                in norm_elephants])

In [86]:
trigram_measures = TrigramAssocMeasures()                                                
finder.nbest(trigram_measures.raw_freq, 20)

[('elephants', 'used', 'tourists'),
 ('indian', 'elephants', 'eat'),
 ('acacia', 'trees', 'symbiotic'),
 ('across', 'alps', 'fought'),
 ('acting', 'toward', 'elephants'),
 ('actual', 'family', 'elephantidae'),
 ('africa', 'asia', 'hard'),
 ('africa', 'tanzania', 'zambia'),
 ('african', 'asian', 'elephants'),
 ('african', 'elephant', 'kind'),
 ('african', 'elephants', 'cool'),
 ('african', 'elephants', 'larger'),
 ('african', 'elephants', 'live'),
 ('african', 'elephants', 'low'),
 ('african', 'elephants', 'receive'),
 ('african', 'elephants', 'tusks'),
 ('african', 'elephants', 'two'),
 ('african', 'loxodonta', 'africanus'),
 ('africanus', 'asian', 'elephants'),
 ('alive', 'feeding', 'soft')]

In [87]:
finder.nbest(trigram_measures.pmi, 20)  

[('ancestors', 'palaeocene', 'eocene'),
 ('appendix', 'ii', 'status'),
 ('became', 'cooler', 'drier'),
 ('cameroon', 'gabon', 'mozambique'),
 ('carthaginian', 'general', 'hannibal'),
 ('climate', 'became', 'cooler'),
 ('concentration', 'silica', 'abrasive'),
 ('cooler', 'drier', 'pliocene'),
 ('distantly', 'related', 'sea'),
 ('early', 'ancestors', 'palaeocene'),
 ('eocene', 'small', 'semiaquatic'),
 ('exists', 'outside', 'protected'),
 ('feeders', 'generalist', 'eaters'),
 ('forests', 'extended', 'grassland'),
 ('general', 'hannibal', 'took'),
 ('heavy', 'work', 'like'),
 ('high', 'concentration', 'silica'),
 ('ii', 'status', 'allows'),
 ('includes', 'mammoth', 'mastodon'),
 ('little', 'wear', 'indicating')]

## Weighted Tag Based Extraction


We will now look at a slightly different approach to extracting keyphrases. This method borrows concepts from a couple of papers, namely K. Barker and N. Cornachhia. "Using Noun Phrase Heads to Extract Document Keyphrases" and Ian Witten et al. "KEA: Practical Automatic Keyphrase Extraction" which you can refer to if you are more interested in further details on their experimentations and approaches. 


We follow a two-step process in our algorithm here. These steps are mentioned as follows.

- Extract all noun phrase chunks using shallow parsing
- Compute TF-IDF weights for each chunk and return the top weighted phrases


For the first step, we will use a simple pattern based on parts of speech (POS) tags to extract noun phrase chunks. 

Chunking is a process of extracting phrases from unstructured text, which means analyzing a sentence to identify the constituents(Noun Groups, Verbs, verb groups, etc.) However, it does not specify their internal structure, nor their role in the main sentence. It works on top of POS tagging.

We use our sample description of elephants taken from Wikipedia.


In [88]:
elephants[:10]

['Elephants are the largest living land mammals.',
 'The largest elephant recorded was one shot in Angola, 1974.',
 'It weighed 27,060 pounds (13.5 tons) and stood 13 feet 8 inches tall.',
 'Their skin color is grey.',
 'At birth, an elephant calf may weigh as much as 100 kg (225 pounds).',
 'The baby elephant develops for 20 to 22 months inside its mother.',
 'No other land animal takes this long to develop before being born.',
 'In the wild, elephants have strong family relationship.',
 'Their ways of acting toward other elephants are hard for people to understand.',
 'They "talk" to each other with very low sounds.']

### Simple Text Pre-processor

We use this just to remove unnecessary special characters and extra whitespaces

In [89]:
def normalize_document_simple(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, flags=re.I|re.A)
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # re-create document from whitespace stripped tokens
    doc = ' '.join([token.strip() for token in tokens])
    return doc

In [90]:
norm_elephants_simple = list(filter(None, [normalize_document_simple(line)
                                  for line in elephants]))
norm_elephants_simple[:10]

['Elephants are the largest living land mammals',
 'The largest elephant recorded was one shot in Angola',
 'It weighed pounds tons and stood feet inches tall',
 'Their skin color is grey',
 'At birth an elephant calf may weigh as much as kg pounds',
 'The baby elephant develops for to months inside its mother',
 'No other land animal takes this long to develop before being born',
 'In the wild elephants have strong family relationship',
 'Their ways of acting toward other elephants are hard for people to understand',
 'They talk to each other with very low sounds']

Now that we have our corpus ready, we will use the pattern, 

__`" NP: {<DT>? <JJ>* <NN.*>+}"`__ 

for extracting all possible noun phrases from our corpus of documents\sentences. You can always experiment with more sophisticated patterns later incorporating verb, adjective or even adverb phrases. 

However we keep things simple and concise here to focus on the core logic. Once we have our pattern, we will define a function to parse and extract these phrases 

In [91]:
import itertools
stopwords = nltk.corpus.stopwords.words('english')

In [93]:
def get_chunks(sentences, grammar = r'NP: {<DT>? <JJ>* <NN.*>+}', stopword_list=stopwords):
    
    all_chunks = []
    chunker = nltk.chunk.regexp.RegexpParser(grammar)
    
    for sentence in sentences:
        
        tagged_sents = [nltk.pos_tag(nltk.word_tokenize(sentence))]   
        
        chunks = [chunker.parse(tagged_sent) 
                      for tagged_sent in tagged_sents]
        
        wtc_sents = [nltk.chunk.tree2conlltags(chunk)
                         for chunk in chunks]    
        
        flattened_chunks = list(
                            itertools.chain.from_iterable(
                                wtc_sent for wtc_sent in wtc_sents)
                           )
        
        valid_chunks_tagged = [(status, [wtc for wtc in chunk]) 
                                   for status, chunk 
                                       in itertools.groupby(flattened_chunks, 
                                                lambda word_pos_chunk: word_pos_chunk[2] != 'O')]
        
        valid_chunks = list(set([' '.join(word.lower() 
                                for word, tag, chunk in wtc_group 
                                    if word.lower() not in stopword_list) 
                                        for status, wtc_group in valid_chunks_tagged
                                            if status]))
        
        if valid_chunks not in all_chunks and valid_chunks:
          all_chunks.append(valid_chunks)
    
    return all_chunks

In the above function we have a defined grammar pattern for chunking or extracting noun phrases. 

- We define a chunker over the same pattern and for each sentence in the document
- We first annotate it with its POS tags and then build a shallow parse tree with noun phrases as the chunks and all other POS tag based words as chinks which are not parts of any chunks
- Once this is done, we use the tree2conlltags function to generate (w,t,c) triples which are words, POS tags and the IOB formatted chunk tags 
- We remove all tags with chunk tag of 'O' since they are basically words or terms which do not belong to any chunk 
- Finally from these valid chunks, we combine the chunked terms to generate phrases from each chunk group

_Refer to Text Analytics with Python Chapter 3 to dive into shallow parsing and chunking if needed_

In [94]:
chunks = get_chunks(norm_elephants_simple)

In [95]:
chunks[:50]

[['living land mammals', 'elephants'],
 ['angola', 'elephant', 'shot'],
 ['feet inches', 'pounds tons'],
 ['skin color'],
 ['elephant calf', 'kg pounds'],
 ['mother', 'months', 'baby'],
 ['land'],
 ['wild elephants', 'strong family relationship'],
 ['elephants', 'people', 'ways'],
 ['low sounds'],
 ['elephants sounds', 'low people'],
 ['sounds', 'elephants'],
 ['strong leathery skin', 'elephants'],
 ['elephants', 'genera'],
 ['asian elephants elephas maximus', 'african loxodonta africanus'],
 ['obvious part', 'trunk'],
 ['upper lip', 'trunk'],
 ['food', 'objects', 'trunk'],
 ['rest', 'trunk', 'elephants hide'],
 ['elephants trunk', 'elephants', 'symbiotic ants', 'inside', 'acacia trees'],
 ['tusks', 'elephants'],
 ['tusks', 'upper jaws', 'large teeth'],
 ['elephant tusks', 'ivory', 'lot'],
 ['ivory traders', 'many elephants'],
 ['trunk'],
 ['blows', 'trunk'],
 ['signal', 'elephants', 'wildlife'],
 ['african elephants', 'ears'],
 ['leaves branches', 'grass', 'lot', 'grazers'],
 ['body',

### TF-IDF weighing of Chunks

We will now build on top of our `get_chunks()` function by implementing the necessary logic for Step 2 where we will build a TF-IDF based model on our keyphrases using `gensim` and then compute TF-IDF based weights for each keyphrase based on its occurrence in the corpus. 

Finally we will sort these keyphrases based on their TF-IDF weights and show the top N keyphrases where `top_n` is specified by the user.

In [96]:
from gensim import corpora, models

def get_tfidf_weighted_keyphrases(sentences, 
                                  grammar=r'NP: {<DT>? <JJ>* <NN.*>+}',
                                  top_n=10):
    
    valid_chunks = get_chunks(sentences, grammar=grammar)
                                     
    dictionary = corpora.Dictionary(valid_chunks)
    corpus = [dictionary.doc2bow(chunk) for chunk in valid_chunks]
    
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    
    weighted_phrases = {dictionary.get(idx): value 
                           for doc in corpus_tfidf 
                               for idx, value in doc}
                            
    weighted_phrases = sorted(weighted_phrases.items(), 
                              key=itemgetter(1), reverse=True)
    weighted_phrases = [(term, round(wt, 3)) for term, wt in weighted_phrases]
    
    return weighted_phrases[:top_n]

In [97]:
get_tfidf_weighted_keyphrases(sentences=norm_elephants_simple, top_n=50)

[('skin color', 1.0),
 ('land', 1.0),
 ('low sounds', 1.0),
 ('many circuses', 1.0),
 ('living land mammals', 0.956),
 ('sounds', 0.956),
 ('strong leathery skin', 0.956),
 ('genera', 0.956),
 ('tourists', 0.94),
 ('obvious part', 0.866),
 ('upper lip', 0.866),
 ('blows', 0.866),
 ('last molar wears', 0.796),
 ('gomphotheres teeth', 0.796),
 ('largescale', 0.796),
 ('ways', 0.795),
 ('sequence', 0.795),
 ('war', 0.795),
 ('criminals', 0.762),
 ('range exists', 0.762),
 ('elephants gestation', 0.762),
 ('rides', 0.742),
 ('conservation efforts', 0.742),
 ('products ivory meat', 0.742),
 ('feet inches', 0.707),
 ('pounds tons', 0.707),
 ('elephant calf', 0.707),
 ('kg pounds', 0.707),
 ('strong family relationship', 0.707),
 ('wild elephants', 0.707),
 ('elephants sounds', 0.707),
 ('low people', 0.707),
 ('african loxodonta africanus', 0.707),
 ('asian elephants elephas maximus', 0.707),
 ('ivory traders', 0.707),
 ('many elephants', 0.707),
 ('different species', 0.707),
 ('today many 

We can also leverage gensim's `summarization` module which has a keywords function which can be used to extract keywords or phrases from text. 

This uses a variation of the TextRank algorithm which we shall be exploring during document summarization.

In [98]:
from gensim.summarization import keywords

key_words = keywords(' '.join(elephants), ratio=1.0, scores=True, lemmatize=True)
[(item, round(score, 3)) for item, score in key_words][:50]

[('elephant recorded', 0.339),
 ('african', 0.127),
 ('called', 0.094),
 ('babies', 0.091),
 ('female', 0.09),
 ('people', 0.088),
 ('lives', 0.085),
 ('species', 0.084),
 ('grass', 0.082),
 ('south', 0.078),
 ('legal', 0.076),
 ('humans', 0.075),
 ('loxodonta', 0.074),
 ('ears', 0.073),
 ('animals', 0.072),
 ('aquatic', 0.072),
 ('reduced forests', 0.071),
 ('long', 0.07),
 ('estimate', 0.07),
 ('pounds', 0.07),
 ('upper', 0.07),
 ('strong family', 0.07),
 ('different', 0.068),
 ('large teeth', 0.068),
 ('mother', 0.064),
 ('eat', 0.064),
 ('inches', 0.062),
 ('gomphotheres', 0.062),
 ('feet', 0.062),
 ('ivory comes', 0.062),
 ('mainly', 0.061),
 ('namibia', 0.058),
 ('groups', 0.057),
 ('botswana', 0.057),
 ('mammals', 0.057),
 ('modern', 0.057),
 ('low', 0.057),
 ('indians', 0.057),
 ('sounds', 0.054),
 ('countries sport', 0.054),
 ('allows restricted', 0.054),
 ('carry blood', 0.054),
 ('land', 0.053),
 ('asians', 0.053),
 ('remaining population probably succumbed', 0.052),
 ('tanz

## Bonus: Extraction using RAKE

RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

Full paper is available here: [researchgate](https://www.researchgate.net/profile/Stuart_Rose/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents/links/55071c570cf27e990e04c8bb.pdf)

In [57]:
!pip install rake-nltk

Collecting rake-nltk
  Downloading https://files.pythonhosted.org/packages/8e/c4/b4ff57e541ac5624ad4b20b89c2bafd4e98f29fd83139f3a81858bdb3815/rake_nltk-1.0.4.tar.gz
Building wheels for collected packages: rake-nltk
  Building wheel for rake-nltk (setup.py) ... [?25l[?25hdone
  Created wheel for rake-nltk: filename=rake_nltk-1.0.4-py2.py3-none-any.whl size=7819 sha256=e43dc0866028b602cc81ee0818194c746a29a7823625591db618173480431176
  Stored in directory: /root/.cache/pip/wheels/ef/92/fc/271b3709e71a96ffe934b27818946b795ac6b9b8ff8682483f
Successfully built rake-nltk
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.4


In [58]:
from rake_nltk import Rake

In [59]:
r = Rake()

In [99]:
# Extraction given the text.
r.extract_keywords_from_text(' '.join(elephants))

In [100]:
# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()[:50]

[(31.0, 'stood 13 feet 8 inches tall'),
 (24.5, 'heavy work like lifting trees'),
 (23.0, 'range exists outside protected areas'),
 (21.666666666666668, 'actual family elephantidae – evolved'),
 (19.476190476190474, 'another female elephant often stays'),
 (18.328735632183907, 'ivory traders killed many elephants'),
 (16.078735632183907, 'indian elephants eat mainly grass'),
 (16.0, 'carthaginian general hannibal took'),
 (15.362068965517242, 'elephants avoid acacia trees'),
 (14.8, 'favoured specialist grass feeders'),
 (14.666666666666666, 'teeth show little wear'),
 (14.166666666666666, 'forest group loxodonta cyclotis'),
 (14.0, 'largest living land mammals'),
 (13.976190476190476, 'elephant calf may weigh'),
 (13.862068965517242, 'asian elephants elephas maximus'),
 (13.666666666666666, 'savanna group loxodonta africanus'),
 (13.6, 'weigh around 120 kg'),
 (13.5, 'calf ") every four'),
 (13.333333333333334, 'remaining population probably succumbed'),
 (13.333333333333334, 'forced 