# Keyword Exctraction

Keyword extraction is defined as the task that automatically identifies a set of the terms that best describe the subject of document.


### Automatic Keyword extraction algorithms used:

- Rapid Automatic Keyword Extraction (RAKE). Python implementations
- Gensim implementation of TextRank
- Yet Another Keyword Extractor (YAKE)


In this kernel we will apply different keyword extraction approaches to the NIPS Paper dataset.

## Loading the Dataset

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [2]:
# load the dataset
df = pd.read_csv('/kaggle/input/nips-papers/papers.csv')
df.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7241 entries, 0 to 7240
Data columns (total 7 columns):
id            7241 non-null int64
year          7241 non-null int64
title         7241 non-null object
event_type    2422 non-null object
pdf_name      7241 non-null object
abstract      7241 non-null object
paper_text    7241 non-null object
dtypes: int64(2), object(5)
memory usage: 396.1+ KB


In [4]:
print("{} abstracts are missing".format(df[df['abstract']=='Abstract Missing']['abstract']))

0       Abstract Missing
1       Abstract Missing
2       Abstract Missing
3       Abstract Missing
4       Abstract Missing
              ...       
7236    Abstract Missing
7237    Abstract Missing
7238    Abstract Missing
7239    Abstract Missing
7240    Abstract Missing
Name: abstract, Length: 3317, dtype: object abstracts are missing


In [5]:
import pprint
sample = 4114
#2551
#3113

pprint.pprint("TITLE:{}".format(df['title'][sample]))
pprint.pprint("ABSTRACT:{}".format(df['abstract'][sample]))
pprint.pprint("FULL TEXT:{}".format(df['paper_text'][sample][:2000]))

'TITLE:Density Propagation and Improved Bounds on the Partition Function'
('ABSTRACT:Given a probabilistic graphical model, its density of states is a '
 'function that, for any likelihood value, gives the number of configurations '
 'with that probability. We introduce a novel message-passing algorithm called '
 'Density Propagation (DP) for estimating this function. We show that DP is '
 'exact for tree-structured graphical models and is, in general, a strict '
 'generalization of both sum-product and max-product algorithms. Further, we '
 'use density of states and tree decomposition to introduce a new family of '
 'upper and lower bounds on the partition function. For any tree decompostion, '
 'the new upper bound based on finer-grained density of state information is '
 'provably at least as tight as previously known bounds based on convexity of '
 'the log-partition function, and strictly stronger if a general condition '
 'holds. We conclude with empirical evidence of improvemen

## Pre-processing the Data

In [6]:
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
##Creating a list of custom stopwords
new_words = ["fig","figure","image","sample","using", 
             "show", "result", "large", 
             "also", "one", "two", "three", 
             "four", "five", "seven","eight","nine"]
stop_words = list(stop_words.union(new_words))

def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    ##Convert to list from string
    text = text.split()
    
    # remove stopwords
    text = [word for word in text if word not in stop_words]

    # remove words less than three letters
    text = [word for word in text if len(word) >= 3]

    # lemmatize
    lmtzr = WordNetLemmatizer()
    text = [lmtzr.lemmatize(word) for word in text]
    
    return ' '.join(text)

In [7]:
%%time
docs = df['paper_text'].apply(lambda x:pre_process(x))

CPU times: user 3min 27s, sys: 975 ms, total: 3min 28s
Wall time: 3min 29s


## 2. Gensim implementation of TextRank summarization algorithm

Gensim is a free Python library designed to automatically extract semantic topics from documents. The gensim implementation is based on the popular TextRank algorithm. 


### Small text

In [8]:
import gensim
text = "Given a probabilistic graphical model, its density of states is a " + \
"distribution that," +\
"for any likelihood value, gives the number of configurations with that "+ \
"probability. We introduce a novel message-passing algorithm called Density" + \
"Propagation" + \
"(DP) for estimating this distribution. We show that DP is exact for "+ \
"tree-structured" + \
"graphical models and is, in general, a strict generalization of both "+ \
"sum-product and" +\
"max-product algorithms. Further, we use density of states and tree "+ \
"decomposition" + \
"to introduce a new family of upper and lower bounds on the partition "+ \
"function."+ \
"For any tree decomposition, the new upper bound based on finer-grained "+ \
"density"+ \
"of state information is provably at least as tight as previously known " + \
"bounds based" + \
"on convexity of the log-partition function, and strictly stronger if a "+ \
"general condition holds. We conclude with empirical evidence of improvement "+ \
"over convex" + \
"relaxations and mean-field based bounds."
gensim.summarization.keywords(text, 
         ratio=0.5,               # use 50% of original text
         words=None,              # Number of returned words
         split=True,              # Whether split keywords
         scores=False,            # Whether score of keyword
         pos_filter=('NN', 'JJ'), # Part of speech (nouns, adjectives etc.) filters
         lemmatize=True,         # If True - lemmatize words
         deacc=True)              # If True - remove accentuation

['bound based',
 'algorithms',
 'partition',
 'value',
 'tree models',
 'generalization',
 'condition',
 'densityof state',
 'graphical',
 'message',
 'decompositionto',
 'new',
 'called',
 'known',
 'basedon',
 'upper',
 'strictly']

In [9]:
print("SUMMARY: ", gensim.summarization.summarize(text,
                                                  ratio = 0.5,
                                                  split = True))

SUMMARY:  ['Given a probabilistic graphical model, its density of states is a distribution that,for any likelihood value, gives the number of configurations with that probability.', 'We introduce a novel message-passing algorithm called DensityPropagation(DP) for estimating this distribution.']


### Large text

In [10]:
def get_keywords_gensim(idx, docs):
    
    keywords=gensim.summarization.keywords(docs[idx], 
                                  ratio=None, 
                                  words=10,         
                                  split=True,             
                                  scores=False,           
                                  pos_filter=None, 
                                  lemmatize=True,         
                                  deacc=True)              
    
    return keywords

def print_results_gensim(idx,keywords, df):
    # now print the results
    print("\n=====Title=====")
    print(df['title'][idx])
    print("\n=====Abstract=====")
    print(df['abstract'][idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k)

In [11]:
idx=4114
keywords=get_keywords_gensim(idx, docs)
print_results_gensim(idx,keywords, df)


=====Title=====
Density Propagation and Improved Bounds on the Partition Function

=====Abstract=====
Given a probabilistic graphical model, its density of states is a function that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this function. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decompostion, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations 

## Rapid Automatic Keyword Extraction algorithm (RAKE)


### Setup using pip

In [12]:
!pip install rake-nltk

Collecting rake-nltk
  Downloading https://files.pythonhosted.org/packages/3b/e5/18876d587142df57b1c70ef752da34664bb7dd383710ccf3ccaefba2aa0c/rake_nltk-1.0.6-py3-none-any.whl
Collecting nltk<4.0.0,>=3.6.2 (from rake-nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/c5/ea/84c7247f5c96c5a1b619fe822fb44052081ccfbe487a49d4c888306adec7/nltk-3.6.7-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 592kB/s 
Collecting regex>=2021.8.3 (from nltk<4.0.0,>=3.6.2->rake-nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/4a/13/5fb3cb045a40baa76e32e1403b4f356c8f60db706ad59f1ac8ec549efbaa/regex-2022.6.2-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (671kB)
[K     |████████████████████████████████| 675kB 9.7MB/s 
[31mERROR: allennlp 0.9.0 requires flaky, which is not installed.[0m
[31mERROR: allennlp 0.9.0 requires responses>=0.7, which is not installed.[0m
[31mERROR: preprocessing 0

### Small text

In [13]:
text = "Given a probabilistic graphical model, its density of states is a " + \
"distribution that," +\
"for any likelihood value, gives the number of configurations with that "+ \
"probability. We introduce a novel message-passing algorithm called Density" + \
"Propagation" + \
"(DP) for estimating this distribution. We show that DP is exact for "+ \
"tree-structured" + \
"graphical models and is, in general, a strict generalization of both "+ \
"sum-product and" +\
"max-product algorithms. Further, we use density of states and tree "+ \
"decomposition" + \
"to introduce a new family of upper and lower bounds on the partition "+ \
"function."+ \
"For any tree decomposition, the new upper bound based on finer-grained "+ \
"density"+ \
"of state information is provably at least as tight as previously known " + \
"bounds based" + \
"on convexity of the log-partition function, and strictly stronger if a "+ \
"general condition holds. We conclude with empirical evidence of improvement "+ \
"over convex" + \
"relaxations and mean-field based bounds."

In [14]:
from rake_nltk import Rake
r = Rake()
r.extract_keywords_from_text(text)
r.get_ranked_phrases_with_scores()[:10]

[(23.333333333333336, 'previously known bounds basedon convexity'),
 (16.0, 'passing algorithm called densitypropagation'),
 (16.0, 'grained densityof state information'),
 (13.0, 'new upper bound based'),
 (9.833333333333334, 'field based bounds'),
 (9.0, 'probabilistic graphical model'),
 (8.0, 'general condition holds'),
 (7.0, 'tree decompositionto introduce'),
 (5.333333333333334, 'lower bounds'),
 (5.0, 'new family')]

### Large Text

In [15]:
def get_keywords_rake(idx, docs, n=10):
    # Uses stopwords for english from NLTK, and all puntuation characters by default
    r = Rake()
    
    # Extraction given the text.
    r.extract_keywords_from_text(docs[idx][1000:2000])
    
    # To get keyword phrases ranked highest to lowest.
    keywords = r.get_ranked_phrases()[0:n]
    
    return keywords

def print_results(idx,keywords, df):
    # now print the results
    print("\n=====Title=====")
    print(df['title'][idx])
    print("\n=====Abstract=====")
    print(df['abstract'][idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k)

In [16]:
idx=4114
keywords = get_keywords_rake(idx, df['paper_text'], n=10)
print_results(idx, keywords, df)


=====Title=====
Density Propagation and Improved Bounds on the Partition Function

=====Abstract=====
Given a probabilistic graphical model, its density of states is a function that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this function. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decompostion, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations 

## Yet Another Keyword Extractor (Yake)


In [17]:
!pip install git+https://github.com/LIAAD/yake

Collecting git+https://github.com/LIAAD/yake
  Cloning https://github.com/LIAAD/yake to /tmp/pip-req-build-j7xxrw7c
  Running command git clone -q https://github.com/LIAAD/yake /tmp/pip-req-build-j7xxrw7c
Collecting segtok (from yake==0.4.8)
  Downloading https://files.pythonhosted.org/packages/dd/60/d384dbae5d4756e33f1750fa3472303de2c827011907a64e213e114d0556/segtok-1.5.11-py3-none-any.whl
Collecting jellyfish (from yake==0.4.8)
[?25l  Downloading https://files.pythonhosted.org/packages/26/18/cd485f3661c8e8c0ab864c2e54033371dcc1f7e75767318a4044b2808ed4/jellyfish-0.9.0.tar.gz (132kB)
[K     |████████████████████████████████| 133kB 849kB/s 
Building wheels for collected packages: yake, jellyfish
  Building wheel for yake (setup.py) ... [?25l- \ done
[?25h  Created wheel for yake: filename=yake-0.4.8-py2.py3-none-any.whl size=62573 sha256=47b8d5e48fc452c9cb25af7a3722d6954996360e82199e5015d2fd098d576486
  Stored in directory: /tmp/pip-ephem-wheel-cache-wfkgpgz1/wheels/b

In [18]:
import yake

def get_keywords_yake(idx, docs):
    y = yake.KeywordExtractor(lan='en',          # language
                             n = 3,              # n-gram size
                             dedupLim = 0.9,     # deduplicationthresold
                             dedupFunc = 'seqm', #  deduplication algorithm
                             windowsSize = 1,
                             top = 10,           # number of keys
                             features=None)           
    
    keywords = y.extract_keywords(text)
    return keywords

idx= 4114
keywords = get_keywords_yake(idx, docs[idx])
print_results(idx, keywords, df)


=====Title=====
Density Propagation and Improved Bounds on the Partition Function

=====Abstract=====
Given a probabilistic graphical model, its density of states is a function that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this function. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decompostion, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations 