### What is Keyword Extraction?
Keyword extraction is defined as the task of Natural language processing that automatically identifies a set of terms to describe the subject of the text. This is an important method in information retrieval (IR) systems: keywords simplify and speed up research. Keyword extraction can be used to reduce text dimensionality for further text analysis (subject modeling text classification).

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv(r'C:\Users\amany\Desktop\archive datasets\papers.csv')
print(df.shape)
df.head()

(7241, 7)


Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


#### preprocess textual data

In [2]:
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
stopwords = set(stopwords.words('english'))

#creating a list of custom stopwords
new_words=['fig','figure','image','sample','show','result','large','using','also','one','two','three','four','five','six',
          'seven','eight','nine']
stopwords=list(stopwords.union(new_words))

In [3]:
lm=WordNetLemmatizer()
def preprocess(text):
    text=text.lower()
    text=re.sub('[^A-Za-z0-9]',' ',text)
    text=text.split()
    text=[lm.lemmatize(word) for word in text if word not in stopwords]
    return ' '.join(text)
docs=df['paper_text'].apply(lambda x:preprocess(x))
docs

0       767 self organization associative database app...
1       683 mean field theory layer iv visual cortex a...
2       394 storing covariance associative long term p...
3       bayesian query construction neural network mod...
4       neural network ensemble cross validation activ...
                              ...                        
7236    single transistor learning synapsis paul hasle...
7237    bias variance combination least square estimat...
7238    real time clustering cmos neural engine serran...
7239    learning direction global motion class psychop...
7240    correlation interpolation network real time ex...
Name: paper_text, Length: 7241, dtype: object

In [4]:
docs[0]

'767 self organization associative database application hisashi suzuki suguru arimoto osaka university toyonaka osaka 560 japan abstract efficient method self organizing associative database proposed together application robot eyesight system proposed database associate input output first half part discussion algorithm self organization proposed aspect hardware produce new style neural network latter half part applicability handwritten letter recognition autonomous mobile robot system demonstrated introduction let mapping f x given x finite infinite set another finite infinite set learning machine observes set pair x sampled randomly x x x x mean cartesian product x computes estimate j x f make small estimation error measure usually say faster decrease estimation error increase number sample better learning machine however expression performance incomplete since lack consideration candidate j j assumed preliminarily find good learning machine clarify conception let u discus type learni

### Using TF-IDF
TF-IDF stands for Text Frequency Inverse Document Frequency. The importance of each word increases in proportion to the number of times a word appears in the document (Text Frequency – TF) but is offset by the frequency of the word in the corpus (Inverse Document Frequency – IDF).

Using the tf-idf weighting scheme, the keywords are the words with the highest TF-IDF score. For this task, I’ll first use the CountVectorizer method in Scikit-learn to create a vocabulary and generate the word count:

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(max_df=0.95,         # ignore words that appear in 95% of documents
                   max_features=10000,  # the size of the vocabulary
                   ngram_range=(1,3)    # vocabulary contains single words, bigrams, trigrams
                  )
word_count_vector=cv.fit_transform(docs)
word_count_vector

<7241x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 6447304 stored elements in Compressed Sparse Row format>

Now I’m going to use the TfidfTransformer in Scikit-learn to calculate the reverse frequency of documents:

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidftransformer = TfidfTransformer(smooth_idf=True,use_idf=True)
tfidftransformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [9]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

# get feature names
feature_names=cv.get_feature_names()

def get_keywords(idx, docs):

    #generate tf-idf for the given document
    tf_idf_vector=tfidftransformer.transform(cv.transform([docs[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords, df):
    # now print the results
    print("\n=====Title=====")
    print(df['title'][idx])
    print("\n=====Abstract=====")
    print(df['abstract'][idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])
idx=941
keywords=get_keywords(idx, docs)
print_results(idx,keywords, df)


=====Title=====
Algorithms for Non-negative Matrix Factorization

=====Abstract=====
Non-negative matrix factorization (NMF) has previously been shown to 
be a useful decomposition for multivariate data. Two different multi- 
plicative algorithms for NMF are analyzed. They differ only slightly in 
the multiplicative factor used in the update rules. One algorithm can be 
shown to minimize the conventional least squares error while the other 
minimizes the generalized Kullback-Leibler divergence. The monotonic 
convergence of both algorithms can be proven using an auxiliary func- 
tion analogous to that used for proving convergence of the Expectation- 
Maximization algorithm. The algorithms can also be interpreted as diag- 
onally rescaled gradient descent, where the rescaling factor is optimally 
chosen to ensure convergence. 

===Keywords===
ht 0.651
ht ht 0.262
update rule 0.238
update 0.197
auxiliary 0.146
non negative matrix 0.146
negative matrix 0.145
rule 0.133
nmf 0.126
multipli

In [11]:
df.iloc[941,:]

id                                                         1861
year                                                       2000
title          Algorithms for Non-negative Matrix Factorization
event_type                                                  NaN
pdf_name      1861-algorithms-for-non-negative-matrix-factor...
abstract      Non-negative matrix factorization (NMF) has pr...
paper_text    Algorithms for Non-negative Matrix\nFactorizat...
Name: 941, dtype: object