A keyword extraction app powered by Natural Language Processing (NLP) is designed to automatically identify and extract the most relevant and significant words or phrases from a given text. This app leverages advanced NLP techniques such as tokenization, part-of-speech tagging, and statistical or machine learning models to analyze large volumes of unstructured data and highlight key information. Whether it's used for summarizing articles, improving search engine optimization (SEO), or enhancing content analysis, the app streamlines the process of understanding and organizing textual data efficiently.

In [7]:
# import numpy and pandas for data analysis and manipulation

import numpy as np
import pandas as pd

In [12]:
# create a dataframe out of the csv file and keep only it's first 5000 rows

df = pd.read_csv('papers.csv')
df = df.iloc[:5000,:]

In [3]:
df.head(10) # checking if the dataframe was successfully created by seeing it's first 10 rows

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."
5,1002,1994,Using a neural net to instantiate a deformable...,,1002-using-a-neural-net-to-instantiate-a-defor...,Abstract Missing,U sing a neural net to instantiate a\ndeformab...
6,1003,1994,Plasticity-Mediated Competitive Learning,,1003-plasticity-mediated-competitive-learning.pdf,Abstract Missing,Plasticity-Mediated Competitive Learning\n\nTe...
7,1004,1994,ICEG Morphology Classification using an Analog...,,1004-iceg-morphology-classification-using-an-a...,Abstract Missing,ICEG Morphology Classification using an\nAnalo...
8,1005,1994,Real-Time Control of a Tokamak Plasma Using Ne...,,1005-real-time-control-of-a-tokamak-plasma-usi...,Abstract Missing,Real-Time Control of a Tokamak Plasma\nUsing N...
9,1006,1994,Pulsestream Synapses with Non-Volatile Analogu...,,1006-pulsestream-synapses-with-non-volatile-an...,Abstract Missing,Real-Time Control of a Tokamak Plasma\nUsing N...


In [4]:
df.shape # checking the dimensions of the dataframe

(5000, 7)

In [5]:
df.isnull().sum() # checking number of null values in each column

id               0
year             0
title            0
event_type    4335
pdf_name         0
abstract         0
paper_text       0
dtype: int64

In [6]:
df['paper_text'][0] # checking the first paper ie first value of column 'paper_text' as we are gonna work on it

'767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABASE\nAND ITS APPLICATIONS\nHisashi Suzuki and Suguru Arimoto\nOsaka University, Toyonaka, Osaka 560, Japan\nABSTRACT\nAn efficient method of self-organizing associative databases is proposed together with\napplications to robot eyesight systems. The proposed databases can associate any input\nwith some output. In the first half part of discussion, an algorithm of self-organization is\nproposed. From an aspect of hardware, it produces a new style of neural network. In the\nlatter half part, an applicability to handwritten letter recognition and that to an autonomous\nmobile robot system are demonstrated.\n\nINTRODUCTION\nLet a mapping f : X -+ Y be given. Here, X is a finite or infinite set, and Y is another\nfinite or infinite set. A learning machine observes any set of pairs (x, y) sampled randomly\nfrom X x Y. (X x Y means the Cartesian product of X and Y.) And, it computes some\nestimate j : X -+ Y of f to make small, the estimation erro

Now we have to do the following things with the text data:

1) convert all the text to lowercase

2) remove all the HTML tags from the text

3) remove all special characters and digits from the text

4) create a list of all the words in the text

5) remove all the stop words from the list of words like 'is' 'am' 'are' etc.

6) remove all words from the list with less than 3 characters

7) lemmatize all the words in the list ie convert them to root form like 'running' to 'run', 'walked' to 'walk' etc.

In [8]:
# now we have to import some more libraries

import re # module for regular expressions to remove punctuation marks, special characters, and digits
import nltk # natural language processing tool-kit to tokenize text ie create list of words from paragrah
from nltk.corpus import stopwords # to import in-built stopwords to identify and remove them from the list of words
from nltk.stem.wordnet import WordNetLemmatizer # to lemmatize words ie convert them to their base form



In [9]:
stop_words = set(stopwords.words('english')) # load in-built english stop words and create a set of them

# define some new stop words to add them to the set
new_stop_words = ["fig","figure","image","sample","using", "show", "result", "large", "also", "one", "two", "three", "four", "five", "seven","eight","nine"]

stop_words = list(stop_words.union(new_stop_words)) # create a list of stop words by combining in-built english stop words and new stop words

In [10]:
# now we create a function that does all the 7 tasks listed above

def preprocess_text(txt):
    # convert all the text to lower case
    txt = txt.lower()

    # remove all the HTML tags using regular expression
    txt = re.sub(r"<.*?>", " ", txt)

    # remove special characters, punctuation marks and digits using regular expression
    txt = re.sub(r"[^a-zA-Z]", " ", txt)

    # tokenize the text into a list of words
    txt = nltk.word_tokenize(txt)

    # remove stopwords
    txt = [word for word in txt if word not in stop_words]

    # remove words having less than three characters
    txt = [word for word in txt if len(word) >= 3]

    # lemmatize the words by creating an object of WordNetLemmatizer class and using lemmatize() method for each word in the list
    lmtr = WordNetLemmatizer()
    txt = [lmtr.lemmatize(word) for word in txt]

    return " ".join(txt) # return a string of all the words remaining in the list separated by a space

In [12]:
# checking if our function works by testing it on a sample text
preprocess_text("HELLO word loving moving the to from 99999 *&^ <p>This is a <b>sample</b> text with <i>HTML tags</i>.</p>")

'hello word loving moving text html tag'

In [13]:
docs = df['paper_text'].apply(lambda x:preprocess_text(x)) # create a dataframe 'docs' with function 'preprocess_text' applied to all values of column 'paper_text'

Now to create a keyword extraction app, we will use TF-IDF algorithm and count vectorizer.

TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical technique used to evaluate how important a word is in a document relative to a collection of documents (corpus). In a keyword extraction app, TF-IDF plays a crucial role by assigning a weight to each term based on how frequently it appears in a single document (term frequency) and how rare it is across all documents (inverse document frequency). This helps in highlighting unique and meaningful words while downplaying common ones that appear frequently across the corpus but contribute little to the specific context. By focusing on terms with high TF-IDF scores, the app effectively extracts keywords that are most representative and informative for each individual document.

CountVectorizer is a text preprocessing technique that converts a collection of text documents into a matrix of token counts. In the context of a keyword extraction app, CountVectorizer helps by breaking down the text into individual words (tokens) and counting how many times each word appears in each document. This results in a simple "bag-of-words" model where each document is represented as a vector of word frequencies. While it doesn’t consider the importance or uniqueness of a word across documents like TF-IDF, CountVectorizer is useful for identifying commonly occurring terms within a document, making it a good baseline for keyword extraction, especially when working with smaller or more uniform datasets.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer # import CountVectorizer from scikit-learn

cv = CountVectorizer(max_features=6000, ngram_range=(1, 2)) # create a CountVectorizer object which keeps only 6000 most frequent words or phrases and considers unigrams (single word) and bigrams (two word phrase)

word_count_vectors = cv.fit_transform(docs) # fit the CountVectorizer object to 'docs' and transform it into a matrix where:

# Each row is a document, each column is a word or word pair learned, each cell shows how many times that word appeared in that document

In [16]:
from sklearn.feature_extraction.text import TfidfTransformer # import TD-IDF transformer from scikit-learn

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) # create an instance of TD-IDF transformer with following parameters toggled to true:

# smooth_idf: In TF-IDF, IDF is used to reduce the weight of common words that appear in many documents.

# But if a word appears in every document, its IDF becomes zero — and if it appears in no document, it can cause a division by zero error.

# So, when smooth_idf = True, it adds 1 to every document count when calculating IDF.

#This makes the math safer and more stable, especially for rare or very common words.

# use_idf: makes sure that IDF is used in calculation

tfidf_transformer.fit(word_count_vectors) # fits the matrix formed in the previous step to the transformer

# to read all the words and give them weights based on their frequency in all the documents

In [1]:
def sort_coo(coo_matrix): # this function takes a sparse matrix which is the matrix formed by the tf-idf score of the words in the document
    tuples = zip(coo_matrix.col, coo_matrix.data) # return a tuple of (column, value) where column is the index of the word 
    # in the vocabulary and value is the tf-idf score of the word in the document
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True) # return the tuple in sorted form where first priority is to
    # sort them by tf-idf score and second priority is to sort them by column index

    # reverse = True means that the list is sorted in descending order so that most important words come first

In [2]:
# create a function that takes 3 things: a list of words or phrases made by vectorizer
# a sorted tuple of pair of tf-idf score and column index
# and number of words to return (by default it will be set to 10)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    sorted_items = sorted_items[:topn] # select the top 'n' items from 'sorted_items'
    
    # initialize two empty lists named 'score_vals' and 'feature_vals'
    # 'score_vals' will store the tf-idf scores and 'feature_vals' will store the corresponding feature names
    score_vals = []
    feature_vals = []
    
    for idx, score in sorted_items:
        fname = feature_names[idx] # get the word using the index from tuple of pairs
        score_vals.append(round(score,3)) # to 'score_vals', append the score rounded to 3 decimal places
        feature_vals.append(feature_names[idx]) # to 'feature_vals', append the corresponding name
    
    results = {} # initialize an empty dictionary named 'results'
    
    # add each word and it's corresponding score to the 'results' dictionary
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]] = score_vals[idx]
    return results

In [None]:
feature_names = cv.get_feature_names_out() # get the words and phrases learned by count vectorizer

# create a function to get the most important keywords

def get_keywords(idx, docs):
    # take document at index 'idx' in 'docs' and fit it into count vectorizer and convert it's word counts into td-idf scores
    tf_idf_vector = tfidf_transformer.transform(cv.transform([docs[idx]]))

    # convert td-idf matrix to co-ordinate format and sort it using 'sort_coo' function
    sorted_items = sort_coo(tf_idf_vector.tocoo())

    # extract top 10 keywords with their scores
    keywords = extract_topn_from_vector(feature_names, sorted_items, 10)
    
    return keywords

In [4]:
# create a function to print document's title, trimmed form after applying 7 steps and keywords alongside their tf-idf scores

def print_results(idx, keywords, df):
    # now print the results
    print("\n=====Title=====")
    print(df['title'][idx])
    print("\n=====Abstract=====")
    print(df['abstract'][idx])
    print("\n===Keywords===")
    
    for k in keywords:
        print(k,keywords[k])

In [19]:
# testing if the functions worked by takind document at index 941 and getting it's keywords

idx = 941
keywords = get_keywords(idx, docs)
print_results(idx,keywords, df)


=====Title=====
Algorithms for Non-negative Matrix Factorization

=====Abstract=====
Non-negative matrix factorization (NMF) has previously been shown to 
be a useful decomposition for multivariate data. Two different multi- 
plicative algorithms for NMF are analyzed. They differ only slightly in 
the multiplicative factor used in the update rules. One algorithm can be 
shown to minimize the conventional least squares error while the other 
minimizes the generalized Kullback-Leibler divergence. The monotonic 
convergence of both algorithms can be proven using an auxiliary func- 
tion analogous to that used for proving convergence of the Expectation- 
Maximization algorithm. The algorithms can also be interpreted as diag- 
onally rescaled gradient descent, where the rescaling factor is optimally 
chosen to ensure convergence. 

===Keywords===
update rule 0.365
update 0.317
auxiliary 0.238
rule 0.205
nmf 0.196
multiplicative 0.194
matrix factorization 0.182
matrix 0.176
factorization 0.

In [20]:
# using pickle module, save the vectorizer and transformer and words and phrases found as byte stream

import pickle
pickle.dump(tfidf_transformer,open('tfidf_transformer.pkl','wb'))
pickle.dump(cv,open('count_vectorizer.pkl','wb'))
pickle.dump(feature_names,open('feature_names.pkl','wb'))