<a href="https://colab.research.google.com/github/AyushiKashyapp/NLP/blob/main/KeywordExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keyword Extraction

Keyword extraction can be used in automatically indexing data, summarizing text, or generating tag clouds with most representative keywords.

In [14]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [9]:
from google.colab import files
import pandas as pd

uploaded = files.upload()


Saving keyword_extraction.csv to keyword_extraction (3).csv


In [10]:
data = pd.read_csv("keyword_extraction (3).csv")
data.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


**Text Pre-processing**

- Lower case conversion.
- Remove tags.
- Remove special characters and digits.
- Remove stopwords.
- Remove words less than three letters.
- Lemmatize.

In [15]:
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))

# Creating a list of custom stop words
new_words = ["fig", "figure", "image", "sample", "using", "show", "result", "large", "also", "one", "two", "three", "four","five", "seven", "eight", "nine"]
stop_words = list(stop_words.union(new_words))

def pre_process(text):

  text = text.lower() #Lower Case
  text = re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text) #Remove Tags
  text = re.sub("(\\d|\\W)+"," ",text) #Remove special characters and digits
  text = text.split() #Convert to list from string
  text = [word for word in text if word not in stop_words] #Remove stopords
  text = [word for word in text if len(word) >= 3] #Remove words less than three letters

  lmtzr = WordNetLemmatizer() #Lemmatize
  text = [lmtzr.lemmatize(word) for word in text]

  return ' '.join(text)

docs = data['paper_text'].apply(lambda x: pre_process(x))

**TF - IDF**

TF-IDF stands for Term Frequency - Inverse Document Frequency. The importance of each word increases in proportion to the number of times a word appears in the document (Term Frequency) but is offset by the frequency of the word in the corpus (Inverse Document Frequency).

Using TF-IDF weighting scheme, the keywords are the words with the highest TF-IDF score.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(
    max_df = 0.95, #Ignore the words that appear in 95% of the documents.
    max_features = 10000, #The size of the vocabulary.
    ngram_range = (1,3) #Vocabulary contains single words, bigrams, trigrams.
)

word_count_vector = cv.fit_transform(docs)

**Preparing the word count to transform into TF-IDF scores.**

Use the TfidfTransformer in Scikit-learn to calculate the reverse frequency of documents.

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(smooth_idf = True, use_idf = True) # smooth_idf adds 1 to document freq to prevent division by zero when computing inverse document freq.
tfidf_transformer.fit(word_count_vector)

Creating a function for the keyword extraction using TF-IDF Vectorization.

In [23]:
# Convert the COO Matrix (Coordinate List matrix) into a list of (column index, value)
# tuples and sort them in descending order by value and then column index
def sort_coo(coo_matrix):
  tuples = zip(coo_matrix.col, coo_matrix.data)
  return sorted(tuples, key = lambda x: (x[1], x[0]), reverse = True)

# Extract the top N features and their TF-IDF scores from the sorted items.
def extract_topn_from_vector(feature_names, sorted_items, topn = 10):
  """get the feature names and tf-idf score of top n items"""

  sorted_items = sorted_items[:topn] #Use only topn items from vector.

  score_vals = []
  feature_vals = []

  for idx, score in sorted_items:
    fname = feature_names[idx]

    score_vals.append(round(score,3)) #Keep track of feature name and its corresponding score.
    feature_vals.append(feature_names[idx])

  results = {} #Create a tuples of feature, score
  for idx in range(len(feature_vals)):
    results[feature_vals[idx]] = score_vals[idx]

  return results

feature_names = cv.get_feature_names_out() #Get feature names

# Transform the document at specific index into a TF-IDF vector.
# Sort the TF-IDF vector.
# Extracts the top N keywords.
def get_keywords(idx, docs):
  tf_idf_vector = tfidf_transformer.transform(cv.transform([docs[idx]])) #Generate tf-idf for the given document.
  sorted_items = sort_coo(tf_idf_vector.tocoo()) #Sort the tf-idf vectors by descending order of scores.
  keywords = extract_topn_from_vector(feature_names, sorted_items, 10) #Extract only the top n, here n is 10.
  return keywords

# Prints the title and abstract of the document.
# Prints the extracted keywords and their scores.
def print_results(idx, keywords, df):
  print("\n-------Title--------")
  print(data['title'][idx])
  print("\n------Abstract------")
  print(data['abstract'][idx])
  print("\n-----Keywords-------")
  for k in keywords:
    print(k, keywords[k])

idx = 200
keywords = get_keywords(idx, docs)
print_results(idx, keywords, data)


-------Title--------
Algorithmic Stability and Generalization Performance

------Abstract------
Abstract Missing

-----Keywords-------
stability 0.293
bound 0.272
regularization network 0.251
algorithm 0.196
inequality 0.183
bound generalization 0.176
stable 0.162
regularization 0.149
generalization 0.148
generalization error 0.147
