## Resource Keyword Creation Using NLTK

The following code allows us to automatically generate keyword phrases from the text of resources related to the RDA. This process involves the following steps.
1. Load in resources as PDFs and convert their the raw text into a list of strings. 
2. Strip out boilerplate and punctuation from the strings.
3. Tokenize or split the strings into candidate phrases.
4. Vectorize the candidate phrases removing stopwords, frequent and infrequent phrases, syntactically meaningless phrases.
5. Rank the vectorized phrases according to [TF-IDF](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/) scores and keep a specified number of keyword phrases.
6. Manually remove unsuitable phrases.
In the broadest sense we create a corpus of documents which then allows us to rank how common phrases are in each document but then weighted for their frequency across the documents. This should (hopefully) return phrases which are maximally representative of each specific document. 

In [41]:
import os
import re
import sys
import regex
import PyPDF2
import pandas as pd
import numpy as np
import urllib
import urllib.request

import nltk
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
resource_folder = r"C:\{Path to folder with resouces}" # Create a variable with the path to the folder with resources. I would
                                                       # recommend naming the files according to their UUIDs as this will make
                                                       # it much easier to join the generated keywords to the dataset later.

In [6]:
resource = [os.path.join(resource_folder, x) for x in os.listdir(resource_folder)] # This creates a list with all the resources
                                                                                   # as file paths.

In [10]:
def extract_text_from_pdf(pdf_file: str) -> [str]: # This function takes the text from the PDF and returns it in list format
    with open(pdf_file, 'rb') as pdf:
        reader = PyPDF2.PdfReader(pdf, strict=False)
        pdf_text = []

        for page in reader.pages:
            content = page.extract_text()
            pdf_text.append(content)
            
        return pdf_text

In [11]:
text = [] # We create a blank list and then append the text for each resource to it so we then have a list of lists containing text.

In [12]:
for x in resource:
    text.append(extract_text_from_pdf(x))

FloatObject (b'0.00000000000-22737368') invalid; use 0.0 instead
FloatObject (b'0.00000000000-45474735') invalid; use 0.0 instead
FloatObject (b'0.00000000000-45474735') invalid; use 0.0 instead
FloatObject (b'0.00000000000-25579538') invalid; use 0.0 instead
FloatObject (b'0.00000000000-25579538') invalid; use 0.0 instead
incorrect startxref pointer(1)
FloatObject (b'0.00-70') invalid; use 0.0 instead
FloatObject (b'0.00-70') invalid; use 0.0 instead
FloatObject (b'0.00-70') invalid; use 0.0 instead
FloatObject (b'0.00-70') invalid; use 0.0 instead


In [None]:
saved = text.copy() 

In [25]:
text = [str(item) for item in text] # The vectorizer we build later throws an error if its fed a list of lists so we change each
                                    # item to a string.

In [26]:
text = [
    re.sub("●", "",
    re.sub("\s[a-zA-Z]\.", "",
    re.sub("\(\d+\)", "",
    re.sub("10.\d+/[a-z]+\d+", "",
    re.sub("doi:10.\d+/[a-z]+\d+", "",
    re.sub("https?://\S+", "",
    " ".join(x.replace("\\n", "").replace("[", "").replace("]", "").replace("\'", "").strip("'").split()).lower()))))))
    for x in text
] # This list comprehension strips out boilerplate text, URLs and punctuation.

In [42]:
# This variable stores a set of semantic structures which are appropriate keyword formats. "NN" for example means Noun Noun. 
valid_pos_tags = {'NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS'}

The below function creates a tokenizer class. This tokenizer uses the [Treebank Word Tokenizer](https://www.nltk.org/api/nltk.tokenize.TreebankWordTokenizer.html) which deals well with punctuation. As this tokenizer requires sentences rather than raw blocks of text, the function splits the text into sentences first. Each word is also given a part of speech (POS) tag which allows uninformative words and phrase structures to be filtered out. 

In [32]:
class MyTokenizer:
    def tokenize(self, text):
        tokenizer = TreebankWordTokenizer()
        result = []
        word = r"\p{letter}"
        for sent in nltk.sent_tokenize(text):
            tokens = tokenizer.tokenize(sent)
            tokens = [t for t in tokens if regex.search(word, t)]
            tokens = [t for t in tokens if len(t) > 1]
            tokens = [t for t in tokens if "\\" not in t]
            pos_tags = pos_tag(tokens)
            tokens = [word for word, tag in pos_tags if tag in valid_pos_tags]
            result += tokens
        return result
    
mytokenizer = MyTokenizer()

In [33]:
# This stopword list is custom built. You can add other words or remove items from this list as appropriate. I chose these
# specific words simply because inspecting the tokenized text revealed these letters or words to still be present despite them
# carrying little meaning.
mystopwords = ["'", 'b', 'c', 'e', 'f', 'g', 'h', 'j', 'l', 'n', 'p', 'r', 'u', 'v', 'w', "'d", "'ll", "'re", "'s", "'ve", 
               'could', 'might', 'must', "n't", 'need', 'sha', 'wo', 'would'] + stopwords.words("english")

In [36]:
tfidf_vectorizer = TfidfVectorizer( 
    tokenizer=mytokenizer.tokenize, # The Previously defined tokenizer.
    stop_words=mystopwords, # The previously defined stopword list.
    ngram_range=(2, 3), # This sets the size of the phrases returned. In this case I found phrases a word length of 2 or 3 to be
                        # the most informative.
    max_df=0.75, # Removes words and phrases that appear in nearly all documents as these are generally uninformative. 
    min_df=0.005, # Removes very infrequent phrases as these are again usually uninformative.
    sublinear_tf=True # This option weights against very frequent terms.
)

In [39]:
d_w = tfidf_vectorizer.fit_transform(text) 
d_w # Fitting the vectorizer returns a sparse matrix of terms. 

In [43]:
feature_names = tfidf_vectorizer.get_feature_names_out() # This creates an array of the terms. 

In [46]:
# Taking the array of features and the sparse matrix this function will return a list of lists with each inner lists containing a
# tuple of the top n (specified and changeable by changing the top_n argument) terms and their tf-idf scores.
def get_top_keywords(matrix, feature_names, top_n=5):
    top_keywords_per_doc = []
    for row in matrix:
        row_data = row.toarray().flatten()
        top_indices = row_data.argsort()[-top_n:][::-1]
        top_keywords = [(feature_names[i], row_data[i]) for i in top_indices if row_data[i] > 0]
        top_keywords_per_doc.append(top_keywords)
    return top_keywords_per_doc

In [123]:
top_keywords = get_top_keywords(d_w, feature_names, top_n=10)

In [124]:
keywords = [[x[0] for x in inner_list] for inner_list in top_keywords] # This removed the tf-idf scores.

In [125]:
data = pd.DataFrame({"resource": resource, "keywords": keywords}) # This creates a dataframe with resource names and keywords.

In [126]:
data['resource'] = data.resource.str.split("\\").str[-1].str.rstrip(".pdf") # Clean up the resources and keyword lists.
data['keywords'] = data.keywords.astype(str).str.strip("[").str.strip("]")

In [None]:
# Split out the keyword lists into specific keywords in wide format. 
data = pd.concat([data['resource'], data['keywords'].str.split(',', expand = True).add_prefix("keyword ")], axis=1)

In [129]:
data = pd.melt(data, id_vars=['resource'], value_vars=['keyword 0', # Pivot the dataset into long format. 
 'keyword 1',
 'keyword 2',
 'keyword 3',
 'keyword 4',
 'keyword 5',
 'keyword 6',
 'keyword 7',
 'keyword 8',
 'keyword 9']).sort_values(["resource"]).drop(columns = "variable").reset_index(drop = True)

In [130]:
data # In the end we have a data frame with resources and keywords. These keywords need to be further manually pruned as not all
     # of them will make logical sense. 

Unnamed: 0,resource,value
0,10 Things for Curating Reproducible and FAIR R...,'research compendium'
1,10 Things for Curating Reproducible and FAIR R...,'research artifacts'
2,10 Things for Curating Reproducible and FAIR R...,'air principles'
3,10 Things for Curating Reproducible and FAIR R...,'computing environment'
4,10 Things for Curating Reproducible and FAIR R...,'things curating'
...,...,...
2485,presentations from the RDA17 BoF on Addressing...,'data movement'
2486,presentations from the RDA17 BoF on Addressing...,'early adopters'
2487,presentations from the RDA17 BoF on Addressing...,'data transfer'
2488,presentations from the RDA17 BoF on Addressing...,'management systems'
