# Text mining : Extracting keywords from papers
## Abstract
Text mining consists in extractings useful information from text data.

We will first reproduce the text mining used by VOSviewer, a popular tool for analysing bibliography networks (https://arxiv.org/pdf/1109.2058).

Here are their steps related to text mining :
<blockquote cite="http://www.worldwildlife.org/who/index.html">
    
1. <strong>Identification of noun phrases.</strong> The approach that we take is similar to what is
reported in an earlier paper (Van Eck, Waltman, Noyons, & Buter, 2010). We first
perform part-of-speech tagging (i.e., identification of verbs, nouns, adjectives, etc.).
The Apache OpenNLP toolkit (http://incubator.apache.org/opennlp/) is used for this
purpose. We then use a linguistic filter to identify noun phrases. Our filter selects all
word sequences that consist exclusively of nouns and adjectives and that end with a
noun (e.g., paper, visualization, interesting result, and text mining, but not degrees of
freedom and highly cited publication). Finally, we convert plural noun phrases into
singular ones.

2. <p>
        <strong>Selection of the most relevant noun phrases.</strong> The selected noun phrases are referred to
        as terms. We have developed a new technique for selecting the most relevant noun
        phrases. The essence of this technique is as follows. For each noun phrase, the
        distribution of (second-order) co-occurrences over all noun phrases is determined.
    <p/>
    <p>
        This distribution is compared with the overall distribution of co-occurrences over
        noun phrases. The larger the difference between the two distributions (measured using
        the Kullback-Leibler distance), the higher the relevance of a noun phrase. Intuitively,
        the idea is that noun phrases with a low relevance (or noun phrases with a general
        meaning), such as paper, interesting result, and new method, have a more or less
        equal distribution of their (second-order) co-occurrences. On the other hand, noun
        phrases with a high relevance (or noun phrases with a specific meaning), such as
        visualization, text mining, and natural language processing, have a distribution of
        their (second-order) co-occurrences that is significantly biased towards certain other
        noun phrases. Hence, it is assumed that in a co-occurrence network noun phrases with
        a high relevance are grouped together into clusters. Each cluster may be seen as a
        topic.
    </p>
</blockquote>

We  won't be using Apache OpenNLP to perform POS tagging like they did, because there was an issue in the later build that we tried, and it was inconvenient to use because it's a Java library. Instead, we will first be using nltk, which is a well liked Python package for NLP.

Aside from nltk, we might consider those other promising NLP tools :
- Google Parsey McParseface
- Stanford CoreNLP
- Amazon Comprehend
- Flair
- Gensim
- Spacy


## Setup
We first import the dependencies, and download the models used for pos tagging and tokenising.

In [6]:
import nltk
import polars as pl
from pprint import pp
import os
import spacy

adj_tags = ["JJ", "JJR", "JJS"]
noun_tags = ["NN", "NNS", "NNP", "NNPS"]

def df(name: str) -> str:
    return os.path.join("dataframes", name + ".parquet")

nltk.download('averaged_perceptron_tagger')  # pos tagger
nltk.download('punkt')  # tokenizer
nltk.download('tagsets')
nltk.download('wordnet')

!python -m spacy download en_core_web_sm

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/moonlyss/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/moonlyss/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package tagsets to /home/moonlyss/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package wordnet to /home/moonlyss/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


##  Identification of noun phrases
### Evaluating part-of-speech tagging (POS)
Let's start with taking an abstract from our BIM papers dataset and applying POS tagging with nltk.

In [4]:
abstract = pl.scan_parquet(df("papers")).first().collect()[0, "Abstract"]
tokens = nltk.word_tokenize(abstract)
tags = nltk.pos_tag(tokens)
print("Successful tagging : ")
pp(tags[:10])
print("\nTagging failure, 'stricter' should be an adjective (JJ) :")
pp(tags[219:225])

Successful tagging : 
[('Abstract', 'NN'),
 (':', ':'),
 ('The', 'DT'),
 ('development', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('digital', 'JJ'),
 ('economy', 'NN'),
 ('has', 'VBZ'),
 ('changed', 'VBN')]

Tagging failure, 'stricter' should be an adjective (JJ) :
[('cities', 'NNS'),
 ('with', 'IN'),
 ('stricter', 'NN'),
 ('governmental', 'JJ'),
 ('environmental', 'JJ'),
 ('regulations', 'NNS')]


The result seems convincing but not perfect. We can move on to the filtering and extraction of noun phrases.

Method used by VOSviewer :
<blockquote>
We then use a linguistic filter to identify noun phrases. Our filter selects all
word sequences that consist exclusively of nouns and adjectives and that end with a
noun (e.g., paper, visualization, interesting result, and text mining, but not degrees of
freedom and highly cited publication). Finally, we convert plural noun phrases into
singular ones.  
</blockquote>

In [5]:
           
filtered_q = (
    pl.DataFrame(tags, orient="row", schema={"token": pl.String, "tag": pl.Categorical(ordering="physical")}).lazy()
    .select(
        pl.arange(0, pl.len()).alias("pos"),
        pl.all(),
        pl.col("tag").is_in(adj_tags + noun_tags).alias("keep")  # keep only nouns and adjectives
    )                  
    .with_columns(group_id=pl.col("keep").rle_id())
    .filter(pl.col("keep") == True)
    .select(
        "pos", 
        "token",
        "tag",
        "group_id")
    .group_by("group_id").agg("pos", "token", "tag")
    
    # Keep only word sequences that end with a noun
    .with_columns(
        pl.col("tag").list.eval(pl.element().is_in(adj_tags)).list.reverse().alias("adjs_reversed")
    )
    .with_columns(
        (pl.col("adjs_reversed").list.len() - 1 - pl.col("adjs_reversed").list.arg_min()).alias("last_noun_pos"),
    )
    .with_columns(
        pl.col("token").list.head(pl.col("last_noun_pos") + 1)
    )
    
    .filter(~pl.col("adjs_reversed").list.all())  # remove groups that are only adjectives
    .select(
        pl.col("token").list.join(" ").str.to_lowercase().alias("noun_phrase"),
        pl.col("tag").alias("tags"),
    )
)

with pl.StringCache():
    pl.Series("tags_order", ["NN", "NNS", "JJ"], pl.Categorical)
    filtered = filtered_q.collect()
filtered

noun_phrase,tags
str,list[cat]
"""fixed effect model""","[""JJ"", ""NN"", ""NN""]"
"""other robustness tests""","[""JJ"", ""NN"", ""NNS""]"
"""study""","[""NN""]"
"""urban level""","[""JJ"", ""NN""]"
"""super-efficient slack-based me…","[""JJ"", ""JJ"", ""NN""]"
…,…
"""cities""","[""NNS""]"
"""sdgs""","[""NNP""]"
"""mechanism analysis""","[""NN"", ""NN""]"
"""urban gee""","[""JJ"", ""NNP""]"


Now that we have a working solution for one text, let's adapt it to make it work on the complete dataframe of papers.

In [44]:
nlp = spacy.load("en_core_web_sm", disable = ['ner'])

In [49]:

doc = nlp("My beautiful dogs are better than yours")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

My my PRON PRP$ poss Xx True True
beautiful beautiful ADJ JJ amod xxxx True False
dogs dog NOUN NNS nsubj xxxx True False
are be AUX VBP ROOT xxx True True
better well ADJ JJR acomp xxxx True False
than than ADP IN prep xxxx True True
yours your NOUN NNS pobj xxxx True True


We now create a dataframe where we apply the strategy of VOSviewer to all papers in our dataset.
Resulting dataframes has the following columns :
- noun_phrase
- paper_ids : multiple noun phrase appearances in paper will result in several ids appearances
- tags
- count
- paper_count

In [387]:
import time

# from flair.nn import Classifier
# from flair.data import Sentence

def nltk_tags(col: str, alias: str | None=None):
    return (pl.col(col).alias(alias if alias is not None else col.lower() + "_tokens_tags")
        .map_elements(nltk.word_tokenize, return_dtype=pl.List(pl.String), strategy="thread_local")
        .map_batches(lambda series : pl.Series(nltk.pos_tag_sents(series)), return_dtype=pl.List(pl.List(pl.String)))
            # .cast(pl.Struct({"yo": pl.String, "y": pl.Categorical}))
           )
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
# # load the model
# tagger = Classifier.load('pos-fast')

# # make a sentence
# sentence = Sentence('Dirk went to the store.')

# # predict NER tags
skip_words = ["abstract", "%",
              "paper", "study", "use"
             ]
t0 = time.time()
# tagger.predict(sentence)

# print sentence with predicted tags
q = (
    pl.scan_parquet(df("papers"))
    .select(
        pl.col("Id").alias("paper_id"), 
        pl.concat_list(
            pl.col("Subject"),
            pl.col("Abstract"),
            pl.col("Keywords")
        ).alias("text"),   
    )
    .explode("text")
    .with_columns(
        pl.arange(0, pl.len()).alias("text_id"),
        nltk_tags("text", "tokens_tags"),
    )
    .explode("tokens_tags")
    .select(
        pl.exclude("tokens_tags"),
        pl.col("tokens_tags").list.get(0).alias("token"),
        pl.col("tokens_tags").list.get(1).alias("tag").cast(pl.Categorical)
    ) 
    .with_columns(
        pl.col("tag").is_in(adj_tags + noun_tags).alias("keep")  # keep only nouns and adjectives
    )                  
    .with_columns(group_id=pl.col("keep").rle_id())
    .filter(pl.col("keep") == True)
    
    # Lemmatize plural forms to singular form
    .with_columns(
        pl.struct("token", "tag").map_elements(lambda t: lemmatizer.lemmatize(t["token"]) if t["tag"] in noun_tags else t["token"], return_dtype=pl.String).alias("token"),
    )
    .group_by("paper_id", "text_id", "group_id").agg("token", "tag")
    
    # Keep only word sequences that end with a noun
    .with_columns(
        pl.col("tag").list.eval(pl.element().is_in(adj_tags)).list.reverse().alias("adjs_reversed")
    )
    .with_columns(
        (pl.col("adjs_reversed").list.len() - 1 - pl.col("adjs_reversed").list.arg_min()).alias("last_noun_pos"),
    )
    .with_columns(
        pl.col("token").list.head(pl.col("last_noun_pos") + 1)
    )
    .filter(~pl.col("adjs_reversed").list.all())  # remove groups that are only adjectives

    # Final data preparation
    .select(
        pl.col("paper_id").alias("paper_ids"),
        pl.col("token").list.join(" ").str.to_lowercase().alias("term"),
        pl.col("tag").alias("tags"),
    )
    .filter(~pl.col("term").is_in(skip_words), )
    .group_by("term").agg(
        pl.col("paper_ids"), pl.col("term").len().alias("count"),
        pl.col("tags").first()
    )
    .filter(pl.col("count") > 5)  # keep only words that appear 5 times
    .with_columns(pl.col("paper_ids").list.unique().list.len().alias("paper_count"))
    
    .sort("paper_count", "count", descending=True)
    .select(
        pl.arange(0, pl.len()).alias("id"),
        pl.all()
    )
)

res = q.collect()
# res = q.fetch(50)
print(time.time() - t0)
res.write_parquet(df("terms_from_subject_abstract_keywords"))
res

8.36829662322998


id,term,paper_ids,count,tags,paper_count
i64,str,list[i64],u32,list[cat],u32
0,"""technology""","[232, 479, … 73]",231,"[""NNS""]",159
1,"""sustainability""","[205, 475, … 368]",199,"[""NN""]",142
2,"""result""","[308, 63, … 199]",152,"[""NNS""]",125
3,"""development""","[53, 184, … 326]",143,"[""NN""]",117
4,"""research""","[182, 362, … 420]",150,"[""NN""]",112
…,…,…,…,…,…
867,"""ict development""","[238, 238, … 238]",6,"[""NN"", ""NN""]",1
868,"""firm green innovation""","[49, 49, … 49]",6,"[""NN"", ""JJ"", ""NN""]",1
869,"""digital placemaking""","[108, 108, … 108]",6,"[""JJ"", ""NN""]",1
870,"""social medium usage""","[231, 231, … 231]",6,"[""JJ"", ""NNS"", ""NN""]",1


In [390]:
from itertools import combinationsimport jax.numpy as jnp
import jax.experimental.sparse as sparse

res = (
    pl.scan_parquet(df("terms_from_subject_abstract_keywords"))
    .explode("paper_ids")
    .with_columns(pl.col("term").cast(pl.Categorical))
    .group_by("id", "term", "paper_ids").agg(pl.col("term").len().alias("count"),
                                       # pl.col("tags").first()
                                      )
    .rename({"paper_ids": "paper_id"})
    # .group_by("paper_id").agg(pl.struct(term=pl.col("term"), count=pl.col("count")).alias("terms"))
    # .filter(pl.col("paper_id") == 265)
    # .with_columns(
    #     pl.col("terms")
    #     # .map_elements(
    #     #     lambda a: print(a),
    #     #     # return_dtype=pl.List(pl.List(pl.Struct({"term":pl.Categorical, "count":pl.UInt32})))
    #     # )
    #     ,
    #     pl.col("terms").list.len().alias("term_count")
    # )
    # .select(pl.exclude("term_count"))
    # .explode("terms")
    # .unnest("terms")
).collect()
res

id,term,paper_id,count
i64,cat,i64,u32
41,"""environment""",365,1
221,"""smart technology""",206,1
291,"""information modeling""",267,1
67,"""adoption""",441,1
102,"""mobility""",188,1
…,…,…,…
376,"""transparency""",15,1
299,"""b""",148,1
163,"""mechanism""",179,1
207,"""concern""",484,1


In [420]:
data = res["count"].to_numpy()
indices = np.concatenate(res.select(pl.concat_list("id", "paper_id").alias("indices"))["indices"].to_numpy()).reshape(len(res), 2)
kw_paper = sparse.BCOO((data, indices), shape=(872, 543))
kw_paper

BCOO(uint32[872, 543], nse=11582)

In [423]:
res.filter(pl.col("count") > 5)

id,term,paper_id,count
i64,cat,i64,u32
6,"""digital technology""",207,6
821,"""urban underground space resour…",476,8
215,"""education""",384,7
826,"""hbmm""",350,8
84,"""bim""",262,9
…,…,…,…
70,"""chapter""",140,8
317,"""enterprise""",179,7
538,"""scheme""",208,6
859,"""sponge city strategy""",440,6


In [461]:
from jax.experimental.sparse import sparsify
import jax
dot_sp = sparsify(jnp.dot)

In [None]:
@jax.jit
def get_cooccurences(a):
    return dot_sp(a, a.T)

cooccurences = dot_sp(kw_paper, kw_paper.T)
cooccurences

In [482]:
get_cooccurences(kw_paper).device

2024-06-06 10:41:07.855695: W external/tsl/tsl/framework/bfc_allocator.cc:482] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.62GiB (rounded to 2817208320)requested by op 
2024-06-06 10:41:07.855889: W external/tsl/tsl/framework/bfc_allocator.cc:494] *********************************************************************************************____***
E0606 10:41:07.856171  108300 pjrt_stream_executor_client.cc:2826] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2817208104 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:   135.7KiB
              constant allocation:         8B
        maybe_live_out allocation:    8.70MiB
     preallocated temp allocation:    2.62GiB
                 total allocation:    2.63GiB
Peak buffers:
	Buffer 1:
		Size: 511.71MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/jit(_unique_sorted_mask)/jit(lexsort)/sort[dimension=0 is_stable=True num_key

XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2817208104 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:   135.7KiB
              constant allocation:         8B
        maybe_live_out allocation:    8.70MiB
     preallocated temp allocation:    2.62GiB
                 total allocation:    2.63GiB
Peak buffers:
	Buffer 1:
		Size: 511.71MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/jit(_unique_sorted_mask)/jit(lexsort)/sort[dimension=0 is_stable=True num_keys=2]" source_file="/tmp/ipykernel_108300/2930833792.py" source_line=3
		XLA Label: sort
		Shape: s32[134142724]
		==========================

	Buffer 2:
		Size: 511.71MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/jit(_unique_sorted_mask)/jit(lexsort)/sort[dimension=0 is_stable=True num_keys=2]" source_file="/tmp/ipykernel_108300/2930833792.py" source_line=3
		XLA Label: sort
		Shape: s32[134142724]
		==========================

	Buffer 3:
		Size: 511.71MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/jit(_unique_sorted_mask)/jit(lexsort)/iota[dtype=int32 shape=(134142724,) dimension=0]" source_file="/tmp/ipykernel_108300/2930833792.py" source_line=3
		XLA Label: fusion
		Shape: s32[134142724]
		==========================

	Buffer 4:
		Size: 511.71MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/jit(_unique_sorted_mask)/jit(lexsort)/slice[start_indices=(1, 0) limit_indices=(2, 134142724) strides=(1, 1)]" source_file="/tmp/ipykernel_108300/2930833792.py" source_line=3
		XLA Label: fusion
		Shape: s32[1,134142724]
		==========================

	Buffer 5:
		Size: 511.71MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/jit(_unique_sorted_mask)/jit(lexsort)/slice[start_indices=(1, 0) limit_indices=(2, 134142724) strides=(1, 1)]" source_file="/tmp/ipykernel_108300/2930833792.py" source_line=3
		XLA Label: fusion
		Shape: s32[1,134142724]
		==========================

	Buffer 6:
		Size: 127.93MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/reduce_or[axes=(1,)]" source_file="/tmp/ipykernel_108300/4272418259.py" source_line=3
		XLA Label: fusion
		Shape: pred[11582,11582]
		==========================

	Buffer 7:
		Size: 5.80MiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/gather[dimension_numbers=GatherDimensionNumbers(offset_dims=(0, 1), collapsed_slice_dims=(), start_index_map=(0,)) slice_sizes=(760384, 2) unique_indices=True indices_are_sorted=True mode=GatherScatterMode.PROMISE_IN_BOUNDS fill_value=None]" source_file="/tmp/ipykernel_108300/4272418259.py" source_line=3
		XLA Label: fusion
		Shape: s32[1,760384,2]
		==========================

	Buffer 8:
		Size: 2.90MiB
		XLA Label: fusion
		Shape: u32[760384]
		==========================

	Buffer 9:
		Size: 90.5KiB
		Entry Parameter Subshape: s32[11582,2]
		==========================

	Buffer 10:
		Size: 45.2KiB
		XLA Label: fusion
		Shape: s32[11582,1]
		==========================

	Buffer 11:
		Size: 45.2KiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/lt" source_file="/tmp/ipykernel_108300/4272418259.py" source_line=3
		XLA Label: fusion
		Shape: s32[1,11582]
		==========================

	Buffer 12:
		Size: 45.2KiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/lt" source_file="/tmp/ipykernel_108300/4272418259.py" source_line=3
		XLA Label: fusion
		Shape: s32[1,1,11582]
		==========================

	Buffer 13:
		Size: 45.2KiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/lt" source_file="/tmp/ipykernel_108300/4272418259.py" source_line=3
		XLA Label: fusion
		Shape: s32[11582,1]
		==========================

	Buffer 14:
		Size: 45.2KiB
		Entry Parameter Subshape: u32[11582]
		==========================

	Buffer 15:
		Size: 11.3KiB
		Operator: op_name="jit(get_cooccurences)/jit(main)/lt" source_file="/tmp/ipykernel_108300/4272418259.py" source_line=3
		XLA Label: fusion
		Shape: pred[1,1,11582]
		==========================



In [457]:
cooccurences.todense()[

Array([[467,  95,  37, ...,   6,   0,   6],
       [ 95, 369,  47, ...,   0,   0,   0],
       [ 37,  47, 212, ...,   6,   6,   0],
       ...,
       [  6,   0,   6, ...,  36,   0,   0],
       [  0,   0,   6, ...,   0,  36,   0],
       [  6,   0,   0, ...,   0,   0,  36]], dtype=uint32)

TypeError: save_device_memory_profile() missing 1 required positional argument: 'filename'