# Text mining : Extracting keywords from papers
## Abstract
Text mining consists in extractings useful information from text data.

We will first reproduce the text mining used by VOSviewer, a popular tool for analysing bibliography networks (https://arxiv.org/pdf/1109.2058).

Here are their steps related to text mining :
<blockquote cite="http://www.worldwildlife.org/who/index.html">
    
1. <strong>Identification of noun phrases.</strong> The approach that we take is similar to what is
reported in an earlier paper (Van Eck, Waltman, Noyons, & Buter, 2010). We first
perform part-of-speech tagging (i.e., identification of verbs, nouns, adjectives, etc.).
The Apache OpenNLP toolkit (http://incubator.apache.org/opennlp/) is used for this
purpose. We then use a linguistic filter to identify noun phrases. Our filter selects all
word sequences that consist exclusively of nouns and adjectives and that end with a
noun (e.g., paper, visualization, interesting result, and text mining, but not degrees of
freedom and highly cited publication). Finally, we convert plural noun phrases into
singular ones.

2. <p>
        <strong>Selection of the most relevant noun phrases.</strong> The selected noun phrases are referred to
        as terms. We have developed a new technique for selecting the most relevant noun
        phrases. The essence of this technique is as follows. For each noun phrase, the
        distribution of (second-order) co-occurrences over all noun phrases is determined.
    <p/>
    <p>
        This distribution is compared with the overall distribution of co-occurrences over
        noun phrases. The larger the difference between the two distributions (measured using
        the Kullback-Leibler distance), the higher the relevance of a noun phrase. Intuitively,
        the idea is that noun phrases with a low relevance (or noun phrases with a general
        meaning), such as paper, interesting result, and new method, have a more or less
        equal distribution of their (second-order) co-occurrences. On the other hand, noun
        phrases with a high relevance (or noun phrases with a specific meaning), such as
        visualization, text mining, and natural language processing, have a distribution of
        their (second-order) co-occurrences that is significantly biased towards certain other
        noun phrases. Hence, it is assumed that in a co-occurrence network noun phrases with
        a high relevance are grouped together into clusters. Each cluster may be seen as a
        topic.
    </p>
</blockquote>

We  won't be using Apache OpenNLP to perform POS tagging like they did, because there was an issue in the later build that we tried, and it was inconvenient to use because it's a Java library. Instead, we will first be using nltk, which is a well liked Python package for NLP.

Aside from nltk, we might consider those other promising NLP tools :
- Google Parsey McParseface
- Stanford CoreNLP
- Amazon Comprehend
- Flair
- Gensim
- Spacy


## Setup
We first import the dependencies, and download the models used for pos tagging and tokenising.

In [1]:
import nltk
import polars as pl
from pprint import pp
import os
import spacy

import numpy as np

adj_tags = ["JJ", "JJR", "JJS"]
noun_tags = ["NN", "NNS", "NNP", "NNPS"]

def df(name: str) -> str:
    return os.path.join("dataframes", name + ".parquet")

import jax.numpy as jnp
import jax.experimental.sparse as sparse
import jax
from jax.experimental.sparse import sparsify

In [3]:




nltk.download('averaged_perceptron_tagger')  # pos tagger
nltk.download('punkt')  # tokenizer
nltk.download('tagsets')
nltk.download('wordnet')

!python -m spacy download en_core_web_sm

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/moonlyss/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/moonlyss/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package tagsets to /home/moonlyss/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package wordnet to /home/moonlyss/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


##  Identification of noun phrases
### Evaluating part-of-speech tagging (POS)
Let's start with taking an abstract from our BIM papers dataset and applying POS tagging with nltk.

In [4]:
abstract = pl.scan_parquet(df("papers")).first().collect()[0, "Abstract"]
tokens = nltk.word_tokenize(abstract)
tags = nltk.pos_tag(tokens)
print("Successful tagging : ")
pp(tags[:10])
print("\nTagging failure, 'stricter' should be an adjective (JJ) :")
pp(tags[219:225])

Successful tagging : 
[('Abstract', 'NN'),
 (':', ':'),
 ('The', 'DT'),
 ('development', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('digital', 'JJ'),
 ('economy', 'NN'),
 ('has', 'VBZ'),
 ('changed', 'VBN')]

Tagging failure, 'stricter' should be an adjective (JJ) :
[('cities', 'NNS'),
 ('with', 'IN'),
 ('stricter', 'NN'),
 ('governmental', 'JJ'),
 ('environmental', 'JJ'),
 ('regulations', 'NNS')]


The result seems convincing but not perfect. We can move on to the filtering and extraction of noun phrases.

Method used by VOSviewer :
<blockquote>
We then use a linguistic filter to identify noun phrases. Our filter selects all
word sequences that consist exclusively of nouns and adjectives and that end with a
noun (e.g., paper, visualization, interesting result, and text mining, but not degrees of
freedom and highly cited publication). Finally, we convert plural noun phrases into
singular ones.  
</blockquote>

In [5]:
           
filtered_q = (
    pl.DataFrame(tags, orient="row", schema={"token": pl.String, "tag": pl.Categorical(ordering="physical")}).lazy()
    .select(
        pl.arange(0, pl.len()).alias("pos"),
        pl.all(),
        pl.col("tag").is_in(adj_tags + noun_tags).alias("keep")  # keep only nouns and adjectives
    )                  
    .with_columns(group_id=pl.col("keep").rle_id())
    .filter(pl.col("keep") == True)
    .select(
        "pos", 
        "token",
        "tag",
        "group_id")
    .group_by("group_id").agg("pos", "token", "tag")
    
    # Keep only word sequences that end with a noun
    .with_columns(
        pl.col("tag").list.eval(pl.element().is_in(adj_tags)).list.reverse().alias("adjs_reversed")
    )
    .with_columns(
        (pl.col("adjs_reversed").list.len() - 1 - pl.col("adjs_reversed").list.arg_min()).alias("last_noun_pos"),
    )
    .with_columns(
        pl.col("token").list.head(pl.col("last_noun_pos") + 1)
    )
    
    .filter(~pl.col("adjs_reversed").list.all())  # remove groups that are only adjectives
    .select(
        pl.col("token").list.join(" ").str.to_lowercase().alias("noun_phrase"),
        pl.col("tag").alias("tags"),
    )
)

with pl.StringCache():
    pl.Series("tags_order", ["NN", "NNS", "JJ"], pl.Categorical)
    filtered = filtered_q.collect()
filtered

noun_phrase,tags
str,list[cat]
"""fixed effect model""","[""JJ"", ""NN"", ""NN""]"
"""other robustness tests""","[""JJ"", ""NN"", ""NNS""]"
"""study""","[""NN""]"
"""urban level""","[""JJ"", ""NN""]"
"""super-efficient slack-based me…","[""JJ"", ""JJ"", ""NN""]"
…,…
"""cities""","[""NNS""]"
"""sdgs""","[""NNP""]"
"""mechanism analysis""","[""NN"", ""NN""]"
"""urban gee""","[""JJ"", ""NNP""]"


Now that we have a working solution for one text, let's adapt it to make it work on the complete dataframe of papers.

In [44]:
nlp = spacy.load("en_core_web_sm", disable = ['ner'])

In [49]:

doc = nlp("My beautiful dogs are better than yours")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

My my PRON PRP$ poss Xx True True
beautiful beautiful ADJ JJ amod xxxx True False
dogs dog NOUN NNS nsubj xxxx True False
are be AUX VBP ROOT xxx True True
better well ADJ JJR acomp xxxx True False
than than ADP IN prep xxxx True True
yours your NOUN NNS pobj xxxx True True


We now create a dataframe where we apply the strategy of VOSviewer to all papers in our dataset.
Resulting dataframes has the following columns :
- noun_phrase
- paper_ids : multiple noun phrase appearances in paper will result in several ids appearances
- tags
- count
- paper_count

In [751]:
import time

# from flair.nn import Classifier
# from flair.data import Sentence

def nltk_tags(col: str, alias: str | None=None):
    return (pl.col(col).alias(alias if alias is not None else col.lower() + "_tokens_tags")
        .map_elements(nltk.word_tokenize, return_dtype=pl.List(pl.String), strategy="thread_local")
        .map_batches(lambda series : pl.Series(nltk.pos_tag_sents(series)), return_dtype=pl.List(pl.List(pl.String)))
            # .cast(pl.Struct({"yo": pl.String, "y": pl.Categorical}))
           )
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
# # load the model
# tagger = Classifier.load('pos-fast')

# # make a sentence
# sentence = Sentence('Dirk went to the store.')

# # predict NER tags
skip_words = ["abstract", "%",
              "paper", "study", "use"
             ]
t0 = time.time()
# tagger.predict(sentence)

# print sentence with predicted tags
q = (
    pl.scan_parquet(df("papers"))
    .select(
        pl.col("Id").alias("paper_id"), 
        pl.concat_list(
            pl.col("Subject"),
            pl.col("Abstract"),
            pl.col("Keywords")
        ).alias("text"),   
    )
    .explode("text")
    .with_columns(
        pl.arange(0, pl.len()).alias("text_id"),
        nltk_tags("text", "tokens_tags"),
    )
    .explode("tokens_tags")
    .select(
        pl.exclude("tokens_tags"),
        pl.col("tokens_tags").list.get(0).alias("token"),
        pl.col("tokens_tags").list.get(1).alias("tag").cast(pl.Categorical)
    ) 
    .with_columns(
        pl.col("tag").is_in(adj_tags + noun_tags).alias("keep")  # keep only nouns and adjectives
    )                  
    .with_columns(group_id=pl.col("keep").rle_id())
    .filter(pl.col("keep") == True)
    
    # Lemmatize plural forms to singular form
    .with_columns(
        pl.struct("token", "tag").map_elements(lambda t: lemmatizer.lemmatize(t["token"]) if t["tag"] in noun_tags else t["token"], return_dtype=pl.String).alias("token"),
    )
    .group_by("paper_id", "text_id", "group_id").agg("token", "tag")
    
    # Keep only word sequences that end with a noun
    .with_columns(
        pl.col("tag").list.eval(pl.element().is_in(adj_tags)).list.reverse().alias("adjs_reversed")
    )
    .with_columns(
        (pl.col("adjs_reversed").list.len() - 1 - pl.col("adjs_reversed").list.arg_min()).alias("last_noun_pos"),
    )
    .with_columns(
        pl.col("token").list.head(pl.col("last_noun_pos") + 1)
    )
    .filter(~pl.col("adjs_reversed").list.all())  # remove groups that are only adjectives

    # Final data preparation
    .select(
        pl.col("paper_id").alias("paper_ids"),
        pl.col("token").list.join(" ").str.to_lowercase().alias("term"),
        pl.col("tag").alias("tags"),
    )
    .filter(~pl.col("term").is_in(skip_words), )
    .group_by("term").agg(
        pl.col("paper_ids"), pl.col("term").len().alias("count"),
        pl.col("tags").first()
    )
    .filter(pl.col("count") > 5)  # keep only words that appear 5 times
    .with_columns(pl.col("paper_ids").list.unique().list.len().alias("paper_count"))
    
    .sort("paper_count", "count", descending=True)
    .select(
        pl.arange(0, pl.len(), dtype=pl.UInt32).alias("id"),
        pl.all()
    )
)

res = q.collect()
# res = q.fetch(50)
print(time.time() - t0)
res.write_parquet(df("terms_from_subject_abstract_keywords"))
res

10.28857159614563


id,term,paper_ids,count,tags,paper_count
u32,str,list[i64],u32,list[cat],u32
0,"""technology""","[481, 413, … 311]",231,"[""NNS""]",159
1,"""sustainability""","[236, 358, … 116]",199,"[""NN""]",142
2,"""result""","[164, 172, … 335]",152,"[""NNS""]",125
3,"""development""","[497, 169, … 241]",143,"[""NNS"", ""JJ""]",117
4,"""research""","[444, 378, … 283]",150,"[""NN""]",112
…,…,…,…,…,…
867,"""firm green innovation""","[49, 49, … 49]",6,"[""NN"", ""JJ"", ""NN""]",1
868,"""economic digitalization""","[2, 2, … 2]",6,"[""JJ"", ""NN""]",1
869,"""dm""","[216, 216, … 216]",6,"[""NN""]",1
870,"""high-quality green development""","[214, 214, … 214]",6,"[""NN"", ""NN"", ""NN""]",1


In [844]:
from itertools import combinations


df_kw_paper = (
    pl.scan_parquet(df("terms_from_subject_abstract_keywords"))
    .explode("paper_ids")
    .with_columns(pl.col("term").cast(pl.Categorical))
    .group_by("id", "term", "paper_ids").agg(pl.col("term").len().alias("count"),
                                       # pl.col("tags").first()
                                      )
    .rename({"paper_ids": "paper_id"})
).collect()
df_kw_paper

id,term,paper_id,count
u32,cat,i64,u32
507,"""robustness test""",229,1
758,"""supply chain management""",253,2
273,"""regulation""",3,1
152,"""circular economy""",353,2
79,"""aim""",423,1
…,…,…,…
21,"""finding""",24,1
56,"""benefit""",249,1
831,"""dci""",241,7
210,"""ability""",175,1


In [878]:
data = df_kw_paper["count"].to_numpy()
indices = np.concatenate(df_kw_paper.select(pl.concat_list("id", "paper_id").alias("indices"))["indices"].to_numpy()).reshape(len(df_kw_paper), 2)
kw_paper = sparse.BCOO((data, indices), shape=(872, 543)).todense()
kw_paper

Array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 3, 0],
       [0, 1, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint32)

In [23]:
res.filter(pl.col("count") > 5)

id,term,paper_id,count
i64,cat,i64,u32
70,"""chapter""",366,6
700,"""intelligent building""",117,6
18,"""digitalization""",42,8
732,"""green finance""",64,12
629,"""urban agglomeration""",92,8
…,…,…,…
851,"""di""",99,6
852,"""high-quality green development""",214,6
817,"""micrm""",373,9
823,"""social-ecological resilience""",118,8


In [514]:
# The dimensions for the dot product (contracting dimensions)
# Since numpy.dot(A, B) sums over the last axis of A and the second-to-last of B,
# these are the dimensions to contract.
dimension_numbers = (([1], [0]), ([], []))

@jax.jit
def get_cooccurences(a):
    return a @ a.T

@jax.jit
def get_second_order_cooccurences(a):
    fo = a @ a.T
    return fo @ fo  # no need to transpose because symetric


test = jnp.array([1,1,0,0,1,2,1,0,0,1,2,1,0,0,1,1]).reshape(4,4)
# test =get_cooccurences(test)  # second order
# test = get_cooccurences(test)  # third order
test

Array([[1, 1, 0, 0],
       [1, 2, 1, 0],
       [0, 1, 2, 1],
       [0, 0, 1, 1]], dtype=int32)

In [901]:
import time
t0 = time.time()
cooccurences = get_cooccurences(kw_paper)
print(time.time() - t0)
cooccurences


0.10277223587036133


Array([[467,  95,  37, ...,   6,   0,   6],
       [ 95, 369,  47, ...,   0,   0,   6],
       [ 37,  47, 212, ...,   0,   6,   0],
       ...,
       [  6,   0,   0, ...,  36,   0,   0],
       [  0,   0,   6, ...,   0,  36,   0],
       [  6,   6,   0, ...,   0,   0,  36]], dtype=uint32)

In [497]:
t0 = time.time()
res = jnp.dot(dense, dense.T)
print(time.time() - t0)
res

0.027343034744262695


Array([[467,  95,  37, ...,   6,   0,   6],
       [ 95, 369,  47, ...,   0,   0,   0],
       [ 37,  47, 212, ...,   6,   6,   0],
       ...,
       [  6,   0,   6, ...,  36,   0,   0],
       [  0,   0,   6, ...,   0,  36,   0],
       [  6,   0,   0, ...,   0,   0,  36]], dtype=uint32)

In [879]:
second_order_cooccurences = get_cooccurences(cooccurences)
second_order_cooccurences

Array([[410281, 219728, 105304, ...,   9600,   1962,   7758],
       [219728, 292514,  96887, ...,   4458,   2046,   6966],
       [105304,  96887, 109220, ...,   2622,   4074,   2496],
       ...,
       [  9600,   4458,   2622, ...,   4104,     36,     72],
       [  1962,   2046,   4074, ...,     36,   2952,      0],
       [  7758,   6966,   2496, ...,     72,      0,   2988]],      dtype=uint32)

Compute second order cooccurences of terms.

In [884]:
scnd_order_cooc = get_second_order_cooccurences(kw_paper)
print("number of keywords that are not cooccuring on second order :", jnp.count_nonzero(scnd_order_cooc == 0) // 2)
print("number of keywords that are cooccuring : ", jnp.count_nonzero(scnd_order_cooc != 0) // 2 - scnd_order_cooc.shape[0])

number of keywords that are not cooccuring on second order : 1587
number of keywords that are cooccuring :  377733


### Second order cooccurences distribution
The distribution is computed so that each row represents the frequencies of cooccurences (i.e. sum(row) = 1).

In [893]:
scnd_order_distrib = (scnd_order_cooc / scnd_order_cooc.sum(axis=0)).T
print("check :\n", scnd_order_distrib.sum(axis=1)[:5]) 
scnd_order_distrib

check :
 [1.         1.         1.         0.99999994 1.        ]


Array([[2.6347449e-02, 1.4110506e-02, 6.7624184e-03, ..., 6.1649334e-04,
        1.2599582e-04, 4.9820368e-04],
       [1.6776664e-02, 2.2334019e-02, 7.3975129e-03, ..., 3.4037707e-04,
        1.5621612e-04, 5.3186779e-04],
       [1.3800960e-02, 1.2697843e-02, 1.4314185e-02, ..., 3.4363478e-04,
        5.3393142e-04, 3.2712144e-04],
       ...,
       [2.1324236e-02, 9.9024419e-03, 5.8241822e-03, ..., 9.1161113e-03,
        7.9965888e-05, 1.5993178e-04],
       [6.6385157e-03, 6.9227335e-03, 1.3784563e-02, ..., 1.2180763e-04,
        9.9882251e-03, 0.0000000e+00],
       [1.9778809e-02, 1.7759627e-02, 6.3634836e-03, ..., 1.8356202e-04,
        0.0000000e+00, 7.6178242e-03]], dtype=float32)

### Overall cooccurences distribution
We need the overall cooccurences distribution to compare with the second order cooccurences distribution.

"Overall" means that instead of computing cooccurences per paper, we do it for the whole corpus.

In [798]:
q = (
    pl.scan_parquet(df("terms_from_subject_abstract_keywords"))
    .select("count")
)
counts = q.collect().get_column("count")
res = jnp.array(counts.to_numpy()).reshape(len(counts), 1)
overall_cooc = get_cooccurences(res)
overall_cooc

Array([[53361, 45969, 35112, ...,  1386,  1386,  1386],
       [45969, 39601, 30248, ...,  1194,  1194,  1194],
       [35112, 30248, 23104, ...,   912,   912,   912],
       ...,
       [ 1386,  1194,   912, ...,    36,    36,    36],
       [ 1386,  1194,   912, ...,    36,    36,    36],
       [ 1386,  1194,   912, ...,    36,    36,    36]], dtype=uint32)

Same as before, we compute the distribution of coocurences on the rows.

In [842]:
overall_distrib = overall_cooc / overall_cooc.sum(axis=1)[:, jnp.newaxis]
print("sanity check :", overall_distrib[0].sum())
overall_distrib

sanity check : 1.0


Array([[0.0143997 , 0.01240494, 0.00947513, ..., 0.00037402, 0.00037402,
        0.00037402],
       [0.0143997 , 0.01240494, 0.00947513, ..., 0.00037402, 0.00037402,
        0.00037402],
       [0.0143997 , 0.01240494, 0.00947513, ..., 0.00037402, 0.00037402,
        0.00037402],
       ...,
       [0.0143997 , 0.01240494, 0.00947513, ..., 0.00037402, 0.00037402,
        0.00037402],
       [0.0143997 , 0.01240494, 0.00947513, ..., 0.00037402, 0.00037402,
        0.00037402],
       [0.0143997 , 0.01240494, 0.00947513, ..., 0.00037402, 0.00037402,
        0.00037402]], dtype=float32)

### Selecting the most relevant terms based on the Kullback-Leibler divergence

In [896]:
dist = jax.scipy.special.kl_div(scnd_order_distrib, overall_distrib)
dist

Array([[3.97042371e-03, 1.12220645e-04, 4.31818422e-04, ...,
        6.56128977e-05, 1.10931942e-04, 1.86517136e-05],
       [1.86191872e-04, 3.20369191e-03, 2.46537849e-04, ...,
        1.56057649e-06, 8.14154919e-05, 2.94161146e-05],
       [1.26222149e-05, 3.43378633e-06, 1.06670521e-03, ...,
        1.26886880e-06, 3.01469117e-05, 3.07126902e-06],
       ...,
       [1.44813582e-03, 2.71344557e-04, 8.16599466e-04, ...,
        2.03701574e-02, 1.70688640e-04, 7.82152056e-05],
       [2.62085441e-03, 1.44428341e-03, 8.58112238e-04, ...,
        1.15559313e-04, 2.31956877e-02, 3.74018215e-04],
       [8.98757949e-04, 1.01806037e-03, 5.78379724e-04, ...,
        5.98057231e-05, 3.74018215e-04, 1.57158710e-02]], dtype=float32)

In [897]:
relevance = dist.sum(axis=1)
relevance

Array([0.05283442, 0.04818851, 0.02658592, 0.03075027, 0.048304  ,
       0.04415703, 0.04185352, 0.0425231 , 0.07724868, 0.05556743,
       0.17190629, 0.04879113, 0.13997538, 0.10424098, 0.03371423,
       0.07620816, 0.03667569, 0.0294241 , 0.06760576, 0.03198233,
       0.07824764, 0.03720746, 0.0894107 , 0.03815623, 0.03346838,
       0.03422605, 0.05207967, 0.03721029, 0.04994608, 0.02739443,
       0.31744623, 0.0950837 , 0.06295551, 0.0612434 , 0.02322143,
       0.04343025, 0.03029149, 0.05970468, 0.0550732 , 0.05467683,
       0.03015793, 0.0467086 , 0.03243691, 0.11948493, 0.03801347,
       0.07230576, 0.08613472, 0.08048847, 0.03938145, 0.09007974,
       0.04133759, 0.03705375, 0.03054662, 0.06112577, 0.05862594,
       0.0411022 , 0.03481007, 0.03934808, 0.04619797, 0.0401652 ,
       0.0528256 , 0.05270232, 0.0859139 , 0.09482181, 0.04018054,
       0.04050227, 0.03392748, 0.04489309, 0.10078646, 0.07052536,
       0.07117409, 0.06528997, 0.06349993, 0.03224806, 0.06169

In [907]:
relevance_threshold = 0.1
# relevance_mask = jnp.count_nonzero(relevance > relevance_threshold)
# relevance_mask[0]
# relevant_kw = (relevance > relevance_threshold).nonzero()[0]
filtered_relevance = (relevance > relevance_threshold) * relevance
relevant_kw = jnp.where(filtered_relevance != 0)[0]
kw_relevance = filtered_relevance[relevant_indices]
print(relevant_indices[:3])
print(kw_relevance[:3])

[10 12 13]
[0.17190629 0.13997538 0.10424098]


In [908]:
def search(col: str, q: str) -> pl.Expr:
    return pl.col(col).str.starts_with(q)
df_relevant_terms = (
    pl.DataFrame([
        pl.Series("id", np.array(relevant_kw), dtype=pl.UInt32),
        pl.Series("relevance", np.array(kw_relevance))
    ]).lazy()
    .join(
        pl.scan_parquet(df("terms_from_subject_abstract_keywords")),
        on="id"
    )
    .select("id", "term", "paper_ids", "count", "paper_count", "relevance")
    # .sort("relevance",descending=True)
    # .filter(search("term", "buil"))
).collect()

df_relevant_terms.write_parquet(df("relevant_terms"))

df_relevant_terms

id,term,paper_ids,count,paper_count,relevance
u32,str,list[i64],u32,u32,f32
10,"""construction""","[277, 494, … 331]",164,82,0.171906
12,"""industry""","[295, 372, … 116]",157,75,0.139975
13,"""smart city""","[356, 141, … 125]",179,73,0.104241
30,"""china""","[264, 92, … 154]",106,53,0.317446
43,"""effect""","[401, 238, … 479]",52,44,0.119485
…,…,…,…,…,…
867,"""firm green innovation""","[49, 49, … 49]",6,1,0.62776
868,"""economic digitalization""","[2, 2, … 2]",6,1,0.581411
869,"""dm""","[216, 216, … 216]",6,1,0.191426
870,"""high-quality green development""","[214, 214, … 214]",6,1,0.487733


In [902]:
df_relevant_terms.sort("term")

id,term,paper_ids,count,paper_count,relevance
u32,str,list[i64],u32,u32,f32
798,"""]""","[462, 462, … 462]",6,2,0.196774
734,"""acceptability""","[227, 227, … 227]",11,3,0.187262
749,"""administration""","[472, 472, … 472]",7,3,0.110309
573,"""africa""","[168, 272, … 168]",7,6,0.116766
738,"""air quality""","[219, 298, … 298]",9,3,0.117672
…,…,…,…,…,…
724,"""visibility""","[381, 166, … 166]",6,4,0.192997
600,"""web""","[171, 520, … 270]",6,6,0.390692
467,"""woman""","[234, 213, … 234]",15,7,0.774842
766,"""workability""","[80, 211, … 80]",6,3,0.146405
