# RQTR (for Lemmatized Corpora)

With this notebook you can calculate the RQTR(n) values for a lemmatized corpus.

In [None]:
# SETUP

from src.corpus import Corpus
from src.metrics import rqtr_lemma
import pathlib
import pandas as pd
import json
from src.load_data import load_files

%load_ext autoreload
%autoreload 2

Loading the data...

Put the path to your corpus in the variable `CORPUSDIR`.

I assume that the data is a set of json files, each containing a list of lemmata under the key 'lemmas'.
If you have a different format, you need to adjust the code accordingly. The result should be a list of lists of lemmata.

In [None]:
# Put the path to the directory containing the corpus files here
CORPUSDIR = '/home/brunobrocai/Data/MoWiKo/Paper-themKorp/full'

docs, metadata = load_files(CORPUSDIR)
corpus = Corpus(docs, metadata)

Picking the ANY NUMBER OF (!) base terms for RQTR calculation...

In [None]:
# Picking base terms
base_terms = (('künstlich', 'Intelligenz'), 'KI', 'AI')

Here it comes!

Let's calculcate baseline QTR values.

In [None]:
b, core_term =rqtr_lemma.qtr_baseline(
    base_terms, corpus
)

Now we can calculate RQTR values for all terms in the corpus -- at least those that cooccur with the base terms at least once.

In [None]:
cooccurence_values = rqtr_lemma.count_cooccurence(
    base_terms,
    corpus,
    max_ngram_len=1,
)

The function we just used returns a dictionary with the RQTR values for all terms in the corpus. We can now perform some Pandas DataFrame magic to get a nice overview of the results.

In [None]:
rqtrn_table = rqtr_lemma.cooccurence_to_metric(
    cooccurence_values,
    b,
    metric='rqtrn'
)
rqtrn_table

In [None]:
# Get the values with RQTRN > 40
filtered_df = rqtrn_table[(rqtrn_table['rqtrn'] > 40)]
# Get the values with count > 3
filtered_df = filtered_df[filtered_df['count'] > 2]

filtered_df['weight'] = filtered_df['rqtrn'] / 100
filtered_df

## Part 2: Corpus Creation
Now we can retrieve documents based on the wordlist we created with the RQTR method.

In [None]:
from src.corpus_creation import document_retriever as dr


wordlist = filtered_df['value'].tolist()
wordlist.extend(base_terms)
found_docs = dr.match_wordlist(corpus, wordlist, min=2)

In [None]:
for doc in found_docs:
    print(doc[1]['h1'])
    print(doc[1]['url'])
    print()

In [None]:
weighted_wordlist = dict(filtered_df[['value', 'weight']].values.tolist())
found_docs = dr.match_weighted_wordlist(corpus, weighted_wordlist, min=2)

for doc in found_docs:
    print(doc[1]['h1'])
    print(doc[1]['url'])
    print