# RQTR (for Lemmatized Corpora)

With this notebook you can calculate the RQTR(n) values for a lemmatized corpus.

In [None]:
# SETUP

from src.corpus import Corpus
from src.metrics import rqtr_lemma
import pathlib
import pandas as pd
import json

%load_ext autoreload
%autoreload 2

Loading the data...

Put the path to your corpus in the variable `CORPUSDIR`.

I assume that the data is a set of json files, each containing a list of lemmata under the key 'lemmas'.
If you have a different format, you need to adjust the code accordingly. The result should be a list of lists of lemmata.

In [21]:
# Put the path to the directory containing the corpus files here
CORPUSDIR = '/home/brunobrocai/Data/MoWiKo/Paper-themKorp/full'

files = pathlib.Path(CORPUSDIR).iterdir()
data = []
for file in files:
    with open(file, 'r') as f:
        doc = json.load(f)
        data.append(doc['lemmas'])
corpus = Corpus(data)

Picking the two base terms for RQTR calculation...

You might want to use n-grams instead of single words. If so, you have to treat them as one word. The corpus object has a method `treat_as_one()` with which you can achieve this.

In [None]:
# Treating 'künstlich Intelligenz' as one token
corpus.treat_as_one(['künstlich', 'Intelligenz'], 'künstlich_Intelligenz')

# Picking base terms
base_terms = ('künstlich_Intelligenz', 'KI')

Here it comes!

Let's calculcate baseline QTR values.

In [None]:
b, core_term =rqtr_lemma.qtr_baseline(
    base_terms[0], base_terms[1], corpus
)

Now we can calculate RQTR values for all terms in the corpus -- at least those that cooccur with the base terms at least once.

In [None]:
values = rqtr_lemma.get_all_ngrams(
    base_terms,
    b,
    corpus.documents,
    min_count=1,
    n=1,
)

The function we just used returns a dictionary with the RQTR values for all terms in the corpus. We can now perform some Pandas DataFrame magic to get a nice overview of the results.

In [None]:
# Sort values by rqtrn
sorted_values = sorted(
    values,
    key=lambda x: x.rqtrn(b),
    reverse=True
)

def tuple_to_string(t):
    return ' '.join(t)

# Create a pandas dataframe
df = pd.DataFrame(
    [
        (tuple_to_string(value.term), value.term_count, value.rqtrn(b))
        for value in sorted_values
    ],
    columns=['value', 'count', 'rqtrn']
)

# Keep ony values that contain alphabetic characters
df = df[df['value'].str.contains('[a-zA-ZÜüÖöÄäß]')]

In [None]:
# Get the values with RQTRN > 40
filtered_df = df[(df['rqtrn'] > 40)]
# Get the values with count > 3
filtered_df = filtered_df[filtered_df['count'] > 3]

print(filtered_df)