# Combine RQTR and Keyness Metrics

With this notebook, you can create a search query by combining the RQTR methods and keyword methods.

In [None]:
# Setup: Import necessary libraries

from src.corpus import Corpus, FrequencyCorpus
from src.metrics import keyness
from src.corpus_creation import document_retriever as dr
import pathlib
import json
import pandas as pd
import random
from src.load_data import load_files, load_reference_sample

%load_ext autoreload
%autoreload 2

Loading the data...

Put the path to your corpus in the variable `CORPUSDIR`.

I assume that the data is a set of json files, each containing a list of lemmata under the key 'lemmas'. With this format, the `load_files` function will create two lists that can be turned into a corpus.

For corpora that are inside of a single file and lack metadata and other features (i.e. our reference corpus format), you can use the `load_reference_sample` function.

If you have a different format, you need to adjust the code accordingly. The result should be a list of lists of lemmata.

In [None]:
# Put the path to the directory containing the corpus files here
CORPUSDIR = '/home/brunobrocai/Data/MoWiKo/Paper-themKorp/full'

docs, metadata = load_files(CORPUSDIR)
corpus = Corpus(docs, metadata)

# While we're at it, we can also load the reference corpora
REFERENCE_CORPUS = '/home/brunobrocai/Data/Reference/Leipzig-Corpora/reference_corpus_20-25.json'
reference_docs = load_reference_sample(REFERENCE_CORPUS)
reference_corpus_leipzig = FrequencyCorpus(reference_docs)

reference_corpus_scraped = FrequencyCorpus(docs, metadata)

## 1: Keyness calculation

Using the corpus we just loaded, we can create a subcorpus by selecting files that contain terms from the `SEARCH_TERMS` list.

>The `match_wordlist` function has additional parameters to customize your search. For example, you can:
>1. Edit the min parameter to set the minimum number of matches required for a file to be included in the subcorpus.
>2. Set *unique* to True to only count unique matches (i.e. if a file contains the same term multiple times, it will only be counted once).

From our search term matches, we can load two types of corpora: ones that *do match* and ones that *do not match* (i.e. all other documents). We can use these as study and reference corpora, respectively.

In [None]:
SEARCH_TERMS = ['KI', ('künstlich', 'Intelligenz')]

# Find the documents that contain the search terms (at least min times)
hits = dr.match_wordlist(
    corpus, SEARCH_TERMS, min=1
)

# Load the found documents into a new corpus
study_corpus = dr.corpus_from_found(
    hits, source_corpus=corpus,
    goal_corpus=FrequencyCorpus
)

# We can also create a corpus from the documents that do not contain the search terms
reference_corpus_nomatch = dr.corpus_from_notfound(
    hits, source_corpus=corpus,
    goal_corpus=FrequencyCorpus
)

Let's actually perform the **keyness calculation**! We can use different functions here, either `keyword_list` or `keyword_list_ngram`. The first one calculates keyness for single words *and* ngrams. You can specify up to which length you want to calculate ngrams with the *max_ngram_len* parameter. The second one only calculates keyness for *one given ngram length* (use length 1 for words).

As function parameters, you must specify study and reference corpus. Both must be FrequencyCorpus python objects. In addition, you can specify the following:
1. **metric**: Which keyness metric to use. 'log_likelihood_rayson' is commonly used in corpus linguistics. Alternatively, use 'odds_ratio' or 'percent_difference' for effect size metrics. There are even more metrics you can use -- just check the metrics.keyness module!
2. **smoothing**: We use Laplace smoothing, not least to avoid division by zero. You can specify the amount of smoothing. In general, the higher the smoothing, the more conservative the results. The default is 0.00001, which is quite low.
3. **min_freq**: Only keywords that appear at least this often in the study corpus will be included in the results. The default is 1, which means that all keywords are included.
4. **min_docs**: Only keywords that appear in at least this many documents in the study corpus will be included in the results. The default is 1, which means that all keywords are included.
5. **filter_stopwords**: If set to True, stopwords and ngrams beginning or ending with stopwords will be filtered out. The default is True.
6. **filter**: You can specify a filter function here, e.g. in order to filter out punctuation when counting frequencies. (By default, all words without alphabetic characters are already ignored when loading a corpus)

In [None]:
# create a keyword list for ngrams of length 1 and 2

keynesses = keyness.keyword_list(
    study_corpus=study_corpus,
    ref_corpus=reference_corpus_scraped,
    metric='log_likelihood_rayson',
    max_ngram_len=2,
    min_docs=3,
    smoothing=0.5,
    filter_stopwords=True,
)

In [None]:
# Let's take a look at the first 50 keywords
keynesses.head(50)

In [None]:
# You can also save pandas dataframes (e.g. the keyword list) to a file
# keynesses.to_excel('keyword_list.xlsx', index=False)

## 2: RQTR calculation

Now that we have our keyness results, let them be for a file and calculate RQTR. As a first step, copy our corpus from before.

Then, we pick the base terms.

**Note:** unfortunately, as of now, n-gram base terms are not supported. You need to use the `treat_as_one` method of the corpus to turn an ngram into a single token. This will be changed in the future!

In [None]:
from copy import deepcopy

corpus_copy = deepcopy(corpus)

# Picking base terms
BASE_TERMS = ('künstlich_Intelligenz', 'KI')

# Treating 'künstlich Intelligenz' as one token -- this is a bit of a hack and will be changed in a future update
corpus.treat_as_one(['künstlich', 'Intelligenz'], 'künstlich_Intelligenz')

Now calculate the baseline RQTR value!

In [None]:
from src.metrics import rqtr_lemma

baseline, core_term =rqtr_lemma.qtr_baseline(
    BASE_TERMS[0], BASE_TERMS[1], corpus
)

After calculating the baseline QTR value, we can move on to calculating the QTR (or RQTR or RQTRn) values for all words/ngrams that cooccur with the base terms at least once. As a first step, count these cooccurrences and the instances where they do not cooccur. Like with keywords, you can do this for a specific ngram length (`count_cooccurence_ngram`) or for all ngrams up to a certain length (`count_cooccurence_ngram`).

You also have the following parameter to play with:
1. **min_count**: How often a word/phrase must cooccur with the base term in order to be included in the results. The default is 1.

After that, we can calculate the metric we want by using the `cooccurence_to_metric` function. You pass the cooccurrence results and the baseline. Also, specify which metric you want to use (baseline is 'rqtrn').

In [None]:
cooccurence_values = rqtr_lemma.count_cooccurence(
    BASE_TERMS,
    corpus,
    min_count=1,
    max_ngram_len=1
)
rqtrn_table = rqtr_lemma.cooccurence_to_metric(
    cooccurence_values,
    baseline,
    metric='rqtrn'
)

In [None]:
# Again, you could save the rqtrn table to a file
# rqtrn_table.to_excel('rqtrn_table.xlsx', index=False)

## 3: Combined Evaluation

With the RQTR and keyness results, we can now create a final search term query with Pandas magic!

Combine the two dataframes...

In [None]:
# Create a new dataframe with both keyness and rqtrn

combined_df = pd.merge(
    rqtrn_table,
    keynesses,
    on='Word',
    how='outer'
)

In [None]:
# Yet again, we can save it...
# combined_df.to_excel('Combined-RQTRn-Keyness.xlsx', index=False)

We can now use pandas dataframe methods to filter the results. For example, we can say we can keep only results above a certain value threshold.

See below for examples.

In [None]:
# Drop rows that lack rqtrn or keyness values
filtered_df = combined_df.dropna()

# Filter the dataframe to only include rows with rqtrn > 0 and keyness > 10
filtered_df = filtered_df[
    (filtered_df['Keyness'] > 10) &
    (filtered_df['RQTRN'] > 0)
]

# Keep only the top 50 rows after sorting by rqtrn
filtered_df = filtered_df.sort_values(
    by='RQTRN', ascending=False
).head(50)

# Let's take a look at the filtered dataframe
filtered_df

Now that we have our final dataframe, we can use the 'Word' row as a search query to create our final corpus.

The function we use will be familiar: it is the `match_wordlist` function we used before. Again, we have the parameters *min* and *unique* to play with.
Then we (again) use `corpus_from_found` to create our final thematic corpus.

In [None]:
found_docs = dr.match_wordlist(
    corpus,
    wordlist=combined_df['Word'].tolist(),
    min=5  # Let's be strict
)
created_corpus = dr.corpus_from_found(
    found_docs,
    source_corpus=corpus,
    goal_corpus='Corpus'
)

# YOU can uncomment the following lines to print the metadata of the created corpus
# for _, meta in created_corpus:
#     print(meta['h1'])
#     print(meta['url'])

If the corpus we loaded is annotated, we can evaluate how well our search query and search parameters performed.

The evaluation parameters are:
1. **annotator**: The key under which the annotations are stored. Use:
   1. *expert_annotator_1* for Janine
   2. *expert_annotator_2* for Bruno
   3. *majority_vote* for an average of all expert and HiWi annotations
2. **mode**:
   1. *'pooling'*: treat all non-annotated files as negative examples
   2. *'annotated'* only evaluate files that are annotated, ignore the rest

In [None]:
dr.eval_retrieval(
    corpus,
    found_docs,
    annotator='majority_vote',
    mode='pooling'
)

You can also create an annotation corpus from the found documents. Specify which annotator you want to look at to see which files are already annotated (if you take 'majority_vote', then you will get all files that at least one person annotated).

Also specify the path you want to copy the files into. The files will be readable txt files with metadata for easy annotation.

In [None]:
dr.prepare_annotations(
    corpus,
    found_docs,
    annotator='majority_vote',  # Ignore everything that was annotated at least once
    goalpath='annotation_round2'
)