# Keyness Metrics (for Lemmatized Corpora)

Setup...

In [None]:
from src.corpus import Corpus, FrequencyCorpus
from src.metrics import keyness
from src.corpus_creation import document_retriever as dr
import pathlib
import json
import pandas as pd

%load_ext autoreload
%autoreload 2

Loading the data...

Put the path to your corpus in the variable `CORPUSDIR`.

I assume that the data is a set of json files, each containing a list of lemmata under the key 'lemmas'.
If you have a different format, you need to adjust the code accordingly. The result should be a list of lists of lemmata.

In [None]:
# Put the path to the directory containing the corpus files here
CORPUSDIR = '/home/brunobrocai/Data/MoWiKo/Paper-themKorp/full'

files = pathlib.Path(CORPUSDIR).iterdir()
docs = []
metadata = []
for file in files:
    with open(file, 'r') as f:
        doc = json.load(f)
        docs.append(doc['lemmas'])
        metadata.append({'h1': doc['h1'], 'url': doc['url']})
corpus = Corpus(docs, metadata)

Now, let's create two subcorpora:

1. STUDY CORPUS: All documents containing the word 'KI' or the 2-gram 'künstlich Intelligenz' at least once (You could of course also use different search terms or increase the number of hits needed to include a document.)
2. REFERENCE CORPUS: All other documents. (Here, you could also use a completely different corpus, e.g. Leipzig Corpora).

In [None]:
hits = dr.match_wordlist(
    corpus, ['KI', ('künstlich', 'Intelligenz'),], min=1
)

study_corpus = dr.corpus_from_found(
    hits, source_corpus=corpus,
    goal_corpus='FrequencyCorpus'
)
reference_corpus = dr.corpus_from_notfound(
    hits, source_corpus=corpus,
    goal_corpus='FrequencyCorpus'
)

Now, for every word in our study corpus, we calculate its keyness score. You can use different metrics here:

1. Statistical Significance:
    + Log-Likelihood (according to Chi-Square)
    + Log-Ratio (according to Rayson)
2. Effect size:
    + Odds Ratio
    + Percentage Difference

More to come...!

In [None]:
keynesses = {}
for word in study_corpus.get_unigrams():
    contingency_table = keyness.corpus_to_contingency(
        word, study_corpus, reference_corpus
    )
    keynesses[word] = keyness.log_likelihood_scipy(contingency_table)

df = pd.DataFrame(keynesses.items(), columns=['Word', 'LL'])
df = df.sort_values(by='LL', ascending=False)

Filter the results with a minimum keyness and take only the top n results.

In [None]:
filtered_df = df[df['LL'] > 2.0]

# top-50
filtered_df = filtered_df.head(50)
filtered_df

Now, we can use the list of keywords to create our thematic corpus...

In [None]:
found_docs = dr.match_wordlist(
    corpus, filtered_df['Word'].tolist(), min=2
)
created_corpus = dr.corpus_from_found(
    found_docs, source_corpus=corpus,
    goal_corpus='Corpus'
)

for _, meta in created_corpus:
    print(meta['h1'])
    print(meta['url'])