# Collocation Metrics
This code computes collocations for a given corpus.

You can also evaluate how these collocations work as a query.

Setup...

In [None]:
import os
from src.corpus import FrequencyCorpus
from src.metrics import cooccurrence_parallel as co
from src.load_data import load_files
from src.corpus_creation import document_retriever as dr
from src.corpus_creation import handle_wordlists as hw


%load_ext autoreload
%autoreload 2

Loading the data...

Put the path to your corpus in the variable `CORPUSDIR`.

I assume that the data is a set of json files, each containing a list of lemmata under the key 'lemmas'.
If you have a different format, you need to adjust the code accordingly. The result should be a list of lists of lemmata.

In [None]:
# Put the path to the directory containing the corpus files here
CORPUSDIR = '/home//.../final_corpus'

data, metadata = load_files(CORPUSDIR)
corpus = FrequencyCorpus(data, metadata, filter=None)

We calculcate the collocations of the unigram 'KI' and the bigram 'künstliche Intelligenz' as if they were one word.

In [None]:
# Treating 'künstlich Intelligenz' as one token and giving it the name 'KI'
# Now, 'künstlich Intelligenz' and 'KI' are the same token
corpus.treat_as_one(['künstlich', 'Intelligenz'], 'KI')

Use the method `count_cooccurrences` to get a cooccurrence matrix for the corpus.

In [None]:
cooccurrences = co.Cooccurrences(window_size=None, unit_separator='\n\n', duplicate_counting=True)
cooccurrences.count_cooccurrences(corpus)

With the matrix, we can calculate collocation metrics for the unigram/bigram combo 'KI'.

In [None]:
df = co.all_collocations(
    cooccurrences,
    'KI',
    co.calculate_pmi,
    min_count=1,
    smoothing=0.0001,
    normalize=True
)

Now let's filter the resulting dataframe to only include positive values and add a row with document counts.

In [None]:
# Get unigrams so that we get the document frequencies
_ = corpus.get_unigrams()

In [None]:
filtered_df = df[df['Stat'] > 0]
filtered_df['Doc_Freq'] = filtered_df['Term'].apply(lambda term: corpus.ngram_doccounts[1].get((term,), 0))

Save the results.

In [None]:
FILEPATH = 'wordlists/collocations/windowsizeParagraph-npmi.csv'
if os.path.exists(FILEPATH):
    raise FileExistsError(
        f"File {FILEPATH} already exists. Please remove it or choose a different name."
    )
filtered_df.to_csv(FILEPATH, index=False)

We can evaluate the collocations as a query...

First get the top 50 collocations, then use them to query the corpus.

In [None]:
filtered_df = filtered_df[filtered_df['Doc_Freq'] >= 5]
filtered_df = hw.top_x_with_core(50, 'Stat', filtered_df, ['KI'])

wordlist = filtered_df['Term'].to_list()

In [None]:
min_to_try = [1, 3, 5]
for n in min_to_try:
    print(f"+++++ MIN == {n} +++++")
    hits = dr.match_wordlist(
        corpus,
        wordlist=wordlist,
        min=n,
        unique=False
    )
    _ = dr.eval_retrieval(
        corpus,
        hits,
        annotator='gold_label',
        mode='pooling'
    )
    print()