Author: Mateusz Burza

## Implementation of function `lang_confidence_score`

In [None]:
def lang_confidence_score(word_counts: dict,
                            language_words_with_frequency: dict) -> float:
    """
    Given a dictionary of word counts and a dictionary of language words
    with their frequencies, return a confidence score in [0, 1]
    indicating how well the text matches the language distribution.
    """
    if not word_counts or not language_words_with_frequency:
        return 0.0

    total = sum(word_counts.values())
    if total == 0:
        return 0.0

    # `word_counts` should be normalised
    article_freqs = {w: c / total for w, c in word_counts.items()}

    # Language frequencies as provided (assumed normalized by wordfreq)
    lang_freqs = language_words_with_frequency
    k = len(lang_freqs)
    if k == 0:
        return 0.0

    overlap = set(article_freqs.keys()) & set(lang_freqs.keys())
    comb_size = (len(article_freqs) + len(lang_freqs)) / 2
    overlap_ratio = len(overlap) / comb_size

    if not overlap:
        return 0.0

    # Frequency difference: how well do article frequencies match language frequencies?
    eps = 1e-9
    diffs = [abs((article_freqs[w] + eps) - (lang_freqs[w] + eps)) for w in overlap]
    avg_diff = sum(diffs) / len(diffs)
    
    # Convert difference to similarity: lower difference = higher score
    freq_similarity = max(0.0, 1.0 - avg_diff)

    # Confidence penalty for small overlaps - avoid false positives with few common words
    overlap_confidence = min(1.0, len(overlap) / (k * 5))
    if len(overlap) > 50:
        overlap_confidence = 1.0

    score = 0.4 * overlap_ratio + (0.6 * freq_similarity * overlap_confidence)

     # Apply logistic function to smooth the score
    from math import exp
    logistic_score = 1.0 / (1.0 + exp(-(score - 0.5) * 10))

    return logistic_score


## Loading data

Collect data from wiki:

In [None]:
from google.colab import files
from pathlib import Path
from json import load

big_path = Path("/content/wiki_big_article_wc.json")
low_path = Path("/content/wiki_low_conf_score_wc.json")

# appropriate dicts should be in these files
uploaded = files.upload()

with big_path.open("r", encoding="utf-8") as f:
    big_article_wc = load(f)

with low_path.open("r", encoding="utf-8") as f:
    low_conf_score_wc = load(f)


Second data source: FrequencyWords (https://github.com/hermitdave/FrequencyWords),
licensed under CC BY-SA 4.0.

Used for statistical analysis only.


In [None]:
from pandas import read_csv

def load_freq(lang, n=10_000):
    """Load frequency list from GitHub FrequencyWords repository"""
    try:
        url = f"https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/{lang}/{lang}_50k.txt"
        df = read_csv(
            url,
            sep=" ",
            names=["word", "frequency"]
        )
        return df.head(n)
    except Exception as e:
        print(f"Error loading frequency data for {lang}: {e}")
        return None

def normalize_to_dict(df):
    if df is None:
        return {}
    total = df["frequency"].sum()
    return dict(zip(df["word"], df["frequency"] / total))

langs = ["en", "de", "es"]

external_wc = {}
for lang in langs:
    df = load_freq(lang)
    external_wc[lang] = normalize_to_dict(df)
    if external_wc[lang]:
        print(f"Loaded {len(external_wc[lang])} words for {lang}")
    else:
        print(f"Failed to load data for {lang}")



Collect data from `wordfreq`

In [None]:
!pip install wordfreq
from wordfreq import top_n_list, word_frequency

# Helper function: build dict of top-k words with their language frequencies
def build_language_freq(language: str, k: int) -> dict:
    words = top_n_list(language, k)
    return {w: word_frequency(w, language) for w in words}

## Analysis

In [None]:
k_values = [3, 10, 100, 1000]
langs = [("English", "en"), ("Spanish", "es"), ("German", "de")]
import matplotlib.pyplot as plt

word_counts_dict = {
    "low_conf": low_conf_score_wc,
    "big_article": big_article_wc,
    "external_en": external_wc["en"],
    "external_de": external_wc["de"],
    "external_es": external_wc["es"],
}

fig, axes = plt.subplots(1, 3, figsize=(13, 5))

for idx, (lang_name, lang_code) in enumerate(langs):
    ax = axes[idx]
    for label, wc in word_counts_dict.items():
        scores = []
        for k in k_values:
            lang_freq = build_language_freq(lang_code, k)
            s = lang_confidence_score(wc, lang_freq)
            scores.append(s)
        scores_for_plot = [s for s in scores]
        ax.plot(k_values, scores_for_plot, marker="o", label=label)
    ax.set_title(f"Lang. confidence vs k ({lang_name})")
    ax.set_xscale("log")
    ax.set_xlabel("k values")
    ax.set_ylabel("Confidence")
    ax.set_ylim(0, 1)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.subplots_adjust(wspace=0.5)
plt.show()


## Description of the results

### Did the choice of languages have an impact on the confidence scores?

It had an impact. English and German come from the same language family - Germanic, and moreover, they generally originate from a geographically similar region. The graphs clearly show that in these languages there are many mutual borrowings and many popular common words (which is only amplified by, for example, similar culture).

In summary, the choice of languages had a significant impact.

### Looking at the `language_words_with_frequency` values for the data and the most frequent words in the language of the data, can your see that in the selected language words are often inflected?

You can see that these words do not have many inflections (unlike, for example, in Polish).

Something else stands out:
-> the vocabulary used in the wiki is quite niche

This is visible in the first chart - as `k` increases, the `lang_confidence_score` function clearly trends downward.
This means that beyond the set of the most popular words (which sometimes appear as loanwords in other languages), the wiki articles contain words that are not popular. Given the specificity of this wiki (Pokemon and the whole universe), I consider this an expected result.

The second and third charts show that the overlap between the compared languages and the Englishâ€“wiki word sets differs noticeably.

### Was it difficult to find and article for which the `lang_confidence_score` result is as low as possible in the wiki? Is this a specificity of this wiki?

Yes and no.

For the value of this function to be simply low? -> Yes
That mainly follows from what I described in the previous answer.

For it to be the lowest possible? -> That took a bit of work.

However, the differences were not huge. (maybe we need to add a new page to the wiki ;)