Author: Mateusz Burza

This notebook should be executed in Google Colab (online version).

**Steps**
1. Open: https://colab.research.google.com/
2. Click **Upload notebook** and choose this file.
3. Run `get_dicts_to_analysis.py` to generate:
   - `wiki_big_article_wc.json`
   - `wiki_low_conf_score_wc.json`

## Implementation of function `lang_confidence_score`

The `lang_confidence_score` function measures how closely an article’s word distribution matches a target language distribution. It first normalizes the article’s word counts, then computes the overlap of vocabularies and the average frequency difference on the shared words. These components are combined into an intermediate score, penalized when the overlap is small, and finally smoothed with a logistic function to return a value in $[0, 1]$.

### Formulas used in `lang_confidence_score`
Let $A$ be the set of unique article words and $L$ the set of words in an arbitrary language.
Let $c_w$ be the count of word $w$ in the article and $L_w$ the language frequency of word $w$.

**1. Normalization:**
$$
f_w = \frac{c_w}{\sum_u c_u}
$$

**2. Overlap and overlap ratio:**
$$
O = A \cap L
$$
$$
r = \frac{|O|}{( |A| + |L| )/2}
$$

**3. Average frequency difference (with $\varepsilon$):**
$$
\Delta = \frac{1}{|O|} \sum_{w \in O} \left| f_w - L_w \right|
$$

**4. Similarity:**
$$
s = \max(0, 1 - \Delta)
$$

**5. Overlap confidence penalty:**
$$
p = \min\left(1, \frac{|O|}{2|L|}\right) \quad \text{(or } p=1 \text{ if } |O| > 1000\text{)}
$$

**6. Intermediate score:**
$$
q = 0.4\,r + 0.6\,s\,p
$$

**7. Logistic smoothing:**
$$
\text{score} = \frac{1}{1 + e^{-10(q-0.5)}}
$$

In [None]:
def lang_confidence_score(word_counts: dict,
                            language_words_with_frequency: dict) -> float:
    """
    Given a dictionary of word counts and a dictionary of language words
    with their frequencies, return a confidence score in [0, 1]
    indicating how well the text matches the language distribution.
    """
    if not word_counts or not language_words_with_frequency:
        return 0.0

    total = sum(word_counts.values())
    if total == 0:
        return 0.0

    # 1. Normalize `word_counts` to get frequencies
    article_freqs = {w: c / total for w, c in word_counts.items()}

    # Language frequencies as provided (assumed normalized by wordfreq)
    lang_freqs = language_words_with_frequency
    k = len(lang_freqs)
    if k == 0:
        return 0.0

    # 2. Overlap ratio: how many words are common between article and language?
    overlap = set(article_freqs.keys()) & set(lang_freqs.keys())
    if not overlap:
        return 0.0
    
    r = (len(article_freqs) + len(lang_freqs)) / 2
    overlap_ratio = len(overlap) / r


    # 3. Avg frequency difference: how well do article frequencies match language frequencies?
    eps = 1e-9
    diffs = [abs((article_freqs[w] + eps) - (lang_freqs[w] + eps)) for w in overlap]
    avg_diff = sum(diffs) / len(diffs)
    
    # 4. Similarity: lower difference -> higher score
    similarity = max(0.0, 1.0 - avg_diff)

    # 5. Overlap confidence penalty: for small overlaps - avoid false positives with few common words
    overlap_confidence = min(1.0, len(overlap) / (k * 5))
    if len(overlap) > 1000:
        overlap_confidence = 1.0

    # 6. Combined score:
    score = 0.4 * overlap_ratio + (0.6 * similarity * overlap_confidence)

     # 7. Logistic smoothing:
    from math import exp
    logistic_score = 1.0 / (1.0 + exp(- 10 * (score - 0.5)))

    return logistic_score


## Loading data

### Here we collect data from the wiki articles.

You should upload the JSON files generated earlier by `get_dicts_to_analysis.py`:
- `wiki_big_article_wc.json`
- `wiki_low_conf_score_wc.json`

In [None]:
from google.colab import files
from pathlib import Path
from json import load

big_path = Path("/content/wiki_big_article_wc.json")
low_path = Path("/content/wiki_low_conf_score_wc.json")

# appropriate dicts should be in these files
uploaded = files.upload()

with big_path.open("r", encoding="utf-8") as f:
    big_article_wc = load(f)

with low_path.open("r", encoding="utf-8") as f:
    low_conf_score_wc = load(f)


### Here we collect dara from `FrequencyWords`

This data will be used to compare results with wiki article word frequencies.

**The code below produces three dictionaries:**
- `external_wc["en"]`
- `external_wc["de"]`
- `external_wc["es"]`

Data source: FrequencyWords (https://github.com/hermitdave/FrequencyWords),
licensed under CC BY-SA 4.0.

Used for statistical analysis only.


In [None]:
from pandas import read_csv

def load_freq(lang, n=10_000):
    """Load frequency list from GitHub FrequencyWords repository"""
    try:
        url = f"https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/{lang}/{lang}_50k.txt"
        df = read_csv(
            url,
            sep=" ",
            names=["word", "frequency"]
        )
        return df.head(n)
    except Exception as e:
        print(f"Error loading frequency data for {lang}: {e}")
        return None

def normalize_to_dict(df):
    if df is None:
        return {}
    total = df["frequency"].sum()
    return dict(zip(df["word"], df["frequency"] / total))

langs = ["en", "de", "es"]

external_wc = {}
for lang in langs:
    df = load_freq(lang)
    external_wc[lang] = normalize_to_dict(df)
    if external_wc[lang]:
        print(f"Loaded {len(external_wc[lang])} words for {lang}")
    else:
        print(f"Failed to load data for {lang}")



### Here we collect data from `wordfreq`

In this code cell I install `wordfreq`, import helpers, and define `build_language_freq()`.

The function returns a dictionary of the top-$k$ words in the selected language with their frequencies.

In [None]:
!pip install wordfreq
from wordfreq import top_n_list, word_frequency

# Helper function: build dict of top-k words with their language frequencies
def build_language_freq(language: str, k: int) -> dict:
    words = top_n_list(language, k)
    return {w: word_frequency(w, language) for w in words}

## Analysis

In the next cell I plot line charts of `lang_confidence_score` versus $k$ (top-$k$ words).

There are three subplots, one per language (EN, ES, DE).

Each subplot compares datasets (wiki and external) on the same $k$ values.

In [None]:
k_values = [3, 10, 100, 1000, 2000, 3000, 4000, 4300, 4500, 4700, 4800, 5000]
langs = [("English", "en"), ("Spanish", "es"), ("German", "de")]
import matplotlib.pyplot as plt

word_counts_dict = {
    "low_conf": low_conf_score_wc,
    "big_article": big_article_wc,
    "external_en": external_wc["en"],
    "external_de": external_wc["de"],
    "external_es": external_wc["es"],
}

fig, axes = plt.subplots(1, 3, figsize=(13, 5))

for idx, (lang_name, lang_code) in enumerate(langs):
    ax = axes[idx]
    for label, wc in word_counts_dict.items():
        scores = []
        for k in k_values:
            lang_freq = build_language_freq(lang_code, k)
            s = lang_confidence_score(wc, lang_freq)
            scores.append(s)
        scores_for_plot = [s for s in scores]
        ax.plot(k_values, scores_for_plot, marker="o", label=label)
    ax.set_title(f"Lang. confidence vs k ({lang_name})")
    ax.set_xscale("log")
    ax.set_xlabel("k values")
    ax.set_ylabel("Confidence")
    ax.set_ylim(0, 1)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.subplots_adjust(wspace=0.5)
plt.show()


## Description of the results

### Did the choice of languages have an impact on the confidence scores?

In the final version, the impact is small, as seen in the charts.

However, English and German come from the same language family (Germanic) and originate from a geographically similar region. Although as $k$ increases, any mutual borrowings or popular common words become less significant.

When `lang_confidence_score` is made less strict (e.g., $\text{score} = \frac{1}{1 + e^{-2(q-0.5)}}$),
German performs slightly better than Spanish.

In summary, the choice of languages has only a small impact.

### Looking at the `language_words_with_frequency` values for the data and the most frequent words in the language of the data, can your see that in the selected language words are often inflected?

You can see that these words do not have many inflections (unlike, for example, in Polish).

Something else stands out:
- The vocabulary used in the wiki is quite niche.

This is visible in the first chart: as $k$ increases, the `lang_confidence_score` function trends downward
until about $k = 5000$, where it rises sharply.

Why?
- Larger $k$ increases the overlap, which makes $p = 1$.

That means beyond the most popular words (often shared as loanwords), the wiki articles contain less common, domain-specific terms.
Given the specificity of this wiki (Pokemon and the whole universe), this is expected.

The second and third charts show that the overlap between the compared languages and the English‑wiki word sets differs noticeably.

### Was it difficult to find and article for which the `lang_confidence_score` result is as low as possible in the wiki? Is this a specificity of this wiki?

Yes and no.

For the value of this function to be simply low? -> Yes.
That mainly follows from what I described in the previous answer.

For it to be the lowest possible? -> That took a bit of work.

However, the differences were not huge. (maybe we need to add a new page to the wiki ;)