# The Complexity Stress Test


First, install the following libraries:

In [2]:
import sys
!{sys.executable} -m pip install nltk
!{sys.executable} -m pip install textcomplexity
!{sys.executable} -m pip install stanza
!{sys.executable} -m pip install wordfreq 
!{sys.executable} -m spacy download en_core_web_md
!{sys.executable} -m pip install tqdm spacy numpy

Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Using cached click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2025.11.3-cp313-cp313-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Downloading regex-2025.11.3-cp313-cp313-win_amd64.whl (277 kB)
Using cached click-8.3.1-py3-none-any.whl (108 kB)
Using cached joblib-1.5.3-py3-none-any.whl (309 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, joblib, click, nltk

   ---------------- ----------------------- 2/5 [joblib]
   ------------------------ --------------- 3/5 [click]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ---

c:\Users\rroll\AppData\Local\Programs\Python\Python313\python.exe: No module named spacy


Collecting spacy
  Downloading spacy-3.8.11-cp313-cp313-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.15-cp313-cp313-win_amd64.whl.metadata (2.3 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.13-cp313-cp313-win_amd64.whl.metadata (9.9 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.12-cp313-cp313-win_amd64.whl.metadata (2.6 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.10-cp313-cp313-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.2-cp313-cp313-win

First, import the following Python libraries:

In [1]:
# Standard library imports
import json
from collections import Counter
from functools import lru_cache
from pprint import pprint
from typing import Dict, Set, Iterable, Optional, Any, Tuple
import importlib.resources as pkg_resources

# Third-party imports
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
import spacy
import stanza
import textcomplexity  # only used to access en.json
from tqdm.auto import tqdm  

# Download required resources
stanza.download('en')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Make sure WordNet is available; if not, download it.
try:
    _ = wn.synsets("dog")
except LookupError:
    nltk.download("wordnet")
    nltk.download("omw-1.4")

# Load spaCy model
nlp = spacy.load("en_core_web_md")
spacy_nlp = nlp
spacy_nlp.add_pipe("sentencizer")



  from .autonotebook import tqdm as notebook_tqdm
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 435kB [00:00, 8.05MB/s]                    
2025-12-18 20:21:16 INFO: Downloaded file to C:\Users\rroll\stanza_resources\resources.json
2025-12-18 20:21:16 INFO: Downloading default packages for language: en (English) ...
2025-12-18 20:21:17 INFO: File exists: C:\Users\rroll\stanza_resources\en\default.zip
2025-12-18 20:21:20 INFO: Finished downloading models and saved to C:\Users\rroll\stanza_resources
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rroll\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\rroll\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


<spacy.pipeline.sentencizer.Sentencizer at 0x2bebcb28f10>

## Data loading

In [7]:
datasets ={'ose_adv_ele':'data_sampled/OSE_adv_ele.csv', 
           'ose_adv_int':'data_sampled/OSE_adv_int.csv',
           'swipe': 'data_sampled/swipe.csv',
           'vikidia':'data_sampled/vikidia.csv'}

def load_data(path):
    return pd.read_csv(path, sep='\t')
    

def load_dataset(name):
    if name not in datasets:
        raise ValueError(f"Dataset {name} not found")
    return load_data(datasets[name])

Let's load one of the datasets, in this case "ose_adv_ele".

In [8]:
df = load_dataset('ose_adv_ele')
df.head(3)


Unnamed: 0,Simple,Complex
0,"﻿When you see the word Amazon, what’s the firs...","﻿When you see the word Amazon, what’s the firs..."
1,"﻿To tourists, Amsterdam still seems very liber...","﻿Amsterdam still looks liberal to tourists, wh..."
2,"﻿Anitta, a music star from Brazil, has million...","﻿Brazil’s latest funk sensation, Anitta, has w..."


Let's look at a random row of the dataset:

In [9]:
row = df.sample(1)

print('SIMPLE TEXT')
print(row['Simple'].iloc[0])
print('-'*100)
print('COMPLEX TEXT')
print(row['Complex'].iloc[0])


SIMPLE TEXT
﻿The Manchester United manager, Sir Alex Ferguson, will retire at the end of the season after 27 years. He will become a director of the club. He is the most successful manager in British football. He has won 13 Premier League titles, two Champions Leagues, the Cup Winners’ Cup, five FA Cups and four League Cups.
“The decision to retire is one that I have thought a lot about,” Ferguson said. “It is the right time. It was important to me to leave an organization in the strongest possible condition and I believe I have done so.” He said that he thinks the quality of the team will bring continued success at the highest level. They also have lots of good young players, so Ferguson thinks the club has a very good future.
“Our training facilities are some of the best in world sport,” he added. “Our stadium, Old Trafford, is one of the most important venues in the world. I am delighted to become both director and ambassador for the club. I am looking forward to the future.” He als

Let's look at the size of each dataset:

In [10]:
cnt = 0
for name, path in datasets.items():
    df = load_dataset(name)
    print(f"{name}: {df.shape[0]} rows")
    cnt += df.shape[0]
print(f"Total: {cnt} rows")

ose_adv_ele: 189 rows
ose_adv_int: 189 rows
swipe: 1233 rows
vikidia: 1233 rows
Total: 2844 rows


Let's load again the dataset for computing the complexity measure in the following section.

In [11]:
df = load_dataset('ose_adv_ele')

## Complexity measures

In [12]:

# Cache stanza pipelines to avoid re-loading models
_STANZA_PIPELINES: Dict[str, stanza.Pipeline] = {}

# UPOS tags considered content words (C)
CONTENT_UPOS = {"NOUN", "PROPN", "VERB", "ADJ", "ADV"}


@lru_cache()
def load_cow_top5000_en() -> Set[str]:
    """
    Load the COW-based list of the 5,000 most frequent English content words
    from textcomplexity's English language definition file (en.json).

    We ignore POS tags and keep only lowercased word forms.
    """
    with pkg_resources.files(textcomplexity).joinpath("en.json").open(
        "r", encoding="utf-8"
    ) as f:
        lang_def = json.load(f)

    most_common = lang_def["most_common"]  # list of [word, xpos]
    cow_top5000 = {w.lower() for w, xpos in most_common}
    return cow_top5000


def get_stanza_pipeline(lang: str = "en", use_gpu: bool = False) -> stanza.Pipeline:
    """
    Get (or create) a cached stanza Pipeline for a given language.

    NOTE: You must have downloaded the models beforehand, e.g.:
        import stanza
        stanza.download('en')
    """
    if lang not in _STANZA_PIPELINES:
        _STANZA_PIPELINES[lang] = stanza.Pipeline(
            lang=lang,
            processors="tokenize,pos,lemma,depparse,constituency",
            use_gpu=use_gpu,
            tokenize_no_ssplit=False,
        )
    return _STANZA_PIPELINES[lang]


### Lexical complexity

In [13]:
def _compute_mtld(tokens: Iterable[str], ttr_threshold: float = 0.72) -> Optional[float]:
    """
    Compute MTLD (Measure of Textual Lexical Diversity) for a list of tokens.

    MTLD = total_number_of_tokens / number_of_factors

    A factor is a contiguous segment where the running TTR stays >= threshold.
    When the TTR drops below the threshold, we close a factor (at the previous
    token) and start a new one. At the end, the remaining partial segment is
    counted as a fractional factor, with weight proportional to how close the
    final TTR is to the threshold.
    """
    tokens = [tok for tok in tokens if tok]
    if not tokens:
        return None

    types = set()
    factor_count = 0.0
    token_count_in_factor = 0

    for tok in tokens:
        token_count_in_factor += 1
        types.add(tok)
        ttr = len(types) / token_count_in_factor

        if ttr < ttr_threshold:
            factor_count += 1.0
            types = set()
            token_count_in_factor = 0

    # final partial factor
    if token_count_in_factor > 0:
        final_ttr = len(types) / token_count_in_factor
        if final_ttr < 1.0:
            fractional = (1.0 - final_ttr) / (1.0 - ttr_threshold)
            fractional = max(0.0, min(1.0, fractional))
            factor_count += fractional

    if factor_count == 0:
        return None

    return len(tokens) / factor_count



def _compute_lexical_density(total_tokens: int, content_tokens: int) -> Optional[float]:
    """
    LD = |C| / |T|
    where:
        |C| = number of content-word tokens
        |T| = total number of non-punctuation tokens
    """
    if total_tokens == 0:
        return None
    return content_tokens / total_tokens


def _compute_lexical_sophistication_cow(
    content_forms: Iterable[str],
    cow_top5000: set,
) -> Optional[float]:
    """
    LS = |{ w in C : w not in R }| / |C|
    where:
        C = content-word tokens (surface forms, lowercased)
        R = COW top-5000 content word forms (lowercased)
    """
    forms = [f for f in content_forms if f]
    if not forms:
        return None

    off_list = sum(1 for f in forms if f not in cow_top5000)
    return off_list / len(forms)



def lexical_measures_from_doc(doc) -> Dict[str, Optional[float]]:
    """
    Compute MTLD, LD, LS from a stanza Document.
    """
    cow_top5000 = load_cow_top5000_en()

    mtld_tokens = []
    total_tokens = 0
    content_tokens = 0
    content_forms = []

    for sent in doc.sentences:
        for word in sent.words:
            if word.upos == "PUNCT":
                continue

            lemma = (word.lemma or word.text or "").lower()
            if not lemma:
                continue

            mtld_tokens.append(lemma)
            total_tokens += 1

            if word.upos in CONTENT_UPOS:
                content_tokens += 1
                form = (word.text or "").lower()
                content_forms.append(form)

    mtld = _compute_mtld(mtld_tokens) if mtld_tokens else None
    ld = _compute_lexical_density(total_tokens, content_tokens)
    ls = _compute_lexical_sophistication_cow(content_forms, cow_top5000)

    return {"MTLD": mtld, "LD": ld, "LS": ls}


def lexical_measures_from_text(text: str, lang: str = "en") -> Dict[str, Optional[float]]:
    """
    Convenience wrapper: parse a single text and compute lexical measures.
    """
    if text is None:
        text = ""
    text = str(text)

    if not text.strip():
        return {"MTLD": None, "LD": None, "LS": None}

    nlp = get_stanza_pipeline(lang)
    doc = nlp(text)
    return lexical_measures_from_doc(doc)



def compute_lexical_measures_df(
    df: pd.DataFrame,
    column: str = "text",
    lang: str = "en",
) -> Dict[str, Dict[Any, Optional[float]]]:
    """
    Compute lexical measures for each row in df[column].

    Returns:
        {
            "MTLD": {index: value},
            "LD":   {index: value},
            "LS":   {index: value},
        }
    """
    mtld_res: Dict[Any, Optional[float]] = {}
    ld_res: Dict[Any, Optional[float]] = {}
    ls_res: Dict[Any, Optional[float]] = {}

    for idx, text in df[column].items():
        metrics = lexical_measures_from_text(text, lang=lang)
        mtld_res[idx] = metrics["MTLD"]
        ld_res[idx] = metrics["LD"]
        ls_res[idx] = metrics["LS"]

    return {"MTLD": mtld_res, "LD": ld_res, "LS": ls_res}


### Syntactic complexity

In [14]:

def mdd_from_doc(doc) -> Optional[float]:
    """
    Compute Mean Dependency Distance (MDD) from a stanza Document.

    For each sentence s_i with dependency set D_i:
        MDD_i = (1 / |D_i|) * sum_{(h,d) in D_i} |h - d|
    Then:
        MDD = (1 / k) * sum_i MDD_i, over all sentences with at least one dependency.
    """
    sentence_mdds = []

    for sent in doc.sentences:
        distances = []
        for w in sent.words:
            if w.head is None or w.head == 0:
                continue
            distances.append(abs(w.id - w.head))

        if distances:
            sentence_mdds.append(sum(distances) / len(distances))

    if not sentence_mdds:
        return None
    return sum(sentence_mdds) / len(sentence_mdds)



def _count_clauses_in_tree(tree) -> int:
    """
    Count clause nodes in a constituency tree.

    A simple and standard heuristic (PTB-style) is:
        count all nodes whose label starts with 'S'
        (S, SBAR, SBARQ, SINV, SQ, etc.).

    This aligns with the idea of counting finite and subordinate clauses
    as in Hunt (1965) and later complexity work.
    """
    if tree is None:
        return 0

    # Stanza's constituency tree: tree.label, tree.children
    count = 1 if getattr(tree, "label", "").startswith("S") else 0

    for child in getattr(tree, "children", []):
        # leaves can be strings or terminals without 'label'
        if hasattr(child, "label"):
            count += _count_clauses_in_tree(child)

    return count


def cs_from_doc(doc) -> Optional[float]:
    """
    Compute CS (clauses per sentence) from a stanza Document.

        CS = (1 / k) * sum_i L_i

    where L_i is the number of clauses in sentence s_i, estimated by counting
    all constituents whose label starts with 'S' in the constituency tree of s_i.
    """
    clause_counts = []
    for sent in doc.sentences:
        tree = getattr(sent, "constituency", None)
        if tree is None:
            # No constituency tree available for this sentence
            continue
        num_clauses = _count_clauses_in_tree(tree)
        clause_counts.append(num_clauses)

    if not clause_counts:
        return None

    return sum(clause_counts) / len(clause_counts)



def syntactic_measures_from_doc(doc) -> Dict[str, Optional[float]]:
    """
    Compute MDD and CS from a stanza Document.
    """
    mdd = mdd_from_doc(doc)
    cs = cs_from_doc(doc)
    return {"MDD": mdd, "CS": cs}


def syntactic_measures_from_text(text: str, lang: str = "en") -> Dict[str, Optional[float]]:
    """
    Convenience wrapper: parse a single text and compute syntactic measures.
    """
    if text is None:
        text = ""
    text = str(text)

    if not text.strip():
        return {"MDD": None, "CS": None}

    nlp = get_stanza_pipeline(lang)
    doc = nlp(text)
    return syntactic_measures_from_doc(doc)


def compute_syntactic_measures_df(
    df: pd.DataFrame,
    column: str = "text",
    lang: str = "en",
) -> Dict[str, Dict[Any, Optional[float]]]:
    """
    Compute syntactic measures for each row in df[column].

    Returns:
        {
            "MDD": {index: value},
            "CS":  {index: value},
        }
    """
    mdd_res: Dict[Any, Optional[float]] = {}
    cs_res: Dict[Any, Optional[float]] = {}

    for idx, text in df[column].items():
        metrics = syntactic_measures_from_text(text, lang=lang)
        mdd_res[idx] = metrics["MDD"]
        cs_res[idx] = metrics["CS"]

    return {"MDD": mdd_res, "CS": cs_res}


### Discourse complexity

In [15]:

# Approximate set of content POS tags (spaCy universal POS)
CONTENT_POS =  {"NOUN", "VERB", "ADJ", "ADV"}


def is_content_token(tok):
    """
    Return True if token is considered a content word.
    We ignore stopwords, punctuation, and non-alphabetic tokens.
    """
    return (
        tok.is_alpha
        and not tok.is_stop
        and tok.pos_ in CONTENT_POS
    )


@lru_cache(maxsize=100000)
def get_related_lemmas(lemma):
    """
    Return a set of semantically related lemmas for the given lemma
    using WordNet, including:
      - synonyms
      - antonyms
      - hypernyms / hyponyms
      - meronyms (part/member/substance)
      - coordinate terms (siblings under the same hypernym)

    NOTE: Some older examples mention 'troponyms', but in NLTK's
    WordNet interface there is no 'troponyms()' method on Synset,
    so we do NOT use it here.
    """
    lemma = lemma.lower()
    related = set()
    synsets = wn.synsets(lemma)

    for syn in synsets:
        # Synonyms and antonyms
        for l in syn.lemmas():
            related.add(l.name().lower().replace("_", " "))
            for ant in l.antonyms():
                related.add(ant.name().lower().replace("_", " "))

        # Hypernyms (more general) and hyponyms (more specific)
        for hyper in syn.hypernyms():
            for l in hyper.lemmas():
                related.add(l.name().lower().replace("_", " "))
        for hypo in syn.hyponyms():
            for l in hypo.lemmas():
                related.add(l.name().lower().replace("_", " "))

        # Meronyms: part/member/substance
        for mer in syn.part_meronyms() + syn.member_meronyms() + syn.substance_meronyms():
            for l in mer.lemmas():
                related.add(l.name().lower().replace("_", " "))

        # Coordinate terms (siblings under same hypernym)
        for hyper in syn.hypernyms():
            for sibling in hyper.hyponyms():
                if sibling == syn:
                    continue
                for l in sibling.lemmas():
                    related.add(l.name().lower().replace("_", " "))

    # Remove the lemma itself if present
    related.discard(lemma)
    return related


def lexical_cohesion_single(text, nlp):
    """
    Compute Lexical Cohesion (LC) for a single document:

        LC = |C| / m

    where:
      - |C| is the number of cohesive devices between sentences
        (lexical repetition + semantic relations),
      - m  is the total number of word tokens (alphabetic) in the document.

    If the document has fewer than 2 sentences or no valid words,
    LC is returned as 0.0.
    """
    if not isinstance(text, str) or not text.strip():
        return 0.0

    doc = nlp(text)

    # Total number of alphabetic tokens (denominator m)
    m = sum(1 for tok in doc if tok.is_alpha)
    if m == 0:
        return 0.0

    sentences = list(doc.sents)
    if len(sentences) < 2:
        # With only one sentence, cross-sentence cohesion is not defined
        return 0.0

    # Collect sets of content lemmas per sentence
    sent_lemmas = []
    for sent in sentences:
        lemmas = set(
            tok.lemma_.lower()
            for tok in sent
            if is_content_token(tok)
        )
        if lemmas:
            sent_lemmas.append(lemmas)

    if len(sent_lemmas) < 2:
        return 0.0

    cohesive_count = 0

    for i in range(len(sent_lemmas) - 1):
        for j in range(i + 1, len(sent_lemmas)):
            li = sent_lemmas[i]
            lj = sent_lemmas[j]

            # 1) Lexical repetition: shared lemmas
            shared = li & lj
            cohesive_count += len(shared)

            # 2) Semantic relations via WordNet
            for lemma in li:
                related = get_related_lemmas(lemma)
                cohesive_count += len(related & lj)

    return float(cohesive_count) / float(m)


def sentence_vector(sent, vector_size):
    """
    Represent a sentence as the average of token vectors.
    If no token has a vector, return a zero vector.
    """
    vecs = [
        tok.vector
        for tok in sent
        if tok.has_vector and not tok.is_punct and not tok.is_space
    ]
    if not vecs:
        return np.zeros(vector_size, dtype="float32")
    return np.mean(vecs, axis=0)


def coherence_single(text, nlp):
    """
    Compute Coherence (CoH) for a single document as the average
    cosine similarity between adjacent sentence vectors:

        CoH = (1 / (k-1)) * sum_{i=1}^{k-1} cos(h_i, h_{i+1})

    where h_i is the sentence/topic vector for sentence i.

    If the document has fewer than 2 sentences, CoH = 0.0.
    """
    if not isinstance(text, str) or not text.strip():
        return 0.0

    if nlp.vocab.vectors_length == 0:
        raise ValueError(
            "The loaded spaCy model does not contain word vectors "
            "(nlp.vocab.vectors_length == 0). "
            "Use a model like 'en_core_web_md' or similar."
        )

    doc = nlp(text)
    sentences = list(doc.sents)
    k = len(sentences)

    if k < 2:
        # Only one sentence: no adjacent pair, coherence = 0.0
        return 0.0

    vector_size = nlp.vocab.vectors_length
    sent_vectors = [
        sentence_vector(sent, vector_size)
        for sent in sentences
    ]

    sims = []
    for i in range(k - 1):
        v1 = sent_vectors[i]
        v2 = sent_vectors[i + 1]
        norm1 = np.linalg.norm(v1)
        norm2 = np.linalg.norm(v2)
        denom = norm1 * norm2
        if denom == 0.0:
            # Skip pairs where at least one sentence vector is zero
            continue
        cos_sim = float(np.dot(v1, v2) / denom)
        sims.append(cos_sim)

    if not sims:
        return 0.0

    return float(np.mean(sims))



def compute_lexical_cohesion_vector(df, nlp, column="text"):
    """
    Compute LC for each row of a DataFrame.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing the texts.
    nlp : spaCy Language object
        Pre-loaded spaCy pipeline with lemmatizer, POS tagger, etc.
    column : str, default "text"
        Name of the column that contains the text.

    Returns
    -------
    np.ndarray
        1D array of LC scores, length == len(df).
    """
    texts = df[column].fillna("").astype(str)
    scores = [lexical_cohesion_single(t, nlp) for t in texts]
    return np.array(scores, dtype="float32")


def compute_coherence_vector(df, nlp, column="text"):
    """
    Compute CoH for each row of a DataFrame.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing the texts.
    nlp : spaCy Language object
        Pre-loaded spaCy pipeline with word vectors.
    column : str, default "text"
        Name of the column that contains the text.

    Returns
    -------
    np.ndarray
        1D array of CoH scores, length == len(df).
    """
    texts = df[column].fillna("").astype(str)
    scores = [coherence_single(t, nlp) for t in texts]
    return np.array(scores, dtype="float32")


def compute_discourse_measures(df, nlp, column="text"):
    """
    Compute both LC and CoH for each row of a DataFrame and return
    them in a dictionary.

    Returns
    -------
    dict
        {
            "LC":  np.ndarray of lexical cohesion scores,
            "CoH": np.ndarray of coherence scores
        }
    """
    lc_vec = compute_lexical_cohesion_vector(df, nlp, column=column)
    coh_vec = compute_coherence_vector(df, nlp, column=column)
    return {"LC": lc_vec, "CoH": coh_vec}




### Text complexity

Here we compute the complexity of each function. Note that we use a method that calculates all measures at once. However, it is advisable to compute each measure separately so that you can better handle any potential errors. For example, calculate MLTD first and save it, then LD, and so on. The following code is an example of how to compute the measures.

In [16]:
def _analyze_text_all(text: str, lang: str = "en") -> Dict[str, Optional[float]]:
    """
    Parse a text with stanza and compute all measures (lexical + syntactic)
    in a single pass.

    Returns a dict with keys:
        "MTLD", "LD", "LS", "MDD", "CS"
    (Discourse measures LC/CoH are added later at DataFrame level, via spaCy.)
    """
    if text is None:
        text = ""
    text = str(text)

    if not text.strip():
        return {"MTLD": None, "LD": None, "LS": None, "MDD": None, "CS": None}

    nlp = get_stanza_pipeline(lang)
    doc = nlp(text)

    lex = lexical_measures_from_doc(doc)
    syn = syntactic_measures_from_doc(doc)

    out: Dict[str, Optional[float]] = {}
    out.update(lex)
    out.update(syn)
    return out


def compute_all_complexity_measures_df(
    df: pd.DataFrame,
    column: str = "text",
    lang: str = "en",
    spacy_nlp=None,
) -> Dict[str, Dict[Any, Optional[float]]]:
    """
    Compute all complexity measures for each row in df[column].

    Args
    ----
    df : pandas.DataFrame
        DataFrame with a text column.
    column : str, default "text"
        Name of the text column.
    lang : str, default "en"
        Language code for stanza.
    n_jobs : int, default 1
        Number of worker processes to use.
            - 1  : sequential execution (no multiprocessing).
            - >1 : multiprocessing with that many workers.
            - 0 or None : use cpu_count() workers.
    spacy_nlp : spaCy Language, required for LC / CoH
        Pre-loaded spaCy pipeline with:
            - POS / lemmatizer for LC
            - word vectors for CoH (e.g. 'en_core_web_md').

    Returns
    -------
    dict
        {
            "MTLD": {index: value},
            "LD":   {index: value},
            "LS":   {index: value},
            "MDD":  {index: value},
            "CS":   {index: value},
            "LC":   {index: value},
            "CoH":  {index: value},
        }
    """
    mtld_res: Dict[Any, Optional[float]] = {}
    ld_res: Dict[Any, Optional[float]] = {}
    ls_res: Dict[Any, Optional[float]] = {}
    mdd_res: Dict[Any, Optional[float]] = {}
    cs_res: Dict[Any, Optional[float]] = {}

    items = list(df[column].items())  # list[(index, text)]
    total_items = len(items)

    # ---- Lexical + syntactic (stanza) ----
    for idx, text in tqdm(
        items,
        total=total_items,
        desc="Computing lexical & syntactic complexity (sequential)",
    ):
        metrics = _analyze_text_all(text, lang=lang)
        mtld_res[idx] = metrics["MTLD"]
        ld_res[idx] = metrics["LD"]
        ls_res[idx] = metrics["LS"]
        mdd_res[idx] = metrics["MDD"]
        cs_res[idx] = metrics["CS"]


    # ---- Discourse measures (spaCy: LC & CoH) ----
    if spacy_nlp is None:
        raise ValueError(
            "spacy_nlp must be provided to compute LC and CoH. "
            "Load a spaCy model with vectors, e.g. 'en_core_web_md', and "
            "pass it as spacy_nlp=..."
        )

    discourse = compute_discourse_measures(df, spacy_nlp, column=column)
    lc_vec = discourse["LC"]
    coh_vec = discourse["CoH"]

    lc_res: Dict[Any, float] = {}
    coh_res: Dict[Any, float] = {}

    # Map arrays back to DataFrame indices
    for i, idx in enumerate(df.index):
        lc_res[idx] = float(lc_vec[i])
        coh_res[idx] = float(coh_vec[i])

    return {
        "MTLD": mtld_res,
        "LD": ld_res,
        "LS": ls_res,
        "MDD": mdd_res,
        "CS": cs_res,
        "LC": lc_res,
        "CoH": coh_res,
    }


In [17]:
#!/usr/bin/env python3
"""
Example script: load a DataFrame and compute all complexity measures.
"""

if __name__ == "__main__":

    df_example = df.sample(n=5) # We sample 5 random rows
    # Compute all measures for Simple texts
    metrics = compute_all_complexity_measures_df(
        df_example,
        column="Simple", # Note that we use the column "Simple" for the Simple text. Use 'Complex' for the Complex text.
        lang="en",

        spacy_nlp=spacy_nlp
    )

    print("All complexity measures (per row):")
    pprint(metrics)


Computing lexical & syntactic complexity (sequential):   0%|          | 0/5 [00:00<?, ?it/s]2025-12-18 17:34:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 435kB [00:00, 7.44MB/s]                    
2025-12-18 17:34:19 INFO: Downloaded file to C:\Users\rroll\stanza_resources\resources.json
2025-12-18 17:34:20 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| mwt          | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |

2025-12-18 17:34:20 INFO: Using device: cpu
2025-12-18 17:34:20 INF

All complexity measures (per row):
{'CS': {17: 2.9069767441860463,
        38: 3.6333333333333333,
        46: 4.8,
        51: 3.3548387096774195,
        55: 3.1707317073170733},
 'CoH': {17: 0.8373317718505859,
         38: 0.8397873044013977,
         46: 0.8502054214477539,
         51: 0.8557181358337402,
         55: 0.8372233510017395},
 'LC': {17: 2.9229559898376465,
        38: 1.283203125,
        46: 1.7163375616073608,
        51: 1.116279125213623,
        55: 3.4348485469818115},
 'LD': {17: 0.5186846038863976,
        38: 0.48295454545454547,
        46: 0.5365853658536586,
        51: 0.45020746887966806,
        55: 0.5331369661266568},
 'LS': {17: 0.1988472622478386,
        38: 0.3058823529411765,
        46: 0.22402597402597402,
        51: 0.25806451612903225,
        55: 0.30939226519337015},
 'MDD': {17: 3.0743672533969515,
         38: 3.090355057151884,
         46: 3.459314447642028,
         51: 2.9663430377463005,
         55: 3.212641674453416},
 'MTLD': {

Pay attention when using the function and ensure proper error handling for NaN values. As a rule, if any complexity dimension produces NaN values for a sample, that dimension must be discarded and not included in the subsequent model training or analysis.

**It is strongly recommended to implement a function that incorporates a backup strategy in case errors occur during execution (e.g., IO errors). Please note that if it is impossible to calculate a measure for at least one row (e.g., NaN value), that row must be discarded. At the end of this process, the goal is to obtain a dataframe with 16 columns. The columns should include Simple and Complex, followed by 7 columns containing the measures for the Simple text, and the final 7 columns containing the complexity measures for the Complex text (pay attention to use distinct names for the Simple and Complex columns.)**