# Natural Language Processing Milestone 1

## Content
1. [Initial Data prep and label representation](#initial-data-prep-and-label-representation)
    - [Create a first dataframe](#creating-a-first-dataframe)
    - [Extract the label taxonomy](#extract-the-label-taxonomy)
    - [Map the labels to the taxonomy](#map-the-labels-in-the-file-to-the-taxonomy)
2. [Text segmentation](#text-segmentation)
    - [Tokenize sentences and words](#tokenize-sentences-and-words)
    - [Find sentences of unusual length](#find-sentences-unusual-length)
    - [Handle very short sentences](#handle-short-sentences)
    - [Handle very long sentences](#handle-long-sentences)
    - [Verify remaining sentences of unusual length](#verify-remaining-unusual-sentences)
3. [Text Normalization](#text-normalization)
    - [Verify text normalization](#check-normalization)
    - [Print the results of the text normalization](#result-printing)
4. [CoNNL-U format](#connlu-format)

In [36]:
# %pip install --requirement requirements.txt
# When you import something new that is not in the requirements.txt file, please add it to the requirements.txt file and re-run the cell above.


import os
import pandas as pd
import fitz  # PyMuPDF
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer

from stanza.utils.conll import CoNLL
import re
from nltk.stem import PorterStemmer
import stanza
import torch
import nltk
import string
from tqdm import tqdm


# nltk.download('stopwords')
# stanza.download('bg')
# stanza.download('ru')
# stanza.download('pt')
# stanza.download('hi')

##%%capture

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json: 392kB [00:00, 18.2MB/s]                    
2024-12-29 19:53:20 INFO: Downloaded file to C:\Users\asus9\stanza_resources\resources.json
2024-12-29 19:53:20 INFO: Downloading default packages for language: hi (Hindi) ...
2024-12-29 19:53:21 INFO: File exists: C:\Users\asus9\stanza_resources\hi\default.zip
2024-12-29 19:53:23 INFO: Finished downloading models and saved to C:\Users\asus9\stanza_resources


## Initial Data Prep and label representation

### Creating a first dataframe
containing [filename, content, narrative, subnarrative, topic] for each datapoint
- The documents belong to two different topics, ukraine war and climate change
- We will make it clear in the dataframe which topic each document belongs to by adding the column "topic", which is based on the abbreviations UA or CC in the document filenames
    - The reason for this is the implementation decision we made to **build separate models to predict the two topics** instead of one that can predict both

The dataset has **two ways of categorizing articles** into topics:
- The above-mentioned UA and CC in the article files
- URW and CC in the label strings which classifies every label for an article (since there can be multiple) 
    - This means, that theoretically, an article could contain both topics
    - However, we will see that almost all datapoints only have labels for the topic that is also in the corresponding article filenames (i.e. if there is CC in the article filename, for almost all datapoints, there would also only be CC labels)
    - More details that drove the decision to use the article file topic instead of the label topics are provided further below
*English files:*
Filtering of topic based on file name. Every file contains "UA" or "CC", making it easy to differentiate between both topics.

*Russian files:*
All files containing Ukraine War, no selection necessary

*Portuguese files (topic labeled):*
PT_CC_* : CC
PT_URW_* : UA

*Portuguese files (topic labeled manually):*
PT_[number]: CC or UA, labeled in annotation-file

*Bulgarian files:*
Following interesting bulgarian narratives:
    - A9_BG_3779 not containing UA nor CC -> removed from files and annotations

*_URW_* : UA
*_CC_* : CC
A9_BG_437 - A9_BG_8450 : UA
BG_171 - BG_364 : CC
BG_423 - BG_555 : UA
BG_559 - BG_560 : CC
BG_573 - BG_869 : UA
BG_888 - BG_2536 : CC
BG_3079 - BG_8455 : UA

*Indian files (topic labeled manually):*
HI_[number]: CC and UA, labeled in annotation-file

In [3]:
def load_annotations(annotation_file):
    annotations = {}
    with open(annotation_file, "r", encoding="utf-8") as file:
        for line in file:
            line = line.strip()
            if line:
                # Split by tabs to separate columns
                parts = line.split("\t")
                if len(parts) > 1:
                    filename = parts[0].strip()
                    # Extract the first topic from the second column
                    topic_section = parts[1].split(";")[0]
                    topic = topic_section.split(":")[0].strip()

                    # Map URW to UA for Ukraine War
                    if topic == "URW":
                        topic = "UA"
                    elif topic == "CC":
                        topic = "CC"
                    else:
                        topic = None

                    annotations[filename] = topic
    return annotations


def determine_topic(lang, filename):
    """
    Function that determines the topic of a narrative (Climate Change or Ukraine War) based on given language and filename
    """
    if lang == "BG":
        # Bulgarian files with topic in name
        if "_URW_" in filename or "A9_BG_" in filename:
            return "UA"
        elif "_CC_" in filename:
            return "CC"

        # Ranges for bulgarian ukraine war files
        ranges_ua = [
            ("A9_BG_437", "A9_BG_8450"),
            ("BG_423", "BG_869"),
            ("BG_3079", "BG_8455"),
        ]
        # Ranges for bulgarian climate change files
        ranges_cc = [
            ("BG_171", "BG_364"),
            ("BG_888", "BG_2536"),
        ]

        # Checks whether filename is inside a range UA or CC
        def is_within_range(filename, start, end):
            # Extract numeric parts from filenames
            file_num = int("".join(filter(str.isdigit, filename)))
            start_num = int("".join(filter(str.isdigit, start)))
            end_num = int("".join(filter(str.isdigit, end)))
            return start_num <= file_num <= end_num

        for start, end in ranges_ua:
            if is_within_range(filename, start, end):
                return "UA"

        for start, end in ranges_cc:
            if is_within_range(filename, start, end):
                return "CC"

    # Portuguese files contain topic in filenames
    elif lang == "PT":
        if filename.startswith("PT_CC_"):
            return "CC"
        elif filename.startswith("PT_URW_"):
            return "UA"

    # All russian narratives have topic UA
    elif lang == "RU":
        return "UA"

    # English narratives topic can be extracated from filename
    elif lang == "EN":
        return "UA" if "UA" in filename else "CC"

    # Default case to handle outliers
    return None


def should_load_file(lang, filename):
    """
    Function that checks if a file in a language should be loaded.
    """
    # Load ever file of english, russian and bulgarian narratives.
    if lang in ["EN", "RU", "BG"]:
        return True
    # For portuguese, load only the file, if it starts with "PT_CC_" or "PT_URW_"
    elif lang == "PT":
        return filename.startswith("PT_CC_") or filename.startswith("PT_URW_")

    return False


# Define a function to iterate over all languages specified in argument and load all narratives of each languages.
def process_languages(languages):
    data = []
    annotations_dict = {}
    for lang in ["HI", "PT"]:
        annotation_file = os.path.join(
            "../training_data_04_December_release", lang, "subtask-2-annotations.txt"
        )
        if os.path.exists(annotation_file):
            annotations_dict.update(load_annotations(annotation_file))

    # Iterate over each language
    for lang in languages:
        # Define the paths to articles and annotations
        documents_path = os.path.join(
            "../training_data_04_December_release", lang, "raw-documents"
        )
        annotations_file = os.path.join(
            "../training_data_04_December_release", lang, "subtask-2-annotations.txt"
        )

        # Read and preprocess annotations
        annotations = pd.read_csv(
            annotations_file,
            sep="\t",
            header=None,
            names=["filename", "narrative", "subnarrative"],
        )

        # Remove all occurrences of "CC: " and "URW: " from narratives and subnarratives
        annotations["narrative"] = annotations["narrative"].str.replace(
            r"(CC: |URW: )", "", regex=True
        )
        annotations["subnarrative"] = annotations["subnarrative"].str.replace(
            r"(CC: |URW: )", "", regex=True
        )

        # Split the narratives and subnarratives into lists
        annotations["narrative"] = annotations["narrative"].str.split(";")
        annotations["subnarrative"] = annotations["subnarrative"].str.split(";")

        # Process each file in the directory
        for _, row in annotations.iterrows():
            filename = row["filename"]
            narratives = row["narrative"]
            subnarratives = row["subnarrative"]

            # Read the document content
            with open(
                os.path.join(documents_path, filename), "r", encoding="utf-8"
            ) as file:
                content = file.read()
            # Determine the topic based on filename or annoation in annotaion file
            if (
                filename.startswith("PT_")
                and filename.split("_")[1].split(".")[0].isdigit()
            ):
                topic = annotations_dict.get(filename, None)
            elif lang == "HI":
                topic = annotations_dict.get(filename, None)
            else:
                # Use determine_topic function for other cases
                topic = determine_topic(lang, filename)

            # Append the document content, narratives, subnarratives, language and topic to the data list
            data.append(
                {
                    "filename": filename,
                    "language": lang,
                    "content": content,
                    "narratives": narratives,
                    "subnarratives": subnarratives,
                    "topic": topic,
                }
            )

    # Convert to a DataFrame
    df = pd.DataFrame(data)
    return df


# Define the base directory and languages
languages = ["EN", "RU", "BG", "PT", "HI"]

# Process files for all languages
df = process_languages(languages)

# Display the DataFrame
df.head(2000)

Unnamed: 0,filename,language,content,narratives,subnarratives,topic
0,EN_CC_100013.txt,EN,Bill Gates Says He Is ‘The Solution’ To Climat...,[Criticism of climate movement],[Criticism of climate movement: Ad hominem att...,CC
1,EN_UA_300009.txt,EN,Russia: Clashes erupt in Bashkortostan as righ...,[Other],[Other],UA
2,EN_UA_300017.txt,EN,"McDonald's to exit Russia, sell business in co...",[Other],[Other],UA
3,EN_CC_100021.txt,EN,"Collaborative plans, innovation keys to circul...",[Other],[Other],CC
4,EN_UA_300041.txt,EN,Russia intends to supply light ‘Mountain’ tank...,[Other],[Other],UA
...,...,...,...,...,...,...
1693,HI_374.txt,HI,तीसरे विश्वयुद्ध की शुरुआत... अमेरिका ने पहली ...,"[Praise of Russia, Amplifying war-related fear...",[Praise of Russia: Russia has international su...,UA
1694,HI_90.txt,HI,"Russia Ukraine War: पुतिन की गंभीर चेतावनी, पश...","[Negative Consequences for the West, Blaming t...","[Negative Consequences for the West: Other, Bl...",UA
1695,HI_167.txt,HI,यूक्रेन में शांति के लिए होने वाले किसी भी शिख...,[Other],[Other],UA
1696,HI_203.txt,HI,"इस्लाम का उदय, फिर बना अरब साम्राज्य; कैसे धरत...",[Other],[Other],UA


- There seems to be redundant information in the narratives and subnarratives column, since entries in the subnarrative column is structured in the way *"narrative:subnarrative"* -> we could therefore get rid of some redundant information
- We first check if the narrative information in the subnarrative column is exactly the same as in the narrative column
- If so, we will remove the narrative information from the subnarrative column

In [4]:
# Define a function to check if the subnarrative starts with the narrative in a given row
def check_pattern(row):
    narratives = row["narratives"]
    subnarratives = row["subnarratives"]

    for narrative, subnarrative in zip(narratives, subnarratives):
        if not subnarrative.startswith(narrative + ":"):
            return (row.name, narrative, subnarrative)
    return None


# Apply the function to each row and collect the results
pattern_check_results = df.apply(check_pattern, axis=1)

# Filter out the rows where the pattern does not hold
problematic_rows = pattern_check_results[pattern_check_results.notnull()]

# Set display options to avoid truncation
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

# Display the problematic rows
print("Problematic rows where the pattern does not hold:")
print(problematic_rows)

Problematic rows where the pattern does not hold:
1          (1, Other, Other)
2          (2, Other, Other)
3          (3, Other, Other)
4          (4, Other, Other)
5          (5, Other, Other)
8          (8, Other, Other)
9          (9, Other, Other)
14        (14, Other, Other)
15        (15, Other, Other)
17        (17, Other, Other)
18        (18, Other, Other)
19        (19, Other, Other)
20        (20, Other, Other)
24        (24, Other, Other)
25        (25, Other, Other)
26        (26, Other, Other)
29        (29, Other, Other)
31        (31, Other, Other)
32        (32, Other, Other)
34        (34, Other, Other)
36        (36, Other, Other)
38        (38, Other, Other)
41        (41, Other, Other)
42        (42, Other, Other)
44        (44, Other, Other)
45        (45, Other, Other)
47        (47, Other, Other)
51        (51, Other, Other)
54        (54, Other, Other)
58        (58, Other, Other)
72        (72, Other, Other)
73        (73, Other, Other)
74        (74, Other, 

- The only exception from the pattern is *narrative other, subnarrative other* cases 
- Since there is no redundancy in those cases and there are no other exceptions from the pattern, we can delete the narratives before the ":" in the subnarratives column

In [5]:
# Function to remove redundant narratives from subnarratives
def remove_redundant_narratives(row):
    narratives = row["narratives"]
    subnarratives = row["subnarratives"]

    cleaned_subnarratives = []
    for narrative, subnarrative in zip(narratives, subnarratives):
        if subnarrative.startswith(narrative + ":"):
            cleaned_subnarrative = subnarrative[len(narrative) + 1 :].strip()
            cleaned_subnarratives.append(cleaned_subnarrative)
        else:
            cleaned_subnarratives.append(subnarrative)

    return cleaned_subnarratives


# Apply the function to each row to clean the subnarratives
df["subnarratives"] = df.apply(remove_redundant_narratives, axis=1)

# Display the updated DataFrame
print("Updated DataFrame with cleaned subnarratives:")
df.head()

Updated DataFrame with cleaned subnarratives:


Unnamed: 0,filename,language,content,narratives,subnarratives,topic
0,EN_CC_100013.txt,EN,"Bill Gates Says He Is ‘The Solution’ To Climate Change So It’s OK To Own Four Private Jets \n\nBill Gates has the right to fly around the world on private jets while normal people are forced to live in 15 minute cities without freedom of travel, according to Bill Gates himself, who told the BBC he is doing much more than anybody else to fight climate change.\n\nGates claimed that because he continues to “spend billions of dollars” on climate change activism, his carbon footprint isn’t an issue.\n\nSign up to get unfiltered news delivered straight to your inbox.\n\nYou can unsubscribe any time. By subscribing you agree to our Terms of Use\n\n“Should I stay at home and not come to Kenya and learn about farming and malaria?” Gates said in the interview with Amol Rajan.\n\n“I’m comfortable with the idea that not only am I not part of the problem by paying for the offsets, but also through the billions that my Breakthrough Energy Group is spending, that I’m part of the solution,” Gates added. Watch:\n\nEarlier this year, Gates flew around Australia on board his $70 million dollar luxury private jet lecturing people about climate change and ordering them to stop flying on planes.\n\nGates, who has declared that the energy crisis is a good thing, owns no fewer than FOUR private jets at a combined cost of $194 million dollars.\n\nA study carried out by Linnaeus University economics professor Stefan Gössling found that Gates flew more than 213,000 miles on 59 private jet flights in 2017 alone.\n\nGates emitted an estimated 1,760 tons of carbon dioxide emissions, over a hundred times more than the emissions per capita in the United States, according to data from the World Bank.\n\nElsewhere during the carefully constructed interview, Gates said he was surprised that he was targeted by ‘conspiracy theorists’ for pushing vaccines during the pandemic.\n\nWhile the BBC interview was set up to look like Gates was being challenged or grilled, he wasn’t asked about his close friendship with the elite pedophile Jeffrey Epstein.",[Criticism of climate movement],[Ad hominem attacks on key activists],CC
1,EN_UA_300009.txt,EN,"Russia: Clashes erupt in Bashkortostan as rights activist sentenced \n\n Russian riot police clashed with protesters in Bashkortostan following the sentencing of rights activist Fail Alsynov to four years in a penal colony. Social media footage captured the confrontations near the court in Baymak, with supporters engaging in clashes with police, including throwing snowballs.\n\nViolent clashes in Baymak— NEXTA (@nexta_tv) January 17, 2024\n\nLaw enforcers used stun grenades in Baymak, Bashkortostan, Russia. The demonstrators responded by throwing snow and ice at them and forced them to retreat.\n\nIt is reported that negotiations are underway between the protesters and special forces: the law… pic.twitter.com/AVHf2gBi7w\n\nAlsynov’s conviction for inciting ethnic hatred sparked rare large-scale protests in Russia, where the risk of arrest typically stifles such demonstrations. Reports suggest thousands participated in the multi-day protest in -20°C temperatures, resulting in several injuries. \n\nThe activist denies the charges related to insulting migrants during a demonstration against gold mining plans. Supporters claim the case is retaliation for Alsynov’s activism against soda mining in a culturally significant area. He allegedly referred to Central Asians and Caucasians, comprising a significant portion of Russia’s migrant population, as “black people” in the Bashkir language. \n\nAlsynov contends the Bashkir words meant “poor people” and were mistranslated into Russian. He plans to appeal the verdict. Alsynov has previously criticized military mobilization in the region as “genocide” against the Bashkir people. Ongoing concerns exist about the disproportionate deployment of ethnic minorities, including Bashkirs, in conflicts such as Ukraine. \n\nAlsynov was a leader of Bashkort, a grassroots movement focused on preserving Bashkir ethnic identity, which was banned as extremist in 2020. The clashes signify a rare instance of public dissent in Russia, underscoring the intensity of emotions surrounding Alsynov’s case and broader issues of ethnic identity and activism in the country.",[Other],[Other],UA
2,EN_UA_300017.txt,EN,"McDonald's to exit Russia, sell business in country \n\n American fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion.\n\nmerican fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion, ending a more than three-decade run begun in the hopeful period near the end of the Cold War.\n\nThe restaurant chain, which launched in Moscow in January 1990 to great fanfare almost two years before the Soviet Union was dissolved, characterized the withdrawal as difficult but necessary.\n\n""The humanitarian crisis caused by the war in Ukraine, and the precipitating unpredictable operating environment, have led McDonald's to conclude that continued ownership of the business in Russia is no longer tenable, nor is it consistent with McDonald's values,"" the company said in a statement.\n\nThe chain is looking to sell ""its entire portfolio of McDonald's restaurants in Russia to a local buyer.""\n\nThe burger giant is one of numerous foreign firms that have pulled out of the country or suspended operations following Moscow's invasion of Ukraine in late February.\n\nEarlier on Monday, French automaker Renault announced it had handed over its Russian assets to the government, marking the first major nationalization since the onset of Western sanctions against Moscow's military campaign.\n\nRussia's President Vladimir Putin ordered troops into pro-Western Ukraine on February 24, triggering unprecedented sanctions and sparking an exodus of foreign corporations including H&M, Starbucks and Ikea.\n\n\n",[Other],[Other],UA
3,EN_CC_100021.txt,EN,"Collaborative plans, innovation keys to circular RMG industry: stakeholders \n\n Stakeholders at a programme on Sunday said that collaborative strategies and innovation were keys to unlocking Bangladesh’s circular apparel industry for availing a significant opportunity for the country to reduce its environmental impact, improve its economic performance and create social benefits.\n\nThe readymade garment sector in Bangladesh is under a transformative journey towards circularity, but embracing circularity also poses certain challenges that must be collectively considered and resolved, they said.\n\nA panel of industry leaders, policymakers and experts in circular economies made the comments at an event titled ‘Switch to Upstream Circularity Dialogue: Pre-consumer Textile Waste in Bangladesh’ at the Amari Hotel in Dhaka on Sunday.\n\nThe event was organised under the Switch to Circular Economy Value Chains project (SWITCH2CE), co-funded by the European Union and the government of Finland.\n\nParliamentary standing committee on the ministry of environment, forest and climate change chairman Saber Hossain Chowdhury was present in the opening session as chief guest.\n\nBangladesh Garment Manufacturers and Exporters Association president Faruque Hassan spoke at the opening session as special guest.\n\nDeputy head of European Union to Bangladesh delegation mission Bernd Spanier, SWITCH2CE chief technical adviser Mark Draeck, among others, also contributed to the event’s opening session.\n\nHilde van Duijn, head of Global Value Chains, Circle Economy, also participated in the event and made circular game demonstration.\n\nFaruque Hassan said, ‘Living in a world in our time where climate is most threatened, business as usual is no more an option. In a race to zero emission and resource decoupling, circularity emerges as the “next normal” linking business and sustainable development. For the BGMEA, circularity sits in the core of our values, mission and vision. Our goal is to help conserve the natural eco-system as much as possible via an economic shift from a linear to circular system, while generating greater social and economic values.’\n\nMark Draeck said, ‘With the support of the European Union and the government of Finland, UNIDO leads the global Switch to Circular Economy Value Chains project.’\n\n‘In Bangladesh, we support the circular transition for the textile and garments industry, by piloting innovative circular solutions in close cooperation with global and national industry leaders, which as BESTSELLER, H&M and their manufacturers,’ he said.\n\n‘These pilots will address acute challenges on new technology, business models and traceability, and will demonstrate the economic opportunity for circular approaches.\n\nThe project also collaborates with government partners, academia, and NGOs to create an enabling environment for circularity through policy and tailored capacity building,’ he added.",[Other],[Other],CC
4,EN_UA_300041.txt,EN,"Russia intends to supply light ‘Mountain’ tanks, infantry fighting vehicles to India \n\n Russia’s state-owned arms company, Rosoboronexport, has declared its intention to participate in the bidding process for the supply of light tanks and the Future Infantry Combat Vehicle (FICV) to India.\n\nAlexander Mikheyev, the CEO of the Russian company announced the decision with the state-run media outlet TASS during the International Aerospace Exhibition at Dubai Airshow 2023.\n\nMikheyev outlined plans for collaboration with Indian partners, to introduce a light tank and an advanced infantry fighting vehicle as part of the Indian Ministry of Defense’s FICV project tender. Emphasizing alignment with the principles of the ‘Make in India’ program, he highlighted the intention to contribute to local manufacturing.\n\nIn his statements, Mikheyev underlined Russia’s comprehension of the Indian government’s aspirations, acknowledging India’s commitment to achieving technological sovereignty and promoting independent industrial development. The proposed plans reflect a collaborative approach that adheres to the principles of the ‘Make in India’ initiative.\n\nEarlier, Russia signed a contract to supply Igla-S hand-held anti-aircraft missiles to India and allow production of the Igla there under licence, the Russian state news agency TASS quoted a top arms export official as saying on Tuesday.\n\nThe Igla-S is a man-portable air defence system (MANPADS) that can be fired by an individual or crew to bring down an enemy aircraft.\n\nIndia is the world’s largest arms importer and Russia remains its largest supplier despite the damage to the reputation of its army and weaponry from the war in Ukraine, where Russia has suffered numerous setbacks at the hands of a smaller but highly motivated and Western-equipped military.\n\nAccording to the Stockholm International Peace Research Institute (SIPRI), Russia accounted for 45% of India’s arms imports between 2018 and 2022, with France providing 29% and the United States 11%.\n\nAnother Russian state news agency, RIA, quoted Mikheyev earlier as saying that “Rosoboronexport is working with Indian private and public enterprises to organise joint production of aviation weapons and integrate them into the existing aviation fleet in India”.\n\nNo details were provided about which Indian companies would be involved or when potential production would start.\n\nMikheyev said Rosoboronexport and Indian partners had provided the Indian Ministry of Defence with Su-30MKI fighter jets, tanks, armoured vehicles and shells. At the beginning of the year, India and Russia also started joint production of AK-203 Kalashnikov assault rifles.\n\n",[Other],[Other],UA


- Currently, a subnarrative is assigned to a narrative by the order in the list in the respective column
- We change this to a better assignment using lists containing dictionaries where each contains a "narrative" and "subnarrative" key
- After this, we only have a single column for the labels "narrative_subnarrative_pairs"

In [6]:
# Function to create narrative-subnarrative pairs in a single column
def create_narrative_subnarrative_pairs(row):
    narratives = row["narratives"]
    subnarratives = row["subnarratives"]

    pairs = []
    for narrative, subnarrative in zip(narratives, subnarratives):
        if subnarrative.startswith(narrative + ":"):
            cleaned_subnarrative = subnarrative[len(narrative) + 1 :].strip()
        else:
            cleaned_subnarrative = subnarrative
        pairs.append({"narrative": narrative, "subnarrative": cleaned_subnarrative})

    return pairs


# Apply the function to each row to create narrative-subnarrative pairs
df["narrative_subnarrative_pairs"] = df.apply(
    create_narrative_subnarrative_pairs, axis=1
)

# Drop the original narratives and subnarratives columns if no longer needed
df = df.drop(columns=["narratives", "subnarratives"])

# Display the updated DataFrame
print("Updated DataFrame with narrative-subnarrative pairs:")
df.head()

Updated DataFrame with narrative-subnarrative pairs:


Unnamed: 0,filename,language,content,topic,narrative_subnarrative_pairs
0,EN_CC_100013.txt,EN,"Bill Gates Says He Is ‘The Solution’ To Climate Change So It’s OK To Own Four Private Jets \n\nBill Gates has the right to fly around the world on private jets while normal people are forced to live in 15 minute cities without freedom of travel, according to Bill Gates himself, who told the BBC he is doing much more than anybody else to fight climate change.\n\nGates claimed that because he continues to “spend billions of dollars” on climate change activism, his carbon footprint isn’t an issue.\n\nSign up to get unfiltered news delivered straight to your inbox.\n\nYou can unsubscribe any time. By subscribing you agree to our Terms of Use\n\n“Should I stay at home and not come to Kenya and learn about farming and malaria?” Gates said in the interview with Amol Rajan.\n\n“I’m comfortable with the idea that not only am I not part of the problem by paying for the offsets, but also through the billions that my Breakthrough Energy Group is spending, that I’m part of the solution,” Gates added. Watch:\n\nEarlier this year, Gates flew around Australia on board his $70 million dollar luxury private jet lecturing people about climate change and ordering them to stop flying on planes.\n\nGates, who has declared that the energy crisis is a good thing, owns no fewer than FOUR private jets at a combined cost of $194 million dollars.\n\nA study carried out by Linnaeus University economics professor Stefan Gössling found that Gates flew more than 213,000 miles on 59 private jet flights in 2017 alone.\n\nGates emitted an estimated 1,760 tons of carbon dioxide emissions, over a hundred times more than the emissions per capita in the United States, according to data from the World Bank.\n\nElsewhere during the carefully constructed interview, Gates said he was surprised that he was targeted by ‘conspiracy theorists’ for pushing vaccines during the pandemic.\n\nWhile the BBC interview was set up to look like Gates was being challenged or grilled, he wasn’t asked about his close friendship with the elite pedophile Jeffrey Epstein.",CC,"[{'narrative': 'Criticism of climate movement', 'subnarrative': 'Ad hominem attacks on key activists'}]"
1,EN_UA_300009.txt,EN,"Russia: Clashes erupt in Bashkortostan as rights activist sentenced \n\n Russian riot police clashed with protesters in Bashkortostan following the sentencing of rights activist Fail Alsynov to four years in a penal colony. Social media footage captured the confrontations near the court in Baymak, with supporters engaging in clashes with police, including throwing snowballs.\n\nViolent clashes in Baymak— NEXTA (@nexta_tv) January 17, 2024\n\nLaw enforcers used stun grenades in Baymak, Bashkortostan, Russia. The demonstrators responded by throwing snow and ice at them and forced them to retreat.\n\nIt is reported that negotiations are underway between the protesters and special forces: the law… pic.twitter.com/AVHf2gBi7w\n\nAlsynov’s conviction for inciting ethnic hatred sparked rare large-scale protests in Russia, where the risk of arrest typically stifles such demonstrations. Reports suggest thousands participated in the multi-day protest in -20°C temperatures, resulting in several injuries. \n\nThe activist denies the charges related to insulting migrants during a demonstration against gold mining plans. Supporters claim the case is retaliation for Alsynov’s activism against soda mining in a culturally significant area. He allegedly referred to Central Asians and Caucasians, comprising a significant portion of Russia’s migrant population, as “black people” in the Bashkir language. \n\nAlsynov contends the Bashkir words meant “poor people” and were mistranslated into Russian. He plans to appeal the verdict. Alsynov has previously criticized military mobilization in the region as “genocide” against the Bashkir people. Ongoing concerns exist about the disproportionate deployment of ethnic minorities, including Bashkirs, in conflicts such as Ukraine. \n\nAlsynov was a leader of Bashkort, a grassroots movement focused on preserving Bashkir ethnic identity, which was banned as extremist in 2020. The clashes signify a rare instance of public dissent in Russia, underscoring the intensity of emotions surrounding Alsynov’s case and broader issues of ethnic identity and activism in the country.",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
2,EN_UA_300017.txt,EN,"McDonald's to exit Russia, sell business in country \n\n American fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion.\n\nmerican fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion, ending a more than three-decade run begun in the hopeful period near the end of the Cold War.\n\nThe restaurant chain, which launched in Moscow in January 1990 to great fanfare almost two years before the Soviet Union was dissolved, characterized the withdrawal as difficult but necessary.\n\n""The humanitarian crisis caused by the war in Ukraine, and the precipitating unpredictable operating environment, have led McDonald's to conclude that continued ownership of the business in Russia is no longer tenable, nor is it consistent with McDonald's values,"" the company said in a statement.\n\nThe chain is looking to sell ""its entire portfolio of McDonald's restaurants in Russia to a local buyer.""\n\nThe burger giant is one of numerous foreign firms that have pulled out of the country or suspended operations following Moscow's invasion of Ukraine in late February.\n\nEarlier on Monday, French automaker Renault announced it had handed over its Russian assets to the government, marking the first major nationalization since the onset of Western sanctions against Moscow's military campaign.\n\nRussia's President Vladimir Putin ordered troops into pro-Western Ukraine on February 24, triggering unprecedented sanctions and sparking an exodus of foreign corporations including H&M, Starbucks and Ikea.\n\n\n",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
3,EN_CC_100021.txt,EN,"Collaborative plans, innovation keys to circular RMG industry: stakeholders \n\n Stakeholders at a programme on Sunday said that collaborative strategies and innovation were keys to unlocking Bangladesh’s circular apparel industry for availing a significant opportunity for the country to reduce its environmental impact, improve its economic performance and create social benefits.\n\nThe readymade garment sector in Bangladesh is under a transformative journey towards circularity, but embracing circularity also poses certain challenges that must be collectively considered and resolved, they said.\n\nA panel of industry leaders, policymakers and experts in circular economies made the comments at an event titled ‘Switch to Upstream Circularity Dialogue: Pre-consumer Textile Waste in Bangladesh’ at the Amari Hotel in Dhaka on Sunday.\n\nThe event was organised under the Switch to Circular Economy Value Chains project (SWITCH2CE), co-funded by the European Union and the government of Finland.\n\nParliamentary standing committee on the ministry of environment, forest and climate change chairman Saber Hossain Chowdhury was present in the opening session as chief guest.\n\nBangladesh Garment Manufacturers and Exporters Association president Faruque Hassan spoke at the opening session as special guest.\n\nDeputy head of European Union to Bangladesh delegation mission Bernd Spanier, SWITCH2CE chief technical adviser Mark Draeck, among others, also contributed to the event’s opening session.\n\nHilde van Duijn, head of Global Value Chains, Circle Economy, also participated in the event and made circular game demonstration.\n\nFaruque Hassan said, ‘Living in a world in our time where climate is most threatened, business as usual is no more an option. In a race to zero emission and resource decoupling, circularity emerges as the “next normal” linking business and sustainable development. For the BGMEA, circularity sits in the core of our values, mission and vision. Our goal is to help conserve the natural eco-system as much as possible via an economic shift from a linear to circular system, while generating greater social and economic values.’\n\nMark Draeck said, ‘With the support of the European Union and the government of Finland, UNIDO leads the global Switch to Circular Economy Value Chains project.’\n\n‘In Bangladesh, we support the circular transition for the textile and garments industry, by piloting innovative circular solutions in close cooperation with global and national industry leaders, which as BESTSELLER, H&M and their manufacturers,’ he said.\n\n‘These pilots will address acute challenges on new technology, business models and traceability, and will demonstrate the economic opportunity for circular approaches.\n\nThe project also collaborates with government partners, academia, and NGOs to create an enabling environment for circularity through policy and tailored capacity building,’ he added.",CC,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
4,EN_UA_300041.txt,EN,"Russia intends to supply light ‘Mountain’ tanks, infantry fighting vehicles to India \n\n Russia’s state-owned arms company, Rosoboronexport, has declared its intention to participate in the bidding process for the supply of light tanks and the Future Infantry Combat Vehicle (FICV) to India.\n\nAlexander Mikheyev, the CEO of the Russian company announced the decision with the state-run media outlet TASS during the International Aerospace Exhibition at Dubai Airshow 2023.\n\nMikheyev outlined plans for collaboration with Indian partners, to introduce a light tank and an advanced infantry fighting vehicle as part of the Indian Ministry of Defense’s FICV project tender. Emphasizing alignment with the principles of the ‘Make in India’ program, he highlighted the intention to contribute to local manufacturing.\n\nIn his statements, Mikheyev underlined Russia’s comprehension of the Indian government’s aspirations, acknowledging India’s commitment to achieving technological sovereignty and promoting independent industrial development. The proposed plans reflect a collaborative approach that adheres to the principles of the ‘Make in India’ initiative.\n\nEarlier, Russia signed a contract to supply Igla-S hand-held anti-aircraft missiles to India and allow production of the Igla there under licence, the Russian state news agency TASS quoted a top arms export official as saying on Tuesday.\n\nThe Igla-S is a man-portable air defence system (MANPADS) that can be fired by an individual or crew to bring down an enemy aircraft.\n\nIndia is the world’s largest arms importer and Russia remains its largest supplier despite the damage to the reputation of its army and weaponry from the war in Ukraine, where Russia has suffered numerous setbacks at the hands of a smaller but highly motivated and Western-equipped military.\n\nAccording to the Stockholm International Peace Research Institute (SIPRI), Russia accounted for 45% of India’s arms imports between 2018 and 2022, with France providing 29% and the United States 11%.\n\nAnother Russian state news agency, RIA, quoted Mikheyev earlier as saying that “Rosoboronexport is working with Indian private and public enterprises to organise joint production of aviation weapons and integrate them into the existing aviation fleet in India”.\n\nNo details were provided about which Indian companies would be involved or when potential production would start.\n\nMikheyev said Rosoboronexport and Indian partners had provided the Indian Ministry of Defence with Su-30MKI fighter jets, tanks, armoured vehicles and shells. At the beginning of the year, India and Russia also started joint production of AK-203 Kalashnikov assault rifles.\n\n",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"


### Extract the label taxonomy
- The competition creators provide a complete taxonomy of labels for each of the two topics
- We create a templete for each label taxonomy (for each of the two topics) from the subtask 2 pdf file
- We can use this to encode the labels of the datapoints in the dataset
- We do this by assigning an index to every possible class (narrative-subnarrative pair) to numerically represent the labels for each document

In [7]:
df.head()

Unnamed: 0,filename,language,content,topic,narrative_subnarrative_pairs
0,EN_CC_100013.txt,EN,"Bill Gates Says He Is ‘The Solution’ To Climate Change So It’s OK To Own Four Private Jets \n\nBill Gates has the right to fly around the world on private jets while normal people are forced to live in 15 minute cities without freedom of travel, according to Bill Gates himself, who told the BBC he is doing much more than anybody else to fight climate change.\n\nGates claimed that because he continues to “spend billions of dollars” on climate change activism, his carbon footprint isn’t an issue.\n\nSign up to get unfiltered news delivered straight to your inbox.\n\nYou can unsubscribe any time. By subscribing you agree to our Terms of Use\n\n“Should I stay at home and not come to Kenya and learn about farming and malaria?” Gates said in the interview with Amol Rajan.\n\n“I’m comfortable with the idea that not only am I not part of the problem by paying for the offsets, but also through the billions that my Breakthrough Energy Group is spending, that I’m part of the solution,” Gates added. Watch:\n\nEarlier this year, Gates flew around Australia on board his $70 million dollar luxury private jet lecturing people about climate change and ordering them to stop flying on planes.\n\nGates, who has declared that the energy crisis is a good thing, owns no fewer than FOUR private jets at a combined cost of $194 million dollars.\n\nA study carried out by Linnaeus University economics professor Stefan Gössling found that Gates flew more than 213,000 miles on 59 private jet flights in 2017 alone.\n\nGates emitted an estimated 1,760 tons of carbon dioxide emissions, over a hundred times more than the emissions per capita in the United States, according to data from the World Bank.\n\nElsewhere during the carefully constructed interview, Gates said he was surprised that he was targeted by ‘conspiracy theorists’ for pushing vaccines during the pandemic.\n\nWhile the BBC interview was set up to look like Gates was being challenged or grilled, he wasn’t asked about his close friendship with the elite pedophile Jeffrey Epstein.",CC,"[{'narrative': 'Criticism of climate movement', 'subnarrative': 'Ad hominem attacks on key activists'}]"
1,EN_UA_300009.txt,EN,"Russia: Clashes erupt in Bashkortostan as rights activist sentenced \n\n Russian riot police clashed with protesters in Bashkortostan following the sentencing of rights activist Fail Alsynov to four years in a penal colony. Social media footage captured the confrontations near the court in Baymak, with supporters engaging in clashes with police, including throwing snowballs.\n\nViolent clashes in Baymak— NEXTA (@nexta_tv) January 17, 2024\n\nLaw enforcers used stun grenades in Baymak, Bashkortostan, Russia. The demonstrators responded by throwing snow and ice at them and forced them to retreat.\n\nIt is reported that negotiations are underway between the protesters and special forces: the law… pic.twitter.com/AVHf2gBi7w\n\nAlsynov’s conviction for inciting ethnic hatred sparked rare large-scale protests in Russia, where the risk of arrest typically stifles such demonstrations. Reports suggest thousands participated in the multi-day protest in -20°C temperatures, resulting in several injuries. \n\nThe activist denies the charges related to insulting migrants during a demonstration against gold mining plans. Supporters claim the case is retaliation for Alsynov’s activism against soda mining in a culturally significant area. He allegedly referred to Central Asians and Caucasians, comprising a significant portion of Russia’s migrant population, as “black people” in the Bashkir language. \n\nAlsynov contends the Bashkir words meant “poor people” and were mistranslated into Russian. He plans to appeal the verdict. Alsynov has previously criticized military mobilization in the region as “genocide” against the Bashkir people. Ongoing concerns exist about the disproportionate deployment of ethnic minorities, including Bashkirs, in conflicts such as Ukraine. \n\nAlsynov was a leader of Bashkort, a grassroots movement focused on preserving Bashkir ethnic identity, which was banned as extremist in 2020. The clashes signify a rare instance of public dissent in Russia, underscoring the intensity of emotions surrounding Alsynov’s case and broader issues of ethnic identity and activism in the country.",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
2,EN_UA_300017.txt,EN,"McDonald's to exit Russia, sell business in country \n\n American fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion.\n\nmerican fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion, ending a more than three-decade run begun in the hopeful period near the end of the Cold War.\n\nThe restaurant chain, which launched in Moscow in January 1990 to great fanfare almost two years before the Soviet Union was dissolved, characterized the withdrawal as difficult but necessary.\n\n""The humanitarian crisis caused by the war in Ukraine, and the precipitating unpredictable operating environment, have led McDonald's to conclude that continued ownership of the business in Russia is no longer tenable, nor is it consistent with McDonald's values,"" the company said in a statement.\n\nThe chain is looking to sell ""its entire portfolio of McDonald's restaurants in Russia to a local buyer.""\n\nThe burger giant is one of numerous foreign firms that have pulled out of the country or suspended operations following Moscow's invasion of Ukraine in late February.\n\nEarlier on Monday, French automaker Renault announced it had handed over its Russian assets to the government, marking the first major nationalization since the onset of Western sanctions against Moscow's military campaign.\n\nRussia's President Vladimir Putin ordered troops into pro-Western Ukraine on February 24, triggering unprecedented sanctions and sparking an exodus of foreign corporations including H&M, Starbucks and Ikea.\n\n\n",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
3,EN_CC_100021.txt,EN,"Collaborative plans, innovation keys to circular RMG industry: stakeholders \n\n Stakeholders at a programme on Sunday said that collaborative strategies and innovation were keys to unlocking Bangladesh’s circular apparel industry for availing a significant opportunity for the country to reduce its environmental impact, improve its economic performance and create social benefits.\n\nThe readymade garment sector in Bangladesh is under a transformative journey towards circularity, but embracing circularity also poses certain challenges that must be collectively considered and resolved, they said.\n\nA panel of industry leaders, policymakers and experts in circular economies made the comments at an event titled ‘Switch to Upstream Circularity Dialogue: Pre-consumer Textile Waste in Bangladesh’ at the Amari Hotel in Dhaka on Sunday.\n\nThe event was organised under the Switch to Circular Economy Value Chains project (SWITCH2CE), co-funded by the European Union and the government of Finland.\n\nParliamentary standing committee on the ministry of environment, forest and climate change chairman Saber Hossain Chowdhury was present in the opening session as chief guest.\n\nBangladesh Garment Manufacturers and Exporters Association president Faruque Hassan spoke at the opening session as special guest.\n\nDeputy head of European Union to Bangladesh delegation mission Bernd Spanier, SWITCH2CE chief technical adviser Mark Draeck, among others, also contributed to the event’s opening session.\n\nHilde van Duijn, head of Global Value Chains, Circle Economy, also participated in the event and made circular game demonstration.\n\nFaruque Hassan said, ‘Living in a world in our time where climate is most threatened, business as usual is no more an option. In a race to zero emission and resource decoupling, circularity emerges as the “next normal” linking business and sustainable development. For the BGMEA, circularity sits in the core of our values, mission and vision. Our goal is to help conserve the natural eco-system as much as possible via an economic shift from a linear to circular system, while generating greater social and economic values.’\n\nMark Draeck said, ‘With the support of the European Union and the government of Finland, UNIDO leads the global Switch to Circular Economy Value Chains project.’\n\n‘In Bangladesh, we support the circular transition for the textile and garments industry, by piloting innovative circular solutions in close cooperation with global and national industry leaders, which as BESTSELLER, H&M and their manufacturers,’ he said.\n\n‘These pilots will address acute challenges on new technology, business models and traceability, and will demonstrate the economic opportunity for circular approaches.\n\nThe project also collaborates with government partners, academia, and NGOs to create an enabling environment for circularity through policy and tailored capacity building,’ he added.",CC,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
4,EN_UA_300041.txt,EN,"Russia intends to supply light ‘Mountain’ tanks, infantry fighting vehicles to India \n\n Russia’s state-owned arms company, Rosoboronexport, has declared its intention to participate in the bidding process for the supply of light tanks and the Future Infantry Combat Vehicle (FICV) to India.\n\nAlexander Mikheyev, the CEO of the Russian company announced the decision with the state-run media outlet TASS during the International Aerospace Exhibition at Dubai Airshow 2023.\n\nMikheyev outlined plans for collaboration with Indian partners, to introduce a light tank and an advanced infantry fighting vehicle as part of the Indian Ministry of Defense’s FICV project tender. Emphasizing alignment with the principles of the ‘Make in India’ program, he highlighted the intention to contribute to local manufacturing.\n\nIn his statements, Mikheyev underlined Russia’s comprehension of the Indian government’s aspirations, acknowledging India’s commitment to achieving technological sovereignty and promoting independent industrial development. The proposed plans reflect a collaborative approach that adheres to the principles of the ‘Make in India’ initiative.\n\nEarlier, Russia signed a contract to supply Igla-S hand-held anti-aircraft missiles to India and allow production of the Igla there under licence, the Russian state news agency TASS quoted a top arms export official as saying on Tuesday.\n\nThe Igla-S is a man-portable air defence system (MANPADS) that can be fired by an individual or crew to bring down an enemy aircraft.\n\nIndia is the world’s largest arms importer and Russia remains its largest supplier despite the damage to the reputation of its army and weaponry from the war in Ukraine, where Russia has suffered numerous setbacks at the hands of a smaller but highly motivated and Western-equipped military.\n\nAccording to the Stockholm International Peace Research Institute (SIPRI), Russia accounted for 45% of India’s arms imports between 2018 and 2022, with France providing 29% and the United States 11%.\n\nAnother Russian state news agency, RIA, quoted Mikheyev earlier as saying that “Rosoboronexport is working with Indian private and public enterprises to organise joint production of aviation weapons and integrate them into the existing aviation fleet in India”.\n\nNo details were provided about which Indian companies would be involved or when potential production would start.\n\nMikheyev said Rosoboronexport and Indian partners had provided the Indian Ministry of Defence with Su-30MKI fighter jets, tanks, armoured vehicles and shells. At the beginning of the year, India and Russia also started joint production of AK-203 Kalashnikov assault rifles.\n\n",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"


In [8]:
# Define the path to the PDF file
pdf_path = "../info/subtask2_NARRATIVE-TAXONOMIES.pdf"

# Open the PDF file
pdf_document = fitz.open(pdf_path)


# Function to extract text from a specific page
def extract_text_from_page(page_number):
    page = pdf_document.load_page(page_number)
    text = page.get_text("text")
    return text


# Extract text from the relevant pages
ukraine_war_text = extract_text_from_page(0)  # First page contains Ukraine War taxonomy
climate_change_text = extract_text_from_page(
    1
)  # Second page contains Climate Change taxonomy


# Function to parse the taxonomy text and create a DataFrame
def parse_taxonomy(text):
    lines = text.split("\n")
    # Exclude the last two lines
    lines = lines[:-3]
    data = []
    current_narrative = None
    for line in lines:
        if line.strip() == "":
            continue
        if line.startswith("-"):  # Subnarrative
            subnarrative = line.strip("- ").strip()
            data.append({"narrative": current_narrative, "subnarrative": subnarrative})
        else:  # Narrative
            if current_narrative and not any(
                d["narrative"] == current_narrative for d in data
            ):
                # Add the narrative itself as subnarrative if it has no subnarratives
                data.append(
                    {"narrative": current_narrative, "subnarrative": current_narrative}
                )
            current_narrative = line.strip()
            if current_narrative == "Other":
                data.append({"narrative": "Other", "subnarrative": "Other"})
    # Handle the last narrative if it has no subnarratives
    if current_narrative and not any(d["narrative"] == current_narrative for d in data):
        data.append({"narrative": current_narrative, "subnarrative": "Other"})

    df = pd.DataFrame(data)
    df = df.sort_values(by="narrative", ascending=True).reset_index(drop=True)
    return df


# Parse the taxonomies and create DataFrames
ukraine_war_df = parse_taxonomy(ukraine_war_text)
climate_change_df = parse_taxonomy(climate_change_text)

In [9]:
ukraine_war_df.head(50)

Unnamed: 0,narrative,subnarrative
0,Amplifying war-related fears,Russia will also attack other countries
1,Amplifying war-related fears,By continuing the war we risk WWIII
2,Amplifying war-related fears,There is a real possibility that nuclear weapons will be employed
3,Amplifying war-related fears,NATO should/will directly intervene
4,Blaming the war on others rather than the invader,Ukraine is the aggressor
5,Blaming the war on others rather than the invader,The West are the aggressors
6,Discrediting Ukraine,Situation in Ukraine is hopeless
7,Discrediting Ukraine,Ukraine is associated with nazism
8,Discrediting Ukraine,Ukraine is a hub for criminal activities
9,Discrediting Ukraine,Discrediting Ukrainian government and officials and policies


In [10]:
climate_change_df.head(50)

Unnamed: 0,narrative,subnarrative
0,Amplifying Climate Fears,Earth will be uninhabitable soon
1,Amplifying Climate Fears,Amplifying existing fears of global warming
2,Amplifying Climate Fears,Doomsday scenarios for humans
3,Amplifying Climate Fears,Whatever we do it is already too late
4,Climate change is beneficial,CO2 is beneficial
5,Climate change is beneficial,Temperature increase is beneficial
6,Controversy about green technologies,Nuclear energy is not climate friendly
7,Controversy about green technologies,Renewable energy is costly
8,Controversy about green technologies,Renewable energy is unreliable
9,Controversy about green technologies,Renewable energy is dangerous


- Handle some edge cases in the taxonomy as instructed in the task description (see https://propaganda.math.unipd.it/semeval2025task10/index.html)
- E.g. if a narrative is identified but no subnarrative applies, the subnarrative is "Other". We apply this to all rows in the taxonomy dataframe except for the narratives "Other" and "Hidden plots by secret schemes of powerful groups" since those are already present (in one of the two dataframes) 
- After manually adding the "Other", "Hidden plots by secret schemes of powerful groups" combination in the climate change taxonomy, all possible combinations are be covered
    - We do this because the "Hidden plots by secret schemes of powerful groups" - "Other" combination is already present in the UA taxonomy but not in the CC taxonomy

In [11]:
# Function to add "Other" subnarrative for each narrative group, excluding specific narratives
def add_other_subnarrative(df):
    additional_rows = []
    unique_narratives = df["narrative"].unique()
    for narrative in unique_narratives:
        if narrative not in [
            "Other",
            "Hidden plots by secret schemes of powerful groups",
        ]:
            additional_rows.append({"narrative": narrative, "subnarrative": "Other"})
    additional_df = pd.DataFrame(additional_rows)
    return pd.concat([df, additional_df], ignore_index=True)


# Function to sort the DataFrame and add an index column
def sort_and_index_df(df):
    df = df.sort_values(by=["narrative", "subnarrative"]).reset_index(drop=True)
    df["index"] = df.index + 1
    df = df[["index", "narrative", "subnarrative"]]
    return df


# Add "Other" subnarrative to each DataFrame
ukraine_war_df = add_other_subnarrative(ukraine_war_df)
climate_change_df = add_other_subnarrative(climate_change_df)

# Manually add the specific row to the climate change DataFrame
climate_change_df = pd.concat(
    [
        climate_change_df,
        pd.DataFrame(
            [
                {
                    "narrative": "Hidden plots by secret schemes of powerful groups",
                    "subnarrative": "Other",
                }
            ]
        ),
    ],
    ignore_index=True,
)

# Sort and add index column to each DataFrame
ukraine_war_df = sort_and_index_df(ukraine_war_df)
climate_change_df = sort_and_index_df(climate_change_df)

In [12]:
ukraine_war_df.head(55)

Unnamed: 0,index,narrative,subnarrative
0,1,Amplifying war-related fears,By continuing the war we risk WWIII
1,2,Amplifying war-related fears,NATO should/will directly intervene
2,3,Amplifying war-related fears,Other
3,4,Amplifying war-related fears,Russia will also attack other countries
4,5,Amplifying war-related fears,There is a real possibility that nuclear weapons will be employed
5,6,Blaming the war on others rather than the invader,Other
6,7,Blaming the war on others rather than the invader,The West are the aggressors
7,8,Blaming the war on others rather than the invader,Ukraine is the aggressor
8,9,Discrediting Ukraine,Discrediting Ukrainian government and officials and policies
9,10,Discrediting Ukraine,Discrediting Ukrainian military


In [13]:
climate_change_df.head(50)

Unnamed: 0,index,narrative,subnarrative
0,1,Amplifying Climate Fears,Amplifying existing fears of global warming
1,2,Amplifying Climate Fears,Doomsday scenarios for humans
2,3,Amplifying Climate Fears,Earth will be uninhabitable soon
3,4,Amplifying Climate Fears,Other
4,5,Amplifying Climate Fears,Whatever we do it is already too late
5,6,Climate change is beneficial,CO2 is beneficial
6,7,Climate change is beneficial,Other
7,8,Climate change is beneficial,Temperature increase is beneficial
8,9,Controversy about green technologies,Nuclear energy is not climate friendly
9,10,Controversy about green technologies,Other


### Map the labels in the file to the taxonomy

- In the next step, we want to add another column to our "df" that combines the narrative_subnarrative_pairs column with the information in our taxonomy dataframes
- Firstly, the column "topic" tells us what taxonomy df should be applied (UA for ukraine war and CC for climate change)
- Then, theoretically, the dictionaries should exactly correspond to a given row in one of the two taxonomy dataframes
- We check if every narrative-subnarrative pair of every row in the dataset can be mapped to its row in the taxonomy and if so, add a column to the dataframe that contains the indices of the targets.

In [14]:
# Create a mapping of narrative-subnarrative pairs to their indices
def create_mapping(df):
    mapping = {}
    for _, row in df.iterrows():
        key = (row["narrative"], row["subnarrative"])
        mapping[key] = row["index"]
    return mapping


ukraine_war_mapping = create_mapping(ukraine_war_df)
climate_change_mapping = create_mapping(climate_change_df)


# Function to check the mapping and add indices to the DataFrame
def add_target_indices(row, ukraine_war_mapping, climate_change_mapping):
    pairs = row["narrative_subnarrative_pairs"]
    topic = row["topic"]
    indices = []

    if topic == "UA":
        mapping = ukraine_war_mapping
    elif topic == "CC":
        mapping = climate_change_mapping
    else:
        return None  # Invalid topic

    for pair in pairs:
        key = (pair["narrative"], pair["subnarrative"])
        if key in mapping:
            indices.append(mapping[key])
        else:
            return (row.name, key)  # Mapping does not exist

    return indices


# Apply the function to each row and collect the results
df["target_indices"] = df.apply(
    add_target_indices, axis=1, args=(ukraine_war_mapping, climate_change_mapping)
)

# Filter out the rows where the mapping does not exist
problematic_rows = df[df["target_indices"].apply(lambda x: isinstance(x, tuple))]

# Display the problematic rows
print("Problematic rows where the mapping does not exist:")
print(problematic_rows)

# Display the first few problematic rows for inspection
if not problematic_rows.empty:
    print("First few problematic rows:")
    for index, row in problematic_rows.iterrows():
        print(f"Row index: {index}, Problematic pair: {row['target_indices']}")

Problematic rows where the mapping does not exist:
               filename language  \
57     EN_UA_103011.txt       EN   
337    EN_UA_103025.txt       EN   
771   A8_CC_BG_8534.txt       BG   
1300     PT_URW_418.txt       PT   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

In [15]:
len(problematic_rows)

4

- There are two datapoints/rows in the dataset where the mapping does not work, the indices 65 and 143
- With a closer look, we see what the problem is: All but the two rows are EITHER ukraine war OR climate change
- In the two rows, this is not the case (remember that our implementation relies on classifying whole articles as a topic, i.e. all the labels must be from the taxonomy of that topic)

Below we check the annotation file for rows where there are mixed topics in the labels.

In [16]:
# Define the path to the annotations file
annotations_file = "../training_data_04_December_release/EN/subtask-2-annotations.txt"

# Read the annotations file
annotations = pd.read_csv(
    annotations_file,
    sep="\t",
    header=None,
    names=["filename", "narrative", "subnarrative"],
)

# Initialize a list to store the line numbers with both "CC" and "URW"
mixed_topic_lines = []

# Iterate through each row and check for mixed topics
for index, row in annotations.iterrows():
    narrative = row["narrative"]
    subnarrative = row["subnarrative"]

    # Check if both "CC" and "URW" are present in either narrative or subnarrative
    if ("CC: " in narrative and "URW: " in narrative) or (
        "CC: " in subnarrative and "URW: " in subnarrative
    ):
        mixed_topic_lines.append(index + 1)  # Adding 1 to index to match line numbers

# Print the line numbers with mixed topics
print("Lines with both 'CC' and 'URW' present:")
print(mixed_topic_lines)

# Print the total count of such lines
print("Total number of lines with mixed topics:", len(mixed_topic_lines))

Lines with both 'CC' and 'URW' present:
[58]
Total number of lines with mixed topics: 1


- This shows us that line 58 in the annotations file (index 57 in the dataset) has labels from both taxonomies
- Line 338 and 962 in the annotation file on the other hand (Index 337 and 961) has UA in the filename but CC in all the labels, which we assume to be an error in the dataset
- Line 772 in the annotation file (Index 771) has CC in the filename, but UA in all the labels, which we also assume to be an error in the dataset
- We deal with this by dropping the four rows from the dataset

In [17]:
df_short = df.drop([57, 337, 771, 1300])

### Conclusion for initial data prep and label representation
- We now have a dataframe containing all relevant data, including which topic an article belongs to, the article content (in raw form up until now) and the labels (narrative and subnarrative combinations) in text form and in numerical form
- We removed 4 problematic datapoints and are left with 1695
- While the label classes are basically ordinally encoded right now, one-hot encoding would be more suitable since there is no ordinal relationship for the class labels
- We can now easily change the encoding which we will do in Milestone 2. The current implementation also easily supports differentiating the topics to train separate models using the "topic" column

## Text Segmentation <a class="anchor" id="text-segmentation"></a>

<a class="anchor" id="tokenize-sentences-and-words"></a>
Now we handle the content of the articles. Currently, each entry in our dataframe has a single plain string that contains the whole article.

Let's start by splitting it into sentences and words.

In [18]:
def tokenize(df):
    df["tokens"] = None
    for i, row in df.iterrows():
        # split the content into sentences
        sentences = nltk.sent_tokenize(row["content"])
        # tokenize each sentence
        tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
        df.at[i, "tokens"] = tokens
    return df


df_short = tokenize(df_short)
df_short.head()

Unnamed: 0,filename,language,content,topic,narrative_subnarrative_pairs,target_indices,tokens
0,EN_CC_100013.txt,EN,"Bill Gates Says He Is ‘The Solution’ To Climate Change So It’s OK To Own Four Private Jets \n\nBill Gates has the right to fly around the world on private jets while normal people are forced to live in 15 minute cities without freedom of travel, according to Bill Gates himself, who told the BBC he is doing much more than anybody else to fight climate change.\n\nGates claimed that because he continues to “spend billions of dollars” on climate change activism, his carbon footprint isn’t an issue.\n\nSign up to get unfiltered news delivered straight to your inbox.\n\nYou can unsubscribe any time. By subscribing you agree to our Terms of Use\n\n“Should I stay at home and not come to Kenya and learn about farming and malaria?” Gates said in the interview with Amol Rajan.\n\n“I’m comfortable with the idea that not only am I not part of the problem by paying for the offsets, but also through the billions that my Breakthrough Energy Group is spending, that I’m part of the solution,” Gates added. Watch:\n\nEarlier this year, Gates flew around Australia on board his $70 million dollar luxury private jet lecturing people about climate change and ordering them to stop flying on planes.\n\nGates, who has declared that the energy crisis is a good thing, owns no fewer than FOUR private jets at a combined cost of $194 million dollars.\n\nA study carried out by Linnaeus University economics professor Stefan Gössling found that Gates flew more than 213,000 miles on 59 private jet flights in 2017 alone.\n\nGates emitted an estimated 1,760 tons of carbon dioxide emissions, over a hundred times more than the emissions per capita in the United States, according to data from the World Bank.\n\nElsewhere during the carefully constructed interview, Gates said he was surprised that he was targeted by ‘conspiracy theorists’ for pushing vaccines during the pandemic.\n\nWhile the BBC interview was set up to look like Gates was being challenged or grilled, he wasn’t asked about his close friendship with the elite pedophile Jeffrey Epstein.",CC,"[{'narrative': 'Criticism of climate movement', 'subnarrative': 'Ad hominem attacks on key activists'}]",[14],"[[Bill, Gates, Says, He, Is, ‘, The, Solution, ’, To, Climate, Change, So, It, ’, s, OK, To, Own, Four, Private, Jets, Bill, Gates, has, the, right, to, fly, around, the, world, on, private, jets, while, normal, people, are, forced, to, live, in, 15, minute, cities, without, freedom, of, travel, ,, according, to, Bill, Gates, himself, ,, who, told, the, BBC, he, is, doing, much, more, than, anybody, else, to, fight, climate, change, .], [Gates, claimed, that, because, he, continues, to, “, spend, billions, of, dollars, ”, on, climate, change, activism, ,, his, carbon, footprint, isn, ’, t, an, issue, .], [Sign, up, to, get, unfiltered, news, delivered, straight, to, your, inbox, .], [You, can, unsubscribe, any, time, .], [By, subscribing, you, agree, to, our, Terms, of, Use, “, Should, I, stay, at, home, and, not, come, to, Kenya, and, learn, about, farming, and, malaria, ?, ”, Gates, said, in, the, interview, with, Amol, Rajan, .], [“, I, ’, m, comfortable, with, the, idea, that, not, only, am, I, not, part, of, the, problem, by, paying, for, the, offsets, ,, but, also, through, the, billions, that, my, Breakthrough, Energy, Group, is, spending, ,, that, I, ’, m, part, of, the, solution, ,, ”, Gates, added, .], [Watch, :, Earlier, this, year, ,, Gates, flew, around, Australia, on, board, his, $, 70, million, dollar, luxury, private, jet, lecturing, people, about, climate, change, and, ordering, them, to, stop, flying, on, planes, .], [Gates, ,, who, has, declared, that, the, energy, crisis, is, a, good, thing, ,, owns, no, fewer, than, FOUR, private, jets, at, a, combined, cost, of, $, 194, million, dollars, .], [A, study, carried, out, by, Linnaeus, University, economics, professor, Stefan, Gössling, found, that, Gates, flew, more, than, 213,000, miles, on, 59, private, jet, flights, in, 2017, alone, .], [Gates, emitted, an, estimated, 1,760, tons, of, carbon, dioxide, emissions, ,, over, a, hundred, times, more, than, the, emissions, per, capita, in, the, United, States, ,, according, to, data, from, the, World, Bank, .], [Elsewhere, during, the, carefully, constructed, interview, ,, Gates, said, he, was, surprised, that, he, was, targeted, by, ‘, conspiracy, theorists, ’, for, pushing, vaccines, during, the, pandemic, .], [While, the, BBC, interview, was, set, up, to, look, like, Gates, was, being, challenged, or, grilled, ,, he, wasn, ’, t, asked, about, his, close, friendship, with, the, elite, pedophile, Jeffrey, Epstein, .]]"
1,EN_UA_300009.txt,EN,"Russia: Clashes erupt in Bashkortostan as rights activist sentenced \n\n Russian riot police clashed with protesters in Bashkortostan following the sentencing of rights activist Fail Alsynov to four years in a penal colony. Social media footage captured the confrontations near the court in Baymak, with supporters engaging in clashes with police, including throwing snowballs.\n\nViolent clashes in Baymak— NEXTA (@nexta_tv) January 17, 2024\n\nLaw enforcers used stun grenades in Baymak, Bashkortostan, Russia. The demonstrators responded by throwing snow and ice at them and forced them to retreat.\n\nIt is reported that negotiations are underway between the protesters and special forces: the law… pic.twitter.com/AVHf2gBi7w\n\nAlsynov’s conviction for inciting ethnic hatred sparked rare large-scale protests in Russia, where the risk of arrest typically stifles such demonstrations. Reports suggest thousands participated in the multi-day protest in -20°C temperatures, resulting in several injuries. \n\nThe activist denies the charges related to insulting migrants during a demonstration against gold mining plans. Supporters claim the case is retaliation for Alsynov’s activism against soda mining in a culturally significant area. He allegedly referred to Central Asians and Caucasians, comprising a significant portion of Russia’s migrant population, as “black people” in the Bashkir language. \n\nAlsynov contends the Bashkir words meant “poor people” and were mistranslated into Russian. He plans to appeal the verdict. Alsynov has previously criticized military mobilization in the region as “genocide” against the Bashkir people. Ongoing concerns exist about the disproportionate deployment of ethnic minorities, including Bashkirs, in conflicts such as Ukraine. \n\nAlsynov was a leader of Bashkort, a grassroots movement focused on preserving Bashkir ethnic identity, which was banned as extremist in 2020. The clashes signify a rare instance of public dissent in Russia, underscoring the intensity of emotions surrounding Alsynov’s case and broader issues of ethnic identity and activism in the country.",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[32],"[[Russia, :, Clashes, erupt, in, Bashkortostan, as, rights, activist, sentenced, Russian, riot, police, clashed, with, protesters, in, Bashkortostan, following, the, sentencing, of, rights, activist, Fail, Alsynov, to, four, years, in, a, penal, colony, .], [Social, media, footage, captured, the, confrontations, near, the, court, in, Baymak, ,, with, supporters, engaging, in, clashes, with, police, ,, including, throwing, snowballs, .], [Violent, clashes, in, Baymak—, NEXTA, (, @, nexta_tv, ), January, 17, ,, 2024, Law, enforcers, used, stun, grenades, in, Baymak, ,, Bashkortostan, ,, Russia, .], [The, demonstrators, responded, by, throwing, snow, and, ice, at, them, and, forced, them, to, retreat, .], [It, is, reported, that, negotiations, are, underway, between, the, protesters, and, special, forces, :, the, law…, pic.twitter.com/AVHf2gBi7w, Alsynov, ’, s, conviction, for, inciting, ethnic, hatred, sparked, rare, large-scale, protests, in, Russia, ,, where, the, risk, of, arrest, typically, stifles, such, demonstrations, .], [Reports, suggest, thousands, participated, in, the, multi-day, protest, in, -20°C, temperatures, ,, resulting, in, several, injuries, .], [The, activist, denies, the, charges, related, to, insulting, migrants, during, a, demonstration, against, gold, mining, plans, .], [Supporters, claim, the, case, is, retaliation, for, Alsynov, ’, s, activism, against, soda, mining, in, a, culturally, significant, area, .], [He, allegedly, referred, to, Central, Asians, and, Caucasians, ,, comprising, a, significant, portion, of, Russia, ’, s, migrant, population, ,, as, “, black, people, ”, in, the, Bashkir, language, .], [Alsynov, contends, the, Bashkir, words, meant, “, poor, people, ”, and, were, mistranslated, into, Russian, .], [He, plans, to, appeal, the, verdict, .], [Alsynov, has, previously, criticized, military, mobilization, in, the, region, as, “, genocide, ”, against, the, Bashkir, people, .], [Ongoing, concerns, exist, about, the, disproportionate, deployment, of, ethnic, minorities, ,, including, Bashkirs, ,, in, conflicts, such, as, Ukraine, .], [Alsynov, was, a, leader, of, Bashkort, ,, a, grassroots, movement, focused, on, preserving, Bashkir, ethnic, identity, ,, which, was, banned, as, extremist, in, 2020, .], [The, clashes, signify, a, rare, instance, of, public, dissent, in, Russia, ,, underscoring, the, intensity, of, emotions, surrounding, Alsynov, ’, s, case, and, broader, issues, of, ethnic, identity, and, activism, in, the, country, .]]"
2,EN_UA_300017.txt,EN,"McDonald's to exit Russia, sell business in country \n\n American fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion.\n\nmerican fast-food giant McDonald said Monday it will exit Russia in the wake of the Ukraine invasion, ending a more than three-decade run begun in the hopeful period near the end of the Cold War.\n\nThe restaurant chain, which launched in Moscow in January 1990 to great fanfare almost two years before the Soviet Union was dissolved, characterized the withdrawal as difficult but necessary.\n\n""The humanitarian crisis caused by the war in Ukraine, and the precipitating unpredictable operating environment, have led McDonald's to conclude that continued ownership of the business in Russia is no longer tenable, nor is it consistent with McDonald's values,"" the company said in a statement.\n\nThe chain is looking to sell ""its entire portfolio of McDonald's restaurants in Russia to a local buyer.""\n\nThe burger giant is one of numerous foreign firms that have pulled out of the country or suspended operations following Moscow's invasion of Ukraine in late February.\n\nEarlier on Monday, French automaker Renault announced it had handed over its Russian assets to the government, marking the first major nationalization since the onset of Western sanctions against Moscow's military campaign.\n\nRussia's President Vladimir Putin ordered troops into pro-Western Ukraine on February 24, triggering unprecedented sanctions and sparking an exodus of foreign corporations including H&M, Starbucks and Ikea.\n\n\n",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[32],"[[McDonald, 's, to, exit, Russia, ,, sell, business, in, country, American, fast-food, giant, McDonald, said, Monday, it, will, exit, Russia, in, the, wake, of, the, Ukraine, invasion, .], [merican, fast-food, giant, McDonald, said, Monday, it, will, exit, Russia, in, the, wake, of, the, Ukraine, invasion, ,, ending, a, more, than, three-decade, run, begun, in, the, hopeful, period, near, the, end, of, the, Cold, War, .], [The, restaurant, chain, ,, which, launched, in, Moscow, in, January, 1990, to, great, fanfare, almost, two, years, before, the, Soviet, Union, was, dissolved, ,, characterized, the, withdrawal, as, difficult, but, necessary, .], [``, The, humanitarian, crisis, caused, by, the, war, in, Ukraine, ,, and, the, precipitating, unpredictable, operating, environment, ,, have, led, McDonald, 's, to, conclude, that, continued, ownership, of, the, business, in, Russia, is, no, longer, tenable, ,, nor, is, it, consistent, with, McDonald, 's, values, ,, '', the, company, said, in, a, statement, .], [The, chain, is, looking, to, sell, ``, its, entire, portfolio, of, McDonald, 's, restaurants, in, Russia, to, a, local, buyer, ., ''], [The, burger, giant, is, one, of, numerous, foreign, firms, that, have, pulled, out, of, the, country, or, suspended, operations, following, Moscow, 's, invasion, of, Ukraine, in, late, February, .], [Earlier, on, Monday, ,, French, automaker, Renault, announced, it, had, handed, over, its, Russian, assets, to, the, government, ,, marking, the, first, major, nationalization, since, the, onset, of, Western, sanctions, against, Moscow, 's, military, campaign, .], [Russia, 's, President, Vladimir, Putin, ordered, troops, into, pro-Western, Ukraine, on, February, 24, ,, triggering, unprecedented, sanctions, and, sparking, an, exodus, of, foreign, corporations, including, H, &, M, ,, Starbucks, and, Ikea, .]]"
3,EN_CC_100021.txt,EN,"Collaborative plans, innovation keys to circular RMG industry: stakeholders \n\n Stakeholders at a programme on Sunday said that collaborative strategies and innovation were keys to unlocking Bangladesh’s circular apparel industry for availing a significant opportunity for the country to reduce its environmental impact, improve its economic performance and create social benefits.\n\nThe readymade garment sector in Bangladesh is under a transformative journey towards circularity, but embracing circularity also poses certain challenges that must be collectively considered and resolved, they said.\n\nA panel of industry leaders, policymakers and experts in circular economies made the comments at an event titled ‘Switch to Upstream Circularity Dialogue: Pre-consumer Textile Waste in Bangladesh’ at the Amari Hotel in Dhaka on Sunday.\n\nThe event was organised under the Switch to Circular Economy Value Chains project (SWITCH2CE), co-funded by the European Union and the government of Finland.\n\nParliamentary standing committee on the ministry of environment, forest and climate change chairman Saber Hossain Chowdhury was present in the opening session as chief guest.\n\nBangladesh Garment Manufacturers and Exporters Association president Faruque Hassan spoke at the opening session as special guest.\n\nDeputy head of European Union to Bangladesh delegation mission Bernd Spanier, SWITCH2CE chief technical adviser Mark Draeck, among others, also contributed to the event’s opening session.\n\nHilde van Duijn, head of Global Value Chains, Circle Economy, also participated in the event and made circular game demonstration.\n\nFaruque Hassan said, ‘Living in a world in our time where climate is most threatened, business as usual is no more an option. In a race to zero emission and resource decoupling, circularity emerges as the “next normal” linking business and sustainable development. For the BGMEA, circularity sits in the core of our values, mission and vision. Our goal is to help conserve the natural eco-system as much as possible via an economic shift from a linear to circular system, while generating greater social and economic values.’\n\nMark Draeck said, ‘With the support of the European Union and the government of Finland, UNIDO leads the global Switch to Circular Economy Value Chains project.’\n\n‘In Bangladesh, we support the circular transition for the textile and garments industry, by piloting innovative circular solutions in close cooperation with global and national industry leaders, which as BESTSELLER, H&M and their manufacturers,’ he said.\n\n‘These pilots will address acute challenges on new technology, business models and traceability, and will demonstrate the economic opportunity for circular approaches.\n\nThe project also collaborates with government partners, academia, and NGOs to create an enabling environment for circularity through policy and tailored capacity building,’ he added.",CC,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[42],"[[Collaborative, plans, ,, innovation, keys, to, circular, RMG, industry, :, stakeholders, Stakeholders, at, a, programme, on, Sunday, said, that, collaborative, strategies, and, innovation, were, keys, to, unlocking, Bangladesh, ’, s, circular, apparel, industry, for, availing, a, significant, opportunity, for, the, country, to, reduce, its, environmental, impact, ,, improve, its, economic, performance, and, create, social, benefits, .], [The, readymade, garment, sector, in, Bangladesh, is, under, a, transformative, journey, towards, circularity, ,, but, embracing, circularity, also, poses, certain, challenges, that, must, be, collectively, considered, and, resolved, ,, they, said, .], [A, panel, of, industry, leaders, ,, policymakers, and, experts, in, circular, economies, made, the, comments, at, an, event, titled, ‘, Switch, to, Upstream, Circularity, Dialogue, :, Pre-consumer, Textile, Waste, in, Bangladesh, ’, at, the, Amari, Hotel, in, Dhaka, on, Sunday, .], [The, event, was, organised, under, the, Switch, to, Circular, Economy, Value, Chains, project, (, SWITCH2CE, ), ,, co-funded, by, the, European, Union, and, the, government, of, Finland, .], [Parliamentary, standing, committee, on, the, ministry, of, environment, ,, forest, and, climate, change, chairman, Saber, Hossain, Chowdhury, was, present, in, the, opening, session, as, chief, guest, .], [Bangladesh, Garment, Manufacturers, and, Exporters, Association, president, Faruque, Hassan, spoke, at, the, opening, session, as, special, guest, .], [Deputy, head, of, European, Union, to, Bangladesh, delegation, mission, Bernd, Spanier, ,, SWITCH2CE, chief, technical, adviser, Mark, Draeck, ,, among, others, ,, also, contributed, to, the, event, ’, s, opening, session, .], [Hilde, van, Duijn, ,, head, of, Global, Value, Chains, ,, Circle, Economy, ,, also, participated, in, the, event, and, made, circular, game, demonstration, .], [Faruque, Hassan, said, ,, ‘, Living, in, a, world, in, our, time, where, climate, is, most, threatened, ,, business, as, usual, is, no, more, an, option, .], [In, a, race, to, zero, emission, and, resource, decoupling, ,, circularity, emerges, as, the, “, next, normal, ”, linking, business, and, sustainable, development, .], [For, the, BGMEA, ,, circularity, sits, in, the, core, of, our, values, ,, mission, and, vision, .], [Our, goal, is, to, help, conserve, the, natural, eco-system, as, much, as, possible, via, an, economic, shift, from, a, linear, to, circular, system, ,, while, generating, greater, social, and, economic, values., ’, Mark, Draeck, said, ,, ‘, With, the, support, of, the, European, Union, and, the, government, of, Finland, ,, UNIDO, leads, the, global, Switch, to, Circular, Economy, Value, Chains, project., ’, ‘, In, Bangladesh, ,, we, support, the, circular, transition, for, the, textile, and, garments, industry, ,, by, piloting, innovative, circular, solutions, in, close, cooperation, with, global, and, national, industry, leaders, ,, which, as, BESTSELLER, ,, H, &, M, ...], [‘, These, pilots, will, address, acute, challenges, on, new, technology, ,, business, models, and, traceability, ,, and, will, demonstrate, the, economic, opportunity, for, circular, approaches, .], [The, project, also, collaborates, with, government, partners, ,, academia, ,, and, NGOs, to, create, an, enabling, environment, for, circularity, through, policy, and, tailored, capacity, building, ,, ’, he, added, .]]"
4,EN_UA_300041.txt,EN,"Russia intends to supply light ‘Mountain’ tanks, infantry fighting vehicles to India \n\n Russia’s state-owned arms company, Rosoboronexport, has declared its intention to participate in the bidding process for the supply of light tanks and the Future Infantry Combat Vehicle (FICV) to India.\n\nAlexander Mikheyev, the CEO of the Russian company announced the decision with the state-run media outlet TASS during the International Aerospace Exhibition at Dubai Airshow 2023.\n\nMikheyev outlined plans for collaboration with Indian partners, to introduce a light tank and an advanced infantry fighting vehicle as part of the Indian Ministry of Defense’s FICV project tender. Emphasizing alignment with the principles of the ‘Make in India’ program, he highlighted the intention to contribute to local manufacturing.\n\nIn his statements, Mikheyev underlined Russia’s comprehension of the Indian government’s aspirations, acknowledging India’s commitment to achieving technological sovereignty and promoting independent industrial development. The proposed plans reflect a collaborative approach that adheres to the principles of the ‘Make in India’ initiative.\n\nEarlier, Russia signed a contract to supply Igla-S hand-held anti-aircraft missiles to India and allow production of the Igla there under licence, the Russian state news agency TASS quoted a top arms export official as saying on Tuesday.\n\nThe Igla-S is a man-portable air defence system (MANPADS) that can be fired by an individual or crew to bring down an enemy aircraft.\n\nIndia is the world’s largest arms importer and Russia remains its largest supplier despite the damage to the reputation of its army and weaponry from the war in Ukraine, where Russia has suffered numerous setbacks at the hands of a smaller but highly motivated and Western-equipped military.\n\nAccording to the Stockholm International Peace Research Institute (SIPRI), Russia accounted for 45% of India’s arms imports between 2018 and 2022, with France providing 29% and the United States 11%.\n\nAnother Russian state news agency, RIA, quoted Mikheyev earlier as saying that “Rosoboronexport is working with Indian private and public enterprises to organise joint production of aviation weapons and integrate them into the existing aviation fleet in India”.\n\nNo details were provided about which Indian companies would be involved or when potential production would start.\n\nMikheyev said Rosoboronexport and Indian partners had provided the Indian Ministry of Defence with Su-30MKI fighter jets, tanks, armoured vehicles and shells. At the beginning of the year, India and Russia also started joint production of AK-203 Kalashnikov assault rifles.\n\n",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[32],"[[Russia, intends, to, supply, light, ‘, Mountain, ’, tanks, ,, infantry, fighting, vehicles, to, India, Russia, ’, s, state-owned, arms, company, ,, Rosoboronexport, ,, has, declared, its, intention, to, participate, in, the, bidding, process, for, the, supply, of, light, tanks, and, the, Future, Infantry, Combat, Vehicle, (, FICV, ), to, India, .], [Alexander, Mikheyev, ,, the, CEO, of, the, Russian, company, announced, the, decision, with, the, state-run, media, outlet, TASS, during, the, International, Aerospace, Exhibition, at, Dubai, Airshow, 2023, .], [Mikheyev, outlined, plans, for, collaboration, with, Indian, partners, ,, to, introduce, a, light, tank, and, an, advanced, infantry, fighting, vehicle, as, part, of, the, Indian, Ministry, of, Defense, ’, s, FICV, project, tender, .], [Emphasizing, alignment, with, the, principles, of, the, ‘, Make, in, India, ’, program, ,, he, highlighted, the, intention, to, contribute, to, local, manufacturing, .], [In, his, statements, ,, Mikheyev, underlined, Russia, ’, s, comprehension, of, the, Indian, government, ’, s, aspirations, ,, acknowledging, India, ’, s, commitment, to, achieving, technological, sovereignty, and, promoting, independent, industrial, development, .], [The, proposed, plans, reflect, a, collaborative, approach, that, adheres, to, the, principles, of, the, ‘, Make, in, India, ’, initiative, .], [Earlier, ,, Russia, signed, a, contract, to, supply, Igla-S, hand-held, anti-aircraft, missiles, to, India, and, allow, production, of, the, Igla, there, under, licence, ,, the, Russian, state, news, agency, TASS, quoted, a, top, arms, export, official, as, saying, on, Tuesday, .], [The, Igla-S, is, a, man-portable, air, defence, system, (, MANPADS, ), that, can, be, fired, by, an, individual, or, crew, to, bring, down, an, enemy, aircraft, .], [India, is, the, world, ’, s, largest, arms, importer, and, Russia, remains, its, largest, supplier, despite, the, damage, to, the, reputation, of, its, army, and, weaponry, from, the, war, in, Ukraine, ,, where, Russia, has, suffered, numerous, setbacks, at, the, hands, of, a, smaller, but, highly, motivated, and, Western-equipped, military, .], [According, to, the, Stockholm, International, Peace, Research, Institute, (, SIPRI, ), ,, Russia, accounted, for, 45, %, of, India, ’, s, arms, imports, between, 2018, and, 2022, ,, with, France, providing, 29, %, and, the, United, States, 11, %, .], [Another, Russian, state, news, agency, ,, RIA, ,, quoted, Mikheyev, earlier, as, saying, that, “, Rosoboronexport, is, working, with, Indian, private, and, public, enterprises, to, organise, joint, production, of, aviation, weapons, and, integrate, them, into, the, existing, aviation, fleet, in, India, ”, .], [No, details, were, provided, about, which, Indian, companies, would, be, involved, or, when, potential, production, would, start, .], [Mikheyev, said, Rosoboronexport, and, Indian, partners, had, provided, the, Indian, Ministry, of, Defence, with, Su-30MKI, fighter, jets, ,, tanks, ,, armoured, vehicles, and, shells, .], [At, the, beginning, of, the, year, ,, India, and, Russia, also, started, joint, production, of, AK-203, Kalashnikov, assault, rifles, .]]"


<a class="anchor" id="find-sentences-unusual-length"></a>
To uncover potential errors, let us check for and handle sentences of unusual length.

In [19]:
# Function to find sentences of unusual length
def find_unusual_length_sentences(df, min_length=3, max_length=130):
    unusual_sentences = []
    for i, row in df.iterrows():
        for j, sentence in enumerate(row["tokens"]):
            if len(sentence) < min_length or len(sentence) > max_length:
                # also store the previous and next sentences for context
                prev_sentence = row["tokens"][j - 1] if j > 0 else None
                next_sentence = (
                    row["tokens"][j + 1] if j < len(row["tokens"]) - 1 else None
                )
                unusual_sentences.append(
                    {
                        "row_index": i,  # for later handling of the unusual sentences
                        "sentence_index": j,  # for later handling of the unusual sentences
                        "sentence": sentence,
                        "previous": prev_sentence,
                        "next": next_sentence,
                    }
                )
    return unusual_sentences


# Find sentences with less than 3 words or more than 130 words
unusual_sentences = find_unusual_length_sentences(df_short)

print(f"There are {len(unusual_sentences)} sentences of unusual length.")

# Display the unusual sentences
for entry in unusual_sentences:
    print(
        f"Sentence length: {len(entry["sentence"])}, Sentence: {' '.join(entry["sentence"])}"
    )

There are 379 sentences of unusual length.
Sentence length: 136, Sentence: “ The Complaint alleged that several of the Vietnamese orphans brought to the United States under Operation Babylift stated they are not orphans and that they wish to return to Vietnam. ” A statement issued on April 4 , 1975 , by “ professors of ethics and religion , ” pointed out that many “ of the children are not orphans ; their parents or relatives may still be alive , although displaced , in Vietnam… The Vietnamese children should be allowed to stay in Vietnam where they belong. ” The operation was celebrated by the corporate media and “ Hollywood ’ s celebrity elite… [ and , as a propaganda event ] generated a spectacle of celebration and emphasized that the babies were more than just average orphans , ” writes US History Scene .
Sentence length: 2, Sentence: Tags :
Sentence length: 2, Sentence: Sure .
Sentence length: 2, Sentence: Coincidence ?
Sentence length: 2, Sentence: Gov .
Sentence length: 2, Sente

We find multiple very short and very large sentences.

<a class="anchor" id="handle-short-sentences"></a>
#### Handling very short sentences.

The sentences of size 1 all consist of non-meaningful characters. Therefore we can drop them directly.

In [21]:
# Function to drop unusual sentences of a specific length from the DataFrame
def drop_sentences_of_length(df, unusual_sentences, length):
    # Create a copy of the list to iterate over
    for entry in unusual_sentences[:]:
        if len(entry["sentence"]) == length:
            row_index = entry["row_index"]
            sentence_index = entry["sentence_index"]
            # Check if the sentence index is within the valid range
            if 0 <= sentence_index < len(df.at[row_index, "tokens"]):
                # Drop from DataFrame
                df.at[row_index, "tokens"].pop(sentence_index)
                # Drop from unusual_sentences list
                unusual_sentences.remove(entry)
                print("dropped: ", entry["sentence"])
    return df


# Drop sentences of length 1
print(
    f"There are {len(unusual_sentences)} sentences of unusual length before dropping sentences of length 1."
)
df_short = drop_sentences_of_length(df_short, unusual_sentences, length=1)
print(
    f"There are {len(unusual_sentences)} sentences of unusual length after dropping sentences of length 1."
)

There are 361 sentences of unusual length before dropping sentences of length 1.
There are 361 sentences of unusual length after dropping sentences of length 1.


The sentences of size 2 might make sense. Let's have a look at their context.

In [22]:
# Display sentences of length 2 with their preceding and following sentences
for entry in unusual_sentences:

    print(f"(Previous) {' '.join(entry['previous']) if entry['previous'] else 'None'}")
    print(
        f"(Idx {entry['row_index']}, {entry['sentence_index']}, ListIdx {unusual_sentences.index(entry)}) {' '.join(entry['sentence'])}"
    )
    if entry["next"]:
        print(f"(Next) {' '.join(entry['next'])}")
    print("-" * 50)

(Previous) “ The suit seeks to enjoin adoption proceedings until it has been ascertained either that the parents or appropriate relatives in Vietnam have consented to their adoption or that these parents or relatives can not be found , ” The Adoption History Project notes .
(Idx 40, 12, ListIdx 0) “ The Complaint alleged that several of the Vietnamese orphans brought to the United States under Operation Babylift stated they are not orphans and that they wish to return to Vietnam. ” A statement issued on April 4 , 1975 , by “ professors of ethics and religion , ” pointed out that many “ of the children are not orphans ; their parents or relatives may still be alive , although displaced , in Vietnam… The Vietnamese children should be allowed to stay in Vietnam where they belong. ” The operation was celebrated by the corporate media and “ Hollywood ’ s celebrity elite… [ and , as a propaganda event ] generated a spectacle of celebration and emphasized that the babies were more than just a

We find, that the sentences of size 2 are of different types.
 - Words belonging to the previous or following sentence, but are split by punctation errors (e.g. "Mild. Tonight: Rain slowly returns ...") -> merge manually 
 - Words at the end of an document (e.g. "Watch:") -> drop
 - Valid sentences (e.g. "Why?") -> valid, keep
 - Section numerations (e.g. "2. (paragraph)") -> valid, keep (might be helpful for the model to understand the text, since they give a structure)
 

Since the number of such sentences is managable, we can manually decide on each case, whether to keep, merge or drop it.

In [23]:
# function to merge specific sentences with either the previous or next sentence
def merge_specific_sentence(df, row_idx, sentence_idx, direction):

    # List of sentences of the specified row
    sentences = df.at[row_idx, "tokens"]

    # Merge based on the direction
    if direction == "previous":
        merged_sentence = sentences[sentence_idx - 1] + sentences[sentence_idx]
        sentences[sentence_idx - 1] = merged_sentence
        del sentences[sentence_idx]

    elif direction == "next":
        merged_sentence = sentences[sentence_idx] + sentences[sentence_idx + 1]
        sentences[sentence_idx] = merged_sentence
        del sentences[sentence_idx + 1]

    # Update the row in the dataframe
    df.at[row_idx, "tokens"] = sentences


print(
    f"There are {len(unusual_sentences)} sentences of unusual length before merging some sentences of length 2."
)

merge_specific_sentence(df_short, 52, 20, "next")
merge_specific_sentence(df_short, 62, 12, "next")
merge_specific_sentence(df_short, 78, 6, "previous")
merge_specific_sentence(df_short, 79, 0, "next")
merge_specific_sentence(df_short, 131, 3, "next")
merge_specific_sentence(df_short, 175, 32, "next")
merge_specific_sentence(df_short, 180, 4, "next")
merge_specific_sentence(df_short, 239, 20, "next")
merge_specific_sentence(df_short, 275, 14, "next")
merge_specific_sentence(df_short, 505, 1, "next")
merge_specific_sentence(df_short, 574, 17, "previous")
merge_specific_sentence(df_short, 591, 14, "previous")
merge_specific_sentence(df_short, 696, 18, "previous")
merge_specific_sentence(df_short, 721, 0, "next")
merge_specific_sentence(df_short, 744, 13, "next")
merge_specific_sentence(df_short, 744, 15, "next")
merge_specific_sentence(df_short, 763, 0, "next")
merge_specific_sentence(df_short, 763, 16, "next")
merge_specific_sentence(df_short, 763, 32, "next")
merge_specific_sentence(df_short, 779, 7, "previous")
merge_specific_sentence(df_short, 790, 28, "previous")
merge_specific_sentence(df_short, 791, 12, "previous")
merge_specific_sentence(df_short, 794, 8, "previous")
merge_specific_sentence(df_short, 826, 7, "previous")
merge_specific_sentence(df_short, 851, 7, "next")
merge_specific_sentence(df_short, 854, 4, "previous")
merge_specific_sentence(df_short, 854, 18, "previous")
merge_specific_sentence(df_short, 905, 4, "next")
merge_specific_sentence(df_short, 919, 23, "next")


# drop them from the list of unusual sentences
unusual_sentences = [
    i
    for j, i in enumerate(unusual_sentences)
    if j
    not in [
        4,
        5,
        9,
        10,
        11,
        16,
        17,
        23,
        25,
        49,
        54,
        58,
        63,
        68,
        74,
        75,
        77,
        78,
        80,
        85,
        86,
        87,
        88,
        89,
        90,
        91,
        93,
        96,
        98,
    ]
]

print(
    f"There are {len(unusual_sentences)} sentences of unusual length after merging some sentences of length 2."
)

There are 361 sentences of unusual length before merging some sentences of length 2.
There are 332 sentences of unusual length after merging some sentences of length 2.


In [24]:
print(f"Row 763 tokens: {df_short.at[763, 'tokens']}")
print(f"Number of sentences in row 763: {len(df_short.at[763, 'tokens'])}")

Row 763 tokens: [['Проф', '.', 'Чуков', ':', 'Три', 'империи', 'си', 'приличат', ',', 'а', ',', 'ние', ',', 'българите', ',', 'много', 'силно', 'трябва', 'да', 'се', 'замислим', 'Най-големият', 'враг', 'на', 'Русия', 'и', 'Путин', 'в', 'момента', 'е', 'Полша', 'Отношенията', 'между', 'Иран', 'и', 'Русия', 'са', 'стратегически', '.'], ['Тези', 'две', 'държави', 'ужасно', 'си', 'приличат', '.'], ['Най-големият', 'враг', 'на', 'Русия', 'и', 'Путин', 'в', 'момента', 'е', 'Полша', '.'], ['Три', 'империи', 'си', 'приличат', ',', 'независимо', 'от', 'външнополитическата', 'им', 'доктрина', ',', 'българите', 'много', 'силно', 'трябва', 'да', 'се', 'замислим', '.'], ['Това', 'коментира', 'в', 'предаването', '``', 'България', ',', 'Европа', 'и', 'светът', 'на', 'фокус', "''", 'на', 'Радио', '``', 'Фокус', "''", 'проф', '.'], ['Владимир', 'Чуков', '.'], ['Той', 'припомни', 'думи', 'на', 'Хасан', 'Рохани', 'от', 'декември', '2015', 'година', ',', 'когато', 'Путин', 'е', 'бил', 'на', 'посещение', '

In [25]:
# function to drop specific sentences from the dataframe
def drop_sentence_by_indices(df, row_idx, sentence_idx):

    # copy list of sentences of the specified row
    sentences = df.at[row_idx, "tokens"]

    if 0 <= sentence_idx < len(sentences):
        # drop sentence
        del sentences[sentence_idx]
        # update the dataframe
        df.at[row_idx, "tokens"] = sentences

    else:
        print(f"Invalid sentence index {sentence_idx} for row {row_idx}")


print(
    f"There are {len(unusual_sentences)} sentences of unusual length before dropping some sentences of length 2."
)

drop_sentence_by_indices(df_short, 49, 17)
drop_sentence_by_indices(df_short, 72, 21)
drop_sentence_by_indices(df_short, 201, 1)
drop_sentence_by_indices(df_short, 259, 10)
drop_sentence_by_indices(df_short, 323, 23)
drop_sentence_by_indices(df_short, 362, 19)
drop_sentence_by_indices(df_short, 381, 18)
drop_sentence_by_indices(df_short, 438, 1)
drop_sentence_by_indices(df_short, 456, 1)
drop_sentence_by_indices(df_short, 462, 13)
drop_sentence_by_indices(df_short, 499, 8)
drop_sentence_by_indices(df_short, 503, 5)
drop_sentence_by_indices(df_short, 577, 18)
drop_sentence_by_indices(df_short, 581, 30)
drop_sentence_by_indices(df_short, 603, 6)
drop_sentence_by_indices(df_short, 635, 11)
drop_sentence_by_indices(df_short, 660, 5)
drop_sentence_by_indices(df_short, 702, 23)
drop_sentence_by_indices(df_short, 702, 23)
drop_sentence_by_indices(df_short, 744, 12)
drop_sentence_by_indices(df_short, 790, 28)


# drop them from the list of unusual sentences
unusual_sentences = [
    i
    for j, i in enumerate(unusual_sentences)
    if j
    not in [
        2,
        8,
        20,
        24,
        32,
        37,
        40,
        43,
        45,
        46,
        47,
        48,
        55,
        56,
        59,
        61,
        62,
        66,
        67,
        73,
        86,
    ]
]

print(
    f"There are {len(unusual_sentences)} sentences of unusual length after dropping some sentences of length 2."
)

There are 332 sentences of unusual length before dropping some sentences of length 2.
Invalid sentence index 28 for row 790
There are 311 sentences of unusual length after dropping some sentences of length 2.


#### Handling very long sentences <a class="anchor" id="handle-long-sentences"></a>

There are several very long sentences. Looking at the data, we see that some of them are infact correctly split and complete single sentences. Others, however, are in fact multiple sentences stored as one, because the splitting did not work correctly. Let's manually split them.

In [26]:
# Display sentences of length 2 with their preceding and following sentences
for entry in unusual_sentences:

    print(f"(Previous) {' '.join(entry['previous']) if entry['previous'] else 'None'}")
    print(
        f"(Idx {entry['row_index']}, {entry['sentence_index']}, ListIdx {unusual_sentences.index(entry)}) {' '.join(entry['sentence'])}"
    )
    if entry["next"]:
        print(f"(Next) {' '.join(entry['next'])}")
    print("-" * 50)

(Previous) “ The suit seeks to enjoin adoption proceedings until it has been ascertained either that the parents or appropriate relatives in Vietnam have consented to their adoption or that these parents or relatives can not be found , ” The Adoption History Project notes .
(Idx 40, 12, ListIdx 0) “ The Complaint alleged that several of the Vietnamese orphans brought to the United States under Operation Babylift stated they are not orphans and that they wish to return to Vietnam. ” A statement issued on April 4 , 1975 , by “ professors of ethics and religion , ” pointed out that many “ of the children are not orphans ; their parents or relatives may still be alive , although displaced , in Vietnam… The Vietnamese children should be allowed to stay in Vietnam where they belong. ” The operation was celebrated by the corporate media and “ Hollywood ’ s celebrity elite… [ and , as a propaganda event ] generated a spectacle of celebration and emphasized that the babies were more than just a

In [27]:
# function to manually replace a unsplitted sequence of sentences with the manually correctly splitted sentences
def manually_split_sentence(idx_row, idx_sentence, splitted_sentence):
    df_short.at[idx_row, "tokens"] = (
        df_short.at[idx_row, "tokens"][:idx_sentence]
        + splitted_sentence
        + df_short.at[6, "tokens"][idx_sentence + 1 :]
    )


print(
    f"There are {len(unusual_sentences)} sentences of unusual length before manually splitting some too long sentences."
)

manually_split_sentence(
    213,
    6,
    [
        [
            "Jens",
            "Stoltenberg",
            "(",
            "pictured",
            ")",
            ",",
            "the",
            "13th",
            "secretary",
            "general",
            "of",
            "NATO",
            ",",
            "revealed",
            "there",
            "were",
            "live",
            "discussions",
            "among",
            "members",
            "about",
            "removing",
            "missiles",
            "from",
            "storage",
            "and",
            "putting",
            "them",
            "on",
            "standby",
            ".",
        ],
        [
            "A",
            "Netherlands",
            "'",
            "Air",
            "Force",
            "F-16",
            "jetfighter",
            "takes",
            "part",
            "in",
            "the",
            "NATO",
            "exercise",
            "as",
            "part",
            "of",
            "the",
            "NATO",
            "Air",
            "Policing",
            "mission",
            ".",
        ],
        [
            "The",
            "head",
            "of",
            "Kyiv",
            "'s",
            "national",
            "security",
            "council",
            "said",
            "Putin",
            "could",
            "demand",
            "a",
            "tactical",
            "nuclear",
            "weapon",
            "be",
            "used",
            "if",
            "Russia",
            "'s",
            "army",
            "is",
            "beaten",
            "in",
            "Ukraine",
            ".",
        ],
        [
            "Russian",
            "soldiers",
            "load",
            "a",
            "Iskander-M",
            "short-range",
            "ballistic",
            "missile",
            "launcher",
            "at",
            "a",
            "firing",
            "position",
            "as",
            "part",
            "of",
            "a",
            "Russian",
            "military",
            "drill",
            "intended",
            "to",
            "train",
            "the",
            "troops",
            "in",
            "using",
            "tactical",
            "nuclear",
            "weapons",
            ".",
        ],
        [
            "Meanwhile",
            ",",
            "Mr",
            "Stoltenberg",
            "warned",
            "in",
            "Brussels",
            "of",
            "the",
            "threat",
            "from",
            "China",
            ",",
            "adding",
            "that",
            "nuclear",
            "transparency",
            "should",
            "form",
            "the",
            "basis",
            "of",
            "NATO",
            "'s",
            "nuclear",
            "strategy",
            "to",
            "prepare",
            "the",
            "alliance",
            "for",
            "the",
            "dangers",
            "of",
            "the",
            "world",
            ".",
        ],
    ],
)

manually_split_sentence(
    40,
    12,
    [
        [
            "“",
            "The",
            "Complaint",
            "alleged",
            "that",
            "several",
            "of",
            "the",
            "Vietnamese",
            "orphans",
            "brought",
            "to",
            "the",
            "United",
            "States",
            "under",
            "Operation",
            "Babylift",
            "stated",
            "they",
            "are",
            "not",
            "orphans",
            "and",
            "that",
            "they",
            "wish",
            "to",
            "return",
            "to",
            "Vietnam",
            ".",
            "”",
        ],
        [
            "A",
            "statement",
            "issued",
            "on",
            "April",
            "4",
            ",",
            "1975",
            ",",
            "by",
            "“",
            "professors",
            "of",
            "ethics",
            "and",
            "religion",
            ",",
            "”",
            "pointed",
            "out",
            "that",
            "many",
            "“",
            "of",
            "the",
            "children",
            "are",
            "not",
            "orphans",
            ";",
            "their",
            "parents",
            "or",
            "relatives",
            "may",
            "still",
            "be",
            "alive",
            ",",
            "although",
            "displaced",
            ",",
            "in",
            "Vietnam",
            "…",
            "The",
            "Vietnamese",
            "children",
            "should",
            "be",
            "allowed",
            "to",
            "stay",
            "in",
            "Vietnam",
            "where",
            "they",
            "belong",
            ".",
            "”",
        ],
        [
            "The",
            "operation",
            "was",
            "celebrated",
            "by",
            "the",
            "corporate",
            "media",
            "and",
            "“",
            "Hollywood",
            "’",
            "s",
            "celebrity",
            "elite",
            "…",
            "[",
            "and",
            ",",
            "as",
            "a",
            "propaganda",
            "event",
            "]",
            "generated",
            "a",
            "spectacle",
            "of",
            "celebration",
            "and",
            "emphasized",
            "that",
            "the",
            "babies",
            "were",
            "more",
            "than",
            "just",
            "average",
            "orphans",
            ",",
            "”",
            "writes",
            "US",
            "History",
            "Scene",
            ".",
        ],
    ],
)

manually_split_sentence(
    66,
    8,
    [
        [
            "But",
            "he",
            "noted",
            "that",
            "“",
            "even",
            "as",
            "the",
            "Russians",
            "have",
            "gained",
            "territory",
            ",",
            "they",
            "do",
            "it",
            "at",
            "a",
            "pretty",
            "big",
            "cost",
            "in",
            "number",
            "of",
            "casualties",
            ",",
            "like",
            "in",
            "personnel",
            ",",
            "but",
            "also",
            "in",
            "number",
            "of",
            "pieces",
            "of",
            "equipment",
            "that",
            "are",
            "being",
            "taken",
            "out.",
            "”",
        ],
        [
            "Austin",
            "said",
            "in",
            "his",
            "remarks",
            "Tuesday",
            "that",
            "“",
            "Russia",
            "has",
            "paid",
            "a",
            "staggering",
            "cost",
            "for",
            "(",
            "President",
            "Vladimir",
            ")",
            "Putin",
            "’",
            "s",
            "imperial",
            "dreams",
            "”",
            ",",
            "using",
            "“",
            "up",
            "to",
            "$",
            "211",
            "billion",
            "to",
            "equip",
            ",",
            "deploy",
            ",",
            "maintain",
            ",",
            "and",
            "sustain",
            "its",
            "imperial",
            "aggression",
            "against",
            "Ukraine.",
            "”",
        ],
        [
            "“",
            "At",
            "least",
            "315,000",
            "Russian",
            "troops",
            "have",
            "been",
            "killed",
            "or",
            "wounded",
            "”",
            "since",
            "Russia",
            "launched",
            "its",
            "all-out",
            "invasion",
            "of",
            "Ukraine",
            "in",
            "2022",
            ",",
            "Austin",
            "said",
            ".",
        ],
        [
            "Austin",
            "added",
            "that",
            "Ukraine",
            "has",
            "also",
            "“",
            "sunk",
            ",",
            "destroyed",
            ",",
            "or",
            "damaged",
            "some",
            "20",
            "medium-to-large",
            "Russian",
            "navy",
            "vessels.",
            "”",
        ],
        [
            "The",
            "sinkings",
            "have",
            "been",
            "an",
            "embarrassment",
            "for",
            "Moscow",
            "and",
            "Russian",
            "state",
            "media",
            "confirmed",
            "Tuesday",
            "that",
            "the",
            "country",
            "had",
            "replaced",
            "the",
            "head",
            "of",
            "its",
            "navy",
            ".",
        ],
    ],
)

manually_split_sentence(
    280,
    13,
    [
        [
            "Подробнее",
            "на",
            "РБК",
            ":",
            "https",
            ":",
            "//www.rbc.ru/politics/14/06/2023/6489e6f39a794778d61881b4",
            ".",
        ],
        [
            "The",
            "picture",
            "of",
            "widening",
            "war",
            "is",
            "beginning",
            "to",
            "form",
            ":",
            "Professor",
            "Sergey",
            "Karaganov",
            ",",
            "honorary",
            "chairman",
            "of",
            "Russia",
            "’",
            "s",
            "Council",
            "on",
            "Foreign",
            "and",
            "Defense",
            "Policy",
            ",",
            "and",
            "academic",
            "supervisor",
            "at",
            "the",
            "School",
            "of",
            "International",
            "Economics",
            "and",
            "Foreign",
            "Affairs",
            "Higher",
            "School",
            "of",
            "Economics",
            "(",
            "HSE",
            ")",
            "in",
            "Moscow",
            ".",
        ],
        [
            "Sergey",
            "Karaganov",
            ":",
            "By",
            "using",
            "its",
            "nuclear",
            "weapons",
            ",",
            "Russia",
            "could",
            "save",
            "humanity",
            "from",
            "a",
            "global",
            "catastrophe",
            ".",
        ],
        [
            "A",
            "tough",
            "but",
            "necessary",
            "decision",
            "would",
            "likely",
            "force",
            "the",
            "West",
            "to",
            "back",
            "off",
            ",",
            "enabling",
            "an",
            "earlier",
            "end",
            "to",
            "the",
            "Ukraine",
            "crisis",
            "and",
            "preventing",
            "it",
            "from",
            "expanding",
            "to",
            "other",
            "states",
            ".",
        ],
        [
            "Karaganov",
            "’",
            "s",
            "description",
            "of",
            "the",
            "Western",
            "World",
            "as",
            "“",
            "anti-human",
            "ideologies",
            ":",
            "the",
            "denial",
            "of",
            "family",
            ",",
            "homeland",
            ",",
            "history",
            ",",
            "love",
            "between",
            "men",
            "and",
            "women",
            ",",
            "faith",
            ",",
            "service",
            "to",
            "higher",
            "ideals",
            ",",
            "everything",
            "that",
            "is",
            "human",
            ",",
            "”",
            "shows",
            "a",
            "rising",
            "realization",
            "that",
            "Russia",
            "sees",
            "itself",
            "confronted",
            "by",
            "a",
            "Satanic",
            "force",
            "that",
            "must",
            "be",
            "destroyed",
            ".",
        ],
    ],
)

manually_split_sentence(
    151,
    7,
    [
        [
            "At",
            "the",
            "same",
            "time",
            ",",
            "the",
            "official",
            "claimed",
            "that",
            "the",
            "danger",
            "of",
            "Kiev",
            "using",
            "a",
            "‘",
            "dirty",
            "bomb",
            "’",
            "remains",
            "“",
            "very",
            "high",
            ",",
            "”",
            "and",
            "that",
            "Ukraine",
            "“",
            "has",
            "the",
            "opportunity",
            "”",
            "and",
            "“",
            "has",
            "every",
            "reason",
            "to",
            "use",
            "it",
            ".",
        ],
        [
            "Earlier",
            "on",
            "Tuesday",
            ",",
            "in",
            "a",
            "letter",
            "to",
            "UN",
            "Secretary-General",
            "Antonio",
            "Guterres",
            ",",
            "the",
            "Russian",
            "mission",
            "’",
            "s",
            "head",
            ",",
            "Vassily",
            "Nebenzia",
            ",",
            "said",
            "that",
            "Moscow",
            "would",
            "consider",
            "the",
            "use",
            "of",
            "a",
            "‘",
            "dirty",
            "bomb",
            "’",
            "by",
            "Ukraine",
            "“",
            "an",
            "act",
            "of",
            "nuclear",
            "terrorism",
            ".",
            "”",
        ],
        [
            "Meanwhile",
            ",",
            "Ukrainian",
            "Foreign",
            "Minister",
            "Dmitry",
            "Kuleba",
            "earlier",
            "called",
            "the",
            "Russian",
            "allegations",
            "“",
            "as",
            "absurd",
            "as",
            "they",
            "are",
            "dangerous",
            ".",
            "”",
        ],
        [
            "He",
            "also",
            "noted",
            "that",
            "“",
            "Russians",
            "often",
            "accuse",
            "others",
            "of",
            "what",
            "they",
            "plan",
            "themselves",
            ".",
            "”",
        ],
        [
            "On",
            "Tuesday",
            ",",
            "the",
            "minister",
            "revealed",
            "that",
            "Ukraine",
            "had",
            "invited",
            "IAEA",
            "inspectors",
            "to",
            "come",
            "and",
            "to",
            "“",
            "prove",
            "that",
            "Ukraine",
            "has",
            "neither",
            "any",
            "dirty",
            "bombs",
            "nor",
            "plans",
            "to",
            "develop",
            "them",
            ".",
            "”",
        ],
        [
            "“",
            "Good",
            "cooperation",
            "with",
            "IAEA",
            "and",
            "partners",
            "allows",
            "us",
            "to",
            "foil",
            "Russia",
            "’",
            "s",
            "‘",
            "dirty",
            "bomb",
            "’",
            "disinfo",
            "campaign",
            ",",
            "”",
            "Kuleba",
            "said",
            ".",
        ],
    ],
)

manually_split_sentence(
    158,
    5,
    [
        [
            "WHO",
            "Tedros",
            "describes",
            "Disease",
            "X",
            "as",
            "a",
            "blueprint",
            "at",
            "a",
            "panel",
            "discussion",
            "at",
            "WEF24",
            "—",
            "Tamara",
            "Ugolini",
            "🇨🇦",
            "(",
            "@",
            "TamaraUgo",
            ")",
            "January",
            "17",
            ",",
            "2024",
            ".",
        ],
        [
            "He",
            "says",
            "that",
            "COVID",
            "was",
            "the",
            "first",
            "Disease",
            "X",
            "and",
            "we",
            "“",
            "need",
            "a",
            "placeholder",
            "for",
            "diseases",
            "we",
            "don",
            "’",
            "t",
            "know",
            ",",
            "”",
            "including",
            "dedication",
            "to",
            "private",
            "sector",
            "drug",
            "research",
            "and",
            "development",
            ".",
        ],
        [
            "Disease",
            "X",
            "serves",
            "as",
            "a",
            "“",
            "placeholder",
            "for",
            "the",
            "diseases",
            "we",
            "don",
            "’",
            "t",
            "know",
            ",",
            "”",
            "and",
            "it",
            "begins",
            "with",
            "private-sector",
            "research",
            "and",
            "development",
            "to",
            "test",
            "drugs",
            "and",
            "“",
            "other",
            "things",
            ".",
            "”",
        ],
        [
            "Tedros",
            "stressed",
            "that",
            "the",
            "next",
            "pandemic",
            "is",
            "“",
            "not",
            "a",
            "matter",
            "of",
            "if",
            ",",
            "but",
            "rather",
            "when",
            ",",
            "”",
            "while",
            "noting",
            "that",
            "COVID-19",
            "was",
            "the",
            "original",
            "Disease",
            "X",
            ",",
            "in",
            "which",
            "they",
            "were",
            "able",
            "to",
            "facilitate",
            "the",
            "Pandemic",
            "Fund",
            "in",
            "partnership",
            "with",
            "the",
            "World",
            "Bank",
            ".",
        ],
    ],
)

manually_split_sentence(
    226,
    3,
    [
        [
            "As",
            "things",
            "stand",
            "today",
            ",",
            "the",
            "United",
            "States",
            "has",
            "not",
            "adjusted",
            "our",
            "nuclear",
            "posture",
            ",",
            "but",
            "it",
            "is",
            "something",
            "that",
            "we",
            "monitor",
            "day",
            "by",
            "day",
            ",",
            "hour",
            "by",
            "hour",
            ",",
            "because",
            "it",
            "is",
            "a",
            "paramount",
            "priority",
            "to",
            "the",
            "president.",
        ],
        [
            "”",
            "When",
            "asked",
            "if",
            "the",
            "Biden",
            "administration",
            "was",
            "“",
            "concerned",
            "”",
            "about",
            "the",
            "situation",
            ",",
            "Sullivan",
            "said",
            ",",
            "“",
            "Anytime",
            "you",
            "have",
            "a",
            "nuclear",
            "power",
            "fighting",
            "in",
            "a",
            "conflict",
            "zone",
            "in",
            "Europe",
            "near",
            "NATO",
            "territory",
            ",",
            "of",
            "course",
            "we",
            "have",
            "to",
            "focus",
            "on",
            "and",
            "be",
            "concerned",
            "about",
            "the",
            "possibility",
            "of",
            "escalation",
            ",",
            "the",
            "risk",
            "of",
            "escalation.",
        ],
        [
            "”",
            "However",
            ",",
            "Sullivan",
            "again",
            "doubled",
            "down",
            "on",
            "his",
            "previous",
            "comment",
            "that",
            "officials",
            "“",
            "have",
            "not",
            "seen",
            "anything",
            "that",
            "would",
            "require",
            "us",
            "to",
            "change",
            "our",
            "nuclear",
            "posture",
            "at",
            "this",
            "time.",
        ],
        [
            "”",
            "Within",
            "days",
            "of",
            "invading",
            "Ukraine",
            ",",
            "Russian",
            "President",
            "Vladimir",
            "Putin",
            "ordered",
            "his",
            "nuclear",
            "forces",
            "on",
            "high",
            "alert",
            ",",
            "blaming",
            "the",
            "“",
            "hostile",
            "actions",
            "”",
            "and",
            "statements",
            "of",
            "Western",
            "nations",
            "and",
            "NATO",
            ".",
        ],
    ],
)

manually_split_sentence(
    363,
    13,
    [
        [
            "In",
            "fact",
            ",",
            "the",
            "pressure",
            "they",
            "placed",
            "on",
            "Kennedy",
            "was",
            "so",
            "intense",
            "that",
            "Robert",
            "Kennedy",
            ",",
            "the",
            "president",
            "’",
            "s",
            "brother",
            ",",
            "secretly",
            "told",
            "Soviet",
            "Ambassador",
            "Anatoly",
            "Dobrynin",
            "“",
            "If",
            "the",
            "situation",
            "continues",
            "much",
            "longer",
            ",",
            "the",
            "President",
            "is",
            "not",
            "sure",
            "that",
            "the",
            "military",
            "will",
            "not",
            "overthrow",
            "him",
            "and",
            "seize",
            "power.",
            "”",
        ],
        [
            "On",
            "October",
            "27",
            "—",
            "two",
            "days",
            "before",
            "the",
            "crisis",
            "was",
            "resolved",
            "—",
            "Soviet",
            "leader",
            "Nikita",
            "Khrushchev",
            "wrote",
            "a",
            "letter",
            "to",
            "Kennedy",
            "stating",
            "the",
            "following",
            ":",
            "But",
            "how",
            "are",
            "we",
            ",",
            "the",
            "Soviet",
            "Union",
            ",",
            "our",
            "Government",
            ",",
            "to",
            "assess",
            "your",
            "actions",
            ",",
            "which",
            "are",
            "expressed",
            "in",
            "the",
            "fact",
            "that",
            "you",
            "have",
            "surrounded",
            "the",
            "Soviet",
            "Union",
            "with",
            "military",
            "bases",
            ";",
            "surrounded",
            "our",
            "allies",
            "with",
            "military",
            "bases",
            ";",
            "placed",
            "military",
            "bases",
            "literally",
            "around",
            "our",
            "country",
            ";",
            "and",
            "stationed",
            "your",
            "missile",
            "armaments",
            "there",
            "?",
        ],
    ],
)

manually_split_sentence(
    451,
    2,
    [
        [
            "Для",
            "удобства",
            "жителей",
            "полуострова",
            "действуют",
            "следующие",
            "электронные",
            "каналы",
            "связи",
            "с",
            "ведомством",
            ":",
            "телефон",
            "дежурного",
            "по",
            "управлению",
            ":",
            "8",
            "(",
            "3652",
            ")",
            "799-723",
            ";",
            "телефон",
            "доверия",
            ":",
            "8",
            "(",
            "3652",
            ")",
            "799-721",
            ";",
            "телефонная",
            "линия",
            "``",
            "''",
            "Ребенок",
            "в",
            "опасности",
            "''",
            "''",
            ":",
            "123",
            ";",
            "8",
            "(",
            "3652",
            ")",
            "799-722",
            ";",
            "телефонная",
            "линия",
            "для",
            "приема",
            "сообщений",
            "о",
            "давлении",
            "на",
            "бизнес",
            ":",
            "(",
            "3652",
            ")",
            "500-750",
            ";",
            "+7-978-909-11-11",
            ";",
            "горячая",
            "линия",
            "для",
            "граждан",
            ",",
            "прибывающих",
            "в",
            "Российскую",
            "Федерацию",
            "из",
            "Донецкой",
            "и",
            "Луганской",
            "Народных",
            "Республик",
            "и",
            "территорий",
            "Украины",
            ":",
            "+7-978-936-80-74",
            ";",
            "горячая",
            "линия",
            "для",
            "участников",
            "СВО",
            ",",
            "мобилизованных",
            "военнослужащих",
            "и",
            "членов",
            "их",
            "семей",
            ":",
            "+7-978-936-80-74",
            "интернет-приемная",
            ":",
            "прием",
            "обращений",
            "посредством",
            "ведомственного",
            "телеграм-канала",
            "@",
            "sledcomcrimea",
            "(",
            "закрепленная",
            "кнопка",
            "в",
            "правом",
            "верхнем",
            "углу",
            "«",
            "Подать",
            "обращение",
            "»",
            "Для",
            "записи",
            "на",
            "личный",
            "прием",
            "к",
            "руководству",
            "ведомства",
            "необходимо",
            "обращаться",
            "по",
            "телефонам",
            ",",
            "размещенным",
            "по",
            "следующей",
            "ссылке",
            "https",
            ":",
            "//crim.sledcom.ru/references/Otdel-po-priemu-gr",
            "..",
        ],
        [
            "Прием",
            "граждан",
            ",",
            "работа",
            "с",
            "их",
            "обращениями",
            "и",
            "заявлениями",
            "всегда",
            "была",
            "и",
            "остается",
            "приоритетным",
            "направлением",
            "деятельности",
            "офицеров",
            "СК",
            "Крыма",
            "и",
            "Севастополя",
            ".",
            "''",
        ],
    ],
)

# drop them from the list of unusual sentences
unusual_sentences = [
    i
    for j, i in enumerate(unusual_sentences)
    if j
    not in [
        0,
        3,
        7,
        8,
        12,
        13,
        15,
        25,
        30,
    ]
]

print(
    f"There are {len(unusual_sentences)} sentences of unusual length after manually splitting some too long sentences."
)

There are 311 sentences of unusual length before manually splitting some too long sentences.
There are 302 sentences of unusual length after manually splitting some too long sentences.


#### Validating fixed unusual-length-sentences <a class="anchor" id="verify-remaining-unusual-sentences"></a>

We update the unusual sentences and print them. We find that all unusually short and long sentences that still occur, are valid and ment to be kept.

In [28]:
# Find sentences with less than 3 words or more than 100 words
unusual_sentences = find_unusual_length_sentences(df_short)

print(f"After handling, there are {len(unusual_sentences)} unusual sentences left.")

# Display the unusual sentences
for entry in unusual_sentences:
    print(
        f"Sentence length: {len(entry["sentence"])}, (Idx {entry['row_index']}, {entry['sentence_index']}), Sentence: {' '.join(entry["sentence"])}"
    )

After handling, there are 304 unusual sentences left.
Sentence length: 2, (Idx 47, 21), Sentence: Tags :
Sentence length: 2, (Idx 49, 25), Sentence: Coincidence ?
Sentence length: 2, (Idx 67, 1), Sentence: Nope .
Sentence length: 2, (Idx 143, 0), Sentence: BEWARE !
Sentence length: 155, (Idx 150, 7), Sentence: This could be a huge source of national security , economic vitality can build our industrial base in Florida. ” DeSantis ’ claim echoes Speaker of the House Kevin McCarthy ‘ s , who last Thursday said God has blessed America with large natural gas resources , and the United States should use them to make two of the world ’ s largest countries and largest economies , China and India , “ dependent on American natural gas. ” He suggested doing so would “ make us economically stronger but geopolitically the world safer , ” while falsely claiming it would make the world “ environmentally sound. ” The Florida governor bragged to Varney about Florida ’ s reduction in emissions , which 

Now that the text is correctly segmentated into sentences and words, we can proceed with text normalization.

### Text Normalization  <a class="anchor" id="text-normalization"></a>

  
Text normalization is the process of transforming text into a standard format, which typically involves:

- Converting text to lowercase
- Removing punctuation
- Removing stopwords
- Removing special characters and numbers
- Lemmatization or stemming

This process helps in reducing the complexity of the text and making it more uniform for further analysis or processing.

We will implement text normalization in the next steps.

In [32]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

print(torch.version.cuda)  # Shows CUDA version if available
print(torch.cuda.is_available())  # Checks if CUDA is available

Using device: cpu
None
False


Check if a NVIDIA Graphics card is installed (and the nessesary CUDA packages) to be used later for lemmatization because with just the CPU it was taking around 15min every time. 

In [49]:
def text_normalization(df, column_name):
    """
    Text normalization with optimized batch processing for nested lists of tokens
    with proper abbreviation handling
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    # Defining nlp piepelines specific for every language we are covering
    nlp_pipelines = {
        "EN": stanza.Pipeline(
            "en",
            processors="tokenize,lemma",
            device=device,
            use_gpu=True,
            batch_size=4096,
            tokenize_pretokenized=True,
            download_method=None,
        ),
        "BG": stanza.Pipeline(
            "bg",
            processors="tokenize,lemma",
            device=device,
            use_gpu=True,
            batch_size=4096,
            tokenize_pretokenized=True,
            download_method=None,
        ),
        "RU": stanza.Pipeline(
            "ru",
            processors="tokenize,lemma",
            device=device,
            use_gpu=True,
            batch_size=4096,
            tokenize_pretokenized=True,
            download_method=None,
        ),
        "PT": stanza.Pipeline(
            "pt",
            processors="tokenize,lemma",
            device=device,
            use_gpu=True,
            batch_size=4096,
            tokenize_pretokenized=True,
            download_method=None,
        ),
        "HI": stanza.Pipeline(
            "hi",
            processors="tokenize,lemma",
            device=device,
            use_gpu=True,
            batch_size=4096,
            tokenize_pretokenized=True,
            download_method=None,
        ),
    }

    # Depending on the graphics card the batch size can be adjusted

    stopwords_mapped = {
        "EN": set(stopwords.words("english")),
        # Bulgarian stopwords not available in nltk and were imported manually from https://github.com/stopwords-iso/stopwords-bg/blob/master/stopwords-bg.json
        "BG": set(
            [
                "а",
                "автентичен",
                "аз",
                "ако",
                "ала",
                "бе",
                "без",
                "беше",
                "би",
                "бивш",
                "бивша",
                "бившо",
                "бил",
                "била",
                "били",
                "било",
                "благодаря",
                "близо",
                "бъдат",
                "бъде",
                "бяха",
                "в",
                "вас",
                "ваш",
                "ваша",
                "вероятно",
                "вече",
                "взема",
                "ви",
                "вие",
                "винаги",
                "внимава",
                "време",
                "все",
                "всеки",
                "всички",
                "всичко",
                "всяка",
                "във",
                "въпреки",
                "върху",
                "г",
                "ги",
                "главен",
                "главна",
                "главно",
                "глас",
                "го",
                "година",
                "години",
                "годишен",
                "д",
                "да",
                "дали",
                "два",
                "двама",
                "двамата",
                "две",
                "двете",
                "ден",
                "днес",
                "дни",
                "до",
                "добра",
                "добре",
                "добро",
                "добър",
                "докато",
                "докога",
                "дори",
                "досега",
                "доста",
                "друг",
                "друга",
                "други",
                "е",
                "евтин",
                "едва",
                "един",
                "една",
                "еднаква",
                "еднакви",
                "еднакъв",
                "едно",
                "екип",
                "ето",
                "живот",
                "за",
                "забавям",
                "зад",
                "заедно",
                "заради",
                "засега",
                "заспал",
                "затова",
                "защо",
                "защото",
                "и",
                "из",
                "или",
                "им",
                "има",
                "имат",
                "иска",
                "й",
                "каза",
                "как",
                "каква",
                "какво",
                "както",
                "какъв",
                "като",
                "кога",
                "когато",
                "което",
                "които",
                "кой",
                "който",
                "колко",
                "която",
                "къде",
                "където",
                "към",
                "лесен",
                "лесно",
                "ли",
                "лош",
                "м",
                "май",
                "малко",
                "ме",
                "между",
                "мек",
                "мен",
                "месец",
                "ми",
                "много",
                "мнозина",
                "мога",
                "могат",
                "може",
                "мокър",
                "моля",
                "момента",
                "му",
                "н",
                "на",
                "над",
                "назад",
                "най",
                "направи",
                "напред",
                "например",
                "нас",
                "не",
                "него",
                "нещо",
                "нея",
                "ни",
                "ние",
                "никой",
                "нито",
                "нищо",
                "но",
                "нов",
                "нова",
                "нови",
                "новина",
                "някои",
                "някой",
                "няколко",
                "няма",
                "обаче",
                "около",
                "освен",
                "особено",
                "от",
                "отгоре",
                "отново",
                "още",
                "пак",
                "по",
                "повече",
                "повечето",
                "под",
                "поне",
                "поради",
                "после",
                "почти",
                "прави",
                "пред",
                "преди",
                "през",
                "при",
                "пък",
                "първата",
                "първи",
                "първо",
                "пъти",
                "равен",
                "равна",
                "с",
                "са",
                "сам",
                "само",
                "се",
                "сега",
                "си",
                "син",
                "скоро",
                "след",
                "следващ",
                "сме",
                "смях",
                "според",
                "сред",
                "срещу",
                "сте",
                "съм",
                "със",
                "също",
                "т",
                "т.н.",
                "тази",
                "така",
                "такива",
                "такъв",
                "там",
                "твой",
                "те",
                "тези",
                "ти",
                "то",
                "това",
                "тогава",
                "този",
                "той",
                "толкова",
                "точно",
                "три",
                "трябва",
                "тук",
                "тъй",
                "тя",
                "тях",
                "у",
                "утре",
                "харесва",
                "хиляди",
                "ч",
                "часа",
                "че",
                "често",
                "чрез",
                "ще",
                "щом",
                "юмрук",
                "я",
                "як",
            ]
        ),
        "RU": set(stopwords.words("russian")),
        "PT": set(stopwords.words("portuguese")),
        "HI": set(
            [
                "अंदर",
                "अत",
                "अदि",
                "अप",
                "अपना",
                "अपनि",
                "अपनी",
                "अपने",
                "अभि",
                "अभी",
                "आदि",
                "आप",
                "इंहिं",
                "इंहें",
                "इंहों",
                "इतयादि",
                "इत्यादि",
                "इन",
                "इनका",
                "इन्हीं",
                "इन्हें",
                "इन्हों",
                "इस",
                "इसका",
                "इसकि",
                "इसकी",
                "इसके",
                "इसमें",
                "इसि",
                "इसी",
                "इसे",
                "उंहिं",
                "उंहें",
                "उंहों",
                "उन",
                "उनका",
                "उनकि",
                "उनकी",
                "उनके",
                "उनको",
                "उन्हीं",
                "उन्हें",
                "उन्हों",
                "उस",
                "उसके",
                "उसि",
                "उसी",
                "उसे",
                "एक",
                "एवं",
                "एस",
                "एसे",
                "ऐसे",
                "ओर",
                "और",
                "कइ",
                "कई",
                "कर",
                "करता",
                "करते",
                "करना",
                "करने",
                "करें",
                "कहते",
                "कहा",
                "का",
                "काफि",
                "काफ़ी",
                "कि",
                "किंहें",
                "किंहों",
                "कितना",
                "किन्हें",
                "किन्हों",
                "किया",
                "किर",
                "किस",
                "किसि",
                "किसी",
                "किसे",
                "की",
                "कुछ",
                "कुल",
                "के",
                "को",
                "कोइ",
                "कोई",
                "कोन",
                "कोनसा",
                "कौन",
                "कौनसा",
                "गया",
                "घर",
                "जब",
                "जहाँ",
                "जहां",
                "जा",
                "जिंहें",
                "जिंहों",
                "जितना",
                "जिधर",
                "जिन",
                "जिन्हें",
                "जिन्हों",
                "जिस",
                "जिसे",
                "जीधर",
                "जेसा",
                "जेसे",
                "जैसा",
                "जैसे",
                "जो",
                "तक",
                "तब",
                "तरह",
                "तिंहें",
                "तिंहों",
                "तिन",
                "तिन्हें",
                "तिन्हों",
                "तिस",
                "तिसे",
                "तो",
                "था",
                "थि",
                "थी",
                "थे",
                "दबारा",
                "दवारा",
                "दिया",
                "दुसरा",
                "दुसरे",
                "दूसरे",
                "दो",
                "द्वारा",
                "न",
                "नहिं",
                "नहीं",
                "ना",
                "निचे",
                "निहायत",
                "नीचे",
                "ने",
                "पर",
                "पहले",
                "पुरा",
                "पूरा",
                "पे",
                "फिर",
                "बनि",
                "बनी",
                "बहि",
                "बही",
                "बहुत",
                "बाद",
                "बाला",
                "बिलकुल",
                "भि",
                "भितर",
                "भी",
                "भीतर",
                "मगर",
                "मानो",
                "मे",
                "में",
                "यदि",
                "यह",
                "यहाँ",
                "यहां",
                "यहि",
                "यही",
                "या",
                "यिह",
                "ये",
                "रखें",
                "रवासा",
                "रहा",
                "रहे",
                "ऱ्वासा",
                "लिए",
                "लिये",
                "लेकिन",
                "व",
                "वगेरह",
                "वरग",
                "वर्ग",
                "वह",
                "वहाँ",
                "वहां",
                "वहिं",
                "वहीं",
                "वाले",
                "वुह",
                "वे",
                "वग़ैरह",
                "संग",
                "सकता",
                "सकते",
                "सबसे",
                "सभि",
                "सभी",
                "साथ",
                "साबुत",
                "साभ",
                "सारा",
                "से",
                "सो",
                "हि",
                "ही",
                "हुअ",
                "हुआ",
                "हुइ",
                "हुई",
                "हुए",
                "हे",
                "हें",
                "है",
                "हैं",
                "हो",
                "होता",
                "होति",
                "होती",
                "होते",
                "होना",
                "होने",
            ]
        ),
    }

    # Common abbreviations and their normalized forms
    abbreviations = {
        "EN": {
            "p.m.": "pm",
            "a.m.": "am",
            "e.g.": "eg",
            "i.e.": "ie",
            "etc.": "etc",
            "vs.": "vs",
            "mr.": "mr",
            "mrs.": "mrs",
            "dr.": "dr",
            "prof.": "prof",
            "u.s.": "us",
            "u.k.": "uk",
            "n.y.": "ny",
            "l.a.": "la",
            "st.": "st",
            "inc.": "inc",
            "ltd.": "ltd",
            "co.": "co",
            "corp.": "corp",
            "avg.": "avg",
            "approx.": "approx",
        },
        "BG": {"т.е .": "то есть", "проф": "професор"},
        "RU": {"т.е .": "то есть"},
        "PT": {"sra.": "senhora", "sr.": "senhor"},
        "HI": {
            "सं॰": "संस्थान",
            "स्त्री॰": "स्त्रीलिंग",
            "विवि": "विश्वविद्यालय",
            "यौ॰": "यौवन",
            "फ़ा॰": "फ़ारसी",
            "पुर्त॰": "पुर्तगाली",
            "पु॰": "पुल्लिंग",
            "प्रा॰": "प्राचार्य",
        },
    }

    def normalize_token(token, lang):
        """Normalize a single token"""
        if not isinstance(token, str):
            return ""

        # Convert to lowercase first
        token = token.lower().strip()

        # Check if it's an abbreviation
        if token in abbreviations.get(lang, {}):
            return abbreviations[lang][token]

        # Remove special characters and numbers for non-abbreviations
        token = re.sub(r"[^a-zа-яёА-ЯЁअ-हःं०-९]", "", token, flags=re.IGNORECASE)

        return token

    def clean_text(nested_tokens, lang):
        """Clean and preprocess nested list of tokens for specific lang"""
        if not isinstance(nested_tokens, list):
            return []

        cleaned_tokens = []
        stop_words = stopwords_mapped.get(lang, set())
        for sentence in nested_tokens:
            if isinstance(sentence, list):
                for token in sentence:
                    # Normalize the token
                    normalized = normalize_token(token, lang)
                    # Check if token is not empty and not a stopword
                    if normalized and normalized not in stop_words:
                        cleaned_tokens.append(normalized)

        return cleaned_tokens

    def process_text(tokens, lang):
        """Process a single text through Stanza"""
        try:
            if not isinstance(tokens, list) or not all(
                isinstance(t, str) for t in tokens
            ):
                print("Error: Tokens must be a list of strings.")
                return []

            if lang not in nlp_pipelines:
                print(f"Error: No pipeline initialized for language '{lang}'")
                return []

            if not tokens:
                return []
            # Join tokens into a single string for processing
            text = " ".join(tokens)
            nlp = nlp_pipelines[lang]
            doc = nlp(text)
            # Extract lemmas and filter stopwords
            stop_words = stopwords_mapped.get(lang, set())
            lemmas = []
            for sent in doc.sentences:
                for word in sent.words:
                    lemma = word.lemma.lower()
                    # Check if the original token was an abbreviation
                    if lemma not in stop_words:
                        lemmas.append(lemma)
            return lemmas
        except Exception as e:
            print(f"Error processing text: {str(e)}")
            return []

    def process_batch(batch_tokens, lang):
        """Process a batch of nested token lists"""
        results = []
        for tokens in batch_tokens:
            # Clean and flatten tokens
            cleaned_tokens = clean_text(tokens, lang)
            # Process cleaned tokens
            normalized = process_text(cleaned_tokens, lang)
            # Ensure all tokens are properly normalized
            normalized = [
                normalize_token(token, lang)
                for token in normalized
                if normalize_token(token, lang)
            ]
            results.append(normalized)

        return results

    # Process in batches
    batch_size = 50
    normalized_tokens = []
    total_batches = (len(df) + batch_size - 1) // batch_size

    print(f"Starting processing of {len(df)} rows in {total_batches} batches")

    for i in tqdm(range(0, len(df), batch_size), desc="Normalizing text"):
        batch_df = df.iloc[i : i + batch_size]
        batch_tokens = batch_df[column_name].tolist()
        batch_languages = batch_df["language"].tolist()

        for tokens, lang in zip(batch_tokens, batch_languages):
            # Print sample of first batch for debugging

            # Cleaning tokens
            cleaned_tokens = clean_text(tokens, lang)
            normalized = process_text(cleaned_tokens, lang)
            normalized_tokens.append(normalized)

    print(f"\nProcessing completed. Total normalized entries: {len(normalized_tokens)}")
    print(
        f"Non-empty normalized entries: {sum(1 for tokens in normalized_tokens if tokens)}"
    )

    # Final check to ensure no punctuation or special characters remain
    final_tokens = []
    for tokens in normalized_tokens:
        cleaned = [
            token
            for token in tokens
            if token and not any(char in string.punctuation for char in token)
        ]
        final_tokens.append(cleaned)

    # Update DataFrame with normalized tokens
    df_normalized = df.copy()
    df_normalized[f"{column_name}_normalized"] = final_tokens

    return df_normalized

## Verifying of text nomalization  <a class="anchor" id="check-normalization"></a>

In [56]:
def check_text_normalization(df):
    """
    Validates if text normalization has been applied correctly

    Parameters:
    df (pandas.DataFrame): DataFrame containing original and normalized tokens

    Returns:
    dict: Validation results with detailed statistics and examples of any issues found
    """
    results = {
        "overall_status": "PASS",
        "tests": {},
        "statistics": {},
        "issues_found": {},
        "sample_issues": {},
    }

    stopwords_map = {
        "EN": set(stopwords.words("english")),
        "BG": set(
            [
                "а",
                "автентичен",
                "аз",
                "ако",
                "ала",
                "бе",
                "без",
                "беше",
                "би",
                "бивш",
                "бивша",
                "бившо",
                "бил",
                "била",
                "били",
                "било",
                "благодаря",
                "близо",
                "бъдат",
                "бъде",
                "бяха",
                "в",
                "вас",
                "ваш",
                "ваша",
                "вероятно",
                "вече",
                "взема",
                "ви",
                "вие",
                "винаги",
                "внимава",
                "време",
                "все",
                "всеки",
                "всички",
                "всичко",
                "всяка",
                "във",
                "въпреки",
                "върху",
                "г",
                "ги",
                "главен",
                "главна",
                "главно",
                "глас",
                "го",
                "година",
                "години",
                "годишен",
                "д",
                "да",
                "дали",
                "два",
                "двама",
                "двамата",
                "две",
                "двете",
                "ден",
                "днес",
                "дни",
                "до",
                "добра",
                "добре",
                "добро",
                "добър",
                "докато",
                "докога",
                "дори",
                "досега",
                "доста",
                "друг",
                "друга",
                "други",
                "е",
                "евтин",
                "едва",
                "един",
                "една",
                "еднаква",
                "еднакви",
                "еднакъв",
                "едно",
                "екип",
                "ето",
                "живот",
                "за",
                "забавям",
                "зад",
                "заедно",
                "заради",
                "засега",
                "заспал",
                "затова",
                "защо",
                "защото",
                "и",
                "из",
                "или",
                "им",
                "има",
                "имат",
                "иска",
                "й",
                "каза",
                "как",
                "каква",
                "какво",
                "както",
                "какъв",
                "като",
                "кога",
                "когато",
                "което",
                "които",
                "кой",
                "който",
                "колко",
                "която",
                "къде",
                "където",
                "към",
                "лесен",
                "лесно",
                "ли",
                "лош",
                "м",
                "май",
                "малко",
                "ме",
                "между",
                "мек",
                "мен",
                "месец",
                "ми",
                "много",
                "мнозина",
                "мога",
                "могат",
                "може",
                "мокър",
                "моля",
                "момента",
                "му",
                "н",
                "на",
                "над",
                "назад",
                "най",
                "направи",
                "напред",
                "например",
                "нас",
                "не",
                "него",
                "нещо",
                "нея",
                "ни",
                "ние",
                "никой",
                "нито",
                "нищо",
                "но",
                "нов",
                "нова",
                "нови",
                "новина",
                "някои",
                "някой",
                "няколко",
                "няма",
                "обаче",
                "около",
                "освен",
                "особено",
                "от",
                "отгоре",
                "отново",
                "още",
                "пак",
                "по",
                "повече",
                "повечето",
                "под",
                "поне",
                "поради",
                "после",
                "почти",
                "прави",
                "пред",
                "преди",
                "през",
                "при",
                "пък",
                "първата",
                "първи",
                "първо",
                "пъти",
                "равен",
                "равна",
                "с",
                "са",
                "сам",
                "само",
                "се",
                "сега",
                "си",
                "син",
                "скоро",
                "след",
                "следващ",
                "сме",
                "смях",
                "според",
                "сред",
                "срещу",
                "сте",
                "съм",
                "със",
                "също",
                "т",
                "т.н.",
                "тази",
                "така",
                "такива",
                "такъв",
                "там",
                "твой",
                "те",
                "тези",
                "ти",
                "то",
                "това",
                "тогава",
                "този",
                "той",
                "толкова",
                "точно",
                "три",
                "трябва",
                "тук",
                "тъй",
                "тя",
                "тях",
                "у",
                "утре",
                "харесва",
                "хиляди",
                "ч",
                "часа",
                "че",
                "често",
                "чрез",
                "ще",
                "щом",
                "юмрук",
                "я",
                "як",
            ]
        ),
        "RU": set(stopwords.words("russian")),
        "PT": set(stopwords.words("portuguese")),
        "HI": set(
            [
                "अंदर",
                "अत",
                "अदि",
                "अप",
                "अपना",
                "अपनि",
                "अपनी",
                "अपने",
                "अभि",
                "अभी",
                "आदि",
                "आप",
                "इंहिं",
                "इंहें",
                "इंहों",
                "इतयादि",
                "इत्यादि",
                "इन",
                "इनका",
                "इन्हीं",
                "इन्हें",
                "इन्हों",
                "इस",
                "इसका",
                "इसकि",
                "इसकी",
                "इसके",
                "इसमें",
                "इसि",
                "इसी",
                "इसे",
                "उंहिं",
                "उंहें",
                "उंहों",
                "उन",
                "उनका",
                "उनकि",
                "उनकी",
                "उनके",
                "उनको",
                "उन्हीं",
                "उन्हें",
                "उन्हों",
                "उस",
                "उसके",
                "उसि",
                "उसी",
                "उसे",
                "एक",
                "एवं",
                "एस",
                "एसे",
                "ऐसे",
                "ओर",
                "और",
                "कइ",
                "कई",
                "कर",
                "करता",
                "करते",
                "करना",
                "करने",
                "करें",
                "कहते",
                "कहा",
                "का",
                "काफि",
                "काफ़ी",
                "कि",
                "किंहें",
                "किंहों",
                "कितना",
                "किन्हें",
                "किन्हों",
                "किया",
                "किर",
                "किस",
                "किसि",
                "किसी",
                "किसे",
                "की",
                "कुछ",
                "कुल",
                "के",
                "को",
                "कोइ",
                "कोई",
                "कोन",
                "कोनसा",
                "कौन",
                "कौनसा",
                "गया",
                "घर",
                "जब",
                "जहाँ",
                "जहां",
                "जा",
                "जिंहें",
                "जिंहों",
                "जितना",
                "जिधर",
                "जिन",
                "जिन्हें",
                "जिन्हों",
                "जिस",
                "जिसे",
                "जीधर",
                "जेसा",
                "जेसे",
                "जैसा",
                "जैसे",
                "जो",
                "तक",
                "तब",
                "तरह",
                "तिंहें",
                "तिंहों",
                "तिन",
                "तिन्हें",
                "तिन्हों",
                "तिस",
                "तिसे",
                "तो",
                "था",
                "थि",
                "थी",
                "थे",
                "दबारा",
                "दवारा",
                "दिया",
                "दुसरा",
                "दुसरे",
                "दूसरे",
                "दो",
                "द्वारा",
                "न",
                "नहिं",
                "नहीं",
                "ना",
                "निचे",
                "निहायत",
                "नीचे",
                "ने",
                "पर",
                "पहले",
                "पुरा",
                "पूरा",
                "पे",
                "फिर",
                "बनि",
                "बनी",
                "बहि",
                "बही",
                "बहुत",
                "बाद",
                "बाला",
                "बिलकुल",
                "भि",
                "भितर",
                "भी",
                "भीतर",
                "मगर",
                "मानो",
                "मे",
                "में",
                "यदि",
                "यह",
                "यहाँ",
                "यहां",
                "यहि",
                "यही",
                "या",
                "यिह",
                "ये",
                "रखें",
                "रवासा",
                "रहा",
                "रहे",
                "ऱ्वासा",
                "लिए",
                "लिये",
                "लेकिन",
                "व",
                "वगेरह",
                "वरग",
                "वर्ग",
                "वह",
                "वहाँ",
                "वहां",
                "वहिं",
                "वहीं",
                "वाले",
                "वुह",
                "वे",
                "वग़ैरह",
                "संग",
                "सकता",
                "सकते",
                "सबसे",
                "सभि",
                "सभी",
                "साथ",
                "साबुत",
                "साभ",
                "सारा",
                "से",
                "सो",
                "हि",
                "ही",
                "हुअ",
                "हुआ",
                "हुइ",
                "हुई",
                "हुए",
                "हे",
                "हें",
                "है",
                "हैं",
                "हो",
                "होता",
                "होति",
                "होती",
                "होते",
                "होना",
                "होने",
            ]
        ),
    }
    allowed_characters = {
        "EN": r"[a-zA-Z]",
        "BG": r"[а-яА-ЯёЁ]",
        "RU": r"[а-яА-ЯёЁ]",
        "PT": r"[a-zA-Zçãõáàâéêíóôúü]",
        "HI": r"[अ-हः०-९]",
    }
    lemmatizer = WordNetLemmatizer()

    def is_lowercase(text):
        return text.islower()

    def contains_punctuation(text):
        return any(char in string.punctuation for char in text)

    def contains_numbers(text):
        return bool(re.search(r"\d", text))

    # Need to differentiate between special languages
    def contains_special_chars(text, lang):
        allowed_pattern = allowed_characters.get(lang, r"[a-zA-Z]")
        return bool(re.search(rf"[^{allowed_pattern}\s]", text))

    def is_stopword(text, lang):
        stop_words = stopwords_map.get(lang, set())
        return text in stop_words

    # Initialize counters for statistics
    stats = {
        "total_original_tokens": 0,
        "total_normalized_tokens": 0,
        "uppercase_found": 0,
        "punctuation_found": 0,
        "numbers_found": 0,
        "special_chars_found": 0,
        "stopwords_found": 0,
    }

    # Initialize issue tracking
    issues = {
        "uppercase_tokens": [],
        "punctuation_tokens": [],
        "number_tokens": [],
        "special_char_tokens": [],
        "stopword_tokens": [],
    }

    # Check each row
    for idx, row in df.iterrows():
        if "tokens_normalized" not in row or "language" not in row:
            results["overall_status"] = "FAIL"
            results["tests"]["normalization_column_exists"] = False
            return results

        lang = row["language"]
        normalized_tokens = row["tokens_normalized"]

        if not isinstance(normalized_tokens, list):
            continue

        # Check each normalized token
        for token in normalized_tokens:
            stats["total_normalized_tokens"] += 1

            # Check for uppercase
            if lang == "HI":
                # Skip uppercase check for Hindi tokens since the concept doesn't apply
                continue
            elif not is_lowercase(token):
                stats["uppercase_found"] += 1
                if len(issues["uppercase_tokens"]) < 5:
                    issues["uppercase_tokens"].append((idx, token))

            # Check for punctuation
            if contains_punctuation(token):
                stats["punctuation_found"] += 1
                if len(issues["punctuation_tokens"]) < 5:
                    issues["punctuation_tokens"].append((idx, token))

            # Check for numbers
            if contains_numbers(token):
                stats["numbers_found"] += 1
                if len(issues["number_tokens"]) < 5:
                    issues["number_tokens"].append((idx, token))

            # Check for special characters
            if contains_special_chars(token, lang):
                stats["special_chars_found"] += 1
                if len(issues["special_char_tokens"]) < 5:
                    issues["special_char_tokens"].append((idx, token))

            # Check for stopwords
            if is_stopword(token, lang):
                stats["stopwords_found"] += 1
                if len(issues["stopword_tokens"]) < 5:
                    issues["stopword_tokens"].append((idx, token))

    # Calculate pass/fail for each test
    tests = {
        "uppercase_test": stats["uppercase_found"] == 0,
        "punctuation_test": stats["punctuation_found"] == 0,
        "numbers_test": stats["numbers_found"] == 0,
        "special_chars_test": stats["special_chars_found"] == 0,
        "stopwords_test": stats["stopwords_found"] == 0,
    }

    # Update overall status
    if not all(tests.values()):
        results["overall_status"] = "FAIL"

    # Calculate percentages for statistics
    total_tokens = stats["total_normalized_tokens"]
    if total_tokens > 0:
        stats.update(
            {
                "uppercase_percentage": (stats["uppercase_found"] / total_tokens) * 100,
                "punctuation_percentage": (stats["punctuation_found"] / total_tokens)
                * 100,
                "numbers_percentage": (stats["numbers_found"] / total_tokens) * 100,
                "special_chars_percentage": (
                    stats["special_chars_found"] / total_tokens
                )
                * 100,
                "stopwords_percentage": (stats["stopwords_found"] / total_tokens) * 100,
            }
        )

    # Compile results
    results["tests"] = tests
    results["statistics"] = stats
    results["issues_found"] = {k: len(v) for k, v in issues.items()}
    results["sample_issues"] = issues

    # Print summary report
    print("\nText Normalization Validation Report")
    print("====================================")
    print(f"Overall Status: {results['overall_status']}")
    print("\nTest Results:")
    for test, passed in tests.items():
        print(f"- {test}: {'PASS' if passed else 'FAIL'}")

    print("\nStatistics:")
    print(f"- Total normalized tokens: {stats['total_normalized_tokens']}")
    if total_tokens > 0:
        print(
            f"- Uppercase tokens: {stats['uppercase_found']} ({stats['uppercase_percentage']:.2f}%)"
        )
        print(
            f"- Tokens with punctuation: {stats['punctuation_found']} ({stats['punctuation_percentage']:.2f}%)"
        )
        print(
            f"- Tokens with numbers: {stats['numbers_found']} ({stats['numbers_percentage']:.2f}%)"
        )
        print(
            f"- Tokens with special characters: {stats['special_chars_found']} ({stats['special_chars_percentage']:.2f}%)"
        )
        print(
            f"- Stopwords found: {stats['stopwords_found']} ({stats['stopwords_percentage']:.2f}%)"
        )

    if results["overall_status"] == "FAIL":
        print("\nSample Issues Found:")
        for issue_type, samples in issues.items():
            if samples:
                print(f"\n{issue_type}:")
                for idx, token in samples:
                    print(f"- Row {idx}: '{token}'")

    return results

# Print the results  <a class="anchor" id="result-printing"></a>

In [57]:
# Clear GPU memory
torch.cuda.empty_cache()


# Run the normalization
df_short = text_normalization(df_short, "tokens")

# Verify the results
print("\nResults verification:")
print("Sample of normalized tokens (first 3 rows):")
print(df_short["tokens_normalized"].head(3))


results = check_text_normalization(df_short)

2024-12-29 20:36:20 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| lemma     | combined_nocharlm |

2024-12-29 20:36:20 INFO: Using device: cpu
2024-12-29 20:36:20 INFO: Loading: tokenize
2024-12-29 20:36:20 INFO: Loading: lemma
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-12-29 20:36:20 INFO: Done loading processors!
2024-12-29 20:36:20 INFO: Loading these models for language: bg (Bulgarian):
| Processor | Package      |
----------------------------
| tokenize  | btb          |
| lemma     | btb_nocharlm |

2024-12-29 20:36:20 INFO: Using device: cpu
2024-12-29 20:36:20 INFO: Loading: tokenize
2024-12-29 20:36:20 INFO: Loading: lemma
2024-12-29 20:36:20 INFO: Done loading processors!
2024-12-29 20:36:20 INFO: Loading these models for language: ru (Russian):
| Processor | Package            |
----------------------------------
| tokenize  | syntagrus

Using device: cpu


2024-12-29 20:36:20 INFO: Done loading processors!
2024-12-29 20:36:20 INFO: Loading these models for language: pt (Portuguese):
| Processor | Package         |
-------------------------------
| tokenize  | bosque          |
| lemma     | bosque_nocharlm |

2024-12-29 20:36:20 INFO: Using device: cpu
2024-12-29 20:36:20 INFO: Loading: tokenize
2024-12-29 20:36:20 INFO: Loading: lemma
2024-12-29 20:36:20 INFO: Done loading processors!
2024-12-29 20:36:20 INFO: Loading these models for language: hi (Hindi):
| Processor | Package       |
-----------------------------
| tokenize  | hdtb          |
| lemma     | hdtb_nocharlm |

2024-12-29 20:36:20 INFO: Using device: cpu
2024-12-29 20:36:20 INFO: Loading: tokenize
2024-12-29 20:36:20 INFO: Loading: lemma
2024-12-29 20:36:20 INFO: Done loading processors!


Starting processing of 1694 rows in 34 batches


Normalizing text: 100%|██████████| 34/34 [04:10<00:00,  7.36s/it]



Processing completed. Total normalized entries: 1694
Non-empty normalized entries: 1694

Results verification:
Sample of normalized tokens (first 3 rows):
0                                                                                                                                                   [bill, gate, say, solution, climate, change, ok, four, private, jet, bill, gate, right, fly, around, world, private, jet, normal, person, force, live, minute, city, without, freedom, travel, accord, bill, gate, tell, bbc, much, anybody, else, fight, climate, change, gate, claim, continue, spend, billion, dollar, climate, change, activism, carbon, footprint, issue, sign, get, unfiltered, news, deliver, straight, inbox, unsubscribe, time, subscribing, agree, term, use, stay, home, come, kenya, learn, farming, malaria, gate, say, interview, amol, rajan, comfortable, idea, part, problem, pay, offset, also, billion, breakthrough, energy, group, spending, part, solution, gate, add, watch, earl

In [62]:
# Saving full dataframe for training BERT
output_file = "df_normalized.csv"
df_short.to_csv(output_file, index=False)

In [59]:
# Saving UA dataframe for training BERT
df_normalized_ua = df_short[df_short["topic"] == "UA"]
output_file = "df_normalized_ua.csv"
df_normalized_ua.to_csv(output_file, index=False)

In [60]:
# Saving CC dataframe for training BERT
df_normalized_cc = df_short[df_short["topic"] == "CC"]
output_file = "df_normalized_cc.csv"
df_normalized_cc.to_csv(output_file, index=False)

### CoNNL-U format  <a class="anchor" id="#connlu-format"></a>
Now we will write a function to store the data in CoNLL-U format to ensure *reproducibility* and *platform independency*. Ten text parts will be written in one file in CoNNL-U format. 
If the execution of a cell is aborted, the currently open file will be closed.

In [None]:
nlp = stanza.Pipeline("en", processors="tokenize,pos,lemma,depparse")


def convert_to_connlu(dataframe, column_name):
    """
    Function that takes a dataframe and converts each row in a CoNNL-U file. Each file consists of ten text parts
    in the CoNNL-U format.
    """
    output_dir = os.path.join("..", "CoNLL")

    file_index = 1
    sentence_count = 0
    total_sentences = 0
    output_file = os.path.join(output_dir, f"output_{file_index}.conllu")

    # Converting text parts to CoNNL-U format and closing files again.
    try:
        f = open(output_file, "w", encoding="utf-8")

        for idx, row in dataframe.iterrows():
            sentence = " ".join(row[f"{column_name}_normalized"])
            doc = nlp(sentence)
            CoNLL.write_doc2conll(doc, f)
            f.write("\n")

            sentence_count += 1
            total_sentences += 1

            # Closing file after ten converted text parts to one CoNNL-U file
            if sentence_count >= 10:
                f.close()
                file_index += 1
                output_file = os.path.join(output_dir, f"output_{file_index}.conllu")
                f = open(output_file, "w", encoding="utf-8")
                sentence_count = 0
    except KeyboardInterrupt:
        print("\nStopped running. Closing open files...")
    except Exception as e:
        print(f"\nAn error happened {e}")
    # If the cell gets aborted, any open file is
    # closed so that there is no remaining open file when the cell is stopped.
    finally:
        if not f.closed:
            f.close()

    created_files = len(
        [
            name
            for name in os.listdir(output_dir)
            if name.startswith("output") and name.endswith(".conllu")
        ]
    )
    print(f"\nTotal sentences processed: {total_sentences}")
    print(f"Total files created: {created_files}")

Now we call the function *convert_to_connlu* on our dataframe *df_short* to receive files in CoNNL-U format consisting of ten text parts per file.

In [None]:
# only run if the conll files are not already created
# convert_to_connlu(df_short, 'tokens')

Creating function to load tokens in a dataframe, keeping the CoNNL-u format in columns.

In [12]:
def extract_conllu_to_dataframe(directory):
    """
    Function that takes the directory string where the CoNNL-u files are located and loops though each files, extracting each word with
    its corresponding CoNNL-u features. At the end, all tokens including its features are stored in a dataframe. Each word is assigned a word_id.
    Each single narrative is assigned a narrative_id
    :returns df
    """
    columns = [
        "narrative_id",
        "word_id",
        "form",
        "lemma",
        "upos",
        "xpos",
        "feats",
        "head",
        "deprel",
        "deps",
        "misc",
    ]
    data = []
    narrative_id = 0

    # Looping though all files
    for filename in sorted(os.listdir(directory)):
        if filename.endswith(".conllu"):
            file_path = os.path.join(directory, filename)
            with open(file_path, "r") as f:
                word_id = 0
                consecutive_comments = 0

                for line in f:
                    line = line.strip()

                    # Checking for comment lines, indicated by two hashtags. If new narrative starts, increment narrative_id
                    if line.startswith("#"):
                        consecutive_comments += 1
                        if consecutive_comments == 2:
                            narrative_id += 1
                            word_id = 0
                        continue
                    else:
                        consecutive_comments = 0

                    # Continuing for empty strokes
                    if not line:
                        continue

                    parts = line.split("\t")
                    if len(parts) == 10:
                        word_id += 1
                        row = [narrative_id, word_id] + parts[1:]
                        data.append(row)

    df = pd.DataFrame(data, columns=columns)
    return df


# Extract all narratives from all CoNNL-u files to dataframe df_connlu
directory_path = "../CoNLL"
df_connlu = extract_conllu_to_dataframe("../CoNLL")