# Natural Language Processing Milestone 1

## Content
1. [Initial Data prep and label representation](#initial-data-prep-and-label-representation)
    - [Create a first dataframe](#creating-a-first-dataframe)
    - [Extract the label taxonomy](#extract-the-label-taxonomy)
    - [Map the labels to the taxonomy](#map-the-labels-in-the-file-to-the-taxonomy)
2. [Text segmentation](#text-segmentation)
    - [Tokenize sentences and words](#tokenize-sentences-and-words)
    - [Find sentences of unusual length](#find-sentences-unusual-length)
    - [Handle very short sentences](#handle-short-sentences)
    - [Handle very long sentences](#handle-long-sentences)
    - [Verify remaining sentences of unusual length](#verify-remaining-unusual-sentences)
3. [Text Normalization](#text-normalization)
    - [Verify text normalization](#check-normalization)
    - [Print the results of the text normalization](#result-printing)
4. [CoNNL-U format](#connlu-format)

In [4]:
#%pip install --requirement requirements.txt
# When you import something new that is not in the requirements.txt file, please add it to the requirements.txt file and re-run the cell above.


import os
import pandas as pd
import fitz  # PyMuPDF
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer

from stanza.utils.conll import CoNLL
import re
from nltk.stem import PorterStemmer
import stanza
import torch
import nltk
import string
from tqdm import tqdm


nltk.download('stopwords')

##%%capture

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asus9\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Initial Data Prep and label representation

### Creating a first dataframe
containing [filename, content, narrative, subnarrative, topic] for each datapoint
- The documents belong to two different topics, ukraine war and climate change
- We will make it clear in the dataframe which topic each document belongs to by adding the column "topic", which is based on the abbreviations UA or CC in the document filenames
    - The reason for this is the implementation decision we made to **build separate models to predict the two topics** instead of one that can predict both

The dataset has **two ways of categorizing articles** into topics:
- The above-mentioned UA and CC in the article files
- URW and CC in the label strings which classifies every label for an article (since there can be multiple) 
    - This means, that theoretically, an article could contain both topics
    - However, we will see that almost all datapoints only have labels for the topic that is also in the corresponding article filenames (i.e. if there is CC in the article filename, for almost all datapoints, there would also only be CC labels)
    - More details that drove the decision to use the article file topic instead of the label topics are provided further below

In [None]:
# Define the paths to articles and annotations
documents_path = "../training_data_16_October_release/EN/raw-documents"
annotations_file = "../training_data_16_October_release/EN/subtask-2-annotations.txt"

# Read the annotations file
annotations = pd.read_csv(annotations_file, sep='\t', header=None, names=['filename', 'narrative', 'subnarrative'])

# Remove all occurrences of "CC: " and "URW: " from narratives and subnarratives
annotations['narrative'] = annotations['narrative'].str.replace(r'(CC: |URW: )', '', regex=True)
annotations['subnarrative'] = annotations['subnarrative'].str.replace(r'(CC: |URW: )', '', regex=True)

# Split the narratives and subnarratives into lists
annotations['narrative'] = annotations['narrative'].str.split(';')
annotations['subnarrative'] = annotations['subnarrative'].str.split(';')

# Initialize a list to store the data
data = []

# Iterate over the annotations and read the corresponding documents
for _, row in annotations.iterrows():
    filename = row['filename']
    narratives = row['narrative']
    subnarratives = row['subnarrative']
    
    # Read the document content
    with open(os.path.join(documents_path, filename), 'r', encoding='utf-8') as file:
        content = file.read()
    
    # Determine the topic based on the filename
    topic = "UA" if "UA" in filename else "CC"
    
    # Append the document content, narratives, subnarratives, and topic to the data list
    data.append({
        'filename': filename,
        'content': content,
        'narratives': narratives,
        'subnarratives': subnarratives,
        'topic': topic
    })

# Convert the data list to a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df.head()

- There seems to be redundant information in the narratives and subnarratives column, since entries in the subnarrative column is structured in the way *"narrative:subnarrative"* -> we could therefore get rid of some redundant information
- We first check if the narrative information in the subnarrative column is exactly the same as in the narrative column
- If so, we will remove the narrative information from the subnarrative column

In [None]:
# Define a function to check if the subnarrative starts with the narrative in a given row
def check_pattern(row):
    narratives = row['narratives']
    subnarratives = row['subnarratives']
    
    for narrative, subnarrative in zip(narratives, subnarratives):
        if not subnarrative.startswith(narrative + ":"):
            return (row.name, narrative, subnarrative)
    return None

# Apply the function to each row and collect the results
pattern_check_results = df.apply(check_pattern, axis=1)

# Filter out the rows where the pattern does not hold
problematic_rows = pattern_check_results[pattern_check_results.notnull()]

# Set display options to avoid truncation
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Display the problematic rows
print("Problematic rows where the pattern does not hold:")
print(problematic_rows)

- The only exception from the pattern is *narrative other, subnarrative other* cases 
- Since there is no redundancy in those cases and there are no other exceptions from the pattern, we can delete the narratives before the ":" in the subnarratives column

In [None]:
# Function to remove redundant narratives from subnarratives
def remove_redundant_narratives(row):
    narratives = row['narratives']
    subnarratives = row['subnarratives']
    
    cleaned_subnarratives = []
    for narrative, subnarrative in zip(narratives, subnarratives):
        if subnarrative.startswith(narrative + ":"):
            cleaned_subnarrative = subnarrative[len(narrative) + 1:].strip()
            cleaned_subnarratives.append(cleaned_subnarrative)
        else:
            cleaned_subnarratives.append(subnarrative)
    
    return cleaned_subnarratives

# Apply the function to each row to clean the subnarratives
df['subnarratives'] = df.apply(remove_redundant_narratives, axis=1)

# Display the updated DataFrame
print("Updated DataFrame with cleaned subnarratives:")
df.head()

- Currently, a subnarrative is assigned to a narrative by the order in the list in the respective column
- We change this to a better assignment using lists containing dictionaries where each contains a "narrative" and "subnarrative" key
- After this, we only have a single column for the labels "narrative_subnarrative_pairs"

In [None]:
# Function to create narrative-subnarrative pairs in a single column
def create_narrative_subnarrative_pairs(row):
    narratives = row['narratives']
    subnarratives = row['subnarratives']
    
    pairs = []
    for narrative, subnarrative in zip(narratives, subnarratives):
        if subnarrative.startswith(narrative + ":"):
            cleaned_subnarrative = subnarrative[len(narrative) + 1:].strip()
        else:
            cleaned_subnarrative = subnarrative
        pairs.append({'narrative': narrative, 'subnarrative': cleaned_subnarrative})
    
    return pairs

# Apply the function to each row to create narrative-subnarrative pairs
df['narrative_subnarrative_pairs'] = df.apply(create_narrative_subnarrative_pairs, axis=1)

# Drop the original narratives and subnarratives columns if no longer needed
df = df.drop(columns=['narratives', 'subnarratives'])

# Display the updated DataFrame
print("Updated DataFrame with narrative-subnarrative pairs:")
df.head()

### Extract the label taxonomy
- The competition creators provide a complete taxonomy of labels for each of the two topics
- We create a templete for each label taxonomy (for each of the two topics) from the subtask 2 pdf file
- We can use this to encode the labels of the datapoints in the dataset
- We do this by assigning an index to every possible class (narrative-subnarrative pair) to numerically represent the labels for each document

In [None]:

# Define the path to the PDF file
pdf_path = "../info/subtask2_NARRATIVE-TAXONOMIES.pdf"

# Open the PDF file
pdf_document = fitz.open(pdf_path)

# Function to extract text from a specific page
def extract_text_from_page(page_number):
    page = pdf_document.load_page(page_number)
    text = page.get_text("text")
    return text

# Extract text from the relevant pages
ukraine_war_text = extract_text_from_page(0)  # First page contains Ukraine War taxonomy
climate_change_text = extract_text_from_page(1)  # Second page contains Climate Change taxonomy

# Function to parse the taxonomy text and create a DataFrame
def parse_taxonomy(text):
    lines = text.split('\n')
    # Exclude the last two lines
    lines = lines[:-3]
    data = []
    current_narrative = None
    for line in lines:
        if line.strip() == "":
            continue
        if line.startswith("-"):  # Subnarrative
            subnarrative = line.strip("- ").strip()
            data.append({'narrative': current_narrative, 'subnarrative': subnarrative})
        else:  # Narrative
            if current_narrative and not any(d['narrative'] == current_narrative for d in data):
                # Add the narrative itself as subnarrative if it has no subnarratives
                data.append({'narrative': current_narrative, 'subnarrative': current_narrative})
            current_narrative = line.strip()
            if current_narrative == "Other":
                data.append({'narrative': "Other", 'subnarrative': "Other"})
    # Handle the last narrative if it has no subnarratives
    if current_narrative and not any(d['narrative'] == current_narrative for d in data):
        data.append({'narrative': current_narrative, 'subnarrative': 'Other'})
    
    df = pd.DataFrame(data)
    df = df.sort_values(by='narrative', ascending=True).reset_index(drop=True)
    return df

# Parse the taxonomies and create DataFrames
ukraine_war_df = parse_taxonomy(ukraine_war_text)
climate_change_df = parse_taxonomy(climate_change_text)


In [None]:
ukraine_war_df.head(50)

In [None]:
climate_change_df.head(50)

- Handle some edge cases in the taxonomy as instructed in the task description (see https://propaganda.math.unipd.it/semeval2025task10/index.html)
- E.g. if a narrative is identified but no subnarrative applies, the subnarrative is "Other". We apply this to all rows in the taxonomy dataframe except for the narratives "Other" and "Hidden plots by secret schemes of powerful groups" since those are already present (in one of the two dataframes) 
- After manually adding the "Other", "Hidden plots by secret schemes of powerful groups" combination in the climate change taxonomy, all possible combinations are be covered
    - We do this because the "Hidden plots by secret schemes of powerful groups" - "Other" combination is already present in the UA taxonomy but not in the CC taxonomy

In [None]:
# Function to add "Other" subnarrative for each narrative group, excluding specific narratives
def add_other_subnarrative(df):
    additional_rows = []
    unique_narratives = df['narrative'].unique()
    for narrative in unique_narratives:
        if narrative not in ["Other", "Hidden plots by secret schemes of powerful groups"]:
            additional_rows.append({'narrative': narrative, 'subnarrative': 'Other'})
    additional_df = pd.DataFrame(additional_rows)
    return pd.concat([df, additional_df], ignore_index=True)

# Function to sort the DataFrame and add an index column
def sort_and_index_df(df):
    df = df.sort_values(by=['narrative', 'subnarrative']).reset_index(drop=True)
    df['index'] = df.index + 1
    df = df[['index', 'narrative', 'subnarrative']]
    return df

# Add "Other" subnarrative to each DataFrame
ukraine_war_df = add_other_subnarrative(ukraine_war_df)
climate_change_df = add_other_subnarrative(climate_change_df)

# Manually add the specific row to the climate change DataFrame
climate_change_df = pd.concat([climate_change_df, pd.DataFrame([{'narrative': 'Hidden plots by secret schemes of powerful groups', 'subnarrative': 'Other'}])], ignore_index=True)

# Sort and add index column to each DataFrame
ukraine_war_df = sort_and_index_df(ukraine_war_df)
climate_change_df = sort_and_index_df(climate_change_df)

In [None]:
ukraine_war_df.head(55)

In [None]:
climate_change_df.head(58)

### Map the labels in the file to the taxonomy

- In the next step, we want to add another column to our "df" that combines the narrative_subnarrative_pairs column with the information in our taxonomy dataframes
- Firstly, the column "topic" tells us what taxonomy df should be applied (UA for ukraine war and CC for climate change)
- Then, theoretically, the dictionaries should exactly correspond to a given row in one of the two taxonomy dataframes
- We check if every narrative-subnarrative pair of every row in the dataset can be mapped to its row in the taxonomy and if so, add a column to the dataframe that contains the indices of the targets.

In [None]:
# Create a mapping of narrative-subnarrative pairs to their indices
def create_mapping(df):
    mapping = {}
    for _, row in df.iterrows():
        key = (row['narrative'], row['subnarrative'])
        mapping[key] = row['index']
    return mapping

ukraine_war_mapping = create_mapping(ukraine_war_df)
climate_change_mapping = create_mapping(climate_change_df)

# Function to check the mapping and add indices to the DataFrame
def add_target_indices(row, ukraine_war_mapping, climate_change_mapping):
    pairs = row['narrative_subnarrative_pairs']
    topic = row['topic']
    indices = []
    
    if topic == "UA":
        mapping = ukraine_war_mapping
    elif topic == "CC":
        mapping = climate_change_mapping
    else:
        return None  # Invalid topic
    
    for pair in pairs:
        key = (pair['narrative'], pair['subnarrative'])
        if key in mapping:
            indices.append(mapping[key])
        else:
            return (row.name, key)  # Mapping does not exist
    
    return indices

# Apply the function to each row and collect the results
df['target_indices'] = df.apply(add_target_indices, axis=1, args=(ukraine_war_mapping, climate_change_mapping))

# Filter out the rows where the mapping does not exist
problematic_rows = df[df['target_indices'].apply(lambda x: isinstance(x, tuple))]

# Display the problematic rows
print("Problematic rows where the mapping does not exist:")
print(problematic_rows)

# Display the first few problematic rows for inspection
if not problematic_rows.empty:
    print("First few problematic rows:")
    for index, row in problematic_rows.iterrows():
        print(f"Row index: {index}, Problematic pair: {row['target_indices']}")

In [None]:
len(problematic_rows)

- There are two datapoints/rows in the dataset where the mapping does not work, the indices 65 and 143
- With a closer look, we see what the problem is: All but the two rows are EITHER ukraine war OR climate change
- In the two rows, this is not the case (remember that our implementation relies on classifying whole articles as a topic, i.e. all the labels must be from the taxonomy of that topic)

Below we check the annotation file for rows where there are mixed topics in the labels.

In [None]:


# Define the path to the annotations file
annotations_file = "../training_data_16_October_release/EN/subtask-2-annotations.txt"

# Read the annotations file
annotations = pd.read_csv(annotations_file, sep='\t', header=None, names=['filename', 'narrative', 'subnarrative'])

# Initialize a list to store the line numbers with both "CC" and "URW"
mixed_topic_lines = []

# Iterate through each row and check for mixed topics
for index, row in annotations.iterrows():
    narrative = row['narrative']
    subnarrative = row['subnarrative']
    
    # Check if both "CC" and "URW" are present in either narrative or subnarrative
    if ("CC: " in narrative and "URW: " in narrative) or ("CC: " in subnarrative and "URW: " in subnarrative):
        mixed_topic_lines.append(index + 1)  # Adding 1 to index to match line numbers

# Print the line numbers with mixed topics
print("Lines with both 'CC' and 'URW' present:")
print(mixed_topic_lines)

# Print the total count of such lines
print("Total number of lines with mixed topics:", len(mixed_topic_lines))

- This shows us that line 66 in the annotations file (index 65 in the dataset) has labels from both taxonomies
- Line 144 in the annotation file on the other hand (Index 143) has UA in the filename but CC in all the labels, which we assume to be an error in the dataset
- We deal with this by dropping the two rows from the dataset

In [None]:
df_short = df.drop([65, 143])

### Conclusion for initial data prep and label representation
- We now have a dataframe containing all relevant data, including which topic an article belongs to, the article content (in raw form up until now) and the labels (narrative and subnarrative combinations) in text form and in numerical form
- We removed 2 problematic datapoints and are left with 198
- While the label classes are basically ordinally encoded right now, one-hot encoding would be more suitable since there is no ordinal relationship for the class labels
- We can now easily change the encoding which we will do in Milestone 2. The current implementation also easily supports differentiating the topics to train separate models using the "topic" column

## Text Segmentation <a class="anchor" id="text-segmentation"></a>

<a class="anchor" id="tokenize-sentences-and-words"></a>
Now we handle the content of the articles. Currently, each entry in our dataframe has a single plain string that contains the whole article.

Let's start by splitting it into sentences and words.

In [None]:
def tokenize(df):
    df['tokens'] = None
    for i, row in df.iterrows():
        # split the content into sentences
        sentences = nltk.sent_tokenize(row['content'])
        # tokenize each sentence
        tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
        df.at[i, 'tokens'] = tokens
    return df

df_short = tokenize(df_short)
#df_short.head()

<a class="anchor" id="find-sentences-unusual-length"></a>
To uncover potential errors, let us check for and handle sentences of unusual length.

In [None]:
# Function to find sentences of unusual length
def find_unusual_length_sentences(df, min_length=3, max_length=130):
    unusual_sentences = []
    for i, row in df.iterrows():
        for j, sentence in enumerate(row['tokens']):
            if len(sentence) < min_length or len(sentence) > max_length:
                # also store the previous and next sentences for context
                prev_sentence = row['tokens'][j-1] if j > 0 else None
                next_sentence = row['tokens'][j+1] if j < len(row['tokens']) - 1 else None
                unusual_sentences.append({
                    'row_index': i, # for later handling of the unusual sentences
                    'sentence_index': j, # for later handling of the unusual sentences
                    'sentence': sentence,
                    'previous': prev_sentence,
                    'next': next_sentence
                })
    return unusual_sentences

# Find sentences with less than 3 words or more than 130 words
unusual_sentences = find_unusual_length_sentences(df_short)

print(f"There are {len(unusual_sentences)} sentences of unusual length.")

# Display the unusual sentences
for entry in unusual_sentences:
    print(f"Sentence length: {len(entry["sentence"])}, Sentence: {' '.join(entry["sentence"])}")

We find multiple very short and very large sentences.

<a class="anchor" id="handle-short-sentences"></a>
#### Handling very short sentences.

The sentences of size 1 all consist of non-meaningful characters. Therefore we can drop them directly.

In [None]:
# Function to drop unusual sentences of a specific length from the DataFrame
def drop_sentences_of_length(df, unusual_sentences, length):
    # Create a copy of the list to iterate over
    for entry in unusual_sentences[:]:
        if len(entry['sentence']) == length:
            row_index = entry['row_index']
            sentence_index = entry['sentence_index']
            # Check if the sentence index is within the valid range
            if 0 <= sentence_index < len(df.at[row_index, 'tokens']):
                # Drop from DataFrame
                df.at[row_index, 'tokens'].pop(sentence_index)
                # Drop from unusual_sentences list
                unusual_sentences.remove(entry)
                print("dropped: ", entry['sentence'])
    return df

# Drop sentences of length 1
print(f"There are {len(unusual_sentences)} sentences of unusual length before dropping sentences of length 1.")
df_short = drop_sentences_of_length(df_short, unusual_sentences, length=1)
print(f"There are {len(unusual_sentences)} sentences of unusual length after dropping sentences of length 1.")

The sentences of size 2 might make sense. Let's have a look at their context.

In [None]:
# Display sentences of length 2 with their preceding and following sentences
for entry in unusual_sentences:

    print(f"(Previous) {' '.join(entry['previous']) if entry['previous'] else 'None'}")
    print(f"(Idx {entry['row_index']}, {entry['sentence_index']}, ListIdx {unusual_sentences.index(entry)}) {' '.join(entry['sentence'])}")
    if entry['next']:
        print(f"(Next) {' '.join(entry['next'])}")
    print("-" * 50)

We find, that the sentences of size 2 are of different types.
 - Words belonging to the previous or following sentence, but are split by punctation errors (e.g. "Mild. Tonight: Rain slowly returns ...") -> merge manually 
 - Words at the end of an document (e.g. "Watch:") -> drop
 - Valid sentences (e.g. "Why?") -> valid, keep
 - Section numerations (e.g. "2. (paragraph)") -> valid, keep (might be helpful for the model to understand the text, since they give a structure)
 

Since the number of such sentences is managable, we can manually decide on each case, whether to keep, merge or drop it.

In [None]:
# function to merge specific sentences with either the previous or next sentence
def merge_specific_sentence(df, row_idx, sentence_idx, direction):
    
    # List of sentences of the specified row
    sentences = df.at[row_idx, 'tokens']
    
    # Merge based on the direction
    if direction == 'previous':
        merged_sentence = sentences[sentence_idx - 1] + sentences[sentence_idx]
        sentences[sentence_idx - 1] = merged_sentence
        del sentences[sentence_idx]
        
    elif direction == 'next':
        merged_sentence = sentences[sentence_idx] + sentences[sentence_idx + 1]
        sentences[sentence_idx] = merged_sentence
        del sentences[sentence_idx + 1]
    
    # Update the row in the dataframe
    df.at[row_idx, 'tokens'] = sentences


print(f"There are {len(unusual_sentences)} sentences of unusual length before merging some sentences of length 2.")

merge_specific_sentence(df_short, 87, 6, 'previous')  # Merge "Drones ." with previous
merge_specific_sentence(df_short, 88, 0, 'next')      # Merge "U.K ." with next
merge_specific_sentence(df_short, 127, 23, 'next')    # Merge "Mild ." with next
merge_specific_sentence(df_short, 136, 3, 'next')     # Merge "Gov ." with next

# drop them from the list of unusual sentences
unusual_sentences = [i for j, i in enumerate(unusual_sentences) if j not in [6,7,12,13]]

print(f"There are {len(unusual_sentences)} sentences of unusual length after merging some sentences of length 2.")

In [None]:
# function to drop specific sentences from the dataframe
def drop_sentence_by_indices(df, row_idx, sentence_idx):
    
    # copy list of sentences of the specified row
    sentences = df.at[row_idx, 'tokens']
        
    if 0 <= sentence_idx < len(sentences):
        # drop sentence
        del sentences[sentence_idx]
        # update the dataframe
        df.at[row_idx, 'tokens'] = sentences
        
    
    else:
        print(f"Invalid sentence index {sentence_idx} for row {row_idx}")


print(f"There are {len(unusual_sentences)} sentences of unusual length before dropping some sentences of length 2.")

drop_sentence_by_indices(df_short, 55, 10)
drop_sentence_by_indices(df_short, 80, 21)

# drop them from the list of unusual sentences
unusual_sentences = [i for j, i in enumerate(unusual_sentences) if j not in [2, 5]]

print(f"There are {len(unusual_sentences)} sentences of unusual length after dropping some sentences of length 2.")

#### Handling very long sentences <a class="anchor" id="handle-long-sentences"></a>

There are several very long sentences. Looking at the data, we see that some of them are infact correctly split and complete single sentences. Others, however, are in fact multiple sentences stored as one, because the splitting did not work correctly. Let's manually split them.

In [None]:
# function to manually replace a unsplitted sequence of sentences with the manually correctly splitted sentences
def manually_split_sentence(idx_row, idx_sentence, splitted_sentence):
    df_short.at[idx_row, 'tokens'] = df_short.at[idx_row, 'tokens'][:idx_sentence] + splitted_sentence + df_short.at[6, 'tokens'][idx_sentence+1:]


print(f"There are {len(unusual_sentences)} sentences of unusual length before manually splitting some too long sentences.")

manually_split_sentence(6,6,[
    ["Jens", "Stoltenberg", "(", "pictured", ")", ",", "the", "13th", "secretary", "general", "of", "NATO", ",", "revealed", "there", "were", "live", "discussions", "among", "members", "about", "removing", "missiles", "from", "storage", "and", "putting", "them", "on", "standby", "."],
    ["A", "Netherlands", "'", "Air", "Force", "F-16", "jetfighter", "takes", "part", "in", "the", "NATO", "exercise", "as", "part", "of", "the", "NATO", "Air", "Policing", "mission", "."],
    ["The", "head", "of", "Kyiv", "'s", "national", "security", "council", "said", "Putin", "could", "demand", "a", "tactical", "nuclear", "weapon", "be", "used", "if", "Russia", "'s", "army", "is", "beaten", "in", "Ukraine", "."],
    ["Russian", "soldiers", "load", "a", "Iskander-M", "short-range", "ballistic", "missile", "launcher", "at", "a", "firing", "position", "as", "part", "of", "a", "Russian", "military", "drill", "intended", "to", "train", "the", "troops", "in", "using", "tactical", "nuclear", "weapons", "."],
    ["Meanwhile", ",", "Mr", "Stoltenberg", "warned", "in", "Brussels", "of", "the", "threat", "from", "China", ",", "adding", "that", "nuclear", "transparency", "should", "form", "the", "basis", "of", "NATO", "'s", "nuclear", "strategy", "to", "prepare", "the", "alliance", "for", "the", "dangers", "of", "the", "world", "."]
])

manually_split_sentence(46,12,[
    ["“", "The", "Complaint", "alleged", "that", "several", "of", "the", "Vietnamese", "orphans", "brought", "to", "the", "United", "States", "under", "Operation", "Babylift", "stated", "they", "are", "not", "orphans", "and", "that", "they", "wish", "to", "return", "to", "Vietnam", ".", "”"],
    ["A", "statement", "issued", "on", "April", "4", ",", "1975", ",", "by", "“", "professors", "of", "ethics", "and", "religion", ",", "”", "pointed", "out", "that", "many", "“", "of", "the", "children", "are", "not", "orphans", ";", "their", "parents", "or", "relatives", "may", "still", "be", "alive", ",", "although", "displaced", ",", "in", "Vietnam", "…", "The", "Vietnamese", "children", "should", "be", "allowed", "to", "stay", "in", "Vietnam", "where", "they", "belong", ".", "”"],
    ["The", "operation", "was", "celebrated", "by", "the", "corporate", "media", "and", "“", "Hollywood", "’", "s", "celebrity", "elite", "…", "[", "and", ",", "as", "a", "propaganda", "event", "]", "generated", "a", "spectacle", "of", "celebration", "and", "emphasized", "that", "the", "babies", "were", "more", "than", "just", "average", "orphans", ",", "”", "writes", "US", "History", "Scene", "."]
])

manually_split_sentence(73,8,[
    ["But", "he", "noted", "that", "“", "even", "as", "the", "Russians", "have", "gained", "territory", ",", "they", "do", "it", "at", "a", "pretty", "big", "cost", "in", "number", "of", "casualties", ",", "like", "in", "personnel", ",", "but", "also", "in", "number", "of", "pieces", "of", "equipment", "that", "are", "being", "taken", "out.", "”"],
    ["Austin", "said", "in", "his", "remarks", "Tuesday", "that", "“", "Russia", "has", "paid", "a", "staggering", "cost", "for", "(", "President", "Vladimir", ")", "Putin", "’", "s", "imperial", "dreams", "”", ",", "using", "“", "up", "to", "$", "211", "billion", "to", "equip", ",", "deploy", ",", "maintain", ",", "and", "sustain", "its", "imperial", "aggression", "against", "Ukraine.", "”"],
    ["“", "At", "least", "315,000", "Russian", "troops", "have", "been", "killed", "or", "wounded", "”", "since", "Russia", "launched", "its", "all-out", "invasion", "of", "Ukraine", "in", "2022", ",", "Austin", "said", "."],
    ["Austin", "added", "that", "Ukraine", "has", "also", "“", "sunk", ",", "destroyed", ",", "or", "damaged", "some", "20", "medium-to-large", "Russian", "navy", "vessels.", "”"],
    ["The", "sinkings", "have", "been", "an", "embarrassment", "for", "Moscow", "and", "Russian", "state", "media", "confirmed", "Tuesday", "that", "the", "country", "had", "replaced", "the", "head", "of", "its", "navy", "."]
])

manually_split_sentence(77,13,[
    ["Подробнее", "на", "РБК", ":", "https", ":", "//www.rbc.ru/politics/14/06/2023/6489e6f39a794778d61881b4", "."],
    ["The", "picture", "of", "widening", "war", "is", "beginning", "to", "form", ":", "Professor", "Sergey", "Karaganov", ",", "honorary", "chairman", "of", "Russia", "’", "s", "Council", "on", "Foreign", "and", "Defense", "Policy", ",", "and", "academic", "supervisor", "at", "the", "School", "of", "International", "Economics", "and", "Foreign", "Affairs", "Higher", "School", "of", "Economics", "(", "HSE", ")", "in", "Moscow", "."],
    ["Sergey", "Karaganov", ":", "By", "using", "its", "nuclear", "weapons", ",", "Russia", "could", "save", "humanity", "from", "a", "global", "catastrophe", "."],
    ["A", "tough", "but", "necessary", "decision", "would", "likely", "force", "the", "West", "to", "back", "off", ",", "enabling", "an", "earlier", "end", "to", "the", "Ukraine", "crisis", "and", "preventing", "it", "from", "expanding", "to", "other", "states", "."],
    ["Karaganov", "’", "s", "description", "of", "the", "Western", "World", "as", "“", "anti-human", "ideologies", ":", "the", "denial", "of", "family", ",", "homeland", ",", "history", ",", "love", "between", "men", "and", "women", ",", "faith", ",", "service", "to", "higher", "ideals", ",", "everything", "that", "is", "human", ",", "”", "shows", "a", "rising", "realization", "that", "Russia", "sees", "itself", "confronted", "by", "a", "Satanic", "force", "that", "must", "be", "destroyed", "."]
])

manually_split_sentence(147,7,[
    ["At", "the", "same", "time", ",", "the", "official", "claimed", "that", "the", "danger", "of", "Kiev", "using", "a", "‘", "dirty", "bomb", "’", "remains", "“", "very", "high", ",", "”", "and", "that", "Ukraine", "“", "has", "the", "opportunity", "”", "and", "“", "has", "every", "reason", "to", "use", "it", "."],
    ["Earlier", "on", "Tuesday", ",", "in", "a", "letter", "to", "UN", "Secretary-General", "Antonio", "Guterres", ",", "the", "Russian", "mission", "’", "s", "head", ",", "Vassily", "Nebenzia", ",", "said", "that", "Moscow", "would", "consider", "the", "use", "of", "a", "‘", "dirty", "bomb", "’", "by", "Ukraine", "“", "an", "act", "of", "nuclear", "terrorism", ".", "”"],
    ["Meanwhile", ",", "Ukrainian", "Foreign", "Minister", "Dmitry", "Kuleba", "earlier", "called", "the", "Russian", "allegations", "“", "as", "absurd", "as", "they", "are", "dangerous", ".", "”"],
    ["He", "also", "noted", "that", "“", "Russians", "often", "accuse", "others", "of", "what", "they", "plan", "themselves", ".", "”"],
    ["On", "Tuesday", ",", "the", "minister", "revealed", "that", "Ukraine", "had", "invited", "IAEA", "inspectors", "to", "come", "and", "to", "“", "prove", "that", "Ukraine", "has", "neither", "any", "dirty", "bombs", "nor", "plans", "to", "develop", "them", ".", "”"],
    ["“", "Good", "cooperation", "with", "IAEA", "and", "partners", "allows", "us", "to", "foil", "Russia", "’", "s", "‘", "dirty", "bomb", "’", "disinfo", "campaign", ",", "”", "Kuleba", "said", "."]
])

manually_split_sentence(154,5,[
    ["WHO", "Tedros", "describes", "Disease", "X", "as", "a", "blueprint", "at", "a", "panel", "discussion", "at", "WEF24", "—", "Tamara", "Ugolini", "🇨🇦", "(", "@", "TamaraUgo", ")", "January", "17", ",", "2024", "."],
    ["He", "says", "that", "COVID", "was", "the", "first", "Disease", "X", "and", "we", "“", "need", "a", "placeholder", "for", "diseases", "we", "don", "’", "t", "know", ",", "”", "including", "dedication", "to", "private", "sector", "drug", "research", "and", "development", "."],
    ["Disease", "X", "serves", "as", "a", "“", "placeholder", "for", "the", "diseases", "we", "don", "’", "t", "know", ",", "”", "and", "it", "begins", "with", "private-sector", "research", "and", "development", "to", "test", "drugs", "and", "“", "other", "things", ".", "”"],
    ["Tedros", "stressed", "that", "the", "next", "pandemic", "is", "“", "not", "a", "matter", "of", "if", ",", "but", "rather", "when", ",", "”", "while", "noting", "that", "COVID-19", "was", "the", "original", "Disease", "X", ",", "in", "which", "they", "were", "able", "to", "facilitate", "the", "Pandemic", "Fund", "in", "partnership", "with", "the", "World", "Bank", "."]
])

# drop them from the list of unusual sentences
unusual_sentences = [i for j, i in enumerate(unusual_sentences) if j not in [0, 1, 2, 3, 8, 9]]

print(f"There are {len(unusual_sentences)} sentences of unusual length after manually splitting some too long sentences.")


#### Validating fixed unusual-length-sentences <a class="anchor" id="verify-remaining-unusual-sentences"></a>

We update the unusual sentences and print them. We find that all unusually short and long sentences that still occur, are valid and ment to be kept.

In [None]:
# Find sentences with less than 3 words or more than 100 words
unusual_sentences = find_unusual_length_sentences(df_short)

print(f"After handling, there are {len(unusual_sentences)} unusual sentences left.")

# Display the unusual sentences
for entry in unusual_sentences:
    print(f"Sentence length: {len(entry["sentence"])}, (Idx {entry['row_index']}, {entry['sentence_index']}), Sentence: {' '.join(entry["sentence"])}")

Now that the text is correctly segmentated into sentences and words, we can proceed with text normalization.

### Text Normalization  <a class="anchor" id="text-normalization"></a>

  
Text normalization is the process of transforming text into a standard format, which typically involves:

- Converting text to lowercase
- Removing punctuation
- Removing stopwords
- Removing special characters and numbers
- Lemmatization or stemming

This process helps in reducing the complexity of the text and making it more uniform for further analysis or processing.

We will implement text normalization in the next steps.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

print(torch.version.cuda)  # Shows CUDA version if available
print(torch.cuda.is_available())  # Checks if CUDA is available



Check if a NVIDIA Graphics card is installed (and the nessesary CUDA packages) to be used later for lemmatization because with just the CPU it was taking around 15min every time. 

In [None]:
def text_normalization(df, column_name):
    """
    Text normalization with optimized batch processing for nested lists of tokens
    with proper abbreviation handling
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    nlp = stanza.Pipeline('en',
                         processors='tokenize,lemma',
                         device=device,
                         use_gpu=True,
                         batch_size=4096,
                         tokenize_batch_size=4096,
                         tokenize_pretokenized=True,
                         download_method=None
                         )
    
    # Depending on the graphics card the batch size can be adjusted
    
    stop_words = set(stopwords.words('english'))
    
    # Common abbreviations and their normalized forms
    abbreviations = {
        'p.m.': 'pm',
        'a.m.': 'am',
        'e.g.': 'eg',
        'i.e.': 'ie',
        'etc.': 'etc',
        'vs.': 'vs',
        'mr.': 'mr',
        'mrs.': 'mrs',
        'dr.': 'dr',
        'prof.': 'prof',
        'u.s.': 'us',
        'u.k.': 'uk',
        'n.y.': 'ny',
        'l.a.': 'la',
        'st.': 'st',
        'inc.': 'inc',
        'ltd.': 'ltd',
        'co.': 'co',
        'corp.': 'corp',
        'avg.': 'avg',
        'approx.': 'approx'
    }
    
    def normalize_token(token):
        """Normalize a single token"""
        if not isinstance(token, str):
            return ''
            
        # Convert to lowercase first
        token = token.lower().strip()
        
        # Check if it's an abbreviation
        if token in abbreviations:
            return abbreviations[token]
            
        # Remove special characters and numbers for non-abbreviations
        token = re.sub(r'[^a-z]', '', token)
        
        return token
    
    def clean_text(nested_tokens):
        """Clean and preprocess nested list of tokens"""
        if not isinstance(nested_tokens, list):
            return []
        
        cleaned_tokens = []
        for sentence in nested_tokens:
            if isinstance(sentence, list):
                for token in sentence:
                    # Normalize the token
                    normalized = normalize_token(token)
                    # Check if token is not empty and not a stopword
                    if normalized and normalized not in stop_words:
                        cleaned_tokens.append(normalized)
        
        return cleaned_tokens
    
    def process_text(tokens):
        """Process a single text through Stanza"""
        try:
            if not tokens:
                return []
            # Join tokens into a single string for processing
            text = ' '.join(tokens)
            doc = nlp(text)
            # Extract lemmas and filter stopwords
            lemmas = []
            for sent in doc.sentences:
                for word in sent.words:
                    lemma = word.lemma.lower()
                    # Check if the original token was an abbreviation
                    if lemma not in stop_words:
                        lemmas.append(lemma)
            return lemmas
        except Exception as e:
            print(f"Error processing text: {str(e)}")
            return []
    
    def process_batch(batch_tokens):
        """Process a batch of nested token lists"""
        results = []
        for tokens in batch_tokens:
            # Clean and flatten tokens
            cleaned_tokens = clean_text(tokens)
            # Process cleaned tokens
            normalized = process_text(cleaned_tokens)
            # Ensure all tokens are properly normalized
            normalized = [normalize_token(token) for token in normalized if normalize_token(token)]
            results.append(normalized)
            
        return results
    
    # Process in batches
    batch_size = 50
    normalized_tokens = []
    total_batches = (len(df) + batch_size - 1) // batch_size
    
    print(f"Starting processing of {len(df)} rows in {total_batches} batches")
    
    for i in tqdm(range(0, len(df), batch_size), desc="Normalizing text"):
        batch_df = df.iloc[i:i + batch_size]
        batch_tokens = batch_df[column_name].tolist()
        
        # Print sample of first batch for debugging
        if i == 0:
            print("\nSample processing:")
            sample_tokens = batch_tokens[0][:5] if batch_tokens else []  # First 5 tokens of first row
            print(f"Original tokens: {sample_tokens}")
            cleaned = clean_text([sample_tokens])
            print(f"Cleaned tokens: {cleaned}")
            normalized = process_batch([sample_tokens])
            print(f"Normalized tokens: {normalized[0]}\n")
        
        normalized_batch = process_batch(batch_tokens)
        normalized_tokens.extend(normalized_batch)
    
    print(f"\nProcessing completed. Total normalized entries: {len(normalized_tokens)}")
    print(f"Non-empty normalized entries: {sum(1 for tokens in normalized_tokens if tokens)}")
    
    # Final check to ensure no punctuation or special characters remain
    final_tokens = []
    for tokens in normalized_tokens:
        cleaned = [token for token in tokens if token and not any(char in string.punctuation for char in token)]
        final_tokens.append(cleaned)
    
    # Update DataFrame with normalized tokens
    df_normalized = df.copy()
    df_normalized[f'{column_name}_normalized'] = final_tokens
    
    return df_normalized

## Verifying of text nomalization  <a class="anchor" id="check-normalization"></a>

In [None]:


def check_text_normalization(df):
    """
    Validates if text normalization has been applied correctly
    
    Parameters:
    df (pandas.DataFrame): DataFrame containing original and normalized tokens
    
    Returns:
    dict: Validation results with detailed statistics and examples of any issues found
    """
    results = {
        'overall_status': 'PASS',
        'tests': {},
        'statistics': {},
        'issues_found': {},
        'sample_issues': {}
    }
    
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    def is_lowercase(text):
        return text.islower()
    
    def contains_punctuation(text):
        return any(char in string.punctuation for char in text)
    
    def contains_numbers(text):
        return bool(re.search(r'\d', text))
    
    def contains_special_chars(text):
        return bool(re.search(r'[^a-zA-Z\s]', text))
    
    def is_stopword(text):
        return text in stop_words
    
    # Initialize counters for statistics
    stats = {
        'total_original_tokens': 0,
        'total_normalized_tokens': 0,
        'uppercase_found': 0,
        'punctuation_found': 0,
        'numbers_found': 0,
        'special_chars_found': 0,
        'stopwords_found': 0
    }
    
    # Initialize issue tracking
    issues = {
        'uppercase_tokens': [],
        'punctuation_tokens': [],
        'number_tokens': [],
        'special_char_tokens': [],
        'stopword_tokens': []
    }
    
    # Check each row
    for idx, row in df.iterrows():
        if 'tokens_normalized' not in row:
            results['overall_status'] = 'FAIL'
            results['tests']['normalization_column_exists'] = False
            return results
        
        normalized_tokens = row['tokens_normalized']
        
        if not isinstance(normalized_tokens, list):
            continue
            
        # Check each normalized token
        for token in normalized_tokens:
            stats['total_normalized_tokens'] += 1
            
            # Check for uppercase
            if not is_lowercase(token):
                stats['uppercase_found'] += 1
                if len(issues['uppercase_tokens']) < 5:
                    issues['uppercase_tokens'].append((idx, token))
            
            # Check for punctuation
            if contains_punctuation(token):
                stats['punctuation_found'] += 1
                if len(issues['punctuation_tokens']) < 5:
                    issues['punctuation_tokens'].append((idx, token))
            
            # Check for numbers
            if contains_numbers(token):
                stats['numbers_found'] += 1
                if len(issues['number_tokens']) < 5:
                    issues['number_tokens'].append((idx, token))
            
            # Check for special characters
            if contains_special_chars(token):
                stats['special_chars_found'] += 1
                if len(issues['special_char_tokens']) < 5:
                    issues['special_char_tokens'].append((idx, token))
            
            # Check for stopwords
            if is_stopword(token):
                stats['stopwords_found'] += 1
                if len(issues['stopword_tokens']) < 5:
                    issues['stopword_tokens'].append((idx, token))
    
    # Calculate pass/fail for each test
    tests = {
        'uppercase_test': stats['uppercase_found'] == 0,
        'punctuation_test': stats['punctuation_found'] == 0,
        'numbers_test': stats['numbers_found'] == 0,
        'special_chars_test': stats['special_chars_found'] == 0,
        'stopwords_test': stats['stopwords_found'] == 0
    }
    
    # Update overall status
    if not all(tests.values()):
        results['overall_status'] = 'FAIL'
    
    # Calculate percentages for statistics
    total_tokens = stats['total_normalized_tokens']
    if total_tokens > 0:
        stats.update({
            'uppercase_percentage': (stats['uppercase_found'] / total_tokens) * 100,
            'punctuation_percentage': (stats['punctuation_found'] / total_tokens) * 100,
            'numbers_percentage': (stats['numbers_found'] / total_tokens) * 100,
            'special_chars_percentage': (stats['special_chars_found'] / total_tokens) * 100,
            'stopwords_percentage': (stats['stopwords_found'] / total_tokens) * 100
        })
    
    # Compile results
    results['tests'] = tests
    results['statistics'] = stats
    results['issues_found'] = {k: len(v) for k, v in issues.items()}
    results['sample_issues'] = issues
    
    # Print summary report
    print("\nText Normalization Validation Report")
    print("====================================")
    print(f"Overall Status: {results['overall_status']}")
    print("\nTest Results:")
    for test, passed in tests.items():
        print(f"- {test}: {'PASS' if passed else 'FAIL'}")
    
    print("\nStatistics:")
    print(f"- Total normalized tokens: {stats['total_normalized_tokens']}")
    if total_tokens > 0:
        print(f"- Uppercase tokens: {stats['uppercase_found']} ({stats['uppercase_percentage']:.2f}%)")
        print(f"- Tokens with punctuation: {stats['punctuation_found']} ({stats['punctuation_percentage']:.2f}%)")
        print(f"- Tokens with numbers: {stats['numbers_found']} ({stats['numbers_percentage']:.2f}%)")
        print(f"- Tokens with special characters: {stats['special_chars_found']} ({stats['special_chars_percentage']:.2f}%)")
        print(f"- Stopwords found: {stats['stopwords_found']} ({stats['stopwords_percentage']:.2f}%)")
    
    if results['overall_status'] == 'FAIL':
        print("\nSample Issues Found:")
        for issue_type, samples in issues.items():
            if samples:
                print(f"\n{issue_type}:")
                for idx, token in samples:
                    print(f"- Row {idx}: '{token}'")
    
    return results



# Print the results  <a class="anchor" id="result-printing"></a>

In [None]:
# Clear GPU memory
torch.cuda.empty_cache()



# Run the normalization
df_short = text_normalization(df_short, 'tokens')

# Verify the results
print("\nResults verification:")
print("Sample of normalized tokens (first 3 rows):")
print(df_short['tokens_normalized'].head(3))


results = check_text_normalization(df_short)

### CoNNL-U format  <a class="anchor" id="#connlu-format"></a>
Now we will write a function to store the data in CoNLL-U format to ensure *reproducibility* and *platform independency*. Ten text parts will be written in one file in CoNNL-U format. 
If the execution of a cell is aborted, the currently open file will be closed.

In [None]:
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')

def convert_to_connlu(dataframe, column_name):
    """
    Function that takes a dataframe and converts each row in a CoNNL-U file. Each file consists of ten text parts
    in the CoNNL-U format.
    """
    output_dir = os.path.join("..", "CoNLL")
    
    file_index = 1
    sentence_count = 0
    total_sentences = 0
    output_file = os.path.join(output_dir, f"output_{file_index}.conllu")
    
    # Converting text parts to CoNNL-U format and closing files again. 
    try:        
        f = open(output_file, "w", encoding="utf-8")
    
        for idx, row in dataframe.iterrows():
            sentence = " ".join(row[f"{column_name}_normalized"])
            doc = nlp(sentence)
            CoNLL.write_doc2conll(doc, f)
            f.write("\n")
            
            sentence_count += 1
            total_sentences += 1
            
            # Closing file after ten converted text parts to one CoNNL-U file
            if sentence_count >= 10:
                f.close()
                file_index += 1
                output_file = os.path.join(output_dir, f"output_{file_index}.conllu")
                f = open(output_file, "w", encoding="utf-8")
                sentence_count = 0 
    except KeyboardInterrupt:
        print("\nStopped running. Closing open files...")
    except Exception as e:
        print(f"\nAn error happened {e}")
    # If the cell gets aborted, any open file is
    # closed so that there is no remaining open file when the cell is stopped.
    finally:
        if not f.closed:
            f.close()
        
    created_files = len([name for name in os.listdir(output_dir) if name.startswith('output') and name.endswith('.conllu')])
    print(f"\nTotal sentences processed: {total_sentences}")
    print(f"Total files created: {created_files}")

Now we call the function *convert_to_connlu* on our dataframe *df_short* to receive files in CoNNL-U format consisting of ten text parts per file.

In [None]:
#only run if the conll files are not already created
#convert_to_connlu(df_short, 'tokens')

Creating function to load tokens in a dataframe, keeping the CoNNL-u format in columns.

In [12]:
def extract_conllu_to_dataframe(directory):
    """
    Function that takes the directory string where the CoNNL-u files are located and loops though each files, extracting each word with
    its corresponding CoNNL-u features. At the end, all tokens including its features are stored in a dataframe. Each word is assigned a word_id.
    Each single narrative is assigned a narrative_id
    :returns df
    """
    columns = ['narrative_id', 'word_id', 'form', 'lemma', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc']
    data = []
    narrative_id = 0

    # Looping though all files
    for filename in sorted(os.listdir(directory)):
        if filename.endswith('.conllu'):
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r') as f:
                word_id = 0
                consecutive_comments = 0

                for line in f:
                    line = line.strip()

                    # Checking for comment lines, indicated by two hashtags. If new narrative starts, increment narrative_id
                    if line.startswith('#'):
                        consecutive_comments += 1
                        if consecutive_comments == 2:
                            narrative_id += 1
                            word_id = 0
                        continue
                    else:
                        consecutive_comments = 0

                    # Continuing for empty strokes
                    if not line:
                        continue

                    parts = line.split('\t')
                    if len(parts) == 10:
                        word_id += 1
                        row = [narrative_id, word_id] + parts[1:]
                        data.append(row)

    df = pd.DataFrame(data, columns=columns)
    return df

# Extract all narratives from all CoNNL-u files to dataframe df_connlu
directory_path = "../CoNLL"
df_connlu = extract_conllu_to_dataframe("../CoNLL")