# Natural Language Processing Milestone 1

## Data Prep

**Uncertainties:**

- Is the current way of handling whether an article is UA or CC sensible?
- The proper numerical representation of the labels is questionable. Should the hierarchical structure somehow be kept? Is One-Hot-Encoding sensible?
- The two different taxonomies for the two topics lead to the question whether two different models should be trained, one for the ukraine war and one for climate change.

### Creating a first dataframe containing [filename, content, label, topic] for each datapoint

The documents belong to two different topics, ukraine war and climate change. We will show which topic each document belongs to by adding the column "topic", which is based on the abbreviations UA or CC in the document filenames.

In [None]:
%pip install --requirement requirements.txt
# When you import something new that is not in the requirements.txt file, please add it to the requirements.txt file and re-run the cell above.


import os
import pandas as pd
import fitz  # PyMuPDF
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
nltk.download('stopwords')
from stanza.utils.conll import CoNLL
import re
from nltk.stem import PorterStemmer
import stanza
import torch
import nltk
import string
from tqdm import tqdm

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:


# Define the paths
documents_path = "../training_data_16_October_release/EN/raw-documents"
annotations_file = "../training_data_16_October_release/EN/subtask-2-annotations.txt"

# Read the annotations file
annotations = pd.read_csv(annotations_file, sep='\t', header=None, names=['filename', 'narrative', 'subnarrative'])

# Remove all occurrences of "CC: " and "URW: " from narratives and subnarratives
annotations['narrative'] = annotations['narrative'].str.replace(r'(CC: |URW: )', '', regex=True)
annotations['subnarrative'] = annotations['subnarrative'].str.replace(r'(CC: |URW: )', '', regex=True)

# Split the narratives and subnarratives into lists
annotations['narrative'] = annotations['narrative'].str.split(';')
annotations['subnarrative'] = annotations['subnarrative'].str.split(';')

# Initialize a list to store the data
data = []

# Iterate over the annotations and read the corresponding documents
for _, row in annotations.iterrows():
    filename = row['filename']
    narratives = row['narrative']
    subnarratives = row['subnarrative']
    
    # Read the document content
    with open(os.path.join(documents_path, filename), 'r', encoding='utf-8') as file:
        content = file.read()
    
    # Determine the topic based on the filename
    topic = "UA" if "UA" in filename else "CC"
    
    # Append the document content, narratives, subnarratives, and topic to the data list
    data.append({
        'filename': filename,
        'content': content,
        'narratives': narratives,
        'subnarratives': subnarratives,
        'topic': topic
    })

# Convert the data list to a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df.head()

Unnamed: 0,filename,content,narratives,subnarratives,topic
0,EN_UA_103861.txt,The World Needs Peacemaker Trump Again \n\n by...,[Other],[Other],UA
1,EN_UA_103667.txt,Desperation and Diplomacy: North Korea's Tech ...,[Other],[Other],UA
2,EN_UA_021270.txt,"Ukraine's Fate Will Be Decided In Coming Year,...","[Speculating war outcomes, Discrediting Ukrain...","[Speculating war outcomes: Other, Discrediting...",UA
3,EN_UA_103403.txt,Russia Stages Major Airstrike on Ukraine; One ...,[Other],[Other],UA
4,EN_CC_100145.txt,Strategy needed to preserve water resources in...,[Other],[Other],CC


The narratives row seems to be redundant, since the subnarrative row is structured in the way *narrative:subnarrative*. To check whether this is the case for all labels, we use the following function before removing the redundancy.

In [3]:
def check_pattern(row):
    narratives = row['narratives']
    subnarratives = row['subnarratives']
    
    for narrative, subnarrative in zip(narratives, subnarratives):
        if not subnarrative.startswith(narrative + ":"):
            return (row.name, narrative, subnarrative)
    return None

# Apply the function to each row and collect the results
pattern_check_results = df.apply(check_pattern, axis=1)

# Filter out the rows where the pattern does not hold
problematic_rows = pattern_check_results[pattern_check_results.notnull()]

# Set display options to avoid truncation
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Display the problematic rows
print("Problematic rows where the pattern does not hold:")
print(problematic_rows)

Problematic rows where the pattern does not hold:
0        (0, Other, Other)
1        (1, Other, Other)
3        (3, Other, Other)
4        (4, Other, Other)
5        (5, Other, Other)
7        (7, Other, Other)
9        (9, Other, Other)
10      (10, Other, Other)
16      (16, Other, Other)
17      (17, Other, Other)
19      (19, Other, Other)
20      (20, Other, Other)
21      (21, Other, Other)
23      (23, Other, Other)
27      (27, Other, Other)
28      (28, Other, Other)
30      (30, Other, Other)
32      (32, Other, Other)
33      (33, Other, Other)
34      (34, Other, Other)
35      (35, Other, Other)
37      (37, Other, Other)
39      (39, Other, Other)
42      (42, Other, Other)
44      (44, Other, Other)
45      (45, Other, Other)
48      (48, Other, Other)
49      (49, Other, Other)
50      (50, Other, Other)
51      (51, Other, Other)
52      (52, Other, Other)
53      (53, Other, Other)
56      (56, Other, Other)
60      (60, Other, Other)
63      (63, Other, Other)
64   

Since the only exception from the pattern seems to be *narrative other, subnarrative other* cases, we can delete the narratives before the ":" in the subnarratives column.

In [4]:
def remove_redundant_narratives(row):
    narratives = row['narratives']
    subnarratives = row['subnarratives']
    
    cleaned_subnarratives = []
    for narrative, subnarrative in zip(narratives, subnarratives):
        if subnarrative.startswith(narrative + ":"):
            cleaned_subnarrative = subnarrative[len(narrative) + 1:].strip()
            cleaned_subnarratives.append(cleaned_subnarrative)
        else:
            cleaned_subnarratives.append(subnarrative)
    
    return cleaned_subnarratives

# Apply the function to each row to clean the subnarratives
df['subnarratives'] = df.apply(remove_redundant_narratives, axis=1)

# Display the updated DataFrame
print("Updated DataFrame with cleaned subnarratives:")
df.head()

Updated DataFrame with cleaned subnarratives:


Unnamed: 0,filename,content,narratives,subnarratives,topic
0,EN_UA_103861.txt,"The World Needs Peacemaker Trump Again \n\n by Jeff Crouere, The Liberty Daily:\n\nThe world is in total chaos after 39 months of the Biden presidency. The southern border of our country is porous and millions of individuals from around the world have descended on our country.\n\nThese “undocumented migrants” include terrorists, drug dealers, and intelligence agents of countries such as our enemy, China. It should alarm every American that 22,233 Chinese nationals have illegally entered the United States since the beginning of the fiscal year in October. If this rate continues, this year’s total will easily top the 24,125 Chinese nationals who illegally entered the country last year.\n\nTRUTH LIVES on at https://sgtreport.tv/\n\nThere has been an astounding 6,300% increase in the number of Chinese nationals illegally entering the country in the last few years. Since China is a communist nation, these individuals are not freely traveling to the United States for a “better life.” U.S. Senator Roger Marshall (R-KS) believes the influx is due to “the direction of the CCP (Chinese Communist Party)” and it involves “espionage” as well as “stealing military and economic secrets.”\n\nLast year, China sent a spy balloon over our country, and it was not shot down until it completed its trek across many states. In addition, Chinese investors are buying American farmland for nefarious reasons, and it was confirmed that China operated “police stations” in the United States to monitor their citizens in our country.\n\nChina is constantly threatening Taiwan and its other neighbors while building up its military forces. Their proxy state, North Korea, has started regularly testing long-range ballistic missiles that concern their neighboring countries.\n\nUkraine is in the third year of a bitter war with Russia. Last year, a United States report estimated that the total number of injuries and deaths in the war neared 500,000. There is no end in sight to the war as no peace talks are planned and the United States Congress is on the verge of allocating more military aid to Ukraine.\n\nOn October 7, Israel was invaded by Hamas resulting in the death of 1,200 innocent people. It was the deadliest attack on Israel since its founding in 1948. Hamas abducted over 240 hostages and 129 have not been returned home. These hostages either remain in captivity or are already deceased.\n\nAs Israel has responded to the Hamas invasion by sending military forces into Gaza, many Democrats have been critical. Sadly, innocent civilians have been killed in the conflict, but how can a nation survive with a terrorist organization operating on its border?\n\nTo further strengthen their security, on April 1, Israel bombed the Iranian embassy in Damascus, Syria. Among the seven officials killed in the airstrike were Iranian Brig. Gen. Mohammad Reza Zahedi, who helped plan the October 7 massacre in Israel.\n\nTo retaliate for this strike, Iran launched over 300 drones and missiles into Israel on Saturday. Fortunately, almost all the strikes were intercepted by Israeli forces, with help from the United States, Jordan, and other countries.\n\nAs the world waits to see if Israeli Prime Minister Benjamin Netanyahu will respond to this attack, the Iranians are continuing to issue threats. In an interview on their state TV network, Iranian Major General Mohammad Bagheri said, “Our response will be much larger than tonight’s military action if Israel retaliates against Iran.”\n\nIt is being widely reported by Axios, CNN, and other media outlets that President Joe Biden is urging Netanyahu not to respond and, if he does, the United States will not be involved. This response is disturbing to U.S. Senator Marco Rubio (R-FL), who believes the world is witnessing the most dangerous period for the Middle East since 1973. He said, “What I don’t understand is why Joe Biden and the administration would leak to the media the contents of a conversation in which he tells Netanyahu he doesn’t think that Netanyahu should respond at all. It is the continuing part of the public game they are playing which frankly encourages Iran and Hezbollah.”\n\nBiden has also encouraged Iran by relaxing our economic sanctions, which allowed additional oil revenue to flow into their coffers. This led to their renewed financial support for terrorist organizations like Hamas. Without Iran’s generous backing, Hamas would not have been able to launch their deadly October 7 attack against Israel.\n\nNot content to allow Iran to prosper from oil sales, Biden also released $16 billion in funds for Iran, partially for the return of six American hostages. Terror regimes should never be rewarded for taking hostages, but this is exactly what happened by the Biden administration’s unwise decision.\n\nAs the world deals with multiple international wars and unprecedented chaos in the Middle East, it is useful to remember what was happening when Donald Trump was our President.\n\nRead More @ TheLibertyDaily.com",[Other],[Other],UA
1,EN_UA_103667.txt,"Desperation and Diplomacy: North Korea's Tech Hunt in Russia Amid Sanctions \n\n For several decades, North Korea has masterfully balanced its relations with the Chinese Communist Party (CCP) and Russia, maintaining a relationship with the latter that is both complex and ever-changing.\n\nThe dynamics have shifted notably since Russia found itself sanctioned and isolated by the international community following its invasion of Ukraine in February 2022.\n\nSince then, North Korea and Russia have grown increasingly close, as recently seen when North Korean leader Kim Jong Un visited eastern Russia, where he held a summit with Russian President Vladimir Putin last week.\n\nAfter a six-day visit, Mr. Kim returned to Pyongyang on Sunday.\n\nSanctioned by the international community, North Korea desperately needs Russian technology for nuclear weapons, satellite development, and food production.\n\nSimilarly, Russia, also facing sanctions, urgently seeks to replenish its dwindling front-line supplies with large quantities of ammunition from North Korea.\n\nBefore the summit, when asked whether Russia would assist North Korea in developing artificial satellites, Mr. Putin confirmed this was a primary reason for their meeting.\n\nBoth sides appear willing to overlook the threat of escalated international sanctions against them.\n\nDuring their summit, Mr. Kim vowed to “fully and unconditionally support” Moscow.\n\nRocket technology holds particular interest for Mr. Kim and his regime.\n\nFrom Outward Support to Clandestine CyberattacksAccording to reports, North Korea has long eyed Russian technology, resorting to cyber theft to obtain it, and it is something that Moscow now appears ready to overlook despite recent reports such as one by Microsoft’s Threat Analysis Center (MTAC).\n\nThe MTAC report indicates that between March last year and this March, North Korean cyber operatives launched attacks on Russian aerospace research facilities and penetrated academic institutions engaged in research. Additionally, the operatives sent phishing emails to personnel within Russian diplomatic agencies.\n\nThe study further revealed that the countries most frequently targeted by North Korean cyberattacks during the same period included South Korea, Israel, Germany, and Russia.\n\nThe hacking groups, identified as ScarCruft and Lazarus, covertly installed digital backdoors within the company’s systems to exfiltrate sensitive information. While the full extent of the hackers’ achievements remains unclear, North Korea announced significant advancements in its ballistic missile program mere months after the cyber-attacks were initiated.\n\nSPUTNIX, a private enterprise affiliated with the Russian Academy of Sciences’ Space Research Institute, lost a substantial amount of information to these cyber incursions. The report speculates that North Korea may have stolen critical technology related to the design of ultra-small satellite bodies.\n\nExperts suggest North Korea’s successful launches of rockets equipped with reconnaissance satellites this year likely capitalized on its prior hacking exploits. These cyber activities appear crucial in advancing the country’s space technology.\n\nAdditionally, the report disclosed that in 2020, North Korean hackers penetrated the internal network of Russia’s Almaz-Antey, a leading manufacturer of surface-to-air missiles. The intruders pilfered various information, including developer personal data and proprietary details on missile components.\n\nIn 2019, North Korean cyber operatives also exfiltrated design blueprints from Russia’s Uralvagonzavod tank factory, the entity responsible for producing Russia’s next-generation T-14 Armata battle tank.\n\nMoreover, the report highlighted the frequency of North Korean cyber-attacks against Russian defense corporations specializing in avant-garde weapon systems like hypersonic technologies and intercontinental ballistic missiles (ICBMs).\n\nMilitary commentator Xia Loshan told The Epoch Times on Sept. 10 that both North Korea and Russia are in a desperate situation.\n\n“While North Korea is eager to acquire military technology, the ultimate decision is not in their hands,” Mr. Xia stated, emphasizing that the alliance is precarious and likely unsustainable in the face of stringent international sanctions.\n\nIndependent analyst Zhuge Mingyang also weighed in.\n\n“In a relationship marked more by utilitarian needs than genuine alliance, both nations recognize the other’s motivations,” he told The Epoch Times.\n\n“Moscow likely knows full well that Pyongyang is siphoning off its military technology but seems willing to turn a blind eye in exchange for much-needed ammunition—a clear example of a ‘friendship’ based on mutual expediency.”",[Other],[Other],UA
2,EN_UA_021270.txt,"Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits \n\n Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits\n\nIn surprisingly blunt words, a top aide to Ukrainian President Volodymyr Zelensky has warned that the coming year will essentially decide the fate of Ukraine and its war with Russia.\n\n""A turning point in the war is approaching,"" Andrii Yermak, who serves as chief of staff for the Office of the President of Ukraine, said Monday. ""The next year will be decisive in this regard."" He issued the words while appealing for more urgent aid from Washington in an address to the hawkish DC-based Hudson Institute think tank.\n\nYermak sought to assure the audience that Zelensky has ""a clear plan"" forward even as Western media has by and large soured on Kiev's prospects for success. Much of this is about Zelensky sending envoys to do damage control in Washington at a moment the US administration's focus is off Ukraine and on Gaza events instead.\n\nHe described advancing plans for ""the development of our defense industry, and the deploying of our own arms production. But [that] will be later.""\n\nBut he quickly pivoted to an immediate need for more ""weapons right now""--describing that ""Russia still has air superiority. It is still capable of producing missiles, doing evasion of sanctions…And we especially need air defense systems.""\n\nWithout doubt, the Zelensky admin is in damage control after eyebrow-raising comments were issued to The Economist early this month by Ukraine's top commander, who admitted there will be no breakthrough and the battlefield situation is in a stalemate. The New York Times had characterized his remarks as ""the first time a top Ukrainian commander said the fighting had reached an impasse.""\n\nSo now Zelensky appears to be dispatching his envoys to calm Washington jitters over all the ""bad news"" of late out of Ukraine.\n\nYermak also sought to assure the Hudson Institute conference that more billions given to Ukraine won't be ""charity"" but is instead an ""investment"" in America's ""global leadership.""\n\nHe further emphasized Zelensky's continued rejection of ceasefire talks with Russia, unless it's purely on Kiev's terms. ""We seek peace, but not just any peace. In our case, ending the war through compromise is nothing more than pausing it. Ukraine will not repeat the mistake of Minsk,"" Yermak said.\n\nWatch the full Hudson Institute speech below:","[Speculating war outcomes, Discrediting Ukraine, Discrediting the West, Diplomacy, Praise of Russia, Discrediting the West, Diplomacy]","[Other, Situation in Ukraine is hopeless, West is tired of Ukraine, Praise of Russian military might, The West does not care about Ukraine, only about its interests]",UA
3,EN_UA_103403.txt,"Russia Stages Major Airstrike on Ukraine; One Missile Enters Polish Airspace \n\n KYIV—Russia struck critical infrastructure in Ukraine’s western region of Lviv with missiles early on March 24, Kyiv said, in a major airstrike that saw one Russian cruise missile briefly fly into Polish airspace, according to Warsaw.\n\nMoscow launched 57 missiles and drones in the attack that also targeted Kyiv, two days after the largest aerial bombardment of Ukraine’s energy system in more than two years of full-scale war, Kyiv said.\n\n“There were two preliminary hits on the same critical infrastructure facility that the occupiers targeted at night,” Lviv’s regional governor, Maksym Kozytskyi, wrote on the Telegram messaging app.\n\nThe strike used Kinzhal hypersonic missiles, which are harder to shoot down, he added, without identifying the facility.\n\nThe energy ministry said equipment caught fire when a critical energy facility in the Lviv region was attacked, causing it to lose power. It was unclear whether they were talking about the same facility.\n\nAir defences destroyed 18 of 29 inbound missiles and 25 of 28 attack drones, the Ukranian air force said.\n\nThere were almost no details about what had been damaged, but the targeting of critical infrastructure could indicate that Russia is trying to keep up pressure on the energy system after its strikes caused widespread blackouts on March 22.\n\nThe energy ministry said that Ukraine, which has been exporting power in recent weeks, had sharply increased imports of electricity and stopped exports on March 24 after attacks on the energy system.\n\nSeveral explosions rang out in Kyiv in the early hours as air defences destroyed about a dozen missiles over the capital and in its vicinity, according to Serhiy Popko, head of Kyiv’s military administration.\n\nThere was only minor damage from the attack, he said.\n\nSmall groups of people huddled for safety underground in a central Kyiv metro station in the early hours, some of them sleeping on camping mats.\n\nMoscow has been pounding Ukraine for days in attacks portrayed by Moscow as revenge for Ukrainian attacks that were conducted during Russia’s presidential election.\n\nThe wreckage of a downed Kh-55 cruise missile was found in a Kyiv park, officials said.\n\n“For the third pre-dawn morning this week, all of Ukraine is under an air alert and has been advised to seek shelter,” U.S. Ambassador Bridget Brink posted on social media platform X.\n\nPolish AirspacePoland’s armed forces said a Russian cruise missile launched at the region of Lviv had violated Poland’s airspace.\n\n“The object entered Polish space near the town of Oserdow [Lublin Voivodeship] and stayed there for 39 seconds,” it said on X. “During the entire flight, it was observed by military radar systems.”\n\nPoland’s army spokesperson, Jacek Goryszewski, told reporters that the missile travelled about 2 kilometers (1.2 miles) into Polish airspace before returning to Ukraine.\n\nThere was no immediate comment from Russia. Warsaw said it would demand an explanation from Moscow.\n\nPolish Defense Minister Wladyslaw Kosiniak-Kamysz said Warsaw would continue to support Ukraine both militarily and on the humanitarian side.",[Other],[Other],UA
4,EN_CC_100145.txt,"Strategy needed to preserve water resources in Pakistan \n\n ISLAMABAD-Pakistan needs to chalk out an effective and feasible plan to preserve its water resources and stop the depletion of underground water, WealthPK reported.\n\nAccording to an expert, water is one of the basic necessities of life and Pakistan is blessed with an abundance of water resources. However, the country’s water resources are depleting with the passage of time. It is feared that the entire world will face an acute water shortage in the coming years. Saiqa Imran, the deputy director of the Pakistan Council of Research and Water Reservoirs (PCRWR), told WealthPK that the country was feared to face an acute shortage of fresh water. “The quantity of fresh water is quite limited in the country,” she added. According to her, climate change is the primary cause of the depletion of water resources. She said that the shortage of water triggered concerns about the future availability of water for the world’s exponentially rising population.\n\nSaiqa Imran said that in most cases, the water table was constantly going down as a result of the frequent pumping of water from the ground. “With a growing world population, the more we pump water from the ground at a rapid rate, the harder it becomes to get the amount of water we need because we pump the groundwater faster than it can be replenished,” she added. She said that inefficient water distribution and mismanagement were the main causes of the water shortage. “Pakistan has one of the largest contiguous irrigation systems in the world where more than 93% of water is used by the agriculture sector, 5% by the domestic sector and 2% by the industrial sector,” said the expert.\n\nShe said that domestic and industrial sectors would use 15% more water by 2025 than their current consumption. She said that the agriculture sector was the biggest user of water as modern irrigation techniques were not adopted in Pakistan but its contribution to the national economy was declining with the passage of time. Saiqa Imran said that the world’s average storage capacity reached 40% but Pakistan could conserve only up to 10% of its annual river water due to insufficient space. “People have turned to use underground water as a result of the shortage of the most essential commodity. The indiscriminate over-pumping in the cities in the absence of regulatory bodies has also led to the depletion of groundwater,” she added.\n\nAccording to her, there is a dire need to devise a proper pricing mechanism for water usage in any sector. She said that incentives should also be in place to encourage people to use water efficiently. The expert said that the country could take a variety of technical and management measures to conserve water at all levels, reduce irrigation losses, encourage farmers to adopt more efficient irrigation methods by creating a regulatory framework, launch licensing water, and implement integrated water resource management. “Under the principle of private sector participation and optimal water pricing, the policymakers should rethink water policy to encourage wastewater recycling,” Saiqa Imran told WealthPK.",[Other],[Other],CC


In [5]:
def create_narrative_subnarrative_pairs(row):
    narratives = row['narratives']
    subnarratives = row['subnarratives']
    
    pairs = []
    for narrative, subnarrative in zip(narratives, subnarratives):
        if subnarrative.startswith(narrative + ":"):
            cleaned_subnarrative = subnarrative[len(narrative) + 1:].strip()
        else:
            cleaned_subnarrative = subnarrative
        pairs.append({'narrative': narrative, 'subnarrative': cleaned_subnarrative})
    
    return pairs

# Apply the function to each row to create narrative-subnarrative pairs
df['narrative_subnarrative_pairs'] = df.apply(create_narrative_subnarrative_pairs, axis=1)

# Drop the original narratives and subnarratives columns if no longer needed
df = df.drop(columns=['narratives', 'subnarratives'])

# Display the updated DataFrame
print("Updated DataFrame with narrative-subnarrative pairs:")
df.head()

Updated DataFrame with narrative-subnarrative pairs:


Unnamed: 0,filename,content,topic,narrative_subnarrative_pairs
0,EN_UA_103861.txt,"The World Needs Peacemaker Trump Again \n\n by Jeff Crouere, The Liberty Daily:\n\nThe world is in total chaos after 39 months of the Biden presidency. The southern border of our country is porous and millions of individuals from around the world have descended on our country.\n\nThese “undocumented migrants” include terrorists, drug dealers, and intelligence agents of countries such as our enemy, China. It should alarm every American that 22,233 Chinese nationals have illegally entered the United States since the beginning of the fiscal year in October. If this rate continues, this year’s total will easily top the 24,125 Chinese nationals who illegally entered the country last year.\n\nTRUTH LIVES on at https://sgtreport.tv/\n\nThere has been an astounding 6,300% increase in the number of Chinese nationals illegally entering the country in the last few years. Since China is a communist nation, these individuals are not freely traveling to the United States for a “better life.” U.S. Senator Roger Marshall (R-KS) believes the influx is due to “the direction of the CCP (Chinese Communist Party)” and it involves “espionage” as well as “stealing military and economic secrets.”\n\nLast year, China sent a spy balloon over our country, and it was not shot down until it completed its trek across many states. In addition, Chinese investors are buying American farmland for nefarious reasons, and it was confirmed that China operated “police stations” in the United States to monitor their citizens in our country.\n\nChina is constantly threatening Taiwan and its other neighbors while building up its military forces. Their proxy state, North Korea, has started regularly testing long-range ballistic missiles that concern their neighboring countries.\n\nUkraine is in the third year of a bitter war with Russia. Last year, a United States report estimated that the total number of injuries and deaths in the war neared 500,000. There is no end in sight to the war as no peace talks are planned and the United States Congress is on the verge of allocating more military aid to Ukraine.\n\nOn October 7, Israel was invaded by Hamas resulting in the death of 1,200 innocent people. It was the deadliest attack on Israel since its founding in 1948. Hamas abducted over 240 hostages and 129 have not been returned home. These hostages either remain in captivity or are already deceased.\n\nAs Israel has responded to the Hamas invasion by sending military forces into Gaza, many Democrats have been critical. Sadly, innocent civilians have been killed in the conflict, but how can a nation survive with a terrorist organization operating on its border?\n\nTo further strengthen their security, on April 1, Israel bombed the Iranian embassy in Damascus, Syria. Among the seven officials killed in the airstrike were Iranian Brig. Gen. Mohammad Reza Zahedi, who helped plan the October 7 massacre in Israel.\n\nTo retaliate for this strike, Iran launched over 300 drones and missiles into Israel on Saturday. Fortunately, almost all the strikes were intercepted by Israeli forces, with help from the United States, Jordan, and other countries.\n\nAs the world waits to see if Israeli Prime Minister Benjamin Netanyahu will respond to this attack, the Iranians are continuing to issue threats. In an interview on their state TV network, Iranian Major General Mohammad Bagheri said, “Our response will be much larger than tonight’s military action if Israel retaliates against Iran.”\n\nIt is being widely reported by Axios, CNN, and other media outlets that President Joe Biden is urging Netanyahu not to respond and, if he does, the United States will not be involved. This response is disturbing to U.S. Senator Marco Rubio (R-FL), who believes the world is witnessing the most dangerous period for the Middle East since 1973. He said, “What I don’t understand is why Joe Biden and the administration would leak to the media the contents of a conversation in which he tells Netanyahu he doesn’t think that Netanyahu should respond at all. It is the continuing part of the public game they are playing which frankly encourages Iran and Hezbollah.”\n\nBiden has also encouraged Iran by relaxing our economic sanctions, which allowed additional oil revenue to flow into their coffers. This led to their renewed financial support for terrorist organizations like Hamas. Without Iran’s generous backing, Hamas would not have been able to launch their deadly October 7 attack against Israel.\n\nNot content to allow Iran to prosper from oil sales, Biden also released $16 billion in funds for Iran, partially for the return of six American hostages. Terror regimes should never be rewarded for taking hostages, but this is exactly what happened by the Biden administration’s unwise decision.\n\nAs the world deals with multiple international wars and unprecedented chaos in the Middle East, it is useful to remember what was happening when Donald Trump was our President.\n\nRead More @ TheLibertyDaily.com",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
1,EN_UA_103667.txt,"Desperation and Diplomacy: North Korea's Tech Hunt in Russia Amid Sanctions \n\n For several decades, North Korea has masterfully balanced its relations with the Chinese Communist Party (CCP) and Russia, maintaining a relationship with the latter that is both complex and ever-changing.\n\nThe dynamics have shifted notably since Russia found itself sanctioned and isolated by the international community following its invasion of Ukraine in February 2022.\n\nSince then, North Korea and Russia have grown increasingly close, as recently seen when North Korean leader Kim Jong Un visited eastern Russia, where he held a summit with Russian President Vladimir Putin last week.\n\nAfter a six-day visit, Mr. Kim returned to Pyongyang on Sunday.\n\nSanctioned by the international community, North Korea desperately needs Russian technology for nuclear weapons, satellite development, and food production.\n\nSimilarly, Russia, also facing sanctions, urgently seeks to replenish its dwindling front-line supplies with large quantities of ammunition from North Korea.\n\nBefore the summit, when asked whether Russia would assist North Korea in developing artificial satellites, Mr. Putin confirmed this was a primary reason for their meeting.\n\nBoth sides appear willing to overlook the threat of escalated international sanctions against them.\n\nDuring their summit, Mr. Kim vowed to “fully and unconditionally support” Moscow.\n\nRocket technology holds particular interest for Mr. Kim and his regime.\n\nFrom Outward Support to Clandestine CyberattacksAccording to reports, North Korea has long eyed Russian technology, resorting to cyber theft to obtain it, and it is something that Moscow now appears ready to overlook despite recent reports such as one by Microsoft’s Threat Analysis Center (MTAC).\n\nThe MTAC report indicates that between March last year and this March, North Korean cyber operatives launched attacks on Russian aerospace research facilities and penetrated academic institutions engaged in research. Additionally, the operatives sent phishing emails to personnel within Russian diplomatic agencies.\n\nThe study further revealed that the countries most frequently targeted by North Korean cyberattacks during the same period included South Korea, Israel, Germany, and Russia.\n\nThe hacking groups, identified as ScarCruft and Lazarus, covertly installed digital backdoors within the company’s systems to exfiltrate sensitive information. While the full extent of the hackers’ achievements remains unclear, North Korea announced significant advancements in its ballistic missile program mere months after the cyber-attacks were initiated.\n\nSPUTNIX, a private enterprise affiliated with the Russian Academy of Sciences’ Space Research Institute, lost a substantial amount of information to these cyber incursions. The report speculates that North Korea may have stolen critical technology related to the design of ultra-small satellite bodies.\n\nExperts suggest North Korea’s successful launches of rockets equipped with reconnaissance satellites this year likely capitalized on its prior hacking exploits. These cyber activities appear crucial in advancing the country’s space technology.\n\nAdditionally, the report disclosed that in 2020, North Korean hackers penetrated the internal network of Russia’s Almaz-Antey, a leading manufacturer of surface-to-air missiles. The intruders pilfered various information, including developer personal data and proprietary details on missile components.\n\nIn 2019, North Korean cyber operatives also exfiltrated design blueprints from Russia’s Uralvagonzavod tank factory, the entity responsible for producing Russia’s next-generation T-14 Armata battle tank.\n\nMoreover, the report highlighted the frequency of North Korean cyber-attacks against Russian defense corporations specializing in avant-garde weapon systems like hypersonic technologies and intercontinental ballistic missiles (ICBMs).\n\nMilitary commentator Xia Loshan told The Epoch Times on Sept. 10 that both North Korea and Russia are in a desperate situation.\n\n“While North Korea is eager to acquire military technology, the ultimate decision is not in their hands,” Mr. Xia stated, emphasizing that the alliance is precarious and likely unsustainable in the face of stringent international sanctions.\n\nIndependent analyst Zhuge Mingyang also weighed in.\n\n“In a relationship marked more by utilitarian needs than genuine alliance, both nations recognize the other’s motivations,” he told The Epoch Times.\n\n“Moscow likely knows full well that Pyongyang is siphoning off its military technology but seems willing to turn a blind eye in exchange for much-needed ammunition—a clear example of a ‘friendship’ based on mutual expediency.”",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
2,EN_UA_021270.txt,"Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits \n\n Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits\n\nIn surprisingly blunt words, a top aide to Ukrainian President Volodymyr Zelensky has warned that the coming year will essentially decide the fate of Ukraine and its war with Russia.\n\n""A turning point in the war is approaching,"" Andrii Yermak, who serves as chief of staff for the Office of the President of Ukraine, said Monday. ""The next year will be decisive in this regard."" He issued the words while appealing for more urgent aid from Washington in an address to the hawkish DC-based Hudson Institute think tank.\n\nYermak sought to assure the audience that Zelensky has ""a clear plan"" forward even as Western media has by and large soured on Kiev's prospects for success. Much of this is about Zelensky sending envoys to do damage control in Washington at a moment the US administration's focus is off Ukraine and on Gaza events instead.\n\nHe described advancing plans for ""the development of our defense industry, and the deploying of our own arms production. But [that] will be later.""\n\nBut he quickly pivoted to an immediate need for more ""weapons right now""--describing that ""Russia still has air superiority. It is still capable of producing missiles, doing evasion of sanctions…And we especially need air defense systems.""\n\nWithout doubt, the Zelensky admin is in damage control after eyebrow-raising comments were issued to The Economist early this month by Ukraine's top commander, who admitted there will be no breakthrough and the battlefield situation is in a stalemate. The New York Times had characterized his remarks as ""the first time a top Ukrainian commander said the fighting had reached an impasse.""\n\nSo now Zelensky appears to be dispatching his envoys to calm Washington jitters over all the ""bad news"" of late out of Ukraine.\n\nYermak also sought to assure the Hudson Institute conference that more billions given to Ukraine won't be ""charity"" but is instead an ""investment"" in America's ""global leadership.""\n\nHe further emphasized Zelensky's continued rejection of ceasefire talks with Russia, unless it's purely on Kiev's terms. ""We seek peace, but not just any peace. In our case, ending the war through compromise is nothing more than pausing it. Ukraine will not repeat the mistake of Minsk,"" Yermak said.\n\nWatch the full Hudson Institute speech below:",UA,"[{'narrative': 'Speculating war outcomes', 'subnarrative': 'Other'}, {'narrative': 'Discrediting Ukraine', 'subnarrative': 'Situation in Ukraine is hopeless'}, {'narrative': 'Discrediting the West, Diplomacy', 'subnarrative': 'West is tired of Ukraine'}, {'narrative': 'Praise of Russia', 'subnarrative': 'Praise of Russian military might'}, {'narrative': 'Discrediting the West, Diplomacy', 'subnarrative': 'The West does not care about Ukraine, only about its interests'}]"
3,EN_UA_103403.txt,"Russia Stages Major Airstrike on Ukraine; One Missile Enters Polish Airspace \n\n KYIV—Russia struck critical infrastructure in Ukraine’s western region of Lviv with missiles early on March 24, Kyiv said, in a major airstrike that saw one Russian cruise missile briefly fly into Polish airspace, according to Warsaw.\n\nMoscow launched 57 missiles and drones in the attack that also targeted Kyiv, two days after the largest aerial bombardment of Ukraine’s energy system in more than two years of full-scale war, Kyiv said.\n\n“There were two preliminary hits on the same critical infrastructure facility that the occupiers targeted at night,” Lviv’s regional governor, Maksym Kozytskyi, wrote on the Telegram messaging app.\n\nThe strike used Kinzhal hypersonic missiles, which are harder to shoot down, he added, without identifying the facility.\n\nThe energy ministry said equipment caught fire when a critical energy facility in the Lviv region was attacked, causing it to lose power. It was unclear whether they were talking about the same facility.\n\nAir defences destroyed 18 of 29 inbound missiles and 25 of 28 attack drones, the Ukranian air force said.\n\nThere were almost no details about what had been damaged, but the targeting of critical infrastructure could indicate that Russia is trying to keep up pressure on the energy system after its strikes caused widespread blackouts on March 22.\n\nThe energy ministry said that Ukraine, which has been exporting power in recent weeks, had sharply increased imports of electricity and stopped exports on March 24 after attacks on the energy system.\n\nSeveral explosions rang out in Kyiv in the early hours as air defences destroyed about a dozen missiles over the capital and in its vicinity, according to Serhiy Popko, head of Kyiv’s military administration.\n\nThere was only minor damage from the attack, he said.\n\nSmall groups of people huddled for safety underground in a central Kyiv metro station in the early hours, some of them sleeping on camping mats.\n\nMoscow has been pounding Ukraine for days in attacks portrayed by Moscow as revenge for Ukrainian attacks that were conducted during Russia’s presidential election.\n\nThe wreckage of a downed Kh-55 cruise missile was found in a Kyiv park, officials said.\n\n“For the third pre-dawn morning this week, all of Ukraine is under an air alert and has been advised to seek shelter,” U.S. Ambassador Bridget Brink posted on social media platform X.\n\nPolish AirspacePoland’s armed forces said a Russian cruise missile launched at the region of Lviv had violated Poland’s airspace.\n\n“The object entered Polish space near the town of Oserdow [Lublin Voivodeship] and stayed there for 39 seconds,” it said on X. “During the entire flight, it was observed by military radar systems.”\n\nPoland’s army spokesperson, Jacek Goryszewski, told reporters that the missile travelled about 2 kilometers (1.2 miles) into Polish airspace before returning to Ukraine.\n\nThere was no immediate comment from Russia. Warsaw said it would demand an explanation from Moscow.\n\nPolish Defense Minister Wladyslaw Kosiniak-Kamysz said Warsaw would continue to support Ukraine both militarily and on the humanitarian side.",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"
4,EN_CC_100145.txt,"Strategy needed to preserve water resources in Pakistan \n\n ISLAMABAD-Pakistan needs to chalk out an effective and feasible plan to preserve its water resources and stop the depletion of underground water, WealthPK reported.\n\nAccording to an expert, water is one of the basic necessities of life and Pakistan is blessed with an abundance of water resources. However, the country’s water resources are depleting with the passage of time. It is feared that the entire world will face an acute water shortage in the coming years. Saiqa Imran, the deputy director of the Pakistan Council of Research and Water Reservoirs (PCRWR), told WealthPK that the country was feared to face an acute shortage of fresh water. “The quantity of fresh water is quite limited in the country,” she added. According to her, climate change is the primary cause of the depletion of water resources. She said that the shortage of water triggered concerns about the future availability of water for the world’s exponentially rising population.\n\nSaiqa Imran said that in most cases, the water table was constantly going down as a result of the frequent pumping of water from the ground. “With a growing world population, the more we pump water from the ground at a rapid rate, the harder it becomes to get the amount of water we need because we pump the groundwater faster than it can be replenished,” she added. She said that inefficient water distribution and mismanagement were the main causes of the water shortage. “Pakistan has one of the largest contiguous irrigation systems in the world where more than 93% of water is used by the agriculture sector, 5% by the domestic sector and 2% by the industrial sector,” said the expert.\n\nShe said that domestic and industrial sectors would use 15% more water by 2025 than their current consumption. She said that the agriculture sector was the biggest user of water as modern irrigation techniques were not adopted in Pakistan but its contribution to the national economy was declining with the passage of time. Saiqa Imran said that the world’s average storage capacity reached 40% but Pakistan could conserve only up to 10% of its annual river water due to insufficient space. “People have turned to use underground water as a result of the shortage of the most essential commodity. The indiscriminate over-pumping in the cities in the absence of regulatory bodies has also led to the depletion of groundwater,” she added.\n\nAccording to her, there is a dire need to devise a proper pricing mechanism for water usage in any sector. She said that incentives should also be in place to encourage people to use water efficiently. The expert said that the country could take a variety of technical and management measures to conserve water at all levels, reduce irrigation losses, encourage farmers to adopt more efficient irrigation methods by creating a regulatory framework, launch licensing water, and implement integrated water resource management. “Under the principle of private sector participation and optimal water pricing, the policymakers should rethink water policy to encourage wastewater recycling,” Saiqa Imran told WealthPK.",CC,"[{'narrative': 'Other', 'subnarrative': 'Other'}]"


### Extract the label taxonomy

Here we create a templete for each label taxonomy (for each of the two topics) from the subtask 2 pdf file. We can use this to encode the labels of the datapoints in the dataset. We do this by assigning an index to every possible class (narrative-subnarrative pair) that we can use to numerically classify the articles.

In [6]:

# Define the path to the PDF file
pdf_path = "../info/subtask2_NARRATIVE-TAXONOMIES.pdf"

# Open the PDF file
pdf_document = fitz.open(pdf_path)

# Function to extract text from a specific page
def extract_text_from_page(page_number):
    page = pdf_document.load_page(page_number)
    text = page.get_text("text")
    return text

# Extract text from the relevant pages
ukraine_war_text = extract_text_from_page(0)  # Assuming the first page contains Ukraine War taxonomy
climate_change_text = extract_text_from_page(1)  # Assuming the second page contains Climate Change taxonomy

# Function to parse the taxonomy text and create a DataFrame
def parse_taxonomy(text):
    lines = text.split('\n')
    # Exclude the last two lines
    lines = lines[:-3]
    data = []
    current_narrative = None
    for line in lines:
        if line.strip() == "":
            continue
        if line.startswith("-"):  # Subnarrative
            subnarrative = line.strip("- ").strip()
            data.append({'narrative': current_narrative, 'subnarrative': subnarrative})
        else:  # Narrative
            if current_narrative and not any(d['narrative'] == current_narrative for d in data):
                # Add the narrative itself as subnarrative if it has no subnarratives
                data.append({'narrative': current_narrative, 'subnarrative': current_narrative})
            current_narrative = line.strip()
            if current_narrative == "Other":
                data.append({'narrative': "Other", 'subnarrative': "Other"})
    # Handle the last narrative if it has no subnarratives
    if current_narrative and not any(d['narrative'] == current_narrative for d in data):
        data.append({'narrative': current_narrative, 'subnarrative': 'Other'})
    
    df = pd.DataFrame(data)
    df = df.sort_values(by='narrative', ascending=True).reset_index(drop=True)
    return df

# Parse the taxonomies and create DataFrames
ukraine_war_df = parse_taxonomy(ukraine_war_text)
climate_change_df = parse_taxonomy(climate_change_text)


In [7]:
ukraine_war_df.head(50)

Unnamed: 0,narrative,subnarrative
0,Amplifying war-related fears,Russia will also attack other countries
1,Amplifying war-related fears,By continuing the war we risk WWIII
2,Amplifying war-related fears,There is a real possibility that nuclear weapons will be employed
3,Amplifying war-related fears,NATO should/will directly intervene
4,Blaming the war on others rather than the invader,Ukraine is the aggressor
5,Blaming the war on others rather than the invader,The West are the aggressors
6,Discrediting Ukraine,Situation in Ukraine is hopeless
7,Discrediting Ukraine,Ukraine is associated with nazism
8,Discrediting Ukraine,Ukraine is a hub for criminal activities
9,Discrediting Ukraine,Discrediting Ukrainian government and officials and policies


In [8]:
climate_change_df.head(50)

Unnamed: 0,narrative,subnarrative
0,Amplifying Climate Fears,Earth will be uninhabitable soon
1,Amplifying Climate Fears,Amplifying existing fears of global warming
2,Amplifying Climate Fears,Doomsday scenarios for humans
3,Amplifying Climate Fears,Whatever we do it is already too late
4,Climate change is beneficial,CO2 is beneficial
5,Climate change is beneficial,Temperature increase is beneficial
6,Controversy about green technologies,Nuclear energy is not climate friendly
7,Controversy about green technologies,Renewable energy is costly
8,Controversy about green technologies,Renewable energy is unreliable
9,Controversy about green technologies,Renewable energy is dangerous


Handle some edge cases in the taxonomy as instructed in the task description. E.g. if a narrative is identified but no subnarrative applies, the subnarrative is "Other". We apply this to all rows in the taxonomy dataframe except for the narratives "Other" and "Hidden plots by secret schemes of powerful groups" since those are already present. After manually adding the "Other", "Hidden plots by secret schemes of powerful groups" combination in the climate change taxonomy, all combinations should be covered.

In [9]:
# Function to add "Other" subnarrative for each narrative group, excluding specific narratives
def add_other_subnarrative(df):
    additional_rows = []
    unique_narratives = df['narrative'].unique()
    for narrative in unique_narratives:
        if narrative not in ["Other", "Hidden plots by secret schemes of powerful groups"]:
            additional_rows.append({'narrative': narrative, 'subnarrative': 'Other'})
    additional_df = pd.DataFrame(additional_rows)
    return pd.concat([df, additional_df], ignore_index=True)

# Function to sort the DataFrame and add an index column
def sort_and_index_df(df):
    df = df.sort_values(by=['narrative', 'subnarrative']).reset_index(drop=True)
    df['index'] = df.index + 1
    df = df[['index', 'narrative', 'subnarrative']]
    return df

# Add "Other" subnarrative to each DataFrame
ukraine_war_df = add_other_subnarrative(ukraine_war_df)
climate_change_df = add_other_subnarrative(climate_change_df)

# Manually add the specific row to the climate change DataFrame
climate_change_df = pd.concat([climate_change_df, pd.DataFrame([{'narrative': 'Hidden plots by secret schemes of powerful groups', 'subnarrative': 'Other'}])], ignore_index=True)

# Sort and add index column to each DataFrame
ukraine_war_df = sort_and_index_df(ukraine_war_df)
climate_change_df = sort_and_index_df(climate_change_df)

In [10]:
ukraine_war_df.head(55)

Unnamed: 0,index,narrative,subnarrative
0,1,Amplifying war-related fears,By continuing the war we risk WWIII
1,2,Amplifying war-related fears,NATO should/will directly intervene
2,3,Amplifying war-related fears,Other
3,4,Amplifying war-related fears,Russia will also attack other countries
4,5,Amplifying war-related fears,There is a real possibility that nuclear weapons will be employed
5,6,Blaming the war on others rather than the invader,Other
6,7,Blaming the war on others rather than the invader,The West are the aggressors
7,8,Blaming the war on others rather than the invader,Ukraine is the aggressor
8,9,Discrediting Ukraine,Discrediting Ukrainian government and officials and policies
9,10,Discrediting Ukraine,Discrediting Ukrainian military


In [11]:
climate_change_df.head(58)

Unnamed: 0,index,narrative,subnarrative
0,1,Amplifying Climate Fears,Amplifying existing fears of global warming
1,2,Amplifying Climate Fears,Doomsday scenarios for humans
2,3,Amplifying Climate Fears,Earth will be uninhabitable soon
3,4,Amplifying Climate Fears,Other
4,5,Amplifying Climate Fears,Whatever we do it is already too late
5,6,Climate change is beneficial,CO2 is beneficial
6,7,Climate change is beneficial,Other
7,8,Climate change is beneficial,Temperature increase is beneficial
8,9,Controversy about green technologies,Nuclear energy is not climate friendly
9,10,Controversy about green technologies,Other


### Map the labels in the file to the taxonomy

In the next step, we want to add another column to our "df" that combines the narrative_subnarrative_pairs column with the information in our taxonomy dataframes. Firstly, the column "topic" tells us what taxonomy df should be applied (UA for ukraine war and CC for climate change). Then, theoretically, the dictionaries should exactly correspond to a given row in one of the two taxonomy dataframes. We check if every narrative-subnarrative pair of every row in the dataset can be mapped to its row in the taxonomy and if so, add a column to the dataframe that contains the indices of the targets.

In [12]:
# Create a mapping of narrative-subnarrative pairs to their indices
def create_mapping(df):
    mapping = {}
    for _, row in df.iterrows():
        key = (row['narrative'], row['subnarrative'])
        mapping[key] = row['index']
    return mapping

ukraine_war_mapping = create_mapping(ukraine_war_df)
climate_change_mapping = create_mapping(climate_change_df)

# Function to check the mapping and add indices to the DataFrame
def add_target_indices(row, ukraine_war_mapping, climate_change_mapping):
    pairs = row['narrative_subnarrative_pairs']
    topic = row['topic']
    indices = []
    
    if topic == "UA":
        mapping = ukraine_war_mapping
    elif topic == "CC":
        mapping = climate_change_mapping
    else:
        return None  # Invalid topic
    
    for pair in pairs:
        key = (pair['narrative'], pair['subnarrative'])
        if key in mapping:
            indices.append(mapping[key])
        else:
            return (row.name, key)  # Mapping does not exist
    
    return indices

# Apply the function to each row and collect the results
df['target_indices'] = df.apply(add_target_indices, axis=1, args=(ukraine_war_mapping, climate_change_mapping))

# Filter out the rows where the mapping does not exist
problematic_rows = df[df['target_indices'].apply(lambda x: isinstance(x, tuple))]

# Display the problematic rows
print("Problematic rows where the mapping does not exist:")
print(problematic_rows)

# Display the first few problematic rows for inspection
if not problematic_rows.empty:
    print("First few problematic rows:")
    for index, row in problematic_rows.iterrows():
        print(f"Row index: {index}, Problematic pair: {row['target_indices']}")

Problematic rows where the mapping does not exist:
             filename  \
65   EN_UA_103011.txt   
143  EN_UA_103025.txt   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

In [13]:
len(problematic_rows)

2

There are two datapoints/rows in the dataset where the mapping does not work, the indices 65 and 143. With a closer look, we see what the problem is. All but the two rows are EITHER ukraine war OR climate change. In the two rows, CC and UA is mixed. As the current implementation does not cover such cases and assumes an article is either UA or CC, we have to deal with those.

In [14]:


# Define the path to the annotations file
annotations_file = "../training_data_16_October_release/EN/subtask-2-annotations.txt"

# Read the annotations file
annotations = pd.read_csv(annotations_file, sep='\t', header=None, names=['filename', 'narrative', 'subnarrative'])

# Initialize a list to store the line numbers with both "CC" and "URW"
mixed_topic_lines = []

# Iterate through each row and check for mixed topics
for index, row in annotations.iterrows():
    narrative = row['narrative']
    subnarrative = row['subnarrative']
    
    # Check if both "CC" and "URW" are present in either narrative or subnarrative
    if ("CC: " in narrative and "URW: " in narrative) or ("CC: " in subnarrative and "URW: " in subnarrative):
        mixed_topic_lines.append(index + 1)  # Adding 1 to index to match line numbers

# Print the line numbers with mixed topics
print("Lines with both 'CC' and 'URW' present:")
print(mixed_topic_lines)

# Print the total count of such lines
print("Total number of lines with mixed topics:", len(mixed_topic_lines))

Lines with both 'CC' and 'URW' present:
[66]
Total number of lines with mixed topics: 1


This shows us that line 66 in the annotations file (index 65 in the dataset) has labels from both taxonomies. Index 143 on the other hand has UA in the filename but CC in the labels, which might be an error in the dataset. For now, we deal with this by dropping the two rows from the dataset.

In [15]:
df_short = df.drop([65, 143])

end of data prep...

One of the main open questions is how we should represent the class labels numerically. Right now, they are represented as indices from 1 to n for either group (UA and CC). For model training, one-hot-encoding would probably be the right choice since there is no ordinal relationship between the different class labels.

The main next steps would be to prepare the file contents (content column in the df), since i did not touch those yet, meaning they are still in the raw form extracted from the documents.

## Text Segmentation

Now we handle the content of the articles. Currently, each entry in our dataframe has a single plain string that contains the whole article.

Let's start by splitting it into sentences and words.

In [None]:
def tokenize(df):
    df['tokens'] = None
    for i, row in df.iterrows():
        # split the content into sentences
        sentences = nltk.sent_tokenize(row['content'])
        # tokenize each sentence
        tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
        df.at[i, 'tokens'] = tokens
    return df

df_short = tokenize(df_short)
#df_short.head()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\krona\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,filename,content,topic,narrative_subnarrative_pairs,target_indices,tokens
0,EN_UA_103861.txt,"The World Needs Peacemaker Trump Again \n\n by Jeff Crouere, The Liberty Daily:\n\nThe world is in total chaos after 39 months of the Biden presidency. The southern border of our country is porous and millions of individuals from around the world have descended on our country.\n\nThese “undocumented migrants” include terrorists, drug dealers, and intelligence agents of countries such as our enemy, China. It should alarm every American that 22,233 Chinese nationals have illegally entered the United States since the beginning of the fiscal year in October. If this rate continues, this year’s total will easily top the 24,125 Chinese nationals who illegally entered the country last year.\n\nTRUTH LIVES on at https://sgtreport.tv/\n\nThere has been an astounding 6,300% increase in the number of Chinese nationals illegally entering the country in the last few years. Since China is a communist nation, these individuals are not freely traveling to the United States for a “better life.” U.S. Senator Roger Marshall (R-KS) believes the influx is due to “the direction of the CCP (Chinese Communist Party)” and it involves “espionage” as well as “stealing military and economic secrets.”\n\nLast year, China sent a spy balloon over our country, and it was not shot down until it completed its trek across many states. In addition, Chinese investors are buying American farmland for nefarious reasons, and it was confirmed that China operated “police stations” in the United States to monitor their citizens in our country.\n\nChina is constantly threatening Taiwan and its other neighbors while building up its military forces. Their proxy state, North Korea, has started regularly testing long-range ballistic missiles that concern their neighboring countries.\n\nUkraine is in the third year of a bitter war with Russia. Last year, a United States report estimated that the total number of injuries and deaths in the war neared 500,000. There is no end in sight to the war as no peace talks are planned and the United States Congress is on the verge of allocating more military aid to Ukraine.\n\nOn October 7, Israel was invaded by Hamas resulting in the death of 1,200 innocent people. It was the deadliest attack on Israel since its founding in 1948. Hamas abducted over 240 hostages and 129 have not been returned home. These hostages either remain in captivity or are already deceased.\n\nAs Israel has responded to the Hamas invasion by sending military forces into Gaza, many Democrats have been critical. Sadly, innocent civilians have been killed in the conflict, but how can a nation survive with a terrorist organization operating on its border?\n\nTo further strengthen their security, on April 1, Israel bombed the Iranian embassy in Damascus, Syria. Among the seven officials killed in the airstrike were Iranian Brig. Gen. Mohammad Reza Zahedi, who helped plan the October 7 massacre in Israel.\n\nTo retaliate for this strike, Iran launched over 300 drones and missiles into Israel on Saturday. Fortunately, almost all the strikes were intercepted by Israeli forces, with help from the United States, Jordan, and other countries.\n\nAs the world waits to see if Israeli Prime Minister Benjamin Netanyahu will respond to this attack, the Iranians are continuing to issue threats. In an interview on their state TV network, Iranian Major General Mohammad Bagheri said, “Our response will be much larger than tonight’s military action if Israel retaliates against Iran.”\n\nIt is being widely reported by Axios, CNN, and other media outlets that President Joe Biden is urging Netanyahu not to respond and, if he does, the United States will not be involved. This response is disturbing to U.S. Senator Marco Rubio (R-FL), who believes the world is witnessing the most dangerous period for the Middle East since 1973. He said, “What I don’t understand is why Joe Biden and the administration would leak to the media the contents of a conversation in which he tells Netanyahu he doesn’t think that Netanyahu should respond at all. It is the continuing part of the public game they are playing which frankly encourages Iran and Hezbollah.”\n\nBiden has also encouraged Iran by relaxing our economic sanctions, which allowed additional oil revenue to flow into their coffers. This led to their renewed financial support for terrorist organizations like Hamas. Without Iran’s generous backing, Hamas would not have been able to launch their deadly October 7 attack against Israel.\n\nNot content to allow Iran to prosper from oil sales, Biden also released $16 billion in funds for Iran, partially for the return of six American hostages. Terror regimes should never be rewarded for taking hostages, but this is exactly what happened by the Biden administration’s unwise decision.\n\nAs the world deals with multiple international wars and unprecedented chaos in the Middle East, it is useful to remember what was happening when Donald Trump was our President.\n\nRead More @ TheLibertyDaily.com",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[32],"[[The, World, Needs, Peacemaker, Trump, Again, by, Jeff, Crouere, ,, The, Liberty, Daily, :, The, world, is, in, total, chaos, after, 39, months, of, the, Biden, presidency, .], [The, southern, border, of, our, country, is, porous, and, millions, of, individuals, from, around, the, world, have, descended, on, our, country, .], [These, “, undocumented, migrants, ”, include, terrorists, ,, drug, dealers, ,, and, intelligence, agents, of, countries, such, as, our, enemy, ,, China, .], [It, should, alarm, every, American, that, 22,233, Chinese, nationals, have, illegally, entered, the, United, States, since, the, beginning, of, the, fiscal, year, in, October, .], [If, this, rate, continues, ,, this, year, ’, s, total, will, easily, top, the, 24,125, Chinese, nationals, who, illegally, entered, the, country, last, year, .], [TRUTH, LIVES, on, at, https, :, //sgtreport.tv/, There, has, been, an, astounding, 6,300, %, increase, in, the, number, of, Chinese, nationals, illegally, entering, the, country, in, the, last, few, years, .], [Since, China, is, a, communist, nation, ,, these, individuals, are, not, freely, traveling, to, the, United, States, for, a, “, better, life., ”, U.S, .], [Senator, Roger, Marshall, (, R-KS, ), believes, the, influx, is, due, to, “, the, direction, of, the, CCP, (, Chinese, Communist, Party, ), ”, and, it, involves, “, espionage, ”, as, well, as, “, stealing, military, and, economic, secrets., ”, Last, year, ,, China, sent, a, spy, balloon, over, our, country, ,, and, it, was, not, shot, down, until, it, completed, its, trek, across, many, states, .], [In, addition, ,, Chinese, investors, are, buying, American, farmland, for, nefarious, reasons, ,, and, it, was, confirmed, that, China, operated, “, police, stations, ”, in, the, United, States, to, monitor, their, citizens, in, our, country, .], [China, is, constantly, threatening, Taiwan, and, its, other, neighbors, while, building, up, its, military, forces, .], [Their, proxy, state, ,, North, Korea, ,, has, started, regularly, testing, long-range, ballistic, missiles, that, concern, their, neighboring, countries, .], [Ukraine, is, in, the, third, year, of, a, bitter, war, with, Russia, .], [Last, year, ,, a, United, States, report, estimated, that, the, total, number, of, injuries, and, deaths, in, the, war, neared, 500,000, .], [There, is, no, end, in, sight, to, the, war, as, no, peace, talks, are, planned, and, the, United, States, Congress, is, on, the, verge, of, allocating, more, military, aid, to, Ukraine, .], [On, October, 7, ,, Israel, was, invaded, by, Hamas, resulting, in, the, death, of, 1,200, innocent, people, .], [It, was, the, deadliest, attack, on, Israel, since, its, founding, in, 1948, .], [Hamas, abducted, over, 240, hostages, and, 129, have, not, been, returned, home, .], [These, hostages, either, remain, in, captivity, or, are, already, deceased, .], [As, Israel, has, responded, to, the, Hamas, invasion, by, sending, military, forces, into, Gaza, ,, many, Democrats, have, been, critical, .], [Sadly, ,, innocent, civilians, have, been, killed, in, the, conflict, ,, but, how, can, a, nation, survive, with, a, terrorist, organization, operating, on, its, border, ?], [To, further, strengthen, their, security, ,, on, April, 1, ,, Israel, bombed, the, Iranian, embassy, in, Damascus, ,, Syria, .], [Among, the, seven, officials, killed, in, the, airstrike, were, Iranian, Brig, .], [Gen., Mohammad, Reza, Zahedi, ,, who, helped, plan, the, October, 7, massacre, in, Israel, .], [To, retaliate, for, this, strike, ,, Iran, launched, over, 300, drones, and, missiles, into, Israel, on, Saturday, .], [Fortunately, ,, almost, all, the, strikes, were, intercepted, by, Israeli, forces, ,, with, help, from, the, United, States, ,, Jordan, ,, and, other, countries, .], [As, the, world, waits, to, see, if, Israeli, Prime, Minister, Benjamin, Netanyahu, will, respond, to, this, attack, ,, the, Iranians, are, continuing, to, issue, threats, .], [In, an, interview, on, their, state, TV, network, ,, Iranian, Major, General, Mohammad, Bagheri, said, ,, “, Our, response, will, be, much, larger, than, tonight, ’, s, military, action, if, Israel, retaliates, against, Iran., ”, It, is, being, widely, reported, by, Axios, ,, CNN, ,, and, other, media, outlets, that, President, Joe, Biden, is, urging, Netanyahu, not, to, respond, and, ,, if, he, does, ,, the, United, States, will, not, be, involved, .], [This, response, is, disturbing, to, U.S, .], [Senator, Marco, Rubio, (, R-FL, ), ,, who, believes, the, world, is, witnessing, the, most, dangerous, period, for, the, Middle, East, since, 1973, .], [He, said, ,, “, What, I, don, ’, t, understand, is, why, Joe, Biden, and, the, administration, would, leak, to, the, media, the, contents, of, a, conversation, in, which, he, tells, Netanyahu, he, doesn, ’, t, think, that, Netanyahu, should, respond, at, all, .], [It, is, the, continuing, part, of, the, public, game, they, are, playing, which, frankly, encourages, Iran, and, Hezbollah., ”, Biden, has, also, encouraged, Iran, by, relaxing, our, economic, sanctions, ,, which, allowed, additional, oil, revenue, to, flow, into, their, coffers, .], [This, led, to, their, renewed, financial, support, for, terrorist, organizations, like, Hamas, .], [Without, Iran, ’, s, generous, backing, ,, Hamas, would, not, have, been, able, to, launch, their, deadly, October, 7, attack, against, Israel, .], [Not, content, to, allow, Iran, to, prosper, from, oil, sales, ,, Biden, also, released, $, 16, billion, in, funds, for, Iran, ,, partially, for, the, return, of, six, American, hostages, .], [Terror, regimes, should, never, be, rewarded, for, taking, hostages, ,, but, this, is, exactly, what, happened, by, the, Biden, administration, ’, s, unwise, decision, .], [As, the, world, deals, with, multiple, international, wars, and, unprecedented, chaos, in, the, Middle, East, ,, it, is, useful, to, remember, what, was, happening, when, Donald, Trump, was, our, President, .], [Read, More, @, TheLibertyDaily.com]]"
1,EN_UA_103667.txt,"Desperation and Diplomacy: North Korea's Tech Hunt in Russia Amid Sanctions \n\n For several decades, North Korea has masterfully balanced its relations with the Chinese Communist Party (CCP) and Russia, maintaining a relationship with the latter that is both complex and ever-changing.\n\nThe dynamics have shifted notably since Russia found itself sanctioned and isolated by the international community following its invasion of Ukraine in February 2022.\n\nSince then, North Korea and Russia have grown increasingly close, as recently seen when North Korean leader Kim Jong Un visited eastern Russia, where he held a summit with Russian President Vladimir Putin last week.\n\nAfter a six-day visit, Mr. Kim returned to Pyongyang on Sunday.\n\nSanctioned by the international community, North Korea desperately needs Russian technology for nuclear weapons, satellite development, and food production.\n\nSimilarly, Russia, also facing sanctions, urgently seeks to replenish its dwindling front-line supplies with large quantities of ammunition from North Korea.\n\nBefore the summit, when asked whether Russia would assist North Korea in developing artificial satellites, Mr. Putin confirmed this was a primary reason for their meeting.\n\nBoth sides appear willing to overlook the threat of escalated international sanctions against them.\n\nDuring their summit, Mr. Kim vowed to “fully and unconditionally support” Moscow.\n\nRocket technology holds particular interest for Mr. Kim and his regime.\n\nFrom Outward Support to Clandestine CyberattacksAccording to reports, North Korea has long eyed Russian technology, resorting to cyber theft to obtain it, and it is something that Moscow now appears ready to overlook despite recent reports such as one by Microsoft’s Threat Analysis Center (MTAC).\n\nThe MTAC report indicates that between March last year and this March, North Korean cyber operatives launched attacks on Russian aerospace research facilities and penetrated academic institutions engaged in research. Additionally, the operatives sent phishing emails to personnel within Russian diplomatic agencies.\n\nThe study further revealed that the countries most frequently targeted by North Korean cyberattacks during the same period included South Korea, Israel, Germany, and Russia.\n\nThe hacking groups, identified as ScarCruft and Lazarus, covertly installed digital backdoors within the company’s systems to exfiltrate sensitive information. While the full extent of the hackers’ achievements remains unclear, North Korea announced significant advancements in its ballistic missile program mere months after the cyber-attacks were initiated.\n\nSPUTNIX, a private enterprise affiliated with the Russian Academy of Sciences’ Space Research Institute, lost a substantial amount of information to these cyber incursions. The report speculates that North Korea may have stolen critical technology related to the design of ultra-small satellite bodies.\n\nExperts suggest North Korea’s successful launches of rockets equipped with reconnaissance satellites this year likely capitalized on its prior hacking exploits. These cyber activities appear crucial in advancing the country’s space technology.\n\nAdditionally, the report disclosed that in 2020, North Korean hackers penetrated the internal network of Russia’s Almaz-Antey, a leading manufacturer of surface-to-air missiles. The intruders pilfered various information, including developer personal data and proprietary details on missile components.\n\nIn 2019, North Korean cyber operatives also exfiltrated design blueprints from Russia’s Uralvagonzavod tank factory, the entity responsible for producing Russia’s next-generation T-14 Armata battle tank.\n\nMoreover, the report highlighted the frequency of North Korean cyber-attacks against Russian defense corporations specializing in avant-garde weapon systems like hypersonic technologies and intercontinental ballistic missiles (ICBMs).\n\nMilitary commentator Xia Loshan told The Epoch Times on Sept. 10 that both North Korea and Russia are in a desperate situation.\n\n“While North Korea is eager to acquire military technology, the ultimate decision is not in their hands,” Mr. Xia stated, emphasizing that the alliance is precarious and likely unsustainable in the face of stringent international sanctions.\n\nIndependent analyst Zhuge Mingyang also weighed in.\n\n“In a relationship marked more by utilitarian needs than genuine alliance, both nations recognize the other’s motivations,” he told The Epoch Times.\n\n“Moscow likely knows full well that Pyongyang is siphoning off its military technology but seems willing to turn a blind eye in exchange for much-needed ammunition—a clear example of a ‘friendship’ based on mutual expediency.”",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[32],"[[Desperation, and, Diplomacy, :, North, Korea, 's, Tech, Hunt, in, Russia, Amid, Sanctions, For, several, decades, ,, North, Korea, has, masterfully, balanced, its, relations, with, the, Chinese, Communist, Party, (, CCP, ), and, Russia, ,, maintaining, a, relationship, with, the, latter, that, is, both, complex, and, ever-changing, .], [The, dynamics, have, shifted, notably, since, Russia, found, itself, sanctioned, and, isolated, by, the, international, community, following, its, invasion, of, Ukraine, in, February, 2022, .], [Since, then, ,, North, Korea, and, Russia, have, grown, increasingly, close, ,, as, recently, seen, when, North, Korean, leader, Kim, Jong, Un, visited, eastern, Russia, ,, where, he, held, a, summit, with, Russian, President, Vladimir, Putin, last, week, .], [After, a, six-day, visit, ,, Mr., Kim, returned, to, Pyongyang, on, Sunday, .], [Sanctioned, by, the, international, community, ,, North, Korea, desperately, needs, Russian, technology, for, nuclear, weapons, ,, satellite, development, ,, and, food, production, .], [Similarly, ,, Russia, ,, also, facing, sanctions, ,, urgently, seeks, to, replenish, its, dwindling, front-line, supplies, with, large, quantities, of, ammunition, from, North, Korea, .], [Before, the, summit, ,, when, asked, whether, Russia, would, assist, North, Korea, in, developing, artificial, satellites, ,, Mr., Putin, confirmed, this, was, a, primary, reason, for, their, meeting, .], [Both, sides, appear, willing, to, overlook, the, threat, of, escalated, international, sanctions, against, them, .], [During, their, summit, ,, Mr., Kim, vowed, to, “, fully, and, unconditionally, support, ”, Moscow, .], [Rocket, technology, holds, particular, interest, for, Mr., Kim, and, his, regime, .], [From, Outward, Support, to, Clandestine, CyberattacksAccording, to, reports, ,, North, Korea, has, long, eyed, Russian, technology, ,, resorting, to, cyber, theft, to, obtain, it, ,, and, it, is, something, that, Moscow, now, appears, ready, to, overlook, despite, recent, reports, such, as, one, by, Microsoft, ’, s, Threat, Analysis, Center, (, MTAC, ), .], [The, MTAC, report, indicates, that, between, March, last, year, and, this, March, ,, North, Korean, cyber, operatives, launched, attacks, on, Russian, aerospace, research, facilities, and, penetrated, academic, institutions, engaged, in, research, .], [Additionally, ,, the, operatives, sent, phishing, emails, to, personnel, within, Russian, diplomatic, agencies, .], [The, study, further, revealed, that, the, countries, most, frequently, targeted, by, North, Korean, cyberattacks, during, the, same, period, included, South, Korea, ,, Israel, ,, Germany, ,, and, Russia, .], [The, hacking, groups, ,, identified, as, ScarCruft, and, Lazarus, ,, covertly, installed, digital, backdoors, within, the, company, ’, s, systems, to, exfiltrate, sensitive, information, .], [While, the, full, extent, of, the, hackers, ’, achievements, remains, unclear, ,, North, Korea, announced, significant, advancements, in, its, ballistic, missile, program, mere, months, after, the, cyber-attacks, were, initiated, .], [SPUTNIX, ,, a, private, enterprise, affiliated, with, the, Russian, Academy, of, Sciences, ’, Space, Research, Institute, ,, lost, a, substantial, amount, of, information, to, these, cyber, incursions, .], [The, report, speculates, that, North, Korea, may, have, stolen, critical, technology, related, to, the, design, of, ultra-small, satellite, bodies, .], [Experts, suggest, North, Korea, ’, s, successful, launches, of, rockets, equipped, with, reconnaissance, satellites, this, year, likely, capitalized, on, its, prior, hacking, exploits, .], [These, cyber, activities, appear, crucial, in, advancing, the, country, ’, s, space, technology, .], [Additionally, ,, the, report, disclosed, that, in, 2020, ,, North, Korean, hackers, penetrated, the, internal, network, of, Russia, ’, s, Almaz-Antey, ,, a, leading, manufacturer, of, surface-to-air, missiles, .], [The, intruders, pilfered, various, information, ,, including, developer, personal, data, and, proprietary, details, on, missile, components, .], [In, 2019, ,, North, Korean, cyber, operatives, also, exfiltrated, design, blueprints, from, Russia, ’, s, Uralvagonzavod, tank, factory, ,, the, entity, responsible, for, producing, Russia, ’, s, next-generation, T-14, Armata, battle, tank, .], [Moreover, ,, the, report, highlighted, the, frequency, of, North, Korean, cyber-attacks, against, Russian, defense, corporations, specializing, in, avant-garde, weapon, systems, like, hypersonic, technologies, and, intercontinental, ballistic, missiles, (, ICBMs, ), .], [Military, commentator, Xia, Loshan, told, The, Epoch, Times, on, Sept., 10, that, both, North, Korea, and, Russia, are, in, a, desperate, situation, .], [“, While, North, Korea, is, eager, to, acquire, military, technology, ,, the, ultimate, decision, is, not, in, their, hands, ,, ”, Mr., Xia, stated, ,, emphasizing, that, the, alliance, is, precarious, and, likely, unsustainable, in, the, face, of, stringent, international, sanctions, .], [Independent, analyst, Zhuge, Mingyang, also, weighed, in, .], [“, In, a, relationship, marked, more, by, utilitarian, needs, than, genuine, alliance, ,, both, nations, recognize, the, other, ’, s, motivations, ,, ”, he, told, The, Epoch, Times, .], [“, Moscow, likely, knows, full, well, that, Pyongyang, is, siphoning, off, its, military, technology, but, seems, willing, to, turn, a, blind, eye, in, exchange, for, much-needed, ammunition—a, clear, example, of, a, ‘, friendship, ’, based, on, mutual, expediency, ., ”]]"
2,EN_UA_021270.txt,"Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits \n\n Ukraine's Fate Will Be Decided In Coming Year, Top Zelensky Aide Admits\n\nIn surprisingly blunt words, a top aide to Ukrainian President Volodymyr Zelensky has warned that the coming year will essentially decide the fate of Ukraine and its war with Russia.\n\n""A turning point in the war is approaching,"" Andrii Yermak, who serves as chief of staff for the Office of the President of Ukraine, said Monday. ""The next year will be decisive in this regard."" He issued the words while appealing for more urgent aid from Washington in an address to the hawkish DC-based Hudson Institute think tank.\n\nYermak sought to assure the audience that Zelensky has ""a clear plan"" forward even as Western media has by and large soured on Kiev's prospects for success. Much of this is about Zelensky sending envoys to do damage control in Washington at a moment the US administration's focus is off Ukraine and on Gaza events instead.\n\nHe described advancing plans for ""the development of our defense industry, and the deploying of our own arms production. But [that] will be later.""\n\nBut he quickly pivoted to an immediate need for more ""weapons right now""--describing that ""Russia still has air superiority. It is still capable of producing missiles, doing evasion of sanctions…And we especially need air defense systems.""\n\nWithout doubt, the Zelensky admin is in damage control after eyebrow-raising comments were issued to The Economist early this month by Ukraine's top commander, who admitted there will be no breakthrough and the battlefield situation is in a stalemate. The New York Times had characterized his remarks as ""the first time a top Ukrainian commander said the fighting had reached an impasse.""\n\nSo now Zelensky appears to be dispatching his envoys to calm Washington jitters over all the ""bad news"" of late out of Ukraine.\n\nYermak also sought to assure the Hudson Institute conference that more billions given to Ukraine won't be ""charity"" but is instead an ""investment"" in America's ""global leadership.""\n\nHe further emphasized Zelensky's continued rejection of ceasefire talks with Russia, unless it's purely on Kiev's terms. ""We seek peace, but not just any peace. In our case, ending the war through compromise is nothing more than pausing it. Ukraine will not repeat the mistake of Minsk,"" Yermak said.\n\nWatch the full Hudson Institute speech below:",UA,"[{'narrative': 'Speculating war outcomes', 'subnarrative': 'Other'}, {'narrative': 'Discrediting Ukraine', 'subnarrative': 'Situation in Ukraine is hopeless'}, {'narrative': 'Discrediting the West, Diplomacy', 'subnarrative': 'West is tired of Ukraine'}, {'narrative': 'Praise of Russia', 'subnarrative': 'Praise of Russian military might'}, {'narrative': 'Discrediting the West, Diplomacy', 'subnarrative': 'The West does not care about Ukraine, only about its interests'}]","[47, 14, 24, 39, 21]","[[Ukraine, 's, Fate, Will, Be, Decided, In, Coming, Year, ,, Top, Zelensky, Aide, Admits, Ukraine, 's, Fate, Will, Be, Decided, In, Coming, Year, ,, Top, Zelensky, Aide, Admits, In, surprisingly, blunt, words, ,, a, top, aide, to, Ukrainian, President, Volodymyr, Zelensky, has, warned, that, the, coming, year, will, essentially, decide, the, fate, of, Ukraine, and, its, war, with, Russia, .], [``, A, turning, point, in, the, war, is, approaching, ,, '', Andrii, Yermak, ,, who, serves, as, chief, of, staff, for, the, Office, of, the, President, of, Ukraine, ,, said, Monday, .], [``, The, next, year, will, be, decisive, in, this, regard, ., ''], [He, issued, the, words, while, appealing, for, more, urgent, aid, from, Washington, in, an, address, to, the, hawkish, DC-based, Hudson, Institute, think, tank, .], [Yermak, sought, to, assure, the, audience, that, Zelensky, has, ``, a, clear, plan, '', forward, even, as, Western, media, has, by, and, large, soured, on, Kiev, 's, prospects, for, success, .], [Much, of, this, is, about, Zelensky, sending, envoys, to, do, damage, control, in, Washington, at, a, moment, the, US, administration, 's, focus, is, off, Ukraine, and, on, Gaza, events, instead, .], [He, described, advancing, plans, for, ``, the, development, of, our, defense, industry, ,, and, the, deploying, of, our, own, arms, production, .], [But, [, that, ], will, be, later, ., ''], [But, he, quickly, pivoted, to, an, immediate, need, for, more, ``, weapons, right, now, '', --, describing, that, ``, Russia, still, has, air, superiority, .], [It, is, still, capable, of, producing, missiles, ,, doing, evasion, of, sanctions…And, we, especially, need, air, defense, systems, ., ''], [Without, doubt, ,, the, Zelensky, admin, is, in, damage, control, after, eyebrow-raising, comments, were, issued, to, The, Economist, early, this, month, by, Ukraine, 's, top, commander, ,, who, admitted, there, will, be, no, breakthrough, and, the, battlefield, situation, is, in, a, stalemate, .], [The, New, York, Times, had, characterized, his, remarks, as, ``, the, first, time, a, top, Ukrainian, commander, said, the, fighting, had, reached, an, impasse, ., ''], [So, now, Zelensky, appears, to, be, dispatching, his, envoys, to, calm, Washington, jitters, over, all, the, ``, bad, news, '', of, late, out, of, Ukraine, .], [Yermak, also, sought, to, assure, the, Hudson, Institute, conference, that, more, billions, given, to, Ukraine, wo, n't, be, ``, charity, '', but, is, instead, an, ``, investment, '', in, America, 's, ``, global, leadership, ., ''], [He, further, emphasized, Zelensky, 's, continued, rejection, of, ceasefire, talks, with, Russia, ,, unless, it, 's, purely, on, Kiev, 's, terms, .], [``, We, seek, peace, ,, but, not, just, any, peace, .], [In, our, case, ,, ending, the, war, through, compromise, is, nothing, more, than, pausing, it, .], [Ukraine, will, not, repeat, the, mistake, of, Minsk, ,, '', Yermak, said, .], [Watch, the, full, Hudson, Institute, speech, below, :]]"
3,EN_UA_103403.txt,"Russia Stages Major Airstrike on Ukraine; One Missile Enters Polish Airspace \n\n KYIV—Russia struck critical infrastructure in Ukraine’s western region of Lviv with missiles early on March 24, Kyiv said, in a major airstrike that saw one Russian cruise missile briefly fly into Polish airspace, according to Warsaw.\n\nMoscow launched 57 missiles and drones in the attack that also targeted Kyiv, two days after the largest aerial bombardment of Ukraine’s energy system in more than two years of full-scale war, Kyiv said.\n\n“There were two preliminary hits on the same critical infrastructure facility that the occupiers targeted at night,” Lviv’s regional governor, Maksym Kozytskyi, wrote on the Telegram messaging app.\n\nThe strike used Kinzhal hypersonic missiles, which are harder to shoot down, he added, without identifying the facility.\n\nThe energy ministry said equipment caught fire when a critical energy facility in the Lviv region was attacked, causing it to lose power. It was unclear whether they were talking about the same facility.\n\nAir defences destroyed 18 of 29 inbound missiles and 25 of 28 attack drones, the Ukranian air force said.\n\nThere were almost no details about what had been damaged, but the targeting of critical infrastructure could indicate that Russia is trying to keep up pressure on the energy system after its strikes caused widespread blackouts on March 22.\n\nThe energy ministry said that Ukraine, which has been exporting power in recent weeks, had sharply increased imports of electricity and stopped exports on March 24 after attacks on the energy system.\n\nSeveral explosions rang out in Kyiv in the early hours as air defences destroyed about a dozen missiles over the capital and in its vicinity, according to Serhiy Popko, head of Kyiv’s military administration.\n\nThere was only minor damage from the attack, he said.\n\nSmall groups of people huddled for safety underground in a central Kyiv metro station in the early hours, some of them sleeping on camping mats.\n\nMoscow has been pounding Ukraine for days in attacks portrayed by Moscow as revenge for Ukrainian attacks that were conducted during Russia’s presidential election.\n\nThe wreckage of a downed Kh-55 cruise missile was found in a Kyiv park, officials said.\n\n“For the third pre-dawn morning this week, all of Ukraine is under an air alert and has been advised to seek shelter,” U.S. Ambassador Bridget Brink posted on social media platform X.\n\nPolish AirspacePoland’s armed forces said a Russian cruise missile launched at the region of Lviv had violated Poland’s airspace.\n\n“The object entered Polish space near the town of Oserdow [Lublin Voivodeship] and stayed there for 39 seconds,” it said on X. “During the entire flight, it was observed by military radar systems.”\n\nPoland’s army spokesperson, Jacek Goryszewski, told reporters that the missile travelled about 2 kilometers (1.2 miles) into Polish airspace before returning to Ukraine.\n\nThere was no immediate comment from Russia. Warsaw said it would demand an explanation from Moscow.\n\nPolish Defense Minister Wladyslaw Kosiniak-Kamysz said Warsaw would continue to support Ukraine both militarily and on the humanitarian side.",UA,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[32],"[[Russia, Stages, Major, Airstrike, on, Ukraine, ;, One, Missile, Enters, Polish, Airspace, KYIV—Russia, struck, critical, infrastructure, in, Ukraine, ’, s, western, region, of, Lviv, with, missiles, early, on, March, 24, ,, Kyiv, said, ,, in, a, major, airstrike, that, saw, one, Russian, cruise, missile, briefly, fly, into, Polish, airspace, ,, according, to, Warsaw, .], [Moscow, launched, 57, missiles, and, drones, in, the, attack, that, also, targeted, Kyiv, ,, two, days, after, the, largest, aerial, bombardment, of, Ukraine, ’, s, energy, system, in, more, than, two, years, of, full-scale, war, ,, Kyiv, said, .], [“, There, were, two, preliminary, hits, on, the, same, critical, infrastructure, facility, that, the, occupiers, targeted, at, night, ,, ”, Lviv, ’, s, regional, governor, ,, Maksym, Kozytskyi, ,, wrote, on, the, Telegram, messaging, app, .], [The, strike, used, Kinzhal, hypersonic, missiles, ,, which, are, harder, to, shoot, down, ,, he, added, ,, without, identifying, the, facility, .], [The, energy, ministry, said, equipment, caught, fire, when, a, critical, energy, facility, in, the, Lviv, region, was, attacked, ,, causing, it, to, lose, power, .], [It, was, unclear, whether, they, were, talking, about, the, same, facility, .], [Air, defences, destroyed, 18, of, 29, inbound, missiles, and, 25, of, 28, attack, drones, ,, the, Ukranian, air, force, said, .], [There, were, almost, no, details, about, what, had, been, damaged, ,, but, the, targeting, of, critical, infrastructure, could, indicate, that, Russia, is, trying, to, keep, up, pressure, on, the, energy, system, after, its, strikes, caused, widespread, blackouts, on, March, 22, .], [The, energy, ministry, said, that, Ukraine, ,, which, has, been, exporting, power, in, recent, weeks, ,, had, sharply, increased, imports, of, electricity, and, stopped, exports, on, March, 24, after, attacks, on, the, energy, system, .], [Several, explosions, rang, out, in, Kyiv, in, the, early, hours, as, air, defences, destroyed, about, a, dozen, missiles, over, the, capital, and, in, its, vicinity, ,, according, to, Serhiy, Popko, ,, head, of, Kyiv, ’, s, military, administration, .], [There, was, only, minor, damage, from, the, attack, ,, he, said, .], [Small, groups, of, people, huddled, for, safety, underground, in, a, central, Kyiv, metro, station, in, the, early, hours, ,, some, of, them, sleeping, on, camping, mats, .], [Moscow, has, been, pounding, Ukraine, for, days, in, attacks, portrayed, by, Moscow, as, revenge, for, Ukrainian, attacks, that, were, conducted, during, Russia, ’, s, presidential, election, .], [The, wreckage, of, a, downed, Kh-55, cruise, missile, was, found, in, a, Kyiv, park, ,, officials, said, .], [“, For, the, third, pre-dawn, morning, this, week, ,, all, of, Ukraine, is, under, an, air, alert, and, has, been, advised, to, seek, shelter, ,, ”, U.S, .], [Ambassador, Bridget, Brink, posted, on, social, media, platform, X, .], [Polish, AirspacePoland, ’, s, armed, forces, said, a, Russian, cruise, missile, launched, at, the, region, of, Lviv, had, violated, Poland, ’, s, airspace, .], [“, The, object, entered, Polish, space, near, the, town, of, Oserdow, [, Lublin, Voivodeship, ], and, stayed, there, for, 39, seconds, ,, ”, it, said, on, X, .], [“, During, the, entire, flight, ,, it, was, observed, by, military, radar, systems., ”, Poland, ’, s, army, spokesperson, ,, Jacek, Goryszewski, ,, told, reporters, that, the, missile, travelled, about, 2, kilometers, (, 1.2, miles, ), into, Polish, airspace, before, returning, to, Ukraine, .], [There, was, no, immediate, comment, from, Russia, .], [Warsaw, said, it, would, demand, an, explanation, from, Moscow, .], [Polish, Defense, Minister, Wladyslaw, Kosiniak-Kamysz, said, Warsaw, would, continue, to, support, Ukraine, both, militarily, and, on, the, humanitarian, side, .]]"
4,EN_CC_100145.txt,"Strategy needed to preserve water resources in Pakistan \n\n ISLAMABAD-Pakistan needs to chalk out an effective and feasible plan to preserve its water resources and stop the depletion of underground water, WealthPK reported.\n\nAccording to an expert, water is one of the basic necessities of life and Pakistan is blessed with an abundance of water resources. However, the country’s water resources are depleting with the passage of time. It is feared that the entire world will face an acute water shortage in the coming years. Saiqa Imran, the deputy director of the Pakistan Council of Research and Water Reservoirs (PCRWR), told WealthPK that the country was feared to face an acute shortage of fresh water. “The quantity of fresh water is quite limited in the country,” she added. According to her, climate change is the primary cause of the depletion of water resources. She said that the shortage of water triggered concerns about the future availability of water for the world’s exponentially rising population.\n\nSaiqa Imran said that in most cases, the water table was constantly going down as a result of the frequent pumping of water from the ground. “With a growing world population, the more we pump water from the ground at a rapid rate, the harder it becomes to get the amount of water we need because we pump the groundwater faster than it can be replenished,” she added. She said that inefficient water distribution and mismanagement were the main causes of the water shortage. “Pakistan has one of the largest contiguous irrigation systems in the world where more than 93% of water is used by the agriculture sector, 5% by the domestic sector and 2% by the industrial sector,” said the expert.\n\nShe said that domestic and industrial sectors would use 15% more water by 2025 than their current consumption. She said that the agriculture sector was the biggest user of water as modern irrigation techniques were not adopted in Pakistan but its contribution to the national economy was declining with the passage of time. Saiqa Imran said that the world’s average storage capacity reached 40% but Pakistan could conserve only up to 10% of its annual river water due to insufficient space. “People have turned to use underground water as a result of the shortage of the most essential commodity. The indiscriminate over-pumping in the cities in the absence of regulatory bodies has also led to the depletion of groundwater,” she added.\n\nAccording to her, there is a dire need to devise a proper pricing mechanism for water usage in any sector. She said that incentives should also be in place to encourage people to use water efficiently. The expert said that the country could take a variety of technical and management measures to conserve water at all levels, reduce irrigation losses, encourage farmers to adopt more efficient irrigation methods by creating a regulatory framework, launch licensing water, and implement integrated water resource management. “Under the principle of private sector participation and optimal water pricing, the policymakers should rethink water policy to encourage wastewater recycling,” Saiqa Imran told WealthPK.",CC,"[{'narrative': 'Other', 'subnarrative': 'Other'}]",[42],"[[Strategy, needed, to, preserve, water, resources, in, Pakistan, ISLAMABAD-Pakistan, needs, to, chalk, out, an, effective, and, feasible, plan, to, preserve, its, water, resources, and, stop, the, depletion, of, underground, water, ,, WealthPK, reported, .], [According, to, an, expert, ,, water, is, one, of, the, basic, necessities, of, life, and, Pakistan, is, blessed, with, an, abundance, of, water, resources, .], [However, ,, the, country, ’, s, water, resources, are, depleting, with, the, passage, of, time, .], [It, is, feared, that, the, entire, world, will, face, an, acute, water, shortage, in, the, coming, years, .], [Saiqa, Imran, ,, the, deputy, director, of, the, Pakistan, Council, of, Research, and, Water, Reservoirs, (, PCRWR, ), ,, told, WealthPK, that, the, country, was, feared, to, face, an, acute, shortage, of, fresh, water, .], [“, The, quantity, of, fresh, water, is, quite, limited, in, the, country, ,, ”, she, added, .], [According, to, her, ,, climate, change, is, the, primary, cause, of, the, depletion, of, water, resources, .], [She, said, that, the, shortage, of, water, triggered, concerns, about, the, future, availability, of, water, for, the, world, ’, s, exponentially, rising, population, .], [Saiqa, Imran, said, that, in, most, cases, ,, the, water, table, was, constantly, going, down, as, a, result, of, the, frequent, pumping, of, water, from, the, ground, .], [“, With, a, growing, world, population, ,, the, more, we, pump, water, from, the, ground, at, a, rapid, rate, ,, the, harder, it, becomes, to, get, the, amount, of, water, we, need, because, we, pump, the, groundwater, faster, than, it, can, be, replenished, ,, ”, she, added, .], [She, said, that, inefficient, water, distribution, and, mismanagement, were, the, main, causes, of, the, water, shortage, .], [“, Pakistan, has, one, of, the, largest, contiguous, irrigation, systems, in, the, world, where, more, than, 93, %, of, water, is, used, by, the, agriculture, sector, ,, 5, %, by, the, domestic, sector, and, 2, %, by, the, industrial, sector, ,, ”, said, the, expert, .], [She, said, that, domestic, and, industrial, sectors, would, use, 15, %, more, water, by, 2025, than, their, current, consumption, .], [She, said, that, the, agriculture, sector, was, the, biggest, user, of, water, as, modern, irrigation, techniques, were, not, adopted, in, Pakistan, but, its, contribution, to, the, national, economy, was, declining, with, the, passage, of, time, .], [Saiqa, Imran, said, that, the, world, ’, s, average, storage, capacity, reached, 40, %, but, Pakistan, could, conserve, only, up, to, 10, %, of, its, annual, river, water, due, to, insufficient, space, .], [“, People, have, turned, to, use, underground, water, as, a, result, of, the, shortage, of, the, most, essential, commodity, .], [The, indiscriminate, over-pumping, in, the, cities, in, the, absence, of, regulatory, bodies, has, also, led, to, the, depletion, of, groundwater, ,, ”, she, added, .], [According, to, her, ,, there, is, a, dire, need, to, devise, a, proper, pricing, mechanism, for, water, usage, in, any, sector, .], [She, said, that, incentives, should, also, be, in, place, to, encourage, people, to, use, water, efficiently, .], [The, expert, said, that, the, country, could, take, a, variety, of, technical, and, management, measures, to, conserve, water, at, all, levels, ,, reduce, irrigation, losses, ,, encourage, farmers, to, adopt, more, efficient, irrigation, methods, by, creating, a, regulatory, framework, ,, launch, licensing, water, ,, and, implement, integrated, water, resource, management, .], [“, Under, the, principle, of, private, sector, participation, and, optimal, water, pricing, ,, the, policymakers, should, rethink, water, policy, to, encourage, wastewater, recycling, ,, ”, Saiqa, Imran, told, WealthPK, .]]"


To uncover potential errors, let us check for and handle sentences of unusual length.

In [17]:
# Function to find sentences of unusual length
def find_unusual_length_sentences(df, min_length=3, max_length=130):
    unusual_sentences = []
    for i, row in df.iterrows():
        for j, sentence in enumerate(row['tokens']):
            if len(sentence) < min_length or len(sentence) > max_length:
                # also store the previous and next sentences for context
                prev_sentence = row['tokens'][j-1] if j > 0 else None
                next_sentence = row['tokens'][j+1] if j < len(row['tokens']) - 1 else None
                unusual_sentences.append({
                    'row_index': i, # for later handling of the unusual sentences
                    'sentence_index': j, # for later handling of the unusual sentences
                    'sentence': sentence,
                    'previous': prev_sentence,
                    'next': next_sentence
                })
    return unusual_sentences

# Find sentences with less than 3 words or more than 130 words
unusual_sentences = find_unusual_length_sentences(df_short)

print(f"There are {len(unusual_sentences)} sentences of unusual length.")

# Display the unusual sentences
for entry in unusual_sentences:
    print(f"Sentence length: {len(entry["sentence"])}, Sentence: {' '.join(entry["sentence"])}")

There are 28 sentences of unusual length.
Sentence length: 143, Sentence: Jens Stoltenberg ( pictured ) , the 13th secretary general of NATO , revealed there were live discussions among members about removing missiles from storage and putting them on standby A Netherlands ' Air Force F-16 jetfighter takes part in the NATO exercise as part of the NATO Air Policing mission The head of Kyiv 's national security council said Putin could demand a tactical nuclear weapon be used if Russia 's army is beaten in Ukraine Russian soldiers load a Iskander-M short-range ballistic missile launchers at a firing position as part of Russian military drill intended to train the troops in using tactical nuclear weapons Meanwhile , Mr Stoltenberg warned in Brussels of the threat from China , adding that nuclear transparency should form the basis of Nato 's nuclear strategy to prepare the alliance for the dangers of the world .
Sentence length: 1, Sentence: ...
Sentence length: 136, Sentence: “ The Complai

We find multiple very short and very large sentences.

#### Handling very short sentences.

The sentences of size 1 all consist of non-meaningful characters. Therefore we can drop them directly.

In [18]:
# Function to drop unusual sentences of a specific length from the DataFrame
def drop_sentences_of_length(df, unusual_sentences, length):
    # Create a copy of the list to iterate over
    for entry in unusual_sentences[:]:
        if len(entry['sentence']) == length:
            row_index = entry['row_index']
            sentence_index = entry['sentence_index']
            # Check if the sentence index is within the valid range
            if 0 <= sentence_index < len(df.at[row_index, 'tokens']):
                # Drop from DataFrame
                df.at[row_index, 'tokens'].pop(sentence_index)
                # Drop from unusual_sentences list
                unusual_sentences.remove(entry)
                print("dropped: ", entry['sentence'])
    return df

# Drop sentences of length 1
print(f"There are {len(unusual_sentences)} sentences of unusual length before dropping sentences of length 1.")
df_short = drop_sentences_of_length(df_short, unusual_sentences, length=1)
print(f"There are {len(unusual_sentences)} sentences of unusual length after dropping sentences of length 1.")

There are 28 sentences of unusual length before dropping sentences of length 1.
dropped:  ['...']
dropped:  ['Comments']
dropped:  ['■']
dropped:  ['■']
dropped:  ['■']
dropped:  ['HT']
dropped:  ['■']
dropped:  ['.']
There are 20 sentences of unusual length after dropping sentences of length 1.


The sentences of size 2 might make sense. Let's have a look at their context.

In [19]:
# Display sentences of length 2 with their preceding and following sentences
for entry in unusual_sentences:

    print(f"(Previous) {' '.join(entry['previous']) if entry['previous'] else 'None'}")
    print(f"(Idx {entry['row_index']}, {entry['sentence_index']}, ListIdx {unusual_sentences.index(entry)}) {' '.join(entry['sentence'])}")
    if entry['next']:
        print(f"(Next) {' '.join(entry['next'])}")
    print("-" * 50)

(Previous) Oleksandr Lytvynenko made the comments after G7 leaders warned any use by Russia of chemical , biological or nuclear weapons would be met with 'severe consequences ' , The Times reported .
(Idx 6, 6, ListIdx 0) Jens Stoltenberg ( pictured ) , the 13th secretary general of NATO , revealed there were live discussions among members about removing missiles from storage and putting them on standby A Netherlands ' Air Force F-16 jetfighter takes part in the NATO exercise as part of the NATO Air Policing mission The head of Kyiv 's national security council said Putin could demand a tactical nuclear weapon be used if Russia 's army is beaten in Ukraine Russian soldiers load a Iskander-M short-range ballistic missile launchers at a firing position as part of Russian military drill intended to train the troops in using tactical nuclear weapons Meanwhile , Mr Stoltenberg warned in Brussels of the threat from China , adding that nuclear transparency should form the basis of Nato 's nuc

We find, that the sentences of size 2 are of different types.
 - Words belonging to the previous or following sentence, but are split by punctation errors (e.g. "Mild. Tonight: Rain slowly returns ...") -> merge manually 
 - Words at the end of an document (e.g. "Watch:") -> drop
 - Valid sentences (e.g. "Why?") -> valid, keep
 - Section numerations (e.g. "2. (paragraph)") -> valid, keep (might be helpful for the model to understand the text, since they give a structure)
 

Since the number of such sentences is managable, we can manually decide on each case, whether to keep, merge or drop it.

In [20]:
# function to merge specific sentences with either the previous or next sentence
def merge_specific_sentence(df, row_idx, sentence_idx, direction):
    
    # List of sentences of the specified row
    sentences = df.at[row_idx, 'tokens']
    
    # Merge based on the direction
    if direction == 'previous':
        merged_sentence = sentences[sentence_idx - 1] + sentences[sentence_idx]
        sentences[sentence_idx - 1] = merged_sentence
        del sentences[sentence_idx]
        
    elif direction == 'next':
        merged_sentence = sentences[sentence_idx] + sentences[sentence_idx + 1]
        sentences[sentence_idx] = merged_sentence
        del sentences[sentence_idx + 1]
    
    # Update the row in the dataframe
    df.at[row_idx, 'tokens'] = sentences


print(f"There are {len(unusual_sentences)} sentences of unusual length before merging some sentences of length 2.")

merge_specific_sentence(df_short, 87, 6, 'previous')  # Merge "Drones ." with previous
merge_specific_sentence(df_short, 88, 0, 'next')      # Merge "U.K ." with next
merge_specific_sentence(df_short, 127, 23, 'next')    # Merge "Mild ." with next
merge_specific_sentence(df_short, 136, 3, 'next')     # Merge "Gov ." with next

# drop them from the list of unusual sentences
unusual_sentences = [i for j, i in enumerate(unusual_sentences) if j not in [6,7,12,13]]

print(f"There are {len(unusual_sentences)} sentences of unusual length after merging some sentences of length 2.")

There are 20 sentences of unusual length before merging some sentences of length 2.
There are 16 sentences of unusual length after merging some sentences of length 2.


In [21]:
# function to drop specific sentences from the dataframe
def drop_sentence_by_indices(df, row_idx, sentence_idx):
    
    # copy list of sentences of the specified row
    sentences = df.at[row_idx, 'tokens']
        
    if 0 <= sentence_idx < len(sentences):
        # drop sentence
        del sentences[sentence_idx]
        # update the dataframe
        df.at[row_idx, 'tokens'] = sentences
        
    
    else:
        print(f"Invalid sentence index {sentence_idx} for row {row_idx}")


print(f"There are {len(unusual_sentences)} sentences of unusual length before dropping some sentences of length 2.")

drop_sentence_by_indices(df_short, 55, 10)
drop_sentence_by_indices(df_short, 80, 21)

# drop them from the list of unusual sentences
unusual_sentences = [i for j, i in enumerate(unusual_sentences) if j not in [2, 5]]

print(f"There are {len(unusual_sentences)} sentences of unusual length after dropping some sentences of length 2.")

There are 16 sentences of unusual length before dropping some sentences of length 2.
There are 14 sentences of unusual length after dropping some sentences of length 2.


#### Handling very long sentences

There are several very long sentences. Looking at the data, we see that some of them are infact correctly split and complete single sentences. Others, however, are in fact multiple sentences stored as one, because the splitting did not work correctly. Let's manually split them.

In [22]:
# function to manually replace a unsplitted sequence of sentences with the manually correctly splitted sentences
def manually_split_sentence(idx_row, idx_sentence, splitted_sentence):
    df_short.at[idx_row, 'tokens'] = df_short.at[idx_row, 'tokens'][:idx_sentence] + splitted_sentence + df_short.at[6, 'tokens'][idx_sentence+1:]


print(f"There are {len(unusual_sentences)} sentences of unusual length before manually splitting some too long sentences.")

manually_split_sentence(6,6,[
    ["Jens", "Stoltenberg", "(", "pictured", ")", ",", "the", "13th", "secretary", "general", "of", "NATO", ",", "revealed", "there", "were", "live", "discussions", "among", "members", "about", "removing", "missiles", "from", "storage", "and", "putting", "them", "on", "standby", "."],
    ["A", "Netherlands", "'", "Air", "Force", "F-16", "jetfighter", "takes", "part", "in", "the", "NATO", "exercise", "as", "part", "of", "the", "NATO", "Air", "Policing", "mission", "."],
    ["The", "head", "of", "Kyiv", "'s", "national", "security", "council", "said", "Putin", "could", "demand", "a", "tactical", "nuclear", "weapon", "be", "used", "if", "Russia", "'s", "army", "is", "beaten", "in", "Ukraine", "."],
    ["Russian", "soldiers", "load", "a", "Iskander-M", "short-range", "ballistic", "missile", "launcher", "at", "a", "firing", "position", "as", "part", "of", "a", "Russian", "military", "drill", "intended", "to", "train", "the", "troops", "in", "using", "tactical", "nuclear", "weapons", "."],
    ["Meanwhile", ",", "Mr", "Stoltenberg", "warned", "in", "Brussels", "of", "the", "threat", "from", "China", ",", "adding", "that", "nuclear", "transparency", "should", "form", "the", "basis", "of", "NATO", "'s", "nuclear", "strategy", "to", "prepare", "the", "alliance", "for", "the", "dangers", "of", "the", "world", "."]
])

manually_split_sentence(46,12,[
    ["“", "The", "Complaint", "alleged", "that", "several", "of", "the", "Vietnamese", "orphans", "brought", "to", "the", "United", "States", "under", "Operation", "Babylift", "stated", "they", "are", "not", "orphans", "and", "that", "they", "wish", "to", "return", "to", "Vietnam", ".", "”"],
    ["A", "statement", "issued", "on", "April", "4", ",", "1975", ",", "by", "“", "professors", "of", "ethics", "and", "religion", ",", "”", "pointed", "out", "that", "many", "“", "of", "the", "children", "are", "not", "orphans", ";", "their", "parents", "or", "relatives", "may", "still", "be", "alive", ",", "although", "displaced", ",", "in", "Vietnam", "…", "The", "Vietnamese", "children", "should", "be", "allowed", "to", "stay", "in", "Vietnam", "where", "they", "belong", ".", "”"],
    ["The", "operation", "was", "celebrated", "by", "the", "corporate", "media", "and", "“", "Hollywood", "’", "s", "celebrity", "elite", "…", "[", "and", ",", "as", "a", "propaganda", "event", "]", "generated", "a", "spectacle", "of", "celebration", "and", "emphasized", "that", "the", "babies", "were", "more", "than", "just", "average", "orphans", ",", "”", "writes", "US", "History", "Scene", "."]
])

manually_split_sentence(73,8,[
    ["But", "he", "noted", "that", "“", "even", "as", "the", "Russians", "have", "gained", "territory", ",", "they", "do", "it", "at", "a", "pretty", "big", "cost", "in", "number", "of", "casualties", ",", "like", "in", "personnel", ",", "but", "also", "in", "number", "of", "pieces", "of", "equipment", "that", "are", "being", "taken", "out.", "”"],
    ["Austin", "said", "in", "his", "remarks", "Tuesday", "that", "“", "Russia", "has", "paid", "a", "staggering", "cost", "for", "(", "President", "Vladimir", ")", "Putin", "’", "s", "imperial", "dreams", "”", ",", "using", "“", "up", "to", "$", "211", "billion", "to", "equip", ",", "deploy", ",", "maintain", ",", "and", "sustain", "its", "imperial", "aggression", "against", "Ukraine.", "”"],
    ["“", "At", "least", "315,000", "Russian", "troops", "have", "been", "killed", "or", "wounded", "”", "since", "Russia", "launched", "its", "all-out", "invasion", "of", "Ukraine", "in", "2022", ",", "Austin", "said", "."],
    ["Austin", "added", "that", "Ukraine", "has", "also", "“", "sunk", ",", "destroyed", ",", "or", "damaged", "some", "20", "medium-to-large", "Russian", "navy", "vessels.", "”"],
    ["The", "sinkings", "have", "been", "an", "embarrassment", "for", "Moscow", "and", "Russian", "state", "media", "confirmed", "Tuesday", "that", "the", "country", "had", "replaced", "the", "head", "of", "its", "navy", "."]
])

manually_split_sentence(77,13,[
    ["Подробнее", "на", "РБК", ":", "https", ":", "//www.rbc.ru/politics/14/06/2023/6489e6f39a794778d61881b4", "."],
    ["The", "picture", "of", "widening", "war", "is", "beginning", "to", "form", ":", "Professor", "Sergey", "Karaganov", ",", "honorary", "chairman", "of", "Russia", "’", "s", "Council", "on", "Foreign", "and", "Defense", "Policy", ",", "and", "academic", "supervisor", "at", "the", "School", "of", "International", "Economics", "and", "Foreign", "Affairs", "Higher", "School", "of", "Economics", "(", "HSE", ")", "in", "Moscow", "."],
    ["Sergey", "Karaganov", ":", "By", "using", "its", "nuclear", "weapons", ",", "Russia", "could", "save", "humanity", "from", "a", "global", "catastrophe", "."],
    ["A", "tough", "but", "necessary", "decision", "would", "likely", "force", "the", "West", "to", "back", "off", ",", "enabling", "an", "earlier", "end", "to", "the", "Ukraine", "crisis", "and", "preventing", "it", "from", "expanding", "to", "other", "states", "."],
    ["Karaganov", "’", "s", "description", "of", "the", "Western", "World", "as", "“", "anti-human", "ideologies", ":", "the", "denial", "of", "family", ",", "homeland", ",", "history", ",", "love", "between", "men", "and", "women", ",", "faith", ",", "service", "to", "higher", "ideals", ",", "everything", "that", "is", "human", ",", "”", "shows", "a", "rising", "realization", "that", "Russia", "sees", "itself", "confronted", "by", "a", "Satanic", "force", "that", "must", "be", "destroyed", "."]
])

manually_split_sentence(147,7,[
    ["At", "the", "same", "time", ",", "the", "official", "claimed", "that", "the", "danger", "of", "Kiev", "using", "a", "‘", "dirty", "bomb", "’", "remains", "“", "very", "high", ",", "”", "and", "that", "Ukraine", "“", "has", "the", "opportunity", "”", "and", "“", "has", "every", "reason", "to", "use", "it", "."],
    ["Earlier", "on", "Tuesday", ",", "in", "a", "letter", "to", "UN", "Secretary-General", "Antonio", "Guterres", ",", "the", "Russian", "mission", "’", "s", "head", ",", "Vassily", "Nebenzia", ",", "said", "that", "Moscow", "would", "consider", "the", "use", "of", "a", "‘", "dirty", "bomb", "’", "by", "Ukraine", "“", "an", "act", "of", "nuclear", "terrorism", ".", "”"],
    ["Meanwhile", ",", "Ukrainian", "Foreign", "Minister", "Dmitry", "Kuleba", "earlier", "called", "the", "Russian", "allegations", "“", "as", "absurd", "as", "they", "are", "dangerous", ".", "”"],
    ["He", "also", "noted", "that", "“", "Russians", "often", "accuse", "others", "of", "what", "they", "plan", "themselves", ".", "”"],
    ["On", "Tuesday", ",", "the", "minister", "revealed", "that", "Ukraine", "had", "invited", "IAEA", "inspectors", "to", "come", "and", "to", "“", "prove", "that", "Ukraine", "has", "neither", "any", "dirty", "bombs", "nor", "plans", "to", "develop", "them", ".", "”"],
    ["“", "Good", "cooperation", "with", "IAEA", "and", "partners", "allows", "us", "to", "foil", "Russia", "’", "s", "‘", "dirty", "bomb", "’", "disinfo", "campaign", ",", "”", "Kuleba", "said", "."]
])

manually_split_sentence(154,5,[
    ["WHO", "Tedros", "describes", "Disease", "X", "as", "a", "blueprint", "at", "a", "panel", "discussion", "at", "WEF24", "—", "Tamara", "Ugolini", "🇨🇦", "(", "@", "TamaraUgo", ")", "January", "17", ",", "2024", "."],
    ["He", "says", "that", "COVID", "was", "the", "first", "Disease", "X", "and", "we", "“", "need", "a", "placeholder", "for", "diseases", "we", "don", "’", "t", "know", ",", "”", "including", "dedication", "to", "private", "sector", "drug", "research", "and", "development", "."],
    ["Disease", "X", "serves", "as", "a", "“", "placeholder", "for", "the", "diseases", "we", "don", "’", "t", "know", ",", "”", "and", "it", "begins", "with", "private-sector", "research", "and", "development", "to", "test", "drugs", "and", "“", "other", "things", ".", "”"],
    ["Tedros", "stressed", "that", "the", "next", "pandemic", "is", "“", "not", "a", "matter", "of", "if", ",", "but", "rather", "when", ",", "”", "while", "noting", "that", "COVID-19", "was", "the", "original", "Disease", "X", ",", "in", "which", "they", "were", "able", "to", "facilitate", "the", "Pandemic", "Fund", "in", "partnership", "with", "the", "World", "Bank", "."]
])

# drop them from the list of unusual sentences
unusual_sentences = [i for j, i in enumerate(unusual_sentences) if j not in [0, 1, 2, 3, 8, 9]]

print(f"There are {len(unusual_sentences)} sentences of unusual length after manually splitting some too long sentences.")


There are 14 sentences of unusual length before manually splitting some too long sentences.
There are 8 sentences of unusual length after manually splitting some too long sentences.


#### Validating fixed unusual-length-sentences

We update the unusual sentences and print them. We find that all unusually short and long sentences that still occur, are valid and ment to be kept.

In [23]:
# Find sentences with less than 3 words or more than 100 words
unusual_sentences = find_unusual_length_sentences(df_short)

print(f"After handling, there are {len(unusual_sentences)} unusual sentences left.")

# Display the unusual sentences
for entry in unusual_sentences:
    print(f"Sentence length: {len(entry["sentence"])}, (Idx {entry['row_index']}, {entry['sentence_index']}), Sentence: {' '.join(entry["sentence"])}")

After handling, there are 8 unusual sentences left.
Sentence length: 2, (Idx 113, 7), Sentence: 2 .
Sentence length: 2, (Idx 113, 10), Sentence: 3 .
Sentence length: 2, (Idx 113, 13), Sentence: 4 .
Sentence length: 2, (Idx 113, 16), Sentence: 5 .
Sentence length: 2, (Idx 174, 3), Sentence: Why ?
Sentence length: 2, (Idx 185, 18), Sentence: LOL !
Sentence length: 133, (Idx 186, 0), Sentence: The Lies that All Pro-U.S.-Government Media Spread by Eric Zuesse , The Duran : First here , are a group of lies that all are false , all for the very same reason — that they all blatantly contradict the actual ( as is to be documented here ) history ( just click onto each given lying phrase below , to see instances in which the given false phrase has been reported as being instead true — and , then , I shall here document them all to be not just false but the reverse of truth , the exact opposite of the reality ) : TRUTH LIVES on at https : //sgtreport.tv/ “ Russia ’ s war of aggression against Ukr

Now that the text is correctly segmentated into sentences and words, we can proceed with text normalization.

### Text Normalization

  
Text normalization is the process of transforming text into a standard format, which typically involves:

- Converting text to lowercase
- Removing punctuation
- Removing stopwords
- Removing special characters and numbers
- Lemmatization or stemming

This process helps in reducing the complexity of the text and making it more uniform for further analysis or processing.

We will implement text normalization in the next steps.

In [24]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

print(torch.version.cuda)  # Shows CUDA version if available
print(torch.cuda.is_available())  # Checks if CUDA is available



Using device: cuda
12.1
True


Check if a NVIDIA Graphics card is installed (and the nessesary CUDA packages) to be used later for lemmatization because with just the CPU it was taking around 15min every time. 

In [None]:
def text_normalization(df, column_name):
    """
    Text normalization with optimized batch processing for nested lists of tokens
    with proper abbreviation handling
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    nlp = stanza.Pipeline('en',
                         processors='tokenize,lemma',
                         device=device,
                         use_gpu=True,
                         batch_size=4096,
                         tokenize_batch_size=4096,
                         tokenize_pretokenized=True,
                         download_method=None
                         )
    
    # Depending on the graphics card the batch size can be adjusted
    
    stop_words = set(stopwords.words('english'))
    
    # Common abbreviations and their normalized forms
    abbreviations = {
        'p.m.': 'pm',
        'a.m.': 'am',
        'e.g.': 'eg',
        'i.e.': 'ie',
        'etc.': 'etc',
        'vs.': 'vs',
        'mr.': 'mr',
        'mrs.': 'mrs',
        'dr.': 'dr',
        'prof.': 'prof',
        'u.s.': 'us',
        'u.k.': 'uk',
        'n.y.': 'ny',
        'l.a.': 'la',
        'st.': 'st',
        'inc.': 'inc',
        'ltd.': 'ltd',
        'co.': 'co',
        'corp.': 'corp',
        'avg.': 'avg',
        'approx.': 'approx'
    }
    
    def normalize_token(token):
        """Normalize a single token"""
        if not isinstance(token, str):
            return ''
            
        # Convert to lowercase first
        token = token.lower().strip()
        
        # Check if it's an abbreviation
        if token in abbreviations:
            return abbreviations[token]
            
        # Remove special characters and numbers for non-abbreviations
        token = re.sub(r'[^a-z]', '', token)
        
        return token
    
    def clean_text(nested_tokens):
        """Clean and preprocess nested list of tokens"""
        if not isinstance(nested_tokens, list):
            return []
        
        cleaned_tokens = []
        for sentence in nested_tokens:
            if isinstance(sentence, list):
                for token in sentence:
                    # Normalize the token
                    normalized = normalize_token(token)
                    # Check if token is not empty and not a stopword
                    if normalized and normalized not in stop_words:
                        cleaned_tokens.append(normalized)
        
        return cleaned_tokens
    
    def process_text(tokens):
        """Process a single text through Stanza"""
        try:
            if not tokens:
                return []
            # Join tokens into a single string for processing
            text = ' '.join(tokens)
            doc = nlp(text)
            # Extract lemmas and filter stopwords
            lemmas = []
            for sent in doc.sentences:
                for word in sent.words:
                    lemma = word.lemma.lower()
                    # Check if the original token was an abbreviation
                    if lemma not in stop_words:
                        lemmas.append(lemma)
            return lemmas
        except Exception as e:
            print(f"Error processing text: {str(e)}")
            return []
    
    def process_batch(batch_tokens):
        """Process a batch of nested token lists"""
        results = []
        for tokens in batch_tokens:
            # Clean and flatten tokens
            cleaned_tokens = clean_text(tokens)
            # Process cleaned tokens
            normalized = process_text(cleaned_tokens)
            # Ensure all tokens are properly normalized
            normalized = [normalize_token(token) for token in normalized if normalize_token(token)]
            results.append(normalized)
            
        return results
    
    # Process in batches
    batch_size = 50
    normalized_tokens = []
    total_batches = (len(df) + batch_size - 1) // batch_size
    
    print(f"Starting processing of {len(df)} rows in {total_batches} batches")
    
    for i in tqdm(range(0, len(df), batch_size), desc="Normalizing text"):
        batch_df = df.iloc[i:i + batch_size]
        batch_tokens = batch_df[column_name].tolist()
        
        # Print sample of first batch for debugging
        if i == 0:
            print("\nSample processing:")
            sample_tokens = batch_tokens[0][:5] if batch_tokens else []  # First 5 tokens of first row
            print(f"Original tokens: {sample_tokens}")
            cleaned = clean_text([sample_tokens])
            print(f"Cleaned tokens: {cleaned}")
            normalized = process_batch([sample_tokens])
            print(f"Normalized tokens: {normalized[0]}\n")
        
        normalized_batch = process_batch(batch_tokens)
        normalized_tokens.extend(normalized_batch)
    
    print(f"\nProcessing completed. Total normalized entries: {len(normalized_tokens)}")
    print(f"Non-empty normalized entries: {sum(1 for tokens in normalized_tokens if tokens)}")
    
    # Final check to ensure no punctuation or special characters remain
    final_tokens = []
    for tokens in normalized_tokens:
        cleaned = [token for token in tokens if token and not any(char in string.punctuation for char in token)]
        final_tokens.append(cleaned)
    
    # Update DataFrame with normalized tokens
    df_normalized = df.copy()
    df_normalized[f'{column_name}_normalized'] = final_tokens
    
    return df_normalized

In [26]:


def check_text_normalization(df):
    """
    Validates if text normalization has been applied correctly
    
    Parameters:
    df (pandas.DataFrame): DataFrame containing original and normalized tokens
    
    Returns:
    dict: Validation results with detailed statistics and examples of any issues found
    """
    results = {
        'overall_status': 'PASS',
        'tests': {},
        'statistics': {},
        'issues_found': {},
        'sample_issues': {}
    }
    
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    def is_lowercase(text):
        return text.islower()
    
    def contains_punctuation(text):
        return any(char in string.punctuation for char in text)
    
    def contains_numbers(text):
        return bool(re.search(r'\d', text))
    
    def contains_special_chars(text):
        return bool(re.search(r'[^a-zA-Z\s]', text))
    
    def is_stopword(text):
        return text in stop_words
    
    # Initialize counters for statistics
    stats = {
        'total_original_tokens': 0,
        'total_normalized_tokens': 0,
        'uppercase_found': 0,
        'punctuation_found': 0,
        'numbers_found': 0,
        'special_chars_found': 0,
        'stopwords_found': 0
    }
    
    # Initialize issue tracking
    issues = {
        'uppercase_tokens': [],
        'punctuation_tokens': [],
        'number_tokens': [],
        'special_char_tokens': [],
        'stopword_tokens': []
    }
    
    # Check each row
    for idx, row in df.iterrows():
        if 'tokens_normalized' not in row:
            results['overall_status'] = 'FAIL'
            results['tests']['normalization_column_exists'] = False
            return results
        
        normalized_tokens = row['tokens_normalized']
        
        if not isinstance(normalized_tokens, list):
            continue
            
        # Check each normalized token
        for token in normalized_tokens:
            stats['total_normalized_tokens'] += 1
            
            # Check for uppercase
            if not is_lowercase(token):
                stats['uppercase_found'] += 1
                if len(issues['uppercase_tokens']) < 5:
                    issues['uppercase_tokens'].append((idx, token))
            
            # Check for punctuation
            if contains_punctuation(token):
                stats['punctuation_found'] += 1
                if len(issues['punctuation_tokens']) < 5:
                    issues['punctuation_tokens'].append((idx, token))
            
            # Check for numbers
            if contains_numbers(token):
                stats['numbers_found'] += 1
                if len(issues['number_tokens']) < 5:
                    issues['number_tokens'].append((idx, token))
            
            # Check for special characters
            if contains_special_chars(token):
                stats['special_chars_found'] += 1
                if len(issues['special_char_tokens']) < 5:
                    issues['special_char_tokens'].append((idx, token))
            
            # Check for stopwords
            if is_stopword(token):
                stats['stopwords_found'] += 1
                if len(issues['stopword_tokens']) < 5:
                    issues['stopword_tokens'].append((idx, token))
    
    # Calculate pass/fail for each test
    tests = {
        'uppercase_test': stats['uppercase_found'] == 0,
        'punctuation_test': stats['punctuation_found'] == 0,
        'numbers_test': stats['numbers_found'] == 0,
        'special_chars_test': stats['special_chars_found'] == 0,
        'stopwords_test': stats['stopwords_found'] == 0
    }
    
    # Update overall status
    if not all(tests.values()):
        results['overall_status'] = 'FAIL'
    
    # Calculate percentages for statistics
    total_tokens = stats['total_normalized_tokens']
    if total_tokens > 0:
        stats.update({
            'uppercase_percentage': (stats['uppercase_found'] / total_tokens) * 100,
            'punctuation_percentage': (stats['punctuation_found'] / total_tokens) * 100,
            'numbers_percentage': (stats['numbers_found'] / total_tokens) * 100,
            'special_chars_percentage': (stats['special_chars_found'] / total_tokens) * 100,
            'stopwords_percentage': (stats['stopwords_found'] / total_tokens) * 100
        })
    
    # Compile results
    results['tests'] = tests
    results['statistics'] = stats
    results['issues_found'] = {k: len(v) for k, v in issues.items()}
    results['sample_issues'] = issues
    
    # Print summary report
    print("\nText Normalization Validation Report")
    print("====================================")
    print(f"Overall Status: {results['overall_status']}")
    print("\nTest Results:")
    for test, passed in tests.items():
        print(f"- {test}: {'PASS' if passed else 'FAIL'}")
    
    print("\nStatistics:")
    print(f"- Total normalized tokens: {stats['total_normalized_tokens']}")
    if total_tokens > 0:
        print(f"- Uppercase tokens: {stats['uppercase_found']} ({stats['uppercase_percentage']:.2f}%)")
        print(f"- Tokens with punctuation: {stats['punctuation_found']} ({stats['punctuation_percentage']:.2f}%)")
        print(f"- Tokens with numbers: {stats['numbers_found']} ({stats['numbers_percentage']:.2f}%)")
        print(f"- Tokens with special characters: {stats['special_chars_found']} ({stats['special_chars_percentage']:.2f}%)")
        print(f"- Stopwords found: {stats['stopwords_found']} ({stats['stopwords_percentage']:.2f}%)")
    
    if results['overall_status'] == 'FAIL':
        print("\nSample Issues Found:")
        for issue_type, samples in issues.items():
            if samples:
                print(f"\n{issue_type}:")
                for idx, token in samples:
                    print(f"- Row {idx}: '{token}'")
    
    return results



In [None]:
# Clear GPU memory
torch.cuda.empty_cache()



# Run the normalization
df_short = text_normalization(df_short, 'tokens')

# Verify the results
print("\nResults verification:")
print("Sample of normalized tokens (first 3 rows):")
print(df_short['tokens_normalized'].head(3))


results = check_text_normalization(df_short)

Using device: cuda


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\krona\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
2024-11-09 23:05:58 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| lemma     | combined_nocharlm |

2024-11-09 23:05:58 INFO: Using device: cuda
2024-11-09 23:05:58 INFO: Loading: tokenize
2024-11-09 23:05:58 INFO: Loading: lemma
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-11-09 23:06:00 INFO: Done loading processors!


Starting processing of 198 rows in 4 batches


Normalizing text:   0%|          | 0/4 [00:00<?, ?it/s]


Sample processing:
Original tokens: [['The', 'World', 'Needs', 'Peacemaker', 'Trump', 'Again', 'by', 'Jeff', 'Crouere', ',', 'The', 'Liberty', 'Daily', ':', 'The', 'world', 'is', 'in', 'total', 'chaos', 'after', '39', 'months', 'of', 'the', 'Biden', 'presidency', '.'], ['The', 'southern', 'border', 'of', 'our', 'country', 'is', 'porous', 'and', 'millions', 'of', 'individuals', 'from', 'around', 'the', 'world', 'have', 'descended', 'on', 'our', 'country', '.'], ['These', '“', 'undocumented', 'migrants', '”', 'include', 'terrorists', ',', 'drug', 'dealers', ',', 'and', 'intelligence', 'agents', 'of', 'countries', 'such', 'as', 'our', 'enemy', ',', 'China', '.'], ['It', 'should', 'alarm', 'every', 'American', 'that', '22,233', 'Chinese', 'nationals', 'have', 'illegally', 'entered', 'the', 'United', 'States', 'since', 'the', 'beginning', 'of', 'the', 'fiscal', 'year', 'in', 'October', '.'], ['If', 'this', 'rate', 'continues', ',', 'this', 'year', '’', 's', 'total', 'will', 'easily', 'top'

Normalizing text: 100%|██████████| 4/4 [00:15<00:00,  3.79s/it]


Processing completed. Total normalized entries: 198
Non-empty normalized entries: 198

Results verification:
Sample of normalized tokens (first 3 rows):
0                                        [world, need, peacemaker, trump, jeff, crouere, liberty, daily, world, total, chaos, month, biden, presidency, southern, border, country, porous, million, individual, around, world, descend, country, undocumented, migrant, include, terrorist, drug, dealer, intelligence, agent, country, enemy, china, alarm, every, american, chinese, national, illegally, enter, unite, state, since, beginning, fiscal, year, october, rate, continue, year, total, easily, top, chinese, national, illegally, enter, country, last, year, truth, life, http, sgtreporttv, astounding, increase, number, chinese, national, illegally, enter, country, last, year, since, china, communist, nation, individual, freely, travel, unite, state, good, life, senator, roger, marshall, rk, believe, influx, due, direction, ccp, chinese, comm




### CoNLL-U format
Now we will write a function to store the data in CoNLL-U format to ensure *reproducibility* and *platform independency*. Ten text parts will be written in one file in CoNNL-U format. 
If the execution of a cell is aborted, the currently open file will be closed.

In [28]:
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')

def convert_to_connlu(dataframe, column_name):
    """
    Function that takes a dataframe and converts each row in a CoNNL-U file. Each file consists of ten text parts
    in the CoNNL-U format.
    """
    output_dir = os.path.join("..", "CoNLL")
    
    file_index = 1
    sentence_count = 0
    total_sentences = 0
    output_file = os.path.join(output_dir, f"output_{file_index}.conllu")
    
    # Converting text parts to CoNNL-U format and closing files again. 
    try:        
        f = open(output_file, "w", encoding="utf-8")
    
        for idx, row in dataframe.iterrows():
            sentence = " ".join(row[f"{column_name}_normalized"])
            doc = nlp(sentence)
            CoNLL.write_doc2conll(doc, f)
            f.write("\n")
            
            sentence_count += 1
            total_sentences += 1
            
            # Closing file after ten converted text parts to one CoNNL-U file
            if sentence_count >= 10:
                f.close()
                file_index += 1
                output_file = os.path.join(output_dir, f"output_{file_index}.conllu")
                f = open(output_file, "w", encoding="utf-8")
                sentence_count = 0 
    except KeyboardInterrupt:
        print("\nStopped running. Closing open files...")
    except Exception as e:
        print(f"\nAn error happened {e}")
    # If the cell gets aborted, any open file is
    # closed so that there is no remaining open file when the cell is stopped.
    finally:
        if not f.closed:
            f.close()
        
    created_files = len([name for name in os.listdir(output_dir) if name.startswith('output') and name.endswith('.conllu')])
    print(f"\nTotal sentences processed: {total_sentences}")
    print(f"Total files created: {created_files}")

2024-11-09 23:06:15 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2024-11-09 23:06:15 INFO: Downloaded file to C:\Users\krona\stanza_resources\resources.json
2024-11-09 23:06:16 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

2024-11-09 23:06:16 INFO: Using device: cuda
2024-11-09 23:06:16 INFO: Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-11-09 23:06:16 INFO: Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-11-09 23:06:16 INFO: Loading: pos
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
2024-11-09 23:06:17 INFO: Loading: lemma
  checkpoint = torch.load(filename, lambda storage, loc: stora

Now we call the function *convert_to_connlu* on our dataframe *df_short* to receive files in CoNNL-U format consisting of ten text parts per file.

In [29]:
convert_to_connlu(df_short, 'tokens')


Total sentences processed: 198
Total files created: 20
