# Data Processing for Persuasion Detection

This notebook demonstrates the process of preparing and processing data for the persuasion detection project. The workflow includes importing necessary modules, exploring the data structure, and wrapping annotated spans in the articles for further analysis.

## Importing Required Modules

First, we set up the environment by importing essential libraries and ensuring that our custom scripts are accessible. This allows us to use utility functions for data processing.

In [1]:
import sys

sys.path.append('../scripts')  # adjust path as needed

In [2]:
from data_processing.wrap import wrap_spans_from_file, print_span

## Exploring Annotated Spans

We can inspect specific annotated spans within the dataset to verify the annotation quality and understand the data format.

In [3]:
# print the span from the article 2318 inside the fr folder that between 3196 and 3214
print_span(231, 3196, 3214, 'fr', base_path='../data/raw')

falsifient le réel


## Preparing Directories for Processing

We define the paths for the raw and processed data directories. This ensures that our scripts can locate the input files and save the processed outputs in the correct locations.

In [4]:
import os

RAW_DIR = '../data/raw'
PROCESSED_DIR = '../data/processed'

## Wrapping Annotated Spans in Articles

The following code iterates through each language directory in the raw data folder, locates the annotation files, and wraps the annotated spans in the corresponding articles. The processed articles are saved in the output directory for each language.

In [5]:
# Iterate through language directories in the raw data folder
# for lang_dir in os.listdir(RAW_DIR):
#     lang_path = os.path.join(RAW_DIR, lang_dir)
#     if os.path.isdir(lang_path):
#         language_code = lang_dir
#         print(f"Processing language: {language_code}")

#         # Define paths for the current language
#         labels_file = os.path.join(lang_path, 'train-labels-subtask-3-spans.txt')
#         articles_folder = os.path.join(lang_path, 'train-articles-subtask-3')
#         output_folder = os.path.join(PROCESSED_DIR, language_code, 'wrapped-articles')

#         # Check if the labels file exists for this language
#         if os.path.exists(labels_file):
#             print(f"  Labels file found: {labels_file}")
#             print(f"  Articles folder: {articles_folder}")
#             print(f"  Output folder: {output_folder}")

#             # Ensure the output directory exists
#             os.makedirs(output_folder, exist_ok=True)

#             # Wrap spans for the current language
#             wrap_spans_from_file(
#                 labels_file=labels_file,
#                 articles_folder=articles_folder,
#                 output_folder=output_folder,
#                 lang=language_code
#             )
#             print(f"  Finished wrapping spans for {language_code}.")
#         else:
#             print(f"  Labels file not found for {language_code}, skipping.")
#         print("---")

# print("Processing complete.")

In [6]:
import glob
import openai
import os
from dotenv import load_dotenv


# Set up OpenAI API key

# Try to load API key from environment or .env file
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

    
client = openai.AsyncOpenAI(api_key=api_key)

test_response = await client.responses.create(
    model="gpt-4o-mini",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(test_response.output_text)

Under a sky sprinkled with stars, a gentle unicorn named Lila danced on moonbeams, spreading dreams of kindness to every sleeping child.


In [7]:
# from data_processing.translate import translate_file_to_language

# en_folder = '../data/processed/en/wrapped-articles'
# ru_folder = '../data/processed/ru/wrapped-articles'
# os.makedirs(ru_folder, exist_ok=True)

# en_files = glob.glob(os.path.join(en_folder, '*.txt'))

# for en_file in en_files:
#     filename = os.path.basename(en_file)
#     ru_file = os.path.join(ru_folder, filename)
#     if not os.path.exists(ru_file):
#         print(f"Translating {filename} to Russian...")
#         translate_file_to_language(en_file, 'ru', client, base_dir='../data/processed')
#     else:
#         print(f"Already translated: {filename}")

# Translating texts

## English - Russian

In [8]:
# Async translation worker for batch file translation using OpenAI API
import asyncio
from data_processing.translate import async_translate_file_to_language


CONCURRENCY = 8
sem = asyncio.Semaphore(CONCURRENCY)

async def translate_worker(file_list, target_lang, client, base_dir):
    total = len(file_list)
    async def process_file(idx, file_path):
        # Check if the file has already been translated
        filename = os.path.basename(file_path)
        target_folder = os.path.join(base_dir, target_lang, 'wrapped-articles')
        os.makedirs(target_folder, exist_ok=True)
        target_file = os.path.join(target_folder, filename)
        if os.path.exists(target_file):
            print(f"[{idx+1}/{total}] Already translated: {filename}")
            return
        async with sem:
            try:
                print(f"[{idx+1}/{total}] Translating {file_path}...")
                await async_translate_file_to_language(file_path, target_lang, client, base_dir=base_dir)
                print(f"[{idx+1}/{total}] Done: {file_path}")
            except Exception as e:
                print(f"Error translating {file_path}: {e}")

    await asyncio.gather(*(process_file(i, f) for i, f in enumerate(file_list)))


In [9]:

# # Example usage:
# en_folder = '../data/processed/en/wrapped-articles'
# ru_folder = '../data/processed/ru/wrapped-articles'
# os.makedirs(ru_folder, exist_ok=True)
# en_files = glob.glob(os.path.join(en_folder, '*.txt'))
# await translate_worker(en_files, 'ru', client, base_dir='../data/processed')

## French - Russian

In [10]:
# fr_folder = '../data/processed/fr/wrapped-articles'
# ru_folder = '../data/processed/ru/wrapped-articles'
# os.makedirs(ru_folder, exist_ok=True)
# fr_files = glob.glob(os.path.join(fr_folder, '*.txt'))
# await translate_worker(fr_files, 'ru', client, base_dir='../data/processed')

# German - Russian

In [11]:
ge_folder = '../data/processed/ge/wrapped-articles'
ru_folder = '../data/processed/ru/wrapped-articles'
os.makedirs(ru_folder, exist_ok=True)
ge_files = glob.glob(os.path.join(ge_folder, '*.txt'))
await translate_worker(ge_files, 'ru', client, base_dir='../data/processed')


[1/177] Translating ../data/processed/ge/wrapped-articles/article2266.txt...
[2/177] Translating ../data/processed/ge/wrapped-articles/article22132.txt...
[3/177] Translating ../data/processed/ge/wrapped-articles/article2231.txt...
[4/177] Translating ../data/processed/ge/wrapped-articles/article22186.txt...
[5/177] Translating ../data/processed/ge/wrapped-articles/article22189.txt...
[6/177] Translating ../data/processed/ge/wrapped-articles/article22120.txt...
[7/177] Translating ../data/processed/ge/wrapped-articles/article22104.txt...
[8/177] Translating ../data/processed/ge/wrapped-articles/article22121.txt...
[5/177] Done: ../data/processed/ge/wrapped-articles/article22189.txt
[9/177] Translating ../data/processed/ge/wrapped-articles/article22114.txt...
[6/177] Done: ../data/processed/ge/wrapped-articles/article22120.txt
[10/177] Translating ../data/processed/ge/wrapped-articles/article229.txt...
[4/177] Done: ../data/processed/ge/wrapped-articles/article22186.txt
[11/177] Transla

# Italian - Russian

In [19]:
from data_processing.translate import prefix_translated_filenames

it_folder = '../data/processed/it/wrapped-articles'
ru_folder = '../data/processed/ru/wrapped-articles'
os.makedirs(ru_folder, exist_ok=True)
it_files = glob.glob(os.path.join(it_folder, '*.txt'))
await translate_worker(it_files, 'ru', client, base_dir='../data/processed')

prefix_translated_filenames(
    target_lang='ru',
    source_lang='it',
)

[1/303] Translating ../data/processed/it/wrapped-articles/article26190.txt...
[2/303] Translating ../data/processed/it/wrapped-articles/article26221.txt...
[3/303] Translating ../data/processed/it/wrapped-articles/article26284.txt...
[4/303] Translating ../data/processed/it/wrapped-articles/article26241.txt...
[5/303] Translating ../data/processed/it/wrapped-articles/article2625.txt...
[6/303] Translating ../data/processed/it/wrapped-articles/article26154.txt...
[7/303] Translating ../data/processed/it/wrapped-articles/article26253.txt...
[8/303] Translating ../data/processed/it/wrapped-articles/article2641.txt...
[4/303] Done: ../data/processed/it/wrapped-articles/article26241.txt
[9/303] Translating ../data/processed/it/wrapped-articles/article26209.txt...
[7/303] Done: ../data/processed/it/wrapped-articles/article26253.txt
[10/303] Translating ../data/processed/it/wrapped-articles/article26114.txt...
[2/303] Done: ../data/processed/it/wrapped-articles/article26221.txt
[11/303] Trans

# Prefixing filenames for later retrieval
The following code prefixes the filenames of the processed articles with their respective language codes. This allows for easy identification and retrieval of files based on their language.

In [None]:
from data_processing.translate import prefix_translated_filenames

# prefix_translated_filenames(
#     target_lang='ru',
#     source_lang='en'
# )

# prefix_translated_filenames(
#     target_lang='ru',
#     source_lang='fr'
# )

# prefix_translated_filenames(
#     target_lang='ru',
#     source_lang='ge'
# )



Renamed article2266.txt -> ge_article2266.txt
Renamed article22132.txt -> ge_article22132.txt
Renamed article2231.txt -> ge_article2231.txt
Renamed article22186.txt -> ge_article22186.txt
Renamed article22189.txt -> ge_article22189.txt
Renamed article22120.txt -> ge_article22120.txt
Renamed article22104.txt -> ge_article22104.txt
Renamed article22121.txt -> ge_article22121.txt
Renamed article22114.txt -> ge_article22114.txt
Renamed article229.txt -> ge_article229.txt
Renamed article2241.txt -> ge_article2241.txt
Renamed article2283.txt -> ge_article2283.txt
Renamed article22171.txt -> ge_article22171.txt
Renamed article22175.txt -> ge_article22175.txt
Renamed article2229.txt -> ge_article2229.txt
Renamed article2211.txt -> ge_article2211.txt
Renamed article22168.txt -> ge_article22168.txt
Renamed article2214.txt -> ge_article2214.txt
Renamed article2260.txt -> ge_article2260.txt
Renamed article2285.txt -> ge_article2285.txt
Renamed article2272.txt -> ge_article2272.txt
Renamed article2

In [18]:
from data_processing.unwrap import remap_folder

remap_folder(
    translated_dir="../data/processed/ru/wrapped-articles",
    src_lang='ge',
    lookup_file="../data/processed/ge/train-labels-subtask-3-spans.txt",
    out_label_file="../data/processed/ru/train-labels-subtask-3-spans-ge.txt",
)

Finished with no errors.


In [21]:
remap_folder(
    translated_dir="../data/processed/ru/wrapped-articles",
    src_lang='it',
    lookup_file="../data/processed/it/train-labels-subtask-3-spans.txt",
    out_label_file="../data/processed/ru/train-labels-subtask-3-spans-it.txt",
)

Finished with no errors.


In [22]:
import sys
sys.path.append('../scripts')  # adjust path as needed
from data_processing.unwrap import unwrap_articles

# Unwrap the translated articles
unwrap_articles(lang_dir='ru')

Processing 1387/1387 files...
Done.


## Summary

This notebook provided an overview of the data processing steps for the persuasion detection project, including importing modules, inspecting data, and preparing processed datasets for further analysis.