# Data Processing for Persuasion Detection

This notebook demonstrates the process of preparing and processing data for the persuasion detection project. The workflow includes importing necessary modules, exploring the data structure, and wrapping annotated spans in the articles for further analysis.

## Importing Required Modules

First, we set up the environment by importing essential libraries and ensuring that our custom scripts are accessible. This allows us to use utility functions for data processing.

In [7]:
import sys

sys.path.append('../scripts')  # adjust path as needed

In [2]:
from data_processing.wrap import wrap_spans_from_file, print_span

## Exploring Annotated Spans

We can inspect specific annotated spans within the dataset to verify the annotation quality and understand the data format.

In [3]:
# print the span from the article 2318 inside the fr folder that between 3196 and 3214
print_span(231, 3196, 3214, 'fr', base_path='../data/raw')

falsifient le réel


## Preparing Directories for Processing

We define the paths for the raw and processed data directories. This ensures that our scripts can locate the input files and save the processed outputs in the correct locations.

In [4]:
import os

RAW_DIR = '../data/raw'
PROCESSED_DIR = '../data/processed'

## Wrapping Annotated Spans in Articles

The following code iterates through each language directory in the raw data folder, locates the annotation files, and wraps the annotated spans in the corresponding articles. The processed articles are saved in the output directory for each language.

In [None]:
# Iterate through language directories in the raw data folder
# for lang_dir in os.listdir(RAW_DIR):
#     lang_path = os.path.join(RAW_DIR, lang_dir)
#     if os.path.isdir(lang_path):
#         language_code = lang_dir
#         print(f"Processing language: {language_code}")

#         # Define paths for the current language
#         labels_file = os.path.join(lang_path, 'train-labels-subtask-3-spans.txt')
#         articles_folder = os.path.join(lang_path, 'train-articles-subtask-3')
#         output_folder = os.path.join(PROCESSED_DIR, language_code, 'wrapped-articles')

#         # Check if the labels file exists for this language
#         if os.path.exists(labels_file):
#             print(f"  Labels file found: {labels_file}")
#             print(f"  Articles folder: {articles_folder}")
#             print(f"  Output folder: {output_folder}")

#             # Ensure the output directory exists
#             os.makedirs(output_folder, exist_ok=True)

#             # Wrap spans for the current language
#             wrap_spans_from_file(
#                 labels_file=labels_file,
#                 articles_folder=articles_folder,
#                 output_folder=output_folder,
#                 lang=language_code
#             )
#             print(f"  Finished wrapping spans for {language_code}.")
#         else:
#             print(f"  Labels file not found for {language_code}, skipping.")
#         print("---")

# print("Processing complete.")

In [6]:
import glob
import openai
import os
from dotenv import load_dotenv


# Set up OpenAI API key

# Try to load API key from environment or .env file
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

    
client = openai.AsyncOpenAI(api_key=api_key)

test_response = await client.responses.create(
    model="gpt-4o-mini",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(test_response.output_text)

As the moonlight danced on the enchanted meadow, a gentle unicorn named Starlight spread her shimmering wings and soared into the night sky, spreading dreams of magic and wonder to every sleeping child below.


In [7]:
# from data_processing.translate import translate_file_to_language

# en_folder = '../data/processed/en/wrapped-articles'
# ru_folder = '../data/processed/ru/wrapped-articles'
# os.makedirs(ru_folder, exist_ok=True)

# en_files = glob.glob(os.path.join(en_folder, '*.txt'))

# for en_file in en_files:
#     filename = os.path.basename(en_file)
#     ru_file = os.path.join(ru_folder, filename)
#     if not os.path.exists(ru_file):
#         print(f"Translating {filename} to Russian...")
#         translate_file_to_language(en_file, 'ru', client, base_dir='../data/processed')
#     else:
#         print(f"Already translated: {filename}")

# Translating texts

## English - Russian

In [8]:
# Async translation worker for batch file translation using OpenAI API
import asyncio
from data_processing.translate import async_translate_file_to_language


CONCURRENCY = 8
sem = asyncio.Semaphore(CONCURRENCY)

async def translate_worker(file_list, target_lang, client, base_dir):
    total = len(file_list)
    async def process_file(idx, file_path):
        # Check if the file has already been translated
        filename = os.path.basename(file_path)
        target_folder = os.path.join(base_dir, target_lang, 'wrapped-articles')
        os.makedirs(target_folder, exist_ok=True)
        target_file = os.path.join(target_folder, filename)
        if os.path.exists(target_file):
            print(f"[{idx+1}/{total}] Already translated: {filename}")
            return
        async with sem:
            try:
                print(f"[{idx+1}/{total}] Translating {file_path}...")
                await async_translate_file_to_language(file_path, target_lang, client, base_dir=base_dir)
                print(f"[{idx+1}/{total}] Done: {file_path}")
            except Exception as e:
                print(f"Error translating {file_path}: {e}")

    await asyncio.gather(*(process_file(i, f) for i, f in enumerate(file_list)))


In [9]:

# Example usage:
en_folder = '../data/processed/en/wrapped-articles'
ru_folder = '../data/processed/ru/wrapped-articles'
os.makedirs(ru_folder, exist_ok=True)
en_files = glob.glob(os.path.join(en_folder, '*.txt'))
await translate_worker(en_files, 'ru', client, base_dir='../data/processed')

[1/518] Translating ../data/processed/en/wrapped-articles/article999001323.txt...
[2/518] Translating ../data/processed/en/wrapped-articles/article710100700.txt...
[3/518] Translating ../data/processed/en/wrapped-articles/article727405181.txt...
[4/518] Translating ../data/processed/en/wrapped-articles/article694356862.txt...
[5/518] Translating ../data/processed/en/wrapped-articles/article698780559.txt...
[6/518] Translating ../data/processed/en/wrapped-articles/article729561658.txt...
[7/518] Translating ../data/processed/en/wrapped-articles/article999001297.txt...
[8/518] Translating ../data/processed/en/wrapped-articles/article734265267.txt...
[2/518] Done: ../data/processed/en/wrapped-articles/article710100700.txt
[9/518] Translating ../data/processed/en/wrapped-articles/article764518567.txt...
[6/518] Done: ../data/processed/en/wrapped-articles/article729561658.txt
[10/518] Translating ../data/processed/en/wrapped-articles/article779309765.txt...
[3/518] Done: ../data/processed/e

## French - Russian

In [10]:
fr_folder = '../data/processed/fr/wrapped-articles'
ru_folder = '../data/processed/ru/wrapped-articles'
os.makedirs(ru_folder, exist_ok=True)
fr_files = glob.glob(os.path.join(fr_folder, '*.txt'))
await translate_worker(fr_files, 'ru', client, base_dir='../data/processed')

[1/211] Translating ../data/processed/fr/wrapped-articles/article2315.txt...
[2/211] Translating ../data/processed/fr/wrapped-articles/article23176.txt...
[3/211] Translating ../data/processed/fr/wrapped-articles/article2345.txt...
[4/211] Translating ../data/processed/fr/wrapped-articles/article23196.txt...
[5/211] Translating ../data/processed/fr/wrapped-articles/article2348.txt...
[6/211] Translating ../data/processed/fr/wrapped-articles/article2335.txt...
[7/211] Translating ../data/processed/fr/wrapped-articles/article2333.txt...
[8/211] Translating ../data/processed/fr/wrapped-articles/article2390.txt...
[6/211] Done: ../data/processed/fr/wrapped-articles/article2335.txt
[9/211] Translating ../data/processed/fr/wrapped-articles/article23145.txt...
[4/211] Done: ../data/processed/fr/wrapped-articles/article23196.txt
[10/211] Translating ../data/processed/fr/wrapped-articles/article2343.txt...
[5/211] Done: ../data/processed/fr/wrapped-articles/article2348.txt
[11/211] Translating 

## Summary

This notebook provided an overview of the data processing steps for the persuasion detection project, including importing modules, inspecting data, and preparing processed datasets for further analysis.