# Data Processing for Persuasion Detection

This notebook demonstrates the process of preparing and processing data for the persuasion detection project. The workflow includes importing necessary modules, exploring the data structure, and wrapping annotated spans in the articles for further analysis.

## Importing Required Modules

First, we set up the environment by importing essential libraries and ensuring that our custom scripts are accessible. This allows us to use utility functions for data processing.

In [2]:
# Add the parent directory to sys.path so data_processing can be imported
import os
import sys

sys.path.append('../scripts')  # adjust path as needed

In [2]:
from data_processing.wrap import wrap_spans_from_file, print_span

## Exploring Annotated Spans

We can inspect specific annotated spans within the dataset to verify the annotation quality and understand the data format.

In [3]:
# print the span from the article 2318 inside the fr folder that between 3470 and 3505
print_span(2318, 4045, 4063, 'fr', base_path='../data/raw')

Ce pays est foutu 


## Preparing Directories for Processing

We define the paths for the raw and processed data directories. This ensures that our scripts can locate the input files and save the processed outputs in the correct locations.

In [4]:
import glob
import os

RAW_DIR = '../data/raw'
PROCESSED_DIR = '../data/processed'

## Wrapping Annotated Spans in Articles

The following code iterates through each language directory in the raw data folder, locates the annotation files, and wraps the annotated spans in the corresponding articles. The processed articles are saved in the output directory for each language.

In [5]:
# Iterate through language directories in the raw data folder
for lang_dir in os.listdir(RAW_DIR):
    lang_path = os.path.join(RAW_DIR, lang_dir)
    if os.path.isdir(lang_path):
        language_code = lang_dir
        print(f"Processing language: {language_code}")

        # Define paths for the current language
        labels_file = os.path.join(lang_path, 'train-labels-subtask-3-spans.txt')
        articles_folder = os.path.join(lang_path, 'train-articles-subtask-3')
        output_folder = os.path.join(PROCESSED_DIR, language_code, 'wrapped-articles')

        # Check if the labels file exists for this language
        if os.path.exists(labels_file):
            print(f"  Labels file found: {labels_file}")
            print(f"  Articles folder: {articles_folder}")
            print(f"  Output folder: {output_folder}")

            # Ensure the output directory exists
            os.makedirs(output_folder, exist_ok=True)

            # Wrap spans for the current language
            wrap_spans_from_file(
                labels_file=labels_file,
                articles_folder=articles_folder,
                output_folder=output_folder,
                lang=language_code
            )
            print(f"  Finished wrapping spans for {language_code}.")
        else:
            print(f"  Labels file not found for {language_code}, skipping.")
        print("---")

print("Processing complete.")

Processing language: ru
  Labels file found: ../data/raw/ru/train-labels-subtask-3-spans.txt
  Articles folder: ../data/raw/ru/train-articles-subtask-3
  Output folder: ../data/processed/ru/wrapped-articles
  Finished wrapping spans for ru.
---
Processing language: sl
  Labels file not found for sl, skipping.
---
Processing language: it
  Labels file found: ../data/raw/it/train-labels-subtask-3-spans.txt
  Articles folder: ../data/raw/it/train-articles-subtask-3
  Output folder: ../data/processed/it/wrapped-articles
  Finished wrapping spans for it.
---
Processing language: en
  Labels file found: ../data/raw/en/train-labels-subtask-3-spans.txt
  Articles folder: ../data/raw/en/train-articles-subtask-3
  Output folder: ../data/processed/en/wrapped-articles
  Finished wrapping spans for en.
---
Processing language: es
  Labels file not found for es, skipping.
---
Processing language: po
  Labels file found: ../data/raw/po/train-labels-subtask-3-spans.txt
  Articles folder: ../data/raw/p

## Summary

This notebook provided an overview of the data processing steps for the persuasion detection project, including importing modules, inspecting data, and preparing processed datasets for further analysis.