# Text Data Preprocessing Pipeline

This notebook implements a comprehensive text preprocessing pipeline for natural language processing tasks. The pipeline handles data loading, text extraction, and various normalization steps to prepare textual data for further analysis or model training.

## Objective
- Load and consolidate text data from multiple CSV files
- Clean and normalize text content 
- Prepare standardized dataset for NLP tasks

## Environment Configuration

- Enables autoreload extension to automatically reload modified modules
- Imports the custom TextProcessor class from utils
- Loads required data manipulation libraries

In [None]:
%load_ext autoreload
%aimport utils.text_processing
%autoreload 1

In [None]:
from utils.text_processing import TextProcessor
import pandas as pd
import requests
import glob

## Load CSV Files

The pipeline scans for CSV files in the data directory:
1. Uses glob to find all .csv files
2. Validates file structure by inspecting first file
3. Ensures consistency in data format across files

In [None]:
csv_files = glob.glob('data/raw/*.csv')
print(f'Found {len(csv_files)} CSV files')

Found 13 CSV files


## Data Preparation

For each CSV file:
1. Extracts the 'content' column containing text messages
2. Removes empty entries
3. Maintains data quality by filtering invalid entries
4. Tracks processing statistics for each file

In [None]:
# Initialize list to store all DataFrames
dataframes = []

# Process each CSV file
for file in csv_files:
    # Read the CSV
    df = pd.read_csv(file)
    
    # keep only the content column
    if 'content' not in df.columns:
        print(f'Skipping {file}: no content column')
        continue
    
    df = df[['content']]
    df = df[df['content'].str.len() > 0]
    df.rename(columns={'content': 'fr'}, inplace=True)
    
    dataframes.append(df)
    print(f'Processed {file}: {len(df)} messages')

Processed data/raw/shadow-slave_page_24.csv: 996 messages
Processed data/raw/shadow-slave_page_17.csv: 991 messages
Processed data/raw/shadow-slave_page_25.csv: 993 messages
Processed data/raw/shadow-slave_page_23.csv: 989 messages
Processed data/raw/shadow-slave_page_18.csv: 995 messages
Processed data/raw/shadow-slave_page_21.csv: 990 messages
Processed data/raw/shadow-slave_page_16.csv: 984 messages
Processed data/raw/shadow-slave_page_19.csv: 990 messages
Processed data/raw/shadow-slave_page_20.csv: 991 messages
Processed data/raw/shadow-slave_page_26.csv: 994 messages
Processed data/raw/shadow-slave_page_27.csv: 992 messages
Processed data/raw/shadow-slave_page_22.csv: 994 messages
Processed data/raw/shadow-slave_page_28.csv: 599 messages


In [None]:
final_df = pd.concat(dataframes, ignore_index=True)

final_df = final_df.drop_duplicates()
final_df['fr'] = TextProcessor(final_df, 'fr').transform()

In [None]:
def translate_batch(texts, source="fr", target="en", url="http://127.0.0.1:5000/translate"):
    payload = {
        "q": texts,
        "source": source,
        "target": target
    }
    headers = {"Content-Type": "application/json"}
    response = requests.post(url, json=payload, headers=headers)
    response.raise_for_status()
    return [item for item in response.json()["translatedText"]]

def translate_column(df, column, batch_size=50):
    translations = []
    for i in range(0, len(df), batch_size):
        batch = df[column].iloc[i:i + batch_size].tolist()
        batch = [text.replace('<start>', '').replace('<end>', '') for text in batch]
        
        translated = translate_batch(batch)
        translations.extend(translated)
    
    return translations

final_df['en'] = translate_column(final_df, 'fr')
final_df['en'] = TextProcessor(final_df, 'en').transform()

In [None]:
# shape the final DataFrame
final_df = final_df[['fr', 'en']]
final_df = final_df.drop_duplicates()
final_df = final_df[final_df['fr'].str.len() > 0]
final_df = final_df[final_df['en'].str.len() > 0]
final_df = final_df.dropna()
final_df = final_df.reset_index(drop=True)
print(f'Final DataFrame shape: {final_df.shape}')

Final DataFrame shape: (11187, 2)


## Data Export

Final processing steps:
1. Combines all processed DataFrames
2. Removes any duplicate entries
3. Exports to CSV format for downstream tasks
4. Preserves both original and processed versions

In [None]:
output_file = './data/cleaned/fr_en_processed_data.csv'
final_df.to_csv(output_file, index=False)
print(f'\nProcessed data saved to {output_file}')


Processed data saved to ./data/cleaned/fr_processed_data.csv
