# News Sentiment Data Preparation

This notebook demonstrates step-by-step data preparation for each raw dataset in the project. Each section provides a unique approach tailored to the dataset's structure and requirements.

## 1. Financial News Dataset Preparation

This section processes the raw financial news data (e.g., `data/financial_news/raw/all-data.csv`).

In [2]:
import pandas as pd
import os
from pathlib import Path

# Define paths
raw_path = Path("../data/financial_news/raw/all-data.csv")
processed_path = Path("../data/financial_news/prepared/processed_all-data.csv")

# Try multiple encodings for robust reading
encodings = ['utf-8', 'utf-8-sig', 'latin-1', 'iso-8859-1', 'cp1252']
df = None
for encoding in encodings:
    try:
        df = pd.read_csv(raw_path, encoding=encoding)
        print(f"Successfully read with {encoding}")
        break
    except Exception:
        continue
if df is None:
    raise RuntimeError("Could not read the raw financial news CSV with any encoding.")

# Strip whitespace from column names
cols = [col.strip() for col in df.columns]
df.columns = cols

# Display a sample
df.head()

Successfully read with utf-8


Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


### Clean and Preprocess Financial News

We use the `NewsPreprocessor` with FinBERT tokenizer for financial news. The text and sentiment columns are auto-detected.

In [6]:
from src.preprocessing import NewsPreprocessor

preprocessor = NewsPreprocessor(tokenizer_name="ProsusAI/finbert")
processed_df = preprocessor.preprocess_dataset(df)
processed_df.head()



Unnamed: 0,original_id,chunk_id,text,is_long,total_chunks,num_tokens,cleaning_metrics,sentiment
0,0,0,According to Gran the company has no plans to ...,False,1,22,"{'urls_removed': 0, 'html_tags_removed': 0, 'e...",neutral
1,1,0,Technopolis plans to develop in stages an area...,False,1,31,"{'urls_removed': 0, 'html_tags_removed': 0, 'e...",neutral
2,2,0,The international electronic industry company ...,False,1,38,"{'urls_removed': 0, 'html_tags_removed': 0, 'e...",negative
3,3,0,With the new production plant the company woul...,False,1,33,"{'urls_removed': 0, 'html_tags_removed': 0, 'e...",positive
4,4,0,According to the company s updated strategy fo...,False,1,40,"{'urls_removed': 0, 'html_tags_removed': 0, 'e...",positive


### Save Processed Financial News

In [7]:
processed_df.to_csv(processed_path, index=False)
print(f"Saved processed data to {processed_path}")

Saved processed data to ..\data\financial_news\prepared\processed_all-data.csv


## 2. Map Labels and Split Financial News Data

This section loads the processed data, maps sentiment labels to integers (if needed), and splits the data into train/validation/test sets for model training.

In [8]:
from sklearn.model_selection import train_test_split

# Load processed data
processed_df = pd.read_csv(processed_path)

# Remove cleaning_metrics column if it exists
if 'cleaning_metrics' in processed_df.columns:
    processed_df.drop(columns=['cleaning_metrics'], inplace=True) 

# Map labels to integers if needed
def map_labels(df, label_column='sentiment'):
    if isinstance(df[label_column].iloc[0], str):
        unique_labels = sorted(df[label_column].unique())
        label_map = {label: i for i, label in enumerate(unique_labels)}
        df[label_column] = df[label_column].map(label_map)
        print(f"Label mapping: {label_map}")
    else:
        print("Labels are already numeric.")
    return df

processed_df = map_labels(processed_df, label_column='sentiment')
processed_df

Label mapping: {'negative': 0, 'neutral': 1, 'positive': 2}


Unnamed: 0,original_id,chunk_id,text,is_long,total_chunks,num_tokens,sentiment
0,0,0,According to Gran the company has no plans to ...,False,1,22,1
1,1,0,Technopolis plans to develop in stages an area...,False,1,31,1
2,2,0,The international electronic industry company ...,False,1,38,0
3,3,0,With the new production plant the company woul...,False,1,33,2
4,4,0,According to the company s updated strategy fo...,False,1,40,2
...,...,...,...,...,...,...,...
4819,4841,0,LONDON MarketWatch Share prices ended lower in...,False,1,26,0
4820,4842,0,Rinkuskiai s beer sales fell by 65 per cent to...,False,1,33,1
4821,4843,0,Operating profit fell to EUR 354 mn from EUR 6...,False,1,25,0
4822,4844,0,Net sales of the Paper segment decreased to EU...,False,1,51,0


In [11]:
# Split into train/val/test
train_df, temp_df = train_test_split(processed_df, test_size=0.2, random_state=42, stratify=processed_df['sentiment'])
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, stratify=temp_df['sentiment'])

# Save splits
split_dir = Path("../data/financial_news/splits/")
split_dir.mkdir(parents=True, exist_ok=True)
train_df.to_csv(split_dir / "data_train.csv", index=False)
val_df.to_csv(split_dir / "data_val.csv", index=False)
test_df.to_csv(split_dir / "data_test.csv", index=False)
print("Splits saved.")

Splits saved.
