# Dataset financial_phrasebank

Polar sentiment dataset of sentences from financial news. The dataset consists of 4840 sentences from English language financial news categorised by sentiment. The dataset is divided by agreement rate of 5-8 annotators.

https://huggingface.co/datasets/takala/financial_phrasebank

In [11]:
def load_financial_phrasebank(filepath):
    data = []
    with open(filepath, encoding="iso-8859-1") as f:
        for line in f:
            sentence, label = line.rsplit("@", 1)
            data.append({
                "sentence": sentence.strip(),
                "label": label.strip()
            })
    return data

path_to_files = "data/FinancialPhraseBank-v1.0/"
files_base_name = "Sentences_"
possible_datasets = ["50Agree", "66Agree", "75Agree", "AllAgree"]
files_ends_with = ".txt"

## Read the dataset

In [16]:
for dataset in possible_datasets:
    file_path = f"{path_to_files}{files_base_name}{dataset}{files_ends_with}"
    data = load_financial_phrasebank(file_path)
    print(f"Loaded {len(data)} sentences from {file_path}")
    
# Example of how to access the data
for item in data[:5]:
    print(f"Sentence: {item['sentence']}, Label: {item['label']}")

Loaded 4846 sentences from data/FinancialPhraseBank-v1.0/Sentences_50Agree.txt
Loaded 4217 sentences from data/FinancialPhraseBank-v1.0/Sentences_66Agree.txt
Loaded 3453 sentences from data/FinancialPhraseBank-v1.0/Sentences_75Agree.txt
Loaded 2264 sentences from data/FinancialPhraseBank-v1.0/Sentences_AllAgree.txt
Sentence: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing ., Label: neutral
Sentence: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ., Label: positive
Sentence: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn ., Label: positive
Sentence: Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales ., Label: positive
Sente

## Are there repeated sentences?

In [17]:
for dataset in possible_datasets:
    file_path = f"{path_to_files}{files_base_name}{dataset}{files_ends_with}"
    data = load_financial_phrasebank(file_path)
    print(f"Dataset: {dataset}, Number of sentences: {len(data)}, Number of unique sentences: {len(set(item['sentence'] for item in data))}")

Dataset: 50Agree, Number of sentences: 4846, Number of unique sentences: 4838
Dataset: 66Agree, Number of sentences: 4217, Number of unique sentences: 4211
Dataset: 75Agree, Number of sentences: 3453, Number of unique sentences: 3448
Dataset: AllAgree, Number of sentences: 2264, Number of unique sentences: 2259


In [18]:
all_data = []
for dataset in possible_datasets:
    file_path = f"{path_to_files}{files_base_name}{dataset}{files_ends_with}"
    data = load_financial_phrasebank(file_path)
    all_data.extend(data)

print(f"Total number of sentences across all datasets: {len(all_data)}")
print(f"Total number of unique sentences across all datasets: {len(set(item['sentence'] for item in all_data))}")

Total number of sentences across all datasets: 14780
Total number of unique sentences across all datasets: 4838


Note that we only should choose one of the `possible_datasets`.

---
---
---
# POR FAZER
---
---
---

# ..... preprocessing? visualizations? .... etc.


 1. Tamanho e Estrutura dos Dados
- `df.shape`, `df.info()`, `df.describe()`
- Número total de documentos
- Verificação de campos nulos
- Distribuição por classe (se supervisionado)

 2. Análise de Texto Bruto
- Comprimento dos textos (número de caracteres, tokens)
- df['text'].str.len()
- df['text'].apply(lambda x: len(x.split()))
- Densidade de palavras, símbolos, stopwords, emojis

3. Distribuição de Classes
- Frequência e proporção por classe
- .value_counts()
- Verificação de desbalanceamento de classes

 4. Tokenização e Frequência
- Tokenização simples: `nltk.word_tokenize`, `str.split`
- Contagem de palavras:
- from collections import Counter
- Counter(" ".join(df['text']).split()).most_common(10)
- Wordclouds ou histogramas de frequência

 5. N-gramas
- Extração de bigramas/trigramas frequentes
- Identificação de padrões locais


 8. TF-IDF e Matriz de Vetores
- Construção de matriz TF-IDF
- Análise de esparsidade e termos discriminativos

 9. Embeddings
- Extração de embeddings médios (Word2Vec, FastText, BERT)
- Projeção 2D com PCA, t-SNE ou UMAP

 11. Outliers e Ruído
- Textos vazios, muito curtos, duplicados
- Ruído lexical (caracteres repetidos, spam)
