<a href="https://colab.research.google.com/github/LoTzuChin/113-1-FinancialBigData/blob/main/Multimodal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
!pip install datasets



In [2]:
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Financial Phrase Bank v1.0: Polar sentiment dataset of sentences from
financial news. The dataset consists of 4840 sentences from English language
financial news categorised by sentiment. The dataset is divided by agreement
rate of 5-8 annotators."""


import os

import datasets


_CITATION = """\
@article{Malo2014GoodDO,
  title={Good debt or bad debt: Detecting semantic orientations in economic texts},
  author={P. Malo and A. Sinha and P. Korhonen and J. Wallenius and P. Takala},
  journal={Journal of the Association for Information Science and Technology},
  year={2014},
  volume={65}
}
"""

_DESCRIPTION = """\
The key arguments for the low utilization of statistical techniques in
financial sentiment analysis have been the difficulty of implementation for
practical applications and the lack of high quality training data for building
such models. Especially in the case of finance and economic texts, annotated
collections are a scarce resource and many are reserved for proprietary use
only. To resolve the missing training data problem, we present a collection of
∼ 5000 sentences to establish human-annotated standards for benchmarking
alternative modeling techniques.
The objective of the phrase level annotation task was to classify each example
sentence into a positive, negative or neutral category by considering only the
information explicitly available in the given sentence. Since the study is
focused only on financial and economic domains, the annotators were asked to
consider the sentences from the view point of an investor only; i.e. whether
the news may have positive, negative or neutral influence on the stock price.
As a result, sentences which have a sentiment that is not relevant from an
economic or financial perspective are considered neutral.
This release of the financial phrase bank covers a collection of 4840
sentences. The selected collection of phrases was annotated by 16 people with
adequate background knowledge on financial markets. Three of the annotators
were researchers and the remaining 13 annotators were master’s students at
Aalto University School of Business with majors primarily in finance,
accounting, and economics.
Given the large number of overlapping annotations (5 to 8 annotations per
sentence), there are several ways to define a majority vote based gold
standard. To provide an objective comparison, we have formed 4 alternative
reference datasets based on the strength of majority agreement: all annotators
agree, >=75% of annotators agree, >=66% of annotators agree and >=50% of
annotators agree.
"""

_HOMEPAGE = "https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news"

_LICENSE = "Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License"

_REPO = "https://huggingface.co/datasets/financial_phrasebank/resolve/main/data"
_URL = f"{_REPO}/FinancialPhraseBank-v1.0.zip"


_VERSION = datasets.Version("1.0.0")


class FinancialPhraseBankConfig(datasets.BuilderConfig):
    """BuilderConfig for FinancialPhraseBank."""

    def __init__(
        self,
        split,
        **kwargs,
    ):
        """BuilderConfig for Discovery.
        Args:
          filename_bit: `string`, the changing part of the filename.
        """

        super(FinancialPhraseBankConfig, self).__init__(name=f"sentences_{split}agree", version=_VERSION, **kwargs)

        self.path = os.path.join("FinancialPhraseBank-v1.0", f"Sentences_{split.title()}Agree.txt")


class FinancialPhrasebank(datasets.GeneratorBasedBuilder):

    BUILDER_CONFIGS = [
        FinancialPhraseBankConfig(
            split="all",
            description="Sentences where all annotators agreed",
        ),
        FinancialPhraseBankConfig(split="75", description="Sentences where at least 75% of annotators agreed"),
        FinancialPhraseBankConfig(split="66", description="Sentences where at least 66% of annotators agreed"),
        FinancialPhraseBankConfig(split="50", description="Sentences where at least 50% of annotators agreed"),
    ]

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {
                    "sentence": datasets.Value("string"),
                    "label": datasets.features.ClassLabel(
                        names=[
                            "negative",
                            "neutral",
                            "positive",
                        ]
                    ),
                }
            ),
            supervised_keys=None,
            homepage=_HOMEPAGE,
            license=_LICENSE,
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""
        data_dir = dl_manager.download_and_extract(_URL)
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={"filepath": os.path.join(data_dir, self.config.path)},
            ),
        ]

    def _generate_examples(self, filepath):
        """Yields examples."""
        with open(filepath, encoding="iso-8859-1") as f:
            for id_, line in enumerate(f):
                sentence, label = line.rsplit("@", 1)
                yield id_, {"sentence": sentence, "label": label}



In [3]:
import datasets

def load_dataset(filepath):
  sentences = []
  labels = []
  with open(filepath, encoding="iso-8859-1") as f:
    for line in f:
      sentence, label = line.strip().split("@")
      sentences.append(sentence)
      labels.append(label)

  dataset = datasets.Dataset.from_dict({"sentence": sentences, "label": labels})
  return dataset

if __name__ == "__main__":
  filepath = "/content/Sentences_AllAgree.txt"
  dataset = load_dataset(filepath)
  print(dataset[0])

{'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 'neutral'}


In [4]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

# 示例数据 (替换为实际数据集路径)
# sentences = [
#     "The company reported a significant increase in revenue.",
#     "The stock prices fell sharply after the announcement.",
#     "The new product line is expected to boost sales."
# ]
# labels = ["positive", "negative", "positive"]

# 创建数据集
filepath = "/content/Sentences_AllAgree.txt"
dataset = load_dataset(filepath)

# 数据集分割 (80% 训练, 20% 测试)
train_test_split_ratio = 0.8
train_dataset, test_dataset = dataset.train_test_split(test_size=1-train_test_split_ratio).values()


In [5]:
from transformers import AutoTokenizer

# 加载预训练的分词器
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 数据预处理函数
def preprocess_function(examples):
    tokenized = tokenizer(examples["sentence"], padding="max_length", truncation=True, max_length=128)
    # 将标签映射为 0 和 1
    tokenized["labels"] = [1 if label == "positive" else 0 for label in examples["label"]]
    return tokenized

# 应用数据预处理
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# 删除原始列，只保留模型需要的输入
train_dataset = train_dataset.remove_columns(["sentence", "label"])
test_dataset = test_dataset.remove_columns(["sentence", "label"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/1811 [00:00<?, ? examples/s]

Map:   0%|          | 0/453 [00:00<?, ? examples/s]

In [6]:
from torch.utils.data import Dataset, DataLoader
import torch

class FinancialDataset(Dataset):
    def __init__(self, dataset):
        # 将 HuggingFace Dataset 转为字典形式
        self.dataset = {key: dataset[key] for key in dataset.column_names}

    def __len__(self):
        return len(next(iter(self.dataset.values())))  # 取第一列的长度

    def __getitem__(self, idx):
        # 逐项提取数据，并将其转换为 PyTorch 的张量
        return {key: torch.tensor(val[idx]) for key, val in self.dataset.items()}




In [7]:
# 将 HuggingFace Dataset 转为 PyTorch Dataset
train_dataset = FinancialDataset(train_dataset)
test_dataset = FinancialDataset(test_dataset)

# 创建 DataLoader
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# 开始训练
from transformers import AutoModelForSequenceClassification
from torch.optim import AdamW
from torch.nn import functional as F

# 加载预训练模型 (2 类分类)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 优化器
optimizer = AdamW(model.parameters(), lr=5e-5)

# 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 训练模型
for epoch in range(3):  # 假设训练3个epoch
    model.train()
    for batch in train_loader:
        # 将数据移动到设备
        inputs = {key: val.to(device) for key, val in batch.items() if key != "labels"}
        labels = batch["labels"].to(device)

        # 前向传播
        outputs = model(**inputs)
        loss = F.cross_entropy(outputs.logits, labels)

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {loss.item()}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.031026527285575867
Epoch 2, Loss: 0.17890039086341858
Epoch 3, Loss: 0.003432067111134529


In [8]:
from sklearn.metrics import accuracy_score

# 模型评估
model.eval()
all_preds, all_labels = [], []

with torch.no_grad():
    for batch in test_loader:
        inputs = {key: val.to(device) for key, val in batch.items() if key != "labels"}
        labels = batch["labels"].to(device)

        outputs = model(**inputs)
        preds = torch.argmax(outputs.logits, dim=-1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# 计算准确率
accuracy = accuracy_score(all_labels, all_preds)
print(f"Test Accuracy: {accuracy}")


Test Accuracy: 0.9735099337748344
