**<center><h1>Sentiment Analysis in Finance</h1></center>**

## **Step 1. Install Libraries**

In [1]:
! pip install -q -U torch=='2.0.0'

In [2]:
! pip install -q -U accelerate=='0.25.0' peft=='0.7.1' bitsandbytes=='0.41.3.post2' trl=='0.7.4'

In [3]:
! pip install -q -U transformers einops

In [4]:
! pip install git+https://github.com/huggingface/trl.git@7630f877f91c556d9e5a3baa4b6e2894d90ff84c

Collecting git+https://github.com/huggingface/trl.git@7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Cloning https://github.com/huggingface/trl.git (to revision 7630f877f91c556d9e5a3baa4b6e2894d90ff84c) to /tmp/pip-req-build-zwk6uww8
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/trl.git /tmp/pip-req-build-zwk6uww8
  Running command git rev-parse -q --verify 'sha^7630f877f91c556d9e5a3baa4b6e2894d90ff84c'
  Running command git fetch -q https://github.com/huggingface/trl.git 7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Running command git checkout -q 7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Resolved https://github.com/huggingface/trl.git to commit 7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: trl
  Building wheel for trl (pyproject.toml) ... [?25ldone
[?2

In [5]:
! pip install accelerate==0.27.2

Collecting accelerate==0.27.2
  Downloading accelerate-0.27.2-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.25.0
    Uninstalling accelerate-0.25.0:
      Successfully uninstalled accelerate-0.25.0
Successfully installed accelerate-0.27.2


## **Step 2. Import Libraries**

In [6]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [7]:
import warnings
warnings.filterwarnings("ignore")

In [8]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split
from datasets import load_dataset
import datasets

2024-06-08 11:50:11.258618: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-08 11:50:11.258738: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-08 11:50:11.385182: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## **Step 3. Data Preparation**

In [9]:
tfns = load_dataset('FinGPT/fingpt-sentiment-train')
tfns = tfns['train']
tfns = tfns.to_pandas()
tfns.head()

Downloading readme:   0%|          | 0.00/529 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.42M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/76772 [00:00<?, ? examples/s]

Unnamed: 0,input,output,instruction
0,"Teollisuuden Voima Oyj , the Finnish utility k...",neutral,What is the sentiment of this news? Please cho...
1,Sanofi poaches AstraZeneca scientist as new re...,neutral,What is the sentiment of this news? Please cho...
2,Starbucks says the workers violated safety pol...,moderately negative,What is the sentiment of this news? Please cho...
3,$brcm raises revenue forecast,positive,What is the sentiment of this tweet? Please ch...
4,Google parent Alphabet Inc. reported revenue a...,moderately negative,What is the sentiment of this news? Please cho...


In [10]:
tfns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76772 entries, 0 to 76771
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   input        76772 non-null  object
 1   output       76772 non-null  object
 2   instruction  76772 non-null  object
dtypes: object(3)
memory usage: 1.8+ MB


In [11]:
tfns["output"].value_counts()

output
neutral                29215
positive               21588
negative               11749
moderately positive     6163
moderately negative     2972
mildly positive         2548
mildly negative         2108
strong negative          218
strong positive          211
Name: count, dtype: int64

In [12]:
values_to_remove = ["moderately positive", "moderately negative", "mildly positive", "mildly negative", "strong negative", "strong positive"]     
df_filtered = tfns.loc[~tfns["output"].isin(values_to_remove)]
df_filtered = df_filtered.rename(columns={'input': 'text'})
df_filtered = df_filtered.rename(columns={'output': 'sentiment'})
df_filtered = df_filtered.drop(columns=['instruction'])
df_filtered = df_filtered[['sentiment', 'text']]
df_filtered.head()

Unnamed: 0,sentiment,text
0,neutral,"Teollisuuden Voima Oyj , the Finnish utility k..."
1,neutral,Sanofi poaches AstraZeneca scientist as new re...
3,positive,$brcm raises revenue forecast
5,neutral,The Finnish company Stockmann has signed the c...
6,neutral,"Bernie Madoff, the former Wall Street investme..."


In [13]:
df_filtered["sentiment"].value_counts()

sentiment
neutral     29215
positive    21588
negative    11749
Name: count, dtype: int64

In [14]:
X_train = list()
X_test = list()
for sentiment in ["positive", "neutral", "negative"]:
    train, test  = train_test_split(df_filtered[df_filtered.sentiment==sentiment], 
                                    train_size=600,
                                    test_size=300, 
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

eval_idx = [idx for idx in df_filtered.index if idx not in list(train.index) + list(test.index)]
X_eval = df_filtered[df_filtered.index.isin(eval_idx)]
X_eval = (X_eval
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
X_train = X_train.reset_index(drop=True)

def generate_prompt(data_point):
    return f"""The sentiment of the following phrase: '{data_point["text"]}' is 
            \n\n Positive
            \n Negative
            \n Neutral
            \n Cannot be determined
            \n\nSolution: The correct option is {data_point["sentiment"]}""".strip()
def generate_test_prompt(data_point):
    return f"""The sentiment of the following phrase: '{data_point["text"]}' is 
            \n\n Positive
            \n Negative
            \n Neutral
            \n Cannot be determined
            \n\nSolution: The correct option is""".strip()

X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), 
                       columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), 
                      columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

In [15]:
print(X_train["text"][0])

The sentiment of the following phrase: 'Keith Weiss, Morgan Stanley analyst, joins 'Squawk on the Street' to discuss Microsoft's earnings results where the company delivered a beat on expectations.' is 
            

 Positive
            
 Negative
            
 Neutral
            
 Cannot be determined
            

Solution: The correct option is neutral


In [16]:
print(X_test["text"][71334])

The sentiment of the following phrase: '$TSLA is now up 57% from its February low. Amazing rebound.' is 
            

 Positive
            
 Negative
            
 Neutral
            
 Cannot be determined
            

Solution: The correct option is


In [17]:
def evaluate(y_true, y_pred):
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [18]:
model_name = "/kaggle/input/llama-3/transformers/8b-chat-hf/1"

compute_dtype = getattr(torch, "float16")

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"working on {device}")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    torch_dtype=compute_dtype,
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

max_seq_length = 2048
tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)
tokenizer.pad_token_id = tokenizer.eos_token_id

working on cuda:0


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [19]:
def predict(test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens = 1, 
                        temperature = 0.0,
                       )
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("=")[-1]
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

In [22]:
y_pred = predict(test, model, tokenizer)

100%|██████████| 900/900 [04:36<00:00,  3.25it/s]


In [25]:
evaluate(y_true, y_pred)

Accuracy: 0.339
Accuracy for label 0: 0.007
Accuracy for label 1: 1.000
Accuracy for label 2: 0.010

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.01      0.01       300
           1       0.34      1.00      0.50       300
           2       1.00      0.01      0.02       300

    accuracy                           0.34       900
   macro avg       0.78      0.34      0.18       900
weighted avg       0.78      0.34      0.18       900


Confusion Matrix:
[[  2 298   0]
 [  0 300   0]
 [  0 297   3]]


In [20]:
output_dir="trained_weigths"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
)

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=4,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=False,             # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,                         # log every 10 steps
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="tensorboard",                  # report metrics to tensorboard
    evaluation_strategy="epoch"               # save checkpoint every epoch
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)

Map:   0%|          | 0/1800 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

In [21]:
# Train model
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.2899,1.465545
2,0.9739,1.399553
3,0.5627,1.483027
4,0.3718,1.64209


TrainOutput(global_step=900, training_loss=0.8776998477511936, metrics={'train_runtime': 7280.1397, 'train_samples_per_second': 0.989, 'train_steps_per_second': 0.124, 'total_flos': 1.7800533887090688e+16, 'train_loss': 0.8776998477511936, 'epoch': 4.0})

In [22]:
# Save trained model and tokenizer
trainer.save_model()
tokenizer.save_pretrained(output_dir)

('trained_weigths/tokenizer_config.json',
 'trained_weigths/special_tokens_map.json',
 'trained_weigths/tokenizer.json')

In [23]:
y_pred = predict(test, model, tokenizer)
evaluate(y_true, y_pred)

100%|██████████| 900/900 [06:35<00:00,  2.28it/s]

Accuracy: 0.883
Accuracy for label 0: 0.920
Accuracy for label 1: 0.853
Accuracy for label 2: 0.877

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.92      0.93       300
           1       0.84      0.85      0.85       300
           2       0.88      0.88      0.88       300

    accuracy                           0.88       900
   macro avg       0.88      0.88      0.88       900
weighted avg       0.88      0.88      0.88       900


Confusion Matrix:
[[276  17   7]
 [ 14 256  30]
 [  6  31 263]]





In [24]:
evaluation = pd.DataFrame({'text': X_test["text"], 
                           'y_true':y_true, 
                           'y_pred': y_pred},
                         )
evaluation.to_csv("test_predictions.csv", index=False)