<a href="https://colab.research.google.com/github/AmaruEscalante/llm-distillation/blob/main/FinGPTEvaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers==4.40.1 peft==0.4.0
!pip install sentencepiece
!pip install accelerate
!pip install torch
!pip install peft
!pip install datasets
!pip install bitsandbytes




In [None]:
# authorize hugging face with access_token
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM, LlamaForCausalLM, LlamaTokenizerFast
from peft import PeftModel  # 0.5.0
import torch

# Load Models
base_model = "meta-llama/Meta-Llama-3-8B"
peft_model = "FinGPT/fingpt-mt_llama3-8b_lora"
tokenizer = LlamaTokenizerFast.from_pretrained(base_model, trust_remote_code=True,)
tokenizer.pad_token = tokenizer.eos_token
model = LlamaForCausalLM.from_pretrained(base_model, trust_remote_code=True, device_map = "cuda:0")
model = PeftModel.from_pretrained(model, peft_model)
model = model.eval()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
import pandas as pd

df = pd.read_parquet("hf://datasets/FinGPT/fingpt-sentiment-train/data/train-00000-of-00001-dabab110260ac909.parquet")

In [None]:
# Make prompts
prompt = [
'''Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}
Input: FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is aggressively pursuing its growth strategy by increasingly focusing on technologically more demanding HDI printed circuit boards PCBs .
Answer: ''',
'''Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}
Input: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Answer: '''
]

tokens = tokenizer(prompt, return_tensors='pt', padding=True, max_length=512).to(device)
res = model.generate(**tokens, max_length=512)
res_sentences = [tokenizer.decode(i) for i in res]
out_text = [o.split("Answer: ")[1] for o in res_sentences]

# Show results
for sentiment in out_text:
    print(sentiment)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Positive<|end_of_text|>
Neutral<|end_of_text|>


In [None]:
# load csv dataset from "output_dataset.csv"
#import pandas as pd
#df = pd.read_csv("output_dataset.csv")

df.head()

Unnamed: 0,input,output,instruction
0,"Teollisuuden Voima Oyj , the Finnish utility k...",neutral,What is the sentiment of this news? Please cho...
1,Sanofi poaches AstraZeneca scientist as new re...,neutral,What is the sentiment of this news? Please cho...
2,Starbucks says the workers violated safety pol...,moderately negative,What is the sentiment of this news? Please cho...
3,$brcm raises revenue forecast,positive,What is the sentiment of this tweet? Please ch...
4,Google parent Alphabet Inc. reported revenue a...,moderately negative,What is the sentiment of this news? Please cho...


In [None]:
print("instruction :", df['instruction'][0])
print("Input :", df['input'][0])


instruction : What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input : Teollisuuden Voima Oyj , the Finnish utility known as TVO , said it shortlisted Mitsubishi Heavy s EU-APWR model along with reactors from Areva , Toshiba Corp. , GE Hitachi Nuclear Energy and Korea Hydro & Nuclear Power Co. .


In [None]:
prompt = f"Instruction:{df['instruction'][987]}\nInput: {df['input'][987]}\nAnswer: "

tokens = tokenizer(prompt, return_tensors='pt', max_length=1024).to(device)
res = model.generate(**tokens)
res_sentences = [tokenizer.decode(i) for i in res]
out_text = [o.split("Answer: ")[1] for o in res_sentences]

# Show results
for sentiment in out_text:
    print(sentiment)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


neutral<|end_of_text|>


In [None]:
res_sentences

['<|begin_of_text|>Instruction:What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.\nInput: $UVXY Put the chum out there at key support then next level down - careful\nAnswer: neutral<|end_of_text|>']

In [None]:
# prompt: Process the df to give instruction and input to LLM and generate output and store in another column. Split data to 100 rows and test

import pandas as pd
from tqdm import tqdm

# Assuming df is already loaded as in the provided code

def process_data(df_subset):
    results = []
    for index, row in tqdm(df_subset.iterrows(), total=len(df_subset)):
        instruction = row['instruction']
        input_text = row['input']
        prompt = f"{instruction}\nInput: {input_text}\nAnswer: "
        tokens = tokenizer(prompt, return_tensors='pt', padding=True, max_length=512).to(device)
        res = model.generate(**tokens, max_length=512)
        res_sentence = [tokenizer.decode(i) for i in res]
        output = [o.split("Answer: ")[1] for o in res_sentence]
        results.append(output)
    return results

# Split the data into chunks of 100 rows
chunk_size = 100
num_chunks = (len(df) + chunk_size - 1) // chunk_size
all_results = []

i=0

#for i in range(num_chunks):
start_index = i * chunk_size
end_index = min((i + 1) * chunk_size, len(df))
df_subset = df.iloc[start_index:end_index]
results = process_data(df_subset)
all_results.extend(results)
print(f"Processed chunk {i+1}/{num_chunks}")
 # break

# Add the results to the DataFrame
df['output'] = all_results

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  1%|          | 1/100 [00:00<00:19,  5.07it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  3%|▎         | 3/100 [00:00<00:14,  6.89it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  4%|▍         | 4/100 [00:00<00:13,  7.09it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  5%|▌         | 5/100 [00:00<00:17,  5.45it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  7%|▋         | 7/100 [00:01<00:13,  6.78it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  8%|▊         | 8/100 [00:01<00:13,  6.96it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  9%|▉         | 9/100 [00:01<00:12,  7.34it/s]Setting `pad_t

Processed chunk 1/768





ValueError: Length of values (100) does not match length of index (76772)

In [None]:
all_results

NameError: name 'all_results' is not defined

In [None]:
import g

In [None]:
# ... existing code ...
from tqdm import tqdm

# Assuming df is your DataFrame with 'instruction' and 'input' columns
chunk_size = 100
results = []

for i in tqdm(range(0, len(df), chunk_size)):
    chunk = df.iloc[i:i+chunk_size]

    for _, row in chunk.iterrows():
        prompt = f"Instruction:{row['instruction']}\nInput: {row['input']}\nAnswer: "

        tokens = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)

        with torch.no_grad():
            res = model.generate(**tokens, max_length=512)

        res_sentence = tokenizer.decode(res[0], skip_special_tokens=True)
        out_text = res_sentence.split("Answer: ")[-1].strip()

        results.append(out_text)

# Add results to the DataFrame
# df['predicted_sentiment'] = results

# Display the first few results
# print(df[['input', 'predicted_sentiment']].head())

# Optionally, save the results
# df.to_csv('sentiment_predictions.csv', index=False)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-en

In [None]:
results

['Neutral',
 'Neutral',
 'Neutral',
 'Positive',
 'Negative',
 'Neutral',
 'Neutral',
 'neutral',
 '',
 'Neutral',
 'neutral',
 'Neutral',
 'Neutral',
 'Positive',
 'Neutral',
 'Neutral',
 'neutral',
 'Neutral',
 'Positive',
 'Neutral']