## Running an LLM on the cloud (i.e. Google Colab or Kaggle)

**Open this notebook in Google Colab or Kaggle to leverage free GPU resources in the cloud or  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1cbOGsTa96las7bT0bG4cCKWPNzIRdQtb?usp=sharing) for a shared version on Google Colab.**

In this notebook we are going to learn how to run an LLM using the Hugging Face Transformers library on the cloud. Please refer to the README file for more information regarding the python installations you should have on your machine.

Pros of HuggingFace Transformers library:
* downloads the models automatically
* there are snippets of code available to run any model
* easily intergratable models to your project

Cons:
* you need to code the behaviour / user interaction
* it's not as fast as other alternatives and fails to run big LLMs (>3-4B parameters)
* large computational resources

### Importing packages and checking GPU access

In [None]:
!pip install datasets
!pip install --upgrade transformers
!pip install -U langchain langchain-openai
!pip install -U langchain-community

In [2]:
# Import some basic packages
import os
import torch
import torchvision
import tensorflow as tf
import pandas as pd

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelForCausalLM

from huggingface_hub import hf_hub_download
from huggingface_hub import login

In [None]:
# make sure you have an updated transformers version (>4.43)
import transformers
print(transformers.__version__)

In [None]:
# Both should return 'True' if on MacOs machine, 'False' if running on cloud
print(torch.backends.mps.is_available())
print(torch.backends.mps.is_built())
# Should recognize 1 GPU. If not, got to Runtime --> Change runtime type --> select a GPU for this session
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

In [5]:
# Run locally
# device = "mps" if torch.backends.mps.is_available() else "cpu"

# For Colab
if torch.cuda.is_available():
    device = "cuda:0"
else:
    device = "cpu"

device = torch.device(device)

DEVICE = device
DEVICE

device(type='cpu')

### Authenticate through HuggingFace

HuggingFace requires to login and use a specialised token to use most of their models and datasets. We suggest creating a HuggingFace account and going to settings to create an access token. Use a name of your prefrerence and create either a read or a write token. Useful video walkthrough on how to generate tokens:https://www.youtube.com/watch?v=Br7AcznvzSA . Try to not share your token when sharing this notebook.

In [None]:
# Login to HuggingFace with your token
from huggingface_hub import notebook_login
notebook_login()

# Alternatively, you can login via terminal with this command: huggingface-cli login
# Add your token and hit Y when prompted to add token as git credentials

#### Load and run HuggingFace models (non-LLMs)

You need to find a model you'd like to use in HuggingFace and click on the right top corner to use it with HuggingFace. Copy and paste that code block here. Make sure you include a pipeline, an input_text and an output argument to call the model. Some useful examples are provided below.

The model name or model_id can be copied from the top part of the page, it will be somethng like 'meta-llama/Llama-2-7b-chat-hf'

In [None]:
# EXAMPLES of HuggingFace transformer-based models for different NLP tasks.


## Sentiment Analyser
classifier = pipeline("sentiment-analysis",
                      model="cardiffnlp/twitter-roberta-base-sentiment",
                      tokenizer="cardiffnlp/twitter-roberta-base-sentiment", device = DEVICE)
results = classifier("We are very happy to show you the 🤗 Transformers library.")
print(f"{results[0]['label']} with score {results[0]['score']}")


## Named Entity Recognition

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)




#### Load and run HuggingFace models (LLMs)

We are going to see some examples of running LLMs using HuggingFace and Langchain below.

In [8]:
# check the type of GPU allocated to our notebook
!nvidia-smi


In [9]:
torch.cuda.empty_cache() # free up unused memory

In [10]:
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

Example 1: In the below snippet you will find a sample run of a relatively small LLM. To run an LLM of your choice you need to adapt the code for the tokenizer and model. Also, you need to adjust the
pipeline to the task. Some key tasks:
* "text-generation"
* "text-classification"
* "summarization"
* "question-answering"
* "sentiment-analysis" and more.

In [None]:
%%time
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")

# Set pad_token_id if it’s not already set
if model.generation_config.pad_token_id is None:
    model.generation_config.pad_token_id = tokenizer.eos_token_id
print(f"Pad token ID set to: {model.generation_config.pad_token_id}")


text_gen_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, device=DEVICE, batch_size = 1) #batch_size, device = DEVICE, device_map = 'auto', load_in_8bit=True,

# Provide an input prompt for text generation
input_text = "Some of the challenges in current financial markets" #"The global financial markets are currently experiencing"

# Generate text
generated_text = text_gen_pipeline(input_text, max_length=150, num_return_sequences=1, truncation=True) #max_length=150, temperature=0.7, top_k=50, top_p=0.95, num_return_sequences=1

# Print the result
print(generated_text[0]['generated_text'])

Example 2: CAUTION! The following is an example that will run OOM (out-of-memory) on Colab/Kaggle if there is no Pro access to powerful GPUs. We do not advise to run it.

In [None]:
# ## This piece of code runs out of memory on Colab/Kaggle

# %%time

# # finma-7b-full
# model_id = "TheFinAI/finma-7b-full"
# tokenizer = AutoTokenizer.from_pretrained("TheFinAI/finma-7b-full", legacy = True)
# # tokenizer.save_pretrained(f"cache/tokenizer/{model_id}")
# model = AutoModelForCausalLM.from_pretrained("TheFinAI/finma-7b-full")
# # model.save_pretrained(f"cache/model/{model_id}")

# # Set pad_token_id if it’s not already set
# if model.generation_config.pad_token_id is None:
#     model.generation_config.pad_token_id = tokenizer.eos_token_id
# print(f"Pad token ID set to: {model.generation_config.pad_token_id}")

# text_gen_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, load_in_8bit=True, device=DEVICE, batch_size = 1) #batch_size, device = DEVICE, device_map = 'auto', load_in_8bit=True,

# # Provide an input prompt for text generation
# input_text = "The global financial markets are currently experiencing"

# # Generate text
# generated_text = text_gen_pipeline(input_text, max_length=100, num_return_sequences=1, truncation=True)

# # Print the result
# print(generated_text[0]['generated_text'])


#### Run a model with Langchain
LangChain is a framework designed to simplify the development of applications powered by large language models (LLMs). It provides a suite of open-source tools and integrations that streamline the entire LLM application lifecycle, from development to deployment. Developers can utilize LangChain’s components to build applications such as chatbots, document summarizers, and code analysis tools.

In [None]:
!pip install -U langchain-community

In [None]:
# Example of running an LLM with Langchain
from langchain.llms.huggingface_pipeline import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="microsoft/DialoGPT-medium", task="text-generation", pipeline_kwargs={"max_new_tokens": 200, "pad_token_id": 50256},
)

from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))

### Saving the model
If you need to save the models locally (for fine-tuning, working offline
etc). Save them and Reload them like this:


In [None]:
# model_id = "TheFinAI/finma-7b-full"
# tokenizer = AutoTokenizer.from_pretrained(f"{model_id}", legacy = True)
# tokenizer.save_pretrained(f"cache/tokenizer/{model_id}")
# model = AutoModelForCausalLM.from_pretrained(f"{model_id}")
# model.save_pretrained(f"cache/model/{model_id}")
# # tokenizer = AutoTokenizer.from_pretrained(f"./{model_id}", legacy=True)
# # model = AutoModelForCausalLM.from_pretrained(f"./{model_id}")

You can fine-tune the behavior of the generation by adjusting parameters such as:

* max_length: The maximum length of the generated response.
* temperature: Affects randomness in the output (lower values make output more deterministic).
* top_k and top_p: Sampling strategies that control diversity in generation.


### Load HuggingFace dataset
Load the dataset (e.g., CICM for stock movement prediction)
Other datasets:
* Sentiment Analysis:
* Question-Answering:
* Summarisation:
* Stock Movement Prediction: 'TheFinAI/flare-sm-cikm', 'TheFinAI/en-fpb', 'TheFinAI/flare-sm-bigdata', 'TheFinAI/flare-ectsum', 'TheFinAI/flare-edtsum'

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset
dataset = load_dataset('TheFinAI/flare-sm-cikm')

In [None]:
# Display the dataset info and a sample
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'query', 'answer', 'text', 'choices', 'gold'],
        num_rows: 3396
    })
    test: Dataset({
        features: ['id', 'query', 'answer', 'text', 'choices', 'gold'],
        num_rows: 1143
    })
    valid: Dataset({
        features: ['id', 'query', 'answer', 'text', 'choices', 'gold'],
        num_rows: 431
    })
})


In [None]:
import pandas as pd

df = pd.DataFrame(dataset['test'])
df.head()


Unnamed: 0,id,query,answer,text,choices,gold
0,cikmsm3827,Assess the data and tweets to estimate whether...,Fall,"date,open,high,low,close,adj-close,inc-5,inc-1...","[Rise, Fall]",1
1,cikmsm3828,Analyze the information and social media posts...,Rise,"date,open,high,low,close,adj-close,inc-5,inc-1...","[Rise, Fall]",0
2,cikmsm3829,Examine the data and tweets to deduce if the c...,Fall,"date,open,high,low,close,adj-close,inc-5,inc-1...","[Rise, Fall]",1
3,cikmsm3830,Assess the data and tweets to estimate whether...,Rise,"date,open,high,low,close,adj-close,inc-5,inc-1...","[Rise, Fall]",0
4,cikmsm3831,Assess the data and tweets to estimate whether...,Fall,"date,open,high,low,close,adj-close,inc-5,inc-1...","[Rise, Fall]",1
