<a href="https://colab.research.google.com/github/KinzaaSheikh/lm_research_notes/blob/main/Hands_On_LLM_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## My Notebook of Hands on LLM book

# Chapter 1

Introduced the recent history of Large Language Models and ended with a small coding example. Its interesting for me to note that a simple query-answer workflow with an LLM is intuitive enough even without using any of the recent frameworks.

In [None]:
!pip install -Uq transformers

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

In [None]:
from transformers import pipeline

# Create a pipeline
generation = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
    )

In [None]:
# The prompt (user input / query)
messages = [
    {
        "role": "user",
        "content": "Tell me a funny joke about chickens"
    }
]

# Generate output
output = generation(messages)
print(output[0]["generated_text"])

# Chapter 2



### Tokenization & Embeddings

The two main pillars of LLM

Tokenization: The smallest chunk a text can be broken down to.


Embeddings: The act of converting those tokens into computable language

In [None]:
!pip install -Uq transformers

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# load model and tokenizer just like before
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

In [None]:
prompt = "Write an email to my advisor explaining why I couldn't finish proposal on time. Explain how it happend. <|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

In [None]:
# Print to see what's inside input_ids

print(input_ids)

In [None]:
# Inspect input ids using tokenizer's decode method
# translate the id's back into human readable text

for id in input_ids[0]:
  print(tokenizer.decode(id))

In [None]:
print(generation_output)

In [None]:
# we can decode the tokenizer on the output side to translate the token id in actual text
print(tokenizer.decode(8496))
print(tokenizer.decode(29915))
print(tokenizer.decode(29873))

In [None]:
print(tokenizer.decode([8496, 29915, 29873]))

In [None]:
# Generate contextualized word embeddings

from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello World', return_tensors='pt')

# Process the toekns
output = model(**tokens)[0]

In [None]:
output.shape

In [None]:
# Inspect why there are four tokens in two words

for token in tokens['input_ids'][0]:
  print(tokenizer.decode(token))

In [None]:
print(output)

In [None]:
# Text embeddings
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Convert text to text embeddings
vector = model.encode("Best movie ever!")

In [None]:
vector.shape

### Word embeddings beyond LLMs


NOTE: Gensim causes a lot of dependency errors so its best to kill the runtime and start over from here

In [None]:
!pip -q install gensim

In [None]:
# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
# Installing specific versions to avoid conflicts
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

In [None]:
model.most_similar([model['king']], topn=11)

### Training a song embedding model

In [None]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen("https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt")

# Parse the playlist dataset file. Skip the first to lines as
# they only contian metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode('utf-8').split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [None]:
print('Playlist #1:/n', playlists[0]), '\n'

In [None]:
# Train the model
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playlists,
    vector_size=32,
    window=20,
    negative=50,
    min_count=1,
    workers=4
)

In [None]:
song_id = 2172

# Ask the model for songs similar to song number 2172
model.wv.most_similar(positive=str(song_id))

In [None]:
print(songs_df.iloc[2173])

In [None]:
# Results are all heavy metal and hard rock, within the same genre
import numpy as np

def print_recommendations(song_id):
  similar_songs = np.array(
      model.wv.most_similar(positive=str(song_id), topn=5)
  )[:,0]

  return songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

# Chapter 3

NOTE: Rerun the first 3 cells of the notebook


In [None]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generation(prompt)

print(output[0]['generated_text'])

In [None]:
# Display order of layers

print(model)

In [None]:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [None]:
print(model_output)

In [None]:
print(lm_head_output)

In [None]:
token_id = lm_head_output[0, -1].argmax(-1)
tokenizer.decode(token_id)

In [None]:
model_output[0].shape

In [None]:
lm_head_output[0].shape

In [None]:
# Testing speed by disabling built-in caching

prompt = "write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids= input_ids.to("cuda")


In [None]:
# Generate 100 tokens with caching and time it

%%timeit -n 1
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    use_cache=True
)

In [None]:
# Generate the same amount of tokens with caching disabled

%%timeit -n 1
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    use_cache=False
)

# Chapter 4


In [1]:
!pip install -q datasets transformers numpy tqdm scikit-learn

In [2]:
from datasets import load_dataset

data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
data["train"][0, -1]

# Classification of reviews are binary, either 0 (negative) or 1 (positive)

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

In [4]:
data["test"][0, -1]

{'text': ['lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .',
  "enigma is well-made , but it's just too dry and too placid ."],
 'label': [1, 0]}

In [6]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load the mmodel into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [7]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")),
                  total=len(data["test"])):
  negative_score = output[0]["score"]
  positive_score = output[2]["score"]
  assignment = np.argmax([negative_score, positive_score])
  y_pred.append(assignment)

100%|██████████| 1066/1066 [00:10<00:00, 101.10it/s]


In [8]:
# After generating predictions its time to evals

from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
  """Create and print the classification report"""
  performance = classification_report(
      y_true,
      y_pred,
      target_names=["Negative Review", "Positive Review"]
  )
  print(performance)

In [9]:
# Create classification report
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066

