<a href="https://colab.research.google.com/github/KinzaaSheikh/lm_research_notes/blob/main/Hands_On_LLM_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## My Notebook of Hands on LLM book

# Chapter 1

Introduced the recent history of Large Language Models and ended with a small coding example. Its interesting for me to note that a simple query-answer workflow with an LLM is intuitive enough even without using any of the recent frameworks.

In [None]:
!pip install -Uq transformers

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

In [None]:
from transformers import pipeline

# Create a pipeline
generation = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
    )

In [None]:
# The prompt (user input / query)
messages = [
    {
        "role": "user",
        "content": "Tell me a funny joke about chickens"
    }
]

# Generate output
output = generation(messages)
print(output[0]["generated_text"])

# Chapter 2



### Tokenization & Embeddings

The two main pillars of LLM

Tokenization: The smallest chunk a text can be broken down to.


Embeddings: The act of converting those tokens into computable language

In [None]:
!pip install -Uq transformers

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# load model and tokenizer just like before
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

In [None]:
prompt = "Write an email to my advisor explaining why I couldn't finish proposal on time. Explain how it happend. <|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

In [None]:
# Print to see what's inside input_ids

print(input_ids)

In [None]:
# Inspect input ids using tokenizer's decode method
# translate the id's back into human readable text

for id in input_ids[0]:
  print(tokenizer.decode(id))

In [None]:
print(generation_output)

In [None]:
# we can decode the tokenizer on the output side to translate the token id in actual text
print(tokenizer.decode(8496))
print(tokenizer.decode(29915))
print(tokenizer.decode(29873))

In [None]:
print(tokenizer.decode([8496, 29915, 29873]))

In [None]:
# Generate contextualized word embeddings

from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello World', return_tensors='pt')

# Process the toekns
output = model(**tokens)[0]

In [None]:
output.shape

In [None]:
# Inspect why there are four tokens in two words

for token in tokens['input_ids'][0]:
  print(tokenizer.decode(token))

In [None]:
print(output)

In [None]:
# Text embeddings
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Convert text to text embeddings
vector = model.encode("Best movie ever!")

In [None]:
vector.shape

### Word embeddings beyond LLMs


NOTE: Gensim causes a lot of dependency errors so its best to kill the runtime and start over from here

In [None]:
!pip -q install gensim

In [None]:
# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
# Installing specific versions to avoid conflicts
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

In [None]:
model.most_similar([model['king']], topn=11)

### Training a song embedding model

In [None]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen("https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt")

# Parse the playlist dataset file. Skip the first to lines as
# they only contian metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode('utf-8').split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [None]:
print('Playlist #1:/n', playlists[0]), '\n'

In [None]:
# Train the model
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playlists,
    vector_size=32,
    window=20,
    negative=50,
    min_count=1,
    workers=4
)

In [None]:
song_id = 2172

# Ask the model for songs similar to song number 2172
model.wv.most_similar(positive=str(song_id))

In [None]:
print(songs_df.iloc[2173])

In [None]:
# Results are all heavy metal and hard rock, within the same genre
import numpy as np

def print_recommendations(song_id):
  similar_songs = np.array(
      model.wv.most_similar(positive=str(song_id), topn=5)
  )[:,0]

  return songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

# Chapter 3

NOTE: Rerun the first 3 cells of the notebook
