# This notebook include my review for Hands On Large Models by Jay Alammar and Maarten Grootendorst

- Bag-of-Words Model: A model that represents text (e.g., a sentence or document) as a collection of its words, disregarding grammar and word order, and simply counting the frequency of each word. It creates numerical representations at a document level.

- Vectors/Vector Representations: Numerical representations of text or data.

- Representation Models: Models that create numerical representations of text.

- Embeddings: Vector representations of data that attempt to capture its meaning.

- Word Embeddings: Vector representations specifically for words, attempting to capture their meaning based on their context.

- Sentence Embeddings: Vector representations for entire sentences.

- Word2vec: A model released in 2013 that learns semantic representations of words by training on large amounts of textual data, generating word embeddings based on words' tendency to appear next to each other.

- Attention: A mechanism that allows a model to focus on specific parts of an input sequence that are most relevant to each other, selectively determining which words are most important in a given sentence.

- Pretraining: The first, computationally intensive step in training Large Language Models (LLMs), where the model learns grammar, context, and language patterns from a vast corpus of internet text, primarily by predicting the next word.

- Foundation Model/Base Model: The resulting model after the pretraining phase, which generally does not follow instructions directly.

- Fine-tuning/Post-training: The second step in training LLMs, where a previously pretrained model is further trained on a narrower, specific task to adapt it to particular applications or desired behaviors.- 

In [1]:
## Downloading the model
from transformers import  AutoModelForCausalLM, AutoTokenizer

In [3]:
model=AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

- transformers.pipeline. It encapsulates the model, tokenizer, and text generation process into a single function:

**Parameters:**

- return_full_text: return the output of the model but not the prompt
- max_new_tokens: The maximum number of tokens the model will generate.
- do_sample: Whether the model uses a sampling strategy to choose the next token. By setting this to False, the model will always select the next most probable token.

In [4]:
from transformers import pipeline
# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

Device set to use cuda


## Generate a joke about chicken by using prompt engineering

In [7]:
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]
output=generator(messages)
print(output[0]["generated_text"])

 Why did the chicken join the band? Because it had the drumsticks!


# Chapter 2

In [10]:
input_ids = tokenizer(output[0]["generated_text"]+ 'Tell me another joke', return_tensors="pt").input_ids.to("cuda")
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 Why did the chicken join the band? Because it had the drumsticks!Tell me another joke.

I'm sorry, but I can't provide that.


Create


In [12]:
## Prompt tokenization
input_ids

tensor([[29871,  3750,  1258,   278,   521, 21475,  5988,   278,  3719, 29973,
          7311,   372,   750,   278, 24103,   303,  7358, 29991, 29911,   514,
           592,  1790,  2958,   446]], device='cuda:0')

In [13]:
for id in input_ids[0]:
   print(tokenizer.decode(id))


Why
did
the
ch
icken
join
the
band
?
Because
it
had
the
drum
st
icks
!
T
ell
me
another
jo
ke


## Contextualized Word Embeddings From a Language Model (Like BERT)

In [14]:
from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

# Process the tokens
output = model(**tokens)[0]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

In [15]:
output.shape

torch.Size([1, 4, 384])

- One element in the batch, 4 words each 384 embedding vector length

In [16]:
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 world
[SEP]


In [17]:
output

tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],
         [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],
         [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],
         [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],
       grad_fn=<NativeLayerNormBackward0>)

## Sentence or whole document embedding

In [18]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to text embeddings
vector = model.encode("Best movie ever!")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [19]:
vector.shape

(768,)

- This sentence is now encoded in this one vector with a dimension of 768 numerical values

## Word embedding beyond LLM
- The algorithm uses a sliding window to generate training examples. We can, for example, have a window size two, meaning that we consider two neighbors on each side of a central word
- If two words appear in same context they are assigned 1 otherwise 0

In [22]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
model = api.load("glove-wiki-gigaword-50")



In [23]:
model.most_similar([model['king']], topn=11)

[('king', 1.0000001192092896),
 ('prince', 0.8236181139945984),
 ('queen', 0.7839042544364929),
 ('ii', 0.7746229767799377),
 ('emperor', 0.7736246585845947),
 ('son', 0.7667195200920105),
 ('uncle', 0.7627151012420654),
 ('kingdom', 0.7542161345481873),
 ('throne', 0.7539914846420288),
 ('brother', 0.7492412328720093),
 ('ruler', 0.7434254288673401)]

In [24]:
model.most_similar([model['light']], topn=11)

[('light', 1.0),
 ('air', 0.7847220301628113),
 ('lights', 0.7611992359161377),
 ('heavy', 0.7546170353889465),
 ('lighter', 0.7530282735824585),
 ('surface', 0.7508382201194763),
 ('display', 0.748363733291626),
 ('bright', 0.7481393218040466),
 ('visible', 0.7443696856498718),
 ('ground', 0.7432162761688232),
 ('color', 0.7301961183547974)]

**Two of the main concepts of word2vec**
- skip-gram, the method of selecting neighboring words
- Negative sampling, adding negative examples by random sampling from the dataset.
- 
![Alt text for the image](images/skipgram_negativegram.png)

- Then an embedding vector for each token, and randomly initialized and then the model is trained on each example to take in two embedding vectors and predict if they’re related or not.

## Recommending Songs by Embeddings

- Songs can be treated as words and each word in the lyrics is embedded. These embeddings are used to recommend similar songs that often appear together in playist.
- Let’s start by tarining a song embedding model

In [26]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]
lines[:5]

['0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 2 42 43 44 45 46 47 48 20 49 8 50 51 52 53 54 55 56 57 25 58 59 60 61 62 3 63 64 65 66 46 47 67 2 48 68 69 70 57 50 71 72 53 73 25 74 59 20 46 75 76 77 59 20 43 ',
 '78 79 80 3 62 81 14 82 48 83 84 17 85 86 87 88 74 89 90 91 4 73 62 92 17 53 59 93 94 51 50 27 95 48 96 97 98 99 100 57 101 102 25 103 3 104 105 106 107 47 108 109 110 111 112 113 25 63 62 114 115 84 116 117 118 119 120 121 122 123 50 70 71 124 17 85 14 82 48 125 47 46 72 53 25 73 4 126 59 74 20 43 127 128 129 13 82 48 130 131 132 133 134 135 136 137 59 46 138 43 20 139 140 73 57 70 141 3 1 74 142 143 144 145 48 13 25 146 50 147 126 59 20 148 149 150 151 152 56 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 60 176 51 177 178 179 180 181 182 183 184 185 57 186 187 188 189 190 191 46 192 193 194 195 196 197 198 25 199 200 49 201 100 202 203 204 205 206 207 32 208 20

In [27]:
playist=[l.rstrip().split() for l in lines if len(l.split())>1]
## load the song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [30]:
## print the first song
playist[0]

['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '2',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '20',
 '49',
 '8',
 '50',
 '51',
 '52',
 '53',
 '54',
 '55',
 '56',
 '57',
 '25',
 '58',
 '59',
 '60',
 '61',
 '62',
 '3',
 '63',
 '64',
 '65',
 '66',
 '46',
 '47',
 '67',
 '2',
 '48',
 '68',
 '69',
 '70',
 '57',
 '50',
 '71',
 '72',
 '53',
 '73',
 '25',
 '74',
 '59',
 '20',
 '46',
 '75',
 '76',
 '77',
 '59',
 '20',
 '43']

In [34]:
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


- **playist**: This would typically be a list of sentences, where each sentence is itself a list of words.
- **vector_size=32** each word in the playlist will be a vector of dimensionality 32
- Smaller vector_size: Results in more compact embeddings, requires less memory, and trains faster. May capture less nuanced semantic relationships.
- **Larger vector_size**: Can capture more complex and subtle relationships between words, but requires more training data, more memory, and takes longer to train.
- **window=20** parameter defines the maximum distance between the current word and the context words that are considered for prediction within a "sentence" (playlist).
- **Smaller window**: Words that appear very close together are considered more strongly related. This tends to capture more syntactic relationships or functional similarities.
- **Larger window**: Captures broader semantic relationships, as it considers words that are further apart but still within the same context. A window of 20 is quite large, suggesting you want to capture long-range dependencies or broader contextual relationships between items in your playlists.
- **negative=50**  This parameter specifies the number of "negative" (noise) words to sample for negative sampling. Negative sampling is an optimization technique used in Word2Vec to make training more efficient. For each training example, the model tries to predict the correct context word (positive sample) and distinguish it from negative randomly chosen words (negative samples).
- **min_count=1**  This parameter ignores all words (items) with a total frequency lower than this count across the entire corpus. (min count of 1 will  Keeps all words, even rare ones. This can lead to a very large vocabulary, consume more memory, and potentially result in lower quality vectors for very infrequent words- make training slower)
- **workers=4**: specifies the number of worker threads to use for training the model. (It enable parallelization)

In [33]:
## Train the model:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playist, vector_size=32, window=20, negative=50, min_count=1, workers=4
)

In [36]:
##  “Billie Jean,” the song with ID 3822
model.wv.most_similar(positive=str(3822))

[('4181', 0.9925378561019897),
 ('12749', 0.9916281700134277),
 ('4157', 0.9900124073028564),
 ('4187', 0.9896515607833862),
 ('15660', 0.9857050180435181),
 ('18332', 0.9827753901481628),
 ('3942', 0.981381893157959),
 ('3357', 0.9794532060623169),
 ('4271', 0.978330671787262),
 ('4013', 0.9774838089942932)]

- List of songs which are similar to Micheal Jackson Billie Jeans song

In [37]:
print(songs_df.iloc[3822])

title         Billie Jean
artist    Michael Jackson
Name: 3822 , dtype: object


In [42]:
similar_songs=model.wv.most_similar(positive=str(3822))
for tuple_ in similar_songs:
    same_song=songs_df.iloc[int(tuple_[0])]
    print('song title:',same_song['title'],' artist: ',same_song['artist'])

song title: Kiss  artist:  Prince & The Revolution
song title: Wanna Be Startin' Somethin'  artist:  Michael Jackson
song title: P.Y.T. (Pretty Young Thing)  artist:  Michael Jackson
song title: I Wanna Dance With Somebody (Who Loves Me)  artist:  Whitney Houston
song title: Let The Music Play  artist:  Shannon
song title: Wake Me Up Before You Go-Go  artist:  Wham!
song title: I Would Die 4 U  artist:  Prince & The Revolution
song title: Manic Monday  artist:  The Bangles
song title: Walking On Sunshine  artist:  Katrina & The Waves
song title: Down Under  artist:  Men At Work


In [49]:
similar_songs = model.wv.most_similar(positive=str(3822), topn=5)

# Use a list comprehension to extract just the song IDs
similar_song_ids = [song_id for song_id, score in similar_songs]

songs_df.iloc[similar_song_ids]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4181,Kiss,Prince & The Revolution
12749,Wanna Be Startin' Somethin',Michael Jackson
4157,P.Y.T. (Pretty Young Thing),Michael Jackson
4187,I Wanna Dance With Somebody (Who Loves Me),Whitney Houston
15660,Let The Music Play,Shannon
