<a href="https://colab.research.google.com/github/Joan947/mini_LLM/blob/main/Chapter_2_Tokens_and_Token_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Chapter 2 - Tokens and Token Embeddings</h1>
<i>Exploring tokens and embeddings as an integral part of building LLMs</i>


<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb)

---

This notebook is for Chapter 2 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

In [31]:
%%capture
!pip install --upgrade transformers==4.41.2 sentence-transformers==3.0.1 scikit-learn==1.5.0 accelerate==0.31.0 peft==0.11.1 scipy==1.10.1 numpy==1.26.4

# Downloading and Running An LLM

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately and keep them as such so that we can explore them separately.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
prompt = "Design a study plan for an aspiring Medical Doctor.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=30
)

# Print the output
print(tokenizer.decode(generation_output[0]))

Design a study plan for an aspiring Medical Doctor.<|assistant|> To design a study plan for an aspiring Medical Doctor (MD), one must consider the extensive and rigorous nature of medical education. Here is a


In [6]:
print(input_ids)

tensor([[12037,   263,  6559,  3814,   363,   385,  7051,  8491, 20795, 15460,
         29889, 32001]], device='cuda:0')


In [7]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

Design
a
study
plan
for
an
asp
iring
Medical
Doctor
.
<|assistant|>


In [8]:
generation_output

tensor([[12037,   263,  6559,  3814,   363,   385,  7051,  8491, 20795, 15460,
         29889, 32001,  1763,  2874,   263,  6559,  3814,   363,   385,  7051,
          8491, 20795, 15460,   313,  5773,   511,   697,  1818,  2050,   278,
         20607,   322, 12912, 20657,  5469,   310, 16083,  9793, 29889,  2266,
           338,   263]], device='cuda:0')

In [9]:
print(tokenizer.decode(3323))
print(tokenizer.decode(622))
print(tokenizer.decode([3323, 622]))
print(tokenizer.decode(29901))

Sub
ject
Subject
:


# Comparing Trained LLM Tokenizers


In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer

colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

In [11]:
text = """
When generating a password, use Symbols like :";'>?/.<,-_+=()*&^%$#@!
and add a mix of uppercase like: ASDF  and lowercase letters: asdf.
"""

In [12]:
show_tokens(text, "bert-base-uncased")

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mwhen[0m [0;30;48;2;141;160;203mgenerating[0m [0;30;48;2;231;138;195ma[0m [0;30;48;2;166;216;84mpassword[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165muse[0m [0;30;48;2;252;141;98msymbols[0m [0;30;48;2;141;160;203mlike[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m"[0m [0;30;48;2;255;217;47m;[0m [0;30;48;2;102;194;165m'[0m [0;30;48;2;252;141;98m>[0m [0;30;48;2;141;160;203m?[0m [0;30;48;2;231;138;195m/[0m [0;30;48;2;166;216;84m.[0m [0;30;48;2;255;217;47m<[0m [0;30;48;2;102;194;165m,[0m [0;30;48;2;252;141;98m-[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195m+[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m([0m [0;30;48;2;102;194;165m)[0m [0;30;48;2;252;141;98m*[0m [0;30;48;2;141;160;203m&[0m [0;30;48;2;231;138;195m^[0m [0;30;48;2;166;216;84m%[0m [0;30;48;2;255;217;47m$[0m [0;30;48;2;102;194;165m#[0m [0;30;48;2;252;141;98m@[0m [0;30;48;2;141;160;20

In [13]:
show_tokens(text, "bert-base-cased")

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mWhen[0m [0;30;48;2;141;160;203mgenerating[0m [0;30;48;2;231;138;195ma[0m [0;30;48;2;166;216;84mpassword[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165muse[0m [0;30;48;2;252;141;98mS[0m [0;30;48;2;141;160;203m##ym[0m [0;30;48;2;231;138;195m##bol[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mlike[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98m"[0m [0;30;48;2;141;160;203m;[0m [0;30;48;2;231;138;195m'[0m [0;30;48;2;166;216;84m>[0m [0;30;48;2;255;217;47m?[0m [0;30;48;2;102;194;165m/[0m [0;30;48;2;252;141;98m.[0m [0;30;48;2;141;160;203m<[0m [0;30;48;2;231;138;195m,[0m [0;30;48;2;166;216;84m-[0m [0;30;48;2;255;217;47m_[0m [0;30;48;2;102;194;165m+[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m([0m [0;30;48;2;231;138;195m)[0m [0;30;48;2;166;216;84m*[0m [0;30;48;2;255;217;47m&[0m [0;30;48;2;102;194;165m^[0m [0;30;48;2;252;141;98m%[0m [0;30;48;2;141;160

In [14]:
show_tokens(text, "gpt2")

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mWhen[0m [0;30;48;2;141;160;203m generating[0m [0;30;48;2;231;138;195m a[0m [0;30;48;2;166;216;84m password[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165m use[0m [0;30;48;2;252;141;98m Symb[0m [0;30;48;2;141;160;203mols[0m [0;30;48;2;231;138;195m like[0m [0;30;48;2;166;216;84m :[0m [0;30;48;2;255;217;47m";[0m [0;30;48;2;102;194;165m'>[0m [0;30;48;2;252;141;98m?[0m [0;30;48;2;141;160;203m/.[0m [0;30;48;2;231;138;195m<[0m [0;30;48;2;166;216;84m,-[0m [0;30;48;2;255;217;47m_[0m [0;30;48;2;102;194;165m+=[0m [0;30;48;2;252;141;98m()[0m [0;30;48;2;141;160;203m*[0m [0;30;48;2;231;138;195m&[0m [0;30;48;2;166;216;84m^[0m [0;30;48;2;255;217;47m%[0m [0;30;48;2;102;194;165m$[0m [0;30;48;2;252;141;98m#[0m [0;30;48;2;141;160;203m@[0m [0;30;48;2;231;138;195m![0m [0;30;48;2;166;216;84m
[0m [0;30;48;2;255;217;47mand[0m [0;30;48;2;102;194;165m add[0m [0;30;48;2;252;141;98m a[0m [0;30;4

In [15]:
show_tokens(text, "google/flan-t5-small")

[0;30;48;2;102;194;165mWhen[0m [0;30;48;2;252;141;98m[0m [0;30;48;2;141;160;203mgenerating[0m [0;30;48;2;231;138;195m[0m [0;30;48;2;166;216;84ma[0m [0;30;48;2;255;217;47mpassword[0m [0;30;48;2;102;194;165m,[0m [0;30;48;2;252;141;98muse[0m [0;30;48;2;141;160;203m[0m [0;30;48;2;231;138;195mSymbol[0m [0;30;48;2;166;216;84ms[0m [0;30;48;2;255;217;47mlike[0m [0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98m:[0m [0;30;48;2;141;160;203m"[0m [0;30;48;2;231;138;195m;[0m [0;30;48;2;166;216;84m'[0m [0;30;48;2;255;217;47m>[0m [0;30;48;2;102;194;165m?[0m [0;30;48;2;252;141;98m/[0m [0;30;48;2;141;160;203m.[0m [0;30;48;2;231;138;195m<unk>[0m [0;30;48;2;166;216;84m,[0m [0;30;48;2;255;217;47m-[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98m+[0m [0;30;48;2;141;160;203m=[0m [0;30;48;2;231;138;195m()[0m [0;30;48;2;166;216;84m*[0m [0;30;48;2;255;217;47m&[0m [0;30;48;2;102;194;165m<unk>[0m [0;30;48;2;252;141;98m%[0m [0;30;48;2;141;160;20

In [16]:
# The official is `tiktoken` but this the same tokenizer on the HF platform
show_tokens(text, "Xenova/gpt-4")

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mWhen[0m [0;30;48;2;141;160;203m generating[0m [0;30;48;2;231;138;195m a[0m [0;30;48;2;166;216;84m password[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165m use[0m [0;30;48;2;252;141;98m Symbols[0m [0;30;48;2;141;160;203m like[0m [0;30;48;2;231;138;195m :[0m [0;30;48;2;166;216;84m";[0m [0;30;48;2;255;217;47m'>[0m [0;30;48;2;102;194;165m?[0m [0;30;48;2;252;141;98m/.[0m [0;30;48;2;141;160;203m<[0m [0;30;48;2;231;138;195m,-[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47m+=[0m [0;30;48;2;102;194;165m()*[0m [0;30;48;2;252;141;98m&[0m [0;30;48;2;141;160;203m^[0m [0;30;48;2;231;138;195m%[0m [0;30;48;2;166;216;84m$[0m [0;30;48;2;255;217;47m#@[0m [0;30;48;2;102;194;165m!
[0m [0;30;48;2;252;141;98mand[0m [0;30;48;2;141;160;203m add[0m [0;30;48;2;231;138;195m a[0m [0;30;48;2;166;216;84m mix[0m [0;30;48;2;255;217;47m of[0m [0;30;48;2;102;194;165m uppercase[0m [0;30;48;2;252;141

In [17]:
# You need to request access before being able to use this tokenizer
show_tokens(text, "bigcode/starcoder2-15b")

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mWhen[0m [0;30;48;2;141;160;203m generating[0m [0;30;48;2;231;138;195m a[0m [0;30;48;2;166;216;84m password[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165m use[0m [0;30;48;2;252;141;98m S[0m [0;30;48;2;141;160;203mymbols[0m [0;30;48;2;231;138;195m like[0m [0;30;48;2;166;216;84m :[0m [0;30;48;2;255;217;47m";[0m [0;30;48;2;102;194;165m'>[0m [0;30;48;2;252;141;98m?[0m [0;30;48;2;141;160;203m/.[0m [0;30;48;2;231;138;195m<[0m [0;30;48;2;166;216;84m,-[0m [0;30;48;2;255;217;47m_[0m [0;30;48;2;102;194;165m+=[0m [0;30;48;2;252;141;98m()*[0m [0;30;48;2;141;160;203m&[0m [0;30;48;2;231;138;195m^[0m [0;30;48;2;166;216;84m%[0m [0;30;48;2;255;217;47m$[0m [0;30;48;2;102;194;165m#[0m [0;30;48;2;252;141;98m@[0m [0;30;48;2;141;160;203m![0m [0;30;48;2;231;138;195m
[0m [0;30;48;2;166;216;84mand[0m [0;30;48;2;255;217;47m add[0m [0;30;48;2;102;194;165m a[0m [0;30;48;2;252;141;98m mix[0m [0;

In [18]:
show_tokens(text, "facebook/galactica-1.3b")

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mWhen[0m [0;30;48;2;141;160;203m generating[0m [0;30;48;2;231;138;195m a[0m [0;30;48;2;166;216;84m password[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165m use[0m [0;30;48;2;252;141;98m Symbols[0m [0;30;48;2;141;160;203m like[0m [0;30;48;2;231;138;195m [0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47m"[0m [0;30;48;2;102;194;165m;[0m [0;30;48;2;252;141;98m'[0m [0;30;48;2;141;160;203m>[0m [0;30;48;2;231;138;195m?[0m [0;30;48;2;166;216;84m/[0m [0;30;48;2;255;217;47m.[0m [0;30;48;2;102;194;165m<[0m [0;30;48;2;252;141;98m,[0m [0;30;48;2;141;160;203m-[0m [0;30;48;2;231;138;195m_[0m [0;30;48;2;166;216;84m+[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165m([0m [0;30;48;2;252;141;98m)[0m [0;30;48;2;141;160;203m*[0m [0;30;48;2;231;138;195m&[0m [0;30;48;2;166;216;84m^[0m [0;30;48;2;255;217;47m%[0m [0;30;48;2;102;194;165m$[0m [0;30;48;2;252;141;98m#[0m [0;30;48;2;141;160;

In [19]:
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")

[0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mWhen[0m [0;30;48;2;231;138;195mgenerating[0m [0;30;48;2;166;216;84ma[0m [0;30;48;2;255;217;47mpassword[0m [0;30;48;2;102;194;165m,[0m [0;30;48;2;252;141;98muse[0m [0;30;48;2;141;160;203mSymbol[0m [0;30;48;2;231;138;195ms[0m [0;30;48;2;166;216;84mlike[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m";[0m [0;30;48;2;252;141;98m'>[0m [0;30;48;2;141;160;203m?[0m [0;30;48;2;231;138;195m/.[0m [0;30;48;2;166;216;84m<[0m [0;30;48;2;255;217;47m,-[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98m+=[0m [0;30;48;2;141;160;203m()[0m [0;30;48;2;231;138;195m*[0m [0;30;48;2;166;216;84m&[0m [0;30;48;2;255;217;47m^[0m [0;30;48;2;102;194;165m%[0m [0;30;48;2;252;141;98m$[0m [0;30;48;2;141;160;203m#[0m [0;30;48;2;231;138;195m@[0m [0;30;48;2;166;216;84m![0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165mand[0m [0;30;48;2;252;141;98madd[0m [0;30;48;2;141;16

# Contextualized Word Embeddings From a Language Model (Like BERT)

In [20]:
from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello Students in LLM class', return_tensors='pt')

# Process the tokens
output = model(**tokens)[0]

In [21]:
output.shape

torch.Size([1, 8, 384])

In [22]:
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 Students
 in
 LL
M
 class
[SEP]


In [23]:
output

tensor([[[-3.4635,  0.0890, -0.1953,  ..., -0.2522, -0.4098,  0.2081],
         [-0.5348,  0.5601,  0.2044,  ..., -0.3112, -0.1352, -0.0641],
         [-0.4676,  0.5406,  0.4165,  ...,  0.0995, -0.2303, -0.1483],
         ...,
         [-0.1817,  0.3306,  0.3508,  ..., -0.4553, -0.5476,  0.6356],
         [-0.3899,  0.0773,  0.3230,  ..., -1.6049, -0.6239, -0.1847],
         [-3.1603,  0.1930, -0.0469,  ..., -0.2468, -0.6185,  0.2701]]],
       grad_fn=<NativeLayerNormBackward0>)

# Text Embeddings (For Sentences and Whole Documents)

In [29]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to text embeddings
vector = model.encode("That is amazing! I wonder how you do it")

In [30]:
vector.shape

(768,)

# Word Embeddings Beyond LLMs


In [26]:
!pip install gensim



In [27]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
model = api.load("glove-wiki-gigaword-50")



In [28]:
model.most_similar([model['king']], topn=11)

[('king', 1.0000001192092896),
 ('prince', 0.8236179351806641),
 ('queen', 0.7839043140411377),
 ('ii', 0.7746230363845825),
 ('emperor', 0.7736247777938843),
 ('son', 0.766719400882721),
 ('uncle', 0.7627150416374207),
 ('kingdom', 0.7542161345481873),
 ('throne', 0.7539914846420288),
 ('brother', 0.7492411136627197),
 ('ruler', 0.7434253692626953)]

# Recommending songs by embeddings

In [31]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [32]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

In [33]:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4
)

In [34]:
song_id = 2172

# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

[('5586', 0.9974304437637329),
 ('6641', 0.9967657923698425),
 ('1922', 0.9963169097900391),
 ('2849', 0.9956973791122437),
 ('6626', 0.9955478310585022),
 ('5634', 0.9950940012931824),
 ('3167', 0.9949778914451599),
 ('5549', 0.9949181079864502),
 ('11473', 0.9945735335350037),
 ('6658', 0.9945141077041626)]

In [35]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [36]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id),topn=5)
    )[:,0]
    return  songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
5586,The Last In Line,Dio
6641,Shout At The Devil,Motley Crue
1922,One,Metallica
2849,Run To The Hills,Iron Maiden
6626,Blackout,Scorpions


In [37]:
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
5586,The Last In Line,Dio
6641,Shout At The Devil,Motley Crue
1922,One,Metallica
2849,Run To The Hills,Iron Maiden
6626,Blackout,Scorpions


In [38]:
print_recommendations(842)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
5698,Turnin' Me On (w\/ Lil Wayne),Keri Hilson
5788,Drop It Like It's Hot (w\/ Pharrell),Snoop Dogg
27078,Out Of My Head (w\/ Trey Songz),Lupe Fiasco
890,Knock You Down (w\/ Ne-Yo & Kanye West),Keri Hilson
413,If I Ruled The World (Imagine That) (w\/ Laury...,Nas
