<h1>Chapter 1 - Introduction to Language Models</h1>
<i>Exploring the exciting field of Language AI</i>


<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter01/Chapter%201%20-%20Introduction%20to%20Language%20Models.ipynb)

---

This notebook is for Chapter 1 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

In [1]:
# %%capture
# !pip install transformers>=4.40.1 accelerate>=0.27.2

# Phi-3

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately (although that isn't always necessary).

In [7]:
from transformers import AutoModelForCasulLM, AutoTokenizer
# 加载模型和 tokenizer （第一次加载的时候会进行下载）
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
# 查看 模型结构是什么？
print(model)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((

Although we can now use the model and tokenizer directly, it's much easier to wrap it in a `pipeline` object:

In [6]:
# 这里有一个 Special token, 打印看一下是什么token呢？
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# tokenizer.all_special_tokens
tokenizer.special_tokens_map

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

{'eos_token': '<|im_end|>',
 'pad_token': '<|endoftext|>',
 'additional_special_tokens': ['<|im_start|>',
  '<|im_end|>',
  '<|object_ref_start|>',
  '<|object_ref_end|>',
  '<|box_start|>',
  '<|box_end|>',
  '<|quad_start|>',
  '<|quad_end|>',
  '<|vision_start|>',
  '<|vision_end|>',
  '<|vision_pad|>',
  '<|image_pad|>',
  '<|video_pad|>']}

Finally, we create our prompt as a user and give it to the model:

In [10]:
prompt = "讲一个猫有关的笑话？"
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

#  想一下诗变成什么格式
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

print(text)
print("====" * 10)
print(model_inputs)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
讲一个猫有关的笑话？<|im_end|>
<|im_start|>assistant

{'input_ids': tensor([[151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553,  54364,  14817,     13,   1446,    525,    264,  10950,  17847,
             13, 151645,    198, 151644,    872,    198,  99526,  46944, 100472,
         101063,   9370, 109959,  11319, 151645,    198, 151644,  77091,    198]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}


In [15]:
# step3: 推理和解码
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
print(generated_ids)

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
print(response)

[tensor([ 18493, 117431,  69249,   3837, 102292, 100472,  99250, 102543, 110184,
        101600, 100985,   3837,  56007,  36987,  56568, 100678, 104067,  81264,
           198,   2073,  35946, 104067, 102488, 101281,  32945,    198,   2073,
        102488, 101281,  81264, 102543, 108651,  29490, 106116,   8997,   2073,
         20412, 103924,   3837,  35946, 101922, 104278, 104067, 102488, 101281,
         32945,    198,   2073,  99212, 107409, 108209, 101036,  81264,    198,
          2073, 104100,  99405, 100655,   3837, 100131,  80443, 100655, 107347,
          3837,  99999, 104115, 104067, 102488, 101281,  32945,    198,   2073,
        104170,   3837,  99212, 105365,  99165, 103027, 100003,  32945, 151645],
       device='cuda:0')]
在动物园里，一只猫被主人带到笼子里，问：“你为什么在这里？”
“我在这里晒太阳。”
“晒太阳？”主人疑惑地问道。
“是啊，我每天都要在这里晒太阳。”
“那你想干什么呢？”
“我想吃鱼，但是没有鱼缸，所以我就在这里晒太阳。”
“哦，那你就很有趣吧。”<|im_end|>


In [28]:
from transformers import pipeline


# step1: 生成 pipeline
# return_full_text=False 表示只返回新生成的文本，不包含输入的prompt
# return_full_text=True 则会返回完整文本，包含输入的prompt和生成的新文本
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,  # 只返回新生成的文本部分
    max_new_tokens=2000,
    do_sample=False
)

# step2: 构建 prompt
messages = [
    {"role": "user", "content": "生成一段文字，关于和平，2000字."}
]

# step3，输出并解码
output = generator(messages)
print(output[0]["generated_text"])

Device set to use cuda


### 和平的定义与重要性

和平是人类社会永恒的主题之一。它不仅关乎国家主权和领土完整，更是一种精神追求和社会价值。在当今世界，和平被视为一种基本人权，也是构建和谐社会、促进全球发展的重要基石。

#### 和平的概念

1. **国际法**：和平是指通过谈判、对话和协商解决争端的方式，而不是采取武力或暴力手段。
2. **人道主义原则**：和平意味着尊重人的尊严和权利，包括言论自由、宗教信仰自由等。
3. **可持续发展**：和平有助于实现经济、社会和环境的长期稳定和发展。

#### 和平的重要性

- **社会稳定与发展**：和平有利于减少冲突，促进经济发展，提高人民生活水平。
- **文化多样性**：和平促进了不同文化的交流与融合，增强了民族凝聚力。
- **国际合作**：和平有助于推动国际关系的健康发展，促进全球治理的进步。
- **国家安全**：和平有助于维护国家的安全和稳定，避免战争带来的破坏。

#### 中国作为和平的倡导者

自古以来，中国就主张“天下为公”，强调“民惟邦本，本固邦宁”。新中国成立后，中国政府始终致力于维护国家统一、民族团结和社会稳定，通过多种方式推进和平进程：

1. **外交政策**：坚持独立自主的和平外交政策，反对霸权主义和强权政治，支持联合国及其安理会的权威性和有效性。
2. **国际会议参与**：积极参与多边外交活动，如二十国集团（G20）、亚太经合组织（APEC）等，以促进地区和全球合作。
3. **人权保障**：在全球范围内推广《世界人权宣言》和《公民权利和政治权利国际公约》，确保各国的人权得到普遍尊重和保护。

#### 中华文明的和平传统

中华文明历史悠久，其核心价值观中蕴含着深厚的和平理念。例如，“和为贵”、“天人合一”的哲学思想，以及“仁爱”、“礼让”等道德观念，都体现了中国人对和平生活的向往和追求。

1. **儒家思想**：孔子提出“己所不欲，勿施于人”，强调个人行为应符合社会规范，这与现代意义上的和平共处的理念相契合。
2. **道家思想**：庄子提倡“无为而治”，认为顺应自然规律，避免人为干预，从而达到内心的平静和和谐。
3. **佛教思想**：佛陀教导众生应当远离痛苦，追求解脱，这种思想鼓励人们放下执着，寻求心灵的宁静和和谐。

#### 现代中国的和平之路

面对全球化、信息化带来的挑战，中国

In [30]:
## step1: 加载模型

from transformers import AutoModelForCausalLM, AutoTokenizer

# 加载模型和 tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

In [39]:
## 所以对于 Qwen 可以这么做？
### 中文
prompt = "帮我写一个请教条，原因是自己生病了。<|im_start|>assistant"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)
print(generation_output[0])
# Print the output
print(tokenizer.decode(generation_output[0]))

tensor([108965,  61443,  46944, 116069,  38989,   3837, 107711,  99283, 109281,
         34187,   1773, 151644,  77091,   5122, 109723,   9370, 103998,   3837,
        111308,   6313,  35946, 104044, 101099, 101895, 108684,   3837,  99172,
        100703, 100158, 101214, 101898,  33108, 101899], device='cuda:0')
帮我写一个请教条，原因是自己生病了。<|im_start|>assistant：尊敬的医生，您好！我最近身体有些不适，想咨询一下您的建议和治疗


In [34]:
print(input_ids)

tensor([[108965,  61443,  46944, 116069,  38989,   3837, 107711,  99283, 109281,
          34187,   1773, 151644,  77091]], device='cuda:0')


prompt从

In [41]:
for id_ in input_ids[0]:
    print(id_)
    print(tokenizer.decode(id_))

tensor(108965, device='cuda:0')
帮我
tensor(61443, device='cuda:0')
写
tensor(46944, device='cuda:0')
一个
tensor(116069, device='cuda:0')
请教
tensor(38989, device='cuda:0')
条
tensor(3837, device='cuda:0')
，
tensor(107711, device='cuda:0')
原因是
tensor(99283, device='cuda:0')
自己
tensor(109281, device='cuda:0')
生病
tensor(34187, device='cuda:0')
了
tensor(1773, device='cuda:0')
。
tensor(151644, device='cuda:0')
<|im_start|>
tensor(77091, device='cuda:0')
assistant


In [45]:
## 英文
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Dear Sarah,

I hope this message finds you well.

As I was out with some friends, I


In [47]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

Write
 an
 email
 apolog
izing
 to
 Sarah
 for
 the
 tragic
 gardening
 mish
ap
.
 Explain
 how
 it
 happened
.<
|
assistant
|
>


In [53]:
from transformers import AutoModelForCausalLM, AutoTokenizer

colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )
text = """
English and CAPITALIZATION
🎵 鸟你是我
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

In [61]:
show_tokens(text, "gpt2")

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAP[0m [0;30;48;2;166;216;84mITAL[0m [0;30;48;2;255;217;47mIZ[0m [0;30;48;2;102;194;165mATION[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m �[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m是[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mshow[0m [0;30;48;2;231;138;195m_[0m [0;30;48;2;166;216;84mt[0m [0;30;48;2;255;217;47mok[0m [0;30;48;2;102;194;165mens[0m [0;30;48;2;252;141;98m False[0m [0;30;48;2;141;160;203m None[0m [0;30;48;2;231;138;195m el[0m [0;30;48;2;166;216;84mif[0m [0;30;48;2;255;217;47m ==[0m [0;30;48;2;102;194;165m >=[0m [0;30;48;2;252;141;98m else[0m [0;30;48

In [57]:
show_tokens(text, "bert-base-uncased")

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98menglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mcapital[0m [0;30;48;2;166;216;84m##ization[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98m[UNK][0m [0;30;48;2;141;160;203m[UNK][0m [0;30;48;2;231;138;195m我[0m [0;30;48;2;166;216;84mshow[0m [0;30;48;2;255;217;47m_[0m [0;30;48;2;102;194;165mtoken[0m [0;30;48;2;252;141;98m##s[0m [0;30;48;2;141;160;203mfalse[0m [0;30;48;2;231;138;195mnone[0m [0;30;48;2;166;216;84meli[0m [0;30;48;2;255;217;47m##f[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m>[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84melse[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165mtwo[0m [0;30;48;2;252;141;98mtab[0m [0;30;48;2;141;160;203m##s[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m"[0m [0;30;48;2;255;217;47m"[0m [0;30;48;2;102;194;165mthree[0m [0;30;48;

In [58]:
show_tokens(text, "Qwen/Qwen2.5-0.5B-Instruct")

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAPITAL[0m [0;30;48;2;166;216;84mIZATION[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165m🎵[0m [0;30;48;2;252;141;98m �[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m你[0m [0;30;48;2;255;217;47m是我[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_tokens[0m [0;30;48;2;231;138;195m False[0m [0;30;48;2;166;216;84m None[0m [0;30;48;2;255;217;47m elif[0m [0;30;48;2;102;194;165m ==[0m [0;30;48;2;252;141;98m >=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:"[0m [0;30;48;2;252;141;98m   [0m [0;30;48;2;141;160;203m "[0m [0;30;48;2;231;138;195m Three[0m [0;30;48;2;166;216;84m tabs[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m "[0m [0;30;48;

In [63]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
model = api.load("glove-wiki-gigaword-50")

model.most_similar([model['king']], topn=11)

[('king', 1.0000001192092896),
 ('prince', 0.8236179351806641),
 ('queen', 0.7839043140411377),
 ('ii', 0.7746230363845825),
 ('emperor', 0.7736247777938843),
 ('son', 0.766719400882721),
 ('uncle', 0.7627150416374207),
 ('kingdom', 0.7542161345481873),
 ('throne', 0.7539914846420288),
 ('brother', 0.7492411136627197),
 ('ruler', 0.7434253692626953)]

In [88]:
from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Load a language model
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

# Process the tokens
output = model(**tokens)[0]

In [89]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# 加载预训练的 BERT 模型和 tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# 示例文本
text = "I love programming!"

# 对输入文本进行编码
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# 将文本输入到 BERT 模型
outputs = model(**inputs)

# 获取 [CLS] token 的嵌入（用于分类任务）
cls_embedding = outputs.logits

# 打印分类结果（logits）
print(cls_embedding)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[-0.4706, -0.0217]], grad_fn=<AddmmBackward0>)


In [87]:
from transformers import BertTokenizer, BertModel
import torch

# 加载预训练的 BERT 模型和 tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name, output_hidden_states=True)

# 将模型设置为评估模式
model.eval()

# 示例文本
text = "The quick brown fox jumps over the lazy dog."

# 使用 tokenizer 对输入文本进行编码
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# 获取模型的输出（所有层的输出）
with torch.no_grad():
    outputs = model(**inputs)

# 输出包含多层的隐藏状态
hidden_states = outputs.hidden_states  # 这里是一个列表，包含了每一层的输出

# 获取倒数第二层的隐藏状态  因为最后一层的表征信息是"Oriented"的
second_last_hidden_state = hidden_states[-2]

# 获取 [CLS] token 对应的嵌入
cls_embedding = second_last_hidden_state[0, 0]  # [0, 0] 是 [CLS] token 在第一句的位置

# 查看倒数第二层的 [CLS] token 嵌入的维度
print(f"倒数第二层的 [CLS] token 嵌入的维度: {cls_embedding.shape}")

# 打印出倒数第二层的 [CLS] token 嵌入
print(cls_embedding)

倒数第二层的 [CLS] token 嵌入的维度: torch.Size([768])
tensor([-2.6080e-01,  1.2339e-02, -6.7656e-02,  4.1337e-01, -2.3746e-01,
        -6.3126e-01, -2.6496e-01,  4.1304e-01,  7.4866e-02, -4.5898e-01,
        -5.8496e-01, -2.1534e-02,  1.2094e-01,  1.6789e-01, -2.9455e-01,
        -1.2126e-01, -2.4585e-01, -1.0238e-01, -2.7770e-01, -4.3773e-01,
        -5.9221e-03, -4.2733e-02, -1.1661e-01,  2.5629e-02,  5.1659e-01,
        -4.4920e-03, -1.3251e-01,  2.1097e-01, -1.0595e-01,  4.7786e-01,
        -5.9861e-01,  2.3056e-01,  7.8810e-02, -6.4975e-01,  4.5683e-01,
        -1.2953e-01,  3.2481e-01, -3.0616e-01, -4.4183e-01,  2.0779e-01,
        -1.8338e-01,  1.3580e-01, -1.1301e-02, -5.3926e-01, -1.2395e-02,
        -5.3533e-01, -3.7442e+00,  2.5199e-01,  9.1076e-02, -3.0403e-01,
        -4.6988e-01, -1.6335e-01, -1.5999e-01,  5.4835e-01, -3.5547e-01,
         9.0133e-02, -2.1743e-01,  2.9163e-03,  2.3797e-01,  2.6842e-01,
        -3.6542e-01,  1.3203e-01, -4.0884e-01, -9.0026e-02, -5.6172e-01,
       

In [90]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

# Convert text to text embeddings
vector = model.encode("测试一个小模型的 embedding 能力")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [92]:
vector.shape

(768,)

In [93]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')