In [6]:
corpus = """LLM Research Papers: The 2024 List
I want to share my running bookmark list of many fascinating (mostly LLM-related) papers I stumbled upon in 2024. It's just a list, but maybe it will come in handy for those who are interested in finding some gems to read for the holidays.
Nov 3, 2024
Understanding Multimodal LLMs An Introduction to the Main Techniques and Latest Models
There has been a lot of new research on the multimodal LLM front, including the latest Llama 3.2 vision models, which employ diverse architectural strategies to integrate various data types like text and images. For instance, The decoder-only method uses a single stack of decoder blocks to process all modalities sequentially. On the other hand, cross-attention methods (for example, used in Llama 3.2) involve separate encoders for different modalities with a cross-attention layer that allows these encoders to interact. This article explains how these different types of multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks to compare their approaches.
Sep 21, 2024
Building A GPT-Style LLM Classifier From Scratch Finetuning a GPT Model for Spam Classification
This article shows you how to transform pretrained large language models (LLMs) into strong text classifiers. But why focus on classification? First, finetuning a pretrained model for classification offers a gentle yet effective introduction to model finetuning. Second, many real-world and business challenges revolve around text classification: spam detection, sentiment analysis, customer feedback categorization, topic labeling, and more.
Sep 1, 2024
Building LLMs from the Ground Up: A 3-hour Coding Workshop
This tutorial is aimed at coders interested in understanding the building blocks of large language models (LLMs), how LLMs work, and how to code them from the ground up in PyTorch. We will kick off this tutorial with an introduction to LLMs, recent milestones, and their use cases. Then, we will code a small GPT-like LLM, including its data input pipeline, core architecture components, and pretraining code ourselves. After understanding how everything fits together and how to pretrain an LLM, we will learn how to load pretrained weights and finetune LLMs using open-source libraries.
"""

In [13]:
# words = list(set(corpus.split()))
# words = [f"{word}</w>" for word in words]
words = corpus.split()
vocab = dict()
for word in words:
    vocab_key = ' '.join(list(word) + ['<\w>'])
    if vocab_key in vocab:
        vocab[vocab_key] += 1
    else:
        vocab[vocab_key] = 1

In [16]:
import collections

def get_stats(vocab):
    """
    统计词汇表中所有相邻符号对的频率。
    参数:
      vocab: dict, 键为空格分隔的符号序列（字符串），值为该词的出现次数。
    返回:
      pairs: dict, 键为相邻符号对（tuple），值为出现频率。
    """
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        # 遍历相邻符号对
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    """
    将词汇表中所有出现指定符号对的地方合并为一个新符号。
    参数:
      pair: tuple, 要合并的符号对 (a, b)。
      vocab: dict, 当前的词汇表。
    返回:
      out_vocab: dict, 更新后的词汇表。
    """
    out_vocab = {}
    # 将待合并的符号对表示为字符串形式（以空格连接）
    bigram = ' '.join(pair)
    for word, freq in vocab.items():
        symbols = word.split()
        new_symbols = []
        i = 0
        while i < len(symbols):
            # 如果当前和下一个符号构成待合并的 pair，则合并
            if i < len(symbols) - 1 and symbols[i] == pair[0] and symbols[i+1] == pair[1]:
                new_symbols.append(pair[0] + pair[1])
                i += 2  # 跳过下一个符号
            else:
                new_symbols.append(symbols[i])
                i += 1
        # 将更新后的符号序列重新组合成字符串作为新词
        new_word = ' '.join(new_symbols)
        out_vocab[new_word] = freq
    return out_vocab

# 示例语料库：每个词已经用空格分割，并在词尾添加了结束符 </w>
# vocab = {
#     'l o w </w>': 5,
#     'l o w e r </w>': 2,
#     'n e w e s t </w>': 6,
#     'w i d e s t </w>': 3
# }

num_merges = 100  # 定义希望执行的合并次数

print("初始词汇表：")
for word, freq in vocab.items():
    print(f"{word}  {freq}")
print("="*30)

for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs:
        break
    # 选择出现频率最高的符号对进行合并
    best = max(pairs, key=pairs.get)
    print(f"第 {i+1} 次合并: {best} ，频率为 {pairs[best]}")
    vocab = merge_vocab(best, vocab)
    print("更新后的词汇表：")
    for word, freq in vocab.items():
        print(f"{word}  {freq}")
    print("-"*30)


初始词汇表：
L L M <\w>  3
R e s e a r c h <\w>  1
P a p er s : <\w>  1
T h e<\w>  2
2 0 2 4 <\w>  4
L i s t <\w>  1
I <\w>  3
w an t <\w>  1
t o <\w>  13
s h a r e<\w>  1
m y <\w>  1
r u n n in g <\w>  1
b o o k m a r k <\w>  1
l i s t <\w>  1
o f <\w>  5
m an y <\w>  2
f a s c in a t in g <\w>  1
( m o s t l y <\w>  1
L L M - r e l a t e d ) <\w>  1
p a p er s<\w>  2
s t u m b l e d<\w>  1
u p o n <\w>  1
in <\w>  7
2 0 2 4 . <\w>  1
I t ' s<\w>  1
j u s t <\w>  1
a <\w>  9
l i s t ,<\w>  1
b u t <\w>  1
m a y b e<\w>  1
i t <\w>  1
w i l l <\w>  5
c o m e<\w>  1
h an d y <\w>  1
f o r <\w>  5
th o s e<\w>  1
w h o <\w>  1
a r e<\w>  1
in t er e s t e d<\w>  2
f in d in g <\w>  1
s o m e<\w>  1
g e m s<\w>  1
r e a d<\w>  1
th e<\w>  8
h o l i d a y s . <\w>  1
N o v <\w>  1
3 ,<\w>  1
U n d er s t an d in g <\w>  1
M u l ti m od a l <\w>  1
L L M s<\w>  5
A n <\w>  1
I n t r od u c ti o n <\w>  1
M a in <\w>  1
T e c h n i q u e s<\w>  1
an d<\w>  11
L a t e s t <\w>  1
M od e l s<\w>  1
