# Chapter 2

modelscope数据集使用可以参阅：<https://modelscope.cn/docs/sdk/dataset>

## install dependencies:

```bash
pip install datasets transformers sentencepiece modelscope[framework]

# windows
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# linux
pip install torch torchvision torchaudio
```

## prepare datasets: exec shell script 

```bash
modelscope download --dataset BAAI/IndustryCorpus2 --local_dir ./datasets/ --include computer_programming_code/*
```

具体可以执行：

```bash

```

In [None]:
'''
load dataset
'''
from modelscope import MsDataset
from datasets import Dataset

files = [
    './datasets/computer_programming_code/chinese/low/rank_00163.parquet',
    './datasets/computer_programming_code/english/low/rank_01598.parquet',
    './datasets/computer_programming_code/english/low/rank_01599.parquet',
    './datasets/computer_programming_code/english/low/rank_01600.parquet',
    './datasets/computer_programming_code/english/low/rank_01601.parquet',
    './datasets/computer_programming_code/english/low/rank_01602.parquet',
]

ds = MsDataset.load(dataset_name='parquet', data_files=files)

hf_ds: Dataset = ds.to_hf_dataset()
print('hf_ds:', hf_ds)

# only keep text column
hf_ds = hf_ds.remove_columns([col for col in hf_ds.column_names if col != "text"])
# split into train and test sets with 90% training and 10% testing
d = hf_ds.train_test_split(test_size=0.1)
print('train and test dataset:', d)

hf_ds = None # free memory

: 

In [None]:
# show dataset info
print('d["train"]:', d["train"])
print('d["train"][text][0]:', d["train"]["text"][0])

d["train"]: Dataset({
    features: ['text', 'alnum_ratio', 'avg_line_length', 'char_rep_ratio', 'flagged_words_ratio', 'max_line_length', 'num_words', 'perplexity', 'quality_score', 'special_char_ratio', 'word_rep_ratio', '_id', 'industry_type'],
    num_rows: 1862001
})
d["train"][text][0]: Q:

How to collect into a Map forming a List in value when duplicate keys in streams, in Java 8

I have a stream of elements in the form of either 2-D array or EntrySet. I need these to be collected in a Map. Now the issue is the stream of elements can have duplicate elements. Let's say I want the value to be a list:
Map<String,List<String>>

Example :
class MapUtils
{
// Function to get Stream of String[]
private static Stream<String[]> getMapStream()
{
    return Stream.of(new String[][] {
            {"CAR", "Audi"},
            {"BIKE", "Harley Davidson"},
            {"BIKE", "Pulsar"}
    });
}

// Program to convert Stream to Map in Java 8
public static void main(String args[])
{
    // get

In [None]:
'''
save train and test dataset
'''
def dataset_to_text(dataset, output_filename="data.txt"):
    with open(output_filename, "w") as f:
        for t in dataset["text"]:
            print(t, file=f)

print('begin to save dataset')
dataset_to_text(d["train"], "./datasets/for_train/ds-train.txt")
dataset_to_text(d["test"], "./datasets/for_test/ds-test.txt")
print('dataset saved')

In [5]:
'''
training tokenizer: set params for training
'''

special_tokens = [
    "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"
]
# training the tokenizer on the training set
files = ["./datasets/for_train/ds-train.txt"]
# 30,522 vocab is BERT's default vocab size, feel free to tweak
vocab_size = 30_522
# maximum sequence length, lowering will result to faster training (when increasing batch size)
max_length = 512
# whether to truncate
truncate_longer_samples = False
model_path = "models/pretrained-bert"

print('set params for training tokenizer')

set params for training tokenizer


In [None]:
'''
train tokenizer: train and save the tokenizer
'''
from tokenizers import BertWordPieceTokenizer

import os
import json

# initialize the WordPiece tokenizer
tokenizer = BertWordPieceTokenizer()

# train the tokenizer
print('begin to train a tokenizer')
tokenizer.train(files=files, vocab_size=vocab_size, special_tokens=special_tokens)
print('done of training a tokenizer')
# enable truncation up to the maximum 512 tokens
tokenizer.enable_truncation(max_length=max_length)

# make the directory if not already there
if not os.path.isdir(model_path):
    os.mkdir(model_path)

# save the tokenizer
tokenizer.save_model(model_path)
print(f'save model to {model_path}')

# dumping some of the tokenizer config to config file,
# including special tokens, whether to lower case and the maximum sequence length
configPath = os.path.join(model_path, "config.json")
with open(configPath, "w") as f:
    tokenizer_cfg = {
        "do_lower_case": True,
        "unk_token": "[UNK]",
        "sep_token": "[SEP]",
        "pad_token": "[PAD]",
        "cls_token": "[CLS]",
        "mask_token": "[MASK]",
        "model_max_length": max_length,
        "max_len": max_length,
    }
    json.dump(tokenizer_cfg, f, indent=2)
    print(f'save config to {configPath}')


In [None]:
'''
preprocessing datasets
'''
from transformers import BertTokenizerFast

def encode_with_truncation(examples):
    """Mapping function to tokenize the sentences passed with truncation"""
    return tokenizer(examples["text"], truncation=True, padding="max_length",
            max_length=max_length, return_special_tokens_mask=True)

def encode_without_truncation(examples):
    """Mapping function to tokenize the sentences passed without truncation"""
    return tokenizer(examples["text"], return_special_tokens_mask=True)

# when the tokenizer is trained and configured, load it as BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained(model_path)
print('load model from', model_path)

# the encode function will depend on the truncate_longer_samples variable
encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation
# tokenizing the train dataset
train_dataset = d["train"].map(encode, batched=True)
# tokenizing the testing dataset
test_dataset = d["test"].map(encode, batched=True)
if truncate_longer_samples:
    # remove other columns and set input_ids and attention_mask as PyTorch tensors
    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
    test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
else:
    # remove other columns, and remain them as Python lists
    test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
    train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
print('done of preprocessing datasets')
d = None # free memory

# group texts
from itertools import chain

def group_texts(examples):
    print("keys:", examples.keys())
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    # Compute length of concatenated texts.
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last samples to ensure they can be batched perfectly.
    if total_length >= max_length:
        total_length = (total_length // max_length) * max_length
    result = {
        k: [t[i : i + max_length] for i in range(0, total_length, max_length)]
        for k, t in concatenated_examples.items()
    }
    return result

if not truncate_longer_samples:
    print('begin to grouping texts')
    train_dataset = train_dataset.map(group_texts, batched=True, desc=f"Grouping texts in chunks of {max_length}")
    test_dataset = test_dataset.map(group_texts, batched=True, desc=f"Grouping texts in chunks of {max_length}")

    # convert to torch tensors
    train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
    test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

print('done of preprocessing')

load model from models/pretrained-bert


Map:   0%|          | 0/1862001 [00:00<?, ? examples/s]

In [3]:
'''
training model
'''
from transformers import BertConfig, BertForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments

model_config = BertConfig(vocab_size=vocab_size, max_position_embeddings=max_length)
model = BertForMaskedLM(config=model_config)

# init data collator: randomly masks 20% of the token
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.2
)

training_args = TrainingArguments(
    output_dir='./models/results',
    evaluation_strategy="steps",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=10,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=64,
    logging_steps=1000,
    save_steps=1000,
    # load_best_model_at_end=True,
    # save_total_limit=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


NameError: name 'tokenizer' is not defined

## 大语言模型结构

![alt text](./C2/assets/GPT-2_model.png)

区别：

- 使用 pre-normalization：
    - 在 多头自注意力（Multi-head Self Attention）之前
    - 在 FFN 之前
- 残留连接在 MSA 和 FFN 之后
- Normalization：使用 RMSNorm（Root Mean Square Normalizing Function）
- 激活函数：SwiGLU
- 位置嵌入：RoPE

RMSNorm 计算公式：

$$
RMS(a) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} a_i^2} + \epsilon \\
\overline{a}_i = \frac{a_i}{RMS(a)}
$$

也可以引入缩放因子：

$$
\overline{a}_i = \frac{a_i}{RMS(a)}g_i + b_i
$$


RoPE 借助了复数的思想，出发点是通过绝对位置编码的方式实现相对位置编码。
