# LLM

## LLMとは？

言語モデルとは？

単語の系列(文章)をx1,x2,....,xLがある時、その生成確率p(x1,x2,....,xL)を割り当てる確率モデルのことを言語モデルといいます。  
  
この生成確率は以下のように表現できる  
  
p(x1,x2,....,xL) = p(x1)p(x2|x1).....p(xL|x1,x2,....,xL-1)  
  
では条件付き確率はどう求めるのか？⇛ニューラル言語モデルを使う  
  
しかし、課題が多い  
・畳み込みやMLPでは長いコンテキストの処理が難しい  
・RNN系列では学習が並列化できず、勾配消失の問題もある  
  
これを解決したのがTransformerのSelf-Attention機構   
  
大規模化の理由 
  
・データセット、計算資源、パラメータ数に関して、スケーリングするほど精度が上がっている  
・モデルサイズが大きくないとと置けないタスクが存在する  

## RAGとPrompting



## Pre-Train

## Tokenize

In [56]:
pip install transformers

Collecting transformers
  Downloading transformers-4.46.1-py3-none-any.whl.metadata (44 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.46.1-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading huggingface_hub-0.26.2-py3-none-any.whl (447 kB)
Downloading PyYAML-6.0.2-cp311-cp311-manylinux

In [69]:
text =  "I don't know what to say. 😀 I'm just a computer program."
print(text)

I don't know what to say. 😀 I'm just a computer program.


In [70]:
#テキストを空白で分割してリストにする
text_list = text.split()
print(text_list)

['I', "don't", 'know', 'what', 'to', 'say.', '😀', "I'm", 'just', 'a', 'computer', 'program.']


#nltkを使ってTokenizeする
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text_tokens = word_tokenize(text)
print(text_tokens)

In [72]:
#Hugging Faceのtransformersを使ってTokenizeする
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_tokenized = tokenizer.tokenize(text)
print(text_tokenized)

['i', 'don', "'", 't', 'know', 'what', 'to', 'say', '.', '[UNK]', 'i', "'", 'm', 'just', 'a', 'computer', 'program', '.']


In [73]:
# トークンIDのリストを直接生成する
input_ids = tokenizer.encode(text, add_special_tokens=True)
print("Encoded Input IDs:", input_ids)

# トークンIDからテキストに戻す
decoded_text = tokenizer.decode(input_ids)
print("Decoded Text:", decoded_text)


Encoded Input IDs: [101, 1045, 2123, 1005, 1056, 2113, 2054, 2000, 2360, 1012, 100, 1045, 1005, 1049, 2074, 1037, 3274, 2565, 1012, 102]
Decoded Text: [CLS] i don't know what to say. [UNK] i'm just a computer program. [SEP]


In [None]:
#更に１文字単位に分割
text_list = list(text)
print(text_list)

['I', "'", 'm', ' ', 'f', 'i', 'n', 'e', ',', ' ', 't', 'h', 'a', 'n', 'k', ' ', 'y', 'o', 'u', '.']


In [2]:
#PTBのコーパスをダウンロード
import os
import urllib.request
url = "https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt"
save_name = "ptb.train.txt"
if not os.path.exists(save_name):
    urllib.request.urlretrieve(url, save_name)

#ファイルを読み込む
with open("ptb.train.txt", "r") as f:
    ptb_text = f.read()

#テキストを表示
print(text)

 aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter 
 pierre <unk> N years old will join the board as a nonexecutive director nov. N 
 mr. <unk> is chairman of <unk> n.v. the dutch publishing group 
 rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate 
 a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported 
 the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said 
 <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N 
 although preliminary findings wer

In [13]:
#読み込んだテキストを一文字単位で分割
text_list = list(text)
#リストの一部を表示
print(text_list[:50])

[' ', 'a', 'e', 'r', ' ', 'b', 'a', 'n', 'k', 'n', 'o', 't', 'e', ' ', 'b', 'e', 'r', 'l', 'i', 't', 'z', ' ', 'c', 'a', 'l', 'l', 'o', 'w', 'a', 'y', ' ', 'c', 'e', 'n', 't', 'r', 'u', 's', 't', ' ', 'c', 'l', 'u', 'e', 't', 't', ' ', 'f', 'r', 'o']


## 実験コード

In [2]:
import nltk
from nltk.corpus import treebank

# データのダウンロード
nltk.download('treebank')

# Penn Treebankの文章を文字単位で分解
sentences = treebank.sents()  # 文単位でリストが返ってきます
text = ' '.join([' '.join(sentence) for sentence in sentences])

# 文字単位でトークン化
char_tokens = list(text)

[nltk_data] Downloading package treebank to /home/yoshida/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [3]:
from collections import Counter
import numpy as np

# 文字ごとの出現頻度をカウントして、頻度の高い順に並べる
counter = Counter(char_tokens)
vocab = sorted(counter, key=counter.get, reverse=True)

# 文字 -> インデックス 変換マッピング
char2idx = {char: idx for idx, char in enumerate(vocab)}

# インデックス -> 文字 変換マッピング
idx2char = {idx: char for idx, char in enumerate(vocab)}

# 文字トークンをインデックスに変換
data_as_int = np.array([char2idx[c] for c in char_tokens])

print(f'Vocabulary size: {len(vocab)}')
print(f'Sample data (indices): {data_as_int[:10]}')

Vocabulary size: 80
Sample data (indices): [49  4  1  8  8  1  0 72  4  6]


In [4]:
import torch

# シーケンスの長さを指定
seq_length = 100  # 任意の長さ

# シーケンスとターゲットの作成
sequences = []
targets = []
for i in range(0, len(data_as_int) - seq_length):
    sequences.append(data_as_int[i:i + seq_length])  # 入力シーケンス
    targets.append(data_as_int[i + 1:i + seq_length + 1])  # 1つシフトしたターゲットシーケンス

# PyTorchのTensorに変換
sequences = torch.tensor(sequences)
targets = torch.tensor(targets)

print(f'Shape of sequences: {sequences.shape}')  # (データ数, シーケンス長)
print(f'Shape of targets: {targets.shape}')      # (データ数, シーケンス長)


  sequences = torch.tensor(sequences)


Shape of sequences: torch.Size([544169, 100])
Shape of targets: torch.Size([544169, 100])


In [5]:
# データを訓練、検証、テストに分割 (例: 80% 訓練、10% 検証、10% テスト)
from torch.utils.data import DataLoader, TensorDataset

# データセットの分割
train_size = int(0.8 * len(sequences))
val_size = int(0.1 * len(sequences))

train_sequences = sequences[:train_size]
train_targets = targets[:train_size]
val_sequences = sequences[train_size:train_size + val_size]
val_targets = targets[train_size:train_size + val_size]
test_sequences = sequences[train_size + val_size:]
test_targets = targets[train_size + val_size:]

# 訓練データローダー
train_dataset = TensorDataset(train_sequences, train_targets)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# 検証データローダー
val_dataset = TensorDataset(val_sequences, val_targets)
val_loader = DataLoader(val_dataset, batch_size=64)

print(f"Training sequences shape: {train_sequences.shape}")  # (データ数, シーケンス長)
print(f"Training targets shape: {train_targets.shape}")      # (データ数, シーケンス長)
print(train_loader.dataset.tensors[0].shape)  # (データ数, シーケンス長)
print(val_loader.dataset.tensors[0].shape)    # (データ数, シーケンス長)

Training sequences shape: torch.Size([435335, 100])
Training targets shape: torch.Size([435335, 100])
torch.Size([435335, 100])
torch.Size([54416, 100])


In [6]:
import torch.nn as nn
import torch.optim as optim

class TransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = nn.Parameter(torch.zeros(1, seq_length, d_model))  # 簡単な位置エンコーディング
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, src, tgt):
        src = self.embedding(src) + self.pos_encoder
        tgt = self.embedding(tgt) + self.pos_encoder
        output = self.transformer(src, tgt)
        return self.fc_out(output)

# モデルの初期化
vocab_size = len(vocab)
model = TransformerModel(vocab_size)



In [7]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [8]:
import torch.optim as optim

# モデルの初期化とデバイス設定
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

epochs = 10

# 学習ループ
for epoch in range(epochs):
    model.train()
    total_loss = 0

    for src, tgt in train_loader:
        src, tgt = src.to(device), tgt.to(device)

        # Optimizerの初期化
        optimizer.zero_grad()

        # Transformerの出力を計算
        output = model(src, tgt)  # 入力は最初から最後の1トークン前まで
        output = output.view(-1, vocab_size)  # 出力を (バッチサイズ * シーケンス長, 語彙サイズ) に変換

        # ターゲットも (バッチサイズ * シーケンス長) に変換
        tgt = tgt[:, 1:].contiguous().view(-1)  # ターゲットをシフトし、flatten

        # 損失を計算
        loss = criterion(output, tgt)

        # 逆伝播
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}")

  return torch._C._cuda_getDeviceCount() > 0


ValueError: Expected input batch_size (6400) to match target batch_size (6336).

In [None]:
model.eval()
with torch.no_grad():
    val_loss = 0
    for batch in val_loader:
        src, tgt = batch
        src, tgt = src.to(device), tgt.to(device)

        output = model(src, tgt)
        loss = criterion(output.view(-1, vocab_size), tgt.view(-1))
        val_loss += loss.item()

    print(f"Validation Loss: {val_loss/len(val_loader)}")


## 例

In [19]:
!pip install japanize-matplotlib -q

In [20]:
#データロード
!wget --no-check-certificate https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2/-/raw/main/ja/ja_wiki/train_9.jsonl.gz
!wget --no-check-certificate https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2/-/raw/main/ja/ja_wiki/validation_0.jsonl.gz

--2024-10-29 19:46:14--  https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2/-/raw/main/ja/ja_wiki/train_9.jsonl.gz
gitlab.llm-jp.nii.ac.jp (gitlab.llm-jp.nii.ac.jp) をDNSに問いあわせています... 157.1.137.152
gitlab.llm-jp.nii.ac.jp (gitlab.llm-jp.nii.ac.jp)|157.1.137.152|:443 に接続しています... 接続しました。
警告: gitlab.llm-jp.nii.ac.jp の証明書(発行者: ‘CN=NII Open Domain CA - G7 RSA,O=SECOM Trust Systems CO.\\,LTD.,C=JP’)の検証に失敗しました:
  発行者の権限を検証できませんでした。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 137011103 (131M) [application/gzip]
‘train_9.jsonl.gz’ に保存中


2024-10-29 19:46:23 (17.1 MB/s) - ‘train_9.jsonl.gz’ へ保存完了 [137011103/137011103]

--2024-10-29 19:46:23--  https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2/-/raw/main/ja/ja_wiki/validation_0.jsonl.gz
gitlab.llm-jp.nii.ac.jp (gitlab.llm-jp.nii.ac.jp) をDNSに問いあわせています... 157.1.137.152
gitlab.llm-jp.nii.ac.jp (gitlab.llm-jp.nii.ac.jp)|157.1.137.152|:443 に接続しています... 接続しました。
警告: gitlab.llm-jp.nii.ac.jp の証明書(発行者: ‘CN=NII Open Domain CA - G7 RSA,O=SECOM T

In [21]:
import gzip
import json

mini_train_data_file_num = 1000
mini_train_data_text = ""
file_count = 0
with gzip.open('train_9.jsonl.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        # 各行をJSONとして読み込む
        data = json.loads(line)
        mini_train_data_text += data['text'] + "\n"
        file_count += 1
        if file_count == mini_train_data_file_num:
            break
print(data.keys())
print(data['text'][:100])
print(data['meta'])

dict_keys(['text', 'meta'])
チリコシーは、アメリカ合衆国オハイオ州中央部南ロス郡の都市であり、同郡の郡庁所在地である。コロンバス大都市圏に属している。

2010年の国勢調査では人口21,901 人だった。ロス郡では唯一の都市で
{'id': '2973866', 'title': 'チリコシー (オハイオ州)', 'url': 'https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%AA%E3%82%B3%E3%82%B7%E3%83%BC%20%28%E3%82%AA%E3%83%8F%E3%82%A4%E3%82%AA%E5%B7%9E%29'}


In [22]:
val_data_file_num = 1
val_data_text = ""
file_count = 0
with gzip.open('validation_0.jsonl.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        # 各行をJSONとして読み込む
        data = json.loads(line)
        val_data_text += data['text'] + "\n"
        file_count += 1
        if file_count == val_data_file_num:
            break
print(data.keys())
print(data['text'][:100])
print(data['meta'])

dict_keys(['text', 'meta'])
梶原 一騎（かじわら いっき、1936年9月4日 - 1987年1月21日）は、日本の漫画原作者、小説家、映画プロデューサー。本名は高森 朝樹（たかもり あさき）。高森 朝雄（たかもり あさお）の筆名
{'id': '506', 'title': '梶原一騎', 'url': 'https://ja.wikipedia.org/wiki/%E6%A2%B6%E5%8E%9F%E4%B8%80%E9%A8%8E'}


In [25]:
#Tokenizer
unique_chars_in_train_text = sorted(list(set(mini_train_data_text)))
#１文字１トークン
class Tokenizer:
    def __init__(self, chars):
        self.str_to_idx = dict()
        self.str_to_idx['<|endoftext|>'] = 0
        # utf-8
        for i in range(256):
            if f'<utf8_{i}>' not in self.str_to_idx:
                self.str_to_idx[f'<utf8_{i}>'] = len(self.str_to_idx)
        for char in chars:
            self.str_to_idx[char] = len(self.str_to_idx)
        self.idx_to_str = dict()
        for key, value in self.str_to_idx.items():
            self.idx_to_str[value] = key

    def encode(self, text, eot=False):
        result = []
        for char in text:
            if char not in self.str_to_idx:
                utf_8_num = list(char.encode("utf-8"))
                for num in utf_8_num:
                    result.append(self.str_to_idx[f'<utf8_{num}>'])
            else:
                result.append(self.str_to_idx[char])
        if eot:
            result.append(self.str_to_idx['<|endoftext|>'])
        return result

    def decode(self, tokens):
        decoded_with_utf_token = [self.idx_to_str[token] for token in tokens]
        decoded_postprocess_utf = []
        utf_tokens = []
        for token in decoded_with_utf_token:
            if token.startswith("<utf8_"):
                utf_num = int(token.replace("<utf8_", "").replace(">", ""))
                utf_tokens.append(utf_num)
            else:
                if utf_tokens:
                    decoded_postprocess_utf.append(bytes(utf_tokens).decode("utf-8"))
                    utf_tokens = []
                decoded_postprocess_utf.append(token)
        if utf_tokens:
            decoded_postprocess_utf.append(bytes(utf_tokens).decode("utf-8"))
            utf_tokens = []
        return "".join(decoded_postprocess_utf)

    def decode_with_utf(self, tokens):
        return "".join([self.idx_to_str[token] for token in tokens])

tokenizer = Tokenizer(unique_chars_in_train_text) # Tokenizerの初期化、一般的にはByte Pair EncodingやUnigram Language Modelなどを活用してTokenizerを実装する
text = '言語モデルの勉強は楽しいです。'
print(tokenizer.encode(text))

[3310, 3354, 694, 667, 703, 591, 1027, 1621, 592, 2156, 568, 549, 584, 570, 526]


In [6]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1234)

class Head(nn.Module):
    def __init__(self, d_model, d_head):
        super().__init__()
        self.key = nn.Linear(d_model, d_head, bias=False)
        self.query = nn.Linear(d_model, d_head, bias=False)
        self.value = nn.Linear(d_model, d_head, bias=False)
        self.back_to_d_model = nn.Linear(d_head, d_model)


    def forward(self, x):
        # x: (batch_size, sequence_length, d_model)
        B, T, d_model = x.size()
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)

        qk_dot_product = q @ k.transpose(-2, -1) / (d_head ** 0.5) #QK^T
        mask = torch.tril(torch.ones((T, T))).to(qk_dot_product.device) # 下三角行列
        qk_dot_product = qk_dot_product.masked_fill(mask == 0, float('-inf')) #mask処理
        attention_score = torch.softmax(qk_dot_product, dim=-1)
        out = attention_score @ v
        out = self.back_to_d_model(out)
        return out


class AttentionLM(nn.Module):
    def __init__(self, vocab_size, sequence_length, d_model, d_head):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(sequence_length, d_model)
        self.head = Head(d_model, d_head)
        self.unembed = nn.Linear(d_model, vocab_size)
        print('number of parameters:', sum(p.numel() for p in self.parameters()))


    def forward(self, token_indexes):
        # token_indexes: (batch_size, sequence_length)
        B, T = token_indexes.size()
        token_embed = self.embed(token_indexes)
        pos_embed = self.pos_embed(torch.arange(T).to(token_embed.device))
        x = token_embed + pos_embed
        x = self.head(x)
        logits = self.unembed(x)

        return logits

    def loss_per_token(self, token_indexes, targets):
        logits = self(token_indexes)
        # logits: (batch_size, sequence_length, vocab_size)
        # targets: (batch_size, sequence_length)
        batch_size, sequence_length, vocab_size = logits.shape
        loss = F.cross_entropy(
            logits.view(batch_size*sequence_length, vocab_size),
            targets.view(batch_size*sequence_length),
            reduction='none'
            )
        # loss: (batch_size*sequence_length)
        return loss.view(batch_size, sequence_length)

    def loss(self, token_indexes, targets):
        logits = self(token_indexes)
        # logits: (batch_size, sequence_length, vocab_size)
        # targets: (batch_size, sequence_length)
        batch_size, sequence_length, vocab_size = logits.shape
        loss = F.cross_entropy(
            logits.view(batch_size*sequence_length, vocab_size),
            targets.view(batch_size*sequence_length)
            )
        # loss: scalar
        return loss

    def generate(self, token_indexes, max_new_tokens):
        # token_indexes: (batch_size, sequence_length)
        batch_size, sequence_length = token_indexes.shape
        for _ in range(max_new_tokens):
            logits = self(token_indexes)
            # logits: (batch_size, sequence_length, vocab_size)
            next_token_logits = logits[:, -1, :]
            # next_token_logits: (batch_size, vocab_size)
            next_token_probs = F.softmax(next_token_logits, dim=-1)
            # next_token_probs: (batch_size, vocab_size)
            next_token = torch.multinomial(next_token_probs, num_samples=1)
            # next_token: (batch_size, 1)
            token_indexes = torch.cat([token_indexes, next_token], dim=1)
            # token_indexes: (batch_size, sequence_length+1)
        return token_indexes

In [7]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
vocab_size = len(tokenizer.str_to_idx)
d_model = 4
d_head = 4
sequence_length = 512
attention_lm = AttentionLM(vocab_size, sequence_length, d_model, d_head).to(device)

number of parameters: 4879


  return torch._C._cuda_getDeviceCount() > 0


In [28]:
mini_train_data_tokens = []
mini_train_data_file_num = 1000
mini_train_data_text = ""
file_count = 0

with gzip.open('train_9.jsonl.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        # 各行をJSONとして読み込む
        data = json.loads(line)
        mini_train_data_tokens += tokenizer.encode(data['text'], eot=True)
        file_count += 1
        if file_count == mini_train_data_file_num:
            break

val_data_tokens = []
val_data_file_num = 1
val_data_text = ""
file_count = 0
with gzip.open('validation_0.jsonl.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        # 各行をJSONとして読み込む
        data = json.loads(line)
        val_data_tokens += tokenizer.encode(data['text'], eot=True)
        file_count += 1
        if file_count == val_data_file_num:
            break

In [46]:
# 学習
optimizer = torch.optim.AdamW(attention_lm.parameters(), lr=1e-2)

batch_size = 10
epochs = 4
training_tokens = 0
print('before training')
val_loss = 0
for i in range(0, len(val_data_tokens), sequence_length+1):
    batch_tokens = val_data_tokens[i:i+sequence_length+1]
    input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
    target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
    with torch.no_grad():
        loss = attention_lm.loss_per_token(input_token_indexes, target_token_indexes)
    val_loss += loss.sum().item()
val_loss = val_loss / len(val_data_tokens)
print(f'val_loss: {val_loss}')
print('start training')
for epoch in range(epochs):
    for i in range(0, len(mini_train_data_tokens), sequence_length+1):
        batch_tokens = mini_train_data_tokens[i:i+sequence_length+1]
        input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
        target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
        loss = attention_lm.loss(input_token_indexes, target_token_indexes)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        training_tokens += len(batch_tokens)
        if training_tokens % ((sequence_length+1)*1000) == 0:
            print(f'epoch: {epoch}, loss: {loss.item()}, training_tokens: {training_tokens}')
    val_loss = 0
    for i in range(0, len(val_data_tokens), sequence_length+1):
        batch_tokens = val_data_tokens[i:i+sequence_length+1]
        input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
        target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
        with torch.no_grad():
            loss = attention_lm.loss_per_token(input_token_indexes, target_token_indexes)
        val_loss += loss.sum().item()
    val_loss = val_loss / len(val_data_tokens)
    print(f'epoch: {epoch}, val_loss: {val_loss}')

before training
val_loss: 5.971977072722963
start training
epoch: 0, loss: 5.401193141937256, training_tokens: 513000
epoch: 0, loss: 4.720692157745361, training_tokens: 1026000
epoch: 0, loss: 4.498340606689453, training_tokens: 1539000
epoch: 0, val_loss: 5.643494399883111
epoch: 1, val_loss: 5.549017661390987
epoch: 2, val_loss: 5.529861712615489
epoch: 3, val_loss: 5.510522535951902


In [44]:
context_length = 512
input_tokens =  mini_train_data_tokens[:context_length]
target_tokens = mini_train_data_tokens[1:context_length+1]
input_token_indexes = torch.tensor(input_tokens).unsqueeze(0).to(device)
target_token_indexes = torch.tensor(target_tokens).unsqueeze(0).to(device)
attention_lm.loss(input_token_indexes, target_token_indexes).item()

# bi-gram(パラメータ数 16516096)の性能: 5.082467079162598

5.264456272125244

In [45]:
context = '人間'
context_token_indexes = torch.tensor(tokenizer.encode(context)).unsqueeze(0).to(device)
generated_tokens = attention_lm.generate(context_token_indexes, max_new_tokens=20)
for token in generated_tokens[0]:
    print(repr(tokenizer.decode([token.item()])))

'人'
'間'
'グ'
'中'
' '
'人'
'm'
'2'
'ッ'
'月'
'ル'
'ビ'
'ス'
'ク'
'た'
'所'
'm'
'を'
'前'
'間'
'ミ'
'コ'


In [75]:
# GPT-2の実装 code from https://github.com/karpathy/makemore/tree/master
# https://github.com/karpathy/makemore/blob/master/makemore.py
"""
MIT License

Copyright (c) 2022 Andrej Karpathy

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
"""
import os
import sys
import time
import math
import argparse
from dataclasses import dataclass
from typing import List

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

# -----------------------------------------------------------------------------

@dataclass
class ModelConfig:
    block_size: int = None # length of the input sequences of integers
    vocab_size: int = None # the input integers are in range [0 .. vocab_size -1]
    # parameters below control the sizes of each model slightly differently
    n_layer: int = 4
    n_embd: int = 64
    n_embd2: int = 64
    n_head: int = 4

# -----------------------------------------------------------------------------
# Transformer Language Model (*exactly* as used in GPT-2)

class NewGELU(nn.Module):
    """
    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
    Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
    """
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    """

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.c_proj(y)
        return y

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU(),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.c_proj(m.act(m.c_fc(x))) # MLP forward

    def forward(self, x):
        # residual connection
        x = x + self.attn(self.ln_1(x))
        # residual connection
        x = x + self.mlpf(self.ln_2(x))
        return x

class Transformer(nn.Module):
    """ Transformer Language Model, exactly as seen in GPT-2 """

    def __init__(self, config):
        super().__init__()
        self.block_size = config.block_size

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # report number of parameters (note we don't count the decoder parameters in lm_head)
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print("トランスフォーマーのパラメーター数: %.2fM" % (n_params/1e6,))

    def get_block_size(self):
        return self.block_size

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

        return logits, loss

## 英語版

In [76]:
#PTBのコーパスをダウンロードし、train, valid, testに分割
!wget https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt
!wget https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.valid.txt
!wget https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.test.txt

--2024-11-06 10:23:22--  https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt
raw.githubusercontent.com (raw.githubusercontent.com) をDNSに問いあわせています... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 5101618 (4.9M) [text/plain]
‘ptb.train.txt.1’ に保存中


2024-11-06 10:23:22 (25.2 MB/s) - ‘ptb.train.txt.1’ へ保存完了 [5101618/5101618]

--2024-11-06 10:23:23--  https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.valid.txt
raw.githubusercontent.com (raw.githubusercontent.com) をDNSに問いあわせています... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 399782 (390K) [text/plain]
‘ptb.valid.txt.1’ に保存中


2024-11-06 10:23:23 (8.29 MB/s) - ‘ptb.valid.txt.1’ へ保存完了 [399782/399782]

--2024-11-06 

In [9]:
#ダウンロードしたPTBコーパスの一部を表示
!head ptb.train.txt

 aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter 
 pierre <unk> N years old will join the board as a nonexecutive director nov. N 
 mr. <unk> is chairman of <unk> n.v. the dutch publishing group 
 rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate 
 a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported 
 the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said 
 <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N 
 although preliminary findings wer

In [10]:
#今回は一部のデータを使って学習を行う
!head -n 1000 ptb.train.txt > mini_ptb.train.txt
!head -n 1000 ptb.valid.txt > mini_ptb.valid.txt
!head -n 1000 ptb.test.txt > mini_ptb.test.txt

In [1]:
#ptb_train_textにでーたを読み込む
ptb_train_text = ""
with open('ptb.train.txt') as f:
    ptb_train_text = f.read()

In [2]:
#mini_ptb_train_textにデータを読み込む
mini_ptb_train_text = ""
with open('mini_ptb.train.txt', 'r') as f:
    mini_ptb_train_text = f.read()

In [4]:
#日本語と同じように文字単位で分割する
#Tokenizer
unique_chars_in_train_text = sorted(list(set(ptb_train_text)))
#１文字１トークン
class Tokenizer:
    def __init__(self, chars):
        self.str_to_idx = dict()
        self.str_to_idx['<|endoftext|>'] = 0
        # utf-8
        for i in range(256):
            if f'<utf8_{i}>' not in self.str_to_idx:
                self.str_to_idx[f'<utf8_{i}>'] = len(self.str_to_idx)
        for char in chars:
            self.str_to_idx[char] = len(self.str_to_idx)
        self.idx_to_str = dict()
        for key, value in self.str_to_idx.items():
            self.idx_to_str[value] = key

    def encode(self, text, eot=False):
        result = []
        for char in text:
            if char not in self.str_to_idx:
                utf_8_num = list(char.encode("utf-8"))
                for num in utf_8_num:
                    result.append(self.str_to_idx[f'<utf8_{num}>'])
            else:
                result.append(self.str_to_idx[char])
        if eot:
            result.append(self.str_to_idx['<|endoftext|>'])
        return result

    def decode(self, tokens):
        decoded_with_utf_token = [self.idx_to_str[token] for token in tokens]
        decoded_postprocess_utf = []
        utf_tokens = []
        for token in decoded_with_utf_token:
            if token.startswith("<utf8_"):
                utf_num = int(token.replace("<utf8_", "").replace(">", ""))
                utf_tokens.append(utf_num)
            else:
                if utf_tokens:
                    decoded_postprocess_utf.append(bytes(utf_tokens).decode("utf-8"))
                    utf_tokens = []
                decoded_postprocess_utf.append(token)
        if utf_tokens:
            decoded_postprocess_utf.append(bytes(utf_tokens).decode("utf-8"))
            utf_tokens = []
        return "".join(decoded_postprocess_utf)

    def decode_with_utf(self, tokens):
        return "".join([self.idx_to_str[token] for token in tokens])

tokenizer = Tokenizer(unique_chars_in_train_text) # Tokenizerの初期化、一般的にはByte Pair EncodingやUnigram Language Modelなどを活用してTokenizerを実装する
text = 'I want to study language model.'
print(tokenizer.encode(text))

[74, 258, 303, 281, 294, 300, 258, 300, 295, 258, 299, 300, 301, 284, 305, 258, 292, 281, 294, 287, 301, 281, 287, 285, 258, 293, 295, 284, 285, 292, 265]


In [5]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1234)

class Head(nn.Module):
    def __init__(self, d_model, d_head):
        super().__init__()
        self.key = nn.Linear(d_model, d_head, bias=False)
        self.query = nn.Linear(d_model, d_head, bias=False)
        self.value = nn.Linear(d_model, d_head, bias=False)
        self.back_to_d_model = nn.Linear(d_head, d_model)


    def forward(self, x):
        # x: (batch_size, sequence_length, d_model)
        B, T, d_model = x.size()
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)

        qk_dot_product = q @ k.transpose(-2, -1) / (d_head ** 0.5) #QK^T
        mask = torch.tril(torch.ones((T, T))).to(qk_dot_product.device) # 下三角行列
        qk_dot_product = qk_dot_product.masked_fill(mask == 0, float('-inf')) #mask処理
        attention_score = torch.softmax(qk_dot_product, dim=-1)
        out = attention_score @ v
        out = self.back_to_d_model(out)
        return out


class AttentionLM(nn.Module):
    def __init__(self, vocab_size, sequence_length, d_model, d_head):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(sequence_length, d_model)
        self.head = Head(d_model, d_head)
        self.unembed = nn.Linear(d_model, vocab_size)
        print('number of parameters:', sum(p.numel() for p in self.parameters()))


    def forward(self, token_indexes):
        # token_indexes: (batch_size, sequence_length)
        B, T = token_indexes.size()
        token_embed = self.embed(token_indexes)
        pos_embed = self.pos_embed(torch.arange(T).to(token_embed.device))
        x = token_embed + pos_embed
        x = self.head(x)
        logits = self.unembed(x)

        return logits

    def loss_per_token(self, token_indexes, targets):
        logits = self(token_indexes)
        # logits: (batch_size, sequence_length, vocab_size)
        # targets: (batch_size, sequence_length)
        batch_size, sequence_length, vocab_size = logits.shape
        loss = F.cross_entropy(
            logits.view(batch_size*sequence_length, vocab_size),
            targets.view(batch_size*sequence_length),
            reduction='none'
            )
        # loss: (batch_size*sequence_length)
        return loss.view(batch_size, sequence_length)

    def loss(self, token_indexes, targets):
        logits = self(token_indexes)
        # logits: (batch_size, sequence_length, vocab_size)
        # targets: (batch_size, sequence_length)
        batch_size, sequence_length, vocab_size = logits.shape
        loss = F.cross_entropy(
            logits.view(batch_size*sequence_length, vocab_size),
            targets.view(batch_size*sequence_length)
            )
        # loss: scalar
        return loss

    def generate(self, token_indexes, max_new_tokens):
        # token_indexes: (batch_size, sequence_length)
        batch_size, sequence_length = token_indexes.shape
        for _ in range(max_new_tokens):
            logits = self(token_indexes)
            # logits: (batch_size, sequence_length, vocab_size)
            next_token_logits = logits[:, -1, :]
            # next_token_logits: (batch_size, vocab_size)
            next_token_probs = F.softmax(next_token_logits, dim=-1)
            # next_token_probs: (batch_size, vocab_size)
            next_token = torch.multinomial(next_token_probs, num_samples=1)
            # next_token: (batch_size, 1)
            token_indexes = torch.cat([token_indexes, next_token], dim=1)
            # token_indexes: (batch_size, sequence_length+1)
        return token_indexes

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
vocab_size = len(tokenizer.str_to_idx)
d_model = 4
d_head = 4
sequence_length = 512
attention_lm = AttentionLM(vocab_size, sequence_length, d_model, d_head).to(device)

number of parameters: 4879


In [7]:
#trainデータとvalidデータをトークン化
mini_ptb_train_tokens = []
mini_ptb_valid_tokens = []
mini_ptb_test_tokens = []
with open("mini_ptb.train.txt", "r") as f:
    mini_ptb_train_text = f.read()
    mini_ptb_train_tokens = tokenizer.encode(mini_ptb_train_text, eot=True)
with open("mini_ptb.valid.txt", "r") as f:
    mini_ptb_valid_text = f.read()
    mini_ptb_valid_tokens = tokenizer.encode(mini_ptb_valid_text, eot=True)
with open("mini_ptb.test.txt", "r") as f:
    mini_ptb_test_text = f.read()
    mini_ptb_test_tokens = tokenizer.encode(mini_ptb_test_text, eot=True)

In [8]:
# 学習
optimizer = torch.optim.AdamW(attention_lm.parameters(), lr=1e-2)

batch_size = 10
epochs = 4
training_tokens = 0
print('before training')
val_loss = 0
for i in range(0, len(mini_ptb_valid_tokens), sequence_length+1):
    batch_tokens = mini_ptb_valid_tokens[i:i+sequence_length+1]
    input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
    target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
    with torch.no_grad():
        loss = attention_lm.loss_per_token(input_token_indexes, target_token_indexes)
    val_loss += loss.sum().item()
val_loss = val_loss / len(mini_ptb_valid_tokens)
print(f'val_loss: {val_loss}')
print('start training')
for epoch in range(epochs):
    for i in range(0, len(mini_ptb_train_tokens), sequence_length+1):
        batch_tokens = mini_ptb_train_tokens[i:i+sequence_length+1]
        input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
        target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
        loss = attention_lm.loss(input_token_indexes, target_token_indexes)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        training_tokens += len(batch_tokens)
        if training_tokens % ((sequence_length+1)*1000) == 0:
            print(f'epoch: {epoch}, loss: {loss.item()}, training_tokens: {training_tokens}')
    val_loss = 0
    for i in range(0, len(mini_ptb_valid_tokens), sequence_length+1):
        batch_tokens = mini_ptb_valid_tokens[i:i+sequence_length+1]
        input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
        target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
        with torch.no_grad():
            loss = attention_lm.loss_per_token(input_token_indexes, target_token_indexes)
        val_loss += loss.sum().item()
    val_loss = val_loss / len(mini_ptb_valid_tokens)
    print(f'epoch: {epoch}, val_loss: {val_loss}')

before training
val_loss: 5.794701402065326
start training
epoch: 0, val_loss: 2.756043357819885
epoch: 1, val_loss: 2.6967229926788168
epoch: 2, val_loss: 2.6541151915885393
epoch: 3, val_loss: 2.57906783437806


In [9]:
context_length = 512
input_tokens =  mini_ptb_train_tokens[:context_length]
target_tokens = mini_ptb_train_tokens[1:context_length+1]
input_token_indexes = torch.tensor(input_tokens).unsqueeze(0).to(device)
target_token_indexes = torch.tensor(target_tokens).unsqueeze(0).to(device)
attention_lm.loss(input_token_indexes, target_token_indexes).item()

# bi-gram(パラメータ数 16516096)の性能: 5.082467079162598

2.767530679702759

In [12]:
context = 'I want'
context_token_indexes = torch.tensor(tokenizer.encode(context)).unsqueeze(0).to(device)
generated_tokens = attention_lm.generate(context_token_indexes, max_new_tokens=20)
for token in generated_tokens[0]:
    print(repr(tokenizer.decode([token.item()])))

'I'
' '
'w'
'a'
'n'
't'
's'
'd'
'.'
'l'
'c'
'a'
't'
'e'
'o'
'c'
'i'
'n'
'i'
'n'
'd'
'.'
'i'
's'
'f'
' '


In [13]:
#サイズを大きくしてみる
#trainデータとvalidデータをトークン化
ptb_train_tokens = []
ptb_valid_tokens = []
ptb_test_tokens = []
with open("ptb.train.txt", "r") as f:
    ptb_train_text = f.read()
    ptb_train_tokens = tokenizer.encode(ptb_train_text, eot=True)
with open("ptb.valid.txt", "r") as f:
    ptb_valid_text = f.read()
    ptb_valid_tokens = tokenizer.encode(ptb_valid_text, eot=True)
with open("ptb.test.txt", "r") as f:
    ptb_test_text = f.read()
    ptb_test_tokens = tokenizer.encode(ptb_test_text, eot=True)

In [38]:
#ptb_train_tokensの長さを確認
print(len(ptb_train_tokens))
print(len(mini_ptb_train_tokens))

5101619
120541


In [18]:
# 学習
optimizer = torch.optim.AdamW(attention_lm.parameters(), lr=1e-3)

batch_size = 10
epochs = 6
training_tokens = 0
print('before training')
val_loss = 0
for i in range(0, len(ptb_valid_tokens), sequence_length+1):
    batch_tokens = ptb_valid_tokens[i:i+sequence_length+1]
    input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
    target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
    with torch.no_grad():
        loss = attention_lm.loss_per_token(input_token_indexes, target_token_indexes)
    val_loss += loss.sum().item()
val_loss = val_loss / len(ptb_valid_tokens)
print(f'val_loss: {val_loss}')
print('start training')
for epoch in range(epochs):
    for i in range(0, len(mini_ptb_train_tokens), sequence_length+1):
        batch_tokens = mini_ptb_train_tokens[i:i+sequence_length+1]
        input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
        target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
        loss = attention_lm.loss(input_token_indexes, target_token_indexes)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        training_tokens += len(batch_tokens)
        if training_tokens % ((sequence_length+1)*1000) == 0:
            print(f'epoch: {epoch}, loss: {loss.item()}, training_tokens: {training_tokens}')
    val_loss = 0
    for i in range(0, len(ptb_valid_tokens), sequence_length+1):
        batch_tokens = ptb_valid_tokens[i:i+sequence_length+1]
        input_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, :-1].to(device)
        target_token_indexes = torch.tensor(batch_tokens).unsqueeze(0)[:, 1:].to(device)
        with torch.no_grad():
            loss = attention_lm.loss_per_token(input_token_indexes, target_token_indexes)
        val_loss += loss.sum().item()
    val_loss = val_loss / len(ptb_valid_tokens)
    print(f'epoch: {epoch}, val_loss: {val_loss}')

before training
val_loss: 2.5236325281355274
start training
epoch: 0, val_loss: 2.5177860912152266
epoch: 1, val_loss: 2.516481531715575
epoch: 2, val_loss: 2.5153543035255805
epoch: 3, val_loss: 2.5143319756888522
epoch: 4, val_loss: 2.5133876746482136
epoch: 5, val_loss: 2.5125052989867607


In [31]:
context = 'I'
context_token_indexes = torch.tensor(tokenizer.encode(context)).unsqueeze(0).to(device)
generated_tokens = attention_lm.generate(context_token_indexes, max_new_tokens=4)
for token in generated_tokens[0]:
    print(repr(tokenizer.decode([token.item()])))

'I'
'h'
'i'
'g'
'i'


In [96]:
# GPT-2の実装 code from https://github.com/karpathy/makemore/tree/master
# https://github.com/karpathy/makemore/blob/master/makemore.py
"""
MIT License

Copyright (c) 2022 Andrej Karpathy

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
"""
import os
import sys
import time
import math
import argparse
from dataclasses import dataclass
from typing import List

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

# -----------------------------------------------------------------------------

@dataclass
class ModelConfig:
    block_size: int = None # length of the input sequences of integers
    vocab_size: int = None # the input integers are in range [0 .. vocab_size -1]
    # parameters below control the sizes of each model slightly differently
    n_layer: int = 4
    n_embd: int = 64
    n_embd2: int = 64
    n_head: int = 4

# -----------------------------------------------------------------------------
# Transformer Language Model (*exactly* as used in GPT-2)

class NewGELU(nn.Module):
    """
    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
    Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
    """
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    """

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.c_proj(y)
        return y

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU(),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.c_proj(m.act(m.c_fc(x))) # MLP forward

    def forward(self, x):
        # residual connection
        x = x + self.attn(self.ln_1(x))
        # residual connection
        x = x + self.mlpf(self.ln_2(x))
        return x

class Transformer(nn.Module):
    """ Transformer Language Model, exactly as seen in GPT-2 """

    def __init__(self, config):
        super().__init__()
        self.block_size = config.block_size

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # report number of parameters (note we don't count the decoder parameters in lm_head)
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print("トランスフォーマーのパラメーター数: %.2fM" % (n_params/1e6,))

    def get_block_size(self):
        return self.block_size

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

        return logits, loss

In [35]:
#更にデータを少なくする
!head -n 10 mini_ptb.train.txt > mini_mini_ptb.train.txt
!head -n 10 mini_ptb.valid.txt > mini_mini_ptb.valid.txt
!head -n 10 mini_ptb.test.txt > mini_mini_ptb.test.txt

In [97]:
#mini_mini_ptb_train_textにデータを読み込む
mini_mini_ptb_train_text = ""
with open('mini_mini_ptb.train.txt', 'r') as f:
    mini_mini_ptb_train_text = f.read()

In [99]:
#日本語と同じように文字単位で分割する
#Tokenizer
#１文字１トークン
class Tokenizer:
    def __init__(self, chars):
        self.str_to_idx = dict()
        self.str_to_idx['<|endoftext|>'] = 0
        # utf-8
        for i in range(256):
            if f'<utf8_{i}>' not in self.str_to_idx:
                self.str_to_idx[f'<utf8_{i}>'] = len(self.str_to_idx)
        for char in chars:
            self.str_to_idx[char] = len(self.str_to_idx)
        self.idx_to_str = dict()
        for key, value in self.str_to_idx.items():
            self.idx_to_str[value] = key

    def encode(self, text, eot=False):
        result = []
        for char in text:
            if char not in self.str_to_idx:
                utf_8_num = list(char.encode("utf-8"))
                for num in utf_8_num:
                    result.append(self.str_to_idx[f'<utf8_{num}>'])
            else:
                result.append(self.str_to_idx[char])
        if eot:
            result.append(self.str_to_idx['<|endoftext|>'])
        return result

    def decode(self, tokens):
        decoded_with_utf_token = [self.idx_to_str[token] for token in tokens]
        decoded_postprocess_utf = []
        utf_tokens = []
        for token in decoded_with_utf_token:
            if token.startswith("<utf8_"):
                utf_num = int(token.replace("<utf8_", "").replace(">", ""))
                utf_tokens.append(utf_num)
            else:
                if utf_tokens:
                    decoded_postprocess_utf.append(bytes(utf_tokens).decode("utf-8"))
                    utf_tokens = []
                decoded_postprocess_utf.append(token)
        if utf_tokens:
            decoded_postprocess_utf.append(bytes(utf_tokens).decode("utf-8"))
            utf_tokens = []
        return "".join(decoded_postprocess_utf)

    def decode_with_utf(self, tokens):
        return "".join([self.idx_to_str[token] for token in tokens])

#tokenizer = Tokenizer(unique_chars_in_train_text) # Tokenizerの初期化、一般的にはByte Pair EncodingやUnigram Language Modelなどを活用してTokenizerを実装する
#text = 'I am fine, thank you.'
#print(tokenizer.encode(text))

In [100]:
#Tokenizerの初期化
unique_chars_in_train_text = sorted(list(set(ptb_train_text)))
tokenizer = Tokenizer(unique_chars_in_train_text)

In [101]:
#GPT-2のモデルの初期化
device = 'cuda' if torch.cuda.is_available() else 'cpu'
block_size = 128
vocab_size = len(tokenizer.str_to_idx)
config = ModelConfig(block_size=block_size, vocab_size=vocab_size)
transformer = Transformer(config).to(device)

#データセットの作成
class TextDataset(Dataset):
    def __init__(self, text, tokenizer, block_size):
        self.tokenizer = tokenizer
        self.block_size = block_size
        self.tokens = self.tokenizer.encode(text, eot=True)

    def __len__(self):
        return len(self.tokens) - self.block_size

    def __getitem__(self, idx):
        return torch.tensor(self.tokens[idx:idx+self.block_size]), torch.tensor(self.tokens[idx+1:idx+self.block_size+1])

train_dataset = TextDataset(mini_mini_ptb_train_text, tokenizer, block_size)
valid_dataset = TextDataset(mini_ptb_valid_text, tokenizer, block_size)
test_dataset = TextDataset(mini_ptb_test_text, tokenizer, block_size)

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=2, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False)

トランスフォーマーのパラメーター数: 0.23M


In [102]:
# 学習ループの実装
def train(model, dataloader, optimizer, epochs, device):
    model.train()  # モデルを訓練モードに設定
    for epoch in range(epochs):
        running_loss = 0.0
        for i, (input_ids, target_ids) in enumerate(dataloader):
            input_ids, target_ids = input_ids.to(device), target_ids.to(device)

            # モデルの順伝播
            logits, loss = model(input_ids, targets=target_ids)

            # 損失の計算と逆伝播
            optimizer.zero_grad()  # 勾配の初期化
            loss.backward()  # 逆伝播
            optimizer.step()  # パラメータ更新

            running_loss += loss.item()

            if i % 100 == 99:  # 100ステップごとに進捗を表示
                print(f"Epoch [{epoch+1}/{epochs}], Step [{i+1}/{len(dataloader)}], Loss: {running_loss / 100:.4f}")
                running_loss = 0.0

    print("学習完了")

# モデル、データローダー、オプティマイザーの設定例
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# モデルの設定
config = ModelConfig(
    block_size=128,  # シーケンス長
    vocab_size=50257,  # 語彙サイズ
    n_layer=6,  # レイヤー数を減らす
    n_embd=512,  # 埋め込み次元を減らす
    n_head=8  # アテンションヘッド数を減らす
)
model = Transformer(config).to(device)

# バッチサイズや学習率も調整
batch_size = 16  # バッチサイズを減らす
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)  # 学習率もやや低く設定

# データローダーの設定（データセットはあらかじめ用意）
# dataset = YourDataset()
# dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# エポック数を指定して学習を開始
train(model, train_loader, optimizer, epochs=1, device=device)


トランスフォーマーのパラメーター数: 44.71M
Epoch [1/1], Step [100/586], Loss: 3.2312
Epoch [1/1], Step [200/586], Loss: 2.2020
Epoch [1/1], Step [300/586], Loss: 1.8528
Epoch [1/1], Step [400/586], Loss: 1.2408
Epoch [1/1], Step [500/586], Loss: 0.6762
学習完了


In [None]:
# モデルの評価
def evaluate(model, dataloader, device):
    model.eval()  # モデルを評価モードに設定
    total_loss = 0.0'I'
' '
'w'
'a'
'n'
't'
's'
'd'
'.'
'l'
'c'
    with torch.no_grad():
        for input_ids, target_ids in dataloader:
            input_ids, target_ids = input_ids.to(device), target_ids.to(device)

            # モデルの順伝播
            logits, loss = model(input_ids, targets=target_ids)
            total_loss += loss.item()

    return total_loss / len(dataloader)

# モデルの評価
val_loss = evaluate(model, valid_loader, device)
print(f"Validation Loss: {val_loss:.4f}")

KeyboardInterrupt: 

In [None]:
from datasets import load_dataset,load_metric
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

# 1. データセットの読み込み
dataset = load_dataset("glue", "cola")
metric = load_metric("glue", "cola")

# 2. トークナイザーとモデルの準備
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 3. データの前処理
def preprocess_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

# 4. 評価関数の定義
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 5. トレーニング設定の準備
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# 6. トレーナーの定義
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    compute_metrics=compute_metrics,
)

# 7. モデルのトレーニングと評価
trainer.train()
eval_results = trainer.evaluate()

# 結果の表示
print(f"Evaluation Results: {eval_results}")


ImportError: cannot import name 'load_metric' from 'datasets' (/home/yoshida/.pyenv/versions/3.11.8/lib/python3.11/site-packages/datasets/__init__.py)

In [16]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
Installing collected packages: evaluate
Successfully installed evaluate-0.4.3


In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from GPT2 import Transformer, ModelConfig
from Tokenizer import Tokenizer

# データセットの作成
class TextDataset(Dataset):
    def __init__(self, text, tokenizer, block_size):
        self.tokenizer = tokenizer
        self.block_size = block_size
        self.tokens = self.tokenizer.encode(text, eot=True)

    def __len__(self):
        return len(self.tokens) - self.block_size

    def __getitem__(self, idx):
        return torch.tensor(self.tokens[idx:idx+self.block_size]), torch.tensor(self.tokens[idx+1:idx+self.block_size+1])

# データセットの読み込み
with open("mini_ptb.train.txt", "r") as f:
    mini_ptb_train_text = f.read()
with open("mini_ptb.valid.txt", "r") as f:
    mini_ptb_valid_text = f.read()
with open("mini_ptb.test.txt", "r") as f:
    mini_ptb_test_text = f.read()

# Tokenizerの初期化
unique_chars_in_train_text = sorted(list(set(mini_ptb_train_text)))
tokenizer = Tokenizer(unique_chars_in_train_text)

# モデルの初期化
device = 'cuda' if torch.cuda.is_available() else 'cpu'
block_size = 128
vocab_size = len(tokenizer.str_to_idx)
config = ModelConfig(
    block_size=128,    # シーケンス長
    vocab_size= vocab_size,  # ボキャブラリサイズ（例: GPTのボキャブラリ）
    n_layer=12,        # 層数
    n_embd=768,        # 埋め込み次元
    n_head=12,         # アテンションヘッド
    dropout=0.1        # ドロップアウト率
)

transformer = Transformer(config).to(device)

# データセットの作成
train_dataset = TextDataset(mini_ptb_train_text, tokenizer, block_size)
valid_dataset = TextDataset(mini_ptb_valid_text, tokenizer, block_size)
test_dataset = TextDataset(mini_ptb_test_text, tokenizer, block_size)

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=2, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False)

# 学習ループの実装
def train(model, dataloader, optimizer, epochs, device):
    model.train()  # モデルを訓練モードに設定
    for epoch in range(epochs):
        running_loss = 0.0
        for i, (input_ids, target_ids) in enumerate(dataloader):
            input_ids, target_ids = input_ids.to(device), target_ids.to(device)

            # モデルの順伝播
            logits, loss = model(input_ids, targets=target_ids)

            # 損失の計算と逆伝播
            optimizer.zero_grad()  # 勾配の初期化
            loss.backward()  # 逆伝播
            optimizer.step()  # パラメータ更新

            running_loss += loss.item()

            if i % 100 == 99:  # 100ステップごとに進捗を表示
                print(f"Epoch [{epoch+1}/{epochs}], Step [{i+1}/{len(dataloader)}], Loss: {running_loss / 100:.4f}")
                running_loss = 0.0

    print("学習完了")


  return torch._C._cuda_getDeviceCount() > 0


トランスフォーマーの総パラメータ数: 85.62M


In [2]:
optimizer = torch.optim.AdamW(transformer.parameters(), lr=3e-4)
train(transformer, train_loader, optimizer, epochs=1, device=device)

# モデルの評価
def evaluate(model, dataloader, device):
    model.eval()  # モデルを評価モードに設定
    total_loss = 0.0
    with torch.no_grad():
        for input_ids, target_ids in dataloader:
            input_ids, target_ids = input_ids.to(device), target_ids.to(device)

            # モデルの順伝播
            logits, loss = model(input_ids, targets=target_ids)
            total_loss += loss.item()

    return total_loss / len(dataloader)

# モデルの評価
val_loss = evaluate(transformer, valid_loader, device)
print(f"Validation Loss: {val_loss:.4f}")

Epoch [1/1], Step [100/60207], Loss: 2.6082
Epoch [1/1], Step [200/60207], Loss: 2.3724
Epoch [1/1], Step [300/60207], Loss: 2.3356
Epoch [1/1], Step [400/60207], Loss: 2.3130
Epoch [1/1], Step [500/60207], Loss: 2.2551
Epoch [1/1], Step [600/60207], Loss: 2.1806
Epoch [1/1], Step [700/60207], Loss: 2.1045
Epoch [1/1], Step [800/60207], Loss: 2.0818
Epoch [1/1], Step [900/60207], Loss: 2.0484
Epoch [1/1], Step [1000/60207], Loss: 2.0207
Epoch [1/1], Step [1100/60207], Loss: 1.9827
Epoch [1/1], Step [1200/60207], Loss: 1.9540
Epoch [1/1], Step [1300/60207], Loss: 1.9348
Epoch [1/1], Step [1400/60207], Loss: 1.8972
Epoch [1/1], Step [1500/60207], Loss: 1.8565
Epoch [1/1], Step [1600/60207], Loss: 1.8361
Epoch [1/1], Step [1700/60207], Loss: 1.8281
Epoch [1/1], Step [1800/60207], Loss: 1.7979
Epoch [1/1], Step [1900/60207], Loss: 1.7594
Epoch [1/1], Step [2000/60207], Loss: 1.7436
Epoch [1/1], Step [2100/60207], Loss: 1.7574
Epoch [1/1], Step [2200/60207], Loss: 1.7226
Epoch [1/1], Step [

In [3]:
#モデルの重みを保存
torch.save(transformer.state_dict(), "transformer.pth")

In [None]:
#configの保存
import torch

config = ModelConfig(
    block_size=128,    # シーケンス長
    vocab_size= vocab_size,  # ボキャブラリサイズ（例: GPTのボキャブラリ）
    n_layer=12,        # 層数
    n_embd=768,        # 埋め込み次元
    n_head=12,         # アテンションヘッド
    dropout=0.1        # ドロップアウト率
)

save_data = {
    "model_state_dict": transformer.state_dict(),
    "config": config
}

torch.save(save_data, "transformer_with_config.pth")

In [20]:
print(transformer.state_dict().keys())

odict_keys(['transformer.wte.weight', 'transformer.wpe.weight', 'transformer.h.0.ln_1.weight', 'transformer.h.0.ln_1.bias', 'transformer.h.0.attn.bias', 'transformer.h.0.attn.c_attn.weight', 'transformer.h.0.attn.c_attn.bias', 'transformer.h.0.attn.c_proj.weight', 'transformer.h.0.attn.c_proj.bias', 'transformer.h.0.ln_2.weight', 'transformer.h.0.ln_2.bias', 'transformer.h.0.mlp.0.weight', 'transformer.h.0.mlp.0.bias', 'transformer.h.0.mlp.3.weight', 'transformer.h.0.mlp.3.bias', 'transformer.h.1.ln_1.weight', 'transformer.h.1.ln_1.bias', 'transformer.h.1.attn.bias', 'transformer.h.1.attn.c_attn.weight', 'transformer.h.1.attn.c_attn.bias', 'transformer.h.1.attn.c_proj.weight', 'transformer.h.1.attn.c_proj.bias', 'transformer.h.1.ln_2.weight', 'transformer.h.1.ln_2.bias', 'transformer.h.1.mlp.0.weight', 'transformer.h.1.mlp.0.bias', 'transformer.h.1.mlp.3.weight', 'transformer.h.1.mlp.3.bias', 'transformer.h.2.ln_1.weight', 'transformer.h.2.ln_1.bias', 'transformer.h.2.attn.bias', 'tran

In [33]:
# 必要なトークンをリストに収集
context = 'wa'
context_token_indexes = torch.tensor(tokenizer.encode(context)).unsqueeze(0).to(device)
generated_tokens = transformer.generate(context_token_indexes, max_new_tokens=8, temperature=0.2, top_k=40)
generated_text = []

for token in generated_tokens[0]:
    generated_text.append(tokenizer.decode([token.item()]))

# 横向きに表示
print(' '.join(repr(token) for token in generated_text))

#縦向きに表示
#for token in generated_tokens[0]:
#    print(repr(tokenizer.decode([token.item()])))

'w' 'a' 'r' 'd' ' ' 't' 'o' ' ' 't' 'h'


In [25]:
# トークンの内容をデバッグ
print([token.item() for token in generated_tokens[0]])


[286, 289, 277, 295, 286, 275, 294, 283, 289, 288, 258, 162]


In [1]:
import torch
from datasets import load_dataset
import evaluate  # evaluateライブラリをインポート

# データセットの読み込み
dataset = load_dataset("glue", "cola")

#datasetの中確認
print(dataset)
for i in range(10):
  print(dataset["train"][i])

#sentence:文章
#label:文法の正誤(正なら1、誤なら0)
#idx:番号

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})
{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.", 'label': 1, 'idx': 0}
{'sentence': "One more pseudo generalization and I'm giving up.", 'label': 1, 'idx': 1}
{'sentence': "One more pseudo generalization or I'm giving up.", 'label': 1, 'idx': 2}
{'sentence': 'The more we study verbs, the crazier they get.', 'label': 1, 'idx': 3}
{'sentence': 'Day by day the facts are getting murkier.', 'label': 1, 'idx': 4}
{'sentence': "I'll fix you a drink.", 'label': 1, 'idx': 5}
{'sentence': 'Fred watered the plants flat.', 'label': 1, 'idx': 6}
{'sentence': 'Bill coughed his way out of the restaurant.', 'label': 1, 'idx': 7}
{'sentence': "We're da

In [2]:
with open("mini_ptb.train.txt", "r") as f:
    mini_ptb_train_text = f.read()

In [None]:
# prompt: 文章の文法的正誤判定問題をモデルに解かせる
import torch
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
import evaluate
from GPT2 import Transformer, ModelConfig
from Tokenizer import Tokenizer

# tokenizerの初期化
unique_chars_in_train_text = sorted(list(set(mini_ptb_train_text)))
tokenizer = Tokenizer(unique_chars_in_train_text)


# モデルの準備
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vocab_size = len(tokenizer.str_to_idx)
config = ModelConfig(block_size=128, vocab_size=vocab_size)
pretrained_model = Transformer(config).to(device)
pretrained_model.load_state_dict(torch.load("transformer.pth"))
model = PretrainedTransformerForClassification(pretrained_model, num_labels=2).to(device)

# データの前処理
# トークナイズ処理
def preprocess_function(examples):
    max_length = 128  # 最大シーケンス長
    padded_inputs = []
    attention_masks = []

    for sentence in examples["sentence"]:
        encoded = tokenizer(sentence, eot=True)

        # トランケーションとパディング
        if len(encoded) > max_length:
            encoded = encoded[:max_length]
        else:
            encoded += [0] * (max_length - len(encoded))

        attention_mask = [1 if i < len(encoded) else 0 for i in range(max_length)]

        padded_inputs.append(encoded)
        attention_masks.append(attention_mask)

    return {
        "input_ids": padded_inputs,
        "attention_mask": attention_masks,
        "labels": examples["label"],
    }

# トークナイズ適用
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# 必要なカラム以外を削除
tokenized_dataset = tokenized_dataset.remove_columns(["sentence"])  # 不要なカラムを削除
tokenized_dataset.set_format("torch")  # データをPyTorchテンソルに変換

train_dataset = tokenized_dataset["train"]
valid_dataset = tokenized_dataset["validation"]
test_dataset = tokenized_dataset["test"]

print(train_dataset["input_ids"][0].shape)

トランスフォーマーのパラメーター数: 0.23M


  pretrained_model.load_state_dict(torch.load("transformer.pth"))


AttributeError: 'Transformer' object has no attribute 'output_dim'

In [None]:
print("Input IDs shape:", train_dataset["input_ids"].shape)  # (batch_size, seq_length)

Input IDs shape: torch.Size([8551, 128])


In [51]:
#評価指標の定義
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(logits, axis=-1)
    return evaluate.compute(predictions=predictions, references=labels)

# トレーニング設定の準備
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# トレーナーの定義
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
)

# モデルのトレーニングと評価
with torch.autograd.detect_anomaly():
    trainer.train()
    eval_results = trainer.evaluate()
#trainer.train()
#eval_results = trainer.evaluate()

# 結果の表示
print(f"Evaluation Results: {eval_results}")

  with torch.autograd.detect_anomaly():
  0%|          | 0/3207 [43:04<?, ?it/s]
  0%|          | 0/3207 [00:00<?, ?it/s]

RuntimeError: mat1 and mat2 shapes cannot be multiplied (8x301 and 64x2)

In [6]:
import torch
from datasets import load_dataset
import evaluate  # evaluateライブラリをインポート
from GPT2 import Transformer, ModelConfig
from Tokenizer import Tokenizer

# モデルのロード
model = Transformer(config)
model.load_state_dict(torch.load('transformer.pth'))
model.eval()

# トークナイザーの準備 (PTBデータセット用のトークナイザーを準備する)
tokenizer = Tokenizer(unique_chars_in_train_text)  # 適切なトークナイザーを用意


トランスフォーマーのパラメーター数: 0.23M


  model.load_state_dict(torch.load('transformer.pth'))


In [8]:
# 文章をトークナイズ
sentence = "This is a grammatically correct sentence."
sentence_tokens = tokenizer.encode(sentence)

# 文法判定
is_correct = model.is_grammatically_correct(sentence_tokens, tokenizer)
if is_correct:
    print("文法的に正しい")
else:
    print("文法的に間違っている")


AttributeError: 'Transformer' object has no attribute 'is_grammatically_correct'

In [None]:
import numpy as np
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

# データセットのロード
train = load_dataset("your_dataset_name", split="train")
valid = load_dataset("your_dataset_name", split="validation")
test = load_dataset("your_dataset_name", split="test")

# トークナイザーの準備
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

#学習時のプロンプトの定義()
def prompt(example):
    return f"sentence: {example['sentence']}"



# モデルの準備
#model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 評価指標の設定
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

# トレーニング引数
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

# トレーナーの作成
trainer = Trainer(
    model=transformer,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    compute_metrics=compute_metrics,
)

# トレーニングの実行
trainer.train()

# テストデータでの評価
trainer.evaluate(tokenized_test)