# IPython 커널 설치 필요!

이 프로젝트에선 파이썬 가상 환경을 사용 중임.

## 출처

이 프로젝트는 홍정모 교수님의 영상을 참고하였습니다.

홍정모 교수님께 진심으로 감사드립니다.

[홍정모 연구소](https://honglab.co.kr/)

[참고한 홍정모 연구소 유튜브 영상](https://www.youtube.com/watch?v=osv2csoHVAo)

- <https://www.kaggle.com/datasets/shubhammaindola/harry-potter-books>

### 패키지 설치

In [None]:
# NumPy
%pip install numpy

In [None]:
# 파이 토치
%pip install torch --index-url https://download.pytorch.org/whl/cu126

In [None]:
# tiktoken - OpenAI에서 제작한 빠른 바이트 페어 인코딩 토크나이저임.
%pip install tiktoken

In [None]:
# requirements.txt
%pip freeze > requirements.txt

### 디바이스 확인

In [2]:
import torch

print("cuda" if torch.cuda.is_available() else "cpu")

cuda


### 토크나이저 맛보기

In [9]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")  # 오픈 소스 모델

text = "Hello, world!"
tokens = tokenizer.encode(text)

print("Text:", text, "Tokens:", tokens, "\n")
print("Text Length:", len(text), "Token Length:", len(tokens), "\n")

for token in tokens:
    print(f"{token}\t -> {tokenizer.decode([token])}")

Text: Hello, world! Tokens: [15496, 11, 995, 0] 

Text Length: 13 Token Length: 4 

15496	 -> Hello
11	 -> ,
995	 ->  world
0	 -> !


### 데이터셋 다듬기

In [4]:
import re
import glob


def clean_text(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        text = file.read()

    cleaned_text = re.sub(r"\n+", " ", text)
    cleaned_text = re.sub(r"\s+", " ", cleaned_text)

    file_path = re.sub(r"\.txt$", " CLEANED.txt", file_path)

    with open(file_path, "w", encoding="utf-8") as file:
        file.write(cleaned_text)

    print(file_path, "was written with", len(cleaned_text), "characters.")


for file_path in glob.glob("dataset/harry-potter-books/*.txt"):
    if not re.search(r"CLEANED.txt$", file_path):
        clean_text(file_path)

dataset/harry-potter-books\01 Harry Potter and the Sorcerers Stone CLEANED.txt was written with 436000 characters.
dataset/harry-potter-books\02 Harry Potter and the Chamber of Secrets CLEANED.txt was written with 488771 characters.
dataset/harry-potter-books\03 Harry Potter and the Prisoner of Azkaban CLEANED.txt was written with 621137 characters.
dataset/harry-potter-books\04 Harry Potter and the Goblet of Fire CLEANED.txt was written with 1093670 characters.
dataset/harry-potter-books\05 Harry Potter and the Order of the Phoenix CLEANED.txt was written with 1489734 characters.
dataset/harry-potter-books\06 Harry Potter and the Half-Blood Prince CLEANED.txt was written with 982041 characters.
dataset/harry-potter-books\07 Harry Potter and the Deathly Hallows CLEANED.txt was written with 1133063 characters.


### 데이터셋 로더

In [5]:
import torch
from torch.utils.data import Dataset, DataLoader


class MyDataset(Dataset):
    def __init__(self, text, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(text)

        print("Token Length:", len(token_ids))

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]


with open(
    "dataset/harry-potter-books/01 Harry Potter and the Sorcerers Stone CLEANED.txt",
    "r",
    encoding="utf-8-sig",  # utf-8-sig는 BOM(바이트 오더 마크)을 제거해줌.
) as file:
    text = file.read()

dataset = MyDataset(text, 32, 4)
train_loader = DataLoader(dataset, 128, True, drop_last=True)

Token Length: 117767


In [6]:
dataiter = iter(train_loader)
x, y = next(dataiter)

print(tokenizer.decode(x[0].tolist()))
print(tokenizer.decode(y[0].tolist()))

What do they think they’re doing, keeping a thing like that locked up in a school?” said Ron finally. “If any dog
 do they think they’re doing, keeping a thing like that locked up in a school?” said Ron finally. “If any dog needs
