## [🥢Cultural Alignment LLM]() - PART II
University College London

<b>YUE HU</b>

#### <b>Outline
- Here, the functions defined in [`MyData.py`](./MyData.py), [`inference.py`](./inference.py) and [`Train.py`](./Train.py) are validated before model training.
<hr width=70% style="float: left">

In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


<div class="alert alert-block alert-warning">
<h4>👩🏻‍💻 <b>0. Set Up</b></h4>
</div>

In [2]:
# Let's import libraries
import os
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

import sys
sys.path.append('/content/drive/MyDrive/CulturalAlignment')

In [51]:
# Check Available Devices
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

<div class="alert alert-block alert-warning">
<h4>👩🏻‍💻 <b>1. MyDataset functions well or not?</b></h4>

* Function `get_csv_data`: load data from '.csv' files.
* Class `MyDataset`: a custom Dataset implemented for further training.
</div>

In [52]:
from MyData import get_csv_data, MyDataset

data_path = '/content/drive/MyDrive/CulturalAlignment/Datasets/exam.csv'
TOKENIZER_NAME = 'm-a-p/CT-LLM-Base'

Load Data (from csv file to pandas Dataframe) & Tokenizer

In [54]:
# Data
df = get_csv_data(data_path)
df.sample(3)

Unnamed: 0,instruction,input,output,task_type
1630,阅读下文，完成下面小题\n镜子①星期天下午，女孩儿戴着耳塞，在床上。听这种节目就是要尽量放松...,初中阶段，我们要能“区分写实作品与虚构作品”，那么本文是写实的，还是虚构的？请做出判断，并结...,是虚构作品。理由：①情节离奇（故事富有想象力）。文中“女孩儿”和曾经的“小窗”照镜子时，从镜...,问答
2144,30.2017年10月18日上午，中国共产党第十九次全国代表大会在人民大会堂开幕，习近平总书...,要实现上述奋斗目标，我们必须大力实施哪些发展战略？,科教兴国战略、人才强国战略、创新驱动发展战略、可持续发展战略、西部大开发战略、乡村振兴战略等...,问答
107,心力衰竭\n这个词语是什么意思？,,也称充血性心力衰竭或心功能不全。心脏因疾病、过劳、排血功能减弱，以至排血量不能满足器官及组织...,问答


In [55]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME,use_fast=False,trust_remote_code=True)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

print('Tokenizer loaded.')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Tokenizer loaded.


In [56]:
tokenizer.pad_token_id

0

Now, we have the dataset:

In [65]:
print("Text to Tensor...")
mydataset = MyDataset(df,tokenizer)
print('Dataset loaded.\n')

Text to Tensor...
Dataset loaded.



Visualise a sample in our dataset randomly.

In [66]:
idx = 100
print('A sample in dataset: \n')
print(mydataset[idx])

A sample in dataset: 

{'input_ids': tensor([     0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
   

In [67]:
mydataset[idx]['input_ids'].shape

torch.Size([256])

In [68]:
len(mydataset[idx]['input_ids']) == len(mydataset[idx]['labels'])

True

<b><span style="color: #C0392B">IMPORTANT 🛑:</span></b> Check if the encoded sequence can be recovered to the original input via decoding.

In [69]:
# Q: What is the original text?
data = df.iloc[idx]
instruction,input,output = data['instruction'],data['input'],data['output']
if input is not None and input != "":
    instruction = instruction+'\n'+input

source = f"问题：{instruction}\n答案："
target = f"{output}{tokenizer.eos_token}"
input = source+target
print(f"ORIGINAL INPUT:\n{input}")

ORIGINAL INPUT:
问题：成语释义：炙鸡渍酒
含义：
答案：指以棉絮浸酒，晒干后裹烧鸡，携以吊丧。后遂用为不忘恩的典实。
成语出处：《后汉书·徐穉传》穉尝为太尉黄琼所辟，不就李贤注引三国吴谢承《后汉书》穉诸公所辟虽不就，有死丧负笈赴吊。常於家豫炙鸡一只，以一两绵絮渍酒中，暴乾以裹鸡，径到所起冢外，……醊酒毕，留谒则去，不见丧主。</s>


In [70]:
decoded_text = tokenizer.decode(mydataset[idx]['input_ids'],skip_special_tokens=True)

print(f"DECODED TEXT:\n{decoded_text}")

DECODED TEXT:
问题：成语释义：炙鸡渍酒
含义：
答案：指以棉絮浸酒，晒干后裹烧鸡，携以吊丧。后遂用为不忘恩的典实。
成语出处：《后汉书·徐穉传》穉尝为太尉黄琼所辟，不就李贤注引三国吴谢承《后汉书》穉诸公所辟虽不就，有死丧负笈赴吊。常於家豫炙鸡一只，以一两绵絮渍酒中，暴乾以裹鸡，径到所起冢外，……醊酒毕，留谒则去，不见丧主。</s>


In [71]:
print(f"IF THE DECODED TEXT IS THE SAME AS INPUT TEXT?\n{input==decoded_text}")

IF THE DECODED TEXT IS THE SAME AS INPUT TEXT?
True


<div class="alert alert-block alert-warning">
<h4>👩🏻‍💻 <b>2. Check Training Loop Step by Step</b></h4>

Experimental results proved that the defined `MyDataset(data:pd.DataFrame, tokenizer:AutoTokenizer)` is working. Then, the following sections will focus on procedures during model training step-by-step.
</div>

<b>(1) Check the DataLoader</b>

Slightly different from formal training, the data discussed here is not splitted into Train/Validation set.

In [72]:
mydataloader = DataLoader(mydataset,shuffle=True,batch_size=8)
print("DATALOADER PREPARED.")

DATALOADER PREPARED.


<b>(2) Check the LOOP</b>

All the procedures below is performed in an epoch.

In [96]:
print("What's in a BATCH? \n")
for step,batch in enumerate(mydataloader):
    print("STEP: ",step)

    batch = {k:v.to(device) for k, v in batch.items()}
    print(f"SHAPE OF EACH BATCH:\n{batch['input_ids'].shape}")
    print(f"BATCH SIZE:\n{len(batch['input_ids'])}")
    print(f"THE SIZE OF EACH INPUT:\n{len(batch['input_ids'][0])}")

    del batch,
    torch.cuda.empty_cache()

    break

What's in a BATCH? 

STEP:  0
SHAPE OF EACH BATCH:
torch.Size([8, 256])
BATCH SIZE:
8
THE SIZE OF EACH INPUT:
256


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
<b>🤖: Congratulations!</b>
</div>