## Overiew
This project is to train and to finetune pretrained model to become PoetBot.  The goal of this project is to train chatbot that can construct short quote using themes listed in keywords with correct logic and grammar.  For example, given keywords 'friendship, love', model could generate sentences with theme of friendship and love but not necessary contain words of 'friendship' or 'love', like:
   1. There is nothing I would not do for those who are really firends. I have no notion of loving people by halves, it is not my native.
   2. If I had a flower for every time I thought of you... I could walk through my garden forever.

This model is different with general keyword-generation model on:
   1. Number of keywords is not fixed.
   2. Sentence generated is random.

## Related Works
Conditional text/story generation: For task of conditional text or story generation, model usually use all keywords to compromise new sentence or story using seq2seq model with attention mechanism other pretrained model as GPT-3 and BART.
Hierarchial story generation: This task requires to draft story line and then new story is generated based on input story line. Meaningful and detailed sentence would be required as input to generate accurate stories.

## Dataset
Quotes-500k in Kaggle, which stores qoutes with various category tags ranging from love, life to philisophy, motivation to describes quote which categories belong to.

In [1]:
import os
import pandas as pd
import torch
from torch.utils.data import DataLoader
import datasets
from transformers import TrainingArguments, pipeline
from modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import trl
from peft import get_peft_model, prepare_model_for_kbit_training, LoraConfig
import deepspeed
from collections import Counter
import json
import datetime

[2024-09-15 02:12:36,699] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


W0915 02:12:39.046000 11924 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.


### Config

In [2]:
dataset_args = {
    'data_path':r'data/quotes.csv',
    'n_top': 50, # Number of Categories included
    'token_length':64, # Maximum context length
    'train_size':20000,
    'valid_size':2000,
    'test_size':10
}
model_args = {
    'batch_size':1,
    'epochs':16,
    'num_workers':os.cpu_count(),
    'bf16':True,
    'fp16':False,
    'learning_rate':0.0001,
    'gradient_accumulation_steps':256,
    'model_name_path':r'C:\Users\user\.cache\modelscope\hub\Qwen\Qwen1___5-4B-Chat',
    'out_dir':'outputs/qwen1___5-4B/best_model',
}

# Lora Config
peft_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules=['q_proj','k_proj','v_proj','o_proj'],
    bias='none',
    task_type='Causal_LM',
)

# Deepspeed Config - ZeRO-2 is used to offload GPU memory to CPU memory
ds_config = {
    'bfloat16':{
        'enable':'auto'
    },
    'fp16':{
        'enable':'auto'
    },
    'zero_optimization':{
        'stage':2,
        'offload_optimizer':{
            'device':'cpu',
            'pin_memory':True
        },
        'offload_parameter':{
            'device':'cpu',
            'pin_memory':True
        },
        'overlap_comm':True,
        'reduce_scatter':True,
        'reduce_bucket_size':1e8,
        'allgather_partitions':True,
        'allgather_bucket_size':1e8,
        'contiguous_gradients':True
    },
    'gradient_accumulation_steps':1,
    'gradient_clipping':'auto',
    'train_batch_size':'auto',
    'train_micro_batch_size_per_gpu':'auto',
    'steps_per_print':1e5
}

### Load pretrained model - Qwen

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_args['model_name_path'])
if model_args['bf16'] == True:
    model = AutoModelForCausalLM.from_pretrained(model_args['model_name_path'], trust_remote_code=True).to(dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(model_args['model_name_path'], trust_remote_code=True)
else:
    model = AutoModelForCausalLM.from_pretrained(model_args['model_name_path'], trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_args['model_name_path'], trust_remote_code=True)

# Lora
model = get_peft_model(model, peft_config)

print('#'*12)
print('Total parameters: {}'.format(sum([p.numel() for p in model.parameters()])))
print('Total parameters: {}'.format(sum([p.numel() for p in model.parameters() if p.requires_grad])))
print('#'*12)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

############
Total parameters: 3956922880
Total parameters: 6553600
############


### Data preparation

In [4]:
def create_dataloader(tokenizer, batch_size:int, split:str) -> DataLoader:
    n_top=dataset_args['n_top']
    token_len=dataset_args['token_length']
    train_size=dataset_args['train_size']
    valid_size=dataset_args['valid_size']
    test_size=dataset_args['test_size']
    file_path=dataset_args['data_path']

    data_df = pd.read_csv(dataset_args['data_path']).drop('author',axis=1)
    data_df['category'] = data_df['category'].apply(lambda x: x.split(', ') if isinstance(x, str) else x)
    data_df = data_df[data_df['category'].notnull()]

    categories = []
    for sublist in data_df['category']:
        if isinstance(sublist, list):
            for item in sublist:
                categories.append(item)
        else:
            categories.append(sublist)

    if split == 'train':
        print('Top Categories: ', Counter(categories).most_common(n_top))
    top_cat =[item[0] for i, item in enumerate(Counter(categories).most_common(n_top))]
    top_cat.sort()

    ## Remove records not having any top categories tagging
    data_df['category'] = data_df['category'].apply(lambda x: [cat if cat in x else '' for cat in top_cat])
    data_df = data_df[data_df['category'].apply(lambda x: sum([1 for cat in x if cat!='']))>0]
    data_df['category'] = data_df['category'].apply(lambda x: [cat for cat in x if cat!=''])

    dataset = datasets.Dataset.from_pandas(data_df)
    dataset.shuffle(8021)
    if split == 'train':
        print('Data total size: ', len(dataset))

    def preprocess_truncate(_dataset):
        categories = ', '.join(_dataset['category'])
        return {
            'input': f'### Intstruction:\nYou are AI peom generator that turns "Input" categories below to poetic masterpieces.\n\n### Input:\n{categories}',
            'output': ' '.join(_dataset['quote'].split()[:token_len])
        }

    if split == 'train':
        dataset = dataset.select(range(train_size)).map(preprocess_truncate)
    elif split == 'valid':
        dataset = dataset.select(range(train_size,train_size+valid_size)).map(preprocess_truncate)
    elif split == 'test':
        dataset = dataset.select(range(train_size+valid_size,train_size+valid_size+test_size)).map(preprocess_truncate)

    ds_out = dataset
    return ds_out

ds_train = create_dataloader(tokenizer, model_args['batch_size'], 'train')
ds_valid = create_dataloader(tokenizer, model_args['batch_size'], 'valid')
ds_test = create_dataloader(tokenizer, model_args['batch_size'], 'test')

print(ds_test[0]['input'])
print('\n###Response:\n'+ds_test[0]['output'])

Top Categories:  [('love', 38805), ('life', 35074), ('inspirational', 29080), ('philosophy', 14939), ('humor', 14081), ('god', 12559), ('truth', 11827), ('wisdom', 10820), ('happiness', 10424), ('hope', 9623), ('inspirational-quotes', 9279), ('quotes', 9191), ('romance', 9121), ('faith', 8933), ('death', 8292), ('inspiration', 8163), ('success', 8127), ('writing', 7962), ('poetry', 7180), ('religion', 7155), ('knowledge', 6430), ('education', 6306), ('motivational', 6182), ('time', 6029), ('relationships', 5711), ('spirituality', 5679), ('Life', 5433), ('life-lessons', 5408), ('fear', 5338), ('motivation', 5336), ('books', 5211), ('people', 5166), ('science', 5109), ('funny', 5061), ('friendship', 5000), ('purpose', 4918), ('change', 4779), ('dreams', 4718), ('freedom', 4617), ('life-quotes', 4585), ('work', 4576), ('leadership', 4488), ('spiritual', 4434), ('women', 4400), ('christianity', 4355), ('debasish-mridha', 4343), ('peace', 4338), ('love-quotes', 4316), ('war', 4284), ('beaut

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

### Intstruction:
You are AI peom generator that turns "Input" categories below to poetic masterpieces.

### Input:
inspirational, inspirational-quotes

###Response:
In markets, stupid action does not have equal but severe opposite reaction.


In [5]:
def preprocess_function(example):
    """
    Formatting function returning a list of samples (kind of necessary for SFT API).
    """
    text = f"{example['input']}\n\n### Response:\n{example['output']}"
    return text

In [6]:
# help(trl.SFTConfig) - Args can be set # https://www.huaxiaozhuan.com/%E5%B7%A5%E5%85%B7/huggingface_transformer/chapters/4_trainer.html
training_args = TrainingArguments(
    output_dir=f'{model_args['out_dir']}/logs',
    overwrite_output_dir=True,
    eval_strategy='epoch',
    weight_decay=0.01,
    per_device_train_batch_size=model_args['batch_size'],
    per_device_eval_batch_size=model_args['batch_size'],
    num_train_epochs=model_args['epochs'],
    bf16=model_args['bf16'],
    fp16=model_args['fp16'],
    dataloader_num_workers=model_args['num_workers'],
    # gradient_accumulation_steps=gradient_accumulation_steps,
    # gradient_checkpointing=True,
    learning_rate=model_args['learning_rate'],
    lr_scheduler_type='constant',
    optim='adamw_hf',
    deepspeed=ds_config,
    # ddp_backend='nccl',
)

# Gradient Enable: To prevent '''RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn''' for gradient checkpoints
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

# To prevent """AttributeError: 'DummyOptim' object has no attribute 'step'""" for DeepSpeed
from accelerate.utils import DistributedType, DeepSpeedPlugin
training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED

In [7]:
# Unable to use deepspeed.initialize due to CommandError. As alternative, self-define save_checkpoint func as null
def save_checkpoint(self, out_dir=f'{model_args['out_dir']}/chkpt'):
    # model.save_pretrained(out_dir)
    return
model.save_checkpoint = save_checkpoint
model.save_checkpoint

<function __main__.save_checkpoint(self, out_dir='outputs/qwen1___5-4B/best_model/chkpt')>

### Train

In [8]:
trainer = trl.SFTTrainer(
    model=model,
    train_dataset=ds_train,
    eval_dataset=ds_valid,
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=preprocess_function,
    packing=True,
    peft_config=peft_config,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [9]:
train_start = datetime.datetime.now()
history = trainer.train()
train_end = datetime.datetime.now()
print('Total train time: {}'.format(train_end-train_start))

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  attn_output = torch.nn.functional.scaled_dot_product_attention(
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
1,1.6552,No log
2,1.6022,No log
3,1.5442,No log
4,1.4953,No log
5,1.4387,No log
6,1.352,No log
7,1.2967,No log
8,1.2371,No log
9,1.1848,No log
10,1.1261,No log


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwarg

Total train time: 6:33:32.507191


In [10]:
trainer.save_model(f'{model_args['out_dir']}/best_model')

### Model Evaluation

In [11]:
pipe = pipeline(
    task='text-generation',
    model=model,
    tokenizer=tokenizer,
    max_length=dataset_args['token_length'],
    # device_='cuda',
    eos_token_id=tokenizer.eos_token_id
)

def poet_gen(data):
    response = pipe(f"{data['input']}\n\n### Response:\n")[0]['generated_text'][len(data['input'])+16:]
    return {'generated_text':response}

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
The model 'PeftModel' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCa

Use Rouge (Recall-Oriented Understudy for Gisting Evaluation) score and BERT score to evaluate similarity of reference sentences.

In [12]:
from rouge import Rouge
from bert_score import score
import numpy as np
def model_eval(model, data):
    rouge = Rouge()
    # Calculate Bert score and Rouge score
    rouge_scores = []; bert_scores = []
    out_text = ds_test.map(poet_gen)['generated_text']
    label_text = ds_test['output']
    
    r_scores = rouge.get_scores(out_text, label_text, avg=True)
    rouge_scores.append(r_scores)

    P, R, b_scores = score(out_text, label_text, lang='en')
    bert_scores.append(np.mean(b_scores.tolist()))
    
    rouge_1 = np.mean([r['rouge-1']['f'] for r in rouge_scores])
    rouge_2 = np.mean([r['rouge-2']['f'] for r in rouge_scores])
    rouge_l = np.mean([r['rouge-l']['f'] for r in rouge_scores])
    bert = np.mean(bert_scores)
    print(f'Score: rouge-1:{rouge_1}; rouge-2:{rouge_2}; rouge-l:{rouge_l}; bert:{bert}')

    return out_text

chk = model_eval(model, ds_test)



Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


model.safetensors:  14%|#4        | 199M/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Score: rouge-1:0.0956779751419744; rouge-2:0.006666666168888927; rouge-l:0.08446445747684383; bert:0.8558026432991028


Low in Rouge scores mean low similarity in exact words matching but high Bert scores for high contextual similarity.

### Sample demonstrate

In [13]:
def poet_gen(cats):
    categories = ', '.join(cats)
    return pipe(f'### Intstruction:\nYou are AI peom generator that turns "Input" categories below to poetic masterpieces.\n\n### Input:\n{categories}\n\n### Response')[0]['generated_text']
print(poet_gen(['love','friendship']))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


### Intstruction:
You are AI peom generator that turns "Input" categories below to poetic masterpieces.

### Input:
love, friendship

### Response:
I’m not sure I ever really got you. Not like you got me. But the more I try to figure you out, the clearer it gets. Like


In [14]:
print(poet_gen(['love','romance']))

### Intstruction:
You are AI peom generator that turns "Input" categories below to poetic masterpieces.

### Input:
love, romance

### Response:
I don't want to be the girl who never gets happy, the girl who's always pining for what could have been.


In [15]:
print(poet_gen(['god','religion''truth','wisdom']))

### Intstruction:
You are AI peom generator that turns "Input" categories below to poetic masterpieces.

### Input:
god, religiontruth, wisdom

### Response:
And we say there's so much nonsense out there. But when God tells you to do something, it is a request, and if you refuse


In [16]:
print(poet_gen(['life','happiness','hope']))

### Intstruction:
You are AI peom generator that turns "Input" categories below to poetic masterpieces.

### Input:
life, happiness, hope

### Response:
So long as there is this hope, joy cannot be lost.


In [17]:
# Release GPU memory
del tokenizer, pipe, model, trainer
with torch.no_grad():
    torch.cuda.empty_cache()