<a href="https://colab.research.google.com/github/22bq1a42d4/Linkchat/blob/main/linkchat_qa_training_auto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install torch datasets sentencepiece accelerate evaluate rouge_score




In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"  # For older transformers


In [None]:
!wget -O data.json https://raw.githubusercontent.com/22bq1a42d4/Linkchat/main/data.json


--2025-08-09 15:53:12--  https://raw.githubusercontent.com/22bq1a42d4/Linkchat/main/data.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113175 (111K) [text/plain]
Saving to: ‘data.json’


2025-08-09 15:53:12 (2.97 MB/s) - ‘data.json’ saved [113175/113175]



In [None]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="data.json")["train"]
dataset = dataset.train_test_split(test_size=0.1, seed=42)
print(dataset["train"][0])


Generating train split: 0 examples [00:00, ? examples/s]

{'category': 'Company Insights', 'question': 'Help me understand this: What is it like working at Google as a SWE?', 'answer': 'You’ll work on scalable systems, get great perks, and collaborate with top engineers.'}


In [None]:
import sentencepiece as spm

texts = []
for ex in dataset["train"]:
    texts.append(ex["question"])
    texts.append(ex["answer"])

with open("all_texts.txt", "w", encoding="utf-8") as f:
    for t in texts:
        f.write(t.replace("\n"," ") + "\n")

# Small vocab size for small dataset
spm.SentencePieceTrainer.train(
    input='all_texts.txt',
    model_prefix='spm_tokenizer',
    vocab_size=2000,
    model_type='bpe',
    character_coverage=1.0
)


In [None]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer(vocab_file="spm_tokenizer.model")

# Add special tokens
special_tokens_dict = {
    "pad_token": "<pad>",
    "eos_token": "</s>",
    "bos_token": "<s>"
}
tokenizer.add_special_tokens(special_tokens_dict)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


0

In [None]:
from transformers import T5Config, T5ForConditionalGeneration

config = T5Config(
    vocab_size=len(tokenizer),
    d_model=256,
    d_ff=1024,
    num_layers=4,
    num_decoder_layers=4,
    num_heads=4,
    dropout_rate=0.1,
    pad_token_id=tokenizer.pad_token_id,
    decoder_start_token_id=tokenizer.pad_token_id
)

model = T5ForConditionalGeneration(config)
model.resize_token_embeddings(len(tokenizer))


Embedding(2101, 256)

In [None]:
max_input_length = 64
max_target_length = 64

def preprocess(examples):
    inputs = [q.strip() for q in examples["question"]]
    targets = [a.strip() for a in examples["answer"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
    labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized = dataset.map(preprocess, batched=True, remove_columns=dataset["train"].column_names)


Map:   0%|          | 0/441 [00:00<?, ? examples/s]

Map:   0%|          | 0/49 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
import evaluate, numpy as np

rouge = evaluate.load("rouge")
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    return {k: round(v,4) for k,v in result.items()}

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    do_eval=True,                 # works with old transformers
    eval_steps=500,
    save_steps=500,
    logging_steps=100,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    num_train_epochs=5,
    learning_rate=5e-4,
    save_total_limit=2,
    fp16=True,
    report_to="none"               # disable wandb
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Step,Training Loss
100,3.2186
200,0.4511


TrainOutput(global_step=280, training_loss=1.350771164894104, metrics={'train_runtime': 83.0718, 'train_samples_per_second': 26.543, 'train_steps_per_second': 3.371, 'total_flos': 1816949990400.0, 'train_loss': 1.350771164894104, 'epoch': 5.0})

In [None]:
from transformers import pipeline

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
print(pipe("How can I improve my LinkedIn profile?", max_length=64)[0]['generated_text'])


Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Add a professional headline, detailed work descriptions, quantify achievements, and list certifications. certifications... certifications...... certifications...... certifications........ certifications. certifications.. and list certifications. and list certifications.. certifications....


In [None]:
model.save_pretrained("qa_model")
tokenizer.save_pretrained("qa_tokenizer")


('qa_tokenizer/tokenizer_config.json',
 'qa_tokenizer/special_tokens_map.json',
 'qa_tokenizer/spiece.model',
 'qa_tokenizer/added_tokens.json')

In [None]:
from google.colab import files
!zip -r qa_model.zip qa_model
!zip -r qa_tokenizer.zip qa_tokenizer



  adding: qa_model/ (stored 0%)
  adding: qa_model/config.json (deflated 48%)
  adding: qa_model/model.safetensors (deflated 7%)
  adding: qa_model/generation_config.json (deflated 30%)
  adding: qa_tokenizer/ (stored 0%)
  adding: qa_tokenizer/special_tokens_map.json (deflated 86%)
  adding: qa_tokenizer/tokenizer_config.json (deflated 94%)
  adding: qa_tokenizer/spiece.model (deflated 43%)
  adding: qa_tokenizer/added_tokens.json (deflated 82%)


In [None]:
from google.colab import files
files.download("qa_model.zip")
files.download("qa_tokenizer.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>