<a href="https://colab.research.google.com/github/SUBHA2211/DATA_SETS/blob/main/Experiment_KhanQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 📌 Step 1: Install required libraries
!pip install -q transformers datasets accelerate evaluate

# 📌 Step 2: Upload your KhanQ.json
from google.colab import files
uploaded = files.upload()


Saving KhanQ.json to KhanQ (1).json


In [2]:
# 📌 Step 3: Load JSON and convert to input/target pairs
import json
import pandas as pd

with open("KhanQ.json", "r") as f:
    data = json.load(f)

rows = []
for item in data:
    context = item.get("Context", "").strip()
    prompt_type = item.get("Prompt", {}).get("type", "").strip()
    prompt_content = item.get("Prompt", {}).get("content", "").strip()
    question = item.get("Question", "").strip()

    if context and question:
        input_text = f"[{prompt_type}] {prompt_content}\n\n{context}" if prompt_type and prompt_content else context
        rows.append({"context": context, "input": input_text, "target": question})

df = pd.DataFrame(rows)
df = df.dropna()
df.head()


Unnamed: 0,context,input,target
0,Electronegativity is how strongly the element ...,[Question] Reactivity is often described by el...,Do electronegativity and elektrodepotential bo...
1,all the reducing agents undergo oxidation them...,[Citation] Lithium having highest ionisation p...,How lithium behaves as a strong reducing agent?
2,Reduction = gain in electrons. K would find it...,[Citation] Isn't the valence electron in Li mo...,How come Li has more reduction potential than ...
3,Reduction = gain in electrons. K would find it...,[Citation] K is more reactive than Li\n\nReduc...,Does more reactive imply a greater reducing po...
4,Lithium is a stronger reducing agent than all ...,[Question] Isn't the valence electron in Li mo...,How come Li has more reduction potential than ...


In [3]:
# prompt: save this as csv file.

df.to_csv("khanq.csv", index=False)
files.download("khanq.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [4]:
!pip install numpy==1.26.4




In [5]:
import numpy as np
import transformers
import datasets


In [6]:
# 📌 Step 4: Convert to Hugging Face dataset
from datasets import Dataset
dataset = Dataset.from_pandas(df[['input', 'target']])
dataset = dataset.train_test_split(test_size=0.1)
dataset


DatasetDict({
    train: Dataset({
        features: ['input', 'target'],
        num_rows: 930
    })
    test: Dataset({
        features: ['input', 'target'],
        num_rows: 104
    })
})

In [7]:
# prompt: see first data

# see the first data entry
print(dataset['train'][0])

{'input': "[Question] There are two carbon atoms between each pair of our six oxygen atoms, totaling eighteen carbons. Does the number of oxygens affect the properties of the crown ether molecule? What about the distribution? Can there be a different number of carbons between each pair of oxygens and still call it a crown ether? Are crown ethers always symmetrical?\n\nMost crown ethers such as 18-crown-6 have two C atoms between the O atoms because they are easy to make from ethane-1,2-diol as a starting material. 15-crown-5 is slightly smaller but it is still a crown ether.There is also a diaza-15-crown-6, in which N atoms replace two of the O atoms. Crown ethers don't have to be symmetrical, but they are much more difficult to make. There is, for example, a 16-crown 6, which has three carbon atoms between one pair of O atoms. The most useful property of crown ethers is their ability to complex (or 'chelate') with cations. For example the 'hole' between the O atoms in 18-crown-6 is ju

In [8]:
# 📌 Step 5: Tokenize
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

def tokenize(example):
    model_input = tokenizer(example['input'], padding="max_length", truncation=True, max_length=256)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(example['target'], padding="max_length", truncation=True, max_length=64)
    model_input["labels"] = labels["input_ids"]
    return model_input

tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/930 [00:00<?, ? examples/s]



Map:   0%|          | 0/104 [00:00<?, ? examples/s]

In [9]:
# 📌 Step 6: Load the model
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")


In [10]:
# 📌 Step 7: Setup training
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="qg-t5-small",
    eval_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="logs",
    save_total_limit=1,
    push_to_hub=False,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
)


  trainer = Trainer(


In [11]:
# 📌 Step 8: Train
trainer.train()


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,0.660765
2,No log,0.651313
3,No log,0.651072


TrainOutput(global_step=351, training_loss=1.0206432722912215, metrics={'train_runtime': 63.4706, 'train_samples_per_second': 43.957, 'train_steps_per_second': 5.53, 'total_flos': 188801813053440.0, 'train_loss': 1.0206432722912215, 'epoch': 3.0})

In [12]:
# 📌 Step 9: Inference
sample = df.iloc[0]['input']
inputs = tokenizer(sample, return_tensors="pt", truncation=True, padding=True).to(model.device)
output = model.generate(**inputs, max_length=64)
print("🧠 Input:\n", sample)
print("\n📝 Generated Question:\n", tokenizer.decode(output[0], skip_special_tokens=True))


🧠 Input:
 [Question] Reactivity is often described by electronegativity or by electrodpotential. What is the difference?

Electronegativity is how strongly the element hogs the election ONCE the covalent bond is made. Electropotential is the tendancy of the element to lose/gain electrons (so I see it close to ionization energy/electron affinity definitions). Nevertheless, they are related.

📝 Generated Question:
 What is the difference between the two?


In [13]:
!pip install -q evaluate


In [15]:
# Install the required library for the ROUGE metric
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=cdb374398aa5af95348587d2a64fe102ef37006a0ec555e3d478b3411e58fe15
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [16]:
import evaluate

# Load metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
meteor = evaluate.load("meteor")

# Generate predictions on test set
references = []
predictions = []

for example in df.sample(50).to_dict(orient="records"):  # ⚠️ Limit to 50 for quick evaluation
    inputs = tokenizer(example["input"], return_tensors="pt", truncation=True, padding=True).to(model.device)
    output = model.generate(**inputs, max_length=64)
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    predictions.append(decoded)
    references.append(example["target"])

# Evaluate metrics
bleu_result = bleu.compute(predictions=predictions, references=[[ref] for ref in references])
rouge_result = rouge.compute(predictions=predictions, references=references)
meteor_result = meteor.compute(predictions=predictions, references=references)

# Print results
print("\n📊 Evaluation Metrics on Test Sample")
print(f"BLEU Score   : {bleu_result['bleu']:.4f}")
print(f"ROUGE-L Score: {rouge_result['rougeL']:.4f}")
print(f"METEOR Score : {meteor_result['meteor']:.4f}")


Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...



📊 Evaluation Metrics on Test Sample
BLEU Score   : 0.0545
ROUGE-L Score: 0.2552
METEOR Score : 0.2433
