# Question Rephraser Bot

Problem Statement :

The goal of this project is to build a Question Rephraser Bot that can rewrite user questions into a clearer, more formal, and grammatically correct form.

Many user queries on the internet are often written in an informal or unclear way, such as:

“wat is best way to learn ai fast?”

The model should generate a refined version like:

“What is the most effective way to learn AI quickly?”

To achieve this, we fine-tune the Phi-2 language model using the Quora Question Pairs dataset, where duplicate question pairs act as paraphrase examples.

In [3]:
! pip install -q transformers datasets peft trl accelerate bitsandbytes



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Load Quora Question Pairs Dataset

In [4]:
from datasets import load_dataset

ds = load_dataset("AlekseyKorshuk/quora-question-pairs")


In [5]:
print(ds)
print(ds["train"][0])


DatasetDict({
    train: Dataset({
        features: ['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'],
        num_rows: 404290
    })
})
{'id': 0, 'qid1': 1, 'qid2': 2, 'question1': 'What is the step by step guide to invest in share market in india?', 'question2': 'What is the step by step guide to invest in share market?', 'is_duplicate': 0}


### Instruction-Response Format

In [6]:
def convert_to_instruction(example):
    q1 = example["question1"]
    q2 = example["question2"]

    prompt = f"""### Instruction:
Rewrite the following question in a clearer and more formal way.

### Input:
{q1}

### Response:
{q2}"""

    return prompt   # ✅ Return STRING only


### Filter Only Duplicate Question Pairs

Meaning of is_duplicate

1 → Both questions mean the same thing (We only want real paraphrases)

0 → Both questions are completely different

In [7]:
filtered_ds = ds["train"].filter(lambda x: x["is_duplicate"] == 1)

print("Total duplicate pairs:", len(filtered_ds))


Total duplicate pairs: 149263


In [8]:
filtered_ds.shape

(149263, 6)

### Reduce Dataset for Quick Training

In [9]:
small_ds = filtered_ds.select(range(200))


### Load Phi-2 Model + Tokenizer

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "microsoft/phi-2"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True,
    trust_remote_code=True
)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The 8-bit optimizer is not available on your device, only available on CUDA for now.
The 8-bit optimizer is not available on your device, only available on CUDA for now.
Loading checkpoint shards: 100%|██████████| 2/2 [02:04<00:00, 62.35s/it] 


### Apply LoRA Fine-Tuning Config

In [11]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

model.print_trainable_parameters()


trainable params: 2,621,440 || all params: 2,782,305,280 || trainable%: 0.0942


### Train with SFTTrainer

In [12]:
# Training Arguments

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./phi2-question-rephraser",

    per_device_train_batch_size=1,  
    gradient_accumulation_steps=1,   # no accumulation = quickest

    num_train_epochs=1,     # keep only 1 epoch
    max_steps=20,

    learning_rate=2e-4,

    logging_steps=10,
    save_steps=500,                 # save less often = faster

    fp16=True,
    report_to="none"
)



#### Supervised Fine-Tuning Trainer  (SFTTrainer)

It is a special training class from the TRL library (trl) that makes it very easy to fine-tune large language models (LLMs) on instruction datasets.

In [13]:
# Trainer Setup

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=filtered_ds,
    args=training_args,
    processing_class=tokenizer,
    formatting_func= convert_to_instruction
)



In [14]:
# Training

trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


Step,Training Loss
10,2.6364
20,2.2187


TrainOutput(global_step=20, training_loss=2.427521514892578, metrics={'train_runtime': 2030.4327, 'train_samples_per_second': 0.01, 'train_steps_per_second': 0.01, 'total_flos': 17370880450560.0, 'train_loss': 2.427521514892578})

### Test the Model

In [15]:
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=60
)

def rephrase(question):
    prompt = f"""### Instruction:
Rewrite the following question in a clearer and more formal way.

### Input:
{question}

### Response:
"""

    output = pipe(prompt)[0]["generated_text"]
    return output.split("### Response:")[-1].strip()


Device set to use cpu


### Custom Questions

In [16]:
print(rephrase("wat is best laptop for ai student"))
print(rephrase("how to get job in google fast"))
print(rephrase("what is difference between ai and ml??"))
print(rephrase("how to crack google interview fast??"))
print(rephrase("how to build resume for freshers"))
print(rephrase("tell me best way to learn machine learning quickly"))
print(rephrase("which programming language is best for beginners?"))
print(rephrase("how to increase cgpa in engineering"))
print(rephrase("can u explain neural networks in simple words"))
print(rephrase("how to stay motivated while learning coding"))
print(rephrase("why my python code not running"))


Which laptop would you recommend for a computer science student?
What are the best ways to apply for a job at Google?
AI and ML are different in terms of the type of tasks they can perform. AI can be either narrow or general, depending on the domain of the problem. ML can be either supervised or unsupervised, depending on the availability of labeled data.
How to prepare for Google Interview?
How to write a resume for freshers?
What is the most efficient way to learn machine learning quickly?
Which programming language is most suitable for beginners?
How can I improve my CGPA in engineering?
Can you explain neural networks in simple words?
- How can one maintain a positive mindset and drive while acquiring knowledge and skills in the field of coding?
What is the reason for the Python code not to run?


#### What is LoRA ?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique used to adapt large language models without updating all model weights.

Instead of modifying the full model, LoRA introduces small trainable matrices into specific layers (such as attention layers).

Key idea:

The original model remains frozen

Only a small number of additional parameters are trained

This makes training faster and lighter

LoRA allows large models to learn new tasks efficiently with minimal compute requirements.

#### Why LoRA Instead of Full Fine-Tuning ?

Full fine-tuning updates all parameters of a large model, which is expensive and requires high GPU memory.

LoRA is preferred because:

* Lower memory usage (only small adapters are trained)

* Faster training compared to full fine-tuning

* Cheaper compute cost

* Prevents overfitting when dataset size is limited

* Makes deployment easier since LoRA adapters are lightweight

Thus, LoRA is an efficient approach for fine-tuning models like Phi-2 on tasks such as question rewriting.

### Observations from the Results :

After training Phi-2 with LoRA on duplicate question pairs:

The model successfully learned to rewrite informal questions into more structured and formal language.

Output questions became clearer, grammatically correct, and closer to professional phrasing.

The model performed well on common user queries such as job, education, and programming questions.

Example:

Input:

“why my python code not running??”

Output:

“Why is my Python code not executing properly?”

Limitations Observed:

Occasionally, the model produces outputs too similar to the input if the question is already formal.

More training epochs and a larger dataset can further improve paraphrasing quality.

Overall, LoRA fine-tuning proved effective for building a lightweight question rephrasing assistant.