## Problem Statement

### Business Context

In the digital age, online question-answer forums such as Stack Overflow, Quora, and Reddit are essential platforms for knowledge sharing and community engagement. These platforms host millions of queries and answers, providing users with a vast repository of information.

Maintaining the quality of user-generated content is crucial for the success and satisfaction of these forums. High-quality content attracts more users, fosters a vibrant community, and enhances the platform's reputation. On the other hand, low-quality content can lead to user frustration, reduce engagement, and damage the forum's credibility.

However, the quality of these contributions can vary significantly. Ensuring high-quality content while effectively managing low-quality submissions is a significant challenge that directly impacts the overall value of the forum.

### Problem Definition

Despite the benefits of user-generated content, managing its quality poses significant challenges. The primary issues faced by these platforms include:

1. **Volume and Diversity of Content**: With thousands of queries posted daily, manually monitoring and evaluating each query's quality is impractical.

3. **Varying Quality Standards:** User contributions vary widely in quality, with some queries being clear and detailed, while others are ambiguous, off-topic, or inappropriate.
2. **Resource-Intensive Moderation:** Moderating content to ensure quality is resource-intensive, requiring significant human effort to edit low-quality queries or close inappropriate ones.
3. **User Experience:** Inconsistent quality of content can lead to a poor user experience, reducing engagement and the likelihood of users returning to the platform.

To address these challenges, there is a need for an automated solution that can efficiently classify user queries into high quality, low quality (edited one or more times), and low quality (closed). By leveraging Generative AI techniques, specifically fine-tuning large language models (LLMs), we aim to develop a system that can:

Automatically assess the quality of user queries.
Reduce the manual effort required for content moderation.
Enhance the overall user experience by ensuring high-quality content is prominently featured and low-quality content is appropriately managed.
This case study focuses on implementing and fine-tuning an LLM to achieve these objectives, providing a scalable and efficient solution for maintaining high content standards in open question-answer forums.

#Methodology



**Fine-Tuning BERT Model:** We will begin by fine-tuning a BERT model using our training dataset. This step involves training the model specifically on our data to improve its performance in understanding and classifying the content relevant to our forum.

**Applying Large Language Models (LLMs):** Next, we will utilize large language models to enhance the quality and relevance of the responses. By leveraging the capabilities of these advanced models, we aim to improve the overall user experience on the platform.

**Fine-Tuning LLMs:** Following the application of LLMs, we will fine-tune these models to further align them with the specific requirements and nuances of our forum. This step ensures that the LLMs are not only powerful but also tailored to the unique needs of our community.

**Evaluating Performance:** Finally, we will evaluate the performance of the fine-tuned Large Language model, Fine Tuned Bert Model and Language Model for our dataset. This will involve testing their f1-score. By comparing the results, we aim to identify the best-performing method to ensure high-quality content and improved user engagement on our platform.

# Objective

Fine-Tune Mistral model on a Training dataset of query classification.

# Setup

In [1]:
!pip install -q datasets==2.16.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/507.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m501.8/507.1 kB[0m [31m23.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/115.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency re

In [6]:
#@title Run this cell to setup Unsloth on Colab
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps xformers==0.0.27.post2 "trl<0.9.0" peft==0.12.0 accelerate==0.32.1 bitsandbytes==0.43.2
!pip install triton


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting triton
  Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.4/209.4 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton
Successfully installed triton-3.0.0


In [7]:
!pip freeze > r_pertfinetuning.txt

In [8]:
# Import iterable
from tqdm import tqdm


# Import necessary libraries for data manipulation and analysis
import pandas as pd
import numpy as np

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Import modules from scikit-learn for machine learning tasks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score, classification_report

# Import TensorFlow for deep learning tasks
import tensorflow as tf

# importing library for text preprocessing
import re

# Import modules from the Hugging Face transformers library
from transformers import BertTokenizer, TFBertForSequenceClassification,TrainingArguments, EarlyStoppingCallback

The functionalities we use from the above packages are:
- `transformers`, `datasets`: helpers to load models and to convert csv file to dataset
- `unsloth`: facilitates application of QLoRA in conjunction with peft on 4-bit quantized base models
- `trl`: abstractions to train the LoRA adapter

# Model Fine Tuning

In [9]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [10]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True
)

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

In [None]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (

In [None]:
tokenizer

LlamaTokenizerFast(name_or_path='unsloth/mistral-7b-instruct-v0.2-bnb-4bit', vocab_size=32000, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

By default, the tokenizer adds a beginning-of-sequence token but does not add an end-of-sequence token. We will need to explicitly add this during training.

In [None]:
tokenizer.add_bos_token, tokenizer.add_eos_token

(True, False)

In [None]:
EOS_TOKEN = tokenizer.eos_token

# Prepare Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from datasets import Dataset
training=pd.read_csv("/content/drive/MyDrive/GANLP Week 9/stackflow_training_data.csv")
training_dict = training.to_dict(orient='list')
validation=pd.read_csv("/content/drive/MyDrive/GANLP Week 9/stackflow_validation_data.csv")
validation_dict =validation.to_dict(orient='list')

# Create a dataset from the dictionary
training_dataset = Dataset.from_dict(training_dict)

In [None]:
training_dataset[:5]

{'query': ['Title: Java: Repeat Task Every Random SecondsQuery: <p>I\'m already familiar with repeating tasks every n seconds by using Java.util.Timer and Java.util.TimerTask. But lets say I want to print "Hello World" to the console every random seconds from 1-5. Unfortunately I\'m in a bit of a rush and don\'t have any code to show so far. Any help would be apriciated.  </p>\n',
  "Title: Why are Java Optionals immutable?Query: <p>I'd like to understand why Java 8 Optionals were designed to be immutable.  Is it just for thread-safety?</p>\n",
  'Title: Text Overlay Image with Darkened Opacity React NativeQuery: <p>I am attempting to overlay a title over an image - with the image darkened with a lower opacity. However, the opacity effect is changing the overlaying text as well - making it dim. Any fix to this? Here is what is looks like:</p>\n\n<p><a href="https://i.stack.imgur.com/1HzD7.png" rel="noreferrer"><img src="https://i.stack.imgur.com/1HzD7.png" alt="enter image description 

The Alpaca instruction prompt is a general purpose prompt template that can be adapted to any task.

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [None]:
def prompt_formatter(example, prompt_template):
    instruction='Classify the query as LQ_EDIT, HQ or LQ_CLOSE'
    query=example["query"]
    c=example["Y"]

    formatted_prompt = prompt_template.format(instruction, query, c) + EOS_TOKEN

    return {'formatted_prompt': formatted_prompt}

Notice how we are adding the end-of-sequence token to the prompt.

In [None]:
formatted_training_dataset = training_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}
)

Map:   0%|          | 0/45000 [00:00<?, ? examples/s]

In [None]:
from datasets import Dataset
validation=pd.read_csv("/content/drive/MyDrive/GANLP Week 9/stackflow_validation_data.csv")
validation_dict =validation.to_dict(orient='list')

# Create a dataset from the dictionary
validation_dataset = Dataset.from_dict(validation_dict)

In [None]:
formatted_validation_dataset = validation_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}
)

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

# Fine-Tuning

We now patch in the adapter modules to the base model using the `get_peft_model` method.

> Practical Tip: $r$ defines the dimensions of the low-rank matrices, while $\alpha$ determines the scaling factor for the weight matrices. It is common to freeze $\alpha=16$, while varying the values of $r = \alpha, \alpha/2, \alpha/4$ and arriving at the optimal value of that gives the lowest validation loss (note that we use the same loss used for the base model, e.g., perplexity or log loss).

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing=True,
    random_state=42,
    loftq_config=None
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_featu

Notice how LoRA adapters are attached to the layers specified during instantiation.

For training, we use the following nuances borrowed from the broader deep learning discipline.

- Low learning rates for smooth parameter updates
- Early stopping to monitor for validation loss (negative log likelihood in this case)
- Checkpointing to enable resumption of training


In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_training_dataset,
    eval_dataset=formatted_validation_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    dataset_text_field = "formatted_prompt",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False, # Increases efficiency for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        max_steps = 500,
        evaluation_strategy="epoch",
        save_strategy='epoch',
        metric_for_best_model="eval_loss",
        load_best_model_at_end=True,
        greater_is_better=False,
        learning_rate=5e-5,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="paged_adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs"
    )
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/45000 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/15000 [00:00<?, ? examples/s]

In [None]:
training_history = trainer.train()

# Inference

In [None]:
from datasets import Dataset
test_data=pd.read_csv("/content/drive/MyDrive/GANLP Week 9/stackflow_validation_data.csv")[:100]
test_dict =test_data.to_dict(orient='list')

# Create a dataset from the dictionary
test_dataset = Dataset.from_dict(test_dict)

In [None]:
instruction="""You will act as technical assistant.\
You will not answer the query. Your task is to classify the quality of query presented in the input as\
LQ_EDIT, HQ or LQ_CLOSE \
query will be delimited by triple backticks in the input.\
Answer only LQ_EDIT, HQ or LQ_CLOSE as quality of query not a solution to the query."""
test_dialogue = test_dataset[0]['query']
test_class = test_dataset[0]['Y']

In [None]:
FastLanguageModel.for_inference(model)

> Reminder: At this stage, we have the model + adapters patched in!

In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_featu

In [None]:
inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction,
        test_dialogue,
        "", # leave output blank for generation
    )
], return_tensors="pt").to("cuda")

In [None]:
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    use_cache=True,
    pad_token_id=tokenizer.eos_token_id
)

In [None]:
print(
    tokenizer.decode(
        outputs[0][inputs.input_ids.shape[-1]:],
        skip_special_tokens=True,
        cleanup_tokenization_spaces=True
    )
)

LQ_EDIT


Now that we have a fine-tuned model, we can save the model to disk.

# Save Trained Model

In [None]:
# @title Setup to enable bash commands
import locale

def getpreferredencoding():
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding

In [None]:
lora_model_name = "dialogue-summarizer-mistral"

In [None]:
model.save_pretrained(lora_model_name)

In [None]:
!ls -lh {lora_model_name}

total 161M
-rw-r--r-- 1 root root  746 May 27 00:02 adapter_config.json
-rw-r--r-- 1 root root 161M May 27 00:02 adapter_model.safetensors
-rw-r--r-- 1 root root 5.0K May 27 00:02 README.md


As we can see from the output above, we save only the adapter
(since we can load the base model on-demand). In order to enable inference, we can export the saved model to a remote, secure location (in this case, Google Drive).

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!cp -r {lora_model_name} /content/drive/MyDrive