### Mental Health Chatbot with Fine-tuned Falcon-7B

This project implements a Mental Health Chatbot using Transformers, PEFT (Parameter-Efficient Fine-Tuning). The chatbot uses a fine-tuned Falcon-7B model to generate responses to user queries about mental health topics.

## Installation
To run this chatbot, make sure to install the required dependencies. Use the following commands to install them:

```sh
# Install core dependencies
!pip install trl==0.6.0 transformers accelerate peft gradio -Uqqq

# Install additional dependencies for model optimization
!pip install datasets bitsandbytes einops wandb -Uqqq
```

These packages include:

- `transformers:` For working with pre-trained language models.
- `accelerate:` For efficient multi-GPU and mixed-precision training.
- `peft:` For parameter-efficient fine-tuning.
- `gradio:` For building the chatbot interface.
- `datasets:` For loading and processing datasets.
- `bitsandbytes:` For 4-bit quantization of the model.
- `einops:` For tensor manipulation.
- `wandb:` For logging and monitoring model training.

## Code Modules
The implementation is structured into multiple functions to maintain modularity and readability.

### 1. Model and Tokenizer Initialization
This function initializes the model and tokenizer using the Falcon-7B model with PEFT and bitsandbytes for efficient model loading and inference.

### 2. Fine-Tuning Model with PEFT
This function prepares the model for PEFT (Low-Rank Adaptation) fine-tuning using LoraConfig.

### 3. Training the Model
This section includes the setup for model training with SFTTrainer.

### 4. Generating Responses
This function generates responses from both the original model and the fine-tuned PEFT model.


In [3]:
"""
Falcon-7b-Model-Finetuned-By-Mental-Health-Data
=====================================
This project implements a Mental Health Chatbot using Transformers, PEFT (Parameter-Efficient Fine-Tuning), and Gradio. 
The chatbot uses a fine-tuned Falcon-7B model to generate responses to user queries about mental health topics.
"""

# Import necessary libraries and modules
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, GenerationConfig
from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer
import warnings
warnings.filterwarnings("ignore")

In [4]:
# ============================
# Login to Hugging Face Hub
# ============================

from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
# ============================
# Load the mental health chatbot dataset
# ============================

data = load_dataset("heliosbrahma/mental_health_chatbot_dataset")
print(data["train"][0]['text']) # Print a sample from the dataset

README.md:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

(…)-00000-of-00001-01391a60ef5c00d9.parquet:   0%|          | 0.00/102k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/172 [00:00<?, ? examples/s]

<HUMAN>: What is a panic attack?
<ASSISTANT>: Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic disorder, a mental health condition characterized by sudden and repeated panic attacks.


In [6]:
# ============================
# Define model and tokenizer
# ============================


model_name = "ybelkada/falcon-7b-sharded-bf16" # sharded falcon-7b model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,            # load model in 4-bit precision
    bnb_4bit_quant_type="nf4",    # pre-trained model should be quantized in 4-bit NF format
    bnb_4bit_use_double_quant=True, # Using double quantization as mentioned in QLoRA paper
    bnb_4bit_compute_dtype=torch.bfloat16, # During computation, pre-trained model should be loaded in BF16 format
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # Use bitsandbytes config
    device_map="auto",  # Specifying device_map="auto" so that HF Accelerate will determine which GPU to put each layer of the model on
    trust_remote_code=True, # Set trust_remote_code=True to use falcon-7b model with custom code
)

config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [7]:
# ============================
# Load tokenizer for the model
# ============================

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) # Set trust_remote_code=True
tokenizer.pad_token = tokenizer.eos_token # Setting pad_token same as eos_token

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [8]:
# ============================
# Prepare model for k-bit training
# ============================


model = prepare_model_for_kbit_training(model)

# Define LoRA configurations for efficient training
lora_alpha = 32 # scaling factor for the weight matrices
lora_dropout = 0.05 # dropout probability of the LoRA layers
lora_rank = 32 # dimension of the low-rank matrices

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_rank,
    bias="none",  # setting to 'none' for only training weight params instead of biases
    task_type="CAUSAL_LM",
    target_modules=[  # Setting names of modules in falcon-7b model that we want to apply LoRA to
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

# Apply LoRA to the model
peft_model = get_peft_model(model, peft_config)

In [9]:
# ============================
# Training configurations
# ============================


output_dir = "./Falcon-7b-Model-Finetuned-By-Mental-Health-Data"
per_device_train_batch_size = 4 # reduce batch size by 2x if out-of-memory error
gradient_accumulation_steps = 16  # increase gradient accumulation steps by 2x if batch size is reduced
optim = "paged_adamw_32bit" # activates the paging for better memory management
save_strategy="steps" # checkpoint save strategy to adopt during training
save_steps = 1 # number of updates steps before two checkpoint saves
logging_steps = 1  # number of update steps between two logs if logging_strategy="steps"
learning_rate = 2e-4  # learning rate for AdamW optimizer
max_grad_norm = 0.3 # maximum gradient norm (for gradient clipping)
max_steps = 10        # training will happen for 320 steps
warmup_ratio = 0.03 # number of steps used for a linear warmup from 0 to learning_rate
lr_scheduler_type = "cosine"  # learning rate scheduler

# Define training arguments for the fine-tuning process
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    bf16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    push_to_hub=True,
)

In [10]:
# ============================
# Tokenization function for preparing dataset
# ============================

def tokenize_function(examples):
    return tokenizer(
        examples["text"], truncation=True, padding="max_length", max_length=1024
    )

# Tokenizing the dataset
tokenized_dataset = data.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

Map:   0%|          | 0/172 [00:00<?, ? examples/s]

In [11]:
# ============================
# Initialize the trainer for fine-tuning
# ============================

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=tokenized_dataset["train"],  #Use preprocessed dataset
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
)

Converting train dataset to ChatML:   0%|          | 0/172 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/172 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/172 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [12]:
# ============================
# Upcasting the layer norms in torch.bfloat16 for more stable training
# ============================

for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.bfloat16)

In [14]:
# ============================
# WandB login for experiment tracking
# ============================

!wandb 'WANDB_API_KEY'
import wandb
wandb.login()

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Currently logged in as: [33matickft13129[0m ([33matickft13129-green-university-of-bangladesh[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [15]:
# ============================
# Disable cache to prevent and Start training
# ============================

# Disable cache to prevent issues during training
peft_model.config.use_cache = False

# Start the training process
trainer.train()



Step,Training Loss
1,1.6592
2,1.6686
3,1.6901
4,1.5745
5,1.4499
6,1.4439
7,1.3915
8,1.3387
9,1.2745
10,1.3673


TrainOutput(global_step=10, training_loss=1.4858099102973938, metrics={'train_runtime': 5221.4497, 'train_samples_per_second': 0.123, 'train_steps_per_second': 0.002, 'total_flos': 2.384538496401408e+16, 'train_loss': 1.4858099102973938})

In [16]:
# ============================
# Push the fine-tuned model to Hugging Face Hub
# ============================

trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/FardinAbrar/Falcon-7b-model-Finetuned-By-Mental-Health-Data/commit/dbef2b83607bbfc3236636dbc5b9eff495947a34', commit_message='End of training', commit_description='', oid='dbef2b83607bbfc3236636dbc5b9eff495947a34', pr_url=None, repo_url=RepoUrl('https://huggingface.co/FardinAbrar/Falcon-7b-model-Finetuned-By-Mental-Health-Data', endpoint='https://huggingface.co', repo_type='model', repo_id='FardinAbrar/Falcon-7b-model-Finetuned-By-Mental-Health-Data'), pr_revision=None, pr_num=None)

In [4]:
# ============================
# Loading the original model
# ============================

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [5]:
# ============================
# Load PEFT model from the Hub
# ============================


PEFT_MODEL = "FardinAbrar/Falcon-7b-model-Finetuned-By-Mental-Health-Data"

config = PeftConfig.from_pretrained(PEFT_MODEL)
peft_base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

peft_model = PeftModel.from_pretrained(peft_base_model, PEFT_MODEL)

# Initialize PEFT tokenizer
peft_tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
peft_tokenizer.pad_token = peft_tokenizer.eos_token

adapter_config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

In [10]:
# ============================
# Function to generate responses from both original model and PEFT model
# ============================

def generate_answer(query):
  system_prompt = """ """

  user_prompt = f"""<HUMAN>: {query}
  <ASSISTANT>: """

  final_prompt = system_prompt + "\n" + user_prompt

  device = "cuda:0"
  dashline = "-".join("" for i in range(50))

  # Generating response from the original model
  encoding = tokenizer(final_prompt, return_tensors="pt").to(device)
  outputs = model.generate(input_ids=encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = tokenizer.eos_token_id, \
                                                                                                                     eos_token_id = tokenizer.eos_token_id, attention_mask = encoding.attention_mask, \
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

  print(dashline)
  print(f'ORIGINAL MODEL RESPONSE:\n{text_output}')
  print(dashline)

  # Generating response from the PEFT model
  peft_encoding = peft_tokenizer(final_prompt, return_tensors="pt").to(device)
  peft_outputs = peft_model.generate(input_ids=peft_encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = peft_tokenizer.eos_token_id, \
                                                                                                                     eos_token_id = peft_tokenizer.eos_token_id, attention_mask = peft_encoding.attention_mask, \
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  peft_text_output = peft_tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

  print(f'PEFT MODEL RESPONSE:\n{peft_text_output}')
  print(dashline)

In [11]:
# ============================
# Testing the model's response generation
# ============================

query = "How can I prevent anxiety and depression?"
generate_answer(query)

-------------------------------------------------
ORIGINAL MODEL RESPONSE:
 
<HUMAN>: How can I prevent anxiety and depression?
  <ASSISTANT>: <HUMAN>, you can prevent anxiety and depression by taking a walk, exercising, meditating, and eating healthy foods.

I'm not sure if this is the best way to do it, but it's a start.
 I'm not sure if I should use the <HUMAN> tag or not.
 @user2357112 I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should use the `` tag or not. I'm not sure if I should
-------------------------------------------------
PEFT MODEL RESPONSE:
 
<HUMAN>: How can I prevent an

In [12]:
# ============================
# Testing the model's response generation
# ============================

query = "How to take care of mental health?"
generate_answer(query)

-------------------------------------------------
ORIGINAL MODEL RESPONSE:
 
<HUMAN>: How to take care of mental health?
  <ASSISTANT>: <HUMAN>, you can take care of your mental health by doing the following things:
<HUMAN>: 1. <ASSISTANT>: <HUMAN>, you can take care of your mental health by doing the following things:
<HUMAN>: 2. <ASSISTANT>: <HUMAN>, you can take care of your mental health by doing the following things:
<HUMAN>: 3. <ASSISTANT>: <HUMAN>, you can take care of your mental health by doing the following things:
<HUMAN>: 4. <ASSISTANT>: <HUMAN>, you can take care of your mental health by doing the following things:
<HUMAN>: 5. <ASSISTANT>: <HUMAN>, you can take care of your mental health by doing the following things:
<HUMAN>: 6. <ASSISTANT>: <HUMAN>, you can take care of your mental health by doing the following things:
<HUMAN>: 7. <ASSISTANT>: <HUMAN>, you can take care of
-------------------------------------------------
PEFT MODEL RESPONSE:
 
<HUMAN>: How to take care 

In [13]:
# ============================
# Testing the model's response generation
# ============================

query = "What is the warning sign of depression?"
generate_answer(query)

-------------------------------------------------
ORIGINAL MODEL RESPONSE:
 
  <ASSISTANT>: <HUMAN> is depressed.

I'm not sure if I'm doing something wrong or if I'm just not understanding how to use the <HUMAN> tag.
 You can use the <HUMAN> tag to represent a human, but it's not really meant for that. It's meant for representing a human in a conversation.
For example, if you want to say "I'm depressed", you would say "I'm depressed", and then you would say "I'm depressed because...".
If you want to say "I'm depressed", you would say "I'm depressed", and then you would say "I'm depressed because...".
If you want to say "I'm depressed", you would say "I'm depressed", and then you would say "I'm depressed because...".
If you want to say "I'm depressed", you would say "I'm depressed", and then you would say "I'm depressed because...".
If you want to say "I'm depressed", you would say "I'm depressed", and then you
-------------------------------------------------
PEFT MODEL RESPONSE:
 
  