# Mental Health Support Chatbot (Fine-Tuned)

We use chatgpt free small model and fine-tune it on `EmpatheticDialogues` dataset

Download hugging-face libraries to for transformers, datasets, accelerate and sentence piece.

In [2]:
!pip install transformers datasets accelerate sentencepiece --quiet

## This loads the dataset which is already divided into test and train.

There was an error on the facebook/empathethic-dialogues dataset, and was not able to download with HF datasets. So I searched and found an alternative on huggingface and downloded it.


In [3]:
from datasets import load_dataset

dataset = load_dataset("Adapting/empathetic_dialogues_v2")

train_dataset = dataset["train"]
test_dataset = dataset["test"]

README.md:   0%|          | 0.00/199 [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/14.7M [00:00<?, ?B/s]

dev.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/40245 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5734 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5255 [00:00<?, ? examples/s]

Checking the dataset, and picked 3 records from it and print one by one to see the data format and column names

In [4]:
first_three_train = train_dataset.select(range(3))
print(first_three_train[0])

{'id': 1, 'chat_history': "['Childrens place is the best for kids clothes', 'Oh really? i have never been there. do they have good collection?', 'They do! Is inexpesive and good quality. I dont buy from anywhere else']", 'sys_response': 'sounds nice. i am going to check that out myself', 'situation': 'I have always been a big fan of childrens place, I will never shop anywhere else', 'emotion': 'faithful', 'question or not': '[None]', 'behavior': "I'm in a positive mood, please congratulate me and praise me."}


Using free chatgpt model

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

dividing dataset to train and test

In [6]:
def tokenize(batch):
    return tokenizer(batch["chat_history"], truncation=True, padding="max_length", max_length=256)

train_tokenized = train_dataset.map(tokenize, batched=True, remove_columns=["chat_history"])
test_tokenized = test_dataset.map(tokenize, batched=True, remove_columns=["chat_history"])

Map:   0%|          | 0/40245 [00:00<?, ? examples/s]

Map:   0%|          | 0/5255 [00:00<?, ? examples/s]

I am using one epoch in blow training configs due to less amount of resources

In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llm-finetuned",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    logging_steps=50,
    save_steps=500,
    fp16=True,
    push_to_hub=False
)

In [8]:
from transformers import Trainer, DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    data_collator=data_collator,
)

Training the model

I am using W&B for training metrics, so below need to put its api key before training starts. so then can check those metrics on W&B website.

In [9]:
import os
from google.colab import userdata
wandb_api_key = userdata.get('WANDB_API_KEY')

os.environ['WANDB_API_KEY'] = wandb_api_key

In [10]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mnaveedhematmal[0m ([33mnaveedahmadhematmal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,3.7627
100,3.4961
150,3.4091
200,3.4707
250,3.4103
300,3.3979
350,3.3203
400,3.3685
450,3.3576
500,3.2668


TrainOutput(global_step=20123, training_loss=2.9962959973894923, metrics={'train_runtime': 2806.9558, 'train_samples_per_second': 14.338, 'train_steps_per_second': 7.169, 'total_flos': 2628971931893760.0, 'train_loss': 2.9962959973894923, 'epoch': 1.0})

In [13]:
trainer.save_model("./llm-finetuned-2")
tokenizer.save_pretrained("./llm-finetuned-2")

('./llm-finetuned-2/tokenizer_config.json',
 './llm-finetuned-2/special_tokens_map.json',
 './llm-finetuned-2/vocab.json',
 './llm-finetuned-2/merges.txt',
 './llm-finetuned-2/added_tokens.json',
 './llm-finetuned-2/tokenizer.json')

In [14]:
!zip -r llm-finetuned.zip llm-finetuned-2

  adding: llm-finetuned-2/ (stored 0%)
  adding: llm-finetuned-2/tokenizer_config.json (deflated 54%)
  adding: llm-finetuned-2/training_args.bin (deflated 53%)
  adding: llm-finetuned-2/config.json (deflated 52%)
  adding: llm-finetuned-2/vocab.json (deflated 59%)
  adding: llm-finetuned-2/model.safetensors (deflated 7%)
  adding: llm-finetuned-2/tokenizer.json (deflated 82%)
  adding: llm-finetuned-2/merges.txt (deflated 53%)
  adding: llm-finetuned-2/special_tokens_map.json (deflated 60%)
  adding: llm-finetuned-2/generation_config.json (deflated 24%)


In [17]:
!ls -l

total 298500
drwxr-xr-x 44 root root      4096 Nov 27 03:37 llm-finetuned
drwxr-xr-x  2 root root      4096 Nov 27 03:39 llm-finetuned-2
-rw-r--r--  1 root root 305640753 Nov 27 03:40 llm-finetuned.zip
drwxr-xr-x  1 root root      4096 Nov 20 14:30 sample_data
drwxr-xr-x  3 root root      4096 Nov 27 02:50 wandb


Simple response

In [12]:
from transformers import pipeline

generator = pipeline("text-generation", model="./llm-finetuned")

print(generator("Roadmap to become genai engineer starts with...?", max_length=50))

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'Roadmap to become genai engineer starts with...?  He gets to build something with his brain.  He will be starting tomorrow!  I\'m a little jealous of him!", \'What is it about?\', "It is about getting a new car.  Im hoping it will be the first vehicle he built.  I\'m so happy for him.  I am really excited about it."]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]"]]]]]]"]]]"]]"]]]]]]]]]]"]]]]]]]]]]]]"]]]]]"]]]]"]]]]]]]]]]]"]]]]]]]]]]]]"]]]]]]]]]]]]]]]"]]]"]]]]]]]]"]]]]]]]]]]]"]]]]"]"]]]]]]]]]]]]]'}]


Response with prompt engineering to tweak response to gentle and emotionally supportive

In [18]:
from transformers import pipeline

generator = pipeline("text-generation", model="./llm-finetuned")

prompt = (
    "Write a gentle, emotionally supportive, and encouraging response. "
    "Roadmap to become a GenAI engineer starts with...?"
)

output = generator(
    prompt,
    max_length=100,
    do_sample=True,
    temperature=0.6,
    top_p=0.85
)

print(output[0]["generated_text"])

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Write a gentle, emotionally supportive, and encouraging response. Roadmap to become a GenAI engineer starts with...?', 'I know. I hope you get the job.  How long ago did it take you to get it?', "I didn't get it, but I'm glad I got it."] "I've been working hard for it, but I'm glad it's going to be easier to get it."] "That's great. I've been working hard for it for a while, but I'm sure I'll get it."]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]"]]]]]]]]]]]]]]]]]]]]]]]]]]"]]]]]]]]]]]]]]]]]]]]"]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]"]]]]]]]]]]]]]]]]]]]]]]]]]]]
