<a href="https://colab.research.google.com/github/ChaitaliV/generative-explanation/blob/main/unsupervised_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers sentencepiece nltk evaluate rouge bleu rouge-score
!git clone https://github.com/ChaitaliV/generative-explanation

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting bleu
  Downloading bleu-0.3.tar.gz (5.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.16.4 (fro

In [None]:
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration, T5TokenizerFast
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from tqdm import tqdm, trange
import pandas as pd
import torch
import nltk
import evaluate
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu
nltk.download('punkt')

### Evaluate the model performace before and after pre-training

* take few sentences and tokenize them, mask one word from each sentence, and use model to predict the masked word.
* use model logits to get candidates for the masked word.
* calculate the rogue score between original and generated sentences.
* plot the graph of scores.

In [22]:
depression_symptoms = [
    "Persistent sadness and a sense of hopelessness are common symptoms of depression.",
    "Changes in sleep patterns, such as insomnia or oversleeping, may indicate depressive symptoms.",
    "Fatigue and a lack of energy are often reported by individuals experiencing depression.",
    "Difficulty concentrating or making decisions is a cognitive symptom associated with depression.",
    "Feelings of worthlessness or excessive guilt can be indicative of depressive thoughts.",
    "Loss of interest or pleasure in activities once enjoyed is a hallmark symptom of depression.",
    "Appetite changes, leading to weight loss or gain, can be part of depressive symptoms.",
    "Irritability and restlessness are emotional symptoms that may accompany depression.",
    "Physical symptoms of depression may include headaches and unexplained body aches.",
    "Thoughts of death or suicide are severe symptoms requiring immediate attention.",
    "Social withdrawal and isolation are behavioral signs often seen in depression.",
    "Decreased libido and sexual dysfunction can be associated with depressive disorders.",
    "Psychomotor agitation or retardation may affect an individual's physical movements.",
    "Recurrent thoughts of death or suicide should be taken seriously and addressed promptly.",
    "Depressive symptoms in children may manifest as irritability and academic decline.",
    "Seasonal changes can contribute to Seasonal Affective Disorder (SAD), a subtype of depression.",
    "Anxiety often coexists with depression, and symptoms may overlap.",
    "Physical complaints, such as digestive issues, may be linked to underlying depression.",
    "Persistent feelings of emptiness and a lack of purpose are emotional symptoms of depression.",
    "Postpartum depression affects some women after giving birth and requires prompt diagnosis and treatment to support maternal well-being."
]

In [56]:
masks = ['sadness', 'oversleeping','lack', 'cognitive','excessive', 'pleasure',
         'Appetite','restlessness', 'headaches', 'suicide', 'behavioral', 'dysfunction',
         'physical', 'seriously', 'children', 'Disorder', 'coexists', 'depression',
         'symptoms', 'Postpartum'
         ]

In [13]:
depression_symptoms_masked = [
    "Persistent <extra_id_0> and a sense of hopelessness are common symptoms of depression.",
    "Changes in sleep patterns, such as insomnia or <extra_id_0>, may indicate depressive symptoms.",
    "Fatigue and a <extra_id_0> of energy are often reported by individuals experiencing depression.",
    "Difficulty concentrating or making decisions is a <extra_id_0> symptom associated with depression.",
    "Feelings of worthlessness or <extra_id_0> guilt can be indicative of depressive thoughts.",
    "Loss of interest or <extra_id_0> in activities once enjoyed is a hallmark symptom of depression.",
    "<extra_id_0> changes, leading to weight loss or gain, can be part of depressive symptoms.",
    "Irritability and <extra_id_0> are emotional symptoms that may accompany depression.",
    "Physical symptoms of depression may include <extra_id_0> and unexplained body aches.",
    "Thoughts of death or <extra_id_0> are severe symptoms requiring immediate attention.",
    "Social withdrawal and isolation are <extra_id_0> signs often seen in depression.",
    "Decreased libido and sexual <extra_id_0> can be associated with depressive disorders.",
    "Psychomotor agitation or retardation may affect an individual's <extra_id_0> movements.",
    "Recurrent thoughts of death or suicide should be taken <extra_id_0> and addressed promptly.",
    "Depressive symptoms in <extra_id_0> may manifest as irritability and academic decline.",
    "Seasonal changes can contribute to Seasonal Affective <extra_id_0>(SAD), a subtype of depression.",
    "Anxiety often <extra_id_0> with depression, and symptoms may overlap.",
    "Physical complaints, such as digestive issues, may be linked to underlying <extra_id_0>.",
    "Persistent feelings of emptiness and a lack of purpose are emotional <extra_id_0> of depression.",
    "<extra_id_0> depression affects some women after giving birth and requires prompt diagnosis and treatment to support maternal well-being."
]

## Model evaluation before pre-training

In [86]:
def eval_model():
  candidate_list = []
  for i in range(0, len(depression_symptoms_masked)):
    input_ids = tokenizer(depression_symptoms_masked[i], max_length= 218, padding="max_length",truncation=True, pad_to_max_length=True, add_special_tokens=True,return_tensors="pt").input_ids
    outputs = model.generate(input_ids)
    candidates = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    candidate_list.append(candidates[0].split(' ')[0])
  return candidate_list

In [58]:
labels_list = eval()



In [66]:
rouge = evaluate.load('rouge')
results = rouge.compute(predictions= labels_list,references=masks)
print(results)

{'rouge1': 0.3, 'rouge2': 0.0, 'rougeL': 0.3, 'rougeLsum': 0.3}


## Model Pre-training

In [68]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True)
device = 'cuda:0'
batch_size = 4
epochs = 5
optimizer = AdamW(model.parameters(), lr=0.0001)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [84]:
df = pd.read_csv('generative-explanation/datasets/unsupervised dataset/unsupervised_dataset.csv')

In [85]:
df['Encoder'] = df['Encoder'].apply(lambda x: torch.tensor(eval(x)))
df['Decoder'] = df['Decoder'].apply(lambda x: torch.tensor(eval(x)))

TypeError: ignored

In [74]:
input_ids = dataset.Encoder
labels = dataset.Decoder

In [76]:
#create train, validation split
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels,random_state = 2018, test_size = 0.1)

In [80]:
#create dataloaders for training and validation data
train_data = TensorDataset(train_inputs,  train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

TypeError: ignored