##Importing Libraries

###**Demonstrating the process of loading a pre-trained GPT-2 model and tokenizer using the Hugging Face's Transformers library.**

1. Import the necessary libraries:
   - `tensorflow` library for TensorFlow functionalities
   - `GPT2Tokenizer` and `TFGPT2LMHeadModel` from the `transformers` module for GPT-2 model and tokenizer

2. Import the `pad_sequences` function from tensorflow for padding sequences.

3. Load the pre-trained GPT-2 model and tokenizer:
   - `model_name` variable stores the name of the pre-trained GPT-2 model to be loaded (in this case, it is "gpt2").
   - `tokenizer` is initialized using `GPT2Tokenizer.from_pretrained(model_name)`, which loads the tokenizer for the specified model.
   - `model` is initialized using `TFGPT2LMHeadModel.from_pretrained(model_name)`, which loads the pre-trained GPT-2 model.


In [21]:
import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = TFGPT2LMHeadModel.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


##Data for Fine-tuning the model

In [16]:
medical_queries = ["What are the symptoms of COVID-19?",
                   "How can I prevent the flu?",
                   "What are the treatment options for diabetes?",
                   "What causes migraines?",
                   "Is there a cure for asthma?",
                   "What are the signs of a heart attack?",
                   "How can I manage my anxiety?",
                   "What are the risk factors for high blood pressure?",
                   "What is the recommended diet for someone with celiac disease?",
                   "What are the symptoms of depression?"]

responses = ["Common symptoms of COVID-19 include fever, cough, and difficulty breathing.",
             "To prevent the flu, you can get vaccinated annually and practice good hand hygiene.",
             "Treatment options for diabetes may include lifestyle changes, medication, and insulin therapy.",
             "Migraines can be caused by various factors such as hormonal changes, certain foods, and stress.",
             "While there is no cure for asthma, it can be managed with medications and lifestyle changes.",
             "Signs of a heart attack include chest pain, shortness of breath, and pain radiating to the left arm.",
             "Managing anxiety may involve therapy, medication, and adopting relaxation techniques.",
             "Risk factors for high blood pressure include obesity, sedentary lifestyle, and a family history of hypertension.",
             "A recommended diet for someone with celiac disease involves avoiding gluten-containing foods like wheat, barley, and rye.",
             "Symptoms of depression can include persistent sadness, loss of interest, and changes in sleep and appetite."]

##**Demonstrating the process of tokenizing and encoding input queries & responses using tokenizer.**

1. Initialize empty lists to store the tokenized input queries and responses:
   - `input_ids` list will store the tokenized input queries.
   - `labels` list will store the tokenized input responses.

2. Iterate over the `medical_queries` and `responses` using the `zip` function to process each pair of query and response.

3. Check if the query or response is None:
   - If either the query or response is None, skip to the next iteration using the `continue` statement.

4. Tokenize and encode the query and response:
   - `encoded_input` variable stores the tokenized and encoded representation of the query using the tokenizer's `encode` method. The `add_special_tokens=True` argument adds special tokens to the sequence.
   - `encoded_response` variable stores the tokenized and encoded representation of the response using the same approach.

5. Append the encoded query and response to the respective lists:
   - `input_ids` list appends the `encoded_input`.
   - `labels` list appends the `encoded_response`.

6. Pad the input_ids and labels sequences:
   - Determine the `max_seq_length` as the maximum length among all sequences in `input_ids` and `labels`.
   - Use the `pad_sequences` function to pad the `input_ids` sequences with zeros at the end (`padding='post'`) up to `max_seq_length` length (`maxlen=max_seq_length`) and store the result back in `input_ids`.
   - Use the same approach to pad the `labels` sequences, but with `max_seq_length - 1` as `maxlen` to exclude the last token.

## Usage ⬇️

The provided code snippet can be used to tokenize and encode input queries and responses using a tokenizer. It prepares the data for training a model that requires encoded sequences.


In [17]:
input_ids = []
labels = []
for query, response in zip(medical_queries, responses):
    if query is None or response is None:
        continue
    encoded_input = tokenizer.encode(query, add_special_tokens=True)
    encoded_response = tokenizer.encode(response, add_special_tokens=True)
    input_ids.append(encoded_input)
    labels.append(encoded_response)

# Pad the input_ids and labels sequences
max_seq_length = max(len(seq) for seq in input_ids + labels)
input_ids = pad_sequences(input_ids, padding='post', maxlen=max_seq_length, dtype='int32')
labels = pad_sequences(labels, padding='post', maxlen=max_seq_length - 1, dtype='int32')

In [None]:
# Convert the input sequences to TensorFlow dataset

train_dataset = tf.data.Dataset.from_tensor_slices((input_ids, labels))

# Convert the TensorFlow dataset to an iterable
train_iter = iter(train_dataset)

# Print the contents of the train_dataset
for i, (inputs, labels) in enumerate(train_iter):
    print(f"Example {i + 1}:")
    print("Input IDs:", inputs)
    print("Labels:", labels)
    print()

##**Fine-tuning the model using the defined training parameters.**

1. Define the training parameters:
   - `batch_size` represents the number of training examples in each batch.
   - `num_epochs` specifies the total number of training epochs.
   - `optimizer` is an instance of the Adam optimizer, responsible for updating the model's weights during training.
   - `loss_fn` defines the loss function used to calculate the model's training loss.

2. Define the training loop:
   - `num_batches` calculates the number of batches based on the length of the training dataset and the batch size.
   - The outer loop iterates over the specified number of epochs.
   - The inner loop iterates over the batches of the training dataset, obtained using the `batch` method.
   - Within each batch, the model's forward pass is executed using the current inputs.
   - The loss value is calculated by comparing the predicted logits with the corresponding labels, excluding the last token.
   - The gradients of the model's trainable variables with respect to the loss are computed using a gradient tape.
   - The optimizer applies the gradients to update the model's weights.
   - The training progress is periodically printed, showing the current epoch, batch, and loss value.

3. Save the fine-tuned model:
   - After the training loop completes, the fine-tuned model is saved using the `save_pretrained` method, specifying the desired save directory.

## Usage ⬇️

The provided code snippet serves as a reference for fine-tuning a model using TensorFlow and the specified training parameters. Modify the parameters and adapt the code according to task and dataset.


In [None]:
# Define the training parameters
batch_size = 4
num_epochs = 3
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)

# Define the model and loss function
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Define the training loop
num_batches = len(train_dataset) // batch_size
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    for step, (inputs, labels) in enumerate(train_dataset.batch(batch_size)):
        with tf.GradientTape() as tape:
            logits = model(inputs)[0]
            loss_value = loss_fn(labels, logits[:, :-1, :])  # Exclude the last token from logits

        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        if step % 10 == 0:
            print(f"  Batch {step}/{num_batches} - Loss: {loss_value:.4f}")

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")