In [1]:
import os

os.chdir('/scratch1/aalamel')

os.getcwd()

'/scratch1/aalamel'

A detailed explanation of the code:

1- Import necessary libraries and modules:

- torch: PyTorch library for deep learning.
- GPT2LMHeadModel, GPT2Tokenizer, GPT2Config: Hugging Face Transformers classes for GPT-2 model, tokenizer, and configuration.
- TextDataset, DataCollatorForLanguageModeling: Hugging Face Transformers classes for creating a dataset and data collator for language modeling.
- Trainer, TrainingArguments: Hugging Face Transformers classes for creating a trainer and setting up training arguments.
- pandas: Library for data manipulation and analysis.
- Dataset: PyTorch class for creating a custom dataset.

2- Read the dataset pubmed.csv using pandas and store it in df.

3- Define the read_data function that takes a DataFrame and returns the "text" column.

4- Define the TextDataset class, which inherits from PyTorch's Dataset class, to create a custom dataset for the GPT-2 model.

- __init__: Initialize the dataset with the tokenized encodings.
- __getitem__: Get an item from the dataset by its index.
- __len__: Get the length of the dataset.

5- Define the train_gpt2 function to fine-tune the GPT-2 model on the given dataset:

- Load the tokenizer and configuration for the specified model name.
- Load the GPT-2 model using the configuration.
- Set the padding token to be the same as the end-of-sentence (EOS) token.
- Tokenize the input texts, create the dataset, and create a data collator for language modeling.
- Set up training arguments, including the output directory, number of epochs, batch size, and learning rate.
- Create a trainer instance with the model, training arguments, data collator, and dataset.
- Train the model and save the model and tokenizer to the output directory.

6- Define the chat function to generate responses from the fine-tuned GPT-2 model:

- Tokenize the input text and generate a response using the model.
- Adjust the generation parameters, such as temperature and top_k, to control the randomness and diversity of the responses.
- Decode the generated response and remove special tokens.

7 - In the main block:

- Read the dataset and extract the text data.
- Set the model name to "gpt2-medium" and specify the output directory.
- Fine-tune the GPT-2 model using the dataset and the specified parameters.
- Load the fine-tuned model and tokenizer from the output directory.
- Start an interactive chat loop, taking user input, generating responses using the chat function, and printing the responses.


With this code, you can fine-tune a GPT-2 model on your dataset and interact with the fine-tuned model to generate responses.

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, GPT2LMHeadModel
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import pandas as pd
from torch.utils.data import Dataset


df = pd.read_csv('pubmed.csv')


def read_data(df):
    return df["text"]


class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])


def train_gpt2(model_name, df, output_dir, epochs=10):
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    config = GPT2Config.from_pretrained(model_name)

    model = GPT2LMHeadModel.from_pretrained(model_name, config=config)

    tokenizer.pad_token = tokenizer.eos_token

    texts = df["text"].tolist()
    train_text = "\n".join(texts)
    train_encodings = tokenizer(train_text, return_tensors='pt', padding=True, truncation=True)
    train_dataset = TextDataset(train_encodings)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )

    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=epochs,
        per_device_train_batch_size=4,
        save_steps=10_000,
        save_total_limit=2,
        learning_rate=5e-5,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )

    trainer.train()
    trainer.save_model(output_dir)

    tokenizer.save_pretrained(output_dir)


def chat(model, tokenizer, input_text, max_length=100):
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    output = model.generate(input_ids, max_length=max_length, pad_token_id=tokenizer.eos_token_id, no_repeat_ngram_size=2, temperature=0.8, top_k=50)
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response


if __name__ == "__main__":
    df = pd.read_csv('pubmed.csv')
    text_data = read_data(df)

    model_name = "gpt2-medium"
    output_dir = "fine_tuned_model"
    train_gpt2(model_name, df, output_dir, epochs=3)

    tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
    model = GPT2LMHeadModel.from_pretrained(output_dir)

    while True:
        user_input = input("You: ")
        response = chat(model, tokenizer, user_input)
        print("Chatbot:", response)


loading file vocab.json from cache at /home/aalamel/.cache/huggingface/hub/models--gpt2-medium/snapshots/425b0cc90498ac177aa51ba07be26fc2fea6af9d/vocab.json
loading file merges.txt from cache at /home/aalamel/.cache/huggingface/hub/models--gpt2-medium/snapshots/425b0cc90498ac177aa51ba07be26fc2fea6af9d/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /home/aalamel/.cache/huggingface/hub/models--gpt2-medium/snapshots/425b0cc90498ac177aa51ba07be26fc2fea6af9d/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2-medium",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 102

Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to fine_tuned_model
Configuration saved in fine_tuned_model/config.json
Configuration saved in fine_tuned_model/generation_config.json
Model weights saved in fine_tuned_model/pytorch_model.bin
tokenizer config file saved in fine_tuned_model/tokenizer_config.json
Special tokens file saved in fine_tuned_model/special_tokens_map.json
loading file vocab.json
loading file merges.txt
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file fine_tuned_model/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2-medium",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 

You:  what is hepatitis?


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}



Chatbot: what is hepatitis?

Hepatitis is a virus that causes inflammation of the liver. It is caused by a hepatitis B virus.
, the hepatitis virus is transmitted through the blood of infected people. The virus can be spread through contact with blood, urine, feces, or saliva. Hepatotoxicity is the most common cause of liver damage in people with hepatitis. In addition, hepatitis can cause liver cancer. People with liver disease are at increased risk for developing liver cirrhosis


You:  how to treat it ?


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}



Chatbot: how to treat it?

The best way to deal with it is to use a topical cream.
, which is a cream that contains a lot of ingredients that are not found in the skin. It is used to remove dead skin cells and to prevent the formation of new ones. The cream is applied to the affected area and the cream will be absorbed into the bloodstream. This is the best treatment for acne. However, it can also cause irritation and redness. If you are using a


You:  how to treat hepatitis ?


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}



Chatbot: how to treat hepatitis?

Hepatitis is a contagious disease that can be spread through contact with blood, saliva, urine, feces, or vomit. It is caused by the hepatitis B virus, which is transmitted through the bite of an infected animal.
, and the liver is the main organ responsible for producing the virus. The liver produces the active form of the Hepatotoxic Virus (HVT) which causes liver damage and cirrhosis. Hepatic cirrosis is an
