<a href="https://colab.research.google.com/github/KOFIYEB/projects/blob/main/GPT_2_fined_tuned_fitness_domain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LOADING PACKAGES**

In [None]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install accelerate -U

In [None]:
pip install datasets

In [None]:
import pandas as pd
import numpy as np
import math
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from transformers import TrainingArguments
from transformers import Trainer

# **Load Pre-trained Model**

Initially, we import a causal language modeling model along with its associated tokenizer from the model repository.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [None]:
model_name = "distilgpt2"

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# **Test drive the Model**

In [None]:
input_txt = "fitness in germany"

In [None]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_length=64, do_sample=True)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


fitness in germany, in which a certain number of Jewish members of the Jewish community live with this family, and the family member to which the rest of the Jewish community resides in the home is not just a Jew, but also a Jewish man from Israel, and is on a very good and fruitful path toward


This output underlined the necessity for domain-specific fine-tuning, as the pre-trained model lacked the nuanced understanding required for generating coherent and thematically consistent text. The generated text's deviation from the anticipated subject matter of fitness underscored the importance of adapting the language model to align more closely with the target domain, emphasizing the potential improvements that could be achieved through the fine-tuning process described in the methodology.

# **Prepare a Dataset for Domain Adaptation**

We now loading our dataset and preprocess it to handle missing data

In [None]:
import pandas as pd
# Load and preprocess the dataset
def preprocess_data(file_path):
    data = pd.read_excel(file_path)
    # Handling missing data
    data = data.fillna('')  # or any other strategy you deem fit
    # Combine title, sub-topic into one string, and content as another string
    data['combined'] = data['title'] + ' ' + data['sub-topic']
    return data

fit = preprocess_data('Stanzy_Fitness_Dataset.xlsx')
fit

Unnamed: 0,title,sub-topic,content,combined
0,Evidence to Support the Effectiveness of Perso...,Evidence to Support the Effectiveness of Perso...,Whether you are a newcomer to the fitness worl...,Evidence to Support the Effectiveness of Perso...
1,Pilates Ball Core-strengthening Exercises,Pilates Ball Core-strengthening Exercises,The Pilates ball is an effective training tool...,Pilates Ball Core-strengthening Exercises Pila...
2,Pilates Ball Core-strengthening Exercises,Pilates Roll-up,Position the body into a “V” sit and place the...,Pilates Ball Core-strengthening Exercises Pila...
3,Pilates Ball Core-strengthening Exercises,,,Pilates Ball Core-strengthening Exercises
4,Pilates Ball Core-strengthening Exercises,Glute Bridge,Lie on the floor and place the Pilates ball be...,Pilates Ball Core-strengthening Exercises Glut...
...,...,...,...,...
1885,,Realizing the power of exercise,The irony of it all is that when my back pain ...,Realizing the power of exercise
1886,,Social exercise enhances physical gains\n,While talking to participants after the end of...,Social exercise enhances physical gains\n
1887,,Research has changed my love-hate relationship...,My pain decreased when I began moving for heal...,Research has changed my love-hate relationshi...
1888,Signs of Heart Problems During Exercise,Signs of Heart Problems During Exercise,A sedentary lifestyle is one of the major risk...,Signs of Heart Problems During Exercise Signs ...



Process the texts using tokenization and eliminate any columns that are not required.

In [None]:
def tokenize_function(examples):
    # 'combined' is the field with combined 'title' and 'sub-topic'
    return tokenizer(examples['content'], truncation=True, padding='max_length', max_length=512)

In [None]:
from datasets import Dataset

# Set the padding token to the eos_token
tokenizer.pad_token = tokenizer.eos_token
tokenized_datasets = Dataset.from_pandas(fit).map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['title', 'sub-topic', 'content', 'combined'])

Map:   0%|          | 0/1890 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 1890
})

In causal language modeling, it's typical to merge all text samples into a single continuous string, then divide this string into sections that match the model's context size. This approach bypasses the standard procedure of padding or truncating each sample individually, ensuring the model receives a consistent context size for every sample.

In [None]:
chunk_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result['labels'] = result["input_ids"].copy()
    return result

As part of the preprocessing mentioned earlier, a 'labels' column has been appended to the dataset. This column contains the token IDs that correspond to the tokens in the input sequence.

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/1890 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 7560
})

In [None]:
lm_datasets["input_ids"][0:10]

In [None]:
lm_datasets["labels"][0:10]

# **Domain-Adapt with Trainer API**

In [None]:
train_size = 6800
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets.train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)


In causal language modeling, a data collator is unnecessary. The training process is designed so that labels are automatically offset by one position. This means the objective is to predict the token at timestep t+1 utilizing all tokens up to t.

In [None]:
from transformers import TrainingArguments

# Define a simple output directory name
output_dir ="{model}-clm-finetuned-fit"

batch_size = 8
logging_steps = len(downsampled_dataset["train"]) // batch_size

training_args = TrainingArguments(
    output_dir= output_dir,
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=logging_steps,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model ,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"]
)

In [None]:
import numpy as np
import math

Perplexity of the model that is pre-trained but not adapted to a specific domain.

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()

# Debug: Print the complete eval_results dictionary
print("Evaluation Results:", eval_results)

# Check if 'eval_loss' key exists before calculating perplexity
if 'eval_loss' in eval_results:
    perplexity = math.exp(eval_results['eval_loss'])
    print(f">>> Perplexity: {perplexity:.2f}")
else:
    print("The 'eval_loss' key does not exist in the evaluation results.")

Evaluation Results: {'eval_loss': 2.032716989517212, 'eval_runtime': 1.8131, 'eval_samples_per_second': 375.046, 'eval_steps_per_second': 46.881}
>>> Perplexity: 7.63


# **conduct training**

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.9126,1.848906
2,1.8014,1.71617
3,1.7104,1.679504


TrainOutput(global_step=2550, training_loss=1.808134957107843, metrics={'train_runtime': 186.1239, 'train_samples_per_second': 109.604, 'train_steps_per_second': 13.701, 'total_flos': 666306714009600.0, 'train_loss': 1.808134957107843, 'epoch': 3.0})

In [None]:
model_path = "models/gpt2-finetuned-fit"
model.save_pretrained(model_path)

Calculate the perplexity for the model adapted to the domain, and observe a decrease.

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()

# Debug: Print the complete eval_results dictionary
print("Evaluation Results:", eval_results)

# Check if 'eval_loss' key exists before calculating perplexity
if 'eval_loss' in eval_results:
    perplexity = math.exp(eval_results['eval_loss'])
    print(f">>> Perplexity: {perplexity:.2f}")
else:
    print("The 'eval_loss' key does not exist in the evaluation results.")

Evaluation Results: {'eval_loss': 1.6795037984848022, 'eval_runtime': 1.8119, 'eval_samples_per_second': 375.305, 'eval_steps_per_second': 46.913, 'epoch': 3.0}
>>> Perplexity: 5.36


# **Try the trained model**

In [None]:
input_txt = "How to stay Healthy "
# Tokenize the input text and create attention mask
input_data = tokenizer(input_txt, return_tensors="pt", padding=True, truncation=True)
input_ids = input_data['input_ids'].to(device)
attention_mask = input_data['attention_mask'].to(device)

# Generate the output
output = model.generate(input_ids, attention_mask=attention_mask, max_length=500, do_sample=True)

# Decode and print the output
print(tokenizer.decode(output[0]))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


How to stay Healthy �The truth is, the amount of time and resources that make you have to work more effectively.If you’re going to lose weight, you need to go through time of work.Think of a fat loss for five minutes of sleep. It’s about 6-8 hours per year that makes you less efficient. When you’re doing heavy lifting, you’ll be unable to eat enough calories or find ways to be efficient at achieving that goal. That means you’re less likely to want to sleep and less likely to start crunches or crunches, and more likely to start a workout. In my experience, it’s easier to get a little more than you want, and more efficient workouts, but it means that the same is not true in most places. The truth is, most people aren’t making the point that it doesn’t mean to be perfect. It’s true that you can improve only by exercising. But more importantly, it means that you don’t see how well you can do it to a lot of other people’s levels of fat and calories, the same amount of energy you do not like