# Fine-Tuning Language Models for Personalized Culinary Guidance
Performed by:

Ritik Krishnan Ambadi, Atul Kumar

## Installing necessary libraries

In [None]:
!pip install transformers datasets accelerate peft trl
!pip install gradio

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.6.2-py3-none-any.whl (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.7/174.7 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trl
  Downloading trl-0.7.4-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K 

## Exploratory Data Analysis and Pre-processing

### Recipe Dataset to Text File Converter

This script loads a dataset of recipes and extracts conversations from it, where each conversation consists of user instructions and assistant responses. It then saves this conversation data to a text file.

Usage:
1. Install the required dependencies.
2. Replace the 'dataset_name' variable with the name of the dataset you want to use.
3. Run the script.

The script performs the following steps:
1. Checks if a GPU (CUDA) is available and sets the device accordingly.
2. Loads the specified dataset using the Hugging Face 'datasets' library.
3. Extracts the user instructions and assistant responses from the training split of the dataset.
4. Creates a text file ('conversation_data.txt') and writes the conversations in the following format:
   User: [User Instruction]
   Assistant: [Assistant Response]

Example Output:
- Number of examples: [Number of conversation examples in the dataset]
- Sample instruction: [Sample user instruction]
- Sample response: [Sample assistant response]

In [None]:
from datasets import load_dataset
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


dataset_name = "dctanner/oa_recipes"
dataset = load_dataset(dataset_name)

# Extract instructions and responses as lists
instructions = dataset['train']['INSTRUCTION']
responses = dataset['train']['RESPONSE']

# Sample output
print("Number of examples:", len(instructions))
print("Sample instruction:", instructions[0])
print("Sample response:", responses[0])

# Create a text file for the conversations
with open('/content/conversation_data.txt', 'w') as file:
    for instruction, response in zip(instructions, responses):
        file.write(f"User: {instruction}\nAssistant: {response}\n")

Downloading readme:   0%|          | 0.00/568 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.33M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/4747 [00:00<?, ? examples/s]

Number of examples: 4747
Sample instruction: Have you got a recipe for Homemade Cinnamon Rolls?
Sample response: Here's a recipe for Homemade Cinnamon Rolls:

Ingredients:

Dough

• 1/2 cup unsalted butter, melted
• 2 cups whole milk, warm to the touch
• 1/2 cup granulated sugar
• 2 1/4 teaspoons active dry yeast
• 5 cups flour, divided
• 1 teaspoon baking powder
• 2 teaspoons salt

Filling

• 3/4 cup butter, softened
• 3/4 cup light brown sugar
• 2 tablespoons ground cinnamon

Frosting

• 4 oz cream cheese, softened
• 2 tablespoons butter, melted
• 2 tablespoons whole milk
• 1 teaspoon vanilla extract
• 1 cup powdered sugar

Instructions:

1. Generously butter two disposable foil pie/cake pans.
2. In a large bowl, whisk together warm milk, melted butter, and granulated sugar. The mixture should be just warm, registering between 100-110˚F (37-43˚C). If it is hotter, allow to cool slightly.
3. Sprinkle the yeast evenly over the warm mixture and let set for 1 minute.
4. Add 4 cups (500g)

## Load Transformers and GPT-2 Model, Tokenizer

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [None]:
# Load GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Fine-tune GPT 2 Model on the recipes data

Fine-Tuning GPT-2 Model for Text Generation

This function fine-tunes a GPT-2 (Generative Pre-trained Transformer 2) model on a custom text dataset and saves the fine-tuned model and tokenizer to an output directory.

Parameters:
- model_name (str): The name or path of the pre-trained GPT-2 model to be fine-tuned.
- train_file (str): The path to the training dataset file in text format.
- output_dir (str): The directory where the fine-tuned model and tokenizer will be saved.

Dependencies:
- Hugging Face Transformers library (install via 'pip install transformers')
- PyTorch (install via 'pip install torch')

Usage:
1. Install the required dependencies.
2. Define the 'model_name', 'train_file', and 'output_dir' parameters.
3. Call this function with the defined parameters.

The function performs the following steps:
1. Moves the GPT-2 model to the specified device (GPU or CPU).
2. Loads the training dataset from the provided 'train_file'.
3. Creates a data collator for language modeling.
4. Sets training arguments, including the output directory, number of training epochs, batch size, and save settings.
5. Initializes a Trainer for fine-tuning the model.
6. Fine-tunes the GPT-2 model on the training dataset.
7. Saves the fine-tuned model and tokenizer to the 'output_dir'.

Example Usage:
```python
model_name = "gpt2-medium"  # Pre-trained GPT-2 model name or path
train_file = "custom_text_dataset.txt"  # Path to your custom training dataset
output_dir = "fine_tuned_gpt2_model"  # Directory to save the fine-tuned model

fine_tune_gpt2(model_name, train_file, output_dir)
```

In [None]:
def fine_tune_gpt2(model_name, train_file, output_dir):

    # Move the model to the GPU
    model.to(device)

    # Load training dataset
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_file,
        block_size=256)
    # Create data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False)
    # Set training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=3,
        per_device_train_batch_size=12,
        save_steps=1_000,
        save_total_limit=2,
    )
    # Train the model
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )
    trainer.train()
    # Save the fine-tuned model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

In [None]:
fine_tune_gpt2("gpt2", "/content/conversation_data.txt", "output")



Step,Training Loss
500,1.5639
1000,1.5102
1500,1.4838


## Zip the output fine-tuned GPT-2 model

In [None]:
!zip -r /content/output.zip /content/output

  adding: content/output/ (stored 0%)
  adding: content/output/checkpoint-2000/ (stored 0%)
  adding: content/output/checkpoint-2000/training_args.bin (deflated 51%)
  adding: content/output/checkpoint-2000/config.json (deflated 52%)
  adding: content/output/checkpoint-2000/trainer_state.json (deflated 61%)
  adding: content/output/checkpoint-2000/generation_config.json (deflated 24%)
  adding: content/output/checkpoint-2000/model.safetensors (deflated 7%)
  adding: content/output/checkpoint-2000/optimizer.pt (deflated 8%)
  adding: content/output/checkpoint-2000/scheduler.pt (deflated 56%)
  adding: content/output/checkpoint-2000/rng_state.pth (deflated 25%)
  adding: content/output/special_tokens_map.json (deflated 74%)
  adding: content/output/config.json (deflated 52%)
  adding: content/output/checkpoint-3000/ (stored 0%)
  adding: content/output/checkpoint-3000/training_args.bin (deflated 51%)
  adding: content/output/checkpoint-3000/config.json (deflated 52%)
  adding: content/ou

## Download the output fine-tuned model for future use

In [None]:
from google.colab import files
files.download("/content/output.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!cp /content/output.zip /content/drive/MyDrive/output.zip

## Integrate Gradio to interact with the Chat model

In [None]:
import torch
import gradio as gr

# Load the fine-tuned GPT-2 model and tokenizer
model_path = "output"  # Replace with the actual path to your "output" directory
model = GPT2LMHeadModel.from_pretrained(model_path)
tokenizer = GPT2Tokenizer.from_pretrained(model_path)

# Define a function to generate responses
def generate_response(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=1028, truncation=True)
    response_ids = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, top_p=0.9, temperature=0.7)
    response_text = tokenizer.decode(response_ids[0], skip_special_tokens=True)
    return response_text

# Set up the Gradio interface
iface = gr.Interface(
    fn=generate_response,
    inputs="text",
    outputs="text",
    live=True,
    title="Chat with Fine-Tuned GPT-2 Model"
)

# Launch the Gradio interface
iface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()


## Observations

Due to GPT-2's limited max_token_limit it is not able to provide full-recipes but atleast the ingredients for a recipe.

```
User: Can you please provide a recipe for spaghetti carbonara?
Assistant: Sure! Here's a delicious recipe for spaghetti carbonara:

Ingredients:
- 200g spaghetti
- 100g pancetta or guanciale, diced
- 2 large eggs
- 50g Pecorino Romano cheese, grated
- 50g Parmesan cheese, grated
- Freshly ground black pepper
- Salt
- Olive oil
```

### Findings

However, we did not notice that our model was fine-tuned by it's peculiar way of answering any question that had nothing to do with recipes or food. Essentially, when prompted to answer a question about say programming or a subject, it would draft it's answers like a recipe.

Therefore if you ask

User Input: "How can I implement a binary search algorithm in Python?"

Model Response:

```
Sure! Here's a great recipe for binary search:

Ingredients:
- Python 3
- left, right = 0, len(arr) - 1
- while left <= right:
- mid = left + (right - left) // 2
```