# Fine-Tuning with Distillation

In this notebook, we will fine-tune a smaller language model, Falcon-7B Instruct, by following a distillation process using synthetic data generated from a larger model. The purpose of distillation is to transfer the knowledge of the larger model to the smaller one, making it more efficient while retaining high-quality performance for a specific task: responding as a vegan nutritionist.

**Key Steps:**
- **Generating Synthetic Data:** The larger model will act as a "teacher" by generating responses for a given set of prompts, simulating the role of a vegan nutritionist. These responses will form the synthetic dataset that the smaller model will learn from, covering a range of nutritional topics related to vegan diets.

- **Fine-Tuning the Smaller Model:** With the synthetic dataset prepared, we will fine-tune the smaller Falcon-7B Instruct model using the teacher model’s outputs as the target responses. This process enables the smaller model to learn from the larger model’s behavior, making it capable of producing relevant responses efficiently.

- **Experiment Tracking:** Throughout the fine-tuning process, an experiment tracking system will be used to log hyperparameters, model performance, and training metrics. This will help in tracking progress and identifying the optimal model configuration.

- **Next Steps:** Once the fine-tuning process is complete, we will proceed with building a separate inference pipeline and web application to handle real-time interactions. These components will be detailed in future steps of the project.

## Generating Synthetic Data

We need to do few shot learning to genearate synthetic data (at least 100 samples). So we will load in some pdf files for example papers and manually create Q&A pairs from them.

Steps are as follows:

- Load pdf files from s3
- Create user_context and question pairs from pdf files (3 examples)
- Use OpenAI to generate synthetic data (at least 100 samples) using these examples for few-shot learning
- Then we will again use OpenAI to generate outputs for the generated data, but we will also use Rag to get relevant context

In [10]:
import os
import json
import boto3
import openai
from dotenv import load_dotenv, find_dotenv

# move to root directory
os.chdir('..')

load_dotenv(find_dotenv())

False

In [11]:
open_api_key = os.environ.get('OPENAI_API_KEY')

### Loading Raw PDF Files

In [12]:
s3 = boto3.client('s3')

bucket_name = os.environ.get('AWS_BUCKET_NAME')

response = s3.list_objects_v2(Bucket=bucket_name)
response["ResponseMetadata"]["HTTPStatusCode"]

200

In [14]:
file_name = 'vegan_or_plant_based_nutrition_data.json'

# Get the file contents from S3
response = s3.get_object(Bucket=bucket_name, Key=file_name)

In [15]:
data = json.loads(response['Body'].read())

### Manually Generated Examples

Let's take a look at a bunch of abstracts and decide where to pull examples from based on interesting examples.

In [24]:
from IPython.display import display, Markdown

The following abstracts seem to be good:

In [27]:
for i in range(10, 15):
    if type(data[i]['meta_data']['abstract']['p']) == list:
        single_string_abstract = " ".join(data[i]['meta_data']['abstract']['p'])
        display(Markdown(f"**{i}. Abstract**\n\n{single_string_abstract}"))
    else:
        display(Markdown(f"**{i}. Abstract**\n\n{data[i]['meta_data']['abstract']['p']}"))
        
    # we'll visit the paper ourselves and find some good facts or results to create questions around
    print("Title:", data[i]['meta_data']['title'])
    print("URL:", data[i]['meta_data']['url'][0]['value'])

**10. Abstract**

Rheumatoid arthritis is a debilitating inflammatory condition which has a high disease burden. While there is emerging evidence that certain foods and diets could have anti-inflammatory properties and there are published ‘anti-inflammatory’ diets, there is very little understanding of patient beliefs and perceptions about the impact of diet on symptom management or attitudes to particular dietary interventions. This scoping review aims to summarize the existing literature around the beliefs that patients with rheumatoid arthritis hold regarding the impact of diet on disease activity and joint pain. It also examines the current state of evidence regarding the impact of specific dietary interventions on patient reported and objective parameters of RA disease activity. A search was conducted across seven databases for studies which included reporting on dietary beliefs related to disease management or investigations on the effect of particular diets on disease activity or joint pain. Articles were excluded if they examined extracted compounds or individual dietary supplements. Included studies were synthesized narratively. We retrieved 25,585 papers from which 68 were included in this review: 7 assessed dietary beliefs, 61 explored dietary interventions. The available literature on patient beliefs has been largely limited to quantitative studies with limited qualitative exploration. The Mediterranean, fasting and vegan diets appear to have the most benefit with regards to rheumatoid arthritis outcomes for patients. Research which examines RA patient’s beliefs and attitudes about the impact of diet on their RA symptoms and disease is currently lacking.

Title: What do we know about dietary perceptions and beliefs of patients with rheumatoid arthritis? A scoping review
URL: http://dx.doi.org/10.1007/s00296-024-05691-5


**11. Abstract**

The present study aimed to produce frozen dessert containing plant-based milk (almond, hazelnut, and lupine) and the probiotic Lb. acidophilus bacteria and to evaluate the chemical, microbiological and sensory properties during the 90 day-storage. Frozen dessert antioxidant capacity at day 0 and 90 of evaluation and changes in the phenolic compounds based on variations between different species were significant ( p  < 0.05). The differences in Lb. acidophilus counts between storage days were significant and values ranged from 4.15–8.99 log CFU/mL on the first day of storage to 3.61–7.06 at the end of the storage. Regarding the results of general acceptability in sensory evaluation, the highest color, taste and aroma scores was determined on day 0 in the hazelnut-lupine milk frozen dessert sample whereas the lowest was determined on day 30 in the almond-lupine milk frozen dessert sample. The samples with the highest antioxidant capacity were found on day 90 day in lupine frozen dessert (87.28 ± 0.007 mM) whereas the samples with the lowest antioxidant capacity were found on day 0 in the almond-hazelnut-lupine frozen dessert (18.83 ± 4.56 mM). Plant-based milk is considered suitable for the main ingredients in ice cream production, due to its health benefits its potential to be consumed as frozen dessert.

Title: Manufacturing plant-based non-dairy and probiotic frozen desserts and their impact on physicochemical, sensory and functional aspects
URL: http://dx.doi.org/10.1007/s13197-024-05964-8


**12. Abstract**

Fibromyalgia (FM) is a complex and common syndrome characterized by chronic widespread pain, fatigue, sleep disturbances, and various functional symptoms without clear structural or pathological causes. Affecting approximately 1–5% of the global population, with a higher prevalence in women, FM significantly impacts patients’ quality of life, often leading to considerable healthcare costs and loss of productivity. Despite its prevalence, the etiology of FM remains elusive, with genetic, environmental, and psychological factors, including nutrition, being implicated. Currently, no universally accepted treatment guidelines exist, and management strategies are often symptomatic. This narrative review explores the potential of a neuronutritional approach to FM management. It synthesizes existing research on the relationship between FM and nutrition, suggesting that dietary interventions could be a promising complementary treatment strategy. Various nutritional interventions, including vitamin D, magnesium, iron, and probiotics supplementation, have shown potential in reducing FM symptoms, such as chronic pain, anxiety, depression, cognitive dysfunction, sleep disturbances, and gastrointestinal issues. Additionally, weight loss has been associated with reduced inflammation and improved quality of life in FM patients. The review highlights the anti-inflammatory benefits of plant-based diets and the low-FODMAPs diet, which have shown promise in managing FM symptoms and related gastrointestinal disorders. Supplements such as vitamin D, magnesium, vitamin B12, coenzyme Q10, probiotics, omega-3 fatty acids, melatonin, S-adenosylmethionine, and acetyl- l -carnitine are discussed for their potential benefits in FM management through various mechanisms, including anti-inflammatory effects, modulation of neurotransmitters, and improvement of mitochondrial function. In conclusion, this review underscores the importance of considering neuronutrition as a holistic approach to FM treatment, advocating for further research and clinical trials to establish comprehensive dietary guidelines and to optimize management strategies for FM patients.

Title: Neuronutritional Approach to Fibromyalgia Management: A Narrative Review
URL: http://dx.doi.org/10.1007/s40122-024-00641-2


**13. Abstract**

Purpose of Review International guidelines emphasize advice to incorporate dietary measures for the prevention and in the management of hypertension. Current data show that modest reductions in weight can have an impact on blood pressure. Reducing salt and marine oils have also shown consistent benefit in reducing blood pressure. Whether other dietary constituents, in particular the amount and type of fat that play important roles in cardiovascular prevention, influence blood pressure sufficiently to be included in the management of hypertension is less certain. In this review, we provide a summary of the most recent findings, with a focus on dietary patterns, fats and other nutrients and their impact on blood pressure and hypertension. Recent Findings Since reducing salt consumption is an established recommendation only corollary dietary advice is subject to the current review. Population studies that have included reliable evaluation of fat intake have indicated almost consistently blood pressure lowering with consumption of marine oils and fats. Results with vegetable oils are inconclusive. However dietary patterns that included total fat reduction and changes in the nature of vegetable fats/oils have suggested beneficial effects on blood pressure. Plant-based foods, dairy foods and yoghurt particularly, may also lower blood pressure irrespective of fat content. Summary Total fat consumption is not directly associated with blood pressure except when it is part of a weight loss diet. Consumption of marine oils has mostly shown moderate blood pressure lowering and possibly greatest effect with docosahexaenoic acid-rich oil.

Title: Diet to Stop Hypertension: Should Fats be Included?
URL: http://dx.doi.org/10.1007/s11906-024-01310-7


**14. Abstract**

Fossil fuel-based products should be replaced by products derived from modern biomass such as plant starch, in the context of the future circular economy. Starch production globally surpasses 50 million tons annually, predominantly sourced from maize, rice, and potatoes. Here, we review plant starch with an emphasis on structure and properties, extraction, modification, and green applications. Modification techniques comprise physical, enzymatic, and genetic methods. Applications include stabilization of food, replacement of meat, three-dimensional food printing, prebiotics, encapsulation, bioplastics, edible films, textiles, and wood adhesives. Starch from maize, potatoes, and cassava shows amylose content ranging from 20 to 30% in regular varieties to 70% in high-amylose varieties. Extraction by traditional wet milling achieves starch purity up to 99.5%, while enzymatic methods maintain higher structural integrity, which is crucial for pharmaceutical applications. Enzymatic extraction improves starch yield by of up to 20%, reduces energy consumption by about 30%, and lowers wastewater production by up to 50%, compared to conventional methods. Sustainable starch modification can reduce the carbon footprint of starch production by up to 40%. Modified starches contribute to approximately 70% of the food texturizers market. The market of starch in plant-based meat alternatives has grown by over 30% in the past five years. Similarly, the use of biodegradable starch-based plastics by the bioplastic industry is growing over 20% annually, driven by the demand for sustainable packaging.Kindly check and confirm the layout of Table 1.Layout is right

Title: Plant starch extraction, modification, and green applications: a review
URL: http://dx.doi.org/10.1007/s10311-024-01753-z


Now looking at these papers manually we can get some good user context and questions that are relevant to these papers.

In [6]:
about_me_1 = "I am obese and just got the weight loss surgery done. I need to increase my protein intake bu can't tolerate dairy well."
question_1 = "What is the most tolerable protein enhancing strategy for someone like me?"

about_me_2 = "I am obese am considering getting the weight loss surgery."
question_2 = "Is there any major concerns dietary wise?"

about_me_3 = "I am an old Chinese adult and am at risk of falling often."
question_3 = "Is there a specific type of diet that might help in preventing me fall?"

### Generating 100 Similar Samples

Now we'll use OpenAI to generate 100 similar samples from the above examples.

In [7]:
PROMPT_TEMPLATE = """
Here are 3 user contexts and questions about science-based research on plant-based health and nutrition. 
Generate 100 more samples like them:

# about_me_1
{about_me_1}

# question_1
{question_1}

# about_me_2
{about_me_2}

# question_2
{question_2}

# about_me_3
{about_me_3}

# question_3
{question_3}

And put them in a JSON format with the keys: about_me and question.
"""

In [8]:
prompt = PROMPT_TEMPLATE.format(
    about_me_1=about_me_1,
    question_1=question_1,
    about_me_2=about_me_2,
    question_2=question_2,
    about_me_3=about_me_3,
    question_3=question_3
)

print(prompt)


Here are 3 user contexts and questions about science-based research on plant-based health and nutrition. 
Generate 100 more samples like them:

# about_me_1
I am obese and just got the weight loss surgery done. I need to increase my protein intake bu can't tolerate dairy well.

# question_1
What is the most tolerable protein enhancing strategy for someone like me?

# about_me_2
I am obese am considering getting the weight loss surgery.

# question_2
Is there any major concerns dietary wise?

# about_me_3
I am an old Chinese adult and am at risk of falling often.

# question_3
Is there a specific type of diet that might help in preventing me fall?

And put them in a JSON format with the keys: about_me and question.



In [85]:
from openai import OpenAI

client = OpenAI(
    api_key=open_api_key,
)

chat_completion = client.chat.completions.create(
    messages=[
        {"role": "user", "content": prompt}
    ],
    model="gpt-3.5-turbo",
)

In [88]:
synthetic_data = chat_completion.choices[0].message.content
print(synthetic_data)

{
  "data": [
    {
      "about_me": "I am obese and just got the weight loss surgery done. I need to increase my protein intake but can't tolerate dairy well.",
      "question": "What is the most tolerable protein enhancing strategy for someone like me?"
    },
    {
      "about_me": "I am obese and am considering getting weight loss surgery.",
      "question": "Is there any major concerns dietary wise?"
    },
    {
      "about_me": "I am an old Chinese adult and am at risk of falling often.",
      "question": "Is there a specific type of diet that might help prevent me from falling?"
    },
    {
      "about_me": "I am a vegan athlete looking to optimize my performance.",
      "question": "What plant-based foods can help improve athletic performance?"
    },
    {
      "about_me": "I have a family history of heart disease and want to improve my heart health.",
      "question": "What plant-based diet is recommended for improving heart health?"
    },
    {
      "about_me":

In [92]:
synthetic_data = json.loads(synthetic_data)

Now that we have this data, let's save it into a config file so we can load it whenever we need it. We will just copy and paste it since it is a small dataset.

In [94]:
synthetic_data['data']

[{'about_me': "I am obese and just got the weight loss surgery done. I need to increase my protein intake but can't tolerate dairy well.",
  'question': 'What is the most tolerable protein enhancing strategy for someone like me?'},
 {'about_me': 'I am obese and am considering getting weight loss surgery.',
  'question': 'Is there any major concerns dietary wise?'},
 {'about_me': 'I am an old Chinese adult and am at risk of falling often.',
  'question': 'Is there a specific type of diet that might help prevent me from falling?'},
 {'about_me': 'I am a vegan athlete looking to optimize my performance.',
  'question': 'What plant-based foods can help improve athletic performance?'},
 {'about_me': 'I have a family history of heart disease and want to improve my heart health.',
  'question': 'What plant-based diet is recommended for improving heart health?'},
 {'about_me': 'I am pregnant and following a plant-based diet.',
  'question': 'How can I ensure I am getting all the necessary nutr

### Generating Outputs with RAG

Next we need to generate responses to all these questions. But to make them relevant we will use RAG on our pdf data to get relevant context.

In [2]:
pwd

'c:\\Users\\RaviB\\GitHub\\vegan-ai-nutritionist\\notebooks'

In [4]:
# check current directory with above line and here change it to root
import os
os.chdir('..')

Now load in the examples that we saved manually.

In [5]:
from modules.q_and_a_dataset.src.examples import EXAMPLES
from modules.data_processing.src.config import INDEX_NAME, EMBEDDING_MODEL_ID
from modules.data_processing.src.embeddings import get_embedding_model, generate_embeddings
from langchain.docstore.document import Document

  from tqdm.autonotebook import tqdm, trange


In [37]:
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
from dotenv import load_dotenv

load_dotenv()

opensearch_endpoint = os.environ.get('OPENSEARCH_ENDPOINT')

AWS_ACCESS_KEY = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
AWS_REGION = os.environ.get('AWS_REGION')

In [99]:
awsauth = AWS4Auth(AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, 'es')

# Create the OpenSearch client
client = OpenSearch(
    hosts=[{'host': opensearch_endpoint, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

# Test connection
info = client.info()
print(info)

{'name': 'b618476fdb5879478fc667c4fb6cd473', 'cluster_name': '590184030535:vegan-pdf-data', 'cluster_uuid': 'D2l3HY8VSk-qIbco5LH9gg', 'version': {'distribution': 'opensearch', 'number': '2.5.0', 'build_type': 'tar', 'build_hash': 'unknown', 'build_date': '2024-05-02T06:25:23.555552Z', 'build_snapshot': False, 'lucene_version': '9.4.2', 'minimum_wire_compatibility_version': '7.10.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'The OpenSearch Project: https://opensearch.org/'}


In [46]:
embedding_model = get_embedding_model(EMBEDDING_MODEL_ID)



We will concatenate the about me and question text to find relevant papers to it.

In [115]:
query_text = EXAMPLES[15]['about_me'] + ' ' + EXAMPLES[15]['question']
query_embedding = embedding_model.encode(query_text)
query_text += " I also want to be vegan and can't stand the thought of animal suffering. But my main concern is still that I can drink milk with no health issues."

In [116]:
query_text

"I am lactose intolerant and want to explore plant-based milk alternatives. What are some plant-based milk options that are lactose-free? I also want to be vegan and can't stand the thought of animal suffering. But my main concern is still that I can drink milk with no health issues."

In [117]:
search_body = {
    "size": 10,
    "query": {
        "knn": {
            "embedding": {
                "vector": query_embedding,
                "k": 5
            }
        }
    },
    "_source": ["text", "metadata"]
}

response = client.search(index=INDEX_NAME, body=search_body)

In [118]:
for hit in response['hits']['hits']:
    print(f"Score: {hit['_score']}")
    print(f"Title: {hit['_source']['metadata']['title']}")
    print(f"Abstract: {hit['_source']['metadata']['abstract']['p']}")
    print(f"Text: {hit['_source']['text']}")
    text_sample = hit['_source']['text']
    print()
    

Score: 0.42665416
Title: Freeze drying microencapsulation using whey protein, maltodextrin and corn powder improved survivability of probiotics during storage
Abstract: Various studies demonstrated that probiotics play important roles in maintaining the balance of microorganisms in the body. Some strains produce bile salt hydrolase enzyme (BSH), which is an indirect mechanism for lowering cholesterol. BSH-producing probiotics as a supplement might be an alternative way to help reducing cholesterol in the body. The aim of this study was to investigate the effects of different microcapsule formulations with selected vegetable powders on growth characteristics of 3 Thai probiotic strains, Lactobacillus gasseri TM1, Lacticaseibacillus rhamnosus TM7, and L. rhamnosus TM14. Probiotics were cultured in MRS broth supplemented with 5 vegetable powders. Corn powder significantly increased growth rate of probiotics from 10^9 to 10^12 CFU/ml. Therefore, different microcapsule formulations by Maill

Let's now put this in a function.

In [103]:
PROMPT_TEMPLATE = """
You are a nutritionist specialized in plant-based diets. 
I will give you some information about myself and you will provide me with good health and diet advice.

# ABOUT ME
{ABOUT_ME}

# CONTEXT
{CONTEXT}

Please provide concrete advice in less than 250 words, and justify your answer based on the information provided in the context only if it is relevant.
"""

def build_prompt(example, context):
    about_me = example["about_me"] + ' ' + example["question"]
    
    return PROMPT_TEMPLATE.format(
        ABOUT_ME=about_me,
        CONTEXT=context,
    )

In [107]:
context = response['hits']['hits'][0]['_source']['text']
context

'Plants have evolved miRNA-target modules to regulate the tolerance to nutrient stress, some of which are evolutionarily related to environmental adaptation. A single miRNA may target more than one transcript, and vice versa, to fine-tune the expression of genes, which converges into a sophisticated and extremely fault-tolerant crosstalk. Accumulating findings highlight that miRNAs furnish a bridge for the transportation and stockpile of N and Pi nutrition in the plants, plant-environment interactions, and plant-plant communications through the modulation of N and Pi signaling transduction.The main themes that emerged from previous studies are the role of miRNAs in enhancing NUE and PUE of plants, as well as the adaptive responses of plants to N and Pi stresses. Destructive effects of nutrient stress on plants are largely dependent on the inheritance and variation of plants and the influence of the environment due to the immobility of plants and the complexity of the environment. One a

In [108]:
prompt_sample = build_prompt(EXAMPLES[15], context)
print(prompt_sample)


You are a nutritionist specialized in plant-based diets. 
I will give you some information about myself and you will provide me with good health and diet advice.

# ABOUT ME
I am lactose intolerant and want to explore plant-based milk alternatives. What are some plant-based milk options that are lactose-free?

# CONTEXT
Plants have evolved miRNA-target modules to regulate the tolerance to nutrient stress, some of which are evolutionarily related to environmental adaptation. A single miRNA may target more than one transcript, and vice versa, to fine-tune the expression of genes, which converges into a sophisticated and extremely fault-tolerant crosstalk. Accumulating findings highlight that miRNAs furnish a bridge for the transportation and stockpile of N and Pi nutrition in the plants, plant-environment interactions, and plant-plant communications through the modulation of N and Pi signaling transduction.The main themes that emerged from previous studies are the role of miRNAs in enha

In [109]:
from openai import OpenAI

gpt_client = OpenAI(
    api_key=open_api_key,
)

gpt_response = gpt_client.chat.completions.create(
    messages=[
        {"role": "user", "content": prompt_sample}
    ],
    model="gpt-3.5-turbo",
)

response_sample = gpt_response.choices[0].message.content
print(response_sample)

Based on your lactose intolerance, some plant-based milk alternatives that are lactose-free include almond milk, soy milk, coconut milk, and oat milk. These options are not only lactose-free but also packed with essential nutrients like calcium, vitamin D, and protein to support your overall health.

In the context provided, while the information mainly focuses on the role of miRNAs in plant nutrition and stress responses, it highlights the importance of fine-tuning gene expressions to enhance nutrient use efficiency in plants. Similarly, by choosing plant-based milk alternatives, you can optimize your nutrient intake and enhance your overall health by avoiding lactose-containing dairy products that may cause digestive issues for you. Embracing plant-based milk options aligns with the concept of utilizing miRNAs to modulate nutrient uptake and signaling pathways for better nutrient utilization in plants. Therefore, incorporating lactose-free plant-based milk alternatives into your diet

Now let's put everything together with functions and we can iterate over examples and save the responses.

In [122]:
def get_query(example):
    query_text = example['about_me'] + ' ' + example['question']
    return query_text

def get_query_embedding(query_text, EMBEDDING_MODEL_ID):
    embedding_model = get_embedding_model(EMBEDDING_MODEL_ID)
    query_embedding = embedding_model.encode(query_text)
    return query_embedding

In [124]:
def get_context(query_embedding, size=3):    
    search_body = {
        "size": size,
        "query": {
            "knn": {
                "embedding": {
                    "vector": query_embedding,
                    "k": 5
                }
            }
        },
        "_source": ["text", "metadata"]
    }

    response = client.search(index=INDEX_NAME, body=search_body)
    
    context = ""
    
    for i in range(size):
        context += response['hits']['hits'][i]['_source']['text'] + ' '
    
    return context
    
    

In [125]:
def get_gpt_response(open_api_key, prompt):
    gpt_client = OpenAI(api_key=open_api_key)
    gpt_response = gpt_client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="gpt-3.5-turbo",
    )
    return gpt_response.choices[0].message.content


Now here we can iterate over examples and save the responses. The we'll want to save everything in a json file.

In [133]:
from tqdm import tqdm

for example in tqdm(EXAMPLES):
    query_text = get_query(example)
    query_embedding = get_query_embedding(query_text, EMBEDDING_MODEL_ID)
    context = get_context(query_embedding)
    prompt = build_prompt(example, context)
    response = get_gpt_response(open_api_key, prompt)
    
    full_q_and_a = example.copy()
    full_q_and_a['context'] = context
    full_q_and_a['response'] = response
    print(full_q_and_a)
    
    break



{'about_me': "I am obese and just got the weight loss surgery done. I need to increase my protein intake but can't tolerate dairy well.", 'question': 'What is the most tolerable protein enhancing strategy for someone like me?', 'context': 'Nutritional supplements for sports and exercise (NSSE) refer to products containing carbohydrates, proteins, fats, minerals, vitamins, herbs, enzymes, metabolic intermediates (amino acids), or extracts of various plants/foods [Sports and exercise have become indispensable components of individuals’ lives in contemporary society [Bibliometrics is the interdisciplinary science of quantitative analysis of all knowledge carriers through mathematical and statistical methods [ In the era of competitive sports, an increasing number of people take NSSE [The overall development of research publications in the field of NSSE is favorable and is currently in the stage of a surge in the amount of literature. From a global perspective, North American countries, Eu

We can see an example of what it looks like. Next we'll want to save it to a json file, which is done in the generating_training_data.py file.

In [139]:
full_q_and_a

{'about_me': "I am obese and just got the weight loss surgery done. I need to increase my protein intake but can't tolerate dairy well.",
 'question': 'What is the most tolerable protein enhancing strategy for someone like me?',
 'context': 'Nutritional supplements for sports and exercise (NSSE) refer to products containing carbohydrates, proteins, fats, minerals, vitamins, herbs, enzymes, metabolic intermediates (amino acids), or extracts of various plants/foods [Sports and exercise have become indispensable components of individuals’ lives in contemporary society [Bibliometrics is the interdisciplinary science of quantitative analysis of all knowledge carriers through mathematical and statistical methods [ In the era of competitive sports, an increasing number of people take NSSE [The overall development of research publications in the field of NSSE is favorable and is currently in the stage of a surge in the amount of literature. From a global perspective, North American countries, 

## Fine-Tuning

We are going to fine-tune the Falcon-7B Instruct model.

In [14]:
pwd

'c:\\Users\\RaviB\\GitHub\\vegan-ai-nutritionist\\notebooks'

Before starting move into the root directory.

In [15]:
import os

os.chdir('..')

In [16]:
pwd

'c:\\Users\\RaviB\\GitHub\\vegan-ai-nutritionist'

### Loading Training and Testing Sets

First let's split the training dataset into train and test sets.

In [2]:
import json
from sklearn.model_selection import train_test_split
from pathlib import Path

training_data_path = Path("../modules/q_and_a_dataset/data/training_data.json")

# Load the JSON file
with open(training_data_path, 'r') as f:
    data = json.load(f)

# Split the data into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

Saving the training data locally

In [3]:
train_data_path = Path("../modules/model_training/datasets/training_data.json")
test_data_path = Path("../modules/model_training/datasets/testing_data.json")

with open(train_data_path, 'w') as f:
    json.dump(train_data, f)

with open(test_data_path, 'w') as f:
    json.dump(test_data, f)

### Setting up CUDA

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

import comet_ml

COMET_API_KEY = os.environ.get("COMET_API_KEY")
if COMET_API_KEY is None:
    raise ValueError("COMET_API_KEY is not set in the environment variables.")

comet_ml.login(api_key=COMET_API_KEY)

[1;38;5;39mCOMET INFO:[0m Valid Comet API Key saved in C:\Users\RaviB\.comet.config (set COMET_CONFIG to change where it is saved).


In [2]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    print("GPU is available.")

GPU is available.


In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
#model_name = "openai-community/gpt2"
model_name = "openai-community/gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [4]:
import json
from datasets import Dataset
with open("C:/Users/RaviB/GitHub/vegan-ai-nutritionist/modules/model_training/datasets/training_data.json", "r") as f:
    train_data = json.load(f)

with open("C:/Users/RaviB/GitHub/vegan-ai-nutritionist/modules/model_training/datasets/testing_data.json", "r") as f:
    test_data = json.load(f)

# Convert to Hugging Face Dataset
train_dataset = Dataset.from_list(train_data)
test_dataset = Dataset.from_list(test_data)

In [5]:
train_dataset

Dataset({
    features: ['about_me', 'question', 'context', 'response'],
    num_rows: 60
})

In [6]:
def tokenize_function(examples):
    # Combine 'about_me' and 'context' for the full context
    full_context = examples['about_me'] + ' ' + examples['context']
    
    # Create the prompt using the full context
    prompt = f"Question: {examples['question']}\nContext: {full_context}\nAnswer:"
    response = examples['response']
    
    # Tokenize inputs and labels
    tokenized_input = tokenizer(prompt, truncation=True, padding="max_length", max_length=512)
    tokenized_output = tokenizer(response, truncation=True, padding="max_length", max_length=512)
    
    # Create the final input_ids and labels
    input_ids = tokenized_input["input_ids"] + tokenized_output["input_ids"][1:]  # Remove BOS token
    labels = [-100] * len(tokenized_input["input_ids"]) + tokenized_output["input_ids"][1:]
    
    return {"input_ids": input_ids, "attention_mask": [1] * len(input_ids), "labels": labels}

In [7]:
def tokenize_function(examples):
    # Combine 'about_me' and 'context' for the full context
    full_context = examples['about_me'] + ' ' + examples['context']
    
    # Create the prompt using the full context
    prompt = f"Question: {examples['question']}\nContext: {full_context}\nAnswer:"
    response = examples['response']
    
    # Tokenize inputs and labels
    tokenized_input = tokenizer(prompt, truncation=True, padding="max_length", max_length=512)
    tokenized_output = tokenizer(response, truncation=True, padding="max_length", max_length=512)
    
    # Combine input and output (GPT-2 is autoregressive)
    input_ids = tokenized_input["input_ids"] + tokenized_output["input_ids"]
    
    # Create the labels (output sequence should be the entire concatenated sequence)
    labels = [-100] * len(tokenized_input["input_ids"]) + tokenized_output["input_ids"]
    
    return {
        "input_ids": input_ids, 
        "attention_mask": [1] * len(input_ids), 
        "labels": labels
    }


In [8]:
tokenizer.pad_token = tokenizer.eos_token

In [9]:
tokenized_train = train_dataset.map(tokenize_function, remove_columns=train_dataset.column_names)
tokenized_test = test_dataset.map(tokenize_function, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

In [10]:
from comet_ml import Experiment

experiment = Experiment(api_key=COMET_API_KEY, project_name="ai_vegan_nutritionist")

[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/ravinderrai/ai-vegan-nutritionist/25ae4e6f2cce4efbb324ef171cd7d81b





In [12]:
from transformers import AutoModelForCausalLM

# Load the model without the timeout argument
model = AutoModelForCausalLM.from_pretrained(model_name)

model.safetensors:  84%|########4 | 1.28G/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [13]:
model = model.to(device)

In [22]:
import torch

torch.cuda.empty_cache()

In [23]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

In [26]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",                # Directory to save the model and logs
    num_train_epochs=1,                    # Number of training epochs
    per_device_train_batch_size=2,         # Batch size for training
    per_device_eval_batch_size=2,          # Batch size for evaluation
    gradient_accumulation_steps=4,
    warmup_steps=500,                       # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,                     # Strength of weight decay
    logging_dir="./logs",                  # Directory for storing logs
    logging_steps=100,                       # Log every 10 steps
    eval_strategy="steps",            # Evaluate every few steps
    eval_steps=500,                         # Evaluation frequency
    save_steps=1000,                        # Save model every 1000 steps
    load_best_model_at_end=True,           # Load the best model at the end of training
    fp16=True
)

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
[1;38;5;39mCOMET INFO:[0m An experiment with the same configuration options is already running and will be reused.


  0%|          | 0/7 [00:00<?, ?it/s]

2024/10/12 21:41:18 ERROR mlflow.utils.async_logging.async_logging_queue: Run Id d3d1502db77a4d358d22649c94fcd100: Failed to log run data: Exception: Changing param values is not allowed. Param with key='torch_dtype' was already logged with value='None' for run ID='d3d1502db77a4d358d22649c94fcd100'. Attempted logging new value 'float32'.


{'train_runtime': 56.4018, 'train_samples_per_second': 1.064, 'train_steps_per_second': 0.124, 'train_loss': 4.175367082868304, 'epoch': 0.93}


TrainOutput(global_step=7, training_loss=4.175367082868304, metrics={'train_runtime': 56.4018, 'train_samples_per_second': 1.064, 'train_steps_per_second': 0.124, 'total_flos': 104014477787136.0, 'train_loss': 4.175367082868304, 'epoch': 0.9333333333333333})

### Inference

In [28]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the fine-tuned model from the 'results' directory
model_name = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model_name_or_path = "./results/checkpoint-7"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)



In [33]:
tokenizer.pad_token = tokenizer.eos_token

In [29]:
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=50257, bias=False)
)

In [30]:
test_data[0]

{'about_me': 'I have a family history of heart disease and want to improve my heart health.',
 'question': 'What plant-based diet is recommended for improving heart health?',
 'context': 'This study presents the first extensive overview of sports nutritional supplement research through bibliometric and visual analyses and provides useful insights and analyses on future research directions. Research in the field of the NSSE can be divided into four stages: steady growth (2000–2007), exponential growth (2007–2013), fluctuation (2013–2017) and surge (2017–2024). The United States is the most active country in the field of nutritional supplements. European countries are the early batch of countries in this field, whose attention to research is not decreasing at the current stage, followed by South American and Southeast Asian countries as the emerging research mainstay countries. In recent years, Croatia, Colombia, Slovenia, Chile, Egypt, China, and Thailand have become the dominant countr

In [31]:
def preprocess_test_data(test_example, tokenizer):
    # Combine 'about_me' and 'context' for the full context
    full_context = test_example['about_me'] + ' ' + test_example['context']
    
    # Create the prompt using the full context (as done during training)
    prompt = f"Question: {test_example['question']}\nContext: {full_context}\nAnswer:"
    
    # Tokenize the input prompt
    tokenized_input = tokenizer(prompt, truncation=True, padding="max_length", max_length=100, return_tensors="pt")
    
    return tokenized_input

In [34]:
sample_datapoint = preprocess_test_data(test_data[0], tokenizer)

sample_datapoint = {
    "input_ids": sample_datapoint["input_ids"].to(device),
    "attention_mask": sample_datapoint["attention_mask"].to(device)
}

In [35]:
with torch.no_grad():
    output = model.generate(
        sample_datapoint["input_ids"],
        attention_mask=sample_datapoint["attention_mask"],
        max_length=512,
        max_new_tokens=100,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print the generated response
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"Generated response: {generated_text}")

Both `max_new_tokens` (=100) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated response: Question: What plant-based diet is recommended for improving heart health?
Context: I have a family history of heart disease and want to improve my heart health. This study presents the first extensive overview of sports nutritional supplement research through bibliometric and visual analyses and provides useful insights and analyses on future research directions. Research in the field of the NSSE can be divided into four stages: steady growth (2000–2007), exponential growth (2007–2013), fluctuation (2013–2017) and transition (2017–2023). In this study, we present the results of a systematic review and meta-analysis of all published studies on sports nutrition supplements and their effects on cardiovascular disease (CVD) risk factors. Keywords: Sports nutrition, sports supplements, cardiorespiratory fitness

Background: Cardiovascular disease is the leading cause of death and disability worldwide [1]. It is estimated that more than 1.5 million people die each year f

### Sagemaker

Now we are going to set up the fine-tuning process using a sagemaker training job to make use of their GPU instances. 

#### Uploading Data to S3

First we need to upload the data to an S3 bucket.

In [1]:
import boto3
import json
from pathlib import Path

train_data_path = Path("../modules/model_training/datasets/training_data.json")
test_data_path = Path("../modules/model_training/datasets/testing_data.json")

with open(train_data_path, 'r') as f:
    train_data = json.load(f)

with open(test_data_path, 'r') as f:
    test_data = json.load(f)

s3_client = boto3.client('s3')
bucket_name = "fine-tuning-training-data"

train_data_key = "train_data.json"  # S3 key for training data
s3_client.put_object(
    Bucket=bucket_name,
    Key=train_data_key,
    Body=json.dumps(train_data)  # Convert the training data to JSON string
)

test_data_key = "test_data.json"  # S3 key for testing data
s3_client.put_object(
    Bucket=bucket_name,
    Key=test_data_key,
    Body=json.dumps(test_data)  # Convert the testing data to JSON string
)

print(f"Training data saved to s3://{bucket_name}/{train_data_key}")
print(f"Testing data saved to s3://{bucket_name}/{test_data_key}")

Training data saved to s3://fine-tuning-training-data/train_data.json
Testing data saved to s3://fine-tuning-training-data/test_data.json


#### Creating Sagemaker Job

In [36]:
import os
from dotenv import load_dotenv, find_dotenv
import sagemaker
from sagemaker.huggingface import HuggingFace
import boto3
import json

load_dotenv(find_dotenv())

sagemaker_role = os.environ.get("SAGEMAKER_ROLE")
huggingface_access_token = os.environ.get("HUGGINGFACE_ACCESS_TOKEN")

In [4]:
bucket_name = "fine-tuning-training-data"
train_data_s3_uri = f"s3://{bucket_name}/train_data.json"
test_data_s3_uri = f"s3://{bucket_name}/test_data.json"

In [29]:
hyperparameters = {
    "model_name_or_path": "tiiuae/falcon-7b-instruct",
    "train_file": train_data_s3_uri,
    "validation_file": test_data_s3_uri,
    "do_train": True,
    "do_eval": True,
    "fp16": True,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "num_train_epochs": 1,
    "save_steps": 1000,
    "evaluation_strategy": "steps",
    "eval_steps": 500,
    "logging_steps": 100,
    "output_dir": "s3://falcon-artifact/", # default value is "/opt/ml/model",
}

In [30]:
%%writefile ../modules/model_training/src/train.py

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

def tokenize_function(examples):
    full_context = examples['about_me'] + ' ' + examples['context']
    prompt = f"Question: {examples['question']}\nContext: {full_context}\nAnswer:"
    response = examples['response']
    
    tokenized_input = tokenizer(prompt, truncation=True, padding="max_length", max_length=512)
    tokenized_output = tokenizer(response, truncation=True, padding="max_length", max_length=512)
    
    input_ids = tokenized_input["input_ids"] + tokenized_output["input_ids"][1:]  # Remove BOS token
    labels = [-100] * len(tokenized_input["input_ids"]) + tokenized_output["input_ids"][1:]
    
    return {"input_ids": input_ids, "attention_mask": [1] * len(input_ids), "labels": labels}

# Load model and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load datasets
train_dataset = load_dataset('json', data_files={"train": "/opt/ml/input/data/train/train_data.json"})['train']
test_dataset = load_dataset('json', data_files={"test": "/opt/ml/input/data/test/test_data.json"})['test']

tokenized_train = train_dataset.map(tokenize_function, remove_columns=train_dataset.column_names)
tokenized_test = test_dataset.map(tokenize_function, remove_columns=test_dataset.column_names)

# Training arguments
training_args = TrainingArguments(
    output_dir="s3://falcon-artifact/", # default value is "/opt/ml/model",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    logging_dir="/opt/ml/logs",
    logging_steps=100,
    fp16=True,
)

# Trainer setup
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
)

trainer.train()


Overwriting ../modules/model_training/src/train.py


You need at least 16 GB of memory on the GPU to run falcon 7B instruct. You can find the options here but ml.p3.2xlarge should work.

https://aws.amazon.com/sagemaker/pricing/

In [39]:
huggingface_estimator = HuggingFace(
    entry_point="train.py",  # Python script to launch training
    source_dir="../modules/model_training/src",  # Folder where your train script and requirements.txt are
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    role=sagemaker_role,
    transformers_version="4.11",
    pytorch_version="1.9",
    py_version="py38",
    hyperparameters=hyperparameters,
    environment = {"HUGGINGFACEHUB_TOKEN": huggingface_access_token}
)

In [40]:
huggingface_estimator.fit({"train": train_data_s3_uri, "test": test_data_s3_uri})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2024-10-12-15-41-50-267


2024-10-12 15:41:53 Starting - Starting the training job
2024-10-12 15:41:53 Pending - Training job waiting for capacity......
2024-10-12 15:42:31 Pending - Preparing the instances for training...
2024-10-12 15:43:19 Downloading - Downloading the training image........................
2024-10-12 15:47:37 Training - Training image download completed. Training in progress...bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2024-10-12 15:47:55,959 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2024-10-12 15:47:55,989 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2024-10-12 15:47:55,991 sagemaker_pytorch_container.training INFO     Invoking user training script.
2024-10-12 15:47:56,262 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "test": 

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2024-10-12-15-41-50-267: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.8 train.py --do_eval True --do_train True --eval_steps 500 --evaluation_strategy steps --fp16 True --logging_steps 100 --model_name_or_path tiiuae/falcon-7b-instruct --num_train_epochs 1 --output_dir s3://falcon-artifact/ --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --save_steps 1000 --train_file s3://fine-tuning-training-data/train_data.json --validation_file s3://fine-tuning-training-data/test_data.json"
401 Client Error: Unauthorized for url: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/resolve/main/config.json
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py", line 546, in get_config_dict
    resolved_config_file = cached_path(
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 1402, in cached_path
    output_path = get_from_cache(
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", . Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html