# Fine-Tuning with Distillation

In this notebook, we will fine-tune a smaller language model, Falcon-7B Instruct, by following a distillation process using synthetic data generated from a larger model. The purpose of distillation is to transfer the knowledge of the larger model to the smaller one, making it more efficient while retaining high-quality performance for a specific task: responding as a vegan nutritionist.

**Key Steps:**
- **Generating Synthetic Data:** The larger model will act as a "teacher" by generating responses for a given set of prompts, simulating the role of a vegan nutritionist. These responses will form the synthetic dataset that the smaller model will learn from, covering a range of nutritional topics related to vegan diets.

- **Fine-Tuning the Smaller Model:** With the synthetic dataset prepared, we will fine-tune the smaller Falcon-7B Instruct model using the teacher model’s outputs as the target responses. This process enables the smaller model to learn from the larger model’s behavior, making it capable of producing relevant responses efficiently.

- **Experiment Tracking:** Throughout the fine-tuning process, an experiment tracking system will be used to log hyperparameters, model performance, and training metrics. This will help in tracking progress and identifying the optimal model configuration.

- **Next Steps:** Once the fine-tuning process is complete, we will proceed with building a separate inference pipeline and web application to handle real-time interactions. These components will be detailed in future steps of the project.

## Generating Synthetic Data

We need to do few shot learning to genearate synthetic data (at least 100 samples). So we will load in some pdf files for example papers and manually create Q&A pairs from them.

Steps are as follows:

- Load pdf files from s3
- Create user_context and question pairs from pdf files (3 examples)
- Use OpenAI to generate synthetic data (at least 100 samples) using these examples for few-shot learning
- Then we will again use OpenAI to generate outputs for the generated data, but we will also use Rag to get relevant context

In [10]:
import os
import json
import boto3
import openai
from dotenv import load_dotenv, find_dotenv

# move to root directory
os.chdir('..')

load_dotenv(find_dotenv())

True

In [5]:
open_api_key = os.environ.get('OPENAI_API_KEY')

### Loading Raw PDF Files

In [6]:
s3 = boto3.client('s3')

bucket_name = os.environ.get('AWS_BUCKET_NAME')

response = s3.list_objects_v2(Bucket=bucket_name)
response["ResponseMetadata"]["HTTPStatusCode"]

200

In [7]:
file_name = 'vegan_research_papers.json'

# Get the file contents from S3
response = s3.get_object(Bucket=bucket_name, Key=file_name)

In [9]:
data = json.loads(response['Body'].read())

Let's take a look at a bunch of abstracts and decide where to pull examples from based on interesting examples.

In [35]:
from IPython.display import display, Markdown

# change the slicing to find relevant abstracts by trial and error
for i, pdf in enumerate(data[60:63]):
    abstract = pdf['meta_data']['abstract']['p']
    display(Markdown(f"**{i}. Abstract**\n\n{abstract}"))

**0. Abstract**

Controlled sediment flushing operations (CSFOs) allow to recover reservoirs storage loss while rebalancing the sediment flux interrupted by dams but, at the same time, may cause unacceptable ecological impact. In this study, we investigated the responses of the food web of an upland stream to a CSFO, focusing on the effects of fine sediment deposition detected in three different mesohabitats, i.e., a pool, a riffle, and a step-pool. The field campaign lasted two years and included repeated measurements of fine sediment deposits, and sampling of periphyton, benthic macroinvertebrates and fishes. A moderate and patchy deposition occurred due to the CSFO with short and medium-term ecological impact on the lower trophic levels of the food web, which may affect the whole ecosystem functioning. The monitoring of all available mesohabitats in the investigated stream allowed to detect variations in the ecological response to CSFO, providing a more adequate assessment of the impact. As expected, sedimentation was larger in the pool but, in contrast to our hypotheses, the impact was lower and the recovery was longer for the benthic organisms inhabiting the riffle. In the case of fishes, no lethal impact of both brown trout and bullhead was recorded in the short term but the occurrence of longer lasting effects could not be excluded. To date, this is one of the few studies dealing with a detailed integrative assessment of the downstream impact of sediment management from reservoir on both abiotic and biotic components of stream ecosystem.

**1. Abstract**

['Background', 'Antibiotic resistance is a critical global concern, posing significant challenges to human health and medical treatments. Studying antibiotic resistance genes (ARGs) is essential not only in clinical settings but also in diverse environmental contexts. However, ARGs in unique environments such as anchialine caves, which connect both fresh and marine water, remain largely unexplored despite their intriguing ecological characteristics.', 'Results', 'We present the first study that comprehensively explores the occurrence and distribution of ARGs and mobile genetic elements (MGEs) within an anchialine cave. Utilizing metagenomic sequencing we uncovered a wide array of ARGs with the bacitracin resistance gene, bac A and multidrug resistance genes, being the most dominant. The cave’s microbial community and associated resistome were significantly influenced by the salinity gradient. The discovery of novel β-lactamase variants revealed the cave’s potential as a reservoir for previously undetected resistance genes. ARGs in the cave demonstrated horizontal transfer potential via plasmids, unveiling ecological implications.', 'Conclusions', 'These findings highlight the need for further exploration of the resistome in unique environments like anchialine caves. The interconnected dynamics of ARGs and MGEs within anchialine caves offer valuable insights into potential reservoirs and mechanisms of antibiotic resistance in natural ecosystems. This study not only advances our fundamental understanding but also highlights the need for a comprehensive approach to address antibiotic resistance in diverse ecological settings.']

**2. Abstract**

['Background', 'Inulin and inulin-derived fructooligosaccharides (FOS) are well-known prebiotics for use in companion animals and livestock. The mechanisms by which FOS contribute to health has not been fully established. Further, the fine chemistry of fructan structures from diverse sources, such as graminan-type fructans found in cereal crops, has not been fully elucidated. New methods to study fructan structure and microbial responses to these complex carbohydrates will be key for evaluating the prebiotic potency of cereal fructans found in cattle feeds. As the rumen microbiome composition is closely associated with their metabolic traits, such as feed utilization and waste production, prebiotics and probiotics represent promising additives to shift the microbial community toward a more productive state.', 'Results', 'Within this study, inulin, levan, and graminan-type fructans from winter wheat, spring wheat, and barley were used to assess the capacity of rumen-derived Bifidobacterium boum , Bifidobacterium merycicum , and Lactobacillus vitulinus to metabolize diverse fructans. Graminan-type fructans were purified and structurally characterized from the stems and kernels of each plant. All three bacterial species grew on FOS, inulin, and cereal crop fructans in pure cultures. L. vitulinus was the only species that could metabolize levan, albeit its growth was delayed. Fluorescently labelled polysaccharides (FLAPS) were used to demonstrate interactions with Gram-positive bacteria and confirm fructan metabolism at the single-cell level; these results were in agreement with the individual growth profiles of each species. The prebiotic potential of inulin was further investigated within naïve rumen microbial communities, where increased relative abundance of Bifidobacterium and Lactobacillus species occurred in a dose-dependent and temporal-related manner. This was supported by in situ analysis of rumen microbiota from cattle fed inulin. FLAPS probe derived from inulin and fluorescent in situ hybridization using taxon-specific probes confirmed that inulin interacts with Bifidobacteria and Lactobacilli at the single-cell level.', 'Conclusion', 'This research revealed that rumen-derived Bifidobacteria and Lactobacilli vary in their metabolism of structurally diverse fructans, and that inulin has limited prebiotic potential in the rumen. This knowledge establishes new methods for evaluating the prebiotic potential of fructans from diverse plant sources as prebiotic candidates for use in ruminants and other animals.']

The following abstracts seem to be good:

In [46]:
for i in [78, 102]:
    if type(data[i]['meta_data']['abstract']['p']) == list:
        single_string_abstract = " ".join(data[i]['meta_data']['abstract']['p'])
        display(Markdown(f"**{i}. Abstract**\n\n{single_string_abstract}"))
    else:
        display(Markdown(f"**{i}. Abstract**\n\n{data[i]['meta_data']['abstract']['p']}"))
        
    # we'll visit the paper ourselves and find some good facts or results to create questions around
    print("URL:", data[i]['meta_data']['url'][0]['value'])

**78. Abstract**

Purpose Disproportional fat-free mass loss often occurs post-bariatric surgery, partly due to insufficient protein intake during the post-surgery recovery phase. We compared five protein-enhancing strategies (PES) on patient tolerability, satisfaction and protein intake. Materials and Methods Ninety-four participants, scheduled for bariatric surgery, were enrolled and allocated to either of the following: (1) whey powder, (2) hydrolysed collagen powder, (3) plant-based powder, (4) protein-rich products, (5) protein gel, or control. PES groups were instructed to add 30 g of powder or 2 gels or protein products to their diet. Patient satisfaction and tolerability were evaluated with questionnaires. Dietary intake was assessed prior to and during PES use. Results Seven patients dropped out (i.e. loss of contact, personal reasons or post-surgery complications) yielding an analytical cohort of 87 participants. The majority of patients (61%) did not experience dietary complaints from PES and could use PES ≥ 5 days of the week. PES non-usage was mainly related to taste dislike (58%). Hydrolysed collagen scored highest on tolerability and satisfaction: 86% of the participants could use HC ≥ 5 days and 71% were satisfied with the product. PES increased protein intake from 54.7 ± 21.5 g/day to 64.7 ± 23.4 g/day during the intervention ( p  = 0.002), which differed from the control group (+ 10.1 ± 24.5 g/day vs. − 6.3 ± 23.8 g/day for controls, p  = 0.019). Whey showed the highest increase, namely + 18.3 ± 16.3 g/day ( p  = 0.009). Conclusion PES were tolerated by the majority of participants, and an improved protein intake with PES use was seen. However, the taste of the products could be improved to further enhance satisfaction and tolerability. Graphical Abstract 

URL: http://dx.doi.org/10.1007/s11695-024-07462-4


**102. Abstract**

Objectives Epidemiology showed that the falling incidences increased with advanced age, and recent findings found link between nutritional intake and risk of falls. Nevertheless, the relationship between different plant-based diets and the risk of falls in older adults remains unclear. Our investigation aimed to evaluate the correlation between various plant-based diet indices and the occurrence of falls. Design This study is a cross-sectional and post-hoc analysis from a national cohort study. Setting and participants We included individuals over 65 years from the Chinese Longitudinal Healthy Longevity Survey (CLHLS) recruited in 2018 with information on falls and dietary assessments, finally 11,044 participants were eligible. Measurements Using food frequency questionnaire (FFQ), we calculated plant-based index scores categorized as unhealthy plant-based index (uPDI) and healthy plant-based index (hPDI). The primary outcome was falls obtained from questionnaire. Statistical analysis was performed utilizing logistic regression model to investigate the relationship between the plant-based diet indices and falls. We also used the subgroup analysis to investigate the interaction of falls and plant-based diet index (PDI) among different status and used the restricted cubic spline (RCS) curves to investigate the connection between the PDI scores and falls risk. Results Among 11,044 participants included in our study, a total of 2493 fall cases were observed. The logistic regression analysis revealed that the plant-based index related to falls. In the adjusted model, per 10-unit increment of hPDI has a significant decreased risk of falls (odd ratio [OR]: 0.85, 95% confidence interval [CI]: 0.79–0.91, P for trend < 0.001) and per 10-unit increment in uPDI increased the risk of falls (OR: 1.21, 95% CI: 1.13–1.30, P for trend < 0.001). We also revealed an interaction between smoking status and falls among the uPDI group ( P _interaction = 0.012). Finally, we found that with plant-based index scores increased, the odds of falls among hPDI decreased (P for overall < 0.001, P nonlinear = 0.0239), and the odds of falls among uPDI increased (P for overall < 0.001, P nonlinear = 0.0332). Conclusion and implications We found significant association between the Plant-based diet index and the risk of falls, highlighting the key role of the consumption of nutritious plant-based foods on the risk of falls, which needed take into account in developing intervention and prevention strategies to decrease falls among older Chinese adults.

URL: http://dx.doi.org/10.1007/s40520-024-02838-z


Now looking at these papers manually we can get some good user context and questions that are relevant to these papers.

In [6]:
about_me_1 = "I am obese and just got the weight loss surgery done. I need to increase my protein intake bu can't tolerate dairy well."
question_1 = "What is the most tolerable protein enhancing strategy for someone like me?"

about_me_2 = "I am obese am considering getting the weight loss surgery."
question_2 = "Is there any major concerns dietary wise?"

about_me_3 = "I am an old Chinese adult and am at risk of falling often."
question_3 = "Is there a specific type of diet that might help in preventing me fall?"

In [7]:
PROMPT_TEMPLATE = """
Here are 3 user contexts and questions about science-based research on plant-based health and nutrition. 
Generate 100 more samples like them:

# about_me_1
{about_me_1}

# question_1
{question_1}

# about_me_2
{about_me_2}

# question_2
{question_2}

# about_me_3
{about_me_3}

# question_3
{question_3}

And put them in a JSON format with the keys: about_me and question.
"""

In [8]:
prompt = PROMPT_TEMPLATE.format(
    about_me_1=about_me_1,
    question_1=question_1,
    about_me_2=about_me_2,
    question_2=question_2,
    about_me_3=about_me_3,
    question_3=question_3
)

print(prompt)


Here are 3 user contexts and questions about science-based research on plant-based health and nutrition. 
Generate 100 more samples like them:

# about_me_1
I am obese and just got the weight loss surgery done. I need to increase my protein intake bu can't tolerate dairy well.

# question_1
What is the most tolerable protein enhancing strategy for someone like me?

# about_me_2
I am obese am considering getting the weight loss surgery.

# question_2
Is there any major concerns dietary wise?

# about_me_3
I am an old Chinese adult and am at risk of falling often.

# question_3
Is there a specific type of diet that might help in preventing me fall?

And put them in a JSON format with the keys: about_me and question.



In [85]:
from openai import OpenAI

client = OpenAI(
    api_key=open_api_key,
)

In [86]:
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "user", "content": prompt}
    ],
    model="gpt-3.5-turbo",
)

In [88]:
synthetic_data = chat_completion.choices[0].message.content
print(synthetic_data)

{
  "data": [
    {
      "about_me": "I am obese and just got the weight loss surgery done. I need to increase my protein intake but can't tolerate dairy well.",
      "question": "What is the most tolerable protein enhancing strategy for someone like me?"
    },
    {
      "about_me": "I am obese and am considering getting weight loss surgery.",
      "question": "Is there any major concerns dietary wise?"
    },
    {
      "about_me": "I am an old Chinese adult and am at risk of falling often.",
      "question": "Is there a specific type of diet that might help prevent me from falling?"
    },
    {
      "about_me": "I am a vegan athlete looking to optimize my performance.",
      "question": "What plant-based foods can help improve athletic performance?"
    },
    {
      "about_me": "I have a family history of heart disease and want to improve my heart health.",
      "question": "What plant-based diet is recommended for improving heart health?"
    },
    {
      "about_me":

In [92]:
synthetic_data = json.loads(synthetic_data)

Now that we have this data, let's save it into a config file so we can load it whenever we need it. We will just copy and paste it since it is a small dataset.

In [94]:
synthetic_data['data']

[{'about_me': "I am obese and just got the weight loss surgery done. I need to increase my protein intake but can't tolerate dairy well.",
  'question': 'What is the most tolerable protein enhancing strategy for someone like me?'},
 {'about_me': 'I am obese and am considering getting weight loss surgery.',
  'question': 'Is there any major concerns dietary wise?'},
 {'about_me': 'I am an old Chinese adult and am at risk of falling often.',
  'question': 'Is there a specific type of diet that might help prevent me from falling?'},
 {'about_me': 'I am a vegan athlete looking to optimize my performance.',
  'question': 'What plant-based foods can help improve athletic performance?'},
 {'about_me': 'I have a family history of heart disease and want to improve my heart health.',
  'question': 'What plant-based diet is recommended for improving heart health?'},
 {'about_me': 'I am pregnant and following a plant-based diet.',
  'question': 'How can I ensure I am getting all the necessary nutr

Next we need to generate repsonses to all these questions. But to make them relevant we will use RAG on our pdf data to get relevant context.