# Using LangSmith to Support Fine-tuning

[blog post](https://blog.langchain.com/using-langsmith-to-support-fine-tuning-of-open-source-llms/)

## Scenarios

- Open Source LLM + Fine-tuning can outperform foundational SOTA models (e.g., ChatGPT).

## When to fine-tune

- Two ways of learning:
    - Weights: pre-training, fine-tuning
    - Prompting, via RAG.
- Analogy:
    - Fine-tuning is like studying one week in advance.
    - Prompting is like taking exam with open notes.
- Not Good for:
    - Learning new knowledge. Can increase hallucinations.
- Good for:
    - Specialized tasks, in similar ways to RAG.
        - With many examples

## When to fine-tune, more analogies

- Zero shot learning: describe task with words.
- Few shot learning: give few examples of solving task in prompt (e.g., via RAG or manually)
- Fine tuning: allow person to practice task.
    - In applications with concrete well-defined tasks where it is possible to collect a lot of data and "practice" on it.
- [Karpathy's tweet](https://x.com/karpathy/status/1655994367033884672?ref=blog.langchain.com)

## Fine-tuning vs others complete guide

Possibilities along two axis:
- Complexity / Cost dimension (higher to lower):
    - From scratch training
    - Reinforcement Learning from Human Feedback
    - Fine Tuning, Retrieval Assisted Generation => Problems Addressed dimension: Form problems - Fine Tuning, Factual Problems - RAG
    - Static / Dynamic Example Selection
    - Manual / Automatic Prompt Tuning

- Fine tuning is good at:
    - Learning the style or form of language.
    - Examples:
        - Pure autoregressive model on Q & A, since there's plenty of Q&A data.
        - Pure autoregressive model on instruction following, since there's plenty of data.
        - Imitate style, e.g., Shakespeare, since there is lot of training material. Also, legal jargon, claim responses
        - Imitate structure / format, like resumes.
- It is not good at:
    - Learning new concepts that do not exist in the base knowledge of the foundational model.
    - Example:
        - Replace Romeo with Bob in a set of texts and fine-tune. See if it forgets association of Romeo with Juliet and learns that it is Bob who was with Juliet. Fails because the *Romeo* concept is ingrained in the base knowledge.
        - Answer questions like "who said to be or not to be" after fine-tuning on set of Shakespeare works. 
            - Better use search engine (RAG) to retrieve those answers and inject them in prompt for LLM.
            - Question: what does LLM add here, if the search engine is already finding relevant answers? Remove FP?

- [Fine tuning is for form, not facts](https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts?ref=blog.langchain.com)

## Fine-tuning - OpenAI example

[colab](https://colab.research.google.com/drive/1YCyDHPSl0d_ULubCVshrP5hLqUCorr7d?usp=sharing&ref=blog.langchain.com#scrollTo=A-8dt5qqtpgM)

- Transform message format from LangSmith to OpenAI:
    ```python
    [example.inputs, example.outputs for example in client.list_examples (dataset_name=name_dataset)]
    example.inputs["sentence"] # string
    example.outputs["cluster"] # dictionary: text_dict = json.dumps(example.outputs["cluster"])
    open_ai_format = [
        {"role": "user", "content": "..." + sentence},
        {"role": "assistant", "content": text_dict}
    ]
    data = [open_ai_format(in,out) for ...]
    ```
- Write into binary file:
```python
binary_file = BytesIO()
for m in data:
    binary_file.write (json.dumps({"messages": m}) + "\n")
training_file=openai.File.create (file=binary_file, purpose="fine-tune")
```
- Train:
    ```python
    job = openai.FineTuningJob.create (training_file=training_file.id, model="gpt-3.5-turbo")
    while True:
        ftj = openai.FineTuningJob.retrieve (job.id)
        if ftj.fine_tuned_model is None:
            time.sleep(10)
        else:
            break
    ```
- Fine-Tuning chain:
    ```python
    prompt = prompts.ChatPromptTemplate.from_messages(
        [
            ("human", "extract triplets from {sentence}")
        ]
    )
    llm = chat_models.ChatOpenAI(model=ftj.fine_tuned_model, temperature=0)
    fine_tuned_chain = prompt | llm

    # later we'll do:
    results = await client.arun_on_dataset (
        validation_dataset_name, # here is where the examples with format example.inputs["sentence"] are taken from, and this sentence field is used by the prompt template to get triplets
        fine_tuned_chain,
        evaluation=config, # we will build a config object from an evaluation chain below
    )
    ```

- Evaluation chain:
    - `eval_prompt` for model: you are an evaluator...
    - reasoning capability ontop of evaluation score: `commit_grade` function, defined through a dictionary schema:
    - transforming obtained reasoning text into structured dict through `normalize_grade` function
    ```python
        eval_chain = (
            eval_prompt # this is where we obtain a score, by asking it to the following model in the chain
            | ChatOpenAI (model="gpt-4", temperature=0).bind (functions=[commit_grade_schema]) # this is where we obtain a reasoning in addition to a score
            | normalize_grade # this is where we get a structured output
        )
    ```
- Evaluator class:
    ```python
    class EvaluateTriplets (StringEvaluator):
        ...
        def _evaluate_strings (
            self,
            *,
            prediction, 
            # ... string fields used below in input dict
            **kwargs,
        ):
            callbacks = kwargs.get("callbacks")
            return eval_chain.invoke (
                {"prediction": prediction, "reference": reference, "input": input},
                {"callbacks": callbacks}
            )

        config = smith.RunEvalConfig (custom_evaluators=[EvaluateTriplets()])
    ```
- Comparison vs few-shot examples.
    - Differences: 
        - prompt: in addition to current sentence to be transformed into triplet, it includes few examples of inputs and desired outputs, using `partials`
        - model: pre-trained (not fine-tuned) chat gpt model
    
```python
for i in example:
    messages.extend([
        ("human", "... {input_%d}"%i),
        ("ai", "{output_%d}"%i),
    ])
    partial["input_%d" %i] = first_5[i].inputs["sentence"]
    partial["output_%d" %i] = json.dumps(first_5[i].outputs["clusters"])
messages.append(("human", "...{sentence}"))
prompts.ChatPromptTemplate.from_messages (
    messages
).partial (
    **partials
)
```

## Quantization, LoRA and qLoRA

- Model quantization: fit model (e.g., 7B Llama) in memory.
- LoRA: efficiently fine-tune model by reducing number of parameters to be trained 
- qLoRA: same, but deals with quantized models.

# Further reading

- [Is Fine-Tuning Still Valuable?](https://hamel.dev/blog/posts/fine_tuning_valuable.html)
- [When and Why to Fine Tune](https://www.youtube.com/watch?v=cPn0nHFsvFg)
- [From Prompt to Model: Fine-tuning when you've already deployed LLMs in prod](https://www.youtube.com/watch?v=4EPZZkVrXC4)
- [Deploying Fine-Tuned Models](https://www.youtube.com/watch?v=GzEcyBykkdo)
- [Fine Tuning OpenAI Models - Best Practices](https://www.youtube.com/watch?v=Q0GSZD0Na1s)
- [Fine-Tuning with Axolotl](https://www.youtube.com/watch?v=mmsa4wDsiy0)
- [Why fine-tuning is dead](https://www.youtube.com/watch?v=h1c_jmk97Ss)
- [Fine-Tuning Llama 3 and Using It Locally: A Step-by-Step Guide](https://www.datacamp.com/tutorial/llama3-fine-tuning-locally)
- [RAG vs Fine-Tuning: A Comprehensive Tutorial with Practical Examples](https://www.datacamp.com/tutorial/rag-vs-fine-tuning?utm_cid=19589720821&utm_aid=157098104375&utm_campaign=230119_1-ps-other~dsa-tofu~all_2-b2c_3-emea_4-prc_5-na_6-na_7-le_8-pdsh-go_9-nb-e_10-na_11-na&utm_loc=9212664-&utm_mtd=-c&utm_kw=&utm_source=google&utm_medium=paid_search&utm_content=ps-other~emea-en~dsa~tofu~tutorial~artificial-intelligence&gad_source=1&gad_campaignid=19589720821&gbraid=0AAAAADQ9WsFIxTj2FMxHfYtkNv25bmuXQ&gclid=Cj0KCQiA9t3KBhCQARIsAJOcR7wj9NfQPsOqKHX3h1x-Tiff_LxQP22G2oVC6YC5A8lgWagBU5tdXlEaAkmlEALw_wcB)
- [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)
- [Supervised fine-tuning - OpenAI](https://platform.openai.com/docs/guides/supervised-fine-tuning)
    - [Model optimization - OpenAI](https://platform.openai.com/docs/guides/model-optimization)
- [Using LangSmith to Support Fine-tuning - Colab's notebooks]
    - [With Llama](https://colab.research.google.com/drive/1tpywvzwOS74YndNXhI8NUaEfPeqOc7ub?usp=sharing&ref=blog.langchain.com)
    - [OpenAI](https://colab.research.google.com/drive/1YCyDHPSl0d_ULubCVshrP5hLqUCorr7d?usp=sharing&ref=blog.langchain.com)
- [Tuna for synthetic dataset generation for fine-tuning](https://blog.langchain.com/introducing-tuna-a-tool-for-rapidly-generating-synthetic-fine-tuning-datasets/)
    - [Demo](https://blog.langchain.com/fine-tuning-chatgpt-surpassing-gpt-4-summarization/)
- [LIMA: curating small training datasets](https://arxiv.org/abs/2305.11206?ref=blog.langchain.com)