<a href="https://colab.research.google.com/github/MaxiBlinkz/BookyMcBookface/blob/master/examples/ielts_fine_tunning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/simplifine-llm/Simplifine/blob/main/examples/cloud_quickstart.ipynb)
### 📦 Installing Required Libraries

Before we begin fine-tuning our fake news detector, we need to install the necessary libraries. In this step, we’re installing the `Simplifine` library, which provides tools to streamline the fine-tuning process for large language models. We’re also installing the `datasets` library, which allows us to easily access and manage datasets from Hugging Face.

- The `Simplifine` library helps in making the fine-tuning process more efficient, whether you're working locally or in the cloud.
- The `datasets` library is essential for loading and processing the dataset we'll be using for this project.

Running this cell will install both libraries quietly in the background.


In [1]:
!pip install git+https://github.com/simplifine-llm/Simplifine.git -q
!pip install datasets -q
!pip install transformers -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 🛠️ Setting Up for Local Training

In this section, we’re preparing to fine-tune our fake news detector model using Google Colab’s resources. The steps below outline how to configure and initiate the training process.

1. **Importing Libraries:**
   - We import `train_engine` from the `Simplifine` library, which provides the necessary functions to handle the fine-tuning process.
   - We also import `SFTConfig` from the `trl` library, which allows us to configure the supervised fine-tuning parameters.

2. **Dataset Selection:**
   - We define the dataset name as `'community-datasets/fake_news_english'`. This dataset contains examples of fake news articles that we will use to fine-tune our model.

3. **Prompt Configuration:**
   - We create a `sftPromptConfig` object to specify how the training data is formatted.
   - The `template` parameter defines the input format, and the `response_template` specifies how the model should generate outputs.
   - The `use_chat_template` flag is set to `True` to format the inputs in a conversational style, which can be effective for chat-based models.

4. **Training Configuration:**
   - We define the training settings using `SFTConfig`. This includes parameters like batch size, learning rate, and the number of epochs.
   - We also enable `fp16` (16-bit floating-point) training for faster computation and set `gradient_checkpointing` to save memory during training.

5. **Model Selection:**
   - The model we’re fine-tuning is `'TinyLlama/TinyLlama-1.1B-Chat-v1.0'`. This is a smaller, efficient model suitable for demonstration purposes on Colab.

6. **Training the Model:**
   - Finally, we call `sft_train` to start the fine-tuning process. This step will take a while to complete, as we’re training the model from scratch without any optimizations like quantization or LoRA.

Running this cell will fine-tune the model locally on Colab, using the configurations we’ve set up. This is ideal for quick experiments or when cloud resources are not available.

In [2]:
from simplifine_alpha import train_engine
from trl import SFTConfig
import pandas as pd
from datasets import load_dataset

# Load the training dataset
train_dataset = load_dataset('csv', data_files='filtered_df_train.csv')

# Load the testing dataset
test_dataset = load_dataset('csv', data_files='filtered_df_test.csv')

# Define prompt config
sft_prompt_config = train_engine.sftPromptConfig(
  keys=['prompt', 'essay'],
  template="### Prompt: {prompt}\n### Essay: {essay}",
  response_template=". \n### Feedback:",
  use_chat_template=True
  )

# Define training config, increase max length to accommodate longer responses
sft_config = SFTConfig(
    output_dir='/content/ielts_review_llama3',
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    num_train_epochs=2,
    report_to='none',
    fp16=True,
    gradient_checkpointing=True,
    max_length=4096,  # Increased sequence length for training
)

# Select the Llama 3 1B model
model_name = 'decapoda-research/llama-3-1b-hf'

# Fine-tune the model
train_engine.sft_train(model_name=model_name,
                       dataset=train_dataset, # using training dataset
                       sft_config=sft_config,
                       sft_prompt_config=sft_prompt_config,
                       use_zero=False,
                       use_ddp=False
                    )

FileNotFoundError: Unable to find '/content/filtered_df_train.csv'

### ☁️ Training the Model on Cloud Servers

In this section, we’re moving from local training to cloud-based training using Simplifine’s cloud infrastructure. This allows you to leverage powerful GPUs like the A100 for more intensive tasks, making it easier to handle larger models and datasets.

1. **Importing the `train_utils` Module:**
   - We start by importing the `train_utils` module from the `Simplifine` library. This module provides utilities to interact with Simplifine's cloud servers.

2. **Model and API Configuration:**
   - We select a different model for this cloud training: `'microsoft/Phi-3-mini-4k-instruct'`. This model is more powerful and well-suited for deployment on cloud GPUs.
   - The `simplifine_api_key` is your unique key to access Simplifine’s cloud services. Ensure you have it ready.
   - The `gpu_type` is set to `'a100'`, which specifies the type of GPU to be used in the cloud. The A100 is a high-performance GPU ideal for deep learning tasks.

   ### 🔑 Need an API Key?
   If you don't have an API key yet, you can [**request one here for free**](https://www.simplifine.com/api-key-interest). The turnaround time is just 24 hours, so you'll be up and running in no time!

3. **Client Initialization:**
   - We create a `Client` object using the API key and GPU type. This client will handle the communication with Simplifine’s cloud infrastructure, managing the training job on your behalf.

4. **Defining the Training Job:**
   - The `job_name` is set to `'fake_news_english_phi3'`, which uniquely identifies this training task.
   - We then call the `sft_train_cloud` method on our `client` object. This method sends the training job to the cloud, using the model and configurations we’ve defined earlier.

5. **Cloud Training Setup:**
   - We enable `use_zero=True` to utilize DeepSpeed's ZeRO optimization, allowing the model to scale effectively across multiple GPUs.
   - We disable Distributed Data Parallel (DDP) for this job, which is appropriate when ZeRO is handling the distribution of data.

Running this cell will initiate the training process on Simplifine’s cloud servers, allowing you to offload the heavy lifting to a powerful cloud infrastructure. This is ideal when working with larger models or when your local resources are insufficient.


In [None]:
from simplifine_alpha import train_utils

# change name to phi 3
model_name = 'microsoft/Phi-3-mini-4k-instruct'
simplifine_api_key = 'PUT YOUR OWN API KEY PROVIDED BY SIMPLIFINE'
gpu_type = 'a100'
client = train_utils.Client(simplifine_api_key, gpu_type)

job_name = 'fake_news_english_phi3'


client.sft_train_cloud(job_name=job_name, model_name=model_name, dataset_name=dataset_name,
                       sft_config = sft_config, sft_prompt_config=sft_prompt_config,
                       use_zero=True, use_ddp=False
                      )

[2024-08-07 18:34:35,110] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)


### 📝 Checking the Status of Your Training Jobs

After submitting your training job to Simplifine’s cloud servers, it’s important to monitor its status to ensure everything is running smoothly. In this section, we’ll check the status of your most recent job.

1. **Retrieving Job Status:**
   - We call the `get_all_jobs` method on our `client` object. This method returns a list of all jobs associated with your API key, including their current statuses.

2. **Displaying the Latest Job:**
   - We loop through the latest job in the list and print its status. This gives you a quick overview of how your most recent training job is progressing.

3. **Understanding Job Statuses:**
   - Your job can have one of the following statuses:
     - `pending`: The job has been submitted and is waiting to start.
     - `in progress`: The job is currently running.
     - `stopped`: The job was stopped before completion, either manually or due to an error.
     - `completed`: The job has successfully finished.

Running this cell will display the status of your most recent job, helping you keep track of your training tasks on Simplifine’s cloud servers.


In [None]:
status = client.get_all_jobs()
for num,i in enumerate(status[-1:]):
  print(f'Job {num}: {i}')

Job 0: {'job_id': '183c65ad-2b4e-4d11-b2a5-d66232d5b15b', 'job_name': 'fake_news_english_phi3', 'status': 'completed'}


### 📊 Retrieving and Viewing Training Logs

After checking the status of your training job, you might want to dive deeper into the details by viewing the training logs. These logs provide insights into the training process, including any issues or updates on the progress.

1. **Getting the `job_id`:**
   - We start by extracting the `job_id` of the last job from the status list. The `job_id` is a unique identifier for each training job, which we’ll use to retrieve its logs.

2. **Retrieving Logs:**
   - We call the `get_train_logs` method on our `client` object, passing in the `job_id`. This method fetches the detailed logs for the specified job, giving you access to the complete training history.

3. **Viewing the Logs:**
   - Finally, we print the `response` from the logs, which contains detailed information about the training process. This includes updates, errors, and any other relevant messages from the training run.

Running this cell will display the logs for your most recent job, allowing you to monitor and troubleshoot the training process effectively.


In [None]:
# getting the job_id of the last job
job_id = status[-1]['job_id']

logs = client.get_train_logs(job_id)
print(logs['response'])

W0806 18:14:41.510000 129132731527296 torch/distributed/run.py:779] 
W0806 18:14:41.510000 129132731527296 torch/distributed/run.py:779] *****************************************
W0806 18:14:41.510000 129132731527296 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0806 18:14:41.510000 129132731527296 torch/distributed/run.py:779] *****************************************
[2024-08-06 18:14:46,878] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-06 18:14:46,910] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-06 18:14:46,961] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-06 18:14:47,065] [INFO] [real_accelerator.py:203:get_accelerator

### 📂 Downloading and Saving the Trained Model

Once your training job is completed, the next step is to download the trained model so you can use it locally or for further fine-tuning.

1. **Creating a Directory for the Model:**
   - We begin by creating a new folder called `sf_trained_model_zero_phi`. This folder will serve as the destination for the downloaded model files.

2. **Downloading the Model:**
   - We use the `download_model` method on our `client` object to download the trained model from the cloud. The `job_id` is passed to specify which model to download, and we extract the files to the newly created directory.
   
   - **Tip:** This process might take some time depending on the size of the model, so feel free to take a break or grab a coffee while you wait! ☕

Running this cell will download your trained model and save it in the specified directory, making it ready for use in your next project or analysis.


In [None]:
import os

# creating a folder to store the model
os.mkdir('sf_trained_model_zero_phi')

# download and save the model to it.
# This might take some time, have a sip of that coffee! :)
client.download_model(job_id=job_id, extract_to='/content/sf_trained_model_zero_phi')

Downloading: 100%|██████████| 6.99G/6.99G [00:42<00:00, 166MiB/s]



Directory downloaded successfully and saved to /content/sf_trained_model_zero_phi/183c65ad-2b4e-4d11-b2a5-d66232d5b15b.zip
Model unzipped successfully to /content/sf_trained_model_zero_phi
Deleted the zip file at /content/sf_trained_model_zero_phi/183c65ad-2b4e-4d11-b2a5-d66232d5b15b.zip
Model downloaded, unzipped, and zip file deleted successfully!


### 🔄 Loading the Trained Model and Tokenizer

Now that we've successfully downloaded the trained model, the next step is to load it into our environment so we can use it for inference or further fine-tuning.

1. **Importing Required Libraries:**
   - We import `AutoModelForCausalLM` and `AutoTokenizer` from the `transformers` library. These classes are used to load the model and tokenizer from the saved files.

2. **Setting the Path:**
   - We set the `path` variable to point to the directory where we saved the trained model (`'/content/sf_trained_model_zero_phi'`).

3. **Loading the Model:**
   - We use `AutoModelForCausalLM.from_pretrained(path)` to load the trained model from the specified path. This initializes the model so it’s ready for use.

4. **Loading the Tokenizer:**
   - Similarly, we load the tokenizer using `AutoTokenizer.from_pretrained(path)`. The tokenizer is essential for processing text input into a format that the model can understand.

Running this cell will load both the trained model and tokenizer into your environment, allowing you to start generating text or continue fine-tuning with your freshly trained model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

path = '/content/sf_trained_model_zero_phi'
sf_model = AutoModelForCausalLM.from_pretrained(path)
sf_tokenizer = AutoTokenizer.from_pretrained(path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### 📚 Loading the Dataset

Before we can use our trained model for inference or further fine-tuning, we need to load the dataset that we’ve been working with.

1. **Importing the Datasets Library:**
   - We start by importing the `datasets` library, which provides easy access to a wide range of datasets, including the one we've been using for training.

2. **Loading the Dataset:**
   - We load the dataset using the `load_dataset` function from the `datasets` library. The `dataset_name` variable contains the name of the dataset we specified earlier in our code.

Running this cell will load the dataset into your environment, making it ready for evaluation, inference,

In [None]:
import datasets
dataset = datasets.load_dataset(dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/5.01k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/43.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/492 [00:00<?, ? examples/s]

### 🧠 Generating Text with the Trained Model

Now that we've loaded both the model and the dataset, it’s time to generate some text using our trained model. In this section, we’ll configure the generation settings and produce some sample outputs.

1. **Importing Inference Tools:**
   - We import `inference_tools` from the `simplifine_alpha` library. This module provides the necessary tools to generate text using the model we’ve fine-tuned.

2. **Configuring Text Generation:**
   - We create a `GenerationConfig` object to define how the model should generate text. This configuration includes:
     - `prompt_template` and `response_template`: Templates for how the inputs and outputs are formatted.
     - `keys`: Specifies the data keys used in the templates.
     - `train_type`: Indicates that we're using supervised fine-tuning (`sft`).
     - `max_length`: The maximum length of the generated sequences.
     - `num_return_sequences`: How many sequences to generate.
     - `do_sample`, `top_k`, `top_p`, `temperature`: Parameters that control the randomness and diversity of the generated text.

3. **Generating Text:**
   - We call `generate_from_pretrained` using our fine-tuned model, tokenizer, and the generation configuration. We also pass in a small sample of the dataset to generate text based on the training data.
   
   - **Note:** We’re using only the first three examples from the training dataset (`dataset['train'][:3]`) for quick testing.

4. **Displaying the Generated Text:**
   - Finally, we print the generated text, which provides a glimpse into how well the model has learned to detect fake news.

Running this cell will generate text using your trained model, showcasing its ability to produce outputs based on the fine-tuned dataset. This is where you can see the real impact of your training efforts!

In [None]:
from simplifine_alpha import inference_tools

# Configuration for generating text with longer sequences
config = inference_tools.GenerationConfig(
    prompt_template=sft_prompt_config.template,
    response_template=sft_prompt_config.response_template,
    keys=sft_prompt_config.keys,
    train_type='sft',
    max_length=4096,  # Increased sequence length for generation
    num_return_sequences=1,
    do_sample=True,
    top_k=50,
    top_p=0.9,
    temperature=0.99
)

generated_text = inference_tools.generate_from_pretrained(sf_model, sf_tokenizer, config, data=test_dataset['train'][:3]) # using test dataset
print(generated_text)

You are not running the flash-attention implementation, expect numerical differences.


[['###URL: http://www.redflagnews.com/headlines-2016/cdc-proposes-rule-to-apprehend-and-detain-anyone-anywhere-at-any-time-for-any-duration-without-due-process-or-right-of-appeal-and-administer-forced-vaccinations-or-medical-treatment-without-consent-or-parens. \n###'], ['###URL: http://www.redflagnews.com/headlines-2016/-outrage-what-obama-just-did-to-the-white-house-logo-will-make-you-sick-128097.html \n###CLS: 0'], ['###URL: http://www.redflagnews.com/headlines-2016/white-house-cancels-all-obama-appearances-at-hillary-campaign-events-as-he-navigates-mandatory-divorce-june-28-2016-1651142.html \n###CLS: 1']]
