# Automating Traning Pipeline

#### Automation and Orchestration of a Supervised Tuning Pipeline.

- Reuse an existing Kubeflow Pipeline for Parameter-Efficient Fine-Tuning (PEFT) for a foundation model from Google, called [PaLM 2](https://ai.google/discover/palm2/). 
- Advantage of reusing a pipleline means you do not have to build it from scratch, you can only specify some of the parameters.

In [21]:
### the files are consistent for all learners
TRAINING_DATA_URI = "./tune_data_stack_overflow_python_qa.jsonl" 
EVAUATION_DATA_URI = "./tune_eval_data_stack_overflow_python_qa.jsonl"  

In [30]:
print(TRAINING_DATA_URI)
print(EVAUATION_DATA_URI)
# we is not in the env we pass the path
# of the data where is it stored 


./tune_data_stack_overflow_python_qa.jsonl
./tune_eval_data_stack_overflow_python_qa.jsonl


- Provide the model with a version.
- Versioning model allows for:
  - Reproducibility: Reproduce your results and ensure your models perform as expected.
  - Auditing: Track changes to your models.
  - Rollbacks: Roll back to a previous version of your model.

In [31]:
### path to the pipeline file to reuse
### the file is provided in your workspace as well
template_path = 'https://us-kfp.pkg.dev/ml-pipeline/\
large-language-model-pipelines/tune-large-model/v2.0.0'

In [32]:
template_path

'https://us-kfp.pkg.dev/ml-pipeline/large-language-model-pipelines/tune-large-model/v2.0.0'

In [33]:
import datetime

In [34]:
date = datetime.datetime.now().strftime("%H:%d:%m:%Y")
date

'17:11:10:2024'

In [35]:
MODEL_NAME = f"deep-learning-ai-model-{date}"
MODEL_NAME

'deep-learning-ai-model-17:11:10:2024'

  - `TRAINING_STEPS`: Number of training steps to use when tuning the model. For extractive QA you can set it from 100-500. 
  - `EVALUATION_INTERVAL`: The interval determines how frequently a trained model is evaluated against the created *evaluation set* to assess its performance and identify issues. Default will be 20, which means after every 20 training steps, the model is evaluated on the evaluation dataset.

In [36]:
TRAINING_STEPS = 200
EVALUATION_INTERVAL = 20

- Load the Project ID and credentials

In [37]:
from utils import authenticate
credentials, PROJECT_ID = authenticate() 

In [38]:
REGION = "us-central1"

- Define the arguments, the input that goes into the pipeline.

In [39]:
pipeline_arguments = {
    "model_display_name": MODEL_NAME,
    "location": REGION,
    "large_model_reference": "text-bison@001",  # Instead of starting from scratch. The model "text-bison@001" is pre-configured with certain capabilities, making it a convenient starting point for tasks related to text processing or generation.
    "project": PROJECT_ID,
    "train_steps": TRAINING_STEPS,
    "dataset_uri": TRAINING_DATA_URI,
    "evaluation_interval": EVALUATION_INTERVAL,
    "evaluation_data_uri": EVAUATION_DATA_URI,
}

In [42]:
from google.cloud.aiplatform.pipeline_jobs import PipelineJob

pipeline_root = "./"

job = PipelineJob(
    template_path=template_path,
    display_name=f"deep_learning_ai_pipeline-{date}",
    parameter_values=pipeline_arguments, # Root directory for storing temporary pipeline files
    location=REGION,
    pipeline_root=pipeline_root,
    enable_caching=True,
)

job.submit()
job.state


- Successful execution of the job would display like:

<div style="text-align: center;">
    <img src="./images/job_success_message.png" width="511" height="211"/>
</div>

- This is how the pipeline graph would look like:

<div style="text-align: center;">
    <img src="./images/peft_pipeline_1.png" width="511" height="211"/>
</div>