## Finetuning the Gemma LLM by Google for the LAiSER Research Work
"""
Notebook Description:
-------------------
Fine-tuning a Language Model for extracting skill keywords

Ownership:
----------
Project: Leveraging Artificial intelligence for Skills Extraction and Research (LAiSER)
Owner:  George Washington University Institute of Public Policy
        Program on Skills, Credentials and Workforce Policy
        Media and Public Affairs Building
        805 21st Street NW
        Washington, DC 20052
        PSCWP@gwu.edu
        https://gwipp.gwu.edu/program-skills-credentials-workforce-policy-pscwp

License:
--------
Copyright 2024 George Washington University Institute of Public Policy

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the “Software”), to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Input Requirements:
-------------------
- Taxonomy dataset
- Job/Course description data

Output/Return Format:
----------------------------
- List of Skill Keywords

"""
"""
Revision History:
-----------------
Rev No.     Date            Author              Description
[1.0.0]     06/18/2024      Satya Phanindra K.  Setup and run Gemma-2b-it
[1.0.3]     06/20/2024      Satya Phanindra K.  Fine-tune the model and push to HuggingFace
[1.0.4]     06/21/2024      Satya Phanindra K.  Import and use the fine-tuned model

TODO:
-----
- 1: huggingface import should use GPU for execution
- 2: Run the model against 10-15 job descriptions

"""

## Prerequisites

Before delving into the fine-tuning process, ensure that you have the following prerequisites in place:

1. **GPU**: [gemma-2b](https://huggingface.co/google/gemma-2b) - can be finetuned on T4(free google colab) while [gemma-7b](https://huggingface.co/google/gemma-7b) requires an A100 GPU.
2. **Python Packages**: Ensure that you have the necessary Python packages installed. You can use the following commands to install them:

Let's begin by checking if your GPU is correctly detected:

In [None]:
!nvidia-smi

Fri Jun 21 14:51:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Step 2 - Model loading
We'll load the model using QLoRA quantization to reduce the usage of memory


In [1]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Now we specify the model ID and then we load it with our previously defined quantization configuration.Now we specify the model ID and then we load it with our previously defined quantization configuration.

In [None]:
# if you are using google colab

import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# model_id = "google/gemma-7b-it"
# model_id = "google/gemma-7b"
model_id = "google/gemma-2b-it"
# model_id = "google/gemma-2b"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)



config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [6]:
def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  Name all the skills present in the following description in a single list. Response should be in English and have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
  Below text is the Description:

  {query}
  <end_of_turn>\n<start_of_turn>model
"""
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)

In [None]:
query_text = '''SANCORP is seeking FTE Level II Data Scientist to support the office of DoD Chief Digital and Artificial Intelligence Office (CDAO) Chief Technology Officer (CTO). CDAO CTO requires support in multiple functional areas to ensure deliverables associated with the CDAO Architecture Council, CTO Federation, and CTO Future Architecture Activities. The mission of the CDAO CTO is to accelerate the DoD's adoption of data, analytics, and AI to improve decision making across all levels of the department. The following are examples of responsibilities:
          Support development of insider threat strategy in support of protecting CDAO technical offerings; balance short-term wins with long-term investments to progressively mature CDAO’s defenses against insider threats.
          Lead coordination of policy and strategy related to insider threats with industry partners and other DoD components.
          Lead exploration of data sources that are relevant to measuring, identifying, and defending against insider threats.
          Provide technical leadership in developing capabilities to detect insider threats among large user communities, leveraging combination of statistical, classical machine learning, and deep learning methods.
          Sancorp Consulting LLC shall, in its discretion, modify or adjust the position to meet Sancorp’s changing needs. This job description is not a contract and may be adjusted as deemed appropriate at Sancorp’s sole discretion.
          Sancorp Consulting, LLC, is an SDVOSB and SBA 8(a) company seeking highly motivated and qualified professionals and offer an attractive salary and benefits package that includes: Medical, Dental, life and Disability Insurance; 401K, and holidays to ensure the highest quality of life for our employees. Please visit our website for more information at www.sancorpconsulting.com.
          Sancorp Consulting, LLC is an equal opportunity employer. At Sancorp Consulting, LLC we are committed to providing equal employment opportunities (EEO) to all employees and applicants without regard to race color, religion, sex, national origin, age, disability, or any other protected characteristic as defined by applicable law. We strive to create an inclusive and diverse workplace where everyone feels valued, respected, and supported."""
          '''
result = get_completion(query=query_text, model=model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



  user
  Name all the skills present in the following description in a single list. Response should have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
  Below text is the Description:

  SANCORP is seeking FTE Level II Data Scientist to support the office of DoD Chief Digital and Artificial Intelligence Office (CDAO) Chief Technology Officer (CTO). CDAO CTO requires support in multiple functional areas to ensure deliverables associated with the CDAO Architecture Council, CTO Federation, and CTO Future Architecture Activities. The mission of the CDAO CTO is to accelerate the DoD's adoption of data, analytics, and AI to improve decision making across all levels of the department. The following are examples of responsibilities:
          Support development of insider threat strategy in support of protecting CDAO technical offerings; balance short-term wins with long-term investments to progressively mature CDAO’s defenses aga

## Step 3 - Load dataset for finetuning

### Lets Load the Dataset

For this tutorial, we will fine-tune Gemma-2B-IT Instruct for code generation.

The dataset structure should resemble the following:

```json
{
  ...keys,
  "RSD Name/Skill Tag": "Skill 1", "Skill 2", "Skill 3", "Skill 4", ...
  "Skill Statement/Task": "Task 1", "Task 2", "Task 3", "Task 4", ...
}
```

In [None]:
from datasets import load_dataset
# Combined OSN taxanomy dataset consists of comp, ind, and pr occupations
dataset = load_dataset("Phanindra-max/osn_combined", split="train")
dataset

Downloading data:   0%|          | 0.00/518k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['Unnamed: 0', 'Canonical URL', 'RSD Name', 'Author', 'Skill Statement', 'Category', 'Keywords', 'Standards', 'Certifications', 'Occupation Major Groups', 'Occupation Minor Groups', 'Broad Occupations', 'Detailed Occupations', 'O*Net Job Codes', 'Employers', 'Alignment Name', 'Alignment URL', 'Alignment Framework'],
    num_rows: 932
})

In [None]:
# df = dataset.to_pandas()
# df.head(10)

Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

### Formatting the Dataset

Now, let's format the dataset in the required [gemma instruction formate](https://huggingface.co/google/gemma-7b-it).

> Many tutorials and blogs skip over this part, but I feel this is a really important step.

```
<start_of_turn>user What is your favorite condiment? <end_of_turn>
<start_of_turn>model Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!<end_of_turn>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [None]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """
    prefix_text = 'Name all the skills present in the following description in a single list. Response should have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words. Below text is the Description:\n\n'
    # Samples with additional context into.
    text = f"""<start_of_turn>user {prefix_text} {data_point["Skill Statement"]} <end_of_turn>\n<start_of_turn>model {data_point["RSD Name"]} <end_of_turn>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)

We'll need to tokenize our data so the model can understand.


In [None]:
dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Map:   0%|          | 0/932 [00:00<?, ? examples/s]

Split dataset into 90% for training and 10% for testing

In [None]:
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

In [None]:
print(test_data)

Dataset({
    features: ['Unnamed: 0', 'Canonical URL', 'RSD Name', 'Author', 'Skill Statement', 'Category', 'Keywords', 'Standards', 'Certifications', 'Occupation Major Groups', 'Occupation Minor Groups', 'Broad Occupations', 'Detailed Occupations', 'O*Net Job Codes', 'Employers', 'Alignment Name', 'Alignment URL', 'Alignment Framework', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 187
})


## Step 4 - Apply Lora  
Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

In [None]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

In [None]:
import bitsandbytes as bnb
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

In [None]:
modules = find_all_linear_names(model)
print(modules)

['up_proj', 'v_proj', 'o_proj', 'k_proj', 'gate_proj', 'down_proj', 'q_proj']


In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

In [None]:
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%


## Step 5 - Run the training!

Setting the training arguments:
* for the reason of demo, we just ran it for few steps (100) just to showcase how to use this integration with existing tools on the HF ecosystem.

### Fine-Tuning with qLora and Supervised Fine-Tuning

We're ready to fine-tune our model using qLora. For this tutorial, we'll use the `SFTTrainer` from the `trl` library for supervised fine-tuning. Ensure that you've installed the `trl` library as mentioned in the prerequisites.

In [None]:
#new code using SFTTrainer
import transformers
from trl import SFTTrainer
# from transformers.generation_utils import top_k_top_p_filtering # Import the function from its new location

tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    dataset_text_field="prompt",
    peft_config=lora_config,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=0.03,
        max_steps=100,
        learning_rate=2e-4,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        save_strategy="epoch",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)



Map:   0%|          | 0/745 [00:00<?, ? examples/s]

Map:   0%|          | 0/187 [00:00<?, ? examples/s]



## Lets start training

In [None]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()



Step,Training Loss
1,7.9587
2,7.8485
3,5.5367
4,4.1551
5,3.2014
6,2.7259
7,2.27
8,2.0823
9,1.762
10,1.4818


KeyboardInterrupt: 

 Share adapters on the 🤗 Hub

In [None]:
new_model = "100epoch-gemma-Code-Finetune-test" #Name of the model you will be pushing to huggingface model hub

In [None]:
trainer.model.save_pretrained(new_model)

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Phanindra-max/100epoch-gemma-Code-Finetune-test/commit/0355471197c5f50403406bb6ecc427c4aec64499', commit_message='Upload tokenizer', commit_description='', oid='0355471197c5f50403406bb6ecc427c4aec64499', pr_url=None, pr_revision=None, pr_num=None)

## Test out Finetuned Model

In [4]:
text = '''As a Research Engineer at Whissle LLC, you will play a pivotal role in bringing our cutting-edge research to life. Your work will involve implementing and experimenting with the latest research techniques, and developing tools and infrastructure that streamline the transition of research into viable products.


Key Responsibilities:

    Experiment with and adopt the latest research techniques in AI and machine learning, open-source and our own research.
    Develop, maintain, and enhance benchmarks for evaluating AI performance.
    Implement and continuously refine our agent architecture to improve functionality and efficiency.
    Contribute to the creation of a seamless experimental framework, supporting the integration of research findings into product development.


Qualifications:

    Solid background in software engineering and ML development, with expertise in optimized deployment and inference of multi-modal AI solutions (video, audio, speech LLMs).
    Proficiency in Python and C++, with experience in machine learning frameworks and dependencies (e.g., torch, tensorrt, cuda).
    Experience with creating dev tools, which include hosting on-demand micro-services.
    Strong skills in version control, code reviews, and CI/CD practices.
    Familiarity with automated integration and deployment environments.
    Exceptional problem-solving abilities and innovative thinking.
    Excellent teamwork and communication skills.


Preferred Experience:

    Previous role as a research engineer at leading general AI companies like Netflix, Google AI, Anthropic, or OpenAI or in a leading AI research organization.
    Extensive exposure to cutting-edge AI research, especially in multi-modal AI technologies.
    Track record of contributing to AI advancements through publications, patents, or open-source projects.
'''

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Result from original model:
 
  user
  Name all the skills present in the following description in a single list. Response should have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
  Below text is the Description:

  As a Research Engineer at Whissle LLC, you will play a pivotal role in bringing our cutting-edge research to life. Your work will involve implementing and experimenting with the latest research techniques, and developing tools and infrastructure that streamline the transition of research into viable products.


Key Responsibilities:

    Experiment with and adopt the latest research techniques in AI and machine learning, open-source and our own research.
    Develop, maintain, and enhance benchmarks for evaluating AI performance.
    Implement and continuously refine our agent architecture to improve functionality and efficiency.
    Contribute to the creation of a seamless experimental framework, supporting the

In [None]:
# query the fine-tuned model to compare the outputs
result = get_completion(query=text, model=merged_model, tokenizer=tokenizer)
print("Result from fine-tuned and merged model:\n", result)


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Result from fine-tuned and merged model:
 
  user
  Name all the skills present in the following description in a single list. Response should have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
  Below text is the Description:

  As a Research Engineer at Whissle LLC, you will play a pivotal role in bringing our cutting-edge research to life. Your work will involve implementing and experimenting with the latest research techniques, and developing tools and infrastructure that streamline the transition of research into viable products.


Key Responsibilities:

    Experiment with and adopt the latest research techniques in AI and machine learning, open-source and our own research.
    Develop, maintain, and enhance benchmarks for evaluating AI performance.
    Implement and continuously refine our agent architecture to improve functionality and efficiency.
    Contribute to the creation of a seamless experimental framework, s

In [None]:
# clear the GPU cache to prevent hardware bottlneck
torch.cuda.empty_cache()
result = get_completion(query=text, model=model, tokenizer=tokenizer)
print("Result from original model:\n", result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Result from original model:
 
  user
  Name all the skills present in the following description in a single list. Response should have only the skills, no other information or words. Skills should be keywords, each being no more than 3 words.
  Below text is the Description:

  As a Research Engineer at Whissle LLC, you will play a pivotal role in bringing our cutting-edge research to life. Your work will involve implementing and experimenting with the latest research techniques, and developing tools and infrastructure that streamline the transition of research into viable products.


Key Responsibilities:

    Experiment with and adopt the latest research techniques in AI and machine learning, open-source and our own research.
    Develop, maintain, and enhance benchmarks for evaluating AI performance.
    Implement and continuously refine our agent architecture to improve functionality and efficiency.
    Contribute to the creation of a seamless experimental framework, supporting the

From above, it can be observed that the original model performs better than the fine-tuned version.
 - Could be because of me stopping the training function at 50 epochs.
 - Irrespective of the model being fine-tuned or not, the results from 2B model are clearly bad when compared to the output of the 7B base model.
 - Non-english words, Emojis and Any other special characters are NOT expected in the output. TODO: need to experiment trying a couple variations of the query instruct. 7B base model never gave Non-english words or Speacial charcters.
 - The response shouldn't be huge either. Ideally 5-10 skills seems nice to me. TODO: try a few experiments by changing the query to have a limit on the no.of extracted skill keywords.

## Usage of the LLM

- With the fine-tuned model saved on HuggingFace, I tried running a simple query.
- One immediate problem I see is that the model is only RAM, no GPU usage recorded.


In [1]:
# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": '''As a Research Engineer at Whissle LLC, you will play a pivotal role in bringing our cutting-edge research to life. Your work will involve implementing and experimenting with the latest research techniques, and developing tools and infrastructure that streamline the transition of research into viable products.


Key Responsibilities:

    Experiment with and adopt the latest research techniques in AI and machine learning, open-source and our own research.
    Develop, maintain, and enhance benchmarks for evaluating AI performance.
    Implement and continuously refine our agent architecture to improve functionality and efficiency.
    Contribute to the creation of a seamless experimental framework, supporting the integration of research findings into product development.


Qualifications:

    Solid background in software engineering and ML development, with expertise in optimized deployment and inference of multi-modal AI solutions (video, audio, speech LLMs).
    Proficiency in Python and C++, with experience in machine learning frameworks and dependencies (e.g., torch, tensorrt, cuda).
    Experience with creating dev tools, which include hosting on-demand micro-services.
    Strong skills in version control, code reviews, and CI/CD practices.
    Familiarity with automated integration and deployment environments.
    Exceptional problem-solving abilities and innovative thinking.
    Excellent teamwork and communication skills.


Preferred Experience:

    Previous role as a research engineer at leading general AI companies like Netflix, Google AI, Anthropic, or OpenAI or in a leading AI research organization.
    Extensive exposure to cutting-edge AI research, especially in multi-modal AI technologies.
    Track record of contributing to AI advancements through publications, patents, or open-source projects.
'''},
]
pipe = pipeline("text-generation", model="Phanindra-max/100epoch-gemma-Code-Finetune-test", max_new_tokens=400)
pipe(messages)

KeyboardInterrupt: 

In [10]:
!pip install transformers -U

Collecting transformers
  Downloading transformers-4.41.2-py3-none-any.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.2
    Uninstalling tokenizers-0.15.2:
      Successfully uninstalled tokenizers-0.15.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.39.1
    Uninstalling transformers-4.39.1:
      Successfully uninstalled transformers-4.39.1
Successfully installed tokenizers-0.19.1 transformers-4.41.2


In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("Phanindra-max/100epoch-gemma-Code-Finetune-test")
model = AutoModelForCausalLM.from_pretrained("Phanindra-max/100epoch-gemma-Code-Finetune-test")

# Move model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
result = get_completion(query=text, model=model, tokenizer=tokenizer)
print(result)