## Setup & EDA

In [8]:
!yarn add @langchain/community


[2K[1G[1myarn add v1.22.21[22m
[2K[1G[34minfo[39m No lockfile found.
[2K[1G[2m[1/4][22m 🔍  Resolving packages...
[1G⠁ [0K[1G⠂ @langchain/community[0K[1G⠄ @langchain/community[0K[1G⡀ zod-to-json-schema@^3.22.5[0K[1G⢀ zod-to-json-schema@^3.22.5[0K[1G⠠ uuid@^9.0.0[0K[1G⠐ zod-to-json-schema@^3.22.3[0K[1G⠈ ml-tree-similarity@^1.0.0[0K[1G⠁ ml-array-sum@^1.1.6[0K[1G⠂ form-data@^4.0.0[0K[1G⠄ is-any-array@^2.0.0[0K[1G⡀ whatwg-url@^5.0.0[0K[1G⢀ mime-types@^2.1.12[0K[1G⠠ delayed-stream@~1.0.0[0K[1G⠐ mime-db@1.52.0[0K[1G⠈ base64-js@^1.5.1[0K[1G⠁ base64-js@^1.5.1[0K[2K[1G[2K[1G[2m[2/4][22m 🚚  Fetching packages...
[2K[1G[1G[---------------------------------------------------------] 0/57[2K[1G[31merror[39m @langchain/core@0.1.62: The engine "node" is incompatible with this module. Expected version ">=18". Got "16.15.1"
[2K[1G[31merror[39m Found incompatible module.
[2K[1G[34minfo[39m Visit [1mhttps://yarnpkg.com/en/docs/cli/add[22m 

In [1]:
!pip3 install --upgrade pip



We first need to install all the needed packages

In [None]:
!pip3 install 'transformers[torch]'
!pip3 install transformers accelerate
!pip3 install --upgrade transformers



**Restart kernel after installation**

We will then import all the needed libraries

In [None]:
!pip3 install datasets

In [None]:
import json
from pathlib import Path
from datasets import Dataset,DatasetDict, load_dataset
import pandas as pd

We reformatted the data set now so that we can use a standard load_dataset

In [None]:
data = load_dataset("json", data_files="train1.json", field="data")

lets just display the dataset to check

In [None]:
data

Great, lest split it up, and create validation and training datasets

In [None]:
split_datasets = data["train"].train_test_split(train_size=0.9, seed=20)
split_datasets["validation"] = split_datasets.pop("test")
split_datasets

In [None]:
split_datasets["train"][1]["translation"]

## Tokenize

Import a tokenizer to convert all our inputs and targets

TOKENs in compiler and LLM/Transformers have different meanings. Tokenisation here refer to the creation of a numerical value for the input tokens, a vector that can be used by the model.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "meta-llama/CodeLlama-7b-hf" # Replace this with your desired model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt", use_auth_token=True, force_download=True)

Note because we merged the dataset, we can actually use the same dataset for both source and target.
The most interesting thing here is the tokenizer, a hugging face function, that produces the attention_mask.

In [None]:
pli_sentence = split_datasets["train"][1]["translation"]["pli"]
ktl_sentence = split_datasets["train"][1]["translation"]["ktl"]

inputs = tokenizer(pli_sentence, return_tensors="pt")
targets = tokenizer(ktl_sentence, return_tensors="pt")
inputs, targets

In [None]:
wrong_targets = tokenizer(ktl_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))

A quick function to clean up the input sequence and set the model up to accept the input sequence

overflowing_tokens and num_truncated tokens are things like whitespace, and start of sequence/end of sequence etc.

This model expects the inputs to be named "labels".


In [None]:
max_length = 64

def preprocess_function(examples):
    inputs = [ex["pli"] for ex in examples["translation"]]
    targets = [ex["ktl"] for ex in examples["translation"]]
    
    # Tokenize inputs and targets separately
    model_inputs = tokenizer(inputs, max_length=max_length, truncation=True, padding=True)
    model_targets = tokenizer(targets, max_length=max_length, truncation=True, padding=True)
    
    # Remove unnecessary keys from model_inputs
    model_inputs.pop("overflowing_tokens", None)
    model_inputs.pop("num_truncated_tokens", None)
    
    # Add targets to model_inputs
    model_inputs["labels"] = model_targets["input_ids"]
    
    return model_inputs

In [None]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

## Fine-tuning model

In [None]:
pip install --upgrade bitsandbytes

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_checkpoint, force_download=True)

In [None]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",  # No evaluation during training
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=150,
    predict_with_generate=True,
)

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=None,  # Assuming no evaluation dataset
    data_collator=data_collator,
    tokenizer=tokenizer
)

In [None]:
import os

# Disable wandb
os.environ["WANDB_DISABLED"] = "true"

In [None]:
trainer.train()