## Finetuned Llama3 8b for Java Code Generation

This Colab notebook focuses on utilizing a finetuned version of Llama3 8b, a language model, for generating Java code snippets. The model has been optimized and trained specifically for Java code generation tasks.


In [None]:
!nvidia-smi # check the details of env before running

# 🔵 **Installation of Dependencies**

<details>

<summary>The first cell of this notebook handles the installation of necessary packages and dependencies required for running the code.( click here to expand ) </summary>

It installs:
- **Unsloth**: A Python package for integrating pretrained language models into your projects, tailored for Colab environments.
- **Xformers**: Utilized for its Flash Attention mechanism, enhancing the model's performance.
- Other supporting packages required for this task, such as `trl`, `peft`, `accelerate`, and `bitsandbytes`.

The `%%capture` magic command is used to suppress the output of the installation process within the notebook.

<br>
<hr>

### Note:
Ensure that you have adequate resources available in your Colab environment, especially GPU acceleration, to efficiently run the notebook and leverage the power of the language model for Java code generation tasks.

</details>

<details>
  <summary><b>Explanation: Dependencies</b> (click to expand)</summary>
  
### Package Explanations:

#### trl:
- **Description**: Transfer Learning Library (`trl`) is a Python library designed to facilitate transfer learning with pretrained language models. It offers utilities for fine-tuning and adapting these models to specific downstream tasks, providing a streamlined workflow for model adaptation.
- **Usefulness**: `trl` is particularly valuable for tasks requiring domain-specific language generation, such as Java code generation. By leveraging pre-existing knowledge encoded in pretrained models like Llama3 8b, `trl` allows for efficient fine-tuning and customization, enhancing the model's performance on specialized tasks.

#### peft:
- **Description**: Parameter-Efficient Fine-Tuning (`peft`) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters. This significantly decreases the computational and storage costs. Basically it freezes most of the pre-trained network's parameters.
- **Usefulness**: For tasks like Java code generation, where generating semantically correct and contextually relevant code is crucial, `peft` can significantly improve model performance. By fine-tuning with parameter-efficient techniques, the model learns to generate code snippets that not only adhere to syntax rules but also exhibit contextually appropriate behavior, enhancing the overall quality of generated code.

#### accelerate:
- **Description**: Accelerate is a PyTorch-based library aimed at optimizing computations, particularly for deep learning tasks. It offers features such as distributed training, mixed precision, and optimized backends to improve performance and efficiency.
- **Usefulness**: When fine-tuning large language models like Llama3 8b, `accelerate` can greatly accelerate the training process by leveraging GPU acceleration and other optimization techniques. This leads to faster convergence and reduced training times, making it an essential tool for training complex models efficiently.

#### bitsandbytes:
- **Description**: Bitsandbytes is a Python utility package providing tools for efficient data manipulation and conversion, particularly focusing on binary data and byte-level operations.
- **Usefulness**: While not directly related to language model fine-tuning, `bitsandbytes` can be useful for preprocessing or post-processing tasks associated with code generation. It offers functionalities for handling binary data formats, encoding and decoding operations, and byte-level manipulations, which may be relevant in certain scenarios during the code generation pipeline.

### Note:
Understanding these dependencies and their functionalities can significantly aid in optimizing model performance, accelerating training, and effectively handling data throughout the fine-tuning process.

</details>


In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes


# 🔵 **Loading Pretrained Language Model and Tokenizer**
<details>

<summary>( click here to expand )</summary>

This cell loads a pretrained language model and its associated tokenizer using the `unsloth` library. The model and tokenizer are essential components for generating Java code snippets using the Llama3 8b model.

<br>

<details>
<summary><b>Parameters:</b></summary>

- **max_seq_length**: Defines the maximum sequence length for input tokens. Setting this parameter higher allows the model to process longer sequences but may require more memory.
- **dtype**: Specifies the data type for model parameters. By default (`None`), it is auto-detected. Using `Float16` is recommended for GPUs like Tesla T4 or V100, while `Bfloat16` is suitable for Ampere+ GPUs.
- **load_in_4bit**: Controls whether to use 4-bit quantization to reduce memory usage during model loading. This can be set to `True` or `False` based on memory constraints.

</details>

<br>

<details>
<summary><b>Pre-Quantized Models:</b></summary>

The code includes a list of pre-quantized models available for faster downloading and reduced out-of-memory errors. These models are pre-quantized to 4 bits and include variants such as `mistral-7b`, `llama-2-7b`, `gemma-7b`, and `gemma-2b`, among others. You can find more models at [Hugging Face's model hub](https://huggingface.co/unsloth).

</details>

<br>

<details>
<summary><b>Model Loading:</b></summary>

The `FastLanguageModel.from_pretrained` method is used to load the pretrained language model and tokenizer. It takes several parameters:
- `model_name`: Specifies the name of the pretrained model to be loaded.
- `max_seq_length`: Sets the maximum sequence length for input tokens.
- `dtype`: Specifies the data type for model parameters.
- `load_in_4bit`: Controls whether to load the model with 4-bit quantization.

</details>

<br>

<hr>

### Note:

Ensure that you have a stable internet connection to download the pretrained model and tokenizer. Additionally, verify that your Colab environment has sufficient memory to accommodate the chosen `max_seq_length` and model parameters.

</details>

In [None]:
from unsloth import FastLanguageModel
import torch

# Define parameters
max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.

# List of 4bit pre-quantized models
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",  # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",  # Instruct version of Gemma 2b
    "unsloth/llama-3-8b-bnb-4bit",   # [NEW] 15 Trillion token Llama-3
]  # More models at https://huggingface.co/unsloth

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # token="hf_...",  # Use one if using gated models like meta-llama/Llama-2-7b-hf
)


config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/449 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# 🔵 **Enhancing Model with Parameter Efficient Fine-Tuning (PEFT) Technique**
<details>
<summary>( click here to expand )</summary>

This cell utilizes the Parameter Efficient Fine-Tuning (PEFT) technique to enhance the pretrained language model (`model`). The PEFT technique optimizes certain parameters of the model to improve its performance and efficiency for specific tasks, such as Java code generation.

<br>

<details>
<summary><b>PEFT Parameters:</b></summary>

- **r**: Specifies the value of parameter `r`, which influences the fine-tuning process. It is recommended to choose a value greater than 0. Common values include 8, 16, 32, 64, or 128.

- **target_modules**: Defines the target modules within the model architecture that will be fine-tuned using the PEFT technique. These modules include `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`.

- **lora_alpha**: Sets the value of the `lora_alpha` parameter, which controls the alpha parameter of LoRA (Local Rank Adaptation) technique. This parameter influences the fine-tuning process.

- **lora_dropout**: Specifies the dropout rate for LoRA. A value of 0 indicates no dropout, while other values are supported but less optimized.

- **bias**: Determines the bias type used in the model. The value `none` is recommended for optimization purposes.

- **use_gradient_checkpointing**: Controls the usage of gradient checkpointing for memory efficiency during fine-tuning. Set to `unsloth` for optimal performance.

- **random_state**: Sets the random seed for reproducibility during fine-tuning.

- **use_rslora**: Indicates whether to utilize rank-stabilized LoRA. Currently set to `False`.

- **loftq_config**: LoftQ configuration. Currently set to `None`.

</details>

<br>
<hr>

### Note:

By applying the PEFT technique with carefully chosen parameters, the model's performance and efficiency can be significantly improved for the task of Java code generation. Ensure that the parameters are adjusted according to the specific requirements and constraints of your project.

</details>

In [None]:
# Enhance model using Parameter Efficient Fine-Tuning (PEFT) technique
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # Use gradient checkpointing for memory efficiency
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None  # And LoftQ
)


Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<details>
<summary><b>🔴Code to load dataset from hugging face (click to expand)</b> OR press "ctrl" + "M" + "J" </summary>

# alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

# ### Instruction:
# {}

# ### Input:
# {}

# ### Response:
# {}"""

# EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
# def formatting_prompts_func(examples):
#     instructions = examples["instruction"]
#     inputs       = examples["input"]
#     outputs      = examples["output"]
#     texts = []
#     for instruction, input, output in zip(instructions, inputs, outputs):
#         # Must add EOS_TOKEN, otherwise your generation will go on forever!
#         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
#         texts.append(text)
#     return { "text" : texts, }
# pass

# from datasets import load_dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
# dataset = dataset.map(formatting_prompts_func, batched = True,)


## 🟢 JSON Dataset Structure Explanation

<details>

<summary> The dataset contains multiple JSON objects, each representing a programming task or problem statement along with its solution in Java code. (click here to expand) </summary>

### JSON Object Structure

Each JSON object consists of the following fields:

- **Instruction**: A description of the task or problem statement.
- **Input**: An optional field for input data (not used in this dataset).
- **Output**: The solution or implementation in Java code.

### Example JSON Object

```json
[
{
  "instruction": "Write a Java class for a BankAccount...",
  "input": "",
  "output": "Java code representing a BankAccount class..."
},
{
  "instruction": "Write a Java class for a BankAccount...",
  "input": "",
  "output": "Java code representing a BankAccount class..."
}
]
```

</details>

# 🔵 **Data Preparation: Formatting Alpaca Prompts**
<details>
<summary> This cell prepares the Alpaca prompts by formatting them into a suitable format for training. It involves importing necessary libraries, defining the prompt template, and loading the dataset from a JSON file. ( click here to expand )</summary>

<br>

# 1. Importing Required Libraries:
<details>
<summary><b>click here to expand</b></summary>

- The cell begins by importing the necessary libraries, including `json` for handling JSON files and `Dataset` from the `datasets` library for managing datasets efficiently.

</details>

# 2. Defining the Alpaca Prompt Template:
<details>
<summary><b>click here to expand</b></summary>

- The Alpaca prompt template is defined as a multi-line string (`alpaca_prompt`). It includes placeholders for the instruction, input, and response parts of each prompt.

</details>

# 3. Defining the End-of-Sequence Token:
<details>
<summary><b>click here to expand</b></summary>

- An end-of-sequence token (`EOS_TOKEN`) is defined using the `tokenizer.eos_token` attribute. This token is used to mark the end of each prompt.

</details>

# 4. Loading the Dataset:
<details>
<summary><b>click here to expand</b></summary>

- The dataset is loaded from a JSON file located at `/content/your_dataset.json`. Make sure to update the file path according to your dataset's location. The JSON file contains a list of dictionaries, where each dictionary represents an example with keys for instruction, input, and output.

</details>

# 5. Formatting the Prompts:
<details>
<summary><b>click here to expand</b></summary>

- The prompts are formatted by iterating over each example in the dataset list. The instruction, input, and output texts are extracted from each example, and the `alpaca_prompt` template is filled with these values. The end-of-sequence token is appended to each prompt, and the formatted prompt is added to the `texts` list.

</details>

# 6. Creating a Hugging Face Dataset:
<details>
<summary><b>click here to expand</b></summary>

- Finally, a Hugging Face dataset is created from the `texts` list containing the formatted prompts. This dataset is now ready to be used for training or further processing.

</details>

<br>

<hr>

# Note:
- Ensure that the dataset JSON file (`your_dataset.json`) is properly formatted and contains the required fields (instruction, input, output) for each example.
- Additionally, verify that the `tokenizer` variable is defined elsewhere in the notebook to access the end-of-sequence token.
</details>

In [None]:
# Importing required libraries
import json
from datasets import Dataset

# Define the Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Define the end-of-sequence token
EOS_TOKEN = tokenizer.eos_token  # Assuming 'tokenizer' is defined elsewhere

# Load the dataset from the JSON file in Google Colab
dataset_path = "/content/your_dataset.json" # Update with your file path
with open(dataset_path, "r") as f:
    dataset_list = json.load(f)

# Format the prompts
texts = []
for example in dataset_list:
    instruction = example["instruction"]
    input_text = example["input"]
    output_text = example["output"]
    text = alpaca_prompt.format(instruction, input_text, output_text) + EOS_TOKEN
    texts.append(text)

# Create a Hugging Face dataset
dataset = Dataset.from_dict({"text": texts})

# Now 'dataset' contains the formatted prompts ready to be used


# 🔵 **Training and Testing**

<a name=" Train"></a>
### **Training with Huggingface TRL's SFTTrainer**
**bold text**
In this code, we are using Huggingface's Transfer Representation Learning (TRL) library to train a model using the SFTTrainer.

We set `num_train_epochs=1` for a full run by setting `max_steps=None`.
<br>

The code supports TRL's `DPOTrainer`.
<br>

More documentation on TRL's SFTTrainer can be found [here](https://huggingface.co/docs/trl/sft_trainer).


In [None]:
# @title
from trl import SFTTrainer
from transformers import TrainingArguments

# Initialize the trainer with SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,  # Number of processes to use while preparing the dataset
    packing = False,  # Whether to use packing to make training faster for short sequences
    args = TrainingArguments(
        per_device_train_batch_size = 2,  # Batch size per GPU/CPU for training
        gradient_accumulation_steps = 4,  # Number of updates steps to accumulate before performing a backward/update pass
        warmup_steps = 5,  # Number of steps for the warmup phase
        max_steps = 60,  # Maximum number of training steps
        learning_rate = 2e-4,  # Initial learning rate for AdamW optimizer
        fp16 = not torch.cuda.is_bf16_supported(),  # Whether to use FP16 (mixed precision) training
        bf16 = torch.cuda.is_bf16_supported(),  # Whether to use BF16 training
        logging_steps = 1,  # Log every X updates steps
        optim = "adamw_8bit",  # Optimizer type
        weight_decay = 0.01,  # Weight decay rate for AdamW optimizer
        lr_scheduler_type = "linear",  # Learning rate scheduler type
        seed = 3407,  # Random seed for reproducibility
        output_dir = "outputs",  # Directory to save the output files
    ),
)


  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/9 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.594 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 9 | Num Epochs = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,0.8234
2,0.852
3,0.8068
4,0.7173
5,0.6738
6,0.5983
7,0.4622
8,0.4014
9,0.2815
10,0.2639


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

719.2952 seconds used for training.
11.99 minutes used for training.
Peak reserved memory = 7.756 GB.
Peak reserved memory for training = 2.162 GB.
Peak reserved memory % of max memory = 52.59 %.
Peak reserved memory for training % of max memory = 14.66 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Design a Java class for a BankCustomer that demonstrates encapsulation by encapsulating attributes like name and account number, inheritance by having subclasses like SavingAccountCustomer and LoanAccountCustomer, and association by associating customers with accounts.", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nDesign a Java class for a BankCustomer that demonstrates encapsulation by encapsulating attributes like name and account number, inheritance by having subclasses like SavingAccountCustomer and LoanAccountCustomer, and association by associating customers with accounts.\n\n### Input:\n\n\n### Response:\n// Encapsulated attributes\nprivate String name;\nprivate String accountNumber;\n\n// Association with accounts (can be extended to further subclasses)\nprivate Account account;\n\n// Constructor with arguments for name and account number\npublic BankCustomer(String name, String accountNumber) {\n  this.name = name;\n  this.accountNumber = account']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Design a Java class for a BankCustomer that demonstrates encapsulation by encapsulating attributes like name and account number, inheritance by having subclasses like SavingAccountCustomer and LoanAccountCustomer, and association by associating customers with accounts.", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 500)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Design a Java class for a BankCustomer that demonstrates encapsulation by encapsulating attributes like name and account number, inheritance by having subclasses like SavingAccountCustomer and LoanAccountCustomer, and association by associating customers with accounts.

### Input:


### Response:
// Encapsulated attributes
private String name;
private String accountNumber;

// Association with accounts (can be extended to further subclasses)
private Account account;

// Constructor with arguments for name and account number
public BankCustomer(String name, String accountNumber) {
  this.name = name;
  this.accountNumber = accountNumber;
  this.account = null; // Account not initialized by default
}

// Getter methods for encapsulated attributes
public String getName() {
  return name;
}

public S

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("code_lora_model") # Local saving
tokenizer.save_pretrained("code_lora_model")
# model.push_to_hub("rahulAkaVector/java_code_generator_finetuned_model", token = "hugging_face_access_token") # Online saving
# tokenizer.push_to_hub("rahulAkaVector/java_code_generator_finetuned_model", token = "hugging_face_access_token") # Online saving

README.md:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/rahulAkaVector/java_code_generator_finetuned_model


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "code_lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Write a Java class representing a SocialMediaPost that showcases encapsulation by encapsulating attributes like content and author, abstraction by providing methods for liking and commenting, and inheritance by having subclasses like TextPost and ImagePost.", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 500, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWrite a Java class representing a SocialMediaPost that showcases encapsulation by encapsulating attributes like content and author, abstraction by providing methods for liking and commenting, and inheritance by having subclasses like TextPost and ImagePost.\n\n### Input:\n\n\n### Response:\npublic class SocialMediaPost {\n  private String content;\n  private String author;\n  private int likes;\n  private List<String> comments;\n\n  public SocialMediaPost(String content, String author) {\n    this.content = content;\n    this.author = author;\n    this.likes = 0;\n    this.comments = new ArrayList<>();\n  }\n\n  public String getContent() {\n    return content;\n  }\n\n  public String getAuthor() {\n    return author;\n  }\n\n  public int getLikes() {\n    return likes;\n  }\n\n  public void

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.