## Preparing the Dataset

We are fine-tuning our model on Alpaca Dataset (10K Rows). The below mentioned code is to load the Alpaca Dataset from [Hugging Face](https://huggingface.co/datasets/tatsu-lab/alpaca) using datasets library and then selecting 10K rows and saving them in JSON format. We have selected only 10K rows from the dataset is to reduce the dataset size due to frequent server disconnection issue in IDC

In [2]:
# Step 1: Install the 'datasets' library
!pip install datasets

# Step 2: Import required modules and load the Alpaca dataset from Hugging Face
from datasets import load_dataset
import json

# Load the Alpaca dataset
dataset = load_dataset("tatsu-lab/alpaca")

# Step 3: Extract 10,000 rows from the dataset
selected_data = dataset['train'].select(range(10000))

# Transform the selected rows into a list of dictionaries
data_list = [selected_data[i] for i in range(len(selected_data))]

# Step 4: Write the selected rows to a JSON file
json_file_path = 'alpaca_10000_lines.json'
with open(json_file_path, 'w') as json_file:
    json.dump(data_list, json_file, indent=4)

print(f"JSON file saved to {json_file_path}")


Defaulting to user installation because normal site-packages is not writeable
JSON file saved to alpaca_10000_lines.json


### Fine-Tuning a Pre-Trained Language Model on Custom Dataset

This code demonstrates the process of fine-tuning a pre-trained language model, specifically the `databricks/dolly-v2-3b` model, on a custom dataset using Hugging Face and Intel Extension for Transformers libraries. The selected model, `databricks/dolly-v2-3b`, is a powerful language model that is part of the Dolly series, known for its ability to generate coherent and contextually relevant text.

#### What This Code Does:
1. **Imports Necessary Libraries**:
   - The code imports essential classes and functions from Hugging Face and Intel Extension for Transformers libraries, enabling configuration and fine-tuning of the model.

2. **Model Configuration**:
   - The `ModelArguments` class is used to specify the pre-trained model to be fine-tuned (`databricks/dolly-v2-3b`).

3. **Data Configuration**:
   - The `DataArguments` class sets up the path to the custom training dataset (`alpaca_10000_lines.json`) and defines a small validation split.

4. **Training Configuration**:
   - The `TrainingArguments` class configures various training parameters such as the output directory for saving checkpoints, batch sizes, number of epochs, gradient accumulation steps, and other relevant settings. Notably, bfloat16 precision is enabled for training, if supported, to optimize performance.

5. **Finetuning Configuration**:
   - Additional fine-tuning parameters are set using the `FinetuningArguments` class.

6. **Combining Configurations**:
   - All configurations (model, data, training, and fine-tuning arguments) are combined into a single `TextGenerationFinetuningConfig` object.

7. **Initiating Fine-Tuning**:
   - The `finetune_model` function is called with the combined configuration to start the fine-tuning process.

#### Why Choose `databricks/dolly-v2-3b`?
- **Performance**: The Dolly series models, including `dolly-v2-3b`, are known for their robust performance in text generation tasks, producing coherent and contextually appropriate responses.
- **Flexibility**: These models are versatile and can be fine-tuned for various downstream tasks, making them suitable for a wide range of applications.
- **Pre-Training**: The `dolly-v2-3b` model has been pre-trained on extensive datasets, providing a strong foundation for further fine-tuning on specific data.
- **Light-Weight**: `dolly-v2-3b` is trained on 3 billion parameters, unlike larger models like `Llama-2-7b-chat-hf`, which makes the training process easier and faster.

#### Use Case:
Fine-tuning `databricks/dolly-v2-3b` on the Alpaca dataset (10,000 rows) allows customization of the model to better suit specific needs, improving its performance on tasks relevant to the provided dataset. This process is crucial for adapting the general capabilities of the pre-trained model to specialized tasks, enhancing accuracy and relevance in generated outputs.

In [3]:
# Importing necessary classes from Hugging Face and Intel Extension for Transformers
from transformers import TrainingArguments
from intel_extension_for_transformers.neural_chat.config import (
    ModelArguments,
    DataArguments,
    FinetuningArguments,
    TextGenerationFinetuningConfig,
)
from intel_extension_for_transformers.neural_chat.chatbot import finetune_model

# Setting up model arguments, specifying which pre-trained model to use
model_args = ModelArguments(model_name_or_path="databricks/dolly-v2-3b")

# Setting up data arguments, including the training file and validation split
data_args = DataArguments(train_file="alpaca_10000_lines.json", validation_split_percentage=1)

# Configuring training arguments
training_args = TrainingArguments(
    output_dir='./tmp1',               # Directory to save model checkpoints
    do_train=True,                    # Enable training
    do_eval=True,                     # Enable evaluation
    num_train_epochs=3,               # Number of training epochs
    overwrite_output_dir=True,        # Overwrite content of the output directory
    per_device_train_batch_size=4,    # Training batch size per device
    per_device_eval_batch_size=4,     # Evaluation batch size per device
    gradient_accumulation_steps=2,    # Number of updates steps to accumulate gradients before performing a backward pass
    save_strategy="no",               # Strategy for saving checkpoints (here, no intermediate saving)
    log_level="info",                 # Logging level
    save_total_limit=2,               # Limit the total amount of checkpoints
    bf16=True,                        # Use bfloat16 precision for training (if supported)
)

# Setting up additional finetuning arguments
finetune_args = FinetuningArguments()

# Combining all the configurations into a single finetuning configuration
finetune_cfg = TextGenerationFinetuningConfig(
            model_args=model_args,
            data_args=data_args,
            training_args=training_args,
            finetune_args=finetune_args,
        )

# Starting the finetuning process with the specified configuration
finetune_model(finetune_cfg)

2024-06-23 06:26:01.813799: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-23 06:26:02.804816: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
distributed training: True, 16-bits training: True
2024-06-23 06:26:09,665 - finetuning.py - intel_extension_for_transformers.transformers.llm.finetuning.finetuning - INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batc

config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

[INFO|configuration_utils.py:733] 2024-06-23 06:26:10,869 >> loading configuration file config.json from cache at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/hub/models--databricks--dolly-v2-3b/snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df/config.json
[INFO|configuration_utils.py:796] 2024-06-23 06:26:10,879 >> Model config GPTNeoXConfig {
  "_name_or_path": "databricks/dolly-v2-3b",
  "architectures": [
    "GPTNeoXForCausalLM"
  ],
  "attention_bias": true,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classifier_dropout": 0.1,
  "custom_pipelines": {
    "text-generation": {
      "impl": "instruct_pipeline.InstructionTextGenerationPipeline",
      "pt": "AutoModelForCausalLM",
      "tf": "TFAutoModelForCausalLM"
    }
  },
  "eos_token_id": 0,
  "hidden_act": "gelu",
  "hidden_dropout": 0.0,
  "hidden_size": 2560,
  "initializer_range": 0.02,
  "intermediate_size": 10240,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_ne

tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

[INFO|tokenization_utils_base.py:2108] 2024-06-23 06:26:12,496 >> loading file vocab.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-06-23 06:26:12,497 >> loading file merges.txt from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-06-23 06:26:12,498 >> loading file tokenizer.json from cache at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/hub/models--databricks--dolly-v2-3b/snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df/tokenizer.json
[INFO|tokenization_utils_base.py:2108] 2024-06-23 06:26:12,499 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-06-23 06:26:12,500 >> loading file special_tokens_map.json from cache at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/hub/models--databricks--dolly-v2-3b/snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df/special_tokens_map.json
[INFO|tokenization_utils_base.py:2108] 2024-06-23 06:26:12,501 >> loading file tokenizer_config.json from cache 

Generating train split: 0 examples [00:00, ? examples/s]

Unable to verify splits sizes.
2024-06-23 06:26:13,242 - info_utils.py - datasets.utils.info_utils - INFO - Unable to verify splits sizes.
Dataset json downloaded and prepared to /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/datasets/json/default-23a59859bfe54ec1/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a. Subsequent calls will reuse this data.
2024-06-23 06:26:13,252 - builder.py - datasets.builder - INFO - Dataset json downloaded and prepared to /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/datasets/json/default-23a59859bfe54ec1/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a. Subsequent calls will reuse this data.
Using custom data configuration default-23a59859bfe54ec1
2024-06-23 06:26:13,547 - builder.py - datasets.builder - INFO - Using custom data configuration default-23a59859bfe54ec1
Loading Dataset Infos from /home/u1759c6d9527c9022f99471cd5e84981/.conda/envs/itrex/lib/python3.10/site-packages/datasets

pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

[INFO|modeling_utils.py:3474] 2024-06-23 06:26:45,136 >> loading weights file pytorch_model.bin from cache at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/hub/models--databricks--dolly-v2-3b/snapshots/f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df/pytorch_model.bin
[INFO|modeling_utils.py:1519] 2024-06-23 06:26:45,717 >> Instantiating GPTNeoXForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:962] 2024-06-23 06:26:45,722 >> Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0
}

[INFO|modeling_utils.py:4280] 2024-06-23 06:26:46,191 >> All model checkpoint weights were used when initializing GPTNeoXForCausalLM.

[INFO|modeling_utils.py:4288] 2024-06-23 06:26:46,192 >> All the weights of GPTNeoXForCausalLM were initialized from the model checkpoint at databricks/dolly-v2-3b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPTNeoXForCausalLM for predictions without further trai

Map:   0%|          | 0/9900 [00:00<?, ? examples/s]

Caching processed dataset at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/datasets/json/default-23a59859bfe54ec1/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a/cache-3b4498c23a63fd4a.arrow
2024-06-23 06:26:47,239 - arrow_dataset.py - datasets.arrow_dataset - INFO - Caching processed dataset at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/datasets/json/default-23a59859bfe54ec1/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a/cache-3b4498c23a63fd4a.arrow


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Caching processed dataset at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/datasets/json/default-23a59859bfe54ec1/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a/cache-373df0f915057236.arrow
2024-06-23 06:26:50,019 - arrow_dataset.py - datasets.arrow_dataset - INFO - Caching processed dataset at /home/u1759c6d9527c9022f99471cd5e84981/.cache/huggingface/datasets/json/default-23a59859bfe54ec1/0.0.0/7483f22a71512872c377524b97484f6d20c275799bb9e7cd8fb3198178d8220a/cache-373df0f915057236.arrow
2024-06-23 06:26:50,040 - finetuning.py - intel_extension_for_transformers.transformers.llm.finetuning.finetuning - INFO - Using data collator of type DataCollatorForSeq2Seq


trainable params: 2,621,440 || all params: 2,777,707,520 || trainable%: 0.09437422698844837


[INFO|trainer.py:641] 2024-06-23 06:26:50,324 >> Using cpu_amp half precision backend
[INFO|trainer.py:2078] 2024-06-23 06:26:50,649 >> ***** Running training *****
[INFO|trainer.py:2079] 2024-06-23 06:26:50,651 >>   Num examples = 9,900
[INFO|trainer.py:2080] 2024-06-23 06:26:50,652 >>   Num Epochs = 3
[INFO|trainer.py:2081] 2024-06-23 06:26:50,653 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:2084] 2024-06-23 06:26:50,654 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2085] 2024-06-23 06:26:50,655 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:2086] 2024-06-23 06:26:50,656 >>   Total optimization steps = 3,711
[INFO|trainer.py:2087] 2024-06-23 06:26:50,659 >>   Number of trainable parameters = 2,621,440
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


Step,Training Loss
500,0.5879
1000,0.5038
1500,0.4742
2000,0.4573
2500,0.472
3000,0.43
3500,0.4208


[INFO|trainer.py:2329] 2024-06-23 09:09:08,096 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:3410] 2024-06-23 09:09:08,102 >> Saving model checkpoint to ./tmp
[INFO|tokenization_utils_base.py:2513] 2024-06-23 09:09:08,132 >> tokenizer config file saved in ./tmp/tokenizer_config.json
[INFO|tokenization_utils_base.py:2522] 2024-06-23 09:09:08,134 >> Special tokens file saved in ./tmp/special_tokens_map.json
2024-06-23 09:09:08,167 - finetuning.py - intel_extension_for_transformers.transformers.llm.finetuning.finetuning - INFO - *** Evaluate After Training***
[INFO|trainer.py:3719] 2024-06-23 09:09:08,173 >> ***** Running Evaluation *****
[INFO|trainer.py:3721] 2024-06-23 09:09:08,174 >>   Num examples = 100
[INFO|trainer.py:3724] 2024-06-23 09:09:08,175 >>   Batch size = 4


***** eval metrics *****
  epoch                   =     2.9988
  eval_loss               =     0.4564
  eval_ppl                =     1.5786
  eval_runtime            = 0:00:14.16
  eval_samples            =        100
  eval_samples_per_second =      7.061
  eval_steps_per_second   =      1.765


## Merging LoRA Adapter into a Language Model

This code demonstrates the process of merging a LoRA (Low-Rank Adaptation) adapter into a pre-trained language model (`databricks/dolly-v2-3b`). The merged model is then saved to a specified directory (`./merged_model/`).

### What This Code Does:

- **Imports Necessary Libraries**:
  - Imports essential libraries including `torch` for tensor operations, `AutoModelForCausalLM` and `AutoTokenizer` from Hugging Face's Transformers for model loading and tokenization, and `get_peft_model`, `LoraConfig`, `TaskType` from `peft` for adapter merging and configuration.

- **Configuration**:
  - Defines the base model name (`databricks/dolly-v2-3b`), the directory containing adapter configuration (`adapter_dir`), and the output directory for the merged model (`merged_model_output_dir`).

- **Load the Base Model**:
  - Loads the base language model (`AutoModelForCausalLM`) from its pre-trained checkpoint (`base_model_name`) with specific configurations like using `torch.bfloat16` for precision and automatic device mapping (`device_map='auto'`).

- **Load the Tokenizer**:
  - Loads the tokenizer corresponding to the base model (`AutoTokenizer`), sets `pad_token` to `eos_token`, and configures `padding_side` as "right" for token padding.

- **Load the LoRA Adapter Configuration**:
  - Loads the LoRA adapter configuration (`LoraConfig`) from the specified directory (`adapter_dir`).

- **Merge the LoRA Adapter**:
  - Utilizes `get_peft_model` function to merge the LoRA adapter configuration (`peft_config`) into the base model (`base_model`), producing `model_with_adapter`.

- **Save the Merged Model**:
  - Saves the merged model (`model_with_adapter`) and tokenizer (`tokenizer`) to the output directory (`merged_model_output_dir`).


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

# Configuration
base_model_name = "databricks/dolly-v2-3b"
adapter_dir = "./tmp"  # Directory containing adapter_config.json and adapter_model.bin
merged_model_output_dir = "./merged_model/"

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map='auto'
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load the LoRA adapter configuration
peft_config = LoraConfig.from_pretrained(adapter_dir)


# Merge the LoRA adapter into the base model
model_with_adapter = get_peft_model(base_model, peft_config)

# Save the merged model
model_with_adapter.save_pretrained(merged_model_output_dir)
tokenizer.save_pretrained(merged_model_output_dir)

### Benchmarking the Fine-Tuned Language Model on CPU

This code benchmarks the performance of a fine-tuned language model, specifically measuring its average latency and throughput when running on a CPU. The benchmarking is performed using a small portion of the Wikitext-2 dataset.

#### What This Code Does:

**Imports Necessary Libraries**:
- Imports essential libraries and modules for model loading, tokenization, dataset handling, and performance measurement.

**Model and Tokenizer Loading**:
- Loads the fine-tuned model and tokenizer from the specified directory.

**Device Configuration**:
- Ensures the model is running on the CPU.

**Dataset Loading**:
- Loads a small sample from the Wikitext-2 dataset for benchmarking purposes.

**Data Preprocessing**:
- Preprocesses the dataset to tokenize the text and prepare it for input into the model.

**Data Collation**:
- Defines a function to collate the data into batches suitable for input into the model.

**Benchmarking Function**:
- Defines a function to measure the model's average latency and throughput during inference.

**Running the Benchmark**:
- Runs the benchmark on the CPU and prints the average latency and throughput.

In [None]:
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from torch.utils.data import DataLoader

# Load the fine-tuned model and tokenizer
model_name_or_path = './merged_model/'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)

# Ensure the model is on CPU
device = torch.device('cpu')
model.to(device)

# Load a sample dataset for benchmarking
# Using a small dataset for demonstration purposes
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test[:1%]')

# Preprocess the dataset
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512)

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["text"])

# Data collator to handle batching
def collate_fn(batch):
    return {key: torch.tensor([example[key] for example in batch]) for key in batch[0].keys()}

dataloader = DataLoader(tokenized_dataset, batch_size=4, collate_fn=collate_fn)

# Benchmarking function
def benchmark_model(model, dataloader):
    model.eval()
    total_time = 0
    total_tokens = 0

    with torch.no_grad():
        for batch in dataloader:
            inputs = {key: value.to(device) for key, value in batch.items()}
            start_time = time.time()
            outputs = model(**inputs)
            total_time += time.time() - start_time
            total_tokens += inputs['input_ids'].numel()

    avg_latency = total_time / len(dataloader)
    throughput = total_tokens / total_time

    return avg_latency, throughput

# Run benchmark on CPU
avg_latency, throughput = benchmark_model(model, dataloader)

print(f'Average Latency: {avg_latency:.4f} seconds')
print(f'Throughput: {throughput:.2f} tokens/second')

Below codeblock demonstrates the process of inferencing text using a merged model that combines a pre-trained language model (`databricks/dolly-v2-3b`) with a LoRA (Low-Rank Adaptation) adapter.

### Overview:
1. **Configuration**:
    - Defines the base model name (`databricks/dolly-v2-3b`), the directory containing the adapter configuration (`adapter_dir`), and the output directory for the merged model (`merged_model_output_dir`).

2. **Load Tokenizer and Model**:
    - Loads the tokenizer and the merged model from the specified output directory.

3. **System Message and Prompt**:
    - Sets a system message guiding the model to generate engaging and positive stories.
    - Defines a user prompt to initiate the story generation.

4. **Tokenization**:
    - Tokenizes the combined system message and user prompt.

5. **Model Evaluation and Text Generation**:
    - Ensures the model is in evaluation mode.
    - Generates text token-by-token up to a specified maximum length (`max_length`), printing each token as it is generated.
    - Times the generation process to measure performance.


In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
# Assuming these are your configurations
base_model_name = "databricks/dolly-v2-3b"
adapter_dir = "./tmp"  # Directory containing adapter_config.json and adapter_model.bin
merged_model_output_dir = "./merged_model/"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(merged_model_output_dir)

# Load the merged model
model = AutoModelForCausalLM.from_pretrained(merged_model_output_dir)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
# Define the system message for story generation
system_message = (
    "You are a creative and imaginative storyteller. Your task is to continue stories in a captivating and coherent manner. "
    "Ensure that your narratives are engaging, appropriate for all audiences, and maintain a positive tone. Avoid any content that is harmful, unethical, racist, sexist, toxic, dangerous, or illegal. "
    "Strive to create stories that are socially unbiased and enjoyable. If a prompt is unclear or does not make sense, provide a creative and sensible continuation while maintaining coherence.\n"
)

# Example prompt for generation
user_prompt = "Computer Engineering is"
prompt = system_message + user_prompt

# Tokenize the input text
inputs = tokenizer(prompt, return_tensors="pt")

# Ensure that the model is in evaluation mode
model.eval()

# Generate tokens step-by-step
max_length = 75
generated_tokens = inputs["input_ids"]
generated_text = ""

# Start timing
start_time = time.time()

print(user_prompt, end='', flush=True)  # Print the user prompt

with torch.no_grad():
    for _ in range(max_length):
        outputs = model(generated_tokens)
        next_token_logits = outputs.logits[:, -1, :]

        # Get the most probable next token
        next_token_id = torch.argmax(next_token_logits, dim=-1, keepdim=True)

        # Stop if EOS token is generated
        if next_token_id.item() == tokenizer.eos_token_id:
            break

        # Append the new token to the generated sequence
        generated_tokens = torch.cat((generated_tokens, next_token_id), dim=-1)

        # Decode and print the generated token
        generated_text_chunk = tokenizer.decode(next_token_id[0], skip_special_tokens=True)
        generated_text += generated_text_chunk
        print(generated_text_chunk, end='', flush=True)  # Print each token as it's generated

# End timing
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

print()  # Ensure the output ends with a newline
print(f"\nTime taken for generation: {elapsed_time:.2f} seconds")

Computer Engineering is a broad field that includes many different disciplines. Some of the more popular computer engineering disciplines include computer architecture, computer engineering, computer graphics, and computer science. Computer engineering is the study of the design, construction, and operation of electronic systems and devices. Computer engineering students typically focus on the design of hardware and software, and the integration of these components into larger systems.



Time taken for generation: 19.80 seconds


In [9]:
# Define the system message for story generation
system_message = (
    "You are a creative and imaginative storyteller. Your task is to continue stories in a captivating and coherent manner. "
    "Ensure that your narratives are engaging, appropriate for all audiences, and maintain a positive tone. Avoid any content that is harmful, unethical, racist, sexist, toxic, dangerous, or illegal. "
    "Strive to create stories that are socially unbiased and enjoyable. If a prompt is unclear or does not make sense, provide a creative and sensible continuation while maintaining coherence.\n"
)

# Example prompt for generation
user_prompt = "Once upon a time"
prompt = system_message + user_prompt

# Tokenize the input text
inputs = tokenizer(prompt, return_tensors="pt")

# Ensure that the model is in evaluation mode
model.eval()

# Generate tokens step-by-step
max_length = 75
generated_tokens = inputs["input_ids"]
generated_text = ""

# Start timing
start_time = time.time()

print(user_prompt, end='', flush=True)  # Print the user prompt

with torch.no_grad():
    for _ in range(max_length):
        outputs = model(generated_tokens)
        next_token_logits = outputs.logits[:, -1, :]

        # Get the most probable next token
        next_token_id = torch.argmax(next_token_logits, dim=-1, keepdim=True)

        # Stop if EOS token is generated
        if next_token_id.item() == tokenizer.eos_token_id:
            break

        # Append the new token to the generated sequence
        generated_tokens = torch.cat((generated_tokens, next_token_id), dim=-1)

        # Decode and print the generated token
        generated_text_chunk = tokenizer.decode(next_token_id[0], skip_special_tokens=True)
        generated_text += generated_text_chunk
        print(generated_text_chunk, end='', flush=True)  # Print each token as it's generated

# End timing
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

print()  # Ensure the output ends with a newline
print(f"\nTime taken for generation: {elapsed_time:.2f} seconds")

Once upon a time, there was a little girl who loved to read. She loved to read books about princesses, fairies, and unicorns. She loved to read about the world she created in her head. She loved to read until one day, she found a book that changed her life. It was a book about a little girl who loved to read. The book was called

Time taken for generation: 19.83 seconds


In [10]:
# Define the system message for story generation
system_message = (
    "You are a creative and imaginative storyteller. Your task is to continue stories in a captivating and coherent manner. "
    "Ensure that your narratives are engaging, appropriate for all audiences, and maintain a positive tone. Avoid any content that is harmful, unethical, racist, sexist, toxic, dangerous, or illegal. "
    "Strive to create stories that are socially unbiased and enjoyable. If a prompt is unclear or does not make sense, provide a creative and sensible continuation while maintaining coherence.\n"
)

# Example prompt for generation
user_prompt = "Long ago lived a king"
prompt = system_message + user_prompt

# Tokenize the input text
inputs = tokenizer(prompt, return_tensors="pt")

# Ensure that the model is in evaluation mode
model.eval()

# Generate tokens step-by-step
max_length = 75
generated_tokens = inputs["input_ids"]
generated_text = ""

# Start timing
start_time = time.time()

print(user_prompt, end='', flush=True)  # Print the user prompt

with torch.no_grad():
    for _ in range(max_length):
        outputs = model(generated_tokens)
        next_token_logits = outputs.logits[:, -1, :]

        # Get the most probable next token
        next_token_id = torch.argmax(next_token_logits, dim=-1, keepdim=True)

        # Stop if EOS token is generated
        if next_token_id.item() == tokenizer.eos_token_id:
            break

        # Append the new token to the generated sequence
        generated_tokens = torch.cat((generated_tokens, next_token_id), dim=-1)

        # Decode and print the generated token
        generated_text_chunk = tokenizer.decode(next_token_id[0], skip_special_tokens=True)
        generated_text += generated_text_chunk
        print(generated_text_chunk, end='', flush=True)  # Print each token as it's generated

# End timing
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

print()  # Ensure the output ends with a newline
print(f"\nTime taken for generation: {elapsed_time:.2f} seconds")

Long ago lived a king and queen. They had three children: a boy and two girls. The boy was very handsome and the girls were beautiful. The king and queen loved their children very much. They were so proud of them. The king and queen were especially proud of their son. He was a great warrior and a hero. He was strong and brave. He was a true leader. He

Time taken for generation: 20.96 seconds


In [13]:
# Define the system message for story generation
system_message = (
    "You are a creative and imaginative storyteller. Your task is to continue stories in a captivating and coherent manner. "
    "Ensure that your narratives are engaging, appropriate for all audiences, and maintain a positive tone. Avoid any content that is harmful, unethical, racist, sexist, toxic, dangerous, or illegal. "
    "Strive to create stories that are socially unbiased and enjoyable. If a prompt is unclear or does not make sense, provide a creative and sensible continuation while maintaining coherence.\n"
)

# Example prompt for generation
user_prompt = "Once upon a time, in a faraway kingdom, there lived a princess named Snow White. She was as beautiful as a blooming rose, with skin as white as snow, lips as red as blood, and hair as black as ebony. Her mother, the queen, passed away when Snow White was very young, and her father, the king, remarried a vain and wicked woman who was obsessed with her own beauty. The new queen owned a magical mirror that she would often consult, asking who is the fairest of them all. The mirror always replied that she, the queen, was the fairest of them all. However, as Snow White grew older, her beauty surpassed that of the queen, and one day, when the queen asked her mirror the familiar question, it replied that Snow White is the fairest of them all. Consumed with jealousy, the queen ordered a huntsman to take Snow White into the forest and kill her. She demanded that he bring back Snow White's heart as proof. The huntsman, however, could not bring himself to harm the innocent princess. Instead, he let her go, advising her to run far away and never return. To deceive the queen, he brought back the heart of a wild boar."
prompt = system_message + user_prompt

# Tokenize the input text
inputs = tokenizer(prompt, return_tensors="pt")

# Ensure that the model is in evaluation mode
model.eval()

# Generate tokens step-by-step
max_length = 75
generated_tokens = inputs["input_ids"]
generated_text = ""

# Start timing
start_time = time.time()

print(user_prompt, end='', flush=True)  # Print the user prompt

with torch.no_grad():
    for _ in range(max_length):
        outputs = model(generated_tokens)
        next_token_logits = outputs.logits[:, -1, :]

        # Get the most probable next token
        next_token_id = torch.argmax(next_token_logits, dim=-1, keepdim=True)

        # Stop if EOS token is generated
        if next_token_id.item() == tokenizer.eos_token_id:
            break

        # Append the new token to the generated sequence
        generated_tokens = torch.cat((generated_tokens, next_token_id), dim=-1)

        # Decode and print the generated token
        generated_text_chunk = tokenizer.decode(next_token_id[0], skip_special_tokens=True)
        generated_text += generated_text_chunk
        print(generated_text_chunk, end='', flush=True)  # Print each token as it's generated

# End timing
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

print()  # Ensure the output ends with a newline
print(f"\nTime taken for generation: {elapsed_time:.2f} seconds")

Once upon a time, in a faraway kingdom, there lived a princess named Snow White. She was as beautiful as a blooming rose, with skin as white as snow, lips as red as blood, and hair as black as ebony. Her mother, the queen, passed away when Snow White was very young, and her father, the king, remarried a vain and wicked woman who was obsessed with her own beauty. The new queen owned a magical mirror that she would often consult, asking who is the fairest of them all. The mirror always replied that she, the queen, was the fairest of them all. However, as Snow White grew older, her beauty surpassed that of the queen, and one day, when the queen asked her mirror the familiar question, it replied that Snow White is the fairest of them all. Consumed with jealousy, the queen ordered a huntsman to take Snow White into the forest and kill her. She demanded that he bring back Snow White's heart as proof. The huntsman, however, could not bring himself to harm the innocent princess. Instead, he le

In [16]:
# Define the system message for story generation
system_message = (
    "You are a creative and imaginative storyteller. Your task is to continue stories in a captivating and coherent manner. "
    "Ensure that your narratives are engaging, appropriate for all audiences, and maintain a positive tone. Avoid any content that is harmful, unethical, racist, sexist, toxic, dangerous, or illegal. "
    "Strive to create stories that are socially unbiased and enjoyable. If a prompt is unclear or does not make sense, provide a creative and sensible continuation while maintaining coherence.\n"
)

# Example prompt for generation
user_prompt = "Once upon a time in Stanford"
prompt = system_message + user_prompt

# Tokenize the input text
inputs = tokenizer(prompt, return_tensors="pt")

# Ensure that the model is in evaluation mode
model.eval()

# Generate tokens step-by-step
max_length = 75
generated_tokens = inputs["input_ids"]
generated_text = ""

# Start timing
start_time = time.time()

print(user_prompt, end='', flush=True)  # Print the user prompt

with torch.no_grad():
    for _ in range(max_length):
        outputs = model(generated_tokens)
        next_token_logits = outputs.logits[:, -1, :]

        # Get the most probable next token
        next_token_id = torch.argmax(next_token_logits, dim=-1, keepdim=True)

        # Stop if EOS token is generated
        if next_token_id.item() == tokenizer.eos_token_id:
            break

        # Append the new token to the generated sequence
        generated_tokens = torch.cat((generated_tokens, next_token_id), dim=-1)

        # Decode and print the generated token
        generated_text_chunk = tokenizer.decode(next_token_id[0], skip_special_tokens=True)
        generated_text += generated_text_chunk
        print(generated_text_chunk, end='', flush=True)  # Print each token as it's generated

# End timing
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

print()  # Ensure the output ends with a newline
print(f"\nTime taken for generation: {elapsed_time:.2f} seconds")

Once upon a time in Stanford's English classroom, there was a young man named Harry Potter. He was a very smart wizard, and he was very good at his job. He was also a very bad boy. He would do things like drink excessively, get into fights, and generally cause trouble. One day, Harry Potter was walking down the street when he saw a beautiful girl. He was immediately attracted

Time taken for generation: 18.86 seconds
