<a href="https://github.com/Deffro/Data-Science-Portfolio/tree/master"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17ATJVumI5cBGmQjkVsN1gk-Gn3-_BMuP?usp=sharing)

# **Unlocking the Potential of Generative AI: Advanced Prompt Engineering and Model Optimization**

## **Introduction**

Generative AI is transforming industries by enabling machines to produce human-like text, generate structured outputs, and even creatively solve complex problems.

However, the real magic happens when you combine cutting-edge models with advanced techniques like **prompt engineering**, **quantized model optimization**, and **grammar-constrained sampling**.

In this project, I explore the forefront of these techniques, demonstrating how to harness the capabilities of state-of-the-art language models to achieve specific and efficient results.

From crafting intricate prompts to controlling the randomness and structure of outputs, this work reflects the fusion of creativity, technical expertise, and a deep understanding of model behavior.

---

## **What This Project Explores**

This project focuses on three pivotal aspects of generative AI:

### **1. Advanced Prompt Engineering**
- Learn how to guide models using precise instructions, multi-component prompts, and **in-context learning** (zero-shot, one-shot, and few-shot prompting).
- Use advanced prompt strategies to create structured outputs like **JSON-based descriptions**.

### **2. Optimizing Models with Quantization**
- Explore how **quantized models** enable high-performance inference with reduced memory and computational requirements.
- Understand the trade-offs between file size, precision, and model accuracy using quantization levels such as Q2, Q4, and fp16.

### **3. Ensuring Structured Outputs with Grammar Constraints**
- Enforce predefined formats like JSON during the token generation process using **grammar-constrained sampling**.
- Apply these techniques to achieve reliable and application-ready outputs in domains like structured data generation and sentiment classification.

---

## **Why It Matters**

As AI continues to integrate into our lives, the ability to guide and optimize these systems has become a vital skill. This project showcases the intersection of cutting-edge AI techniques and real-world applications, empowering anyone to:

- Efficiently deploy models on resource-constrained devices.
- Achieve creative and structured outputs for diverse use cases.
- Build confidence in AI outputs through rigorous validation and control.


In [1]:
%%capture
!pip install transformers>=4.40.1
!CMAKE_ARGS="-DLLAMA_CUBLAS=on"
!pip install llama-cpp-python

# **Load a Text Generation Model**

<div style="border-left: 5px solid #2196F3; background-color: #f1f9ff; padding: 10px 15px; font-family: Arial, sans-serif; font-size: 14px; line-height: 1.5; color: #333;">
  <strong style="color: #0d47a1; font-size: 16px;">Tip:</strong>
  Some Hugging Face models need access to use. In case you don't have a Hugging Face account with an access token, you should use another model.
  <br><br>
  In that case, replace <code style="background-color: #e8f4fc; padding: 2px 4px; border-radius: 4px;">openai-community/gpt2</code> with <code style="background-color: #e8f4fc; padding: 2px 4px; border-radius: 4px;">microsoft/Phi-3-mini-4k-instruct</code>.
</div>


In [2]:
from huggingface_hub import notebook_login

# Launch the login widget
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
from transformers import pipeline

# Create a pipeline
pipe = pipeline(
    task="text-generation",
    model="Qwen/Qwen2.5-1.5B-Instruct",
    return_full_text=False,
    max_new_tokens=100,
    do_sample=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


### **Summary of Parameters**

| **Parameter**         | **Value**                     | **Description**                                                                 |
|------------------------|-------------------------------|---------------------------------------------------------------------------------|
| **task**              | `"text-generation"`           | Defines the task type: generates text continuations based on input prompts.     |
| **model**             | `"Qwen/Qwen2.5-1.5B-Instruct"`| Instruction-tuned generative model with 1.5B parameters, optimized for NLP tasks. |
| **`return_full_text`**| `False`                       | Excludes the input prompt from the output, returning only the generated text.   |
| **`max_new_tokens`**  | `100`                         | Limits the number of tokens generated to control output length and cost.        |
| **`do_sample`**       | `False`                       | Disables randomness to ensure deterministic, reproducible outputs.              |


# **1. Using a simple prompt**

In [3]:
# Prompt
messages = [
    {"role": "user", "content": "Given the company Tesla, describe the best pokemon it would represent. Write a short description of the pokemon-company."}
]

In [4]:
# Apply prompt template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
prompt

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nGiven the company Tesla, describe the best pokemon it would represent. Write a short description of the pokemon-company.<|im_end|>\n'

## Why Should `messages` Have `role` and `content`?

The `messages` structure with `role` and `content` fields is commonly used for **chat-style conversational models**. Below are the key reasons why this format is important and beneficial:

---

### **1. Mimics Multi-Turn Conversations**
- The **`role`** field (e.g., `user`, `assistant`, or `system`) helps the model understand the context of the conversation:
  - **`user`**: Represents the human's input or query.
  - **`assistant`**: Represents the AI model's response.
  - **`system`**: Provides instructions or context for the AI’s behavior (if supported by the model).
- This structure enables models to manage multi-turn conversations effectively by maintaining a clear distinction between different speakers.

---

### **2. Instruction-Following Models**
- Instruction-tuned models (e.g., Qwen, Flan-T5, or ChatGPT-like models) are often trained on datasets where prompts are structured this way.
- Including `role` and `content` aligns with the model's training data, improving its ability to:
  - Interpret the input correctly.
  - Generate coherent and contextually relevant responses.

---

### **3. Flexibility for Chat Applications**
- The `messages` format allows for more complex and dynamic interactions, such as:
  - Handling multiple speakers in a conversation.
  - Maintaining context across multi-turn dialogues.
  - Incorporating dynamic instructions via the `system` role.

---

### **4. Model-Specific Requirements**
- Some chat-focused models (like OpenAI’s ChatGPT, GPT-4, or Qwen-Chat) expect input in the `messages` format.
- Omitting `role` and `content` for these models might lead to:
  - Errors in processing.
  - Misinterpretation of the input, resulting in lower-quality responses.

---

### **5. Generality and Consistency**
- Using `role` and `content` creates a standardized format that works across different chat-focused models.
- It ensures compatibility, making it easier to switch models or integrate them into applications.

---

### **What If the Model Does Not Support `messages`?**

If the model you’re using does not natively support the `messages` structure:

1. **Simplify the Input**:
   - Convert `messages` into a plain text prompt:
     ```python
     prompt = "user: Given the company Tesla, describe the best Pokemon it would represent. Write a short description of the Pokemon-company."
     ```
   - Pass the `prompt` to the pipeline:
     ```python
     output = pipe(prompt)
     print(output[0]["generated_text"])
     ```

2. **Use a Chat-Compatible Model**:
   - Models designed for chat (e.g., `Qwen-Chat`) will handle the `messages` format directly.

---

### **Key Takeaway**
The `messages` format is **not mandatory** for all models but is critical for:
- Models trained on conversational or instruction-following datasets.
- Applications requiring multi-turn chat or contextual interactions.

If the model does not support `messages`, convert it into a single string prompt to ensure compatibility.


In [6]:
# Generate the output
output = pipe(messages)
print(output[0]["generated_text"])

Tesla could potentially be represented by the Pokémon "Gyarados." Gyarados is known for its massive size and powerful attacks, which align with Tesla's focus on innovation and technological advancements.

**Description:**
In the world of Pokémon, Gyarados stands as an iconic symbol of Tesla's innovative spirit. This colossal water-type Pokémon boasts a sleek, streamlined body that resembles a futuristic submarine, reflecting Tesla's commitment to cutting-edge technology in all aspects of life. Its large fins and sharp teeth suggest


We can see that the `Qwen/Qwen2.5-1.5B-Instruct` model, given a simple prompt generated an "ok" response.

But there is an important flaw. The generated text is cut off as per our `max_new_tokens=100` and `do_sample=False` configuration.

We can allow the model to generate more tokens to complete the thought.



# **2. Prompt engineering with an advanced prompt**

## Advanced Components of Prompts
For more nuanced tasks, consider incorporating the following:

1. **Persona**: Define a role for the LLM. For example, “You are a Pokemon expert and a financial analyst.”
2. **Instruction**: Specify the task clearly. Example: “Describe a company as a Pokemon.”
3. **Context**: Provide background or additional information. Example: “Focus on Tesla’s innovation.”
4. **Format**: Specify the structure of the output. Example: “Output the response in JSON format.”
5. **Audience**: Tailor the tone or complexity to the intended audience. Example: “The output is for investors.”
6. **Tone**: Adjust the language style. Example: “Use a playful tone.”


In [7]:
# Prompt components
persona = "You are a Pokemon expert and a financial analyst in the stock market.\n"
instruction = "Given a company name, you will describe the best pokemon it would represent.\n"
context = "Write a short description of the pokemon-company.\n"
data_format = "Give the output in a json format.\n"
audience = "The output is designed for investors.\n"
tone = "The tone should be playful.\n"
text = "Tesla"
data = f"Company: {text}"

# The full prompt - remove and add pieces to view its impact on the generated output
query = persona + instruction + context + data_format + audience + tone + data

messages = [
    {"role": "user", "content": query}
]

print(pipe.tokenizer.apply_chat_template(messages, tokenize=False))

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
You are a Pokemon expert and a financial analyst in the stock market.
Given a company name, you will describe the best pokemon it would represent.
Write a short description of the pokemon-company.
Give the output in a json format.
The output is designed for investors.
The tone should be playful.
Company: Tesla<|im_end|>



In [9]:
output = pipe(messages)
print(output[0]["generated_text"])

```json
{
  "pokemon_company": {
    "name": "Eevee",
    "description": "Tesla is like an Eevee - versatile and adaptable, always evolving to meet new challenges. Just as Eevee can transform into various forms based on its surroundings, Tesla has transformed from an electric car pioneer into a leader in renewable energy solutions. Its journey continues with innovations that push boundaries and redefine what's possible."
  }
}
```


We can see that our Qwen model created a perfect json just as we instructed it.

# **3. Adjusting Temperature and Top-p in Text Generation**



When working with text-generation models, **temperature** and **top-p (nucleus sampling)** are key parameters that control the diversity and creativity of the model's output. Fine-tuning these parameters allows you to balance randomness and coherence, tailoring the model’s response to your specific needs.

---

## **1. What Is Temperature?**

- **Definition**:
  - Temperature is a parameter that adjusts the randomness of the model's predictions.
  - It scales the probabilities of the next word/token being chosen, influencing how deterministic or creative the output is.

- **How It Works**:
  - Lower temperatures make the output more deterministic.
  - Higher temperatures introduce more randomness by flattening the probability distribution.

- **Examples**:
  - **Low Temperature (e.g., 0.2)**:
    - Focuses on the most probable tokens, generating predictable and focused text.
    - Suitable for tasks like factual summarization or code generation.
  - **High Temperature (e.g., 1.0)**:
    - Expands the range of token selection, leading to more diverse and creative outputs.
    - Ideal for storytelling, poetry, or open-ended tasks.

---

## **2. What Is Top-p (Nucleus Sampling)?**

- **Definition**:
  - Top-p controls the cumulative probability of tokens considered for selection.
  - It limits the model to only consider the most probable tokens whose cumulative probability falls below \( p \).

- **How It Works**:
  - The model generates tokens by sampling from a dynamically adjusted subset of all possible tokens.
  - The size of the subset depends on the cumulative probability \( p \).

- **Examples**:
  - **Low Top-p (e.g., 0.2)**:
    - Restricts the selection to only the top 20% of tokens with the highest probabilities.
    - Produces focused and coherent outputs.
  - **High Top-p (e.g., 0.9)**:
    - Expands the range of potential tokens, allowing for more diverse outputs.
    - Balances creativity and coherence.

- **How It Differs from Temperature**:
  - While temperature scales probabilities globally, top-p selects a subset of tokens based on their cumulative probabilities.

---

## **3. Using Temperature and Top-p Together**

- These parameters can be used independently or combined for better control:
  - **High Temperature + High Top-p**: Encourages maximum creativity, often at the cost of coherence.
  - **Low Temperature + Low Top-p**: Ensures deterministic and precise outputs.
  - **High Temperature + Low Top-p**: Adds creativity but restricts it to highly probable tokens for a balance of diversity and accuracy.

---

## **4. Choosing the Right Settings**

### **Task-Specific Suggestions**:
| **Task**                | **Temperature** | **Top-p** |
|--------------------------|-----------------|-----------|
| Factual Summarization    | 0.2–0.5         | 0.9–1.0   |
| Creative Writing         | 0.8–1.2         | 0.8–1.0   |
| Code Generation          | 0.1–0.3         | 0.9–1.0   |
| Open-Ended Dialogues     | 0.7–1.0         | 0.8–1.0   |

### **Experimentation Is Key**:
- Start with defaults: Temperature = **1.0**, Top-p = **0.9**.
- Adjust incrementally to see how the output changes.
- Combine settings for the desired balance of creativity and accuracy.

---

## **5. Example Code**

```python
from transformers import pipeline

# Create a text generation pipeline
pipe = pipeline(
    "text-generation",
    model="gpt2",
    max_new_tokens=100,
    temperature=0.7,  # Adjust randomness
    top_p=0.9,        # Use nucleus sampling
    do_sample=True    # Enable sampling
)

# Generate text
output = pipe("Once upon a time, in a land far away, there was a", num_return_sequences=1)
print(output[0]["generated_text"])
```

You can override the default parameters of an already defined pipeline by passing them as arguments during inference.

In [10]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=0.7)
print(output[0]["generated_text"])

{
  "Pokemon": {
    "Name": "Steelix",
    "Description": "A powerful and versatile electrician who can handle any situation with ease. Just like Tesla, this company has the ability to adapt to new technologies and stay ahead of the curve.",
    "Strengths": [
      "High Voltage Power - The strength of Tesla's batteries powering their cars and energy solutions.",
      "Innovative Thinking - Steelix constantly looks for ways to improve and innovate, much like how


In [11]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

```json
{
  "pokemonCompany": {
    "name": "Pikachu",
    "description": "As a financial powerhouse, Tesla represents the electric charge behind its innovative technology. Just like Pikachu's ability to rapidly generate electricity, Tesla’s cutting-edge electric vehicles and renewable energy solutions propel the world forward with speed and efficiency.",
    "traits": [
      "Fast-paced growth",
      "Innovation at the forefront",
      "Evolving products and services"
    ],
    "


# **4. In-Context Learning: Guiding LLMs with Examples**

While detailed descriptions and instructions are essential for crafting effective prompts, there's another powerful tool in prompt engineering: **examples**.

By showing the model exactly what we want it to accomplish, we can achieve more precise and reliable outputs.

This technique, often called **in-context learning**, allows the model to learn from the examples within the same prompt, without requiring additional fine-tuning.

---

## **What Is In-Context Learning?**
In-context learning involves providing examples of the task directly within the prompt.

These examples demonstrate the structure, tone, and content of the desired output, making it easier for the model to understand and replicate.

There are three main types of in-context learning:
1. **Zero-Shot Prompting**: No examples are provided; the model must rely entirely on the instruction.
2. **One-Shot Prompting**: A single example is included in the prompt.
3. **Few-Shot Prompting**: Multiple examples are used to guide the model.

Each method has its strengths:
- **Zero-Shot**: Works well for straightforward tasks with clear instructions.
- **One-Shot/Few-Shot**: Improves performance for more complex tasks by reducing ambiguity and providing context.

---

## **Why Use Examples?**
Examples allow you to:
- **Clarify Ambiguity**: Show the model the exact format and type of output you expect.
- **Reduce Errors**: By replicating the example, the model is less likely to deviate from the intended task.
- **Demonstrate Style**: Examples help convey tone, style, and complexity of language for creative or domain-specific tasks.


In [12]:
# One-shot learning: Providing an example of the output structure
instruction = """Given a company name, you will describe the best pokemon it would represent. Use exactly this format:

{
  "company": "The provided company name",
  "pokemon": "The best pokemon it represents from the first generation",
  "description": "A short description in 40 characters",
  "type": "The pokemon type",
  "moves": {
    "move_1": "the first company move that should be relevant to the company",
    "move_2": "the second company move that should be relevant to the company",
    "move_3": "the third company move that should be relevant to the company",
    "move_4": "the forth company move that should be relevant to the company",
  }
}
"""

text = "Tesla"
data = f"Company: {text}"

query = instruction + data

one_shot_prompt = [
    {"role": "user", "content": query}
]

# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

{
  "company": "Tesla",
  "pokemon": "Eevee",
  "description": "Fast and versatile.",
  "type": "Normal/Flying",
  "moves": {
    "move_1": "Quick Attack",
    "move_2": "Fly",
    "move_3": "Thunderbolt",
    "move_4": "Psychic"
  }
}


# **5. Forcing JSON output**

When working with Large Language Models (LLMs), one challenge is ensuring the generated output adheres to specific formats or rules.

Even with carefully crafted prompts, models might occasionally generate undesired or invalid results.

### **What is Grammar-Constrained Sampling?**

Grammar-constrained sampling ensures that the output of an LLM adheres to predefined formats or rules. Unlike regular prompting or few-shot learning, where we guide the model with examples or instructions, this method validates and enforces output structures during generation.

#### **Key Benefits**:
1. **Output Validation**: Ensures adherence to specific formats, such as JSON or XML.
2. **Enhanced Control**: Prevents models from generating irrelevant or invalid tokens.
3. **Improved Reliability**: Reduces post-processing errors in applications that require structured data.

---

### **How It Works**

1. **Defining Rules or Grammar**:
   - Predefine constraints (e.g., JSON schema, predefined tokens, or formats).
   - Limit the model’s possible outputs during token sampling.

2. **Validation During Generation**:
   - Instead of validating the output after generation, grammar-constrained sampling applies rules in real-time during token selection.

3. **Applications**:
   - **Sentiment Classification**: Limit output to predefined labels like `"positive"`, `"neutral"`, and `"negative"`.
   - **Structured Data Generation**: Generate JSON, XML, or other structured formats.
   - **Domain-Specific Outputs**: Enforce outputs relevant to a specific domain (e.g., RPG character creation, company profiles).




In [13]:
from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF",
	filename="Mistral-7B-Instruct-v0.3.Q2_K.gguf",
  n_gpu_layers=-1,
  n_ctx=2048,
  verbose=False
)

Mistral-7B-Instruct-v0.3.Q2_K.gguf:   0%|          | 0.00/2.72G [00:00<?, ?B/s]

llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


## **LLaMA and Quantized Models**

LLaMA (Large Language Model Meta AI) is a family of large language models designed to be efficient and scalable for various NLP tasks. It provides high performance while requiring fewer computational resources compared to other models of similar size.

### **What Are Quantized Models?**
Quantization is a process of reducing the precision of a model's weights from 32-bit floating point to lower precision formats (e.g., 8-bit, 4-bit). This dramatically reduces the model's size and improves inference speed with minimal impact on accuracy.

### **Advantages of Quantization:**
1. **Reduced Memory Footprint**: Smaller models use less storage and memory.
2. **Faster Inference**: Quantized models perform computations faster, especially on devices with limited resources.
3. **Efficient GPU/CPU Usage**: Reduces computational overhead, making it possible to run larger models on less powerful hardware.

---

## **Parameters in LLaMA Loading**

The following parameters are used when loading a quantized LLaMA model:

| **Parameter**     | **Description**                                                                                           |
|--------------------|-----------------------------------------------------------------------------------------------------------|
| **repo_id**        | The Hugging Face repository where the quantized model resides.                                            |
| **filename**       | The specific file containing the quantized model.                                                        |
| **n_gpu_layers**   | The number of layers to offload to the GPU. Setting `-1` offloads all layers to the GPU.                  |
| **n_ctx**          | The context size or maximum token length that the model can process in one forward pass.                 |
| **verbose**        | A boolean flag to enable or disable detailed logging during the model loading process.                    |

---

## **Quantization Levels and File Sizes**

Quantization levels are indicated in the model filenames and determine the precision of the weights:

| **Quantization Level** | **Precision**   | **Approx. File Size** | **Description**                                                                                   |
|-------------------------|-----------------|------------------------|---------------------------------------------------------------------------------------------------|
| **IQ1_M**              | 1-bit (Mixed)  | ~1.76 GB              | Extremely compact with significant trade-offs in accuracy.                                       |
| **IQ2_XS**             | 2-bit (Extra Small) | ~2.2 GB           | Compact and suitable for lightweight applications.                                               |
| **IQ3_XS**             | 3-bit (Extra Small) | ~3.02 GB          | Balanced between compactness and performance.                                                    |
| **IQ4_XS**             | 4-bit (Extra Small) | ~3.91 GB          | Higher precision, suitable for more demanding applications.                                      |
| **Q2_K**               | 2-bit (K-bit)  | ~2.72 GB              | Optimized for efficiency while maintaining reasonable performance.                               |
| **Q6_K**               | 6-bit (K-bit)  | ~5.95 GB              | High precision, offering excellent accuracy at the cost of larger file size.                    |
| **fp16**               | 16-bit (Floating Point) | ~14.5 GB       | Full precision for maximum accuracy, requiring significant computational resources.             |

---

## **Suffixes and Their Meanings**

The model filenames include suffixes that provide additional information about their configuration:

| **Suffix**    | **Meaning**                                                                                       |
|---------------|---------------------------------------------------------------------------------------------------|
| **Q2**        | Indicates 2-bit quantization for model weights.                                                   |
| **Q3/Q4**     | Indicates 3-bit or 4-bit quantization, offering progressively higher precision.                   |
| **K**         | Refers to specific quantization algorithms (e.g., K-bit quantization).                            |
| **XS**        | Extra Small configuration, optimized for minimal resource usage.                                  |
| **fp16**      | Full precision (16-bit floating-point), designed for high-accuracy use cases but resource-heavy.  |
| **M/S/L**     | Mixed (M), Small (S), or Large (L) configurations, denoting variations in quantization or model structure. |


In [14]:
# Generate output
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Given the company Tesla, describe the best pokemon it would represent along with its atributes and moves."},
    ],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]


In [15]:
import json

# Format as json
json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "company": "Tesla",
    "pokemon": "Voltorb",
    "attributes": {
        "type": [
            "Electric"
        ],
        "hidden_ability": [
            "Run Away"
        ],
        "stats": {
            "HP": 72,
            "Attack": 60,
            "Defense": 60,
            "Special Attack": 60,
            "Special Defense": 60,
            "Speed": 90
        },
        "evolution": {
            "evolves_from": "Pok\u00e9mon",
            "evolution_level": 22
        },
        "moves": [
            "Tail Glow",
            "Self-Destruct",
            "Swift",
            "Thunder Shock"
        ]
    },
    "description": "Voltorb, representing Tesla, embodies the company's innovative and forward-thinking spirit. Its Electric type reflects Tesla's focus on electric vehicles. Voltorb's moves, such as Tail Glow and Self-Destruct, symbolize the risk and potential for destruction inherent in innovation and technological advancement. The hidden ability Run Away repre