# üìã Creating Basic Prompts for Data Extraction

**Prompting** is the art of crafting instructions for language models to get desired outputs. This notebook explores fundamental prompting techniques for extracting structured information from unstructured text.

## üéØ What You'll Learn

1. Loading and using language models for text extraction
2. Creating basic prompts for data extraction
3. Controlling output quality with generation parameters
4. Using repetition penalty to improve results
5. Enforcing structured output formats
6. Prompt engineering techniques
7. Understanding sampling vs greedy decoding
8. Fine-tuning model behavior with temperature, top-k, and top-p

---

## üìä Project Overview

**Goal:** Extract structured information from resumes across different industries.

**Target Fields:**
- üìù Name
- üè† Address
- üìû Phone number
- üìß Email

**Sample Data:**
- Name: Tammy Jones
- Address: 4759 Sunnydale Lane, Plano, Texas, United States 75071
- Phone: 1234567890
- Email: youremailcom

---

## üöÄ Section 1: Setup and Model Loading


### üì• Loading the Model

We'll use the Qwen/Qwen2.5-0.5B model for text generation and information extraction.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

torch.manual_seed(42)
model_name = "Qwen/Qwen2.5-0.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)


---

## üìÇ Section 2: Loading Resume Data

### üíæ Loading the Dataset

We'll load resume data from a CSV file containing text from various resumes.

In [None]:
import pandas as pd

cvs_df = pd.read_csv('https://raw.githubusercontent.com/AI360-Labs/GenAI_Fundamentals/refs/heads/main/data/resumes/resumes_scraped_sampled.csv')
cvs_df.head(3)

### üî¢ Understanding Token Counts

**Important:** Estimating token counts helps you understand model input/output limits.

**Rules of Thumb:**
- 1 token ~= 4 chars in English
- 1 token ~= ¬æ words
- 100 tokens ~= 75 words

In [None]:
number_of_tokens = len(cvs_df.cv_text.values[0].split())
print("Approximate number of tokens in the prompt:", number_of_tokens / 0.75)

---

## ‚úçÔ∏è Section 3: Creating Basic Prompts

### üìù First Attempt: Simple Instruction Prompt

Let's create a basic prompt that instructs the model to extract specific information.

In [None]:
import torch

prompt = """I will provide you with a resume of a person.
Please extract the following information from the resume:
1. Name
2. Address
3. Phone number
4. Email

Here is the resume:
{cv_text}
"""

prompt = prompt.format(cv_text=cvs_df.cv_text.values[0])
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
    )

decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded_output)

### üîß Helper Function for Output Cleaning

This function removes the initial prompt from the model's output to isolate the answer.

In [None]:
def extract_answer_from_llm(prompt, decoded_output):
    """Extracts the answer from the LLM output by removing initial prompt."""
    if decoded_output.startswith(prompt):
        return decoded_output[len(prompt):].strip()
    else:
        return decoded_output

In [None]:
cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
print(cleaned_output)

---

## ‚öôÔ∏è Section 4: Controlling Output with Parameters

### üö´ Repetition Penalty

The repetition penalty parameter discourages the model from repeating the same words or phrases.

- **Value = 1.0:** No penalty (default)
- **Value > 1.0:** Penalizes repetition (e.g., 1.1 reduces repetitive text)

In [None]:
prompt = """I will provide you with a resume of a person.
Please extract the following information from the resume:
1. Name
2. Address
3. Phone number
4. Email

Here is the resume:
{cv_text}
"""

prompt = prompt.format(cv_text=cvs_df.cv_text.values[0])
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
    )

decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
print(cleaned_output)

---

## üéØ Section 5: Enforcing Concise Output

### üìã Adding Output Constraints

Let's refine our prompt to explicitly request less explanatory text and focus on the data.

In [None]:
prompt = """I will provide you with a resume of a person.
Please extract the following information from the resume:
1. Name
2. Address
3. Phone number
4. Email

Here is the resume:
{cv_text}

Do not provide any additional information except name, address, phone and email.
"""

prompt = prompt.format(cv_text=cvs_df.cv_text.values[0])
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
    )

decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
print(cleaned_output)

### ‚ùå Attempt: Further Restricting Explanations

Adding more restrictions to prevent explanatory comments (Note: This may not always work).

In [None]:
prompt = """I will provide you with a resume of a person.
Please extract the following information from the resume:
1. Name
2. Address
3. Phone number
4. Email

Here is the resume:
{cv_text}

Do not provide any additional information except name, address, phone and email.
Do not give additional commentary or explanation.
"""

prompt = prompt.format(cv_text=cvs_df.cv_text.values[0])
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
    )

decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
print(cleaned_output)

---

## üí° Section 6: Advanced Prompting Technique - Priming

### üé≤ Suggesting the Next Token

Instead of allowing the model to start with phrases like "Sure! Here is the answer", we can **prime** the response by adding "Answer:" to the prompt. This forces the model to generate the actual answer as the next best token.

In [None]:
prompt = """I will provide you with a resume of a person.
Please extract the following information from the resume:
1. Name
2. Address
3. Phone number
4. Email

Here is the resume:
{cv_text}

Do not provide any additional information except name, address, phone and email.
Do not give additional commentary or explanation.
Answer:
"""

prompt = prompt.format(cv_text=cvs_df.cv_text.values[0])
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
    )

decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
print(cleaned_output)

---

## üèóÔ∏è Section 7: Structured Prompt Engineering

### üé® Refactoring for Better Results

Applying best practices for prompt design:

1. **Define Role & Goal:** Establishes context and pushes the model toward relevant token space
2. **Repeat Key Information:** Reinforces important instructions
3. **Use Special Tokens:** Leverage model-sensitive markers (XML tags, Markdown `###`, or model-specific tokens like `[/INST]`)

In [None]:
prompt = """### Role:
You are an AI assistant specialized in extracting structured information from resumes.
Your goal is to locate relevant information in the provided resume text and return it in a JSON format exactly as it is in the document.
Return the information in a JSON format with the keys: "name", "address", "phone", "email".

### Ask:
I will provide you with a resume of a person.
Please extract the information from the resume

### Here is the resume:
{cv_text}

### Return only JSON with the keys: "name", "address", "phone", "email".
### Answer:
"""

prompt = prompt.format(cv_text=cvs_df.cv_text.values[0])
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
    )

decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
print(cleaned_output)

---

## üé≤ Section 8: Sampling vs Greedy Decoding

### üîç The `do_sample` Parameter

The `do_sample` parameter controls how the model selects tokens:

**`do_sample=False` (Default - Greedy Decoding):**
- Model always picks the most probable token
- Results are deterministic and consistent
- Less creative, more predictable

**`do_sample=True` (Sampling):**
- Model samples from probability distribution
- Results vary between runs
- More creative and diverse
- Requires tuning with:
  - `temperature` (randomness)
  - `top_p` (nucleus sampling)
  - `top_k` (top-k sampling)

### üéØ Experiment: Greedy Decoding (`do_sample=False`)

**Key Observations:**
- Responses are **consistent** across all executions
- Values may be **formatted** rather than extracted exactly as written

In [None]:
prompt = """### Role:
You are an AI assistant specialized in extracting structured information from resumes.
Your goal is to locate relevant information in the provided resume text and return it in a JSON format exactly as it is in the document.
Return the information in a JSON format with the keys: "name", "address", "phone", "email".

### Ask:
I will provide you with a resume of a person.
Please extract the information from the resume

### Here is the resume:
{cv_text}

### Return only JSON with the keys: "name", "address", "phone", "email".
### Answer:
"""

prompt = prompt.format(cv_text=cvs_df.cv_text.values[0])
inputs = tokenizer(prompt, return_tensors="pt").to(device)

In [None]:
for iter in range(3):
    print(f"Iteration {iter + 1}")
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=False,
        )

    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
    print(cleaned_output, "\n" + "="*50 + "\n")

### üé® Experiment: Sampling (`do_sample=True`)

**Key Observations:**
- Responses are **inconsistent** across executions
- Model may produce **more hallucinations**
- Results vary each time

In [None]:
for iter in range(3):
    print(f"Iteration {iter + 1}")
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        )

    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
    print(cleaned_output, "\n" + "="*50 + "\n")

---

## üå°Ô∏è Section 9: Fine-Tuning with Temperature

### üå°Ô∏è Temperature

Temperature controls the model's **creativity** and **randomness**.

**How it works:**
- Scales the probability distribution
- **Low temperature** (e.g., 0.01): Sharp distribution, more focused/deterministic
- **High temperature** (e.g., 1.5): Flat distribution, more diverse/creative

**Note:** Even with very low temperature, if multiple tokens have similar probabilities after scaling, randomness still exists.

In [None]:
for iter in range(3):
    print(f"Iteration {iter + 1}")
    print("-"*15)
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=1e-6,  # Adjust temperature for randomness
        )

    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
    print(cleaned_output, "\n")

---

## üéØ Section 10: Top-p (Nucleus) Sampling

### üéØ Top-p Parameter

Top-p fine-tunes token selection by limiting choices to the **smallest set whose cumulative probability sums to p**.

- **High top-p** (e.g., 0.9): Wide range of word choices
- **Low top-p** (e.g., 0.01): Limits to most probable tokens only

In [None]:
for iter in range(3):
    print(f"Iteration {iter + 1}")
    print("-"*15)
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_p=1e-2,  # Adjust top_p for more randomness
        )

    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
    print(cleaned_output, "\n")

---

## üî¢ Section 11: Top-k Sampling

### üî¢ Top-k Parameter

Top-k limits the number of tokens considered for sampling.

- **top_k = 1:** No sampling (greedy decoding - always picks most probable token)
- **top_k = 5:** Choose from top 5 most probable tokens
- **top_k = 50:** Choose from top 50 most probable tokens

In [None]:
for iter in range(3):
    print(f"Iteration {iter + 1}")
    print("-"*15)
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_k=1,  # Adjust top_p for more randomness
        )

    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    cleaned_output = extract_answer_from_llm(prompt=prompt, decoded_output=decoded_output)
    print(cleaned_output, "\n")

---

## üìö Summary

### ‚ú® Key Concepts Covered

1. **Model Setup**: Loading and configuring language models for text extraction
2. **Basic Prompting**: Creating simple instruction-based prompts
3. **Token Counting**: Estimating input/output sizes
4. **Output Cleaning**: Removing prompts from model responses
5. **Repetition Penalty**: Preventing repetitive text generation
6. **Output Constraints**: Enforcing concise, focused responses
7. **Priming Technique**: Suggesting next tokens to guide output
8. **Structured Prompts**: Using roles, goals, and special tokens
9. **Sampling vs Greedy**: Understanding deterministic vs probabilistic decoding
10. **Temperature**: Controlling creativity and randomness
11. **Top-p Sampling**: Dynamic probability-based token selection
12. **Top-k Sampling**: Limiting token choices to top k options

### üí° Best Practices

- ‚úÖ **Define clear roles and goals** in prompts
- ‚úÖ **Repeat important instructions** for emphasis
- ‚úÖ **Use special tokens** (###, XML tags) for structure
- ‚úÖ **Prime responses** with keywords like "Answer:"
- ‚úÖ **Use low temperature** for factual extraction
- ‚úÖ **Apply repetition penalty** to avoid loops
- ‚úÖ **Combine parameters** for optimal results

### üéØ Next Steps

- üîπ Experiment with different prompt structures
- üîπ Try various parameter combinations
- üîπ Explore structured output formats (JSON)
- üîπ Compare different sampling strategies
- üîπ Build on these techniques for complex extraction tasks

### üìñ Reference

For more information on prompting parameters:
- [Prompt Engineering with Temperature and Top-p](https://promptengineering.org/prompt-engineering-with-temperature-and-top-p/#the-overlooked-power-of-llm-parameters-in-prompt-engineering)

---

### üéì Congratulations!

You now understand the fundamentals of prompt engineering for data extraction and how to control model behavior using various generation parameters.