In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

  from .autonotebook import tqdm as notebook_tqdm


#### 0 : Load & LLMs (Get Model and Tokenizer)

**Temperature:** Controls randomness.  
- Low (~0.1): Predictable, repetitive.  
- High (~1.0): Creative, random.

**Top-p (nucleus sampling):** Choose from top tokens whose probabilities sum to `p`.  
- Keeps likely tokens, ignores long tail.

**Beam search:** Keeps `k` best sequences each step.  
- Better for tasks needing coherence (translation). Slower.

In [2]:
# model = AutoModelForCausalLM.from_pretrained(
#     'distilgpt2',
#     device_map = 'auto',
#     torch_dtype = torch.float16,
#     trust_remote_code = True
# )

In [None]:
def load_llm(model='distilgpt2'):
    tokenizer = AutoTokenizer.from_pretrained(model) # Converts text ↔ tokens (vocabulary handling).

    model = AutoModelForCausalLM.from_pretrained( # load model and how fast the model will go
        model,
        device_map = 'auto', # Auto-place on GPU/CPU
        torch_dtype = torch.float16,  # Use half precision (saves memory)
        trust_remote_code = True # Allow custom code from HF
    )
    return model, tokenizer

model, tokenizer = load_llm()

#### 1 :  Advanced Generation Parameters

In [10]:
def generate(model, tokenizer, prompt, **gene):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            **gene
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

This part:

1. **`with torch.no_grad():`** – Disables gradient calculation (faster, less memory during inference).
2. **`model.generate(**inputs, **gene)`** – Generates tokens using the model with your prompt and any extra parameters.
3. **`tokenizer.decode(...)`** – Converts token IDs back to readable text.

It's the core inference step.

`**inputs`: Prompt tokenized as model input.

`**gene`: Any extra generation parameters you pass (like `temperature`, `top_p`, `max_new_tokens`).

In [8]:
## Greedy Decoding (basic)

In [5]:
prompt = "hello"

In [6]:
greedy_text = generate(
    model, tokenizer, prompt, max_new_tokens=12, do_sample=False, num_beams=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [7]:
print(greedy_text)

hello The U.S. Department of Justice has been investigating the


**Greedy decoding:**  
Model picks the highest-probability token at each step, 12 times total. Deterministic, no randomness.

`do_sample=False`: Turns off random sampling → greedy search.  No randomness. Always picks the single most probable token.

`num_beams=1`: Uses beam search width of 1 → basic greedy.

In [64]:
### 2 with Temperature

In [4]:
prompt = 'what is love?'

In [5]:
temp = generate(
    model, tokenizer, prompt, max_new_tokens=15, do_sample = True, temperature=1.0, top_k=15)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [6]:
print(temp)

what is love? Is love something, not love.

The truth is love is not


In [63]:
# 3. Nucleus Sampling (top-p)

In [16]:
nucleus = generate(
    model, tokenizer, prompt, max_new_tokens=10, do_sample=True,
    top_p = 0.92, # Consider tokens comprising 92% of probability mass
    temperature=0.7)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [17]:
print(nucleus)

what is love, friendship, friendship, and love, love,


In [21]:
# 4 extra

In [8]:
prompt

'what is love'

In [35]:
text = generate(
                model, tokenizer, prompt='Jesus Christ',
                max_new_tokens=50,
                do_sample=True,
                temperature=1.0,
                top_p=1.0,
                # pad_token_id=tokenizer.eos_token_id
            )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [36]:
print(text)

Jesus Christ for all the sins of men and of women, for the salvation of God, for the good of the universe, for all the good of humanity. ... That is true, and to the contrary, when God commands the world to go for Him,


Default good values: do_sample=True, temperature=0.7, top_p=0.9. | ME : do_sample=True, temperature=1.0, top_p=1.0.

#### 2 : Compare LLaMA vs Mistral Generation

In [37]:
def compare_models(prompt, models_to_test):
    """Compare generation across different models"""
    
    for model_name in models_to_test:
        try:
            print(f"\n{'='*50}")
            print(f"MODEL: {model_name}")
            print(f"{'='*50}")
            
            model, tokenizer = load_llm(model_name)
            
            # Consistent generation parameters
            text = generate(
                model, tokenizer, prompt,
                max_new_tokens=50,
                do_sample=True,
                temperature=1.0,
                top_p=1.0,
                pad_token_id=tokenizer.eos_token_id
            )
            
            print(text)
            print("-------------------------------------------------------------------------------------------------------------------")
            # Clean up memory
            del model
            torch.cuda.empty_cache()
            
        except Exception as e:
            print(f"Failed to load {model_name}: {e}")

# Test with smaller, accessible models
test_models = [
    "gpt2",                       # Base GPT-2
    "distilgpt2"                  # Even smaller
]

compare_models("what is love?", test_models)


MODEL: gpt2
what is love? Is it not love?

I am a woman. I am a male.

I am a man. I am a woman.

I am a man. I am a woman.

I am a man. I am
-------------------------------------------------------------------------------------------------------------------

MODEL: distilgpt2
what is love?‷It would seem that it‷could have been‷‷t. it is true. it could mean that you‷you could probably be getting used to it, especially if you don‷ve thought you could love it
-------------------------------------------------------------------------------------------------------------------


- `do_sample`: Enable random sampling (True) or greedy (False).
- `temperature`: Controls randomness (0.1–1.0).
- `top_p`: Nucleus sampling threshold (0.0–1.0).
- `pad_token_id`: Token ID for padding (often EOS token).

- `temperature` scales all logits (sharpens/flattens distribution).
- `top_p` cuts off low-probability tail (removes unlikely tokens).
- `do_sample` enables/disables sampling entirely.

They work together for balanced randomness.

Default good values: do_sample=True, temperature=0.7, top_p=0.9. | ME : do_sample=True, temperature=1.0, top_p=1.0.

#### 3 . Advanced: Streaming Generation

In [8]:
def stream_generate(model, tokenizer, prompt, max_tokens=50):
    
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    generated = inputs.input_ids.clone()

    print('streaming:', end=" ")

    for i in range(max_tokens):
        with torch.no_grad():
            outputs = model(generated)
            next_token = outputs.logits[:, -1, :]

            next_tokens = next_token / 0.7
            the_tokens = torch.multinomial(
                torch.softmax(next_tokens, dim=-1), num_samples=1)

            generated = torch.cat([generated, the_tokens], dim=1)

            new_text = tokenizer.decode(the_tokens[0], skip_special_tokens = True)
            print(new_text, end='', flush=True)

            if the_tokens.item() == tokenizer.eos_token_id:
                break
    print()

In [9]:
# Test streaming
stream_generate(model, tokenizer, "The meaning of life is")

streaming:  to be lived there and we have to remain here.�� The meaning of life is to be lived in a world of all people with dignity and dignity and respect. Life is to be lived in a world of all people with dignity and respect.


- `torch.multinomial` + softmax → equivalent to `do_sample=True`.
- `next_token / 0.7` → **temperature** (0.7 here).
- No `top_p` filtering in this code.

Hardcoded to temperature=0.7 with sampling.

#### $ 4

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

  from .autonotebook import tqdm as notebook_tqdm


- `AutoTokenizer`: Loads correct tokenizer for model.
- `AutoModelForCausalLM`: Loads causal language model (for text generation).
- `GenerationConfig`: Stores default generation parameters (optional).

In [2]:
name  = 'distilgpt2'

In [3]:
tokenizer = AutoTokenizer.from_pretrained(name)

In [4]:
model = AutoModelForCausalLM.from_pretrained(
    name,
    device_map = 'auto', # auto-place on gpu/cpu
    torch_dtype = torch.float16, # save memory
    trust_remote_code = True # Allows execution of custom code from Hugging Face (for newer/unconventional models). Safer to keep `False` unless needed.
)

`torch_dtype` is deprecated! Use `dtype` instead!


In [5]:
def generate(model, tokenizer, prompt, **gene):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

    with torch.no_grad():
        outputs = model.generate(**inputs, **gene)
        
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

- `return_tensors='pt'`: Returns PyTorch tensors.
- `.to(model.device)`: Moves tensors to same device as model (GPU/CPU).

`**inputs`: Prompt tokenized as model input.

`**gene`: Any extra generation parameters you pass (like `temperature`, `top_p`, `max_new_tokens`).

- `torch.no_grad()`: Disables gradient tracking (faster inference, less memory).
- `model.generate()`: Runs the model to generate text using the inputs and generation parameters.

- `outputs[0]`: First sequence in batch.
- `skip_special_tokens=True`: Removes special tokens like `<s>`, `</s>`.
- Returns clean generated text.

In [6]:
prompt = 'salvation'

In [7]:
text = generate(
    model, tokenizer, prompt,
    max_new_tokens=50,
    do_sample = True,
    temperature = 1.0,
    top_p = 1.0,
    # pad_token_id = tokenizer.eos_token_id

)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [8]:
print(text)

salvation of their heart.
For a very deep and deeply spiritual reason, we must believe in prayer and the sacrament of mercy. You are a man who has experienced this suffering.
In our heart, we are blessed to receive the sacrament of mercy and


In [22]:
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#

In [13]:
def stream(model, tokenizer, prompt, max_tokens=50):
    
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    generated = inputs.input_ids.clone()
    
    # Copies the initial token IDs from the prompt.
    # Creates a tensor that will be extended with new tokens during generation.

    print("streaming: ", end=" ")
    for i in range(max_tokens):
        with torch.no_grad():
            outputs = model(generated)
            tokens = outputs.logits[:, -1, :]
            
            # Gets logits for the **second token position** (`[:, 1, :]`).
            # - `outputs.logits`: Shape `[batch_size, sequence_length, vocab_size]`.
            # - `[:, 1, :]`: Index 1 → second token in sequence.

            tokens = tokens / 0.7
            dtokens = torch.multinomial(
                torch.softmax(tokens, dim=-1), num_samples=1)
            
            # 1. `tokens / 0.7`: Applies **temperature** scaling (0.7 = moderate randomness).
            # 2. `torch.softmax(tokens, dim=-1)`: Converts logits to probabilities.
            # 3. `torch.multinomial(..., num_samples=1)`: Samples 1 token from the probability distribution (random sampling).
            # Equivalent to `do_sample=True, temperature=0.7`.
            # for no sample: 
                # next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)

            generated = torch.cat([generated, dtokens], dim=1)
            # Appends the newly generated token (dtokens) to the existing sequence (generated) along the sequence dimension (dim=1).
            # Because autoregressive generation builds the output sequentially.
            # Each step adds one new token to the sequence, then feeds the extended sequence back into the model to predict the next token.

            new_text = tokenizer.decode(dtokens[0], skip_special_tokens = True)
            print(new_text, end="", flush=True)
           
            # Decodes the single new token into text and prints it immediately.
            # - `end=" "`: Keeps printing on same line with a space.
            # - `flush=True`: Forces immediate display (for real-time streaming).

            if dtokens.item() == tokenizer.eos_token_id:
                break
    print()

In [14]:
sam = stream(model, tokenizer, 'salvation')

streaming:   of the New Testament, Acts, and Secret Acts, the City of New York (New York: George Washington University Press), and the City of New York (New York: George Washington University Press), and and the City of New York (New York


In [15]:
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#

In [5]:
from transformers import TextStreamer

In [6]:
streamer = TextStreamer(tokenizer, skip_prompt=True)

In [7]:
def generate(model, tokenizer, prompt, **gene):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

    with torch.no_grad():
        outputs = model.generate(**inputs, streamer=streamer, **gene)
        
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [8]:
prompt = 'Python programming'

In [12]:
text = generate(
    model, tokenizer, prompt,
    max_new_tokens=50,
    do_sample = True,
    temperature = 1.0,
    top_p = 1.0,
    # pad_token_id = tokenizer.eos_token_id

)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 language can be useful for many tasks in the programming language. A full-blown language can provide a great experience for working with specific language-based programming languages. Some popular languages include C++, JRuby, Java or Java to understand the most common
