In [2]:
import os
import torch
from transformers import pipeline, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoModelForQuestionAnswering, AutoModelForSeq2SeqLM, AutoTokenizer

# Sample Text
text = (
    "Hello TechEmporium, I recently purchased a set of noise-canceling headphones "
    "from your online store in Canada. Upon receiving the item, I noticed a defect: "
    "the left earcup doesn’t produce sound. As someone who relies on quality audio "
    "for my work, this is a major inconvenience. I would appreciate a replacement or "
    "a refund. Attached are the order details and proof of purchase. Please let me "
    "know how we can resolve this issue promptly. Best regards, Alex."
)

def print_model_info(pipe):
    model = pipe.model
    model_size = sum(p.numel() for p in model.parameters())
    print(f"Model: {model.name_or_path}, Size: {model_size:,} parameters")

# Sentiment Analysis
print("\n**Sentiment Analysis**")
sentiment_pipeline = pipeline("sentiment-analysis")
print_model_info(sentiment_pipeline)
sentiment_result = sentiment_pipeline(text)
print(sentiment_result)

# Named Entity Recognition (NER)
print("\n**Named Entity Recognition**")
ner_pipeline = pipeline("ner", grouped_entities=True)
print_model_info(ner_pipeline)
ner_result = ner_pipeline(text)
print(ner_result)

# Question Answering
print("\n**Question Answering**")
qa_pipeline = pipeline("question-answering")
print_model_info(qa_pipeline)
question = "What is the defect?"
qa_result = qa_pipeline(question=question, context=text)
print(qa_result)

# Translation (English to Spanish)
print("\n**Translation**")
translation_pipeline = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
print_model_info(translation_pipeline)
translation_result = translation_pipeline(text, max_length=200)
print(translation_result)

# Summarization
print("\n**Summarization**")
summarization_pipeline = pipeline("summarization")
print_model_info(summarization_pipeline)
summarization_result = summarization_pipeline(text, max_length=50, min_length=25, do_sample=False)
print(summarization_result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



**Sentiment Analysis**


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Model: distilbert/distilbert-base-uncased-finetuned-sst-2-english, Size: 66,955,010 parameters
[{'label': 'NEGATIVE', 'score': 0.999281108379364}]

**Named Entity Recognition**


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Model: dbmdz/bert-large-cased-finetuned-conll03-english, Size: 332,538,889 parameters


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'ORG', 'score': 0.9927805, 'word': 'TechEmporium', 'start': 6, 'end': 18}, {'entity_group': 'LOC', 'score': 0.9996623, 'word': 'Canada', 'start': 103, 'end': 109}, {'entity_group': 'PER', 'score': 0.90358573, 'word': 'Alex', 'start': 451, 'end': 455}]

**Question Answering**


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Model: distilbert/distilbert-base-cased-distilled-squad, Size: 65,192,450 parameters
{'score': 0.70292729139328, 'start': 156, 'end': 193, 'answer': 'the left earcup doesn’t produce sound'}

**Translation**


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Model: Helsinki-NLP/opus-mt-en-es, Size: 77,943,296 parameters


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'translation_text': 'Hola TechEmporium, hace poco compré un conjunto de auriculares de cancelación de ruido en su tienda en línea en Canadá. Al recibir el artículo, noté un defecto: la oreja izquierda no produce sonido. Como alguien que confía en audio de calidad para mi trabajo, esto es un inconveniente importante. Me gustaría un reemplazo o un reembolso. Adjunto son los detalles del pedido y la prueba de compra. Por favor, hágame saber cómo podemos resolver este problema rápidamente.'}]

**Summarization**


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Model: sshleifer/distilbart-cnn-12-6, Size: 305,510,400 parameters
[{'summary_text': " TechEmporium's noise-canceling headphones fail to produce sound . The left earcup doesn't produce sound, and this is a major inconvenience for Alex ."}]


In [7]:
import os
from openai import OpenAI

# Load OpenAI API Key
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

def chatgpt_prompt(prompt):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an assistant that performs various NLP tasks."},
            {"role": "user", "content": prompt}
        ]
    )    
    return response.choices[0].message.content


# Sample Text
text = (
    "Hello TechEmporium, I recently purchased a set of noise-canceling headphones "
    "from your online store in Canada. Upon receiving the item, I noticed a defect: "
    "the left earcup doesn’t produce sound. As someone who relies on quality audio "
    "for my work, this is a major inconvenience. I would appreciate a replacement or "
    "a refund. Attached are the order details and proof of purchase. Please let me "
    "know how we can resolve this issue promptly. Best regards, Alex."
)

# Sentiment Analysis
print("\n**Sentiment Analysis**")
sentiment_prompt = f"Perform sentiment analysis on the following text: \n{text}"
sentiment_result = chatgpt_prompt(sentiment_prompt)
print(sentiment_result)

# Named Entity Recognition (NER)
print("\n**Named Entity Recognition**")
ner_prompt = f"Identify and categorize the named entities in the following text: \n{text}"
ner_result = chatgpt_prompt(ner_prompt)
print(ner_result)

# Question Answering
print("\n**Question Answering**")
question = "What is the defect?"
qa_prompt = f"Based on the following text, answer the question: \nText: {text}\nQuestion: {question}"
qa_result = chatgpt_prompt(qa_prompt)
print(qa_result)

# Translation (English to French)
print("\n**Translation**")
translation_prompt = f"Translate the following text to French: \n{text}"
translation_result = chatgpt_prompt(translation_prompt)
print(translation_result)

# Summarization
print("\n**Summarization**")
summarization_prompt = f"Summarize the following text in 25-50 words: \n{text}"
summarization_result = chatgpt_prompt(summarization_prompt)
print(summarization_result)



**Sentiment Analysis**
The sentiment in the text is primarily negative. The customer expresses dissatisfaction because the product they received is defective, which is a major inconvenience for them. However, the tone is also polite and professional, as the customer is asking for a resolution (either a replacement or a refund) and providing necessary details for follow-up.

**Named Entity Recognition**
Here are the named entities in the text categorized:

1. **Organization**: 
   - TechEmporium

2. **Location**: 
   - Canada

3. **Person**: 
   - Alex

These entities are identified as significant names or terms within the text that refer to specific people, organizations, or locations.

**Question Answering**
The defect is that the left earcup of the noise-canceling headphones doesn’t produce sound.

**Translation**
Bonjour TechEmporium,

J'ai récemment acheté une paire de casques à réduction de bruit sur votre boutique en ligne au Canada. À la réception de l'article, j'ai remarqué un

In [None]:
import os
import openai
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Cache for local models
MODEL_CACHE = {}

def llm_prompt(prompt, model_str, max_length=256):
    """
    Generate a response from an LLM based on the given model string.
    
    Parameters:
        - prompt (str): The input text prompt.
        - model_str (str): The model identifier (e.g., 'gpt-4o' or 'meta-llama/Llama-3.2-3B-Instruct').
        - max_length (int, optional): Maximum response length. Default is 256.
    
    Returns:
        - str: The model-generated response.
    """
    
    # Case 1: OpenAI API Model
    if model_str in ["gpt-4o", "gpt-4", "gpt-3.5-turbo"]:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OpenAI API key not found. Set OPENAI_API_KEY in the environment.")
        
        openai.api_key = api_key
        try:
            response = openai.ChatCompletion.create(
                model=model_str,
                messages=[{"role": "system", "content": "You are an AI assistant."},
                          {"role": "user", "content": prompt}],
                max_tokens=max_length,
                temperature=0.7
            )
            return response['choices'][0]['message']['content'].strip()
        except Exception as e:
            return f"OpenAI API error: {str(e)}"

    # Case 2: Local Hugging Face Model
    else:
        if model_str not in MODEL_CACHE:
            try:
                print(f"Loading model: {model_str} (this may take a while)...")
                MODEL_CACHE[model_str] = {
                    "model": AutoModelForCausalLM.from_pretrained(model_str, torch_dtype=torch.float16, device_map="auto"),
                    "tokenizer": AutoTokenizer.from_pretrained(model_str)
                }
            except Exception as e:
                return f"Error loading model {model_str}: {str(e)}"
        
        model = MODEL_CACHE[model_str]["model"]
        tokenizer = MODEL_CACHE[model_str]["tokenizer"]

        # Tokenize input
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # Generate response
        with torch.no_grad():
            output = model.generate(**inputs, max_length=max_length, temperature=0.7)

        return tokenizer.decode(output[0], skip_special_tokens=True)



The **temperature** parameter in the `llm_prompt` function controls the **randomness** of the model's output. It determines how deterministic or creative the generated response will be.

### **How Temperature Affects Output:**
- **Low temperature (e.g., 0.1 - 0.3)** → **More deterministic**  
  - The model will choose the most probable words, leading to more consistent and factual responses.
  - Best for tasks requiring accuracy, such as summarization or answering factual questions.
  
- **Medium temperature (e.g., 0.5 - 0.7)** → **Balanced creativity and reliability**  
  - Allows some variation while maintaining coherence.
  - Good for general-purpose conversations.

- **High temperature (e.g., 0.8 - 1.5)** → **More diverse and creative**  
  - The model will take more risks, selecting less likely words more often.
  - Can produce more original and imaginative responses but may become less coherent.
  - Suitable for brainstorming, storytelling, or creative writing.

### **Where It Applies in Your Function:**
In the function:
```python
response = openai.ChatCompletion.create(
    model=model_str,
    messages=[{"role": "system", "content": "You are an AI assistant."},
              {"role": "user", "content": prompt}],
    max_tokens=max_length,
    temperature=0.7  # <-- Here
)
```
- Setting `temperature=0.7` provides a good balance between creativity and coherence.
- If you want deterministic responses, set `temperature=0.0` or `0.1`.
- If you want more variety in responses, increase it to `0.9` or higher.

For local Hugging Face models, the equivalent setting in `generate()`:
```python
output = model.generate(**inputs, max_length=max_length, temperature=0.7)
```
works the same way.

### **Should You Make Temperature Adjustable?**
If your use case requires different levels of randomness (e.g., factual QA vs. creative writing), you could add `temperature` as a parameter:
```python
def llm_prompt(prompt, model_str, max_length=256, temperature=0.7):
```
Then use it like this:
```python
llm_prompt("Tell me a story.", "gpt-4o", temperature=1.0)  # More creative
llm_prompt("What is the capital of France?", "gpt-4o", temperature=0.1)  # More deterministic
```
Would you like me to update your function to include an adjustable `temperature` parameter? 🚀

In [None]:
import os
import openai
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Cache for local models to avoid reloading
MODEL_CACHE = {}

def llm_prompt(prompt, model_str, max_length=256, temperature=0.7):
    """
    Generate a response from an LLM based on the given model string.
    
    Parameters:
        - prompt (str): The input text prompt.
        - model_str (str): The model identifier (e.g., 'gpt-4o' or 'meta-llama/Llama-3.2-3B-Instruct').
        - max_length (int, optional): Maximum response length. Default is 256.
        - temperature (float, optional): Controls randomness (0.0 = deterministic, 1.0+ = creative). Default is 0.7.
    
    Returns:
        - str: The model-generated response.
    """

    # Case 1: OpenAI API Model
    if model_str in ["gpt-4o", "gpt-4o-mini", "o1", "o1-mini"]:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OpenAI API key not found. Set OPENAI_API_KEY in the environment.")

        openai.api_key = api_key
        try:
            response = openai.ChatCompletion.create(
                model=model_str,
                messages=[{"role": "system", "content": "You are an AI assistant."},
                          {"role": "user", "content": prompt}],
                max_tokens=max_length,
                temperature=temperature  # Adjustable temperature
            )
            return response['choices'][0]['message']['content'].strip()
        except Exception as e:
            return f"OpenAI API error: {str(e)}"

    # Case 2: Local Hugging Face Model (e.g., LLaMA)
    else:
        if model_str not in MODEL_CACHE:
            try:
                print(f"Loading model: {model_str} (this may take a while)...")
                MODEL_CACHE[model_str] = {
                    "model": AutoModelForCausalLM.from_pretrained(model_str, torch_dtype=torch.float16, device_map="auto"),
                    "tokenizer": AutoTokenizer.from_pretrained(model_str)
                }
            except Exception as e:
                return f"Error loading model {model_str}: {str(e)}"
        
        model = MODEL_CACHE[model_str]["model"]
        tokenizer = MODEL_CACHE[model_str]["tokenizer"]

        # Tokenize input
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # Generate response with adjustable temperature
        with torch.no_grad():
            output = model.generate(**inputs, max_length=max_length, temperature=temperature)

        return tokenizer.decode(output[0], skip_special_tokens=True)



In [None]:
# Using OpenAI GPT-4o with a lower temperature for more deterministic responses
response = llm_prompt("Explain quantum entanglement.", "gpt-4o", temperature=0.3)
print(response)

# Using LLaMA with a higher temperature for more creative output
response = llm_prompt("Tell me a sci-fi story.", "meta-llama/Llama-3.2-3B-Instruct", temperature=1.2)
print(response)


I've updated your `llm_prompt` function to include **search strategies** for local Hugging Face models, giving you control over how responses are generated. You can now specify **different decoding strategies** using a new parameter:  
- **`search_strategy="greedy"`** (default) – simple, deterministic response  
- **`search_strategy="beam"`** – beam search for better fluency  
- **`search_strategy="top_k"`** – random sampling from the top-k most likely tokens  
- **`search_strategy="top_p"`** – nucleus sampling for diverse responses  
- **`search_strategy="contrastive"`** – contrastive search for high-quality output  

---

### **Updated `llm_prompt` Function**
```python
import os
import openai
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Cache for local models to avoid reloading
MODEL_CACHE = {}

def llm_prompt(prompt, model_str, max_length=256, temperature=0.7, search_strategy="greedy", top_k=50, top_p=0.9, num_beams=5):
    """
    Generate a response from an LLM based on the given model string.
    
    Parameters:
        - prompt (str): The input text prompt.
        - model_str (str): The model identifier (e.g., 'gpt-4o' or 'meta-llama/Llama-3.2-3B-Instruct').
        - max_length (int, optional): Maximum response length. Default is 256.
        - temperature (float, optional): Controls randomness (0.0 = deterministic, 1.0+ = creative). Default is 0.7.
        - search_strategy (str, optional): Decoding strategy. Options: 'greedy', 'beam', 'top_k', 'top_p', 'contrastive'. Default is 'greedy'.
        - top_k (int, optional): Used if `search_strategy="top_k"`. Default is 50.
        - top_p (float, optional): Used if `search_strategy="top_p"`. Default is 0.9.
        - num_beams (int, optional): Used if `search_strategy="beam"`. Default is 5.

    Returns:
        - str: The model-generated response.
    """

    # Case 1: OpenAI API Model
    if model_str in ["gpt-4o", "gpt-4o-mini", "o1", "o1-mini"]:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OpenAI API key not found. Set OPENAI_API_KEY in the environment.")

        openai.api_key = api_key
        try:
            response = openai.ChatCompletion.create(
                model=model_str,
                messages=[{"role": "system", "content": "You are an AI assistant."},
                          {"role": "user", "content": prompt}],
                max_tokens=max_length,
                temperature=temperature  # Adjustable temperature
            )
            return response['choices'][0]['message']['content'].strip()
        except Exception as e:
            return f"OpenAI API error: {str(e)}"

    # Case 2: Local Hugging Face Model (e.g., LLaMA)
    else:
        if model_str not in MODEL_CACHE:
            try:
                print(f"Loading model: {model_str} (this may take a while)...")
                MODEL_CACHE[model_str] = {
                    "model": AutoModelForCausalLM.from_pretrained(model_str, torch_dtype=torch.float16, device_map="auto"),
                    "tokenizer": AutoTokenizer.from_pretrained(model_str)
                }
            except Exception as e:
                return f"Error loading model {model_str}: {str(e)}"
        
        model = MODEL_CACHE[model_str]["model"]
        tokenizer = MODEL_CACHE[model_str]["tokenizer"]

        # Tokenize input
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # Apply search strategy
        gen_kwargs = {"max_length": max_length, "temperature": temperature}

        if search_strategy == "greedy":
            gen_kwargs["do_sample"] = False  # Greedy decoding
        elif search_strategy == "beam":
            gen_kwargs["num_beams"] = num_beams  # Beam search
        elif search_strategy == "top_k":
            gen_kwargs.update({"do_sample": True, "top_k": top_k})  # Top-k sampling
        elif search_strategy == "top_p":
            gen_kwargs.update({"do_sample": True, "top_p": top_p})  # Nucleus sampling
        elif search_strategy == "contrastive":
            gen_kwargs.update({"penalty_alpha": 0.6, "top_k": 4})  # Contrastive search

        # Generate response
        with torch.no_grad():
            output = model.generate(**inputs, **gen_kwargs)

        return tokenizer.decode(output[0], skip_special_tokens=True)
```

---

### **New Features & Search Strategies**
| Search Strategy  | How It Works | Best Used For |
|------------------|-------------|--------------|
| **greedy** (default) | Picks the most likely word at each step. | Fast, deterministic output. |
| **beam** | Considers multiple possible sequences and picks the best. | Producing high-quality, fluent text. |
| **top_k** | Samples from the top-k most likely words. | Adding variety while staying relevant. |
| **top_p** | Samples dynamically from the most probable subset of words. | More diverse, natural text. |
| **contrastive** | Uses **contrastive search** to balance diversity and quality. | High-quality open-ended generation. |

---

### **Example Usage**
#### **1️⃣ OpenAI GPT-4o (API)**
```python
response = llm_prompt("Explain quantum mechanics.", "gpt-4o", temperature=0.3)
print(response)
```

#### **2️⃣ LLaMA Using Greedy Decoding**
```python
response = llm_prompt("Tell me a story.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="greedy")
print(response)
```

#### **3️⃣ LLaMA Using Beam Search for Higher Fluency**
```python
response = llm_prompt("Summarize this article.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="beam", num_beams=5)
print(response)
```

#### **4️⃣ LLaMA Using Top-k Sampling for More Variety**
```python
response = llm_prompt("Write a poem about AI.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="top_k", top_k=40)
print(response)
```

#### **5️⃣ LLaMA Using Contrastive Search for High-Quality Responses**
```python
response = llm_prompt("Describe the future of AI.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="contrastive")
print(response)
```

---

### **Why This Update is Useful**
✅ **More Control** – Choose between fluency (beam), diversity (top-p), or determinism (greedy).  
✅ **Better Text Quality** – Contrastive search improves coherence.  
✅ **Preserves API Flexibility** – GPT models still work the same way.  
✅ **Fast & Efficient** – **Caches** the model for reuse instead of reloading it each time.  

Would you like any additional customization, such as **temperature-dependent sampling adjustments** or **log probabilities** for debugging? 🚀