# PART 1: HUGGING FACE TRANSFORMERS
### Goal: Use a pre-trained model for summarization with minimal code.
### Connection to Slide 10: The "democratization" of AI access via Model Hubs.


In [1]:

from transformers import pipeline

# 1. Initialize the pipeline.
# The 'pipeline' abstraction handles model downloading, tokenization, and inference automatically.
# We use a small, efficient model (distilbart) suitable for CPU usage.
print(">>> Loading model from Hugging Face Hub...")
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

# 2. Input Data: A dense technical paragraph (Simulating a paper abstract).
# This text describes the Transformer architecture (Slide 7 content).
scientific_text = """
The Transformer architecture abandoned sequential processing, introducing parallel 
processing of all tokens simultaneously. For each token, the model computes an 
"attention score" with every other token, creating a rich map of contextual relationships. 
This allows the model to learn which relationships matter, regardless of distance. 
Traditional RNNs suffered from vanishing gradients and could not be parallelized efficiently.
"""

# 3. Inference: Generate the summary.
# Parameters:
# - max_length: constrains the output size.
# - do_sample=False: makes the output deterministic (Temperature = 0).
print("\n>>> Generating Summary...")
summary = summarizer(scientific_text, max_length=45, min_length=15, do_sample=False)

# 4. Output the result.
print(f"Original Length: {len(scientific_text)} characters")
print(f"Summary: {summary[0]['summary_text']}")

>>> Loading model from Hugging Face Hub...






>>> Generating Summary...
Original Length: 435 characters
Summary:  The Transformer architecture abandoned sequential processing, introducing parallel processing of all tokens simultaneously . For each token, the model computes an  "attention score" with every other token, creating a rich map of contextual


# PART 2: OLLAMA (LOCAL INFERENCE)
### Goal: Run a quantized LLM (Llama 3) locally on your laptop.
### Connection to Slide 28: Implementing the "Expert Persona" via System Prompts.


In [7]:
import ollama

# 1. Define the System Prompt.
# We instruct the model to act as a specialized domain expert (JEL Classifier).
# See Slide 28 regarding the "JEL Expert Classification Task".
system_instruction = """
You are a strict JEL (Journal of Economic Literature) classification expert.
Analyze the user's abstract and classify it as either 'METHODOLOGICAL' or 'TOPIC' based.
Provide a one-sentence justification.
"""

# 2. Define the User Input (A simulated abstract).
user_abstract = "We introduce a novel instrumental variable approach to estimate the elasticity of labor supply."

print(f">>> Sending request to local model (User Input: {user_abstract})...")

# 3. Call the local API with CONTROL PARAMETERS.
# This runs entirely on your machine's hardware (CPU/GPU).
response = ollama.chat(
    model='mannix/gemma2-9b-simpo:latest',
    messages=[
      {
        'role': 'system',
        'content': system_instruction,
      },
      {
        'role': 'user',
        'content': user_abstract,
      },
    ],
    # PARAMETERS CONFIGURATION
    options={
        # Temperature: Controls the "creativity" or randomness of the model.
        # 0.0 = Deterministic/Precise. The model always picks the most likely word. Best for classification.
        # 1.0 = Creative. The model takes more risks. Best for writing stories.
        'temperature': 0.1,

        # Top K: Limits the model's choice to the top K most probable next words.
        # e.g., 40 means "Only consider the 40 most likely words, ignore the thousands of others."
        # Helps prevent the model from going off-topic with nonsense words.
        'top_k': 40,

        # Top P (Nucleus Sampling): A dynamic threshold.
        # The model selects from the smallest set of words whose cumulative probability equals P.
        # 0.8 is standard; lowering it makes the model more conservative/focused.
        'top_p': 0.8,
    }
)

# 4. Print the local response.
print("\n>>> LOCAL OLLAMA RESPONSE:")
print(response['message']['content'])

>>> Sending request to local model (User Input: We introduce a novel instrumental variable approach to estimate the elasticity of labor supply.)...

>>> LOCAL OLLAMA RESPONSE:
**METHODOLOGICAL** because the abstract focuses on a new methodological technique (instrumental variable approach) for estimating labor supply elasticity.


# PART 3: CEREBRAS (CLOUD ACCELERATION & CONTROL)
### Goal: Access a Frontier Model (70B) to experiment with "Temperature" and "Creativity".
### Connection to Slide 6: Solving the "Parallelization Constraint" with specialized hardware.
### Connection to Slide 5: Controlling the "Cost of Prediction" (Accuracy vs. Hallucination).

In [9]:
import os
from cerebras.cloud.sdk import Cerebras
from dotenv import load_dotenv
load_dotenv()

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

print(">>> CEREBRAS (Conservative Mode - Temp 0)")

complex_query = "Explain the concept of 'Self-Attention' in Transformers using the analogy of a library filing system."

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": complex_query}
    ],
    stream=True,
    # CONFIGURATION: PRECISE
    temperature=0.0,  # Zero randomness. Picks the single most likely next token
    top_p=0.9         # Consider all possibilities, but strictly ranked
)

print(f"\n>>> Query: {complex_query}\n")
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

>>> CEREBRAS (Conservative Mode - Temp 0)

>>> Query: Explain the concept of 'Self-Attention' in Transformers using the analogy of a library filing system.

**Introduction to Self-Attention**

The concept of self-attention in Transformers can be understood using the analogy of a library filing system. Imagine a vast library with an infinite number of books, each representing a piece of information. The library's filing system is designed to help you find relevant information quickly.

**The Library Filing System Analogy**
--------------------------------------

In this analogy:

*   **Books** represent the input sequence (e.g., a sentence or a document) that is being processed.
*   **Book Titles** represent the individual elements (e.g., words or tokens) within the input sequence.
*   **Book Shelves** represent the different positions or contexts in which the elements appear.
*   **Librarian** represents the self-attention mechanism.

**How Self-Attention Works**
----------------------

In [10]:
import os
from cerebras.cloud.sdk import Cerebras
from dotenv import load_dotenv
load_dotenv()

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"))

print(">>> CEREBRAS (Creative Mode - Temp 1.5)")

# Nota: Per vedere l'effetto creativo, chiediamo uno stile narrativo
complex_query = "Explain 'Self-Attention' like a fantasy storyteller using a library analogy. Be colorful and metaphorical."

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": complex_query}
    ],
    stream=True,
    # CONFIGURATION: CREATIVE
    temperature=1.5,  # High randomness. The model takes risks with rare words.
    top_p=1       # craziest words probably nonsense.
)

print(f"\n>>> Query: {complex_query}\n")
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

>>> CEREBRAS (Creative Mode - Temp 1.5)

>>> Query: Explain 'Self-Attention' like a fantasy storyteller using a library analogy. Be colorful and metaphorical.

Gather 'round, travelers of the realm, and heed my tale of the mystical library of information, where ancient tomes hold the secrets of the universe. Within this hallowed hall, a peculiar magic dwells, known as "Self-Attention." 'Tis a wondrous force, akin to a curious scribe, tasked with unraveling the very essence of the knowledge contained within.

In this enchanted library, shelves upon shelves of dusty leather-bound books stretch far and wide, each one a container of wisdom, waiting to impart its contents to the inquisitive mind. Imagine, if you will, a noble tome, adorned with golden filigree and strange symbols, representing a sentence, a thought, or an idea. This tome, like its companions, holds various passages, each a distinct word or concept, nestled within its pages.

As our intrepid scribe, Self-Attention, begins it