<a href="https://colab.research.google.com/github/RamsesMDLC/Smolagent_Project_1/blob/main/Smolagents_Project_1_YT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PROJECT 1**

#**1. LOADING LIBRARIES / MODULES / CLASSES**



In [1]:
#Installs the smolagents library along with extensions defined in the [toolkit] option.
!pip install smolagents[toolkit]

#Components from smolagents
  #CodeAgent: The "agent". It orchestrates reasoning and tool usage.
  #DuckDuckGoSearchTool: The "tool". It lets the agent fetch information from the web.
  #TransformersModel: It "allow us to get access to the model through Hugging Face". A wrapper for Hugging Face Transformer models.
from smolagents import CodeAgent, DuckDuckGoSearchTool, TransformersModel

#API key
  #Provides a secure way to access stored secrets (like API tokens) within Google Colab.
from google.colab import userdata
  #Allows programmatic login to Hugging Face Hub.
from huggingface_hub import login

#Tokenizer: class in the Hugging Face Transformers library to process text inputs ("prompts or text") and outputs ("answer") for the model.
  #This means AutoTokenizer forms the bridge:
    #Input text → tokens/tensors → Model
      #Splitting text into tokens (smaller pieces such as words or subwords).
      #Converting these tokens into numbers ("tensors"), called input IDs, which the model uses for computation.
      #Managing extra elements like special tokens (e.g., [CLS], [SEP], padding).
    #Model output tokens/tensors → decoded text
  #It automatically loads and configures the correct tokenizer for a specified model (i.e., there’s no need to know the model-specific tokenizer class).
from transformers import AutoTokenizer

Collecting smolagents[toolkit]
  Downloading smolagents-1.22.0-py3-none-any.whl.metadata (16 kB)
Collecting ddgs>=9.0.0 (from smolagents[toolkit])
  Downloading ddgs-9.6.0-py3-none-any.whl.metadata (18 kB)
Collecting markdownify>=0.14.1 (from smolagents[toolkit])
  Downloading markdownify-1.2.0-py3-none-any.whl.metadata (9.9 kB)
Collecting primp>=0.15.0 (from ddgs>=9.0.0->smolagents[toolkit])
  Downloading primp-0.15.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting lxml>=6.0.0 (from ddgs>=9.0.0->smolagents[toolkit])
  Downloading lxml-6.0.2-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.metadata (3.6 kB)
Collecting socksio==1.* (from httpx[brotli,http2,socks]>=0.28.1->ddgs>=9.0.0->smolagents[toolkit])
  Downloading socksio-1.0.0-py3-none-any.whl.metadata (6.1 kB)
Downloading ddgs-9.6.0-py3-none-any.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.6/41.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hDow

#**2. GETTING THE TOKEN**

In [2]:
# Securely get Hugging Face token and login
hf_token = userdata.get('HF_TOKEN')
if hf_token:
    login(hf_token)
    print("Successfully logged in to Hugging Face!")
else:
    print("Token not found. Please add HF_TOKEN secret.")

Successfully logged in to Hugging Face!


#**3. DEFINING MODEL / TOKENIZER / PAD**


In [3]:
#Defining Model (from Hugging Face)
  #Type: Causal Language Models
  #Training Stage: Pretraining
  #Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
  #Number of Parameters: 0.49B
  #Number of Paramaters (Non-Embedding): 0.36B
  #Number of Layers: 24
  #Number of Attention Heads (GQA): 14 for Q and 2 for KV
  #Context Length: Full 32,768 tokens
model_id = "Qwen/Qwen2.5-0.5B"

#Initialize tokenizer
  #Load a pretrained tokenizer for the given model identified by model_id (in this case "Qwen/Qwen1.5-1.8B")
    #The tokenizer includes vocabulary, tokenization rules, special tokens, and associated settings needed to convert raw text into token IDs.
tokenizer = AutoTokenizer.from_pretrained(model_id)

#Check whether the tokenizer has a designated padding token.
  #Padding tokens are used to make all input sequences the same length by adding special "pad" tokens to shorter sequences.
    #Real-world texts vary in length, so to batch-process multiple sequences efficiently, shorter sequences are padded with these special tokens until they match the longest sequence length in the batch.
    #Padding tokens carry no meaningful information and are meant only to fill space for model input consistency.
  #Padding tokens ensure stable input preprocessing and model compatibility during tokenization and generation
  #If the padding token is not set, the tokenizer or model might throw errors during inference or training.
  #By assigning the EOS (end-of-sentence) token as padding, the code ensures compatibility even when a dedicated pad token is not defined for the particular model.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the model
model = TransformersModel(model_id=model_id)

# Fix pad_token_id in model config if not set
if model.model.config.pad_token_id is None:
    model.model.config.pad_token_id = tokenizer.pad_token_id

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

**Log output (Tokenizer Files Being Downloaded)**

Hugging Face automatically downloads the tokenizer artifacts from the repository Qwen/Qwen1.5-1.8B. These include:

* tokenizer_config.json (1.29kB): Contains metadata about the tokenizer, such as whether it lowercases text, what type of tokenizer is being used (e.g., BPE, SentencePiece), and any special tokens (CLS, SEP, etc.).

* vocab.json (2.78MB): Stores the vocabulary mapping for the tokenizer: a dictionary linking each token string to an ID (e.g., "elephant" -> 5031). This is used to convert text into token IDs.

* merges.txt (1.67MB): Defines the merge operations for a Byte Pair Encoding (BPE) tokenizer. It describes how characters and subwords are combined into larger tokens.

* tokenizer.json (7.03MB): A consolidated JSON file that includes vocabulary, merges, and tokenizer configuration in one place. This is often faster and more convenient to load.

**Log output (Model Files Being Downloaded)**

Hugging Face automatically downloads the model artifacts. These include:

* config.json (662B): Stores model architecture parameters (e.g., hidden size, number of layers, attention heads, max position embeddings).

* model.safetensors (3.67GB): The actual pretrained model weights in the safetensors format (preferred over .bin for safety and efficiency). This file makes up the majority of the size because it contains billions of parameters.

* generation_config.json (138B): Stores default parameters for text generation, like temperature, top-k, top-p, max length, and repetition penalty.

#**4. PROCESSING THE INPUT AND GENERATING AN OUTPUT**


In [7]:
#Initialize agent
  #CodeAgent: A higher-level wrapper that can run a language model while also using external tools.
  #tools=[DuckDuckGoSearchTool()]: The agent can access web search via DuckDuckGo, so it can pull in fresh information to supplement the model’s reasoning.
  #model=model: The agent is linked to a pretrained transformer-based language model (e.g., Hugging Face’s transformers).
agent = CodeAgent(tools=[DuckDuckGoSearchTool()], model=model)

#Prepare input text and tokenize with attention mask
  #input_text: The user’s query in plain text.
  #tokenizer: Converts text into tokens (numerical IDs) suitable for the model.
    #return_tensors="pt": Returns PyTorch tensors (instead of lists or arrays).
    #padding=True: Ensures uniform sequence lengths by filling shorter sequences with padding tokens.
    #truncation=True: Ensures long sequences are cut down to fit the model’s input size.

#IMPORTANT: The input is transformed from TEXT to NUMBERS
input_text = "How long would it take for an elephant to cross the United States from Florida to California?"
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

#The tokenization produces a dictionary inputs containing (these are extracted separately for clarity, and will be used in the generation call):
  #input_ids: Token IDs for the sequence.
  #attention_mask: A binary mask (1s for real tokens, 0s for padding) that tells the model which positions to attend to.
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

#IMPORTANT: The input (in NUMBERS format) is transformed into output (NUMBERS format)

#Since agent.run() might not allow passing attention_mask directly; therefore, call the model generation:
  #model.model: Accesses the underlying transformer model inside the agent wrapper.
  #generate: Method used for text generation (e.g., greedy search, beam search, sampling depending on model’s default). This produces generated_ids, which are the token IDs for the generated text.
    #input_ids, attention_mask: Ensure the model knows exactly which parts of the input are real text vs padding.
    #pad_token_id: In case the model needs to pad internally, this specifies the right padding token ID.
    #max_new_tokens=50: Limits how many new tokens the model can generate beyond the input.
generated_ids = model.model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    pad_token_id=model.model.config.pad_token_id,
    max_new_tokens=1000
)

#IMPORTANT: The output (in NUMBERS format) is transformed into output (TEXT format)

# Decode generated tokens
  #tokenizer.decode: Converts the generated token IDs back into a readable string.
  #skip_special_tokens=True: Removes artifacts like <pad>, <eos> that the model may produce.
  #Prints the final model response.
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("Generated text:")
print(generated_text)

Generated text:
How long would it take for an elephant to cross the United States from Florida to California? To determine how long it would take for an elephant to cross the United States from Florida to California, we need to know the distance between these two states. Let's assume the distance is given in miles. For the sake of this example, let's assume the distance is 2,500 miles.

Here are the steps to calculate the time it would take for an elephant to travel this distance:

1. **Identify the distance between Florida and California:**
   \[
   \text{Distance} = 2500 \text{ miles}
   \]

2. **Determine the speed of the elephant:**
   Let's assume the elephant's speed is 10 miles per hour.

3. **Calculate the time taken to travel the distance:**
   Time = Distance / Speed
   \[
   \text{Time} = \frac{2500 \text{ miles}}{10 \text{ miles per hour}} = 250 \text{ hours}
   \]

Therefore, it would take an elephant \boxed{250} hours to cross the United States from Florida to California.

In [8]:
import re

# Remove unwanted bracket characters and slashes
clean_text = re.sub(r'[\[\]\{\}\(\)/]', '', generated_text)

# Optionally strip and normalize spaces
clean_text = ' '.join(clean_text.split())

print("Cleaned generated text:")
print(clean_text)


Cleaned generated text:
How long would it take for an elephant to cross the United States from Florida to California? To determine how long it would take for an elephant to cross the United States from Florida to California, we need to know the distance between these two states. Let's assume the distance is given in miles. For the sake of this example, let's assume the distance is 2,500 miles. Here are the steps to calculate the time it would take for an elephant to travel this distance: 1. **Identify the distance between Florida and California:** \ \textDistance = 2500 \text miles \ 2. **Determine the speed of the elephant:** Let's assume the elephant's speed is 10 miles per hour. 3. **Calculate the time taken to travel the distance:** Time = Distance Speed \ \textTime = \frac2500 \text miles10 \text miles per hour = 250 \text hours \ Therefore, it would take an elephant \boxed250 hours to cross the United States from Florida to California.


In [9]:
clean_text

"How long would it take for an elephant to cross the United States from Florida to California? To determine how long it would take for an elephant to cross the United States from Florida to California, we need to know the distance between these two states. Let's assume the distance is given in miles. For the sake of this example, let's assume the distance is 2,500 miles. Here are the steps to calculate the time it would take for an elephant to travel this distance: 1. **Identify the distance between Florida and California:** \\ \\textDistance = 2500 \\text miles \\ 2. **Determine the speed of the elephant:** Let's assume the elephant's speed is 10 miles per hour. 3. **Calculate the time taken to travel the distance:** Time = Distance Speed \\ \\textTime = \\frac2500 \\text miles10 \\text miles per hour = 250 \\text hours \\ Therefore, it would take an elephant \\boxed250 hours to cross the United States from Florida to California."