# Guía — Introducción práctica a Hugging Face (HF) — **versión corregida**

_Actualizado: 2025-10-30 00:40 UTC_  
Esta versión agrega **rutas offline/online**, corrige celdas incompletas y usa APIs vigentes
de `transformers`, `datasets`, `huggingface_hub` y `tokenizers`.

## 0) Modo de ejecución

- Si **tienes internet** (Colab, tu laptop con red): usa `RUN_ONLINE = True` y ejecuta la celda de instalación.  
- Si **no tienes internet**: deja `RUN_ONLINE = False`. Las celdas que requieren descargar modelos/datasets se **saltarán** automáticamente.

In [1]:
RUN_ONLINE = True  # ponlo en False si estás sin internet

import os
if not RUN_ONLINE:
    os.environ["HF_HUB_OFFLINE"] = "1"   # fuerza modo offline en Hugging Face Hub
    os.environ["TRANSFORMERS_OFFLINE"] = "1"
print("RUN_ONLINE =", RUN_ONLINE)

RUN_ONLINE = True


## 1) Instalación (solo online)

In [6]:
import sys, subprocess
if RUN_ONLINE:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-U",
                           "transformers", "datasets", "huggingface_hub", "tokenizers", "torch"])


## 2) Imports y versiones

In [7]:
import importlib, sys

def safe_import(name):
    try:
        m = importlib.import_module(name)
        print(f"OK  import {name}  ->", getattr(m, "__version__", "(sin __version__)"))
        return m
    except Exception as e:
        print(f"FAIL import {name}:", e)
        return None

transformers = safe_import("transformers")
datasets = safe_import("datasets")
hf_hub = safe_import("huggingface_hub")
tokenizers = safe_import("tokenizers")
torch = safe_import("torch")

if torch is not None:
    print("Torch cuda available:", torch.cuda.is_available())

  from .autonotebook import tqdm as notebook_tqdm


OK  import transformers  -> 4.57.1
OK  import datasets  -> 4.3.0
OK  import huggingface_hub  -> 0.36.0
OK  import tokenizers  -> 0.22.1
OK  import torch  -> 2.9.0+cpu
Torch cuda available: False


## 3) `pipeline`: primeros pasos (sentimiento, fill‑mask, generación de texto)

In [8]:
from typing import List
from transformers import pipeline, AutoTokenizer

if RUN_ONLINE:
    # 3.1 Sentiment Analysis
    clf = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
    out = clf("Este curso de IA en el aula me parece excelente.")
    print("Sentiment:", out)

    # 3.2 Fill-mask (usa [MASK] token)
    mlm = pipeline("fill-mask", model="bert-base-uncased")
    print(mlm("Paris is the [MASK] of France.")[:2])

    # 3.3 Text generation (usa max_new_tokens — API vigente)
    gen_tok = AutoTokenizer.from_pretrained("gpt2")
    gen = pipeline("text-generation", model="gpt2")
    prompt = "Hello AI classroom, today we will learn about"
    print(gen(prompt, max_new_tokens=40, num_return_sequences=1))
else:
    print("Modo offline: ejemplos con pipeline omitidos (requieren descargar modelos).")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


Sentiment: [{'label': 'POSITIVE', 'score': 0.952487051486969}]


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Bert

[{'score': 0.9969369173049927, 'token': 3007, 'token_str': 'capital', 'sequence': 'paris is the capital of france.'}, {'score': 0.0005914842477068305, 'token': 2540, 'token_str': 'heart', 'sequence': 'paris is the heart of france.'}]


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello AI classroom, today we will learn about it. What does 'AI' mean? Well, you might say 'AI-like' but I don't mean that in a bad way. It's more like 'AI-like' or"}]


## 4) AutoTokenizer + AutoModel (control fino)

In [9]:
from transformers import AutoTokenizer, AutoModel

if RUN_ONLINE:
    model_id = "bert-base-uncased"
    tok = AutoTokenizer.from_pretrained(model_id)
    mdl = AutoModel.from_pretrained(model_id)
    print("Vocab size:", tok.vocab_size, "| Hidden size:", mdl.config.hidden_size)
else:
    print("Modo offline: omitido (descarga de pesos requerida).")

Vocab size: 30522 | Hidden size: 768


## 5) `datasets`: cargar y explorar (IMDB de ejemplo)

In [10]:
if RUN_ONLINE:
    from datasets import load_dataset
    imdb = load_dataset("imdb")
    print(imdb)
    print("Ejemplo train[0]:", imdb["train"][0])
else:
    print("Modo offline: omitido (requiere bajar dataset).")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 100%|██████████| 25

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
Ejemplo train[0]: {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between aski




## 6) Hugging Face Inference API (remoto)

Para usar la Inference API necesitas un **token** en tu cuenta HF:  
1. Crea un token en <https://huggingface.co/settings/tokens>  
2. Asigna el valor a la variable `HF_TOKEN` o inicia sesión con `notebook_login()`.

In [None]:
import os
HF_TOKEN = os.environ.get("", None)

if RUN_ONLINE and HF_TOKEN:
    from huggingface_hub import InferenceClient
    client = InferenceClient(token=HF_TOKEN)
    resp = client.text_generation(prompt="Explain transformers in 2 sentences.", model="gpt2", max_new_tokens=40)
    print(resp[:200], "...")
else:
    print("Omitido: requiere HF_TOKEN y conexión.")

Omitido: requiere HF_TOKEN y conexión.


## 7) Embeddings (representaciones) con `transformers` puros

Usamos un modelo base (BERT) y un **pooling promedio** para producir embeddings de oraciones, evitando
dependencias extra como `sentence-transformers`. Con GPU disponible, `torch` usará CUDA automáticamente.

In [14]:
import torch
from transformers import AutoTokenizer, AutoModel

def mean_pool(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
    # enmascara y promedia a lo largo de la dimensión de tokens
    mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    summed = (last_hidden_state * mask).sum(1)
    counts = mask.sum(1).clamp(min=1e-9)
    return summed / counts

sentences = [
    "Machine learning improves with more data.",
    "Deep learning models are trained with GPUs.",
    "The Eiffel Tower is in Paris."
]

if RUN_ONLINE:
    model_id = "bert-base-uncased"
    tok = AutoTokenizer.from_pretrained(model_id)
    mdl = AutoModel.from_pretrained(model_id)
    enc = tok(sentences, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        out = mdl(**enc)
    emb = mean_pool(out.last_hidden_state, enc["attention_mask"])
    # Cosine similarities
    sim01 = torch.nn.functional.cosine_similarity(emb[0], emb[1], dim=0).item()
    sim02 = torch.nn.functional.cosine_similarity(emb[0], emb[2], dim=0).item()
    print("cos(0,1) ~", round(sim01, 4), "| cos(0,2) ~", round(sim02, 4))
else:
    print("Modo offline: omitido (requiere pesos del modelo).")

cos(0,1) ~ 0.767 | cos(0,2) ~ 0.5186


## 8) `tokenizers`: ejemplo mínimo (BPE) — **funciona offline**

In [15]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())
print("Tokenizer BPE vacío creado (demo).")

Tokenizer BPE vacío creado (demo).


---
### Notas finales
- Las secciones que antes tenían `...` fueron reemplazadas por ejemplos completos.
- Donde aplica, usamos `max_new_tokens` (no `max_length`) y variables de entorno para modo offline.
- Si algo no corre en tu entorno, valida versiones con la celda de imports y ajusta `RUN_ONLINE`.