# Hugging Face + FastAPI

# Introduction

This notebook explores different Transformer architectures:
1. Encoder-only models (e.g., BERT) → good for classification tasks.
2. Decoder-only models (e.g., GPT-2) → good for text generation.
3. Encoder-decoder models (e.g., T5) → good for sequence-to-sequence tasks like translation.

###### Finally, we will deploy GPT-2.
---

In [1]:
!pip install transformers torch datasets sentencepiece --quiet

[HuggingFace](https://huggingface.co)

# Encoder-Only Model (BERT)
## Explanation:
*  Encoder-only models convert input text into hidden states but do not generate text.
* Best for classification tasks (sentiment analysis, spam filtering, topic classification).
* Example: BERT.

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load tokenizer + model
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

# Build a sentiment-analysis pipeline
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Test the model
result = classifier("I love studying transformers, they are amazing!")
print(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': '5 stars', 'score': 0.9083207249641418}]


# Decoder-Only Model (GPT-2)

## Explanation:
* Decoder-only models generate text autoregressively (predicting one token at a time).
* Best for text generation tasks → chatbot, story generation, code completion.
* Example: GPT-2.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate


dataset = load_dataset("cardiffnlp/tweet_eval", "sentiment")


labels = dataset["train"].features["label"].names
num_labels = len(labels)


model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

# 4) Preprocess
def tokenize_fn(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

tokenized = dataset.map(tokenize_fn, batched=True)

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
import torch
import math

# 1. Load dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# 2. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tokenizer.pad_token = tokenizer.eos_token  # GPT2 doesn’t have pad token

# 3. Tokenize
def tokenize_function(examples):
    tokens = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=64)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# 4. Load model
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
model.resize_token_embeddings(len(tokenizer))

# 5. Training arguments (super basic)
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    num_train_epochs=1,
    logging_steps=10,
    report_to="none"
)

# 6. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(200)),  # small subset for speed
)

# 7. Train
trainer.train()

# 8. Save fine-tuned model
trainer.save_model("./gpt2-mini-finetuned")

# 9. Quick generation (GPU-safe)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

inputs = tokenizer("The future of AI is", return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

outputs = model.generate(
    **inputs,
    max_new_tokens=40,
    temperature=0.8, # randomness
    top_k=50, # limits choices
    top_p=0.95, # nucleus sampling
    repetition_penalty=2.0,  # >1 discourages repetition
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/test-00000-of-00001.pa(…):   0%|          | 0.00/733k [00:00<?, ?B/s]

wikitext-2-raw-v1/train-00000-of-00001.p(…):   0%|          | 0.00/6.36M [00:00<?, ?B/s]

wikitext-2-raw-v1/validation-00000-of-00(…):   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,3.866
20,2.2974
30,1.8135
40,2.0682
50,1.915
60,2.2867
70,2.1631
80,1.737
90,1.8611
100,1.6588


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The future of AI is in the hands and will be decided by a committee chaired group led from within each department . The chairperson has to have been appointed as an advisor for at least two years , with his or her own


# Encoder-Decoder Model (T5, BART or PEGASUS)
## Explanation:
* Encoder-decoder models read input with the encoder, then generate output with the decoder.
* Best for sequence-to-sequence tasks → translation, summarization, question-answering.
* Example: T5 for summarization.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

text = """
Artificial Intelligence is rapidly evolving and impacting various industries, such as healthcare,
finance, and transportation. Experts believe AI will continue to shape the future of work and society
as a whole. While some argue that AI will create new opportunities and improve efficiency,
others worry about job displacement and ethical concerns. Governments and organizations are now
working on frameworks to ensure responsible development and deployment of AI technologies.
"""

inputs = tokenizer("summarize: " + text, max_length=1024, return_tensors="pt", truncation=True)

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    min_length=20,
    max_length=60,
    num_beams=4, # Beam search
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

experts believe AI will continue to shape the future of work and society as a whole . some argue that AI will create new opportunities and improve efficiency . others worry about job displacement and ethical concerns .


# Deploying with FastAPI

In [None]:
!pip install fastapi uvicorn nest_asyncio pyngrok --quiet
!ngrok config add-authtoken ""

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import nest_asyncio
from pyngrok import ngrok
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Colab setup
nest_asyncio.apply()

# Load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# FastAPI setup
app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Request model
class Prompt(BaseModel):
    text: str
    max_length: int = 50
    temperature: float = 0.7
    top_k: int = 50
    top_p: float = 0.95

@app.post("/generate/")
def generate_text(prompt: Prompt):
    inputs = tokenizer(
        prompt.text,
        return_tensors="pt",
        padding=True,
        truncation=True
    )

    inputs = {k: v.to(device) for k, v in inputs.items()}

    outputs = model.generate(
        **inputs,
        max_new_tokens=prompt.max_length,
        do_sample=True,
        top_k=prompt.top_k,
        top_p=prompt.top_p,
        temperature=prompt.temperature,
        repetition_penalty=1.5  # optional to reduce repetition
    )

    return {"result": tokenizer.decode(outputs[0], skip_special_tokens=True)}

# Ngrok tunnel
public_url = ngrok.connect(8000)
print("Public URL:", public_url)

# Run server
from uvicorn import Config, Server
config = Config(app=app, host="0.0.0.0", port=8000, loop="asyncio")
server = Server(config)

await server.serve()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Public URL: NgrokTunnel: "https://dumpish-blanchi-meadow.ngrok-free.dev" -> "http://localhost:8000"


INFO:     Started server process [8645]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO:     2c0f:fc88:29:e883:98bd:ca5d:280b:3677:0 - "GET / HTTP/1.1" 404 Not Found
INFO:     2c0f:fc88:29:e883:98bd:ca5d:280b:3677:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     2c0f:fc88:29:e883:98bd:ca5d:280b:3677:0 - "GET /openapi.json HTTP/1.1" 200 OK


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


INFO:     2c0f:fc88:29:e883:98bd:ca5d:280b:3677:0 - "POST /generate/ HTTP/1.1" 200 OK
INFO:     2c0f:fc88:29:e883:98bd:ca5d:280b:3677:0 - "OPTIONS /generate/ HTTP/1.1" 200 OK


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


INFO:     2c0f:fc88:29:e883:98bd:ca5d:280b:3677:0 - "POST /generate/ HTTP/1.1" 200 OK
