Colab Notebook: Introduction to Hugging Face and Transformers
This notebook will guide you through the basics of Hugging Face, demonstrate how to load Large Language Models (LLMs), and provide a conceptual understanding of how Transformer models operate.

##1.  Introduction to Hugging Face
Hugging Face is an AI company that has become a central hub for machine learning, especially in the field of Natural Language Processing (NLP). They provide:

Transformers Library: An open-source library offering thousands of pre-trained models for various tasks like text classification, question answering, translation, and more.

Hugging Face Hub: A platform to share, discover, and collaborate on models, datasets, and demos.

Datasets Library: A fast and easy-to-use library to load and process datasets.

Tokenizers Library: Blazingly fast tokenizers, written in Rust.

Why Hugging Face?

Hugging Face simplifies the process of working with state-of-the-art NLP models. It promotes open science and democratizes AI by making powerful models readily accessible.

Link: https://huggingface.co/

##2. Setting Up Your Environment
First, let's install the necessary libraries.

In [1]:
!pip install transformers accelerate bitsandbytes sentencepiece
!pip install transformers[torch] # Or [tensorflow] or [flax] depending on your backend
!pip install datasets
!pip install matplotlib seaborn # For visualization later

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2

Explanation of Libraries:

transformers: The core Hugging Face library for models.

accelerate: A library for easily training and using PyTorch models across different hardware configurations (e.g., multiple GPUs, TPUs).

bitsandbytes: Used for 8-bit quantization, which allows loading larger models with less memory.

sentencepiece: A dependency for some tokenizers (like LLaMA's).

datasets: For easily loading and working with datasets.

matplotlib, seaborn: For creating plots and visualizations.

## 3. Loading Large Language Models (LLMs) via Hugging Face

Hugging Face makes it incredibly easy to load pre-trained LLMs. The AutoModelForCausalLM and AutoTokenizer classes are your best friends here.

Key Concepts:

Pre-trained Model: A model that has already been trained on a massive amount of text data and learned to understand language patterns.

Tokenizer: A component that converts raw text into numerical representations (tokens) that a model can understand.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Choose a smaller, accessible model for demonstration to avoid OOM errors
# For larger models, consider using Google Colab Pro or enabling 8-bit quantization.
model_name = "gpt2" # A classic small LLM
# model_name = "EleutherAI/gpt-neo-125M" # Another option, slightly larger
# model_name = "microsoft/phi-2" # A more recent small LLM, might require more RAM/GPU in free Colab

print(f"Loading model: {model_name}")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model
# Using torch_dtype=torch.bfloat16 or torch.float16 can save memory if your GPU supports it.
# For larger models, you might need load_in_8bit=True
model = AutoModelForCausalLM.from_pretrained(model_name) #, torch_dtype=torch.bfloat16)

# If you have a GHugging Face makes it incredibly easy to load pre-trained LLMs. The AutoModelForCausalLM and AutoTokenizer classes are your best friends here.

Loading model: gpt2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Key Concepts:

Pre-trained Model: A model that has already been trained on a massive amount of text data and learned to understand language patterns.

Tokenizer: A component that converts raw text into numerical representations (tokens) that a model can understand.


In [3]:
if torch.cuda.is_available():
    model.to("cuda")
    print("Model moved to GPU.")
else:
    print("No GPU found. Model will run on CPU (might be slower).")

print("Model and Tokenizer loaded successfully!")

# --- Example Usage: Text Generation ---
prompt = "The quick brown fox jumps over the lazy"

# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# If using GPU
if torch.cuda.is_available():
    input_ids = input_ids.to("cuda")

print(f"\nInput Prompt: '{prompt}'")
print(f"Input Tokens: {tokenizer.decode(input_ids[0])}")

# Generate text
# max_new_tokens: The maximum number of tokens to generate.
# num_beams: For beam search (exploring multiple possibilities). >1 makes generation slower but potentially better.
# no_repeat_ngram_size: Avoids repeating n-grams.
# early_stopping: Stops generation early if all beam candidates have met a certain condition.
# temperature: Controls randomness. Lower for more deterministic, higher for more creative.
output = model.generate(
    input_ids,
    max_new_tokens=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True,
    temperature=0.7
)

# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(f"\nGenerated Text:\n{generated_text}")


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


No GPU found. Model will run on CPU (might be slower).
Model and Tokenizer loaded successfully!

Input Prompt: 'The quick brown fox jumps over the lazy'
Input Tokens: The quick brown fox jumps over the lazy

Generated Text:
The quick brown fox jumps over the lazy white fox.

"I'm sorry, I didn't mean to be rude, but I don't know what you're talking about. I'm just trying to help you out, and I can't do anything about it. You know,


## Use Model via API

In [15]:

!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read)

In [16]:
# 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

In [4]:
import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="hf-inference",
    api_key="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",
)

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
)

print(completion.choices[0].message)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


ChatCompletionOutputMessage(role='assistant', content="The capital of France is Paris. Paris is a significant city, known globally for its cultural, artistic, and architectural contributions. It's home to famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city plays a crucial role in French politics, economic activity, and cultural life. It houses some of France's most important government offices, including the Presidential Palace (Élysée Palace) and the National Assembly.", tool_call_id=None, tool_calls=None)
