<a href="https://colab.research.google.com/github/NADIAAREF/Hands_On_Language_Model/blob/main/Chapter_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is for Chapter 1 of the Hands-On Large Language Models book by Jay Alammar and Maarten Grootendorst.

A Large Language Model (LLM) is a deep learning model trained on tons of text. It learns patterns in language so it can:

Answer questions

Write stories or code

Translate language

Chat like a human

LLMs like Phi-3 are pretrained — meaning someone already trained them on large datasets (so you don’t have to).



NOTE: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

Install:

In [1]:
# %%capture
!pip install transformers>=4.40.1 accelerate>=0.27.2

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately (although that isn't always necessary).

This code is loading a pretrained LLM (Large Language Model) — specifically Phi-3 Mini, built by Microsoft — so you can use it to generate text like ChatGPT does.

MODEL

 You're importing two tools from Hugging Face's transformers library:

AutoModelForCausalLM: Loads a model designed to generate text (Causal LM = Causal Language Model).

AutoTokenizer: Breaks text into tokens (words/subwords), which is how LLMs understand language.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Here's what each argument does:
"microsoft/Phi-3-mini-4k-instruct": The name of the model checkpoint on Hugging Face Hub.

device_map="cuda": Tells the model to use your GPU (faster than CPU).

torch_dtype="auto": Automatically picks the best numeric precision (like float16 or float32) for speed and memory efficiency.

trust_remote_code=False: You're not allowing any custom model code from the internet to be executed (safer).

TOKENIZER

tokenizer = AutoTokenizer.from_pretrained(...)
This loads the tokenizer for the same model. Why?

LLMs don’t understand raw text — they understand tokens (numbers representing words).

The tokenizer handles:

Text → Tokens (input)

Tokens → Text (output)



So what is happening here, really?
You're loading everything you need to talk to Phi-3:

tokenizer: Prepares text to feed into the model.

model: Generates text or answers based on that input.

CREATING A PIPELINE

Although we can now use the model and tokenizer directly, it's much easier to wrap it in a pipeline object:

What's a pipeline?
Hugging Face’s pipeline is a high-level abstraction that wraps together everything — tokenizer, model, input processing, and output formatting — into a simple interface. It handles all the behind-the-scenes work.

In [2]:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

Device set to use cuda


Importing the high-level pipeline utility from Hugging Face.

generator = pipeline(...)
You're creating a text generation pipeline — a complete system to generate text using your LLM.

Let’s break down the arguments:

"text-generation"->	Specifies the task — you're generating text (like ChatGPT does).
model=model->Uses the Phi-3 model you loaded earlier.


tokenizer=tokenizer->Uses the corresponding tokenizer for Phi-3.


return_full_text=False->	Only returns the newly generated text, not the original prompt.


max_new_tokens=500->	Limits how many new tokens the model can generate (prevents infinite rambling).


do_sample=False->	Disables randomness — the model will always give the most likely next word (more deterministic).

After creating the pipeline, you can now generate text super easily:

In [11]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Explain about Large Language Model"}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 A Large Language Model (LLM) is a type of artificial intelligence (AI) that is designed to understand, generate, and manipulate human language. These models are trained on vast amounts of text data, enabling them to learn patterns, structures, and nuances of language. LLMs are a subset of machine learning models, specifically designed for natural language processing (NLP) tasks.

The primary goal of LLMs is to create AI systems that can interact with humans in a more natural and intuitive way. They can be used for various applications, such as language translation, text summarization, question answering, chatbots, and content generation.

LLMs are built using deep learning techniques, which involve the use of neural networks. These networks are composed of interconnected layers of artificial neurons that process and analyze the input data. The more data the model is trained on, the better it becomes at understanding and generating language.

One of the most popular LLMs is GPT (Genera