# Unit 1 Hands-on: Generative AI & NLP Fundamentals

Welcome to your interactive guide to **Generative AI**. This notebook is designed to be a step-by-step tutorial, explaining not just *how* to code, but *why* we use these tools.


## 1. Introduction & Setup

In this section, we will set up our environment. But first, let's understand the tools we are using.


### What is Hugging Face?

Hugging Face (https://huggingface.co/) is often called the "GitHub of AI". It is a massive repository where researchers and companies share their trained models, datasets, and demos.

Instead of training a model from scratch (which costs millions of dollars), we can download models like GPT-2, BERT, or RoBERTa directly from Hugging Face and use them.


### What is the `transformers` library?

The `transformers` library is the bridge between the models on Hugging Face and your code. It provides APIs to easily download, load, and run state-of-the-art pretrained models.

It supports framework interoperability, meaning you can often move between PyTorch, TensorFlow, and JAX.


### What is `pipeline()`?

The `pipeline()` function is the most powerful high-level tool in the library. It abstracts away the complex math and processing into three simple steps:

1.  **Preprocessing**: Converts your raw text into numbers (Tokens & IDs) that the model can understand.
2.  **Model Inference**: The model processes the numbers and outputs predictions (logits).
3.  **Post-processing**: The raw predictions are converted back into human-readable text (labels, answers, summaries).

With just one line, `pipeline('task-name')` handles all of this for you.


### Import Pipeline
Let's import this powerful function.


In [1]:
from transformers import pipeline, set_seed, GPT2Tokenizer




### Import Utilities
We also need `nltk` for some traditional NLP tasks and `os` for file handling.


In [2]:
import os
import nltk


### Loading the Course Material
We will define the path to our course text file (`unit 1.txt`).


In [3]:
file_path = "/content/unit 1.txt"


Now we read the file. This text will be the 'Knowledge Base' for our tasks later.


In [4]:
try:
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()
    print("File loaded successfully!")
except FileNotFoundError:
    print(f"Error: '{file_path}' not found.")


File loaded successfully!


Let's look at the first 500 characters to make sure we have the right data.


In [5]:
print("--- Data Preview ---")
print(text[:500] + "...")


--- Data Preview ---
Generative AI and Its Applications: A Foundational Briefing

Executive Summary

This document provides a comprehensive overview of Generative AI, synthesizing foundational concepts, technological underpinnings, and practical applications as outlined in the course materials from PES University. Generative AI represents a transformative subset of Artificial Intelligence focused on creating novel content, a capability primarily driven by the advent of Large Language Models (LLMs). The evolution of ...


## 2. Generative AI: Dumb vs. Smart Models

Generative AI creates new content (text, images, audio). But the quality depends heavily on the model's size and training.

We will compare two models:
1.  **`distilgpt2`**: A 'distilled' version. It is smaller, faster, and requires less memory, but it might be less coherent (a "Dumb" model for this comparison).
2.  **`gpt2`**: The standard version (The "Smart" model, though still small by modern standards).

**How to access a model?**
1.  Go to Hugging Face Models page.
2.  Search for a task (e.g., 'Text Generation').
3.  Pick a model (e.g., `gpt2`).
4.  Copy the model name.


### Step 1: Set a Seed

A **seed value** is used to make random results **reproducible**. When we set a seed, the random number generator starts from the same point each time, which means it will produce the **same sequence of random values**.

Try running the code multiple times using the **same seed value** and observe the output.

Now, change the seed value and run the code again. This time, the output **will change** because a different seed creates a different sequence of random numbers.


In [6]:
set_seed(42)


### Step 2: Define a Prompt
Both models will complete this sentence.


In [7]:
prompt = "Generative AI is a revolutionary technology that"


### Step 3: Fast Model (`distilgpt2`)
Let's see how the smaller model performs.


In [8]:
# Initialize the pipeline with the specific model
fast_generator = pipeline('text-generation', model='distilgpt2')

# Generate text
output_fast = fast_generator(prompt, max_length=50, num_return_sequences=1)
print(output_fast[0]['generated_text'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that is designed to work with existing AI systems. It has been developed by the University of California, Berkeley. Its research team is the leading developer of AI software and its use is limited to AI and AI systems.


The research team led by Professor Daniel Kranz, from the University of California, Berkeley, has developed a program to learn how to use the AI to improve the performance of the software. It has been developed by the University of California, Berkeley. Its research team is the leading developer of AI software and its use is limited to AI and AI systems. It is a top-selling research computer software company, and is a top-selling research computer software company.
The research team developed the program to learn how to use the AI to improve the performance of the software. It has been developed by the University of California, Berkeley, Berkeley, and is a top-selling research computer software company. The research team deve

### Step 4: Standard Model (`gpt2`)
Now let's try the standard model.


In [9]:
smart_generator = pipeline('text-generation', model='gpt2')

output_smart = smart_generator(prompt, max_length=50, num_return_sequences=1)
print(output_smart[0]['generated_text'])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that allows users to build AI that can help solve complex problems. It brings together hundreds of different approaches to solve problems, from solving complex problems in a laboratory to solving complex problems in a city. The technology allows users to build a computer that makes decisions based on user input, not on intuition.

The AI is a model of human intelligence, and has many aspects that are similar to artificial intelligence. It can learn from humans, and it can adapt to the environment. It can learn by experimenting with new ways of thinking, and it can learn by learning from its own experience.

It is the main driving force behind the new Artificial Intelligence, and the AI is very important to the success of AI. The new AI is designed to work out problems that need to be solved in a way that is easy to understand and solve, and that is flexible enough to be easily adaptable to different environments.

The AI is designed to be sca

**Analysis**: Compare the two outputs. Does the standard model stay more on topic? Does the fast model drift into nonsense?


## 3. NLP Fundamentals: Under the Hood

Before any "Magic" happens, the text must be processed. The pipeline does this automatically, but let's break it down manually to understand the steps.


### 3.1 Tokenization
**Why?** Models cannot read English strings. They only understand numbers.
**What?** Tokenization breaks text into pieces (Tokens) and assigns each piece a unique ID.


In [10]:
# 1. Initialize the Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")


Let's take a sample sentence.


In [11]:
sample_sentence = "Transformers revolutionized NLP."


Now we split it into tokens.


In [12]:
tokens = tokenizer.tokenize(sample_sentence)
print(f"Tokens: {tokens}")


Tokens: ['Transform', 'ers', 'Ġrevolution', 'ized', 'ĠN', 'LP', '.']


And finally, convert tokens to IDs.


In [13]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")


Token IDs: [41762, 364, 5854, 1143, 399, 19930, 13]


### 3.2 POS Tagging (Part-of-Speech)
**Why?** To understand grammar. Is 'book' a noun (the object) or a verb (to book a flight)?
**What?** We label each word as Noun (NN), Verb (VB), Adjective (JJ), etc.


In [18]:
# Download necessary NLTK data
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

True

Let's tag our sentence.


In [20]:
pos_tags = nltk.pos_tag(nltk.word_tokenize(sample_sentence))
print(f"POS Tags: {pos_tags}")

POS Tags: [('Transformers', 'NNS'), ('revolutionized', 'VBD'), ('NLP', 'NNP'), ('.', '.')]


### 3.3 Named Entity Recognition (NER)
**Why?** To extract structured information like names, organizations, and dates.
**What?** We use a specific BERT model fine-tuned for the NER task.


In [21]:
# Initialize NER pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Let's analyze the first paragraph of our text.


In [22]:
snippet = text[:1000]
entities = ner_pipeline(snippet)

print(f"{'Entity':<20} | {'Type':<10} | {'Score':<5}")
print("-"*45)
for entity in entities:
    if entity['score'] > 0.90:
        print(f"{entity['word']:<20} | {entity['entity_group']:<10} | {entity['score']:.2f}")


Entity               | Type       | Score
---------------------------------------------
AI                   | MISC       | 0.98
PES University       | ORG        | 0.99
AI                   | MISC       | 0.98
Large Language Models | MISC       | 0.91
LLMs                 | MISC       | 0.90
Transformer          | MISC       | 0.99


## 4. Advanced Applications: Comparative Analysis

Now we move to complex tasks: Summarization, Question Answering, and Next Sentene Generation.


### 4.1 Summarization: Efficiency vs. Quality

We will summarize a complex section about Transformer Architecture using two models:
1. **`distilbart-cnn-12-6`**: Optimized for speed.
2. **`bart-large-cnn`**: Optimized for performance.


In [23]:
# Let's extract a specific section for summarization
transformer_section = """
The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI. It provided a more effective and scalable way to handle sequential data like text, replacing older, less efficient methods like recurrence (RNNs) and convolutions.
The fundamental innovation of the Transformer is the attention mechanism. This component allows the model to weigh the importance of different words (tokens) in the input sequence when making a prediction. In essence, for each word it processes, the model can "pay attention" to all other words in the input, helping it understand context, resolve ambiguity, and handle long-range dependencies. This is crucial for tasks like translation, summarization, and question answering.
The Transformer architecture consists of an encoder stack (to process the input) and a decoder stack (to generate the output), both of which heavily utilize multi-head attention and feed-forward networks.
"""


#### Fast Summarizer


In [24]:
fast_sum = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
res_fast = fast_sum(transformer_section, max_length=60, min_length=30, do_sample=False)
print(res_fast[0]['summary_text'])


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


 The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI . It provided a more effective and scalable way to handle sequential data like text, replacing older, less efficient methods like recurrence (RNNs) and conv


#### Quality Summarizer


In [25]:
smart_sum = pipeline("summarization", model="facebook/bart-large-cnn")
res_smart = smart_sum(transformer_section, max_length=60, min_length=30, do_sample=False)
print(res_smart[0]['summary_text'])


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI. It provided a more effective and scalable way to handle sequential data like text.


### 4.2 Question Answering

This task is **Extractive**. We provide a `context` (our text) and a `question`. The model highlights the answer within the text.


In [26]:
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


Let's ask about the risks mentioned in our text.


In [27]:
questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text[:5000])
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")



Q: What is the fundamental innovation of the Transformer?
A: to identify hidden patterns, structures, and relationships within the data

Q: What are the risks of using Generative AI?
A: data privacy, intellectual property, and academic integrity


### 4.3 Masked Language Modeling (The 'Fill-in-the-Blank' Game)

This is the core training objective of BERT. We hide a token (`[MASK]`) and ask the model to predict it based on context.


In [28]:
mask_filler = pipeline("fill-mask", model="bert-base-uncased")


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


Let's see what the model thinks Generative AI creates.


In [29]:
masked_sentence = "The goal of Generative AI is to create new [MASK]."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")


applications: 0.06
ideas: 0.05
problems: 0.05
systems: 0.04
information: 0.03
