## NAME : NANDAN D
## SRN : PES2UG23CS363
## SEC : F

# Unit 1 Hands-on: Generative AI & NLP Fundamentals

Welcome to your interactive guide to **Generative AI**. This notebook is designed to be a step-by-step tutorial, explaining not just *how* to code, but *why* we use these tools.


### Import Pipeline
It imports Hugging Face Transformers tools: pipeline (easy model inference), set_seed (reproducible outputs), and GPT2Tokenizer (convert text ↔ GPT‑2 tokens/IDs).


In [1]:
from transformers import pipeline, set_seed, GPT2Tokenizer


### Import Utilities
We also need `nltk` for some traditional NLP tasks and `os` for file handling.


In [2]:
import os
import nltk


### Loading the Course Material
We will define the path to our course text file (`unit 1.txt`).


In [3]:
file_path = "/content/unit 1.txt"


In [4]:
try:
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()
    print("File loaded successfully!")
except FileNotFoundError:
    print(f"Error: '{file_path}' not found.")


File loaded successfully!


 the first 500 characters to make sure we have the right data.


In [5]:
print("--- Data Preview ---")
print(text[:500] + "...")


--- Data Preview ---
Generative AI and Its Applications: A Foundational Briefing

Executive Summary

This document provides a comprehensive overview of Generative AI, synthesizing foundational concepts, technological underpinnings, and practical applications as outlined in the course materials from PES University. Generative AI represents a transformative subset of Artificial Intelligence focused on creating novel content, a capability primarily driven by the advent of Large Language Models (LLMs). The evolution of ...


## 2. Generative AI: Dumb vs. Smart Models

Generative AI creates new content (text, images, audio). But the quality depends heavily on the model's size and training.

We will compare two models:
1.  **`distilgpt2`**: A 'distilled' version. It is smaller, faster, and requires less memory, but it might be less coherent (a "Dumb" model for this comparison).
2.  **`gpt2`**: The standard version (The "Smart" model, though still small by modern standards).

**How to access a model?**
1.  Go to Hugging Face Models page.
2.  Search for a task (e.g., 'Text Generation').
3.  Pick a model (e.g., `gpt2`).
4.  Copy the model name.


### Step 1: Set a Seed

A seed is a fixed starting value for the random number generator, so the model’s “random” outputs (like text generation) become reproducible same seed + same settings = same result

In [39]:
set_seed(60)

### Step 2: Define a Prompt
Both models will complete this sentence.


In [36]:
prompt = "Generative AI is a revolutionary technology that"

### Step 3: Fast Model (`distilgpt2`)

In [40]:
# Initialize the pipeline with the specific model
fast_generator = pipeline('text-generation', model='distilgpt2')

# Generate text
output_fast = fast_generator(prompt, max_new_tokens=50, num_return_sequences=1)
print(output_fast[0]['generated_text'])

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generative AI is a revolutionary technology that is now the basis of AI research in the field of artificial intelligence research and development.


### Step 4: Standard Model (`gpt2`)



In [41]:
smart_generator = pipeline('text-generation', model='gpt2')

output_smart = smart_generator(prompt, max_length=50, num_return_sequences=1)
print(output_smart[0]['generated_text'])


Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generative AI is a revolutionary technology that advances the use of intelligent machines to discover new knowledge.

The first AI to enter the field was the Artificial Intelligence (AI) at Carnegie Mellon University in 1979.

Its technology is based on the concept of the "self" or "machine".

It incorporates a set of rules that allow intelligent machines to work on their own.

The AI can then choose to "learn" from these rules.

This process can only occur on the basis of knowledge, and has no limits.

This innovation has been called the "AI of the future" by the UK's Computer Science Association.

The AI of today is about 1,200 times more complex than it was in the 1980s.

It is based on learning how to use its own knowledge and to make an educated decision.

The second AI to become established was the Autonomous Vehicle (AV).

It is based on the concept of self-learning, and it uses these rules to develop new ones.

This technology has an enormous potential to revolutionise the way 

**Analysis**: Compare the two outputs. Does the standard model stay more on topic? Does the fast model drift into nonsense?


# **Observations **

## DistilGPT
I expected the smaller model to hallucinate more, but it performed reasonably well. It was much faster than GPT2; however, its output was less coherent, less detailed, and not as explanatory. I ran the experiment multiple times and still couldn’t get output that was as long or as informative as GPT2’s. I also changed the seed and tried again, but I kept getting very similar outputs each time, unlike GPT2, which showed more variation.

## GPT 2

This is a very good model for its small size it is surprisingly powerful and produces clear, explanatory output. As I mentioned earlier, when I tried different seeds for both models, GPT‑2 generated new text and new sentences each time with better quality, but it was slower than DistilGPT.

## 3. NLP Fundamentals: Under the Hood



### 3.1 Tokenization

In [10]:
# 1. Initialize the Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")


In [11]:
sample_sentence = "Transformers revolutionized NLP."


 split it into tokens.


In [12]:
tokens = tokenizer.tokenize(sample_sentence)
print(f"Tokens: {tokens}")


Tokens: ['Transform', 'ers', 'Ġrevolution', 'ized', 'ĠN', 'LP', '.']


 convert tokens to IDs.


### 3.2 POS Tagging (Part-of-Speech)
**Why?** To understand grammar. Is 'book' a noun (the object) or a verb (to book a flight)?
**What?** We label each word as Noun (NN), Verb (VB), Adjective (JJ), etc.


In [19]:
# Download necessary NLTK data
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

True

In [20]:
nltk.download('punkt_tab', quiet=True)
pos_tags = nltk.pos_tag(nltk.word_tokenize(sample_sentence))
print(f"POS Tags: {pos_tags}")

POS Tags: [('Transformers', 'NNS'), ('revolutionized', 'VBD'), ('NLP', 'NNP'), ('.', '.')]


### 3.3 Named Entity Recognition (NER)
**Why?** To extract structured information like names, organizations, and dates.
**What?** We use a specific BERT model fine-tuned for the NER task.


In [21]:
# Initialize NER pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Let's analyze the first paragraph of our text.


In [22]:
snippet = text[:1000]
entities = ner_pipeline(snippet)

print(f"{'Entity':<20} | {'Type':<10} | {'Score':<5}")
print("-"*45)
for entity in entities:
    if entity['score'] > 0.90:
        print(f"{entity['word']:<20} | {entity['entity_group']:<10} | {entity['score']:.2f}")


Entity               | Type       | Score
---------------------------------------------
AI                   | MISC       | 0.98
PES University       | ORG        | 0.99
AI                   | MISC       | 0.98
Large Language Models | MISC       | 0.91
LLMs                 | MISC       | 0.90
Transformer          | MISC       | 0.99


## 4. Advanced Applications: Comparative Analysis

### 4.1 Summarization: Efficiency vs. Quality

We will summarize a complex section about Transformer Architecture using two models:
1. **`distilbart-cnn-12-6`**: Optimized for speed.
2. **`bart-large-cnn`**: Optimized for performance.


In [23]:
# Let's extract a specific section for summarization
transformer_section = """
The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI. It provided a more effective and scalable way to handle sequential data like text, replacing older, less efficient methods like recurrence (RNNs) and convolutions.
The fundamental innovation of the Transformer is the attention mechanism. This component allows the model to weigh the importance of different words (tokens) in the input sequence when making a prediction. In essence, for each word it processes, the model can "pay attention" to all other words in the input, helping it understand context, resolve ambiguity, and handle long-range dependencies. This is crucial for tasks like translation, summarization, and question answering.
The Transformer architecture consists of an encoder stack (to process the input) and a decoder stack (to generate the output), both of which heavily utilize multi-head attention and feed-forward networks.
"""


#### Fast Summarizer


In [43]:
fast_sum = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
res_fast = fast_sum(transformer_section, max_length=60, min_length=30, do_sample=False)
print(res_fast[0]['summary_text'])


Device set to use cuda:0


 The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI . It provided a more effective and scalable way to handle sequential data like text, replacing older, less efficient methods like recurrence (RNNs) and conv


#### Quality Summarizer


In [42]:
smart_sum = pipeline("summarization", model="facebook/bart-large-cnn")
res_smart = smart_sum(transformer_section, max_length=60, min_length=30, do_sample=False)
print(res_smart[0]['summary_text'])


Device set to use cuda:0


The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI. It provided a more effective and scalable way to handle sequential data like text.


* Using sshleifer/distilbart-cnn-12-6, the summary is generated faster but tends to be more compressed and may miss some details (for example, it can cut off or simplify parts like RNNs/convolutions and the attention explanation).

* Using facebook/bart-large-cnn, the summary is usually clearer and more complete, preserving the main idea (“Attention is All You Need” as a major breakthrough) and keeping better sentence flow, but it runs slower and needs more compute.
* Overall, DistilBART is better for quick/efficient summarization, while BART-large is better when you want higher-quality, more informative summaries.

### 4.2 Question Answering



In [26]:
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


Let's ask about the risks mentioned in our text.


In [27]:
questions = [
    "What is the fundamental innovation of the Transformer?",
    "What are the risks of using Generative AI?"
]

for q in questions:
    res = qa_pipeline(question=q, context=text[:5000])
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")



Q: What is the fundamental innovation of the Transformer?
A: to identify hidden patterns, structures, and relationships within the data

Q: What are the risks of using Generative AI?
A: data privacy, intellectual property, and academic integrity


* The QA model answered both questions quickly by extracting key phrases from the given context.
* It identified the Transformer’s main innovation as finding patterns/relationships in data, and listed Generative AI risks as data privacy, intellectual property, and academic integrity.

### 4.3 Masked Language Modeling (The 'Fill-in-the-Blank' Game)

This is the core training objective of BERT. We hide a token (`[MASK]`) and ask the model to predict it based on context.


In [28]:
mask_filler = pipeline("fill-mask", model="bert-base-uncased")


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0


In [29]:
masked_sentence = "The goal of Generative AI is to create new [MASK]."
preds = mask_filler(masked_sentence)

for p in preds:
    print(f"{p['token_str']}: {p['score']:.2f}")


applications: 0.06
ideas: 0.05
problems: 0.05
systems: 0.04
information: 0.03


* The fill-mask (BERT) model uses the surrounding words to predict the missing [MASK] token, so it suggests likely completions for “The goal of Generative AI is to create new {token}.”
* The outputs (applications, ideas, problems, systems, information) are ranked by probability scores, showing which word BERT thinks best fits the context (higher score = more likely)

# Conclusion

In this lab, I used Hugging Face Transformers to try text generation, tokenization, summarization, question answering, and fill‑mask prediction. DistilGPT‑2 was very fast and lightweight, but its generated text was usually less detailed and less coherent than GPT‑2. GPT‑2 produced more meaningful and explanatory responses, but it took more time to run. Setting a seed helped keep results consistent, and changing the seed caused the generated text to vary.

I also observed how GPT‑2 tokenization splits words into smaller subword pieces and uses symbols like Ġ to represent spaces. For summarization, DistilBART gave quicker but more compressed summaries, while BART‑large produced clearer and more complete summaries with slower performance. The question‑answering model extracted answers directly from the provided context and highlighted risks like privacy and intellectual property. Finally, the BERT fill‑mask task predicted the most likely missing word based on surrounding text and ranked options by confidence scores.