## NAME : NANDAN D
## SRN : PES2UG23CS363
## SEC : F

# Unit 1 Hands-on: Generative AI & NLP Fundamentals

Welcome to your interactive guide to **Generative AI**. This notebook is designed to be a step-by-step tutorial, explaining not just *how* to code, but *why* we use these tools.


In [1]:
from transformers import pipeline, set_seed, GPT2Tokenizer


In [2]:
import os
import nltk


In [3]:
file_path = "/content/unit 1.txt"


In [4]:
try:
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()
    print("File loaded successfully!")
except FileNotFoundError:
    print(f"Error: '{file_path}' not found.")


File loaded successfully!


In [5]:
print("--- Data Preview ---")
print(text[:500] + "...")


--- Data Preview ---
Generative AI and Its Applications: A Foundational Briefing

Executive Summary

This document provides a comprehensive overview of Generative AI, synthesizing foundational concepts, technological underpinnings, and practical applications as outlined in the course materials from PES University. Generative AI represents a transformative subset of Artificial Intelligence focused on creating novel content, a capability primarily driven by the advent of Large Language Models (LLMs). The evolution of ...


## 2. Generative AI: RoBERTa vs BART


In [11]:
set_seed(42)

In [7]:
prompt = "The future of Artificial Intelligence is"
roberta = "roberta-base"
bart = "facebook/bart-base"
bert = "bert-base-uncased"

In [12]:
# Initialize the pipeline with the specific model
fast_generator = pipeline('text-generation', model=roberta)

output_fast = fast_generator(prompt, max_new_tokens=50, num_return_sequences=1)
print(output_fast[0]['generated_text'])

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cuda:0


The future of Artificial Intelligence is


In [13]:
# Initialize the pipeline with the specific model
fast_generator = pipeline('text-generation', model= bart)

# Generate text
output_fast = fast_generator(prompt, max_new_tokens=50, num_return_sequences=1)
print(output_fast[0]['generated_text'])

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


The future of Artificial Intelligence is Cosby238 phyl Bradford *** Scholarship Scholarship assailant238238 Cosby Ahmadename Nazi/// � 560 Ahmad dominant Ahmad Morocco Nazi appendix Cosby lbs Dani piled DaniDNA Nazi downgrade Bradford warranted commentator Nazi lbs layoffs Nazi boost Nazi Nazi Nazi boostzzoValues boost appendixomething�


In [14]:
smart_generator = pipeline('text-generation', model=bert)

output_smart = smart_generator(prompt, max_length=50, num_return_sequences=1)
print(output_smart[0]['generated_text'])

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is................................................................................................................................................................................................................................................................


## 3. NLP Fundamentals: Under the Hood



### 3.1 Tokenization

In [15]:
# 1. Initialize the Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [16]:
sample_sentence = "Transformers revolutionized NLP."


 split it into tokens.


In [17]:
tokens = tokenizer.tokenize(sample_sentence)
print(f"Tokens: {tokens}")


Tokens: ['Transform', 'ers', 'Ġrevolution', 'ized', 'ĠN', 'LP', '.']


### 3.2 POS Tagging (Part-of-Speech)
**Why?** To understand grammar. Is 'book' a noun (the object) or a verb (to book a flight)?
**What?** We label each word as Noun (NN), Verb (VB), Adjective (JJ), etc.


In [18]:
# Download necessary NLTK data
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

True

In [19]:
nltk.download('punkt_tab', quiet=True)
pos_tags = nltk.pos_tag(nltk.word_tokenize(sample_sentence))
print(f"POS Tags: {pos_tags}")

POS Tags: [('Transformers', 'NNS'), ('revolutionized', 'VBD'), ('NLP', 'NNP'), ('.', '.')]


### 3.3 Named Entity Recognition (NER)

In [20]:
# Initialize NER pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [21]:
snippet = text[:1000]
entities = ner_pipeline(snippet)

print(f"{'Entity':<20} | {'Type':<10} | {'Score':<5}")
print("-"*45)
for entity in entities:
    if entity['score'] > 0.90:
        print(f"{entity['word']:<20} | {entity['entity_group']:<10} | {entity['score']:.2f}")


Entity               | Type       | Score
---------------------------------------------
AI                   | MISC       | 0.98
PES University       | ORG        | 0.99
AI                   | MISC       | 0.98
Large Language Models | MISC       | 0.91
LLMs                 | MISC       | 0.90
Transformer          | MISC       | 0.99


## 4. Advanced Applications: Comparative Analysis

In [22]:
# Let's extract a specific section for summarization
transformer_section = """
The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI. It provided a more effective and scalable way to handle sequential data like text, replacing older, less efficient methods like recurrence (RNNs) and convolutions.
The fundamental innovation of the Transformer is the attention mechanism. This component allows the model to weigh the importance of different words (tokens) in the input sequence when making a prediction. In essence, for each word it processes, the model can "pay attention" to all other words in the input, helping it understand context, resolve ambiguity, and handle long-range dependencies. This is crucial for tasks like translation, summarization, and question answering.
The Transformer architecture consists of an encoder stack (to process the input) and a decoder stack (to generate the output), both of which heavily utilize multi-head attention and feed-forward networks.
"""


In [33]:
try:
  fast_sum = pipeline("summarization", model= roberta)
  res_fast = fast_sum(transformer_section, max_length=60, min_length=30, do_sample=False)
  print(res_fast[0]['summary_text'])
except Exception as error :
  print(error)

Device set to use cuda:0


'SummarizationPipeline' object has no attribute 'assistant_model'


In [25]:
smart_sum = pipeline("summarization", model= bart)
res_smart = smart_sum(transformer_section, max_length=60, min_length=30, do_sample=False)
print(res_smart[0]['summary_text'])


Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=60) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The introduction of the Transformer architecture in the 2017 paper "Attention is all you need" was a watershed moment in AI. It provided a more effective and scalable way to handle sequential data like text, replacing older, less efficient methods like recurrence (RNNs) and convolutions. The Transformer was also the first to implement a multi-head attention mechanism in a neural network. The implementation of this architecture is called the "transformer" architecture.The fundamental innovation of theTransformer is the attention mechanism. This component allows the model to weigh the importance of different words (tokens) in the input sequence when making a prediction. In essence, for each word it processes, the model can "pay attention" to all other words in the output, helping it understand context, resolve ambiguity, and handle long-range dependencies. This is crucial for tasks like translation, summarization, and question answering.The Transformer ArchitectureThe TransTransformer ar

In [32]:
try :
  smart_sum = pipeline("summarization", model= bert)
  res_smart = smart_sum(transformer_section, max_length=60, min_length=30, do_sample=False)
  print(res_smart[0]['summary_text'])
except Exception as error :
  print(error)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


'SummarizationPipeline' object has no attribute 'assistant_model'


### 4.2 Question Answering



In [31]:
qa_pipeline1 = pipeline("question-answering", model= roberta)
qa_pipeline2 = pipeline("question-answering", model= bart)
qa_pipeline3 = pipeline("question-answering", model= bert)


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


In [38]:
questions = [
    "Generative AI poses significant risks such as hallucinations, bias, and deepfakes.",
    "What are the risks?"
]

for q in questions:
    res = qa_pipeline1(question=q, context=text[:5000])
    print( '\n', roberta ,)
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")


for q in questions:
    print( '\n',bart ,)
    res = qa_pipeline2(question=q, context=text[:5000])
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")


for q in questions:
    res = qa_pipeline3(question=q, context=text[:5000])
    print( '\n',bert ,)
    print(f"\nQ: {q}")
    print(f"A: {res['answer']}")




 roberta-base

Q: Generative AI poses significant risks such as hallucinations, bias, and deepfakes.
A: composed of many layers (hence

 roberta-base

Q: What are the risks?
A: composed of many layers (hence

 facebook/bart-base

Q: Generative AI poses significant risks such as hallucinations, bias, and deepfakes.
A: -Speech (POS) tagging and Named Entity Recognition (NER

 facebook/bart-base

Q: What are the risks?
A: distribution of the data for each class (i.

 bert-base-uncased

Q: Generative AI poses significant risks such as hallucinations, bias, and deepfakes.
A: , the model is trained on labeled data, meaning each

 bert-base-uncased

Q: What are the risks?
A: , the model is trained on labeled data, meaning each


### 4.3 Masked Language Modeling (The 'Fill-in-the-Blank' Game)

This is the core training objective of BERT. We hide a token (`[MASK]`) and ask the model to predict it based on context.


In [40]:
mask_filler1 = pipeline('fill-mask', model= roberta)
mask_filler2 = pipeline('fill-mask', model=bart)
mask_filler3 = pipeline('fill-mask', model=bert)

Device set to use cuda:0
Device set to use cuda:0
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [42]:
masked_sentence_roberta_bart = "The goal of Generative AI is to <mask> new content."
masked_sentence_bert = "The goal of Generative AI is to [MASK] new content."

preds1 = mask_filler1(masked_sentence_roberta_bart)
preds2 = mask_filler2(masked_sentence_roberta_bart)
preds3 = mask_filler3(masked_sentence_bert)

print("--- RoBERTa Predictions ---")
for pred in preds1:
    print(f"{pred['token_str']}: {pred['score']:.2f}")

print("\n--- BART Predictions ---")
for pred in preds2:
    print(f"{pred['token_str']}: {pred['score']:.2f}")

print("\n--- BERT Predictions ---")
for pred in preds3:
    print(f"{pred['token_str']}: {pred['score']:.2f}")

--- RoBERTa Predictions ---
 generate: 0.37
 create: 0.37
 discover: 0.08
 find: 0.02
 provide: 0.02

--- BART Predictions ---
 create: 0.07
 help: 0.07
 provide: 0.06
 enable: 0.04
 improve: 0.03

--- BERT Predictions ---
create: 0.54
generate: 0.16
produce: 0.05
develop: 0.04
add: 0.02
