Application of BERT

1. Text Representation: BERT is used to generate word embeddings or representation for words in a sentence.

2. Named Entity Recognition (NER): BERT can be fine-tuned for named entity recognition tasks, where the goal is to identify entities such as names of people, organizations, locations, etc., in a given text.

3. Text Classification: BERT is widely used for text classification tasks, including sentiment analysis, spam detection, and topic categorization. It has demonstrated excellent performance in understanding and classifying the context of textual data.

4. Question-Answering Systems: BERT has been applied to question-answering systems, where the model is trained to understand the context of a question and provide relevant answers. This is particularly useful for tasks like reading comprehension.

5. Machine Translation: BERT's contextual embeddings can be leveraged for improving machine translation systems. The model captures the nuances of language that are crucial for accurate translation.

6. Text Summarization: BERT can be used for abstractive text summarization, where the model generates concise and meaningful summaries of longer texts by understanding the context and semantics.

7. Conversational AI: BERT is employed in building conversational AI systems, such as chatbots, virtual assistants, and dialogue systems. Its ability to grasp context makes it effective for understanding and generating natural language responses.

8. Semantic Similarity: BERT embeddings can be used to measure semantic similarity between sentences or documents. This is valuable in tasks like duplicate detection, paraphrase identification, and information retrieval.

**The tokenizer.encode** method adds the special [CLS] - classification and [SEP] - separator tokens at the beginning and end of the encoded sequence.

In [None]:
import transformers
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
text = "Bert is the google model that was developed in 2018. i like this model becuse this is only encoder part that is not envolve in generation. "
encoding = tokenizer.encode(text)
print("Token IDs:", encoding)

# Convert token IDs back to tokens
tokens = tokenizer.convert_ids_to_tokens(encoding)
print("Tokens:", tokens)

Token IDs: [101, 15035, 1110, 1103, 1301, 8032, 1513, 2235, 1115, 1108, 1872, 1107, 1857, 119, 178, 1176, 1142, 2235, 1129, 6697, 1162, 1142, 1110, 1178, 4035, 13775, 1197, 1226, 1115, 1110, 1136, 4035, 6005, 23534, 1107, 3964, 119, 102]
Tokens: ['[CLS]', 'Bert', 'is', 'the', 'go', '##og', '##le', 'model', 'that', 'was', 'developed', 'in', '2018', '.', 'i', 'like', 'this', 'model', 'be', '##cus', '##e', 'this', 'is', 'only', 'en', '##code', '##r', 'part', 'that', 'is', 'not', 'en', '##vo', '##lve', 'in', 'generation', '.', '[SEP]']


# **Auto text completion**

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
import torch

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
forced_bos_token_id = 0
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
sent = "GeekforGeeks has a <mask> article on Bart."
tokenized_sent= tokenizer(sent, return_tensors='pt')
input_ids = tokenized_sent['input_ids']
generated_encoded = model.generate(input_ids, forced_bos_token_id=forced_bos_token_id)
generated_sent = tokenizer.decode(generated_encoded[0], skip_special_tokens=True)
print(generated_sent)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

GeekforGeeks has a great article on Bart.


#**Toxic coment classification using bert**

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from sklearn.metrics import accuracy_score
from datasets import load_dataset
import pandas as pd

In [None]:
from datasets import load_dataset

# Load dataset (small subset for demo)
dataset = load_dataset("civil_comments", split="train[:2000]")
dataset = dataset.train_test_split(test_size=0.2)

print(dataset)
print(dataset['train'][0])


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/20.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1804874 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/97320 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/97320 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'toxicity', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['text', 'toxicity', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit'],
        num_rows: 400
    })
})
{'text': "I've worked in public contracting for over a decade, as a contract manager for m employer, and as a community partner with it other  jurisdictions including Portland and Multnomah County. I have been dismayed, pretty much every time I've participated in Portland and Multnomah County, to find how they consistently fail to comply with state public bidding rules. My experience is that they know who they want to give the money to and they do with little to no fair access, they rarely articulate measurable outcomes, rarely monitor contracts t be sure taxpayers money is well spent, and almost never introduce consequ

**Preprocess (Tokenization)**

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)

dataset = dataset.map(tokenize batched=True)

# Convert label into binary toxic (>=0.5 = toxic, else not toxic)
dataset = dataset.map(lambda x: {"labels": 1 if x["toxicity"] >= 0.5 else 0})
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

dataset loader


In [None]:
train_loader = DataLoader(dataset["train"], batch_size=16, shuffle=True)
test_loader = DataLoader(dataset["test"], batch_size=16)

# load bert model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

optimizer = AdamW(model.parameters(), lr=2e-5)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Training Loop**

In [None]:
loss_fn = nn.CrossEntropyLoss()
epochs = 2

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}")


Epoch 1, Loss: 0.2384
Epoch 2, Loss: 0.1528


##Evaluate the model

In [None]:
model.eval()
correct, total = 0, 0

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]

        outputs = model(input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, dim=-1)

        correct += (preds == labels).sum().item()
        total += labels.size(0)

print("Test Accuracy:", correct/total)


Test Accuracy: 0.96


#**Resume Parsing (Extract Names, Emails, Companies) USING PIPELINE**

In [None]:
from transformers import pipeline
import re
from transformers.pipelines.token_classification import AggregationStrategy

# Load a pretrained NER model
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy=AggregationStrategy.SIMPLE)
# Example Resume text
resume_text = """
My name is John Doe. I have worked at Google for 5 years as a software engineer.
You can reach me at john.doe@gmail.com. Previously I worked with Microsoft.
"""

entities = ner_pipeline(resume_text)
print("Named entities")
for ent in entities:
  print(ent)

print("\n🔹 Extracted Info:")
for ent in entities:
    if ent["entity_group"] in ["PER", "ORG"]:  # Person / Organization
        print(f"{ent['entity_group']}: {ent['word']}")

# extract emails
emails = re.findall(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", resume_text)

print("\nEmails:")
for email in emails:
    print(email)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Named entities
{'entity_group': 'PER', 'score': np.float32(0.99280906), 'word': 'John Doe', 'start': 12, 'end': 20}
{'entity_group': 'ORG', 'score': np.float32(0.9986141), 'word': 'Google', 'start': 39, 'end': 45}
{'entity_group': 'ORG', 'score': np.float32(0.9989907), 'word': 'Microsoft', 'start': 147, 'end': 156}

🔹 Extracted Info:
PER: John Doe
ORG: Google
ORG: Microsoft

Emails:
john.doe@gmail.com


#**🔹 Project 2: Medical NER (Find Diseases, Symptoms, Drugs)**

In [None]:
from transformers import pipeline
from transformers.pipelines.token_classification import AggregationStrategy

# Load Biomedical NER model
med_ner_pipeline = pipeline("ner", model="Ishan0612/biobert-ner-disease-ncbi" ,aggregation_strategy=AggregationStrategy.SIMPLE)

clinical_text = """
The patient was diagnosed with diabetes and prescribed metformin.
He also complained of chest pain and shortness of breath.
"""
# Run NER
results = med_ner_pipeline(clinical_text)

for r in results:
  print(r)

Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entity_group': 'LABEL_0', 'score': np.float32(0.9999809), 'word': 'the patient was diagnosed with', 'start': 1, 'end': 31}
{'entity_group': 'LABEL_1', 'score': np.float32(0.9801783), 'word': 'diabetes', 'start': 32, 'end': 40}
{'entity_group': 'LABEL_0', 'score': np.float32(0.9999629), 'word': 'and prescribed metformin. he also complained of', 'start': 41, 'end': 88}
{'entity_group': 'LABEL_1', 'score': np.float32(0.998831), 'word': 'chest', 'start': 89, 'end': 94}
{'entity_group': 'LABEL_2', 'score': np.float32(0.99876654), 'word': 'pain', 'start': 95, 'end': 99}
{'entity_group': 'LABEL_0', 'score': np.float32(0.9999709), 'word': 'and', 'start': 100, 'end': 103}
{'entity_group': 'LABEL_1', 'score': np.float32(0.99949765), 'word': 'short', 'start': 104, 'end': 109}
{'entity_group': 'LABEL_2', 'score': np.float32(0.9967515), 'word': '##ness of breath', 'start': 109, 'end': 123}
{'entity_group': 'LABEL_0', 'score': np.float32(0.99997866), 'word': '.', 'start': 123, 'end': 124}


In [None]:
# extract the information

# Example label mapping (depends on model's training dataset)
label_map = {
    "LABEL_0": " No Disease",          # Non-entity
    "LABEL_1": "DISEASE",
    "LABEL_2": "SYMPTOM",
}

# Group results
cleaned = {}
for ent in results:
    label = ent["entity_group"]
    mapped = label_map.get(label, "UNKNOWN")
    if mapped not in cleaned:
        cleaned[mapped] = []
    cleaned[mapped].append(ent["word"])

# Display grouped results
print("🔹 Extracted Entities:")
for category, words in cleaned.items():
    print(f"{category}: {', '.join(words)}")

🔹 Extracted Entities:
 No Disease: the patient was diagnosed with, and prescribed metformin. he also complained of, and, .
DISEASE: diabetes, chest, short
SYMPTOM: pain, ##ness of breath


#**Mini Project 3: Question Answering (QA)**

In [None]:
from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = """
Elon Musk is the CEO of Tesla and SpaceX.
He founded SpaceX in 2002 with the goal of reducing space transportation costs.
Tesla, on the other hand, focuses on electric vehicles and sustainable energy.
"""
# Example questions
questions = [
    "Who is the CEO of Tesla?",
    "When was SpaceX founded?",
    "What does Tesla focus on?"
]

for q in questions:
  answer = qa_pipeline(question=q, context=context)
  print(f"Question: {q}\nAnswer: {answer['answer']}\n")


Fetching 0 files: 0it [00:00, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 0 files: 0it [00:00, ?it/s]

Device set to use cpu


Question: Who is the CEO of Tesla?
Answer: Elon Musk

Question: When was SpaceX founded?
Answer: 2002

Question: What does Tesla focus on?
Answer: electric vehicles and sustainable energy



#**🔹 Mini Project 4: Text Summarization**

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model= "facebook/bart-large-cnn")

# Example long text
article = """
Artificial Intelligence (AI) is rapidly transforming the healthcare industry.
From diagnostic imaging to personalized medicine, AI technologies are improving
accuracy and efficiency. For example, AI-powered tools can analyze X-rays and MRI
scans faster than humans. Additionally, predictive models help doctors identify
high-risk patients earlier. However, challenges like data privacy, ethical concerns,
and lack of transparency remain significant obstacles for large-scale adoption.
"""

# run summarizer
summary = summarizer(article, max_length=60, min_length=30, do_sample=False)

print("Original Text:\n", article)
print("Summary:",summary[0]['summary_text'])

Device set to use cpu


Original Text:
 
Artificial Intelligence (AI) is rapidly transforming the healthcare industry.
From diagnostic imaging to personalized medicine, AI technologies are improving
accuracy and efficiency. For example, AI-powered tools can analyze X-rays and MRI
scans faster than humans. Additionally, predictive models help doctors identify
high-risk patients earlier. However, challenges like data privacy, ethical concerns,
and lack of transparency remain significant obstacles for large-scale adoption.

Summary: Artificial Intelligence (AI) is rapidly transforming the healthcare industry. However, challenges like data privacy, ethical concerns,and lack of transparency remain significant obstacles for large-scale adoption.


#**🔹 Mini Project 5: Machine Translation**

In [None]:
from transformers import pipeline

# Load translation pipeline (English to French)
translator_fr = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

# Load translation pipeline (English to German)
translator_de = pipeline("translation", model="Helsinki-NLP/opus-mt-en-de")

# english to urdu
translator_ur = pipeline("translation", model="Helsinki-NLP/opus-mt-en-ur")


# Example input text
text = "Artificial Intelligence is transforming the healthcare industry."

# translate
en_to_fr = translator_fr(text, max_length=100)
en_to_de = translator_de(text, max_length=100)
en_to_ur = translator_ur(text, max_length=100)

# Display results
print("Original Text:\n", text)
print("English to French Translation:", en_to_fr[0]['translation_text'])
print("English to German Translation:", en_to_de[0]['translation_text'])
print("English to Urdu Translation:", en_to_ur[0]['translation_text'])


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/306M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/816k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/848k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Original Text:
 Artificial Intelligence is transforming the healthcare industry.
English to French Translation: L'intelligence artificielle transforme l'industrie des soins de santé.
English to German Translation: Künstliche Intelligenz verändert die Gesundheitsbranche.
English to Urdu Translation: جڑی‌بوٹیاں صحت کی صنعت کو تبدیل کر رہی ہیں ۔


#**Mini Project 6: Semantic Similarity with BERT**

In [2]:
import sentence_transformers
from sentence_transformers import SentenceTransformer, util

In [8]:
# loading the pretrained model using pipeline
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Example text pairs
sentences = [
    "I love machine learning and AI.",
    "Artificial Intelligence and machine learning are my passion.",
    "The sky is blue and full of clouds.",
    "He is reading a book on deep learning."
]

# encode the text into vectors
embeddings = model.encode(sentences, convert_to_tensor=True)
print("Shape of embeddings:", embeddings.shape)  # (num_sentences, vector_size)

# Compare similarity between first sentence and all others
cosine_scores = util.cos_sim(embeddings[0], embeddings)
for i in range(len(sentences)):
  print(f"Similarity('{sentences[0]}', '{sentences[i]}') = {cosine_scores[0][i]:.4f}")







Shape of embeddings: torch.Size([4, 384])
Similarity('I love machine learning and AI.', 'I love machine learning and AI.') = 1.0000
Similarity('I love machine learning and AI.', 'Artificial Intelligence and machine learning are my passion.') = 0.8129
Similarity('I love machine learning and AI.', 'The sky is blue and full of clouds.') = 0.1066
Similarity('I love machine learning and AI.', 'He is reading a book on deep learning.') = 0.3720


#**Mini Project 7: Zero-Shot Classification with RoBERTa**
RoBERTa is a stronger version of BERT 🚀.
It’s used for NLP tasks like:

Text classification (sentiment, spam, topic)

Named Entity Recognition (NER)

Question answering

Text summarization

Semantic similarity

In [14]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model = "FacebookAI/roberta-large-mnli")

text = "I recently bought a new iPhone and the camera quality is amazing."

candidate_labels = ["technology", "sports", "politics", "food"]

results = classifier(text, candidate_labels)
print(results)

# Customer Feedback Classification
feedbacks = [
    "The delivery was late and the package was damaged.",
    "I love the camera quality of this phone.",
    "The restaurant food was delicious but service was slow."
]

labels = ["delivery", "product quality", "food", "customer service"]

for fb in feedbacks:
  results = classifier(fb, labels)
  print(f"\ntext: {fb}")
  print("prediction:", results["labels"][0], "| scores:", round(results["scores"][0],2))



Some weights of the model checkpoint at FacebookAI/roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


{'sequence': 'I recently bought a new iPhone and the camera quality is amazing.', 'labels': ['technology', 'food', 'sports', 'politics'], 'scores': [0.9536327123641968, 0.01970263198018074, 0.015491437166929245, 0.01117322314530611]}

text: The delivery was late and the package was damaged.
prediction: delivery | scores: 0.77

text: I love the camera quality of this phone.
prediction: product quality | scores: 0.83

text: The restaurant food was delicious but service was slow.
prediction: food | scores: 0.67


#**Mini Project 8: Text Generation with BART**
BART = Bidirectional Auto-Regressive Transformer.

👉 Think of it as a combo of BERT + GPT:

Like BERT → understands context (good for encoding text).

Like GPT → generates text (good for decoding).

🔹 What it’s used for:

Text Summarization

Text Generation / Paraphrasing

Translation

Question Answering

Zero-Shot Classification (via bart-large-mnli)

In [17]:
from transformers import pipeline
generator = pipeline("text2text-generation", model="facebook/bart-large-cnn")

prompt = "Write a short motivational message about learning machine learning."

result = generator(prompt,max_length=90, do_sample=True, top_p=0.95)
print("Generated text:", result[0]['generated_text'])

# Story Generation
story_prompt = "Once upon a time in a futuristic city, an AI robot decided to"
story = generator(story_prompt, max_length=80, do_sample=True, top_p=0.9)
print("Generated Story:", story[0]['generated_text'])




Device set to use cpu


Generated text: Write a short motivational message about learning machine learning. Write a short message about how you can use machine learning to improve your life. Share your story with CNN iReport in the comments below or send us a video on Facebook and Twitter. Back to the page you came from.
Generated Story: An AI robot in a futuristic city decided to kill itself. The robot is now being cared for by a human. It was created by a team of scientists at a university in the U.S. The project is part of a larger project called "Artificial Intelligence in the City"
