In [43]:
# Presented by Ahmed Baari on 25 October 2024

# What can BERT do?
In this notebook, let's explore the capabilities of BERT by using it for a variety of NLP tasks. We will use the `transformers` library to load a fine-tuned BERT model and use it for the following tasks:
1. Text Classification
2. Named Entity Recognition
3. Question Answering
4. Text Summarization

In [44]:
from transformers import pipeline

In [45]:
sentiment_pipeline = pipeline('text-classification',
                              model='nlptown/bert-base-multilingual-uncased-sentiment',
                              tokenizer='nlptown/bert-base-multilingual-uncased-sentiment',
                              device=0) # 0 is the GPU index

sentiment_pipeline.model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [46]:
sentiment_pipeline('NLP Class is amazing!')

[{'label': '5 stars', 'score': 0.8126146197319031}]

In [47]:
sentiment_pipeline('NLP Class is actually good')
sentiment_pipeline('NLP Class is good')

[{'label': '4 stars', 'score': 0.45880571007728577}]

In [48]:
sentiment_pipeline('NLP Class is okay')

[{'label': '3 stars', 'score': 0.7211390733718872}]

In [49]:
sentiment_pipeline('NLP Class is boring')

[{'label': '2 stars', 'score': 0.46322232484817505}]

In [50]:
sentiment_pipeline('NLP Class is terrible')

[{'label': '1 star', 'score': 0.7699272036552429}]

In [51]:
sentiment_pipeline('The 5th semester has finally come to an end', return_all_scores=True)  

[[{'label': '1 star', 'score': 0.054048508405685425},
  {'label': '2 stars', 'score': 0.08055774867534637},
  {'label': '3 stars', 'score': 0.15945588052272797},
  {'label': '4 stars', 'score': 0.36899498105049133},
  {'label': '5 stars', 'score': 0.3369428813457489}]]

## Named Entity Recognition
Extract entities such as organizations, locations, or individuals from the text

In [64]:
text = """The Shanmugha Arts, Science, Technology & Research Academy, also known as SASTRA, is a private and deemed university in the town of Thirumalaisamudram, Thanjavur district, Tamil Nadu, India. SASTRA is ranked by global ranking agencies such as Times Higher Education and QS. It offers undergraduate, post graduate and doctoral courses in Engineering, Science, Education, Management, Law and the Arts."""

In [65]:
ner_pipeline = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is

In [66]:
entities = ner_pipeline(text, aggregation_strategy="simple")
entities

[{'entity_group': 'ORG',
  'score': 0.97421646,
  'word': 'Shanmugha Arts, Science, Technology & Research Academy',
  'start': 4,
  'end': 58},
 {'entity_group': 'ORG',
  'score': 0.99745464,
  'word': 'SASTRA',
  'start': 74,
  'end': 80},
 {'entity_group': 'LOC',
  'score': 0.9783507,
  'word': 'Thirumalaisamudram',
  'start': 132,
  'end': 150},
 {'entity_group': 'LOC',
  'score': 0.97977877,
  'word': 'Thanjavur',
  'start': 152,
  'end': 161},
 {'entity_group': 'LOC',
  'score': 0.997586,
  'word': 'Tamil Nadu',
  'start': 172,
  'end': 182},
 {'entity_group': 'LOC',
  'score': 0.9825057,
  'word': 'India',
  'start': 184,
  'end': 189},
 {'entity_group': 'ORG',
  'score': 0.9949009,
  'word': 'SASTRA',
  'start': 191,
  'end': 197},
 {'entity_group': 'ORG',
  'score': 0.99622416,
  'word': 'Times Higher Education',
  'start': 243,
  'end': 265},
 {'entity_group': 'ORG',
  'score': 0.9934008,
  'word': 'QS',
  'start': 270,
  'end': 272}]

In [67]:
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Shanmugha Arts, Science, Technology & Research Academy: ORG (0.97)
SASTRA: ORG (1.00)
Thirumalaisamudram: LOC (0.98)
Thanjavur: LOC (0.98)
Tamil Nadu: LOC (1.00)
India: LOC (0.98)
SASTRA: ORG (0.99)
Times Higher Education: ORG (1.00)
QS: ORG (0.99)


## Question Answering

In [68]:
qa_pipeline = pipeline("question-answering",
                       device=0)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [69]:
text = ""
with open("chandrayaan.txt") as f:
    text = f.read()

text[:100]

'It was July 2019 when I, a child, was about to watch the livestream of an Indian rocket launch for t'

In [70]:
question = "Who found water on the moon?"

outputs = qa_pipeline(question=question, context=text)

for key, value in outputs.items():
    print(f"{key}: {value}")

score: 0.8325778245925903
start: 4682
end: 4698
answer: The Soviet Union


## Text Summarization

In [71]:
summarization_pipeline = pipeline("summarization",
                                  )

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [72]:
outputs = summarization_pipeline(text[:1023], clean_up_tokenization_spaces=True)

summary = outputs[0]['summary_text']
summary

" Chandrayaan-2 is India's prestigious moon mission to map and study the variations in lunar surface composition, as well as the location and abundance of lunar water. The Chandraysaan Series is the names of India's lunar exploration missions. Each mission had different objectives and achievements, but they all shared the common goal of advancing India's scientific and technological capabilities in space."

In [73]:
import textwrap

wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)


In [74]:
print(wrapper.fill(summary))

 Chandrayaan-2 is India's prestigious moon mission to map and study the
variations in lunar surface composition, as well as the location and abundance
of lunar water. The Chandraysaan Series is the names of India's lunar
exploration missions. Each mission had different objectives and achievements,
but they all shared the common goal of advancing India's scientific and
technological capabilities in space.


## Translation

Do it yourself

### There's Much More!

In [75]:
from transformers import pipelines
for task in pipelines.SUPPORTED_TASKS:
    print(task)

audio-classification
automatic-speech-recognition
text-to-audio
feature-extraction
text-classification
token-classification
question-answering
table-question-answering
visual-question-answering
document-question-answering
fill-mask
summarization
translation
text2text-generation
text-generation
zero-shot-classification
zero-shot-image-classification
zero-shot-audio-classification
image-classification
image-feature-extraction
image-segmentation
image-to-text
object-detection
zero-shot-object-detection
depth-estimation
video-classification
mask-generation
image-to-image
