<a href="https://colab.research.google.com/github/CvetanV/BERT_NLP/blob/main/BERT_NLP_Extract_Keywords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT NLP Extracting Keywords from a text
## In this notebook I am implementing transformers in order to extract keywords from a text.

In [2]:
# Install the transformers library that contains everything that we need for the NLP implementation
%%capture
!pip install transformers[sentencepiece] 

In [3]:
# Import the pipeline framewoek from the transformers library and textwrap
from transformers import pipeline
import textwrap
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)

## Example 1

In [4]:
# This code uses the Hugging Face Transformers library to perform named entity recognition (NER) on a given sentence 
# using the pre-trained model "dbmdz/bert-large-cased-finetuned-conll03-english". 
# After importing the required packages and defining the input sentence, the `pipeline` function is called to 
# initialize a text classification pipeline with the pre-trained model specified. The `token-classification` task 
# is used here as it involves identifying and labeling individual tokens in the input text. The `grouped_entities` 
# parameter is set to `True` to group consecutive tokens with the same entity label.
# The `ners` variable stores the results of the NER task performed on the input sentence. 
# Finally, the script prints the input sentence using the `textwrap` module to wrap lines at a maximum width of 80 characters.
# It then loops through each entity identified by the NER model and prints the word and its corresponding entity group.

sentence = "Singapore Airlines was the first airline to fly the A380. Chew Choon Seng was Singapore Airline's CEO at the time. Singapore Airlines flies to New York daily."
ner = pipeline('token-classification', 
               model='dbmdz/bert-large-cased-finetuned-conll03-english', 
               grouped_entities=True)
ners = ner(sentence)
print('\nSentence:')
print(wrapper.fill(sentence))
print('\n')
for n in ners:
  print(f"{n['word']} -> {n['entity_group']}")

Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]




Sentence:
Singapore Airlines was the first airline to fly the A380. Chew Choon Seng was
Singapore Airline's CEO at the time. Singapore Airlines flies to New York daily.


Singapore Airlines -> ORG
A380 -> MISC
Chew Choon Seng -> PER
Singapore Airline -> ORG
Singapore Airlines -> ORG
New York -> LOC


## Example 2

In [8]:
sentence = "The General Motors Company[2] (GM) is an American multinational automotive manufacturing company headquartered in Detroit, Michigan, United States.[3] By sales, it was the largest automaker in the United States in 2022, and was the largest in the world for 77 years before losing the top spot to Toyota in 2008.[4][5] General Motors operates manufacturing plants in eight countries.[6] Its four core automobile brands are Chevrolet, Buick, GMC, and Cadillac. It also holds interests in Chinese brands Baojun and Wuling via SAIC-GM-Wuling Automobile.[2] GM also owns the BrightDrop delivery vehicle manufacturer,[7] a namesake defense vehicles division which produces military vehicles for the United States government and military,[8] the vehicle safety, security, and information services provider OnStar,[9] the auto parts company ACDelco, a namesake financial lending service, and majority ownership in the self-driving cars enterprise Cruise LLC. In January 2021, GM announced plans to end production and sales of vehicles using internal combustion engines, including hybrid vehicles and plug-in hybrids, by 2035, as part of its plan to achieve carbon neutrality by 2040.[10] GM offers more flexible-fuel vehicles, which can operate on either E85 ethanol fuel or gasoline, or any blend of both, than any other automaker.[11] The company traces itself to a holding company for Buick established on September 16, 1908, by William C. Durant, the largest seller of horse-drawn vehicles at the time.[12] The current entity was established in 2009 after the General Motors Chapter 11 reorganization.[13] As of January 2023, GM is ranked 25th on the Fortune 500 rankings of the largest United States corporations by total revenue.[14]"
ner = pipeline('token-classification', 
               model='dbmdz/bert-large-cased-finetuned-conll03-english', 
               grouped_entities=True)
ners = ner(sentence)
print('\nSentence:')
print(wrapper.fill(sentence))
print('\n')
for n in ners:
  print(f"{n['word']} -> {n['entity_group']}")


Sentence:
The General Motors Company[2] (GM) is an American multinational automotive
manufacturing company headquartered in Detroit, Michigan, United States.[3] By
sales, it was the largest automaker in the United States in 2022, and was the
largest in the world for 77 years before losing the top spot to Toyota in
2008.[4][5] General Motors operates manufacturing plants in eight countries.[6]
Its four core automobile brands are Chevrolet, Buick, GMC, and Cadillac. It also
holds interests in Chinese brands Baojun and Wuling via SAIC-GM-Wuling
Automobile.[2] GM also owns the BrightDrop delivery vehicle manufacturer,[7] a
namesake defense vehicles division which produces military vehicles for the
United States government and military,[8] the vehicle safety, security, and
information services provider OnStar,[9] the auto parts company ACDelco, a
namesake financial lending service, and majority ownership in the self-driving
cars enterprise Cruise LLC. In January 2021, GM announced plans to