<a href="https://colab.research.google.com/github/IshaSarangi/Edureka_Notes/blob/main/Edureka_NER_using_Spacy_and_Transformers_demo_31_Aug.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://colab.research.google.com/drive/18kCwr2tds4so5o0W_Tf112npyYRqy_aP?usp=sharing

In [1]:
import spacy
from spacy import displacy
import en_core_web_sm
NER = en_core_web_sm.load()

In [2]:
raw_text = """The Indian Space Research Organisation or is the national
space agency of India, headquartered in Bengaluru.
It operates under Department of Space which is
directly overseen by the Prime Minister Narendra Damodardas Modi of
India while Chairman of ISRO acts as executive of DOS as well."""

text_ner = NER(raw_text)

In [3]:
print(text_ner.ents)

(The Indian Space Research Organisation, India, Bengaluru, Department of Space, Narendra Damodardas Modi, India, ISRO, DOS)


In [4]:
for word in text_ner.ents:
    print(word.text, word.label_)

The Indian Space Research Organisation ORG
India GPE
Bengaluru GPE
Department of Space ORG
Narendra Damodardas Modi PERSON
India GPE
ISRO ORG
DOS ORG


In [5]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

In [6]:
spacy.explain('GPE')

'Countries, cities, states'

* PERSON: People, including fictional characters.
* NORP: Nationalities, religious, or political groups.
* FAC: Buildings, airports, highways, bridges, etc. ORG: Companies, agencies, institutions, etc.
* GPE: Geopolitical entities like countries, cities, or states.
* LOC: Non-GPE locations, such as mountain ranges or bodies of water.
* PRODUCT: Objects, vehicles, foods, etc. (excluding services).
* EVENT: Named occurrences like hurricanes, battles, wars, or sports events.
* WORK_OF_ART: Titles of books, songs, or other artistic creations.
* LAW: Named documents that are laws.
* LANGUAGE: Any named language.
* DATE: Absolute or relative dates or periods.
* TIME: Times smaller than a day.
* PERCENT: Percentage values, including the "%" symbol.
* MONEY: Monetary values, including the unit of currency.
* QUANTITY: Measurements, such as weight or distance.
* ORDINAL: Ordinal numbers like "first," "second," etc.
* CARDINAL: Numerals that do not fall under another specific type.



In [7]:
displacy.render(text_ner, style='ent')

##Transformer based NER (Named Entity Recognition)

BERT (Bidirectional Encoder Representations from Transformer) for NER

In [8]:
from transformers import pipeline #hugging face transformer library in python
#load pretrainer NER model pipeline
ner_pipeline=pipeline('ner',model='dslim/bert-base-NER',aggregation_strategy='simple')

raw_text=["Barack Obama was born in Hawaii.",
    "Apple Inc. reported a revenue of $120 billion in 2024.",
    "Elon Musk founded SpaceX in California in 2002.",
    "Cristiano Ronaldo plays football in Saudi Arabia."]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [9]:
#Apply NER

for text in raw_text:
  print(f"Text:{text}")
  print("Entities:",ner_pipeline(text))

Text:Barack Obama was born in Hawaii.
Entities: [{'entity_group': 'PER', 'score': np.float32(0.9992919), 'word': 'Barack Obama', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': np.float32(0.9997441), 'word': 'Hawaii', 'start': 25, 'end': 31}]
Text:Apple Inc. reported a revenue of $120 billion in 2024.
Entities: [{'entity_group': 'ORG', 'score': np.float32(0.999503), 'word': 'Apple Inc', 'start': 0, 'end': 9}]
Text:Elon Musk founded SpaceX in California in 2002.
Entities: [{'entity_group': 'ORG', 'score': np.float32(0.66747177), 'word': 'Elon', 'start': 0, 'end': 4}, {'entity_group': 'PER', 'score': np.float32(0.83128965), 'word': 'Mu', 'start': 5, 'end': 7}, {'entity_group': 'ORG', 'score': np.float32(0.54482806), 'word': '##sk', 'start': 7, 'end': 9}, {'entity_group': 'ORG', 'score': np.float32(0.9992529), 'word': 'SpaceX', 'start': 18, 'end': 24}, {'entity_group': 'LOC', 'score': np.float32(0.9996731), 'word': 'California', 'start': 28, 'end': 38}]
Text:Cristiano Ronaldo pla