<a href="https://colab.research.google.com/github/Pakhi27/Named-Entity-Recognition-NLP-/blob/main/Named_Entity_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Named Entity Recognition(NER)

In [1]:

import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names
# ent-entity

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [4]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [5]:
from spacy import displacy

displacy.render(doc, style="ent")


List down all the entities

In [6]:
nlp.pipe_labels['ner']


['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [7]:

doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | PERSON | People, including fictional
1982 | DATE | Absolute or relative dates or periods


In [8]:

doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | PERSON | People, including fictional
1982 | DATE | Absolute or relative dates or periods


In [9]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)


Tesla Inc  |  ORG  |  0 | 9
Twitter Inc  |  ORG  |  30 | 41
$45 billion  |  MONEY  |  46 | 57



Setting custom entities

In [10]:

doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PRODUCT
$45 billion  |  MONEY


In [11]:

s = doc[2:5]
s
# span of tokens

going to acquire

In [12]:
type(s)

spacy.tokens.span.Span

In [13]:

from spacy.tokens import Span

# tesla
s1 = Span(doc, 0, 1, label="ORG")
# twitter
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [14]:

for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY


NER using Hugging Face

In [15]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)
# The output is a list of dictionaries, each representing a named entity identified in the text.
# Each dictionary contains the following information:
# entity: The entity type (e.g., "B-PER" for person, "B-LOC" for location).
# score: The confidence score associated with the entity prediction.
# index: The index of the token in the input sequence.
# word: The actual word identified as the entity.
# start: The starting index of the entity in the input text.
# end: The ending index of the entity in the input text.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


NER Dataset using BERT

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [18]:
df = pd.read_csv('/content/NER dataset.csv', encoding='latin-1')  # Replace 'latin-1' with the appropriate encoding

In [19]:
dataset=pd.DataFrame(df)

In [20]:
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [24]:
df=df.drop('Sentence #',axis=1)

In [25]:
df.tail()

Unnamed: 0,Word,POS,Tag
1048570,they,PRP,O
1048571,responded,VBD,O
1048572,to,TO,O
1048573,the,DT,O
1048574,attack,NN,O


In [26]:
df.columns

Index(['Word', 'POS', 'Tag'], dtype='object')

In [27]:
df.tail()

Unnamed: 0,Word,POS,Tag
1048570,they,PRP,O
1048571,responded,VBD,O
1048572,to,TO,O
1048573,the,DT,O
1048574,attack,NN,O


In [28]:
df

Unnamed: 0,Word,POS,Tag
0,Thousands,NNS,O
1,of,IN,O
2,demonstrators,NNS,O
3,have,VBP,O
4,marched,VBN,O
...,...,...,...
1048570,they,PRP,O
1048571,responded,VBD,O
1048572,to,TO,O
1048573,the,DT,O


In [31]:
sample_size = 1000
df = df.sample(n=sample_size)


In [34]:
df

Unnamed: 0,Word,POS,Tag
934584,an,DT,O
807422,threat,NN,O
395773,",",",",O
371634,operation,NN,O
880575,3.5,CD,O
...,...,...,...
909540,intends,VBZ,O
651786,of,IN,O
992446,"""",``,O
329084,five,CD,O


In [32]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [35]:
ner_results = []
for text in df['Word']:
    results = nlp(text)
    ner_results.append(results)

In [36]:
# Assuming your dataset has a column named 'text'
for index, row in df.iterrows():
    text = row['Word']
    results = nlp(text)
    print(f"For text: {text}")
    print(results)

For text: an
[]
For text: threat
[]
For text: ,
[]
For text: operation
[]
For text: 3.5
[]
For text: States
[{'entity': 'B-LOC', 'score': 0.9986577, 'index': 1, 'word': 'States', 'start': 0, 'end': 6}]
For text: Dogg
[]
For text: bribery
[]
For text: ,
[]
For text: of
[]
For text: to
[]
For text: sovereignty
[]
For text: Saberi
[{'entity': 'B-PER', 'score': 0.64173347, 'index': 1, 'word': 'Sa', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.63520426, 'index': 2, 'word': '##ber', 'start': 2, 'end': 5}]
For text: But
[]
For text: and
[]
For text: the
[]
For text: was
[]
For text: occurs
[]
For text: United
[{'entity': 'B-ORG', 'score': 0.99877137, 'index': 1, 'word': 'United', 'start': 0, 'end': 6}]
For text: nation
[]
For text: in
[]
For text: show
[]
For text: .
[]
For text: hostages
[]
For text: .
[]
For text: founded
[]
For text: VOA
[{'entity': 'B-ORG', 'score': 0.9989691, 'index': 1, 'word': 'V', 'start': 0, 'end': 1}, {'entity': 'I-ORG', 'score': 0.9898385, 'index': 2, 'wor