<a href="https://colab.research.google.com/github/SolankiNilam/Name-Entity-Tagger/blob/main/Name_Entity_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Named Entity Recognition
Named Entity Recognition (NER) is a crucial task in natural language processing (NLP) that involves identifying and classifying key elements in a text into predefined categories such as names of people, organizations, locations, dates, and other specific entities.

Named Entity Recognition (NER) is a subtask of information extraction that focuses on identifying and categorizing entities within a text into predefined classes. These entities typically include:



*  People: Names of individuals.
*   Organizations: Names of companies, institutions, or groups.
*   Locations: Names of cities, countries, landmarks, etc.
* Dates and Times: Specific dates and time periods.
*   Miscellaneous Entities: Other specific categories like product names, monetary values, percentages, etc.



In [1]:
!pip install spacy



In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +
                  ' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [4]:
doc1 = nlp("Jeff Bezos founded Amazon in 1994, and it became the world’s largest online retailer")

show_ents(doc1)

Jeff Bezos - 0 - 10 - PERSON - People, including fictional
Amazon - 19 - 25 - ORG - Companies, agencies, institutions, etc.
1994 - 29 - 33 - DATE - Absolute or relative dates or periods


In [5]:
doc2 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

show_ents(doc2)

Washington, DC - 12 - 26 - GPE - Countries, cities, states
next May - 27 - 35 - DATE - Absolute or relative dates or periods
the Washington Monument - 43 - 66 - ORG - Companies, agencies, institutions, etc.


## Understanding Entity Annotations in spaCy

In spaCy, entities recognized in a text are represented by `Doc.ents`, which are token spans with specific annotations. Each entity span has the following attributes:

- **`ent.text`**: The exact text of the entity as it appears in the document.
- **`ent.label`**: The unique hash value representing the type of the entity.
- **`ent.label_`**: A human-readable description of the entity type.
- **`ent.start`**: The starting index position of the entity span within the `Doc` object.
- **`ent.end`**: The ending index position of the entity span within the `Doc` object.
- **`ent.start_char`**: The starting character index of the entity text within the document.
- **`ent.end_char`**: The ending character index of the entity text within the document.

These attributes are crucial for extracting and analyzing named entities from text data.


In [6]:
doc3 = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')

for ent in doc3.ents:
    print(ent.text, ent.label_)

500 dollars MONEY
Microsoft ORG


## Accessing Entity Annotations in spaCy

Entity annotations are accessed through the `doc.ents` property, which returns a sequence of `Span` objects. You can get the entity type using:

- **`ent.label`**: The entity type's hash value.
- **`ent.label_`**: The entity type's string description.

The `Span` object allows you to iterate over tokens or access the entire entity text as if it were a single token.

For token-level entity annotations:

- **`token.ent_iob`**: Indicates whether the token starts, continues, or ends an entity.
- **`token.ent_type`**: Provides the entity type, or an empty string if no entity type is set.


---



---



In [7]:
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
for e in doc.ents:
    print(e.text, e.start_char, e.end_char, e.label_)
# OR
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] #in a list comprehension form
print(ents)

# token level
# doc[0], doc[1] ...will have tokens stored.

ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)
print(ent_francisco)

San Francisco 0 13 GPE
[('San Francisco', 0, 13, 'GPE')]
['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']


## IOB Scheme

The IOB (Inside-Outside-Beginning) scheme is used to label tokens in named entity recognition:

- **I**: Token is Inside an entity.
- **O**: Token is Outside an entity.
- **B**: Token is the Beginning of an entity.

### Example

| Text         | `ent_iob` | `ent_iob_` | `ent_type_` | Description                |
|--------------|-----------|------------|-------------|----------------------------|
| San          | B         | B          | "GPE"       | Beginning of an entity     |
| Francisco    | I         | I          | "GPE"       | Inside an entity           |
| considers    | O         | O          | ""          | Outside an entity          |
| banning      | O         | O          | ""          | Outside an entity          |
| sidewalk     | O         | O          | ""          | Outside an entity          |
| delivery     | O         | O          | ""          | Outside an entity          |
| robots       | O         | O          | ""          | Outside an entity          |

**Note:** In the example, "San Francisco" is recognized as a named entity ("GPE" - Geopolitical Entity), with "San" being the start and "Francisco" inside the entity. All other tokens are labeled as outside the entity.


## NER Tags

Named Entity Recognition (NER) tags help categorize entities in text. Each tag is accessible through the `.label_` property of an entity. Here are the common types of NER tags:

| TYPE           | DESCRIPTION                                         | EXAMPLE                            |
|----------------|-----------------------------------------------------|------------------------------------|
| **PERSON**     | People, including fictional characters.            | Fred Flintstone                    |
| **NORP**       | Nationalities, religious, or political groups.     | The Republican Party                |
| **FAC**        | Buildings, airports, highways, bridges, etc.       | Logan International Airport, The Golden Gate |
| **ORG**        | Companies, agencies, institutions, etc.            | Microsoft, FBI, MIT                |
| **GPE**        | Countries, cities, states.                         | France, UAR, Chicago, Idaho        |
| **LOC**        | Non-GPE locations, mountain ranges, bodies of water.| Europe, Nile River, Midwest        |
| **PRODUCT**    | Objects, vehicles, foods, etc. (Not services.)     | Formula 1                          |
| **EVENT**      | Named events like hurricanes, battles, sports, etc. | Olympic Games                       |
| **WORK_OF_ART**| Titles of books, songs, etc.                        | The Mona Lisa                       |
| **LAW**        | Named legal documents.                             | Roe v. Wade                         |
| **LANGUAGE**   | Any named language.                                | English                             |
| **DATE**       | Absolute or relative dates or periods.             | 20 July 1969                        |
| **TIME**       | Times smaller than a day.                          | Four hours                          |
| **PERCENT**    | Percentages, including "%".                        | Eighty percent                      |
| **MONEY**      | Monetary values, including units.                  | Twenty Cents                        |
| **QUANTITY**   | Measurements such as weight or distance.           | Several kilometers, 55kg            |
| **ORDINAL**    | Order of items, e.g., "first", "second".           | 9th, Ninth                          |
| **CARDINAL**   | Numerals not categorized under other types.        | 2, Two, Fifty-two                   |


## User-Defined Named Entity and Adding It to a Span

In spaCy, custom named entities can be assigned to specific tokens that are not recognized by the pre-trained models. To achieve this, follow these steps:

1. **Create a `Span` Object**: Use the `Span` constructor to define your custom named entity. The arguments for the `Span` constructor are:
   - `doc`: The `Doc` object containing the tokens.
   - `start`: The start index position of the token in the `doc`.
   - `end`: The stop index position (exclusive) in the `doc`.
   - `label`: The label assigned to the entity (e.g., `ORG` for organizations).

In [8]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')

show_ents(doc)

U.K. - 17 - 21 - GPE - Countries, cities, states
$6 million - 34 - 44 - MONEY - Monetary values, including unit


In [9]:
from spacy.tokens import Span


In [10]:


ORG = doc.vocab.strings[u'ORG']


new_ent = Span(doc, 0, 1, label=ORG)


doc.ents = list(doc.ents) + [new_ent]

In [11]:
show_ents(doc)

Tesla - 0 - 5 - ORG - Companies, agencies, institutions, etc.
U.K. - 17 - 21 - GPE - Countries, cities, states
$6 million - 34 - 44 - MONEY - Monetary values, including unit


In [12]:
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)


fb_ent = Span(doc, 0, 1, label="ORG")
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)


Before []
After [('fb', 0, 2, 'ORG')]


In [13]:

from spacy import displacy

In [15]:
text = "When Elon Musk launched SpaceX in 2002, many experts initially doubted its potential to revolutionize space travel."
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

In [16]:
text=""""When Google was founded by Larry Page and Sergey Brin in 1998, the idea of a search engine that could deliver highly relevant results to users was met with a mixture of curiosity and skepticism. At the time, many wondered if a new company could truly disrupt the established search engine market dominated by larger players. However, Google's innovative approach to search algorithms and its emphasis on simplicity and speed quickly gained traction. Within a few years, Google not only proved its critics wrong but also transformed the way people access and interact with information online. Its success expanded beyond search, leading to a wide range of products and services that have become integral to modern digital life, from Gmail and Google Maps to Android and cloud computing."""
doc = nlp(text)

displacy.render(doc, style='ent', jupyter=True)

In [17]:
for sent in doc.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)



In [18]:
options = {'ents': ['ORG', 'PRODUCT']}

displacy.render(doc, style='ent', jupyter=True, options=options)

In [19]:
colors = {'ORG': 'linear-gradient(90deg, #f2c707, #dc9ce7)', 'PRODUCT': 'radial-gradient(white, green)'}

options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(doc, style='ent', jupyter=True, options=options)

In [20]:
colors = {'ORG':'linear-gradient(90deg,#aa9cde,#dc9ce7)','PRODUCT':'radial-gradient(white,red)'}
options = {'ent':['ORG','PRODUCT'],'colors':colors}
displacy.render(doc,style='ent',jupyter=True,options=options)