# Named Entity Recognition(NER)

[Reference Link](https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/)

In Natural Language Processing (NLP) an Entity Recognition is one of the common problem. The entity is referred to as the part of the text that is interested in. In NLP, NER is a method of extracting the relevant information from a large corpus and classifying those entities into predefined categories such as location, organization, name and so on.

## Content

1. <a href = "#1.-Definition">What is Named Entity Recognition</a>
2. <a href = "#2.-Uses-of-NER">Uses of Named Entity Recognition</a>
3. <a href = "#3.-NER-with-spaCy">Named Entity Extraction with Spacy</a>
4. <a href = "#4.-NER-with-NLTK"> Named Entity Extraction with NLTK</a>
5. <a href = "#5.-Conclusion">Conclusion</a>

       
Links:
[1](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
[2](https://monkeylearn.com/blog/named-entity-recognition-python/)
[3](https://machinelearningknowledge.ai/beginners-guide-to-named-entity-recognition-ner-in-nltk-library-python/)

## 1. Definition

### Named Entity

In any text data, the named entities are objects which exist in the real world. Named entities are proper nouns that refer to specific entities that can be a person, organization, location, date, etc. 
- Example – “Mount Everest is the tallest mountain”. Here Mount Everest is a named entity of type location as it refers to a specific entity.
- Other examples of named entities are Narendra Modi, Mumbai, MacBook pro etc. or anything that can have a name.
More formally we can say a named entity denotes the proper name of any object. As mentioned in the above example, Narendra Modi is the name of a leader, Mumbai is the name of a city and MacBook pro is the name of a laptop.

![ner1.png](attachment:Images//ner1.png)


### Named Entity Recognition

Named entity recognition (NER), or named entity extraction is a keyword extraction technique that uses natural language processing (NLP) to automatically identify named entities within raw text and classify them into predetermined categories, like people, organizations, email addresses, locations, values, etc.

NER is a two steps process, we first perform Part of Speech (POS) tagging on the text, and then using it we extract the named entities based on the information of POS tagging



## 2. Uses of NER

Named Entity Recognition is useful in –

* Scanning news articles for the people, organizations and locations reported.
* Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved.
* Quickly retrieving geographical locations talked about in Twitter posts.
* The field of academics by easy and faster extraction of information for the students and researchers from the searching data.
* In Question Answer system to provide answers from the data by the machine and hence minimizing human efforts.
* In content classification by identifying the theme and subject of the contents and makes the process faster and easy, suggesting the best content of interest.
* Helps in customer service by categorizing the user complaint, request, and question in respective fields and filtering by priority keywords.


## 3. NER with spaCy

spaCy is regarded as the fastest NLP framework in Python, with single optimized functions for each of the NLP tasks it implements. Being easy to learn and use, one can easily perform simple tasks using a few lines of code.

### Installation

* pip install spacy
* python -m spacy download en_core_web_sm

In [1]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import spacy 
from spacy import displacy

raw_text= """The Board of Control for Cricket in India (BCCI) is the governing body for cricket in India and is under the jurisdiction of Ministry of Youth Affairs and Sports, Government of India.[2] The board was formed in December 1928 as a society, registered under the Tamil Nadu Societies Registration Act. It is a consortium of state cricket associations and the state associations select their representatives who in turn elect the BCCI Chief. Its headquarters are in Wankhede Stadium, Mumbai. Grant Govan was its first president and Anthony De Mello its first secretary. """

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
#python -m spacy download en
!pip install -U pip setuptools wheel

Collecting pip
  Using cached pip-22.3-py3-none-any.whl (2.1 MB)
Collecting setuptools
  Using cached setuptools-65.5.0-py3-none-any.whl (1.2 MB)


ERROR: To modify pip, please run the following command:
c:\program files\python39\python.exe -m pip install -U pip setuptools wheel

[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [21]:
os.getcwd()

'C:\\Users\\KVSH2\\Novartis Pharma AG\\Data Insights and Analytics (DiA) Team - DS Code Repository\\Code\\NLP'

In [5]:
NER = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
text= NER(raw_text)   
for w in text.ents:
    print(w.text,w.label_)

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [9]:
import os

In [10]:
os.getcwd()

'C:\\Users\\KVSH2\\Novartis Pharma AG\\Data Insights and Analytics (DiA) Team - DS Code Repository\\Code\\NLP'

In [13]:
NER = spacy.load("C:\\Users\\KVSH2\\Novartis Pharma AG\\Data Insights and Analytics (DiA) Team - DS Code Repository\\Code\\NLP\\en_core_web_sm")

OSError: [E050] Can't find model 'C:\Users\KVSH2\Novartis Pharma AG\Data Insights and Analytics (DiA) Team - DS Code Repository\Code\NLP\en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [None]:
spacy.displacy.render(text, style="ent",jupyter=True)

In [11]:

import en_core_web_sm



ModuleNotFoundError: No module named 'en_core_web_sm'

In [None]:
spacy.explain(u"NORP")

In [None]:
spacy.explain("ORG")

In [None]:
spacy.explain("GPE")

In [None]:
spacy.explain("MONEY")

In [None]:
displacy.render(text,style="dep",jupyter=True)

** spaCy supports the following entity types:**

- PERSON, NORP (nationalities, religious and political groups), 
- FAC (buildings, airports etc.), ORG (organizations), 
- GPE (countries, cities etc.), LOC (mountain ranges, water bodies etc.), 
- PRODUCT (products), EVENT (event names), 
- WORK_OF_ART (books, song titles),  
- LAW (legal document titles),  
-  LANGUAGE (named languages),  
-  DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL.

### NER of a News Article

In [None]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')
article = NER(ny_bb)
len(article.ents)

In [None]:
from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)
items = [x.text for x in article.ents]
Counter(items).most_common(3)
sentences = [x for x in article.ents]
print(sentences[20])

In [None]:
displacy.render(NER(str(sentences[20])), jupyter=True, style='ent')


## 3. NER with NLTK

NLTK provides some already tagged sentences, we can check it using the treebank package.


In [None]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('treebank')
sent = nltk.corpus.treebank.tagged_sents()
print(nltk.ne_chunk(sent[0]))

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
raw_words= word_tokenize(raw_text)
tags=pos_tag(raw_words)

In [None]:
ne = nltk.ne_chunk(tags,binary=True)
print(ne)

For better understanding, we can use the IOB tagging format. This format provides tags similar to the pos tagging but gives clarification about the position and the entity of the words.

In [None]:
from nltk.chunk import tree2conlltags
iob = tree2conlltags(ne)
iob

Here the IOB Tagging system contains tags of the form:

* B-{CHUNK_TYPE} – for the word in the Beginning chunk
* I-{CHUNK_TYPE} – for words Inside the chunk
* O – Outside any chunk

## POS Tagging vs NER

* POS tagging aims at identifying which grammatical group a word belongs to, so whether it is a NOUN, ADJECTIVE, VERB, ADVERBS, etc. whereas on the other hand Named Entity Recognition tries to find out whether or not a word is a named entity. Named entities are persons, locations, organizations, time expressions, etc.
* POS tagger does not look for the relation between the words in the document whereas NER looks for the relationship between words.
* The output of POS tagging is used as an input for NER. Word recognized as a noun by a POS tagger is passed for the NER process.
* POS tagger looks for one word at a time whereas NER looks for multiple words detecting the type of Named Entity, as well as the word boundaries.
* POS tagging increases the data size more than the NER.

So, Named Entity Recognition can be helpful in analyzing the different textual data. We used a pre-trained model from the spacy library for the same and categorized words into different entities. With tremendous advancements in NLP, the machines are getting smarter and can now intelligently understand large volumes of textual data that result in numerous use cases like machine translation, text summarization, etc.

In [None]:
# errors