# Named Entity Recognition (NER)

## What is it?


Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as:

* Person: Names of individuals (e.g., “Barack Obama”)
* Organization: Names of companies, institutions, agencies, etc. (e.g., “OpenAI”, “Google”)
* Location: Names of geographical locations (e.g., “Paris”, “Mount Everest”)
* Date and Time: Expressions of dates and times (e.g., “January 1, 2020”, “3:00 PM”)
* Monetary Values: Amounts of money (e.g., “$100”, “€20”)
* Percentages: Percentage values (e.g., “50%”)
* Miscellaneous: Other categories of named entities (e.g., product names, event names)

## What for?

Uses of NER


	1.Information Extraction Automatically extracting structured information from unstructured text, such as identifying key entities in news articles or medical reports.
	
	2.Search and Information Retrieval Enhancing search engines to understand and retrieve information about specific entities, improving the relevance of search results.
	
	3.Question Answering Systems Building systems that can answer questions posed in natural language by identifying entities within the question and retrieving relevant information.
	
	4.Document Summarization Improving the quality of text summarization by identifying and highlighting important entities within the text.
	
	5.Content Recommendation Providing personalized content recommendations based on entities mentioned in a user’s reading history or search queries.
	
	6.Customer Support Automation Enhancing chatbots and virtual assistants to understand and respond to user queries by recognizing entities within the conversation.
	
	7.Machine Translation Improving the accuracy of translating names and other entities by recognizing and properly handling them.
	
	8.Social Media Analysis Monitoring and analyzing trends and sentiments about specific entities on social media platforms.
	
	9.Fraud Detection Identifying suspicious entities in financial transactions and communications.
	
	10.Legal Document Processing Extracting relevant entities from legal documents to assist in legal research and case management.
	

## Example

Consider the sentence: “Apple Inc. was founded by Steve Jobs in Cupertino, California, on April 1, 1976.”

NER would identify:
* Organization: “Apple Inc.”
* Person: “Steve Jobs”
* Location: “Cupertino, California”
* Date: “April 1, 1976”

By identifying these entities, NER helps in understanding the key components of the text, making it easier to organize, search, and analyze large amounts of unstructured data.

## How to do it?

### Packages

* spaCy

### Examples

In [1]:
import spacy
from spacy.training import Example

In [2]:
# print the meaning of the labels
print(spacy.explain("ORG"))
print(spacy.explain("PERSON"))
print(spacy.explain("GPE"))
print(spacy.explain("DATE"))
print(spacy.explain("CARDINAL"))
print(spacy.explain("WORK_OF_ART"))

Companies, agencies, institutions, etc.
People, including fictional
Countries, cities, states
Absolute or relative dates or periods
Numerals that do not fall under another type
Titles of books, songs, etc.


In [3]:
# download the model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
# load the model
nlp = spacy.load("en_core_web_sm")

In [5]:
example = "Apple Inc. was founded by Steve Jobs in Cupertino, California, on April 1, 1976."

In [6]:
nlp_example = nlp(example)

for ent in nlp_example.ents:
    print(ent.text, ent.label_)

Apple Inc. ORG
Steve Jobs PERSON
Cupertino GPE
California GPE
April 1, 1976 DATE


### Custom Entities

In [7]:
#! pip install spacy-lookups-data

In [8]:

# nlp = spacy.blank("en")
# ner = nlp.add_pipe("ner")

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

train_data = [
    ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]

with nlp.disable_pipes(*other_pipes):  # Only train NER
    optimizer = nlp.begin_training()
    for itn in range(10):  # Number of training iterations
        # random.shuffle(train_data)
        losses = {}
        for text, annotations in train_data:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            nlp.update([example], drop=0.5, losses=losses)
        # print(losses)


# Add new label for NER if it doesn't exist
if "NEW_ENT" not in nlp.get_pipe("ner").labels:
    nlp.get_pipe("ner").add_label("NEW_ENT")
    print("Label added")

TypeError: Argument 'example_dict' has incorrect type (expected dict, got list)

## Practise

### Quiz 1

* Read the file `twitter.csv`
* create a new column `NER` that contains the named entities recognized in the `tweet` column.

In [9]:
# write answer here



Unnamed: 0,id,label,tweet,NER
0,1,0,@user when a father is dysfunctional and is s...,{'@user when a father is dysfunctional and is ...
1,2,0,@user @user thanks for #lyft credit i can't us...,{'@user @user thanks for #lyft credit i can't ...
2,3,0,bihday your majesty,{'bihday your majesty': 'ORG'}
3,4,0,#model i love u take with u all the time in ...,{'#model i love u take with u all the time i...
4,5,0,factsguide: society now #motivation,{'factsguide: society now #motivation': 'ORG'}
...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,{'ate @user isz that youuu?ððððð...
31958,31959,0,to see nina turner on the airwaves trying to...,{'to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...,{'listening to sad songs on a monday morning o...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",{'@user #sikh #temple vandalised in in #calgar...


### Quiz 2

Fore each tweet, count the number of entities:

* ORG
* PERSON
* GPE
* DATE
* CARDINAL
* WORK_OF_ART
* LOC
* PRODUCT
* EVENT

In [22]:
# write answer here


### Quiz 3

Write a function take an input text and return training set.

For each mention `@` in the text, add a new entity `MENTION`.

Example:

```python
text = "@elonmusk is the CEO of @Tesla"
```

the function returns:

```python
("@elonmusk is the CEO of @Tesla", {"entities": [(0, 8, "MENTION"), (24, 29, "MENTION")]})
```


In [13]:
# write answer here

### Quiz 4

Find all tweets that contains a mention `@`.

Save the results in a new Data Frame `df_mentions` or boolean mask.

Choose the first 100 rows for training.

Apply the function `find_mentions` to crate training data.


In [None]:
# write answer here

### Quiz 5

Add a new label `HASHTAG` to the NER pipeline if it doesn't exist.

In [None]:
# Add new label for NER if it doesn't exist
# nlp = spacy.blank("en")
# ner = nlp.add_pipe("ner")
if "HASHTAG" not in nlp.get_pipe("ner").labels:
    nlp.get_pipe("ner").add_label("HASHTAG")
    print("Label added")

### Quiz 6

Train the NER pipeline with the training data.

In [None]:
# write answer here

### Quiz 7

Create a new column `MenNER` that contains the named entities recognized in the `tweet` column.

In [None]:
# write answer here

Unnamed: 0,id,label,tweet,NER3
0,1,0,@user when a father is dysfunctional and is s...,{'@user': 'HASHTAG'}
1,2,0,@user @user thanks for #lyft credit i can't us...,{'@user': 'HASHTAG'}
2,3,0,bihday your majesty,{}
3,4,0,#model i love u take with u all the time in ...,{}
4,5,0,factsguide: society now #motivation,{}
...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,{'@user': 'HASHTAG'}
31958,31959,0,to see nina turner on the airwaves trying to...,{}
31959,31960,0,listening to sad songs on a monday morning otw...,{}
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",{'@user': 'HASHTAG'}


### Quiz 8

save the model to disk.


### Quiz 9

Solve the questions 3-7 for `Hashtags` instead of `Mentions`.

* The new label is `HASHTAG`.
* The function is `find_hashtags`.
* The new column is `Hashtags`.
* Save the model to disk as `HashtagsNER`.