# Name Entity Recognition with Spacy

**Named Enitity Recognition** is a common problem in NLP dealing with identifying and classifying named entities.

A named entity is a real life object which has an indentification and can be defined by a name. A place, person, countries or organizations can be a named entity. For example, Microsoft is an organization and Asia is a geographic entity.

A raw or unstructed data is processed and by using the help of named enitity recognition, one can label and classify the data as different entities. A NER system is developed with the help of linguistic approches and statiscal methods.

A NER model begins with identifying an entity and categorizes into the most suitable class.

Practical Applications of NER:

-- Scanning through large documents and finding people, organizations and locations available.

-- We could optimize the search by providing the key entities found.

-- Text Summarization

**Named Entity Recognition with spaCy**:

SpaCy is an open source Natural processing library with fast statistical entity recognition system. The methods that are available in SpaCy for NER assigns a label to the text data and classifies the same as defined above.

Spacy also provides us an option to add arbitrary classes to entity recognition systems and update the model to include new examples. We can train our own data for business-specific needs and prepare the model as necessary.

Spacy Installation

pip install spacy

python -m spacy download en_core_web_sm

en_core_web_sm is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.

**Naming convention of the package**
Type: Capabilities (e.g. core for general-purpose pipeline with tagging, parsing, lemmatization and named entity recognition, or dep for only tagging, parsing and lemmatization). Genre: Type of text the pipeline is trained on, e.g. web or news. Size: Package size indicator, sm, md, lg or trf.

Spacy provides predefined models for many languages and they can be found in the URL: https://spacy.io/models. Predict part-of-speech tags, dependency labels, named entities and more.

When a text is passed into nlp, it goes through each of the pipeline as shown in the image:
![Example Image](Capture.PNG)

Taipei is GPE - Geo-Political Entity

Just in case, if you are wondering what the meaning could be of the label returned. We could use the below:

The labels and output are self-explanatory.

**spaCy supports the following entity types**

PERSON, NORP (nationalities, religious and political groups), FAC (buildings, airports etc.), ORG (organizations), GPE (countries, cities etc.), LOC (mountain ranges, water bodies etc.), PRODUCT (products), EVENT (event names), WORK_OF_ART (books, song titles), LAW (legal document titles), LANGUAGE (named languages), DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL.

**Visualize dependencies**

1. Entity text by using ent.text,
2. Starting and ending character of an entity by using ent.start_char and ent.end_char,
3. Entity’s index by using ent.start,
4. Entity type’s id by using ent.entid,
5. Generate vector norm of an entity by using ent.vector_norm.

# Adding new Named Entities in Spacy

In cases like if we need a new entity in the model, we could follow the steps below to create an entity and include it in the model.

Dogecoin is a cryptocurrency and it is not recognized by spaCy

Dogecoin is now considered as MONEY. Just in case if we have numerous cryptocurrencies in our data to be entitled as MONEY, we could find the SPAN and update the label as necessary.

# Adding Named Entities to Matching Spans

PhraseMatcher is used to identify a series of span in a doc. When the matched spans are identified, we could tag all of them with the corresponding entity.

This only gives us the end of July as the DATE entity, but we also want the spaCy to identify robot as a Product. Hence, we are using Phrase matcher and adding the label as shown below:

We have added robot as a product to the labels now.

# How to train a custom NER Model is spaCy

To train the model, we will need relevant data with proper annotations. I have used the medical entities dataset here

Install the spacy-tranformers

pip install spacy[transformers]

We are extracting the text and corresponding annotations and creating a structed data below

For the data in text above, we have the labels with their corresponding span.

spaCy uses **DocBin** class for annotated data, so we’ll have to create the DocBin objects for our training examples. This DocBin class efficiently serializes the information from a collection of Doc objects. It is faster and produces smaller data sizes than pickle, and allows the user to deserialize without executing arbitrary Python code.

The indices of some entities overlap. spaCy provides a utility method filter_spans to deal with this.

The DocBin saves the Training_Data in Spacy format which we need to train a model. Then, We can manually create a config file as per the use case or quickly create a base config on spaCy’s training quickstart page here.
![Example Image](docbin.PNG)

We’ll be working with a base config file using the quickstart page. This is an incomplete file with only our custom options, so we’ll have to fill in the rest with the default values. The command below is ran in CMD.

python -m spacy init fill-config base_config.cfg config.cfg

Please make sure that the training data in spacy format is available in the same path before running the line above. This will create a config file in the same directory.

Now, as we have all that we need to train our model. Let's train the model with the line below:


python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

Let’s load the best-performing model and test it on a piece of text. I had to use Anaconda Prompt to work with the command lines and so the output below was generated in Jupyter notebook.

nlp_ner = spacy.load("model-best")

doc = nlp_ner(training_data['annotations'][0].get('text'))

spacy.displacy.render(doc, style="ent", options= options, jupyter=True)

Even with the very limited amount of data the model achieves decent performance. We could follow the same steps above to train a model with different data but it should have annotations.