## SPACY
spaCy’s models are statistical and **every “decision” they make** – for example, which part-of-speech tag to assign, or whether a word is a named entity – **is a prediction**. This prediction is based on the examples the model has seen during training. To train a model, you first need training data – examples of text, and the labels you want the model to predict. This could be a part-of-speech tag, a named entity or any other information.
has_annotation("DEP")
All spaCy models support online learning, so you can update a pretrained model with new examples. You’ll usually need to provide many examples to meaningfully improve the system — a few hundred is a good start, although more is better.

In [None]:
#import packages
import io
import spacy
import pandas as pd 
import pyforest

In [33]:
doc=pd.read_csv('train_notags.csv',error_bad_lines=False,lineterminator=('\r'),engine="c",sep="\n")
df=doc
doc

Unnamed: 0,-DOCSTART-
,EU
,rejects
,German
,call
,to
...,...
,Swansea
,1
,Lincoln
,2


# Named Entity Recognition (NER)
spaCy has an **'ner'** pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the `ents` property of a `Doc` object.

In [11]:
spacy.__version__

'3.0.5'

In [12]:
# Perform standard imports
nlp = spacy.load('en_core_web_sm')

In [13]:
tokens= []
lemma = []
pos = []
tag = []
dep = []
ent= []
label = []
ent_exp= []


for item in nlp.pipe(doc['-DOCSTART-'].astype('str').values, batch_size=50):
    if item.has_annotation("DEP"):
        tokens.append([n.text for n in item])
        lemma.append([n.lemma_ for n in item])
        pos.append([n.pos_ for n in item])
        tag.append([n.tag_ for n in item])
        dep.append([n.dep_ for n in item])
        ent.append([item.ents for ent in item.ents])
        label.append([ent.label_ for ent in item.ents])
        ent_exp.append([str(spacy.explain(ent.label_))  for ent in item.ents])

        

    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)
        tag.append(None)
        dep.append(None)
        ent.append(None)
        label.append(None)
        ent_exp.append(None)
        

doc['doc_token'] = tokens
doc['doc_lemma'] = lemma
doc['doc_pos'] = pos
doc['doc_tag'] = tag
doc["doc_dep"] = dep
doc['doc_ent']= ent
doc['doc_label'] = label
doc['ent_exp'] = ent_exp



In [14]:
doc.head()

Unnamed: 0,-DOCSTART-,doc_token,doc_lemma,doc_pos,doc_tag,doc_dep,doc_ent,doc_label,ent_exp
,EU,[EU],[EU],[PROPN],[NNP],[ROOT],"[((EU),)]",[ORG],"[Companies, agencies, institutions, etc.]"
,rejects,[rejects],[reject],[VERB],[VBZ],[ROOT],[],[],[]
,German,[German],[german],[ADJ],[JJ],[ROOT],"[((German),)]",[NORP],[Nationalities or religious or political groups]
,call,[call],[call],[VERB],[VB],[ROOT],[],[],[]
,to,[to],[to],[ADP],[IN],[ROOT],[],[],[]


In [16]:
#interested in NER
doc[["-DOCSTART-","doc_ent","doc_label","ent_exp"]].head(25)

Unnamed: 0,-DOCSTART-,doc_ent,doc_label,ent_exp
,EU,"[((EU),)]",[ORG],"[Companies, agencies, institutions, etc.]"
,rejects,[],[],[]
,German,"[((German),)]",[NORP],[Nationalities or religious or political groups]
,call,[],[],[]
,to,[],[],[]
,boycott,[],[],[]
,British,"[((British),)]",[NORP],[Nationalities or religious or political groups]
,lamb,[],[],[]
,.,[],[],[]
,Peter,[],[],[]


## Entity annotations
`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>



## NER Tags
Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

For more on **Named Entity Recognition** source: https://spacy.io/usage/linguistic-features#101

For **noun_chunks** source: https://spacy.io/usage/linguistic-features#noun-chunks


# Visualizing Named Entities
Besides viewing Part of Speech dependencies with `style='dep'`, **displaCy** offers a `style='ent'` visualizer:

In [18]:
# Import the displaCy library
from spacy import displacy

In [47]:
displacy.render(doc, style='ent',jupyter=True)


## Viewing Sentences Line by Line
Unlike the **displaCy** dependency parse, the NER viewer has to take in a Doc object with an `ents` attribute. For this reason, we can't just pass a list of spans to `.render()`, we have to create a new Doc from each `span.text`:



In [42]:
nlp = spacy.load("en_core_web_sm")
text = "EU rejects German call to boycott British lamb "
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="ent",jupyter=True)

Customizing Colors and Effects
You can also pass background color and gradient options:

In [44]:
#if i'm intererested only in ORG and PRODUCT Entities
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}

options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(sentence_spans, style='ent', jupyter=True, options=options)

___
# Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.

Instead of `displacy.render()`, use `displacy.serve()`:

In [46]:
displacy.serve(sentence_spans, style='ent', options=options)


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


Source: https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718
