### Open source library for NLP in Python

### What is NLTK?

#### A suit of libraries and programs for symbolic and statistical NLP for English

### What is Spacy?

#### Free open-source library for NLP in python

### NLTK vs Spacy

#### Spacy is better than NLTK in terms of Speed and Implementation

In [56]:
import spacy

#### Below code shows use of tokenizer. As the code below do not have any pipeline, we can only use tokenizer. 

In [30]:
nlp = spacy.blank("en")

doc = nlp("This is the text to check tokenizer")

for token in doc:
    print(token)

This
is
the
text
to
check
tokenizer


In [31]:
nlp = spacy.load("en_core_web_sm") # This is a pretrained model for english lang.

In [32]:
nlp.pipe_names # Displaying number of models in pipeline.

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

#### Pipeline is basically a bunch of components. It comes after tokenizer. Even if your pipeline is brank, we get tokenizer component by default.

### To see all the entities in the text - 

In [33]:
doc = nlp("Tesla Inc has aquired twitter for $45 billion")

for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))
    
# ent.text - entities in text form
# ent.label_ - Entity label
# spacy.explain() - to get complete explaination.

Tesla Inc | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit


#### A much fancier way to represent it is - 

In [34]:
from spacy import displacy

displacy.render(doc, style = "ent")

#### Above text did not recognize Twitter as company, because spacy must be following some conventions to check for entities.

#### So we will change the sentence a little bit

In [35]:
doc = nlp("Tesla Inc has aquired Twitter Inc for $45 billion")

for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Tesla Inc | ORG | Companies, agencies, institutions, etc.
Twitter Inc | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit


In [36]:
displacy.render(doc, style = "ent")

#### So this NER is searching for the word starting with capital letter followed by Inc and naming it as an ORG.

In [37]:
nlp.pipe_labels["ner"] # These are the entities that this model can detect.

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

### Hugging face is also a very popular NLP library

https://huggingface.co/dslim/bert-base-NER?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California.

### Adding the entities in Spacy

In [52]:
doc = nlp("Tesla has aquire Twitter in $45 billion")

for ent in doc.ents:
    print(ent.text, "|", ent.label_)

Tesla | ORG
Twitter | PRODUCT
$45 billion | MONEY


In [53]:
# If I want to add Twitter : ORG as new entity in Spacy, we can do -
# For that, we need to use span. 

print(doc[0]) # This is a token
doc[0:3] # This is a span.

Tesla


Tesla has aquire

In [54]:
from spacy.tokens import Span

s1 = Span(doc, 3, 4, label="ORG")

doc.set_ents([s1], default = "unmodified")

In [55]:
for ent in doc.ents:
    print(ent.text, "|", ent.label_)

Tesla | ORG
Twitter | ORG
$45 billion | MONEY
