# spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

In this section we'll install and setup spaCy to work with Python, and then introduce some concepts related to Natural Language Processing.

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $7 million')

In [4]:
for token in doc:
    print(token.text, token.pos, token.pos_)

Tesla 96 PROPN
is 87 AUX
looking 100 VERB
at 85 ADP
buying 100 VERB
U.S. 96 PROPN
startup 92 NOUN
for 85 ADP
$ 99 SYM
7 93 NUM
million 93 NUM


In [5]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f32a53ad6d0>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f32a550bd00>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f32a550bca0>)]

In [6]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [7]:
nlp.pipe_labels

OrderedDict([('tagger',
              ['$',
               "''",
               ',',
               '-LRB-',
               '-RRB-',
               '.',
               ':',
               'ADD',
               'AFX',
               'CC',
               'CD',
               'DT',
               'EX',
               'FW',
               'HYPH',
               'IN',
               'JJ',
               'JJR',
               'JJS',
               'LS',
               'MD',
               'NFP',
               'NN',
               'NNP',
               'NNPS',
               'NNS',
               'PDT',
               'POS',
               'PRP',
               'PRP$',
               'RB',
               'RBR',
               'RBS',
               'RP',
               'SYM',
               'TO',
               'UH',
               'VB',
               'VBD',
               'VBG',
               'VBN',
               'VBP',
               'VBZ',
               'WDT',
               'WP',
    

# Ahora aplicaremos estos ejemplos para el español

In [8]:
import es_core_news_sm
nlp_es = es_core_news_sm.load()

In [10]:
doc = nlp_es("Cómo ser más productivo es algo que seguro te has planteado si valoras tu tiempo y quieres aprovecharlo al 100%. Puede que muchas veces te veas saturado por la cantidad de tareas que se te acumulan en el día a día y no sepas por donde empezar. A veces incluso sientes que no avanzas y cada vez te agobias más. Por eso estás decidido a aprender de productividad y mejorar esa faceta de tu vida profesional.")
print([(w.text, w.pos_) for w in doc])

[('Cómo', 'PRON'), ('ser', 'VERB'), ('más', 'ADV'), ('productivo', 'ADJ'), ('es', 'AUX'), ('algo', 'PRON'), ('que', 'SCONJ'), ('seguro', 'ADV'), ('te', 'PRON'), ('has', 'AUX'), ('planteado', 'VERB'), ('si', 'SCONJ'), ('valoras', 'NOUN'), ('tu', 'DET'), ('tiempo', 'NOUN'), ('y', 'CCONJ'), ('quieres', 'NOUN'), ('aprovecharlo', 'VERB'), ('al', 'ADP'), ('100%', 'SYM'), ('.', 'PUNCT'), ('Puede', 'VERB'), ('que', 'SCONJ'), ('muchas', 'DET'), ('veces', 'NOUN'), ('te', 'PRON'), ('veas', 'VERB'), ('saturado', 'ADJ'), ('por', 'ADP'), ('la', 'DET'), ('cantidad', 'NOUN'), ('de', 'ADP'), ('tareas', 'NOUN'), ('que', 'SCONJ'), ('se', 'PRON'), ('te', 'PRON'), ('acumulan', 'AUX'), ('en', 'ADP'), ('el', 'DET'), ('día', 'NOUN'), ('a', 'ADP'), ('día', 'NOUN'), ('y', 'CCONJ'), ('no', 'ADV'), ('sepas', 'VERB'), ('por', 'ADP'), ('donde', 'PRON'), ('empezar', 'VERB'), ('.', 'PUNCT'), ('A', 'ADP'), ('veces', 'INTJ'), ('incluso', 'ADV'), ('sientes', 'ADJ'), ('que', 'SCONJ'), ('no', 'ADV'), ('avanzas', 'NOUN'), 

# Tokenization
## Is the process of breaking up the original text into component pieces (tokens).
### Tokens are the basic building blocks of a Doc Object, everything that help us understand the MEANING of a text is derived from tokens and their relationships
- Prefix: Character(s) at the beginning
- Suffix: Character(s) at the end
- Infix: Character(s) in between
- Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.

In [11]:
import en_core_web_sm
nlp_en = en_core_web_sm.load()

In [13]:
my_string = '"we\'re moving to L.A.!"'

In [14]:
print(my_string)

"we're moving to L.A.!"


In [15]:
doc = nlp_en(my_string)

In [18]:
for token in doc:
    print(token.text)

"
we
're
moving
to
L.A.
!
"


In [20]:
another_textr = u"We're here to help!, send snail-email, email support@oursite.com or visit http://www.oursite.com "
another_doc = nlp_en(another_textr)

In [21]:
for token in another_doc:
    print(token)

We
're
here
to
help
!
,
send
snail
-
email
,
email
support@oursite.com
or
visit
http://www.oursite.com


In [27]:
esp_text = "Un viaje en texi en la ciudad de Mexico cuesta alrededor de $350.45 MXN pesos"
doc_es = nlp_es(esp_text)

In [28]:
for t in doc_es:
    print(t.text, t.pos_)

Un DET
viaje NOUN
en ADP
texi NOUN
en ADP
la DET
ciudad NOUN
de ADP
Mexico PROPN
cuesta VERB
alrededor ADV
de ADP
$ PROPN
350.45 PROPN
MXN PROPN
pesos NOUN


In [29]:
len(doc_es)

16

In [35]:
len(doc_es.vocab)

308

In [36]:
for entity in doc_es.ents:
    print(entity)

Mexico
MXN


In [42]:
n_text = "A startup created by Capital One group is going to invest in a large cluster in AWS Cloud for 11 Million USD"
for ent in nlp_en(n_text).ents:
    print(f"Entity = {ent}, Label = {ent.label_}, Label Explanation = {spacy.explain(ent.label_)}")

Entity = Capital One, Label = ORG, Label Explanation = Companies, agencies, institutions, etc.
Entity = AWS Cloud, Label = PRODUCT, Label Explanation = Objects, vehicles, foods, etc. (not services)
Entity = 11 Million, Label = CARDINAL, Label Explanation = Numerals that do not fall under another type


In [43]:
from spacy import displacy

In [44]:
dod = nlp_en(u'The Ford motor company is going to invest $13 million dollars on building a big factory that will build batteries')

In [48]:
displacy.render(dod, style='dep', jupyter=True, options={'distance':80})

In [49]:
displacy.render(dod, style='ent', jupyter=True)

In [59]:
n_doc = 'Over the last quarter Verizon and Motorola sold 20 Million smartphones (iPods, Moto G and Samsung Galaxy S20) over Korea to make a profit of 45 million'
dd = nlp_en(n_doc)

In [60]:
displacy.render(dd, style='ent', jupyter=True)

# Serving displays or renders
## To serve a display or image outside of a Jupyter notebook we should use the next script

In [62]:
displacy.serve(dd, style='ent')


127.0.0.1 - - [13/Sep/2020 16:20:11] "GET / HTTP/1.1" 200 3815
127.0.0.1 - - [13/Sep/2020 16:20:11] "GET /favicon.ico HTTP/1.1" 200 3815
127.0.0.1 - - [13/Sep/2020 16:21:06] "GET / HTTP/1.1" 200 3815
127.0.0.1 - - [13/Sep/2020 16:21:06] "GET /favicon.ico HTTP/1.1" 200 3815



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


___
# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers