Spacy operates on a pipeline which consists of many steps including tokenisation, lemmisation and other optional add in components.

Pipeline is instantiated with a spacy pipeline by loading either a blank one with `spacy.blank(<language>)` or with `spacy.load(<name of trained model>)`

Objects
- nlp : pipeline
- doc : the end result of the raw text through the pipeline
- token : individual token (word or component) in the document
- Span : Mulitple tokens

# Exploration

In [58]:
import spacy

In [59]:
nlp = spacy.blank("en")

In [60]:
sample_string = "Randy is learning NLP on spacy!"
doc =nlp(sample_string)

In [61]:
doc.text

'Randy is learning NLP on spacy!'

In [62]:
# tokens
tokens = [tok for tok in doc]

In [63]:
tokens

[Randy, is, learning, NLP, on, spacy, !]

In [64]:
# spans
sample_span = tokens[:3]
sample_span

[Randy, is, learning]

In [65]:
# Lexical attributes
# Doc: https://spacy.io/api/token
# e.g token.is_alpha , token.is_punct , token.like_num
for idx, token in enumerate(tokens):
    print(f"Examining {token} in idx {idx}")
    print(f"Alpha: {token.is_alpha}")
    print(f"ascii: {token.is_ascii}")
    print(f"Title case: {token.is_title}")
    print(f"Lower case: {token.is_lower}")
    print(f"Left punctuation: {token.is_left_punct}")
    print(f"Right punctuation: {token.is_right_punct}")
    print(f"Number: {token.like_num}")
    print(f"OOV?: {token.is_oov}")
    print(f"url: {token.like_url}")
    print(f"Email: {token.like_email}")


Examining Randy in idx 0
Alpha: True
ascii: True
Title case: True
Lower case: False
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining is in idx 1
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining learning in idx 2
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining NLP in idx 3
Alpha: True
ascii: True
Title case: False
Lower case: False
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining on in idx 4
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punctuation: False
Right punctuation: False
Number: False
OOV?: True
url: False
Email: False
Examining spacy in idx 5
Alpha: True
ascii: True
Title case: False
Lower case: True
Left punct

# Loading Trained model

In [66]:
nlp_en_core_web_sm = spacy.load("en_core_web_sm")



In [67]:
sample_email = "chngyuanlong@gmail.com"
sample_url = "https://myfirstdatasciencejob.wordpress.com/"
doc = nlp_en_core_web_sm(sample_email)
tokens = [tok for tok in doc]
print(tokens)
print(f"Like email : {tokens[0].like_email}")

[chngyuanlong@gmail.com]
Like email : True


In [68]:
doc = nlp_en_core_web_sm(sample_url)
tokens = [tok for tok in doc]
print(tokens)
print(f"Like url : {tokens[0].like_url}")

[https://myfirstdatasciencejob.wordpress.com/]
Like url : True


In [69]:
sample_similarity_sentence = "My cat is like a god. Oh opps I meant dog"
doc = nlp_en_core_web_sm(sample_similarity_sentence)
tokens = [tok for tok in doc]
print(tokens)

[My, cat, is, like, a, god, ., Oh, opps, I, meant, dog]


In [70]:
print(tokens[1].similarity(tokens[5]))
print(tokens[1].similarity(tokens[-1]))

0.34402838349342346
0.23464104533195496


  print(tokens[1].similarity(tokens[5]))
  print(tokens[1].similarity(tokens[-1]))


# NLP Tasks

The thing that fascinates me is that some inbuilt features that relates to core NLP tasks that dependency parsing, NER, POS, identification of syntactic heads.

In [71]:
for token in tokens:
    print(f"Text: {token.text}, POS Tag: {token.pos_}, Dep label: {token.dep_}, Head token: {token.head.text}, Lemmatised token: {token.lemma_}")

Text: My, POS Tag: PRON, Dep label: poss, Head token: cat, Lemmatised token: my
Text: cat, POS Tag: NOUN, Dep label: nsubj, Head token: is, Lemmatised token: cat
Text: is, POS Tag: AUX, Dep label: ROOT, Head token: is, Lemmatised token: be
Text: like, POS Tag: ADP, Dep label: prep, Head token: is, Lemmatised token: like
Text: a, POS Tag: DET, Dep label: det, Head token: god, Lemmatised token: a
Text: god, POS Tag: NOUN, Dep label: pobj, Head token: like, Lemmatised token: god
Text: ., POS Tag: PUNCT, Dep label: punct, Head token: is, Lemmatised token: .
Text: Oh, POS Tag: INTJ, Dep label: poss, Head token: opps, Lemmatised token: oh
Text: opps, POS Tag: NOUN, Dep label: npadvmod, Head token: meant, Lemmatised token: opps
Text: I, POS Tag: PRON, Dep label: nsubj, Head token: meant, Lemmatised token: I
Text: meant, POS Tag: VERB, Dep label: ROOT, Head token: meant, Lemmatised token: mean
Text: dog, POS Tag: NOUN, Dep label: dobj, Head token: meant, Lemmatised token: dog


In [72]:
# If there is no entity it will not print it.
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

In [73]:
sample_NER = "We will need to destroy the Whiskey company else they get too big"
doc = nlp_en_core_web_sm(sample_NER)
tokens = [tok for tok in doc]
print(tokens)

[We, will, need, to, destroy, the, Whiskey, company, else, they, get, too, big]


In [74]:
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

Whiskey ORG


In [77]:
from spacy.matcher import Matcher

[From website](https://course.spacy.io/en/chapter1)

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use a model's predictions.

For example, find the word "duck" only if it's a verb, not a noun.

!!! Important things to note: each dictionary pattern represents one token

In [115]:
sample_matcher_text = "I need two dozen eggs and maybe 1kg of minced beef. Do not beef me"

In [116]:
doc = nlp_en_core_web_sm(sample_matcher_text)

In [118]:
tokens = [tok for tok in doc]

for token in tokens:
    print(token.text, token.pos_)

I PRON
need VERB
two NUM
dozen NOUN
eggs NOUN
and CCONJ
maybe ADV
1 NUM
kg NOUN
of ADP
minced VERB
beef NOUN
. PUNCT
Do AUX
not PART
beef VERB
me PRON


In [154]:
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp_en_core_web_sm.vocab)

# Let's try to catch the following pattern : VERB , [NOUN or PRON]
VERB_PATTERN = [{"POS":"VERB"}, {"POS":{"IN":["PRON", "NOUN"]}}]

In [155]:
# Add the pattern to the matcher
matcher.add("items_pattern", [VERB_PATTERN])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

# It fetches according to the POS tags but minced meat would be more like ADJ NOUN rather than VERB NOUN

Matches: ['minced beef', 'beef me']
