# Token

In [1]:
import spacy
nlp = spacy.load("en_core_web_md")

In [2]:
doc = nlp("I own a ginger cat.")
doc

I own a ginger cat.

In [3]:
[token.text for token in doc]

['I', 'own', 'a', 'ginger', 'cat', '.']

In [4]:
doc = nlp("It's been tough time.")
doc

It's been tough time.

In [5]:
[token.text for token in doc]

['It', "'s", 'been', 'tough', 'time', '.']

In [7]:
[token.orth_ for token in doc]

['It', "'s", 'been', 'tough', 'time', '.']

# Customizing token

In [14]:
from spacy.symbols import ORTH
doc = nlp("lemme that")
[token.text for token in doc]

['lemme', 'that']

In [13]:
help(nlp.tokenizer.add_special_case)

Help on method add_special_case in module spacy.tokenizer:

add_special_case(string, substrings) method of spacy.tokenizer.Tokenizer instance
    Tokenizer.add_special_case(self, str string, substrings)
    Add a special-case tokenization rule.
    
            string (str): The string to specially tokenize.
            substrings (iterable): A sequence of dicts, where each dict describes
                a token and its attributes. The `ORTH` fields of the attributes
                must exactly match the string when they are concatenated.
    
            DOCS: https://spacy.io/api/tokenizer#add_special_case



In [15]:
special_case = [{ORTH:"lem"},{ORTH:"me"}]
nlp.tokenizer.add_special_case(string="lemme",
                               substrings=special_case)
doc = nlp("lemme that")
[token.text for token in doc]

['lem', 'me', 'that']

# debugging the tokenizer

explain method in tokenizer , will return the pattern and token of it.

In [18]:
tok_exp = nlp.tokenizer.explain("Let's go.")
for pat,tok in tok_exp:
    print(pat,"\t",tok)

SPECIAL-1 	 Let
SPECIAL-2 	 's
TOKEN 	 go
SUFFIX 	 .


# sent token

In [19]:
sent = "off the keeper's gloves with Smith getting a thin edge as he went back into the crease to defend. Drew Smith forward with the length."
doc = nlp(sent)
for s in doc.sents:
    print(s)

off the keeper's gloves with Smith getting a thin edge as he went back into the crease to defend.
Drew Smith forward with the length.
