## Tokenization

### Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object.

In [1]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion")

In [8]:
for token in doc:
    print(token.text)

Apple
is
n't
looking
at
buyig
U.K.
startup
for
$
1
billion


### Part-of_Speech [POS] Tagging

In [10]:
doc

Apple isn't looking at buyig U.K. startup for $1 billion

In [12]:
for token in doc:
    print(token.text, token.lemma_)

Apple Apple
is be
n't not
looking look
at at
buyig buyig
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion


In [14]:
for token in doc:
    print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')

Apple           Apple           PROPN      False
is              be              AUX        True
n't             not             PART       True
looking         look            VERB       False
at              at              ADP        True
buyig           buyig           NOUN       False
U.K.            U.K.            PROPN      False
startup         startup         NOUN       False
for             for             ADP        True
$               $               SYM        False
1               1               NUM        False
billion         billion         NUM        False


### Dependency Parsing

In [18]:
for chunk in doc.noun_chunks:
    print(f'{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_}')

Apple                          Apple           nsubj
buyig U.K. startup             startup         pobj


### Named Entity Recognition

In [20]:
doc

Apple isn't looking at buyig U.K. startup for $1 billion

In [22]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Sentence Segmentation

In [24]:
doc

Apple isn't looking at buyig U.K. startup for $1 billion

In [26]:
for sent in doc.sents:
    print(sent)

Apple isn't looking at buyig U.K. startup for $1 billion


In [27]:
doc1 = nlp("Welcome to my channel. Thanks for watching. Please like and subscribe")

In [29]:
for sent in doc1.sents:
    print(sent)

Welcome to KGP Talkie.
Thanks for watching.
Please like and subscribe


In [31]:
doc1 = nlp("Welcome to.*.my channel.*.Thanks for watching")

In [33]:
for sent in doc1.sents:
    print(sent)

Welcome to.*.my channel.*.Thanks for watching


In [34]:
def set_rule(doc):
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i + 1].is_sent_start = True
    return doc

In [35]:
# nlp.remove_pipe('set_rule')

In [36]:
nlp.add_pipe(set_rule, before = 'parser')

In [37]:
text = 'Welcome to KGP Talkie...Thanks...Like and Subscribe!'
doc = nlp(text)

In [38]:
for sent in doc.sents:
    print(sent)

Welcome to KGP Talkie...
Thanks...
Like and Subscribe!


In [39]:
for token in doc:
    print(token.text)

Welcome
to
KGP
Talkie
...
Thanks
...
Like
and
Subscribe
!
