# Lesson 1: Tokenization

In [34]:
import spacy
nlp = spacy.load("nlp_venv/en_core_web_sm-3.4.0/en_core_web_sm/en_core_web_sm-3.4.0/")
# python -m spacy download en_core_web_sm

In [32]:
# Tokenize string 
s = "Noah doesn't like to run when it rains."
doc = nlp(s)

In [33]:
# print tokens
print([token.text for token in doc])

['Noah', 'does', "n't", 'like', 'to', 'run', 'when', 'it', 'rains', '.']


Notes:

* "doesn't" gets split into two tokens: "does" and "n't".
* the full stop "." gets its own token.

## Types and attributes

* The doc object is a container, which can be indexed and sliced like a list.

In [18]:
# Index and slice example
print(doc[0])
print(doc[0:3])

Noah
Noah doesn't


* Each entry in the doc object is a token object. And if you slice a doc object you get a span object.

In [19]:
# Object types
print(type(doc))
print(type(doc[0]))
print(type(doc[0:3]))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.span.Span'>


* Each token has several attributes such as language, length, index, etc. We will explore these in more detail in later notebooks but here are some examples.

In [21]:
# Token attribute examples
print(doc[3].text)
print(doc[3].lang_)
print(doc[3].__len__())

like
en
4


In [22]:
# Locate index of tokens
print([(t.text, t.i) for t in doc[:6]])

[('Noah', 0), ('does', 1), ("n't", 2), ('like', 3), ('to', 4), ('run', 5)]


## Tokenizing paragraphs

In [23]:
# Tokenize multiple sentences
s = "Hello there! General Kenobi. You are a bold one."
doc = nlp(s)

* We can iterate through the sentences using the .sents attribute.

In [25]:
# Print sentences
list(doc.sents)

[Hello there!, General Kenobi., You are a bold one.]

* Note that each sentence is a span object of the original document.

In [27]:
# Object type
type(list(doc.sents)[0]) 

spacy.tokens.span.Span

In [30]:
print([t.text for t in doc])

['Hello', 'there', '!', 'General', 'Kenobi', '.', 'You', 'are', 'a', 'bold', 'one', '.']


# End Notebook