# Tokenization

To understand a sentence, us as humans, read each word. We analyze the meaning of it and then we connect words together. It's the same for a machine. So the first step to almost any NLP task will be to split sentences in words.

![tokens](https://www.kdnuggets.com/wp-content/uploads/text-tokens-tokenization-manning.jpg)

It seems simple said like that, but you also have to slice punctuation, compound words (but not all of them),...

It's a time consuming task, that's why people invented a "Tokenization" function.

Let's have a look at how [Spacy](https://spacy.io/) handles that.

## Installation

You will need to install Spacy, to do that I let you search on [their website](https://spacy.io/).
You will also need to download their English model `en_core_web_sm`. To do that you can type:
```shell
python -m spacy download en_core_web_sm
```

You can see that Spacy provides us with models in a lot of languages. We won't use them now but keep that in mind. It can be useful for later !

## Tokenize the text

Now that you installed Spacy, let's take a look at their basic example:

In [None]:
#. Tokenize this document with SpaCy:
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")

import spacy

# Load the model
nlp = #TO COMPLETE

# Store the tokens in doc

doc = #TO COMPLETE


Perfect, our text is tokenized, now we can see a lot of interesting features. But first of all, let's see what our tokens look like:

In [None]:
for token in doc:
    print(token)

It should look something like this:
```
When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
was
n’t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.
```

In [4]:
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Define the text to be tokenized
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")

# Process the text with SpaCy
doc = nlp(text)

# Print the tokens
for token in doc:
    print(token.text)


We can see the punctuation and `-` have been separated from the word they were appended to.

Spacy also applies a lot of other preprocessing steps that we will see later.

In [5]:
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Define the text to be tokenized
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")

# Process the text with SpaCy
doc = nlp(text)

### Tokenization ###
print("### Tokenization ###")
for token in doc:
    print(token.text)

### Part-of-Speech Tagging ###
print("\n### Part-of-Speech Tagging ###")
for token in doc:
    print(token.text, token.pos_)

### Named Entity Recognition (NER) ###
print("\n### Named Entity Recognition (NER) ###")
for ent in doc.ents:
    print(ent.text, ent.label_)

### Dependency Parsing ###
print("\n### Dependency Parsing ###")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_)

### Lemmatization ###
print("\n### Lemmatization ###")
for token in doc:
    print(token.text, token.lemma_)

### Sentence Boundary Detection ###
print("\n### Sentence Boundary Detection ###")
for sent in doc.sents:
    print(sent.text)

### Tokenization with Spaces ###
print("\n### Tokenization with Spaces ###")
for token in doc:
    print(token.text_with_ws)


### Tokenization ###
When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
was
n’t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.

### Part-of-Speech Tagging ###
When SCONJ
Sebastian ADJ
Thrun PROPN
started VERB
working VERB
on ADP
self NOUN
- PUNCT
driving VERB
cars NOUN
at ADP
Google PROPN
in ADP
2007 NUM
, PUNCT
few ADJ
people NOUN
outside ADP
of ADP
the DET
company NOUN
took VERB
him PRON
seriously ADV
. PUNCT
“ PUNCT
I PRON
can AUX
tell VERB
you PRON
very ADV
senior ADJ
CEOs NOUN
of ADP
major ADJ
American ADJ
car NOUN
companies NOUN
would AUX
shake VERB
my PRON
hand NOUN
and CCONJ
turn VERB
away ADV
because SCONJ
I PRON
was AUX
n’t PART
worth ADJ
talking VERB
to ADP
, PUNCT
” PUNCT
said VERB
Thrun PROPN
, PUNCT
in ADP
an DET
interview NOUN
with ADP
Recode PRO

## Tokenize into sentences using nltk 

In [1]:
import nltk


In [2]:
text = "Many had a little lamb. Her fleece was white as snow"

from nltk.tokenize import word_tokenize, sent_tokenize
sents = sent_tokenize(text)
print(sents)

['Many had a little lamb.', 'Her fleece was white as snow']


In [3]:
words = [word_tokenize(sent) for sent in sents]
print(words)


[['Many', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'fleece', 'was', 'white', 'as', 'snow']]
