# Tokenization

To understand a sentence, us as humans, read each word. We analyze the meaning of it and then we connect words together. It's the same for a machine. So the first step to almost any NLP task will be to split sentences in words.

![tokens](https://www.kdnuggets.com/wp-content/uploads/text-tokens-tokenization-manning.jpg)

It seems simple said like that, but you also have to slice punctuation, compound words (but not all of them),...

It's a time consuming task, that's why people invented a "Tokenization" function.

Let's have a look at how [Spacy](https://spacy.io/) handles that.

## Installation

You will need to install Spacy, to do that I let you search on [their website](https://spacy.io/).
You will also need to download their English model `en_core_web_sm`. To do that you can type:
```shell
python -m spacy download en_core_web_sm
```

You can see that Spacy provides us with models in a lot of languages. We won't use them now but keep that in mind. It can be useful for later !

## Tokenize the text

Now that you installed Spacy, let's take a look at their basic example:

In [1]:
#. Tokenize this document with SpaCy:
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")


import spacy 

#Load the model 

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)





Perfect, our text is tokenized, now we can see a lot of interesting features. But first of all, let's see what our tokens look like:

In [2]:
for token in doc:
    print(token)

When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
was
n’t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.


It should look something like this:
```
When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
was
n’t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.
```

We can see the punctuation and `-` have been separated from the word they were appended to.

Spacy also applies a lot of other preprocessing steps that we will see later.

In [4]:
help(spacy)

Help on package spacy:

NAME
    spacy

PACKAGE CONTENTS
    __main__
    about
    attrs
    cli (package)
    compat
    displacy (package)
    errors
    git_info
    glossary
    kb (package)
    lang (package)
    language
    lexeme
    lookups
    matcher (package)
    ml (package)
    morphology
    parts_of_speech
    pipe_analysis
    pipeline (package)
    schemas
    scorer
    strings
    symbols
    tests (package)
    tokenizer
    tokens (package)
    training (package)
    ty
    util
    vectors
    vocab

FUNCTIONS
    blank(name: str, *, vocab: Union[spacy.vocab.Vocab, bool] = True, config: Union[Dict[str, Any], confection.Config] = {}, meta: Dict[str, Any] = {}) -> spacy.language.Language
        Create a blank nlp object for a given language code.
        
        name (str): The language code, e.g. "en".
        vocab (Vocab): A Vocab object. If True, a vocab is created.
        config (Dict[str, Any] / Config): Optional config overrides.
        meta (Dict[str, 