# Tokenization

To understand a sentence, us as humans, read each word. We analyze the meaning of it and then we connect words together. It's the same for a machine. So the first step to almost any NLP task will be to slice sentences up in words.

![tokens](https://www.kdnuggets.com/wp-content/uploads/text-tokens-tokenization-manning.jpg)

It seems simple said like that, but you also have to slice punctuation, composed words (but not all of them),...

It's a time consuming task, that's why people invented a "Tokenization" function.

Let's have a look at how [Spacy](https://spacy.io/) handles that.

## Installation

You will need to install Spacy, to do that I let you search on [their website](https://spacy.io/).
You will also need to download their `en_core_web_sm`. To do that you can type:

```shell
python -m spacy download en_core_web_sm
```

## Tokenize the text

Now that you installed Spacy, let's take a look at their basic example:


In [40]:
import spacy
from langdetect import detect

In [None]:
# Store the tokens in doc
# Load English tokenizer, tagger, parser and NER
EN_nlp = spacy.load("en_core_web_sm")
# Load FRENCH tokenizer, tagger, parser and NER
FR_nlp = spacy.load("fr_core_news_sm")


In [None]:
fr_text = text_fr = (
    "En 2015, Clara Dupont a décidé de se lancer dans la recherche sur l'intelligence artificielle,"
    "un domaine alors dominé par quelques grandes entreprises technologiques."
    "« À l’époque, très peu de chercheurs en France s’intéressaient aux réseaux neuronaux profonds », "
    "explique-t-elle. « Beaucoup pensaient que ces modèles étaient trop complexes pour être utiles en pratique. »"
    "Aujourd’hui, ses travaux sont reconnus internationalement et contribuent à des avancées majeures dans le domaine."
)


In [None]:
en_text = (
    "When Sebastian Thrun started working on self-driving cars at "
    "Google in 2007, few people outside of the company took him "
    "seriously. “I can tell you very senior CEOs of major American "
    "car companies would shake my hand and turn away because I wasn’t "
    "worth talking to,” said Thrun, in an interview with Recode earlier "
    "this week."
)

## using langdetect we are able to guess the text language


In [25]:
print(detect(fr_text))
print(detect(en_text))

fr
en


In [None]:
def load_text_autodetect(text: str = ""):
    print("EN detected" if detect(text) == "en" else "FR detected")
    return EN_nlp(text) if detect(text) == "en" else FR_nlp(text)


In [33]:
doc = load_text_autodetect(fr_text)

FR detected


In [37]:
doc = load_text_autodetect(en_text)

EN detected


## parsing text in token (tokeninsation)


In [None]:
print("Noun phrases :", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])


Noun phrases : ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']


## describing various entities


In [None]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun GPE
Recode ORG
earlier this week DATE


Perfect, our text is tokenized, now we can see a lot of interesting features. But first of all, let's see what our tokens look like:


In [7]:
for token in doc:
    print(token)

When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
was
n’t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.


It should look something like this:

```
When
Sebastian
Thrun
started
working
on
self
-
driving
cars
at
Google
in
2007
,
few
people
outside
of
the
company
took
him
seriously
.
“
I
can
tell
you
very
senior
CEOs
of
major
American
car
companies
would
shake
my
hand
and
turn
away
because
I
was
n’t
worth
talking
to
,
”
said
Thrun
,
in
an
interview
with
Recode
earlier
this
week
.
```


We can see the punctuation and `-` have been separated from the word they were appended to.

Spacy also applies a lot of other preprocessing steps that we will see later.
