# Tokenization

### Articles to read:
- [6 ways to perform tokenization](https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/)
- [All about tokenization](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/)

We can use NLTK for tokenization

In [None]:
# sentence
s = "hi, how are you doing today?"

In [None]:
# split sentence by space
s.split(" ")

['hi,', 'how', 'are', 'you', 'doing', 'today?']

In [None]:
# avoiding space to be treated as word
s.split()

['hi,', 'how', 'are', 'you', 'doing', 'today?']

In [None]:
import re
# replace all kinds of punctuations 
re.sub(r"[^\w]", " ", s).split()

['hi', 'how', 'are', 'you', 'doing', 'today']

We can see that we only got words here replacing all the punctuations and space

### Tokenization using NLTK

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:

word_tokenize(s)

['hi', ',', 'how', 'are', 'you', 'doing', 'today', '?']

In [None]:
wordpunct_tokenize(s)

['hi', ',', 'how', 'are', 'you', 'doing', 'today', '?']

In [None]:
# tokenizing by sentence
s = "hi, how are you doing today? I am fine"
sent_tokenize(s)

['hi, how are you doing today?', 'I am fine']

### Tokenization using Spacy

#### Word Tokenization

In [None]:
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

# nlp object is used to create documents with linguistic annotations
my_doc = nlp(text)

# create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
token_list

['Founded',
 'in',
 '2002',
 ',',
 'SpaceX',
 '’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi',
 '-',
 'planet',
 '\n',
 'species',
 'by',
 'building',
 'a',
 'self',
 '-',
 'sustaining',
 'city',
 'on',
 'Mars',
 '.',
 'In',
 '2008',
 ',',
 'SpaceX',
 '’s',
 'Falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 '\n',
 'liquid',
 '-',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth',
 '.']

#### Sentence Tokenization

In [None]:
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

# create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# add the component to the pipeline
nlp.add_pipe(sbd)

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

# nlp object is used to create documents with linguistic annotations
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent)
print(sents_list)
print(len(sents_list))

[Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars., In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth.]
2


Will be adding more as I find new techniques ....