###Tokenization

Well, we'll see how to represent the words in a way that the computers can process them with a view to later training the machine that can understand their meaning. Tokenization is the process of tokenizing or splitting a string, text into a list of tokens

So let's consider a word 'SILENT'. 

It's made of sequence of letters. These letters are can be represented by numbers using an encoding scheme. A popular one called ASCII has these letters represented by these numbers

In [1]:
#So each letter from the word SILENT can be represented in numbers as follows:

print(ord('S'))     
print(ord('I'))
print(ord('L'))
print(ord('E'))
print(ord('N'))
print(ord('T'))

"""So the above bunch of numbers collectively represent a word SILENT"""

83
73
76
69
78
84


'So the above bunch of numbers collectively represent a word SILENT'

**But the word "LISTEN" has the same letters and thus the numbers but in different order. So it makes hard for us to understand the sentiment of the word just by the letters in it**

In [2]:
#So we can see how both the words with same letters and numbers are represented in different orders:
print('SILENT:', "S:",ord('S'),"I:",ord('I'),"L:",ord('L'),"E:",ord('E'),"N:",ord('N'),"T:",ord('T')) 
print('LISTEN:', "L:",ord('L'),"I:",ord('I'),"S:",ord('S'),"T:",ord('T'),"E:",ord('E'),"N:",ord('N')) 

SILENT: S: 83 I: 73 L: 76 E: 69 N: 78 T: 84
LISTEN: L: 76 I: 73 S: 83 T: 84 E: 69 N: 78


**So it might be easier to encode words than encoding letters.** 

In [3]:
#Let's consider the sentence 
"I love playing cricket."

'I love playing cricket.'

**So what will happen if we start encoding the words in the sentence instead of encoding the letters in each words?**

In [4]:
#Let's import relevant libraries and packages needed 
import spacy

In [5]:
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")

In [6]:
#Let's create a text

txt = 'I love playing cricket'
print(txt)

text2 = nlp("I love playing football")
print(text2)

I love playing cricket
I love playing football


In [7]:
#Let's create a document of the text and convert the sentence into tokens
doc= nlp(txt)

In [8]:
for token in doc:
  print(token.text)
  #check the no. of words in the sentences
print(len(doc))

I
love
playing
cricket
4


In [9]:
##Let's convert the sentence into tokens
doc2 = nlp("We're here to help you, Send us queries on john01@hotmail.com email id or visit us on https://www.w3schools.com!" )

In [10]:
for token in doc2:
  print(token.text)

We
're
here
to
help
you
,
Send
us
queries
on
john01@hotmail.com
email
i
d
or
visit
us
on
https://www.w3schools.com
!


In [11]:
#Let's see below examples that has special characters and how we can tokenize them
doc3 = nlp("A 10km ride in Hyderabad costs Rs.100")
doc4 = nlp("A 10km ride in Hyderabad costs $100")

In [12]:
for t in doc3:
  print(t)
print('\n')

for t in doc4:
  print(t)

A
10
km
ride
in
Hyderabad
costs
Rs.100


A
10
km
ride
in
Hyderabad
costs
$
100


In [13]:
#Let's check the no. of words in the sentences
print(len(doc))
print(len(doc2))
print(len(doc3))
print(len(doc4))

4
21
8
9


In [14]:
#Let's check the no. of vocab in english lang.
print(len(doc.vocab))
print(len(doc2.vocab))
print(len(doc3.vocab))
print(len(doc4.vocab))

505
505
505
505


###Tokenization using Indexing

In [15]:
doc5 = nlp(u"Let's import the packages")

In [16]:
#Using indexing 
print(doc5[0])

#Using slice indexing
print(doc5[0:4])
print(doc5[1:4])
print(doc5[2:5])

Let
Let's import the
's import the
import the packages


In [17]:
doc5[0] = "It's" #spacy.tokens.doc.Doc' object does not support item assignment.

TypeError: ignored

In [18]:
doc6 = nlp(u"TCS has commissioned its new campus in Pune at a cost of $10 million.")

In [19]:
#Let's tokenize
for token in doc6:
  print(token.text, end=' | ')

TCS | has | commissioned | its | new | campus | in | Pune | at | a | cost | of | $ | 10 | million | . | 

In [20]:
#We can use NER as there are few entities
for entity in doc6.ents:
  print(entity)
  print(entity.label_)
  print(str(spacy.explain(entity.label_)))         #Explains where the entity belongs to  
  print('\n')

TCS
ORG
Companies, agencies, institutions, etc.


Pune
GPE
Countries, cities, states


$10 million
MONEY
Monetary values, including unit


