# Word Tokenize

Tokenization is essentially splitting a phrase, sentence, paragraph, 
or an entire text document into smaller units, such as individual words or terms. 
Each of these smaller units are called tokens.

# Word Tokenization no: 1

In [35]:
#Word Tokenization

text = """Founded in 2002, SpaceX’s mission spacefaring civilization."""

# Splits at space 
text.split() 

['Founded',
 'in',
 '2002,',
 'SpaceX’s',
 'mission',
 'spacefaring',
 'civilization.']

# Word Tokenization no: 2


In [37]:

from keras.preprocessing.text import text_to_word_sequence

text = """
species by building  the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

# tokenize
result = text_to_word_sequence(text)
result

['species',
 'by',
 'building',
 'the',
 'first',
 'privately',
 'developed',
 'liquid',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'earth']

# Tokenize no 3:

In [38]:
from nltk.tokenize import word_tokenize
# Create text

string = "The science of today is the technology of tomorrow"

# Tokenize words
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

# Sentence Tokenize

# Sentence Tokenization no 1

This is similar to word tokenization. Here, we study the structure of sentences in the analysis. A sentence usually ends with a full stop (.), so we can use “.” as a separator to break the string:


In [40]:
#Sentence Tokenization

text = """Founded in 2002 civilization and a multi-planet. Species by building a self-sustaining
city on Mars. The first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

# Splits at '.' 
text.split('. ') 

['Founded in 2002 civilization and a multi-planet',
 'Species by building a self-sustaining\ncity on Mars',
 'The first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

# Tokenize no 2:

In [41]:
# Load library
from nltk.tokenize import sent_tokenize

# Create text

string = """The science of today is the technology of tomorrow. 
Tomorrow is today. Today is very impotant for me."""

# Tokenize sentences
sent_tokenize(string)


['The science of today is the technology of tomorrow.',
 'Tomorrow is today.',
 'Today is very impotant for me.']

In [14]:
from nltk import ngrams
 
sentence = 'I like dancing in the rain'
 
ngram = ngrams(sentence.split(' '), n=2)
 
for x in ngram:
    print(x)

('I', 'like')
('like', 'dancing')
('dancing', 'in')
('in', 'the')
('the', 'rain')


# TF-IDF

Term Frequency, Inverse Document Frequency(TF-IDF):
-
This is the most popular way to represent documents as feature vectors. TF-IDF stands for Term Frequency, Inverse Document Frequency.

TF-IDF measures how important a particular word is with respect to a document and the entire corpus.

Term Frequency:

Term frequency is the measure of the counts of each word in a document out of all the words in the same document. 

TF(w) = (number of times word w appears in a document) / (total number of words in the document)

For example, if we want to find the TF of the word cat which occurs 50 times in a document of 1000 words, then 

TF(cat) = 50 / 1000 = 0.05

Inverse Document Frequency:

IDF is a measure of the importance of a word, taking into consideration the frequency of the word throughout the corpus.

It measures how important a word is for the corpus.

IDF(w) = log(total number of documents / number of documents with w in it)

For example, if the word cat occurs in 100 documents out of 3000, then the IDF is calculated as

IDF(cat) = log(3000 / 100) = 1.47

Finally, to calculate TF-IDF, we multiply these two factors – TF and IDF.

TF-IDF(w) = TF(w) x IDF(w)

TF-IDF(cat) = 0.05 * 1.47 = 0.073

Let’s do some coding.

We’ll use the TfidfVectorizer from scikit-learn for vectorizing the documents.

In [42]:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = ['Cats have four legs',
          'Cats and dogs are antagonistic',
          'He hate dogs']

tfidf = TfidfVectorizer()
vect = tfidf.fit_transform(corpus)

df = pd.DataFrame()
df['vocabulary'] = tfidf.get_feature_names_out()
df['sentence1'] = vect.toarray()[0]
df['sentence2'] = vect.toarray()[1]
df['sentence3'] = vect.toarray()[2]
df.set_index('vocabulary', inplace=True)
print(df.T)


vocabulary       and  antagonistic       are      cats      dogs      four  \
sentence1   0.000000      0.000000  0.000000  0.402040  0.000000  0.528635   
sentence2   0.490479      0.490479  0.490479  0.373022  0.373022  0.000000   
sentence3   0.000000      0.000000  0.000000  0.000000  0.473630  0.000000   

vocabulary      hate      have        he      legs  
sentence1   0.000000  0.528635  0.000000  0.528635  
sentence2   0.000000  0.000000  0.000000  0.000000  
sentence3   0.622766  0.000000  0.622766  0.000000  


In [43]:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
 
corpus = ['Cats have four legs',
          'Cats and dogs are antagonistic',
          'He hate dogs']
 
tfidf = TfidfVectorizer()
vect = tfidf.fit_transform(corpus)
 
df = pd.DataFrame()
df['vocabulary'] = tfidf.get_feature_names_out ()
df['sentence1'] = vect.toarray()[0]
df['sentence2'] = vect.toarray()[1]
df['sentence3'] = vect.toarray()[2]
df.set_index('vocabulary', inplace=True)
print(df.T)

vocabulary       and  antagonistic       are      cats      dogs      four  \
sentence1   0.000000      0.000000  0.000000  0.402040  0.000000  0.528635   
sentence2   0.490479      0.490479  0.490479  0.373022  0.373022  0.000000   
sentence3   0.000000      0.000000  0.000000  0.000000  0.473630  0.000000   

vocabulary      hate      have        he      legs  
sentence1   0.000000  0.528635  0.000000  0.528635  
sentence2   0.000000  0.000000  0.000000  0.000000  
sentence3   0.622766  0.000000  0.622766  0.000000  
