## Tokenization 

#### Hemant Thapa

Tokenization in the context of natural language processing (NLP) is a foundational step where text is broken down into smaller units called tokens. These tokens are often words, but they can also include smaller units like subwords, characters, or even n-grams (combinations of n words).

##### Purpose of Tokenization

Tokenization is a necessary preprocessing step for many NLP tasks. It helps in simplifying the process of analysing text by reducing it to a sequence of manageable pieces.

It helps in dealing with complexities of the text like different word forms, punctuation, and special characters.

##### Types of Tokenization

- Word Tokenization: Breaks text into individual words. It's the most common form of tokenization and is usually the first step in text analysis.

- Character Tokenization: Breaks text down to its characters. This can be useful for certain languages or when the analysis requires a deeper understanding of the text structure at the character level.

- Subword Tokenization: Splits words into smaller meaningful units (subwords). This is useful for handling rare words, or in languages where words are often compounds of smaller units.

- Sentence Tokenization: Involves breaking text into individual sentences. This is useful in tasks that require understanding the context at the sentence level.

##### Types of challenges in Tokenization 

- Language Variations: Different languages have different syntactic and grammatical rules, making tokenization a language-dependent task.

- Handling Special Cases: Words with apostrophes, hyphens, or special characters can be challenging. For example, deciding whether "don't" should be one token or two ("do" and "n't") depends on the analysis goals.

- Contextual Meaning: Tokenization doesn’t consider the meaning of the word in the context, which might be crucial for certain types of analysis.

#### Hemant thapa

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
#creating a list of string
sentences = [
    'I love to read books',
    'I love to travel around world',
    'Do you love reading books!'  #ddding exclamation mark
]

In [3]:
print(sentences)

['I love to read books', 'I love to travel around world', 'Do you love reading books!']


In [4]:
#creating model 
tokenizer = Tokenizer(num_words=100)

In [5]:
#training model 
tokenizer.fit_on_texts(sentences)

In [6]:
#creating dictonary 
word_index = tokenizer.word_index

In [7]:
#detected exclamation mark and removed from word books!
print(word_index)

{'love': 1, 'i': 2, 'to': 3, 'books': 4, 'read': 5, 'travel': 6, 'around': 7, 'world': 8, 'do': 9, 'you': 10, 'reading': 11}
