# Language Preprocessing with Tokenization and Sequencing

## Overview:

The goal is to guide language preprocessing by covering essential steps like tokenization, sequencing, padding, and vocabulary indexing. The process includes text, sentence, and word tokenization, followed by converting text into sequences, adding padding for uniform input lengths, and building a vocabulary index. Example code and explanations make it practical for NLP applications.

# ============================
# Table of Contents
# ============================

1. [Introduction](#Introduction)
2. [Text Tokenization](#Text-Tokenization)
    - Sentence Tokenization
    - Word Tokenization
3. [Sequencing and Padding](#Sequencing-and-Padding)
4. [Vocabulary and Word Index](#Vocabulary-and-Word-Index)

# ============================
# 1. Introduction
# ============================

## Introduction

In this code, we will cover essential steps for text preprocessing in NLP: tokenization, sequencing, padding, and vocabulary indexing. These steps 
are foundational for preparing raw text data for machine learning and deep learning models, particularly when using neural networks for text 
classification, sentiment analysis, and other NLP tasks.


In [2]:
# Import necessary libraries for text preprocessing and tokenization
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Download NLTK tokenizers if needed (run only once)
nltk.download('punkt')

[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# ============================
# 2. Text Tokenization
# ============================


## Text Tokenization

Tokenization is the process of splitting text into smaller units, called tokens, which can be words, sentences, or even characters. For NLP tasks, 
sentence and word tokenization are commonly used to process text and understand its structure. We will use NLTK for both sentence and word 
tokenization.

In [4]:
# Define a sample text for tokenization
text = "Natural Language Processing is fascinating. We will learn about tokenization today!"

# ----------------------------------------
# 2.1 Sentence Tokenization
# ----------------------------------------

### Sentence Tokenization

Sentence tokenization is the process of splitting a block of text into individual sentences. This technique is a fundamental step in natural language processing (NLP) and is crucial for various applications that require sentence-level analysis or understanding.

#### Importance of Sentence Tokenization

- **Understanding Structure**: By dividing text into sentences, we can better understand the structure and flow of information. This is particularly important in documents where the meaning may change depending on the sentence context.
- **Text Analysis**: Many NLP tasks, such as sentiment analysis, require understanding the sentiment expressed in individual sentences. Sentence tokenization allows models to process each sentence independently.
- **Facilitating Summarization**: In text summarization, breaking down text into sentences enables algorithms to identify the most important sentences and condense information effectively.
- **Enhancing Machine Learning Models**: Many machine learning models perform better when trained on sentence-level data rather than raw text. Sentence tokenization helps prepare data for such tasks.

#### Use Cases

- **Sentiment Analysis**: Understanding the sentiment expressed in customer reviews, where each sentence can convey a different sentiment.
- **Information Retrieval**: Extracting specific sentences that answer user queries from larger documents.
- **Chatbots**: Parsing user input to understand commands or questions at the sentence level.

Below is an image that visually represents the sentence tokenization process, illustrating how a paragraph of text is broken down into individual sentences:

![Image Description](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/NLP_November/Lesson%205/sentence%20tokenization.png)

In [5]:
# Sentence tokenization - splitting text into sentences
sentences = sent_tokenize(text)
print("Sentence Tokenization:", sentences)

Sentence Tokenization: ['Natural Language Processing is fascinating.', 'We will learn about tokenization today!']


# ----------------------------------------
# 2.2 Word Tokenization
# ----------------------------------------


### Word Tokenization

Word tokenization splits each sentence into words, allowing us to analyze text at the word level. This is especially useful for models 
that rely on word-level inputs, like bag-of-words or word embeddings.

![Image Description](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/NLP_November/Lesson%205/word%20tokenization.png)


In [6]:
# Word tokenization - splitting each sentence into words
words = [word_tokenize(sentence) for sentence in sentences]
print("Word Tokenization:", words)

Word Tokenization: [['Natural', 'Language', 'Processing', 'is', 'fascinating', '.'], ['We', 'will', 'learn', 'about', 'tokenization', 'today', '!']]


# ============================
# 3. Sequencing and Padding
# ============================

## Sequencing and Padding

Once tokenization is complete, the next step is to convert the tokens (words) into numerical sequences. Machine learning models operate on numbers rather than raw text, making it essential to convert words into numerical representations. 

### What is Padding?

Padding is the process of adding extra tokens to sequences to ensure they are of uniform length. In many machine learning models, especially in deep learning architectures like recurrent neural networks (RNNs) or convolutional neural networks (CNNs), input sequences must have the same length. Padding helps maintain this consistency.

### Why Padding is Necessary

- **Uniform Input Size**: Neural networks require inputs of the same shape. Padding ensures that all sequences are the same length, enabling batch processing.
- **Efficient Training**: Uniform length sequences allow for efficient use of computational resources and optimize the training process.
- **Preventing Information Loss**: When sequences are truncated to fit a uniform size, padding can help maintain critical information from longer sequences.

### Types of Padding

1. **Pre-padding**: Adding tokens to the beginning of sequences.
2. **Post-padding**: Adding tokens to the end of sequences.

### Example of Padding

Consider the following sentences that have been tokenized into numerical sequences:

- Sentence 1: "I love NLP" -> [1, 2, 3] 
- Sentence 2: "Deep learning is fascinating" -> [4, 5, 6, 7, 8]

**Before Padding:**
- Sequence lengths:
  - Sentence 1: 3
  - Sentence 2: 5

To create uniform length sequences of 5, we can apply **post-padding**:

**After Post-padding:**
- Sentence 1: [1, 2, 3, 0, 0] (padded with two zeros)
- Sentence 2: [4, 5, 6, 7, 8] (remains unchanged)

### Example of Truncating

In some cases, we might encounter sequences that are too long. To handle these, truncating may be necessary, which involves cutting off excess tokens to fit a specified length.

For example, if we want to limit our sequences to a maximum length of 4:

- Sentence 2: "Deep learning is fascinating" -> [4, 5, 6, 7, 8]

**Before Truncating:**
- Length: 5

**After Truncating to 4:**
- Sentence 2: [4, 5, 6, 7] (the last token "8" is removed)

### Summary

Padding and truncating are crucial preprocessing steps in NLP that ensure sequences have a uniform length, which is necessary for efficient processing in machine learning models. By carefully managing the lengths of sequences, we can preserve important information while also optimizing model performance.


In [7]:
# Sample corpus of text data
corpus = [
    "I love machine learning.",
    "NLP is fun and exciting.",
    "Deep learning applications are vast."
]

# Initialize the Tokenizer and fit on text
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")  # Using an out-of-vocabulary (OOV) token for words not seen in training
tokenizer.fit_on_texts(corpus)

# ----------------------------------------
# 3.1 Tokenization to Sequences
# ----------------------------------------


### Tokenization to Sequences

Here, we use the `texts_to_sequences` method to convert the corpus into sequences of integers, each representing a unique word in the corpus.

In [8]:
# Tokenizing and converting text to sequences
sequences = tokenizer.texts_to_sequences(corpus)
print("Tokenized Sequences:", sequences)

Tokenized Sequences: [[3, 4, 5, 2], [6, 7, 8, 9, 10], [11, 2, 12, 13, 14]]


# ----------------------------------------
# 3.2 Padding Sequences
# ----------------------------------------


### Padding Sequences

Padding is added to ensure all sequences have the same length. This is necessary for feeding data into deep learning models that expect uniform 
input shapes. We use `post` padding to add padding at the end of sequences, though `pre` padding is also an option.

In [9]:
# Define the max length for padding sequences (uniform length)
max_length = 5

# Pad sequences to ensure uniform input shape
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
print("Padded Sequences:\n", padded_sequences)

Padded Sequences:
 [[ 3  4  5  2  0]
 [ 6  7  8  9 10]
 [11  2 12 13 14]]


# ============================
# 4. Vocabulary and Word Index
# ============================


## Vocabulary and Word Index

After tokenization, the tokenizer creates a word index, mapping each unique word in the corpus to a unique integer ID. This index forms 
the vocabulary of our text data and is useful for converting new text into sequences that our model can understand.

In [10]:
# View the word index created by the tokenizer
word_index = tokenizer.word_index
print("Word Index:\n", word_index)

# Total vocabulary size (including OOV token)
vocab_size = len(word_index) + 1  # +1 to account for the OOV token
print("Vocabulary Size:", vocab_size)

Word Index:
 {'<OOV>': 1, 'learning': 2, 'i': 3, 'love': 4, 'machine': 5, 'nlp': 6, 'is': 7, 'fun': 8, 'and': 9, 'exciting': 10, 'deep': 11, 'applications': 12, 'are': 13, 'vast': 14}
Vocabulary Size: 15
