# Tokenization

Breaking down text into smaller units such as words, phrases, or sentences.

Tokenization is the process of breaking down a text into smaller units called tokens. 

These tokens could be words, subwords, characters, or even phrases, depending on the specific tokenization technique used. 

Tokenization is a fundamental step in natural language processing (NLP) tasks as it helps in preparing the text data for further analysis.

## Tokenization techniques


Each tokenization technique serves different purposes based on the requirements of the NLP task at hand. For instance, word tokenization is commonly used for tasks like text classification, while sentence tokenization is useful for tasks like machine translation or summarization. Subword tokenization, on the other hand, is beneficial for handling out-of-vocabulary words and morphologically rich languages.


Here are some common tokenization techniques with examples:

### Word Tokenization:

Example: "Tokenization is an important NLP task."

Tokens: ["Tokenization", "is", "an", "important", "NLP", "task", "."]

### Sentence Tokenization:

Example: "Tokenization is an important NLP task. It breaks down text into smaller units."

Tokens: ["Tokenization is an important NLP task.", "It breaks down text into smaller units."]

### Whitespace Tokenization:

Example: "Tokenization separates words by spaces."

Tokens: ["Tokenization", "separates", "words", "by", "spaces."]


### Character Tokenization:

Example: "Tokenization"

Tokens: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]

### Subword Tokenization (e.g., Byte-Pair Encoding, WordPiece):

Example: "Tokenization"

Tokens: ["To", "ken", "iz", "at", "ion"]

### Phrasal Tokenization:

Example: "Natural language processing"

Tokens: ["Natural language processing"]

### Customized Tokenization:

Example: "email@example.com"

Tokens: ["email", "@", "example", ".", "com"]

## Natural Language Toolkit (NLTK) and the spaCy library

### Word Tokenization with NLTK:

In [None]:
#pip install nltk

In [5]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


In [6]:

text = "Tokenization is an important NLP task."
tokens = word_tokenize(text)
print("Tokenized Data:")
print(tokens)

Tokenized Data:
['Tokenization', 'is', 'an', 'important', 'NLP', 'task', '.']


### Sentence Tokenization with NLTK

In [24]:
from nltk.tokenize import sent_tokenize

text = "Tokenization is an important NLP task. It breaks down text into smaller units."
sentences = sent_tokenize(text)
print(sentences)


['Tokenization is an important NLP task.', 'It breaks down text into smaller units.']


# Stop Words Removal

Stop words are common words in a language that are often filtered out before or after processing text because they don't contribute much to the meaning of the text. Examples of stop words in English include "the", "is", "and", "of", etc.

In [27]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download the stopwords corpus if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

# Example text
text = "This is an example sentence demonstrating the removal of stop words."

# Tokenize the text
tokens = word_tokenize(text)

# Get English stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokenized text
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Join the filtered tokens back into a sentence
filtered_sentence = ' '.join(filtered_tokens)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


In [28]:

print("Original Text:")
print(text)
print("\nText after Stop Words Removal:")
print(filtered_sentence)


Original Text:
This is an example sentence demonstrating the removal of stop words.

Text after Stop Words Removal:
example sentence demonstrating removal stop words .


This code will tokenize the input text, then remove any words that are found in the NLTK English stop words list, and finally join the remaining tokens back into a sentence.

## Stopwords List

In [29]:
import nltk
from nltk.corpus import stopwords

# Download the stopwords corpus if not already downloaded
nltk.download('stopwords')

# Get the English stop words list
english_stopwords = stopwords.words('english')

# Print the list of English stop words
print("NLTK English Stop Words List:")
print(english_stopwords)


NLTK English Stop Words List:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package stopwords is already up-to-date!


# Stemming 

Stemming is the process of reducing words to their root or base form, also known as the stem.

This process involves removing suffixes and prefixes from words to obtain their base form, which may not always be a valid word itself but still represents the core meaning.

Stemming is commonly used in natural language processing (NLP) and information retrieval tasks to normalize words and reduce them to their common base form.

In [30]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


True

In [31]:


# Example text
text = "Stemming is a process of reducing words to their base form. It helps in normalization of text data."

# Tokenize the text
tokens = word_tokenize(text)

# Initialize the Porter stemmer
stemmer = PorterStemmer()

# Apply stemming to each token
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Join the stemmed tokens back into a sentence
stemmed_text = ' '.join(stemmed_tokens)

print("Original Text:")
print(text)
print("\nText after Stemming:")
print(stemmed_text)


Original Text:
Stemming is a process of reducing words to their base form. It helps in normalization of text data.

Text after Stemming:
stem is a process of reduc word to their base form . it help in normal of text data .


the Porter stemming algorithm from NLTK is used to stem each tokenized word in the input text. The stemmed tokens are then joined back into a sentence. This process helps in reducing words to their base form, which can be useful for tasks like text normalization, indexing, and information retrieval.

# Lemmatization

Lemmatization is a process similar to stemming, where words are transformed to their base or dictionary form, known as the lemma. 

However, unlike stemming, lemmatization ensures that the transformed word belongs to the language and is a valid word.

In [32]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download the WordNet corpus if not already downloaded
nltk.download('wordnet')
nltk.download('punkt')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


True

In [36]:

# Example text
text = "Lemmatization is a process of reducing words to their base form. It helps in normalization of text data."

# Tokenize the text
tokens = word_tokenize(text)

# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization to each token
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Join the lemmatized tokens back into a sentence
lemmatized_text = ' '.join(lemmatized_tokens)

print("Original Text:")
print(text)
print("\nText after Lemmatization:")
print(lemmatized_text)

Original Text:
Lemmatization is a process of reducing words to their base form. It helps in normalization of text data.

Text after Lemmatization:
Lemmatization is a process of reducing word to their base form . It help in normalization of text data .


NLTK's WordNet lemmatizer is used to lemmatize each tokenized word in the input text. The lemmatized tokens are then joined back into a sentence. Lemmatization helps in obtaining the base form of words, which can be useful for tasks like text normalization, information retrieval, and machine learning.

# Part-of-Speech (POS) Tagging

Assigning grammatical categories (like noun, verb, adjective) to words in a sentence.

Part-of-Speech (POS) tagging is the process of assigning a grammatical tag to each word in a sentence based on its role and context within the sentence. 

These tags typically indicate the word's part of speech, such as noun, verb, adjective, adverb, etc., as well as additional grammatical information like tense, number, and case.

In [37]:
import nltk
from nltk.tokenize import word_tokenize

# Download the averaged_perceptron_tagger corpus if not already downloaded
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


True

In [38]:

# Example text
text = "POS tagging is essential for understanding the grammatical structure of sentences."

# Tokenize the text
tokens = word_tokenize(text)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

print("POS Tagging:")
print(pos_tags)

POS Tagging:
[('POS', 'NNP'), ('tagging', 'NN'), ('is', 'VBZ'), ('essential', 'JJ'), ('for', 'IN'), ('understanding', 'VBG'), ('the', 'DT'), ('grammatical', 'JJ'), ('structure', 'NN'), ('of', 'IN'), ('sentences', 'NNS'), ('.', '.')]


POS tagging is an essential preprocessing step in many NLP tasks, such as syntactic parsing, information extraction, and sentiment analysis, as it provides valuable linguistic information about the words in a sentence.

# Named Entity Recognition (NER)

Named Entity Recognition (NER) is a natural language processing task that involves identifying and classifying named entities in text into predefined categories such as names of persons, organizations, locations, dates, quantities, and more. 

NER is crucial for various NLP applications, including information extraction, question answering, and entity linking.

In [None]:
# pip install Counter

In [45]:
import nltk
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk import ne_chunk
from collections import Counter

# Download the necessary NLTK corpora if not already downloaded
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


True

In [48]:

text = "Apple is a company based in Cupertino, California. John Smith works for Google."

# Tokenize the text
tokens = word_tokenize(text)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

# Perform Named Entity Recognition (NER)
ner_tags = ne_chunk(pos_tags)

# Extract named entities along with their types
named_entities = []
for chunk in ner_tags:
    if hasattr(chunk, 'label'):
        named_entities.append((chunk.label(), ' '.join(c[0] for c in chunk)))
print(named_entities)

[('GPE', 'Apple'), ('GPE', 'Cupertino'), ('GPE', 'California'), ('PERSON', 'John Smith'), ('PERSON', 'Google')]


# Bag-of-Words (BoW) Model

Representing text as a collection of words disregarding grammar and word order.

The Bag-of-Words (BoW) model is a simple and commonly used technique in natural language processing (NLP) for text representation. 

It represents text data as a collection of words, disregarding grammar and word order but keeping multiplicity (frequency) of words.

## Theory

### Example:

Suppose we have a corpus consisting of three documents:

Document 1: "The cat sat on the mat."

Document 2: "The dog played in the yard."

Document 3: "The cat and the dog are friends."

### 1. Tokenization:

We first tokenize each document, splitting them into individual words or tokens:

Document 1: ["The", "cat", "sat", "on", "the", "mat"]

Document 2: ["The", "dog", "played", "in", "the", "yard"]

Document 3: ["The", "cat", "and", "the", "dog", "are", "friends"]

### 2. Vocabulary Construction:



Next, we construct the vocabulary, which consists of all unique words in the corpus:

Vocabulary: ["The", "cat", "sat", "on", "mat", "dog", "played", "in", "yard", "and", "are", "friends"]

### 3. Document Representation:


We represent each document in the corpus using a vector based on the frequency of words in the vocabulary:

Document 1: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Document 2: [1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0]

Document 3: [1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1]

Each element in the vector represents the frequency of the corresponding word in the vocabulary within the document.

## Code

In [49]:
from sklearn.feature_extraction.text import CountVectorizer

# Example text corpus
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to create the Bag-of-Words representation
bow_matrix = vectorizer.fit_transform(corpus)

# Get the vocabulary (unique words) learned by the vectorizer
vocabulary = vectorizer.get_feature_names_out()

# Convert the Bag-of-Words matrix to a dense array for easier manipulation
bow_array = bow_matrix.toarray()

# Print the Bag-of-Words matrix and vocabulary
print("Bag-of-Words Matrix:")
print(bow_array)
print("\nVocabulary:")
print(vocabulary)


Bag-of-Words Matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

Vocabulary:
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


# TF-IDF (Term Frequency-Inverse Document Frequency)

A statistical measure to evaluate the importance of a word in a document relative to a collection of documents.

## Theory



Document 1: "The cat sat on the mat."

Document 2: "The dog played in the yard."

Document 3: "The cat and the dog are friends."

Document 4: "The mat is on the floor."

### 1. Term Frequency (TF):

![image.png](attachment:image.png)



Let's calculate the TF for the term "cat" in Document 1:

Number of times term "cat" appears in Document 1 = 1

Total number of terms in Document 1 = 6

TF("cat", Document 1) = 1/6 ≈ 0.167

Similarly, we can calculate TF for other terms in each document.

###  Inverse Document Frequency (IDF)

![image.png](attachment:image.png)

Let's calculate the IDF for the term "cat" in the corpus:

Total number of documents in the corpus (N) = 4

Number of documents containing term "cat" (n_cat) = 2

IDF("cat") = log(4/2) = log(2) ≈ 0.693

Similarly, we can calculate IDF for other terms in the corpus.

![image.png](attachment:image.png)

Let's calculate the TF-IDF for the term "cat" in Document 1:

TF("cat", Document 1) ≈ 0.167

IDF("cat") ≈ 0.693

TF-IDF("cat", Document 1) ≈ 0.167 * 0.693 ≈ 0.116

Similarly, we can calculate TF-IDF for other terms in each document.

Interpretation:

TF measures the local importance of a term within a document.

IDF measures the global importance of a term across the corpus.

TF-IDF combines TF and IDF to determine the importance of a term in a document relative to the entire corpus.

Terms with high TF-IDF scores are important within the context of a document but are rare across the corpus.

## Code

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the corpus
corpus = [
    "The cat sat on the mat.",
    "The dog played in the yard.",
    "The cat and the dog are friends.",
    "The mat is on the floor."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus to calculate TF-IDF scores
tfidf_matrix = vectorizer.fit_transform(corpus)

# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense array for easier manipulation
tfidf_array = tfidf_matrix.toarray()

# Print the TF-IDF scores for each term in each document
print("TF-IDF Scores:")
for i, doc in enumerate(corpus):
    print(f"Document {i+1}:")
    for j, term in enumerate(feature_names):
        print(f"   {term}: {tfidf_array[i, j]:.3f}")


TF-IDF Scores:
Document 1:
   and: 0.000
   are: 0.000
   cat: 0.396
   dog: 0.000
   floor: 0.000
   friends: 0.000
   in: 0.000
   is: 0.000
   mat: 0.396
   on: 0.396
   played: 0.000
   sat: 0.503
   the: 0.525
   yard: 0.000
Document 2:
   and: 0.000
   are: 0.000
   cat: 0.000
   dog: 0.363
   floor: 0.000
   friends: 0.000
   in: 0.461
   is: 0.000
   mat: 0.000
   on: 0.000
   played: 0.461
   sat: 0.000
   the: 0.481
   yard: 0.461
Document 3:
   and: 0.433
   are: 0.433
   cat: 0.341
   dog: 0.341
   floor: 0.000
   friends: 0.433
   in: 0.000
   is: 0.000
   mat: 0.000
   on: 0.000
   played: 0.000
   sat: 0.000
   the: 0.452
   yard: 0.000
Document 4:
   and: 0.000
   are: 0.000
   cat: 0.000
   dog: 0.000
   floor: 0.480
   friends: 0.000
   in: 0.000
   is: 0.480
   mat: 0.379
   on: 0.379
   played: 0.000
   sat: 0.000
   the: 0.501
   yard: 0.000


# Word Embeddings

Representing words as dense vectors in a continuous vector space, capturing semantic meaning.

Word embeddings are a type of word representation in natural language processing (NLP) that captures semantic and syntactic information about words in a dense vector space. 

Unlike traditional sparse representations like one-hot encoding or Bag-of-Words, word embeddings encode semantic similarity between words by placing similar words close to each other in the vector space. 

Word embeddings are typically learned from large text corpora using neural network-based models, such as Word2Vec, GloVe, and FastText.

## Word2Vec

**Definition**

Word2Vec is a popular word embedding technique introduced by Mikolov et al. (2013). It learns word embeddings by predicting the surrounding words in a given context using shallow neural networks like Continuous Bag of Words (CBOW) or Skip-gram models.

**Key Features**

**Semantic Similarity**: Words with similar meanings are clustered together in the embedding space.

**Analogies**: Word vectors can capture analogical relationships like "king - man + woman = queen".

Example:

king - man + woman ≈ queen

In [2]:
#pip install gensim

Collecting gensim
  Downloading gensim-4.3.2-cp311-cp311-win_amd64.whl (24.0 MB)
     ---------------------------------------- 24.0/24.0 MB 9.4 MB/s eta 0:00:00
Installing collected packages: gensim
Successfully installed gensim-4.3.2
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HariharanSivakumar\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# Example corpus
corpus = [
    "The cat sat on the mat.",
    "The dog played in the yard.",
    "The cat and the dog are friends.",
    "The mat is on the floor."
]

# Tokenize the corpus
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Define Word2Vec parameters
vector_size = 100  # Dimensionality of the word vectors
window = 5  # Maximum distance between the current and predicted word within a sentence
min_count = 1  # Minimum frequency of words to be included in the model

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=vector_size, window=window, min_count=min_count)

# Get the word vectors
word_vectors = model.wv

# Test word similarity
similarity = word_vectors.similarity('cat', 'dog')
print(f"Similarity between 'cat' and 'dog': {similarity:.2f}")

# Find similar words
similar_words = word_vectors.most_similar('cat', topn=5)
print("Similar words to 'cat':")
for word, score in similar_words:
    print(f"{word}: {score:.2f}")

Similarity between 'cat' and 'dog': 0.06
Similar words to 'cat':
yard: 0.17
on: 0.14
mat: 0.13
dog: 0.06
friends: 0.06
