<span style="font-size:16px; font-weight:bold">Welcome to Natural language processing (NLP) in Python</span><br/>

Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

<span style="font-size: 16px;font-weight:bold"> Bag Of Words (BOW):</span><br/>
The `Bag of Words (BOW)` model is a simple and widely used technique in NLP for representing text data. In this approach, each document is represented as a vector that records the frequency of each unique word in the text, disregarding grammar and word order but keeping multiplicity. BOW is especially useful for tasks like `document classification`, where the presence or frequency of words serves as features for machine learning algorithms. This model is a type of `Vector Space Model (VSM)`, where text is transformed into numerical vectors for further analysis.

**Bag of Words Steps:**<br/>
▪ `Lowercase:` Standardize all text by converting to lowercase.<br/>
▪ `Tokenization:` Split the text into individual words (tokens).<br/>
▪ `Vocabulary Creation:` Identify all unique words to form the vocabulary.<br/>
▪ `Sorting:` Optionally, sort the vocabulary alphabetically (A-Z) for consistency.<br/>
▪ `Vectorization:` Represent each document as a vector indicating the frequency (or presence) of each vocabulary word.<br/>

<span style="dont-size:16.5px; color:rgb(245, 5, 5); font-weight:bold;">Importing libraries</span>

In [68]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [74]:
txt = [
    "Natural language processing is fascinating",
    "Bag of Words is a simple model",
    "Text data can be represented as vectors",
    "Words are important features in NLP"
]

# Step 1 & 2: Create and fit the Tokenizer
tok = Tokenizer()
tok.fit_on_texts(txt)

# Step 3 & 4: Print the sorted vocabulary
sorted_vocab = sorted(tok.word_index.keys())
print(f'Vocabulary (sorted): {sorted_vocab}')

# Step 5: Transform to Bag of Words matrix
vectors = tok.texts_to_matrix(txt, mode='count')
print("Bag of Words matrix:\n", vectors)

Vocabulary (sorted): ['a', 'are', 'as', 'bag', 'be', 'can', 'data', 'fascinating', 'features', 'important', 'in', 'is', 'language', 'model', 'natural', 'nlp', 'of', 'processing', 'represented', 'simple', 'text', 'vectors', 'words']
Bag of Words matrix:
 [[0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]]
