<span style="font-size:16px; font-weight:bold">Welcome to Natural language processing (NLP) in Python</span><br/>

Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

<span style="font-size: 16px;font-weight:bold"> Term Frequency - Inverse Document Frequency (TF-IDF):</span><br/>
TF-IDF is a numerical statistic used to evaluate how important a word is to a specific document within a larger collection (corpus).<br/>

**TF-IDF combines two metrics:**<br/>
▪ `TF:` Measures how frequently a word appears in a document.<br/>
▪ `IDF:` Measures how unique or rare a word is across all documents in the corpus.<br/>
▪ The TF-IDF value increases when a word appears often in a document but is less common in the rest of the corpus. This helps highlight words that are both frequent and distinctive for a particular document.<br/>
▪ TF-IDF is widely used in information retrieval, text mining, and natural language processing tasks such as document classification, search engines (to rank document relevance), and filtering out common words (stop-words).<br/>

TF = (Number of times term *t* appears in a document) / (Total number of terms in the document)<br/>
IDF = log((Number of documents) / (Number of documents containing term *t*))<br/>
TF-IDF = TF × IDF

**TF-IDF Workflow:**<br/>
▪ `Lowercasing:` Convert all text to lowercase for consistency.<br/>
▪ `Tokenization:` Break text into individual words (tokens).<br/>
▪ `Vocabulary Building:` Collect all unique words to form the vocabulary.<br/>
▪ `Sorting (Optional):` Sort the vocabulary alphabetically for reproducibility.<br/>
▪ `Vectorization:` Represent each document as a vector, using TF-IDF scores for each vocabulary word.<br/>

<span style="dont-size:16.5px; color:rgb(245, 5, 5); font-weight:bold;">Importing libraries</span>

In [1]:
import numpy as np
import pandas as pd
import math

from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
documents = [
    "Natural language processing is fascinating",
    "Bag of Words is a simple model",
    "Text data can be represented as vectors",
    "Words are important features in NLP"
]

# Step 1: Preprocess and build vocabulary
tokenized_docs = [doc.lower().split() for doc in documents]  # Tokenize each document by lowercasing and splitting into words
unique_words = set(word for doc in tokenized_docs for word in doc)  # Create a set of all unique words from all tokenized documents
sort_vocab = sorted(unique_words)  # Sort the unique words alphabetically to form the vocabulary
print(f"Step 1:\n{tokenized_docs = }\n{unique_words = }\n{sort_vocab = }")

# Step 2: Term Frequency (TF)
print(f"\nStep 2 (TF):")
tf_list = []  # Initialize an empty list to store term frequency dictionaries for each document
for doc in tokenized_docs:  # Iterate over each tokenized document
    tf = {}  # Initialize an empty dictionary to store term frequencies for the current document
    total = len(doc)  # Calculate the total number of words in the current document
    for word in sort_vocab:  # Iterate over each word in the sorted vocabulary
        tf[word] = doc.count(word) / total  # Compute term frequency for the word and store in the dictionary
    tf_list.append(tf)  # Append the term frequency dictionary to the list
    print(f"{tf}")

# Step 3: Compute Inverse Document Frequency (IDF)
def compute_idf(tokenized_docs, vocab):  # Define a function to compute IDF values given tokenized documents and vocabulary
    N = len(tokenized_docs)  # Number of documents
    idf = {}  # Initialize an empty dictionary to store IDF values
    for word in vocab:  # Iterate over each word in the vocabulary
        doc_count = 0  # Initialize a counter for the number of documents containing the word
        for doc in tokenized_docs:  # Iterate over each tokenized document
            if word in doc:  # Check if the word is present in the current document
                doc_count += 1  # Increment the counter if the word is found in the document
        idf[word] = math.log(N / (1 + doc_count)) + 1  # Compute IDF using smoothed formula and store in dictionary
    return idf  # Return the dictionary of IDF values

idf = compute_idf(tokenized_docs, sort_vocab)
print(f"\nStep 3 (IDF):\n {idf}")

# Step 4: Compute TF-IDF
tfidf_matrix = []
for tf in tf_list:  # Iterate over each term frequency dictionary in the list
    tfidf = {}  # Initialize an empty dictionary to store TF-IDF values for the current document
    for word in sort_vocab:  # Iterate over each word in the sorted vocabulary
        tfidf[word] = round(tf[word] * idf[word], 3)  # Calculate TF-IDF for the word and store in the dictionary
    tfidf_matrix.append(tfidf)  # Append the TF-IDF dictionary to the matrix
print(f"\nStep 4 (Compute TF-IDF):\n {tfidf_matrix}")

# Step 5: Convert to DataFrame for readability
tfidf_df = pd.DataFrame(tfidf_matrix, columns=sort_vocab)
print("\nTF-IDF matrix:\n", tfidf_df)

Step 1:
tokenized_docs = [['natural', 'language', 'processing', 'is', 'fascinating'], ['bag', 'of', 'words', 'is', 'a', 'simple', 'model'], ['text', 'data', 'can', 'be', 'represented', 'as', 'vectors'], ['words', 'are', 'important', 'features', 'in', 'nlp']]
unique_words = {'a', 'is', 'text', 'nlp', 'bag', 'be', 'are', 'of', 'in', 'natural', 'language', 'processing', 'model', 'data', 'important', 'can', 'represented', 'simple', 'fascinating', 'as', 'features', 'vectors', 'words'}
sort_vocab = ['a', 'are', 'as', 'bag', 'be', 'can', 'data', 'fascinating', 'features', 'important', 'in', 'is', 'language', 'model', 'natural', 'nlp', 'of', 'processing', 'represented', 'simple', 'text', 'vectors', 'words']

Step 2 (TF):
{'a': 0.0, 'are': 0.0, 'as': 0.0, 'bag': 0.0, 'be': 0.0, 'can': 0.0, 'data': 0.0, 'fascinating': 0.2, 'features': 0.0, 'important': 0.0, 'in': 0.0, 'is': 0.2, 'language': 0.2, 'model': 0.0, 'natural': 0.2, 'nlp': 0.0, 'of': 0.0, 'processing': 0.2, 'represented': 0.0, 'simple':

In [3]:
# Step 1: Create a TfidfVectorizer object
# - This object is responsible for converting a list of text documents into a matrix of TF-IDF features.
# - It will learn the vocabulary from the documents and calculate the TF-IDF score for each word in each document.
# How to calculate TF-IDF?
# 1. The vectorizer learns the vocabulary from the documents.
# 2. For each word in each document, it computes:
#    - Term Frequency (TF): How often the word appears in the document.
#    - Inverse Document Frequency (IDF): How unique the word is across all documents.
#    - TF-IDF = TF * IDF for each word in each document.
tfidf_vectorizer = TfidfVectorizer()

# Step 2: Fit the vectorizer to the documents and transform the documents into TF-IDF vectors
# - The 'fit' part learns the vocabulary from the documents.
# - The 'transform' part calculates the TF-IDF score for each word in each document.
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Step 3: Retrieve the feature names (i.e., the vocabulary)
# This gives us the list of unique words (features) that the vectorizer has learned from the documents.
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense array and display as a DataFrame for readability
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print("TF-IDF matrix:\n", tfidf_df)


TF-IDF matrix:
         are        as       bag        be       can      data  fascinating  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000     0.465162   
1  0.000000  0.000000  0.436719  0.000000  0.000000  0.000000     0.000000   
2  0.000000  0.377964  0.000000  0.377964  0.377964  0.377964     0.000000   
3  0.421765  0.000000  0.000000  0.000000  0.000000  0.000000     0.000000   

   features  important        in  ...     model   natural       nlp        of  \
0  0.000000   0.000000  0.000000  ...  0.000000  0.465162  0.000000  0.000000   
1  0.000000   0.000000  0.000000  ...  0.436719  0.000000  0.000000  0.436719   
2  0.000000   0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000   
3  0.421765   0.421765  0.421765  ...  0.000000  0.000000  0.421765  0.000000   

   processing  represented    simple      text   vectors     words  
0    0.465162     0.000000  0.000000  0.000000  0.000000  0.000000  
1    0.000000     0.000000  0.436719  0.000000  0

In [29]:
# The two codes above have different outputs because they use different methods and objects to compute the TF-IDF matrix:
# 
# 1. The first code block manually computes TF, IDF, and then TF-IDF using custom logic and stores the results in a DataFrame.
#    - The vocabulary is determined by the order in 'sort_vocab'.
#    - The calculation and rounding are explicitly controlled.
#    - The output DataFrame columns are in the order of 'sort_vocab'.
# 
# 2. The second code block uses scikit-learn's TfidfVectorizer, which:
#    - Handles tokenization, normalization, and vocabulary extraction internally.
#    - May use different default preprocessing (e.g., lowercasing, token pattern).
#    - The vocabulary (feature_names) may be sorted differently (usually alphabetically).
#    - The TF-IDF calculation may use slightly different formulas (e.g., smoothing, normalization).
#    - The output DataFrame columns are in the order of 'feature_names' as determined by the vectorizer.
# 
# As a result, the two outputs may differ in:
# - The order of columns (vocabulary).
# - The exact TF-IDF values due to differences in calculation details (e.g., normalization, smoothing).
# - The handling of tokenization and preprocessing.


In [5]:
# Manually compute TF, IDF, and TF-IDF as in the first approach

import numpy as np

# Step 1: Build the vocabulary (sorted as in feature_names for comparison)
sort_vocab = sorted(set(word for doc in documents for word in doc.lower().split()))

# Step 2: Compute Term Frequency (TF)
tf = []
for doc in documents:
    doc_words = doc.lower().split()
    doc_tf = []
    for word in sort_vocab:
        doc_tf.append(doc_words.count(word))
    tf.append(doc_tf)
tf = np.array(tf)

# Step 3: Compute Document Frequency (DF)
df = np.sum(tf > 0, axis=0)

# Step 4: Compute Inverse Document Frequency (IDF)
N = len(documents)
idf = np.log((N) / (df)) + 1  # scikit-learn uses "smooth_idf=True" by default, but this is the basic formula

# Step 5: Compute TF-IDF
tfidf_manual = tf * idf

# Step 6: Normalize TF-IDF rows to unit norm (L2), as scikit-learn does by default
norms = np.linalg.norm(tfidf_manual, axis=1, keepdims=True)
tfidf_manual_normalized = tfidf_manual / np.where(norms == 0, 1, norms)

# Step 7: Create DataFrame for readability
tfidf_manual_df = pd.DataFrame(tfidf_manual_normalized, columns=sort_vocab)
tfidf_manual_df.head()


Unnamed: 0,a,are,as,bag,be,can,data,fascinating,features,important,...,model,natural,nlp,of,processing,represented,simple,text,vectors,words
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.471225,0.0,0.0,...,0.0,0.471225,0.0,0.0,0.471225,0.0,0.0,0.0,0.0,0.0
1,0.408015,0.0,0.0,0.408015,0.0,0.0,0.0,0.0,0.0,0.0,...,0.408015,0.0,0.0,0.408015,0.0,0.0,0.408015,0.0,0.0,0.289499
2,0.0,0.0,0.377964,0.0,0.377964,0.377964,0.377964,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.377964,0.0,0.377964,0.377964,0.0
3,0.0,0.426268,0.0,0.0,0.0,0.0,0.0,0.0,0.426268,0.426268,...,0.0,0.0,0.426268,0.0,0.0,0.0,0.0,0.0,0.0,0.30245


In [33]:
tfidf_matrix, feature_names

(<Compressed Sparse Row sparse matrix of dtype 'float64'
 	with 24 stored elements and shape (4, 22)>,
 array(['are', 'as', 'bag', 'be', 'can', 'data', 'fascinating', 'features',
        'important', 'in', 'is', 'language', 'model', 'natural', 'nlp',
        'of', 'processing', 'represented', 'simple', 'text', 'vectors',
        'words'], dtype=object))

In [68]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [74]:
txt = [
    "Natural language processing is fascinating",
    "Bag of Words is a simple model",
    "Text data can be represented as vectors",
    "Words are important features in NLP"
]

# Step 1 & 2: Create and fit the Tokenizer
tok = Tokenizer()
tok.fit_on_texts(txt)

# Step 3 & 4: Print the sorted vocabulary
sorted_vocab = sorted(tok.word_index.keys())
print(f'Vocabulary (sorted): {sorted_vocab}')

# Step 5: Transform to Bag of Words matrix
vectors = tok.texts_to_matrix(txt, mode='count')
print("Bag of Words matrix:\n", vectors)

Vocabulary (sorted): ['a', 'are', 'as', 'bag', 'be', 'can', 'data', 'fascinating', 'features', 'important', 'in', 'is', 'language', 'model', 'natural', 'nlp', 'of', 'processing', 'represented', 'simple', 'text', 'vectors', 'words']
Bag of Words matrix:
 [[0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]]
