In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize


## Load the documents

In [2]:
docA = "apple apple apple apple a banana an banana "
docB = "apple car a car"

## Preprocessing

### Represent each document as 'Bag of words' (bow)

### Tokenize

In NLP (Natural Language Processing), tokenization is the process of breaking down a piece of text (such as a sentence or paragraph) into smaller units, called tokens. 

These tokens can be words, subwords, or even characters, depending on the level of tokenization

In [3]:
# for simplicity, we will tokenize our documents assuming everything between
# whitespaces is a word.
# in reality you may want to use nltk.word_tokenize()
tokensA = docA.split()
tokensB = docB.split()
print( tokensA )
print( tokensB )

['apple', 'apple', 'apple', 'apple', 'a', 'banana', 'an', 'banana']
['apple', 'car', 'a', 'car']


### Stemming

Stemming in NLP (Natural Language Processing) is the process of reducing a word to its base or root form, called a "stem."

'apple' and 'apples' should be counted as a same term. Stemming is helping that process.
Resource: https://pythonspot.com/nltk-stemming/

In [4]:
# Initialize the stemmer
stemmer = PorterStemmer()

# Apply stemming to the tokens
stemmedA = [stemmer.stem(token) for token in tokensA]
stemmedB = [stemmer.stem(token) for token in tokensB]

print(stemmedA)
print(stemmedB)

['appl', 'appl', 'appl', 'appl', 'a', 'banana', 'an', 'banana']
['appl', 'car', 'a', 'car']


### Remove the 'stop words'

Prepositions, articles and other common words are considered as useless (low lexical content) in document representation, thus you can filter the "Bag of Words" through a common stop-word list, or create custom specific.
NLTK module has a list of stopewrods for many languages

In [5]:
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

bowA = [token for token in stemmedA if token.lower() not in stop_words]
bowB = [token for token in stemmedB if token.lower() not in stop_words]

print(bowA)
print(bowB)

['appl', 'appl', 'appl', 'appl', 'banana', 'banana']
['appl', 'car', 'car']


## Create the collection vocabulary

Make one list (set) of unique words for document collection

That set will represent the vocabulary of our documents collection



In [6]:
vocabulary = set(bowA).union(set(bowB))
print(f'vocabulary: {vocabulary}')

vocabulary: {'banana', 'appl', 'car'}


## Count unique words in each document, i.e. Term Frequency (TF)

Term frequency indicates the significance of a particular term in the document

Now, we have to represent each document as numbers of terms occurrence in the vocabulary

In [7]:
# Initialize the dictionaries to store Term Frequencies:
TFa = dict.fromkeys(vocabulary, 0)
TFb = dict.fromkeys(vocabulary, 0)

print(f'TFa:{TFa}')

TFa:{'banana': 0, 'appl': 0, 'car': 0}


In [8]:
# Calculate term frequences:
for word in vocabulary:
    TFa[word] = bowA.count(word)
    TFb[word] = bowB.count(word)

print(f'countsA: {TFa}')
print(f'countsB: {TFb}')

countsA: {'banana': 2, 'appl': 4, 'car': 0}
countsB: {'banana': 0, 'appl': 1, 'car': 2}


## Create DataFrame to store document colection

In [9]:
counts_df = pd.DataFrame([TFa, TFb])
counts_df

Unnamed: 0,banana,appl,car
0,2,4,0
1,0,1,2


## Calculate adjusted TF

**Adjusted term frequency for a document** is calculated by dividing the number of occurrences of the term by the total number of terms in that document:

Adjusted Term Frequency = Count of the term / Total number of terms in the document

This gives the relative frequency of the term within the document.


In [10]:
# Calculate the total number of terms in each document (sum of each row)
total_terms_per_doc = counts_df.sum(axis=1)

# Calculate the adjusted term frequency (TF) by dividing each count by the total number of terms in that document
TF_df = counts_df.div(total_terms_per_doc, axis=0)
TF_df

Unnamed: 0,banana,appl,car
0,0.333333,0.666667,0.0
1,0.0,0.333333,0.666667


## Compute IDF

**IDF (Inverse Document Frequency)** is a measure used in text processing, particularly in **TF-IDF (Term Frequency-Inverse Document Frequency)**, to evaluate how important a word is to a document within a collection or corpus.

### Key Idea:
IDF helps determine whether a term is **common or rare** across multiple documents in a corpus:
- Words that appear in **many documents** are considered **less important** (less informative) because they are common.
- Words that appear in **fewer documents** are considered **more important** because they provide more unique information.

### Formula for IDF:
$$
IDF(t) = \log \left( \frac{N}{1 + df(t)} \right)
$$
Where:
- \(N\) = total number of documents in the corpus.
- \(df(t)\) = number of documents containing the term \(t\).
- \(1\) is added in the denominator to avoid division by zero in case the term doesn’t appear in any document.

### Explanation:
- If a term appears in **all documents**, its IDF will be low (often close to 0), indicating that it is not very informative.
- If a term appears in **few documents**, its IDF will be high, meaning it has more significance for distinguishing between documents.

### Why IDF is Important:
IDF helps reduce the influence of common words like "the," "and," or "is," which might appear frequently in all documents but provide little value for distinguishing between them. When combined with **TF (Term Frequency)**, it highlights words that are more informative for a given document.





In [11]:
# Number of documents (rows in TF_df)
N = TF_df.shape[0]

# Calculate document frequency (df) for each term (number of non-zero entries per column)
df = (TF_df > 0).sum(axis=0)

# Calculate the Inverse Document Frequency (IDF) using the formula
IDF = np.log((1 + N) / (1 + df))
IDF

banana    0.405465
appl      0.000000
car       0.405465
dtype: float64

## TF-IDF

**TF-IDF** (Term Frequency-Inverse Document Frequency) is a measure that combines two metrics:
- **TF (Term Frequency)**: Reflects how frequently a term appears in a document.
- **IDF (Inverse Document Frequency)**: Reduces the weight of terms that appear in many documents, highlighting terms that are more unique to specific documents.

The **TF-IDF score** is calculated by multiplying the TF value of a term by its IDF value:

TF_IDF = TF*IDF


This helps identify terms that are both frequent within a document and rare across the corpus, making them more important for understanding document content.


A high weight in tf-idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. 




In [12]:
# Calculate the TF-IDF by multiplying TF values by the corresponding IDF values for each term
TF_IDF_df = TF_df * IDF

# Display the resulting TF-IDF DataFrame
TF_IDF_df


Unnamed: 0,banana,appl,car
0,0.135155,0.0,0.0
1,0.0,0.0,0.27031
