# Goals of this notebook
- What is Text Feature Extraction
- TF-IDF implementation from scratch
- TF-IDF implementation using scikit-learn

## 1. Text Feature Extraction
Text feature extraction in Natural Language Processing (NLP) refers to the process of transforming raw text data into numerical representations or features that machine learning models can understand and process. This is a crucial step in NLP pipelines as it allows the algorithms to learn from and make predictions based on text data.

### 1.1 Count Vectorization
Count Vectorization (also known as Count Vectors) is a method of converting text into numerical features by counting the occurrences of each word in the text. This method captures the frequency of each word within a document but does not account for word order or semantics.

### 1.1.1 Tokenization:
The text is split into individual words or tokens. For example, "The cat sat on the mat" would be tokenized into ["The", "cat", "sat", "on", "the", "mat"].

### 1.1.2 Vocabulary Creation:
A vocabulary is built from all unique tokens in the corpus. For instance, if the corpus consists of multiple documents, the vocabulary might include all unique words across all documents.

### 1.1.3 Vectorization:
Each document is represented as a vector where each dimension corresponds to a word in the vocabulary. The value in each dimension is the count of occurrences of that word in the document.
For example, if the vocabulary is ["The", "cat", "sat", "on", "the", "mat"], the document "The cat sat on the mat" might be represented as [2, 1, 1, 1, 1, 1], where the first dimension corresponds to "The", and the count is 2 because "The" appears twice.

### 1.1.4 Example:
For a small corpus with two documents:
Document 1: "The cat sat on the mat"
Document 2: "The cat is on the mat"
The vocabulary might be ["The", "cat", "sat", "on", "the", "mat", "is"]. Notice that "The" and "the" are considered different due to case sensitivity.

The Count Vectors might look like:

Document 1: [2, 1, 1, 1, 1, 1, 0]
Document 2: [2, 1, 0, 1, 1, 1, 1]

## 1.2 Document-Term Matrix (DTM)
The Document-Term Matrix (DTM) is a tabular representation of the text data where rows represent documents and columns represent terms (words). Each cell in the matrix indicates the count of a specific term in a specific document.

### 1.2.1 Constructing the Matrix:
Each row corresponds to a document.
Each column corresponds to a term in the vocabulary.
The cell value at position (i, j) represents the count of term j in document i.

### 1.2.3 Matrix Representation:
The DTM is essentially a sparse matrix because most of its values are zero (especially in large vocabularies).
Example:
Using the same documents as above:

The	cat	sat	on	the	mat	is:
- Document 1	`2	1	1	1	1	1	0`
- Document 2	`2	1	0	1	1	1	1`
<br>

Here, each cell represents the count of the corresponding term in the respective document.

## 1.3 Term Frequency-Inverse Document Frequency (TF-IDF)
It is a numerical statistic used in text processing to evaluate the importance of a word within a document relative to a collection of documents (corpus). It combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF).

### 1.3.1 Term Frequency (TF)
Term Frequency measures how frequently a term occurs in a document. The simplest form of TF is:
$$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

Where:
- t is the term (word).
- d is the document.

TF provides a basic measure of how important a term is within a specific document, but it doesn’t account for the term’s importance across the entire corpus.

### 1.3.2 Inverse Document Frequency (IDF)
Inverse Document Frequency measures the importance of a term across the whole corpus. It helps to adjust for the fact that some words are very common and may not be very informative. The IDF of a term is calculated as:
$$ \text{IDF}(t, D) = \log \frac{\text{Total number of documents } |D|}{\text{Number of documents containing term } t} $$

Where:
- ∣D∣ is the total number of documents in the corpus.
- The denominator is the count of documents that contain the term 

The IDF decreases the weight of terms that occur very frequently across many documents (like “the”, “is”, etc.), making them less informative.

### 1.3.3 TF-IDF Calculation
The TF-IDF score for a term in a document is the product of its TF and IDF values:
$$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$

This score reflects both the term's importance in the specific document and its rarity across the corpus.

### 1.3.4 TF-DF Calculation Example
Let's go through an example with a small corpus:

Corpus:
- "The cat sat on the mat"
- "The cat is on the mat"
- "The dog sat on the log"

**Term Frequency (TF):**

For the term “cat” in Document 1:

$$ \text{TF}(\text{cat}, \text{Document 1}) = \frac{1}{6} $$

(1 occurrence out of 6 total words)

For the term “cat” in Document 2:

$$ \text{TF}(\text{cat}, \text{Document 2}) = \frac{1}{6} $$

(1 occurrence out of 6 total words)

**Inverse Document Frequency (IDF):**

- Total number of documents \( |D| = 3 \)
- Number of documents containing “cat” = 2

IDF for “cat”:

$$ \text{IDF}(\text{cat}, D) = \log \frac{|D|}{\text{Number of documents containing term } \text{cat}} = \log \frac{3}{2} \approx 0.176 $$

**TF-IDF Calculation:**

For “cat” in Document 1:

$$ \text{TF-IDF}(\text{cat}, \text{Document 1}, D) = \text{TF}(\text{cat}, \text{Document 1}) \times \text{IDF}(\text{cat}, D) = \frac{1}{6} \times 0.176 \approx 0.029 $$

For “cat” in Document 2:

$$ \text{TF-IDF}(\text{cat}, \text{Document 2}, D) = \text{TF}(\text{cat}, \text{Document 2}) \times \text{IDF}(\text{cat}, D) = \frac{1}{6} \times 0.176 \approx 0.029 $$

In this example, the TF-IDF score for "cat" in both documents is the same. If "cat" appeared in fewer documents, its IDF would be higher, and thus its TF-IDF score would be higher as well, reflecting its greater importance in those specific documents.

Fortunately, all of these calculations can be performed through Scikit-Learn but we will also implement it from scratch as well.

## 2. TF-DF implementation from scratch

In [1]:
import numpy as np
import pandas as pd
import spacy

In [2]:
# we only want the spacy tokenizer, so disable everything else
nlp = spacy.load(
    "en_core_web_md",
    disable=["tagger", "parser", "attribute_ruler", "lemmatizer", "ner"]
)

In [3]:
nlp.pipe_names

['tok2vec']

In [4]:
def tokenize_and_remove_punkt(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_punct]
    return tokens

The dataset is taken from here [Medium Articles Kaggle](https://www.kaggle.com/datasets/hsankesara/medium-articles)

In [5]:
df = pd.read_csv("./medium_articles.csv")

In [6]:
df.head()

Unnamed: 0,author,claps,reading_time,link,title,text
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T..."
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...
4,Emmanuel Ameisen,935,11,https://blog.insightdatascience.com/reinforcem...,Reinforcement Learning from scratch – Insight ...,Want to learn about applied Artificial Intelli...


In [7]:
# create vocab and tokenize the docs
vocab = {}
idx = 0
tokenized_docs = []

for doc in df["text"]:
    tokens = tokenize_and_remove_punkt(doc.lower())
    doc_tokens = []
    
    for token in tokens:
        if token not in vocab:
            vocab[token] = idx
            idx += 1
        
        doc_tokens.append(vocab[token])
    
    tokenized_docs.append(doc_tokens)

In [8]:
# reverse mapping (index to word)
words = [key for key in vocab.keys()]

In [9]:
# N = no of documents, V = size of vocabulary
N = len(df["text"])
V = len(vocab)

In [10]:
# term frequency matrix (dense)
term_freq = np.zeros((N, V))

Recall, that every cell in the term frequency matrix represents the occurrence of a word for a specific document.
- row represents the document e.g. document 1, document 2
- col represents the word itself (in tokenized form)

### Term Frequency Matrix

|       | this | is | a  | sample | sentence | another | example | different |
|-------|------|----|----|--------|----------|---------|---------|-----------|
| Doc 1 | 1    | 1  | 1  | 1      | 1        | 0       | 0       | 0         |
| Doc 2 | 1    | 1  | 0  | 0      | 1        | 1       | 1       | 0         |
| Doc 3 | 1    | 1  | 0  | 0      | 0        | 0       | 1       | 1         |

In [11]:
# fill the term frequency matrix with the occurrence of words
for doc_idx, tokenized_doc in enumerate(tokenized_docs):
    for token_idx in tokenized_doc:
        term_freq[doc_idx, token_idx] += 1

The line `doc_freq = np.sum(term_freq > 0, axis=0)` counts the occurrence of a specific word in all documents, for example in the above table "this" appears in all the document (hence it's document frequency is 3).

In [12]:
# calculate IDF (inverse document frequency)
doc_freq = np.sum(term_freq > 0, axis=0)

# numpy will automatically broadcast i.e. divide each doc_freq value with N
idf = np.log(N / doc_freq)

In [13]:
# here each value in the array represents the idf of the word
idf

array([2.87564395, 0.        , 0.01494796, ..., 5.82008293, 5.82008293,
       5.82008293])

In [14]:
# each document row will be multiplied with the idf vector
tf_idf = term_freq * idf

In [15]:
# let's test this out
random_idx = np.random.choice(N)
row = df.iloc[random_idx]

print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])

scores = tf_idf[random_idx]
top_ten = (-scores).argsort()[:11]

print("\n")
print("Top ten words: ", [words[idx] for idx in top_ten])

Label:  Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data


Starting text:  Over the past few months, I have been collecting AI cheat sheets. From time to time I share them with friends and colleagues and recently I have been getting asked a lot, so I decided to organize and share the entire collection. To make things more interesting and give context, I added descriptions and/or excerpts for each major topic.


Top ten words:  ['cheat', 'sheet', 'numpy', 'matplotlib', 'scipy', 'pandas', 'matlab', 'tpus', 'chatbot', 'scikit', 'keras']


## 3. Using sparse matrix from Scipy

Sparse matrices are memory efficient since they only store non-zero elements, reducing memory usage. This is particularly beneficial for large matrices with few non-zero elements.

In [16]:
from collections import defaultdict
from scipy.sparse import csr_matrix

In [17]:
# for scipy the method is a little different
data = []
rows = []
cols = []

for doc_idx, tokenized_doc in enumerate(tokenized_docs):
    term_counts = defaultdict(int)
    
    for token_idx in tokenized_doc:
        term_counts[token_idx] += 1
    
    for token_idx, count in term_counts.items():
        data.append(count)
        rows.append(doc_idx)
        cols.append(token_idx)

In [18]:
sparse_term_freq = csr_matrix((data, (rows, cols)), shape=(N, V))

In [19]:
binary_term_freq = (sparse_term_freq > 0).astype(int)

# sum along the cols and convert to 1D array by flattening it
document_freq = np.array(binary_term_freq.sum(axis=0)).flatten()

In [20]:
idf = np.log(N / document_freq)

In [21]:
tf_idf = sparse_term_freq.multiply(idf)

In [22]:
rand_idx = np.random.choice(N)
row = df.iloc[rand_idx]

print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])

scores = tf_idf.getrow(rand_idx).toarray().flatten()
top_five = (-scores).argsort()[:6]

print("\n")
print("Top five words: ", [words[idx] for idx in top_five])

Label:  TensorFlow Tutorial— Part 1 – Illia Polosukhin – Medium


Starting text:  UPD (April 20, 2016): Scikit Flow has been merged into TensorFlow since version 0.8 and now called TensorFlow Learn or tf.learn.


Top five words:  ['scikit', 'tf.learn', 'tensorflow', 'flow', 'ipython', 'dataset']


## 4. Covert everything to a neat class

In [32]:
import numpy as np
import spacy
from collections import defaultdict
from scipy.sparse import csr_matrix

class TFIDF:
    def __init__(self):
        self.vocab = {}
        self.tokenized_docs = []
        self.reverseMap = []
        self.sparse_matrix = None
        self.tf_idf = None
        self.nlp = spacy.load(
            "en_core_web_md",
            disable=["tagger", "parser", "attribute_ruler", "lemmatizer", "ner"]
        )
        
    def tokenize_and_remove_punkt(self, text):
        doc = self.nlp(text)
        tokens = [token.text for token in doc if not token.is_punct]
        return tokens
        
    def create_vocabulary(self, documents):
        idx = 0
        
        for doc in documents:
            tokens = self.tokenize_and_remove_punkt(doc.lower())
            doc_tokens = []

            for token in tokens:
                if token not in self.vocab:
                    self.vocab[token] = idx
                    idx += 1

                doc_tokens.append(self.vocab[token])

            self.tokenized_docs.append(doc_tokens)
        
        self.reverseMap = [key for key in self.vocab.keys()]
        
    def create_sparse_matrix(self):
        data = []
        rows = []
        cols = []

        for doc_idx, tokenized_doc in enumerate(tokenized_docs):
            term_counts = defaultdict(int)

            for token_idx in tokenized_doc:
                term_counts[token_idx] += 1

            for token_idx, count in term_counts.items():
                data.append(count)
                rows.append(doc_idx)
                cols.append(token_idx)
                
        self.sparse_matrix = csr_matrix((data, (rows, cols)), shape=(N, V))
        
    def calculate_tf_idf(self):
        binary_term_freq = (sparse_term_freq > 0).astype(int)
        document_freq = np.array(binary_term_freq.sum(axis=0)).flatten()
        idf = np.log(N / document_freq)
        
        self.tf_idf = self.sparse_matrix.multiply(idf)
        
    def fit_transform(self, documents):
        self.create_vocabulary(documents)
        self.create_sparse_matrix()
        self.calculate_tf_idf()
        
        return self.tf_idf
    
    def get_feature_names(self):
        return self.reverseMap

In [33]:
tfidf_vectorizer = TFIDF()

In [34]:
matrix = tfidf_vectorizer.fit_transform(df["text"])

In [35]:
feature_names = tfidf_vectorizer.get_feature_names()

In [38]:
rand_idx = np.random.choice(N)
row = df.iloc[rand_idx]

print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])

scores = matrix.getrow(rand_idx).toarray().flatten()
top_five = (-scores).argsort()[:6]

print("\n")
print("Top five words: ", [feature_names[idx] for idx in top_five])

Label:  A Rock Album For AI – Carlos Beltran – Medium


Starting text:  https://open.spotify.com/album/0jwnYwJz6XHNrVAYEclQPd


Top five words:  ['album', 'song', 'kurzweil', 'simulation', 'tim', 'musk']


## 5. TF-IDF using scikit-learn

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [40]:
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.9,
    min_df=0.1,
    max_features=2000
)

In [41]:
tfidf_mat = tfidf_vectorizer.fit_transform(df["text"])

In [42]:
tfidf_mat

<337x1280 sparse matrix of type '<class 'numpy.float64'>'
	with 109373 stored elements in Compressed Sparse Row format>

In [43]:
feature_names = tfidf_vectorizer.get_feature_names_out()

In [56]:
rand_idx = np.random.choice(tfidf_mat.shape[0])
row = df.iloc[rand_idx]

print("Label: ", row["title"].split("\n", 1)[0])
print("\n")
print("Starting text: ", row["text"].split("\n", 1)[0])

tfidf_vector = tfidf_mat.getrow(rand_idx).toarray().flatten()

top_indices = (-tfidf_vector).argsort()[:6]

top_words = [feature_names[i] for i in top_indices]

print("\n")
print("Top ten words: ", top_words)

Label:  What Are The Best Intelligent Chatbots or AI Chatbots Available Online?


Starting text:  How do we define the intelligence of a chatbot? You can see a lot of articles about what would make a chatbot “appear intelligent.” A chatbot is intelligent when it becomes aware of user needs. Its intelligence is what gives the chatbot the ability to handle any scenario of a conversation with ease.


Top ten words:  ['bot', 'intelligent', 'conversation', 'most', 'click', 'ai']


## The End