# Text Representations: One-Hot, Bag of Words, TF/IDF

Learn simple, non-math ways to turn text into numbers with beginner notes and tiny examples.

> Beginner quick start
>
> - Run cells from top to bottom. If you change earlier text, re-run later cells.
> - Each section has a short note on what to run and how to read the outputs.
> - Use the tiny 3-sentence corpus to experiment.

In [None]:
import numpy as np
import pandas as pd
import re
from typing import List

In [None]:
def clean_tokenize(text: str) -> List[str]:
    text = text.lower()
    text = re.sub(r'[^a-z]+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text.split()

## One-Hot Encoding

What it does:
Turns each word in a sentence into a simple list of numbers, where only one number is "on" (1) and the rest are "off" (0). This shows which word it is, without implying any meaning or similarity.

Why it's useful:
It's a basic way to let computers "see" which words are present so we can start working with text.

Example (tiny):
Sentence: "I like pizza"
Vocabulary: ["i", "like", "pizza"]
One‑hot matrix (rows = tokens in order, columns = unique words):
| Word | i | like | pizza |
|------|---|------|-------|
| i    | 1 | 0    | 0     |
| like | 0 | 1    | 0     |
| pizza| 0 | 0    | 1     |

How rows and tokens work
- Each row corresponds to a token (word) in the sentence, in order.
- Number of rows = number of tokens; number of columns = number of unique words (vocabulary).

Beginner notes: One‑Hot
- Run the next cell.
- Look for: tokens (cleaned words), vocabulary (unique words), and matrix shape (rows=tokens; columns=vocab).

In [None]:
def one_hot_encoding(sentence):
    tokens = clean_tokenize(sentence)
    vocabulary = sorted(set(tokens))
    word_to_index = {w: i for i, w in enumerate(vocabulary)}
    mat = np.zeros((len(tokens), len(vocabulary)), dtype=int)
    for i, w in enumerate(tokens):
        mat[i, word_to_index[w]] = 1
    return mat, vocabulary, tokens


sentence = 'Should we go to a pizzeria or do you prefer a restaurant?'
one_hot_matrix, vocabulary, tokens = one_hot_encoding(sentence)
print('Tokens:', tokens)
print('Vocab ({}):'.format(len(vocabulary)), vocabulary)
print('One-Hot shape:', one_hot_matrix.shape)
one_hot_matrix

## Bag of Words (BoW)

What it does:
Counts how many times each word appears in each sentence (or document). It builds a vocabulary of unique words and then makes a table of counts.

Why it’s useful:
Lets us compare sentences by the words they contain and how often. Helpful for search, clustering, or as input features.

Tiny example (3 sentences)
1) "This movie is awesome awesome"
2) "I do not say is good, but neither awesome"
3) "Awesome? Only a fool can say that"

Vocabulary (sorted): ["a", "awesome", "but", "can", "do", "fool", "good", "i", "is", "movie", "neither", "not", "only", "say", "that", "this"]

Counts matrix (rows = sentences, columns = vocabulary):
| sentence | a | awesome | but | can | do | fool | good | i | is | movie | neither | not | only | say | that | this |
|---------:|:-:|:------:|:---:|:---:|:--:|:----:|:----:|:-:|:--:|:-----:|:-------:|:---:|:----:|:---:|:----:|:----:|
| 1        | 0 |   2    |  0  |  0  | 0  |  0   |  0   | 0 |  1 |   1   |    0    |  0  |  0   |  0  |  0   |  1   |
| 2        | 0 |   1    |  1  |  0  | 1  |  0   |  1   | 1 |  1 |   0   |    1    |  1  |  0   |  1  |  0   |  0   |
| 3        | 1 |   1    |  0  |  1  | 0  |  1   |  0   | 0 |  0 |   0   |    0    |  0  |  1   |  1  |  1   |  0   |

How to read it
- Row 1 has “awesome” twice, and includes “is”, “movie”, and “this” once each.
- Row 2 includes single‑occurrence words like “but”, “do”, “good”, “i”, “is”, “neither”, “not”, “say”, and one “awesome”.
- Row 3 contains words like “a”, “awesome”, “can”, “fool”, “only”, “say”, “that” each once.

Notes
- Bag of Words ignores order: “movie awesome” equals “awesome movie”.
- It captures counts, not meaning. TF‑IDF and embeddings improve on this.

Beginner notes: BoW
- Run the next two cells.
- The table shows rows=sentences and columns=words; values are counts.

In [None]:
def bag_of_words(sentences):
    tokenized = [clean_tokenize(s) for s in sentences]
    vocab = sorted(set(w for sent in tokenized for w in sent))
    w2i = {w: i for i, w in enumerate(vocab)}
    mat = np.zeros((len(sentences), len(vocab)), dtype=int)
    for i, sent in enumerate(tokenized):
        for w in sent:
            mat[i, w2i[w]] += 1
    return vocab, mat


corpus = [
    'This movie is awesome awesome',
    'I do not say is good, but neither awesome',
    'Awesome? Only a fool can say that'
]
vocab_bow, bow_matrix = bag_of_words(corpus)
print('Vocab size:', len(vocab_bow), '| Matrix:', bow_matrix.shape)

In [None]:
bow_df = pd.DataFrame(bow_matrix, columns=vocab_bow)
bow_df

## Term Frequency (TF) and TF‑IDF

What it does:
- TF: Shows how often each word appears in a sentence, relative to the sentence length.
- IDF: Downweights words that appear in many sentences; upweights words that are rare across sentences.
- TF‑IDF = TF × IDF: Highlights words that are frequent in one sentence but uncommon overall.

Why it’s useful:
Helps find words that are distinctive to each sentence, useful for search, summarization, and feature engineering.

How to read TF and TF‑IDF
- TF: Within one sentence, count each word and divide by total words in that sentence (row sums ≈ 1.0).
- IDF: Higher for rarer words; equals 1.0 for words present in every sentence (with smoothing).
- TF‑IDF: Large when a word is frequent in a sentence and rare across others.

Tiny example (same 3 sentences)
1) "this movie is awesome awesome"
2) "i do not say is good, but neither awesome"
3) "awesome? only a fool can say that"

Intuition
- “awesome” appears in all sentences, so its IDF is lower than a word appearing in only one sentence.
- A word that repeats in one sentence can still get a higher TF‑IDF for that sentence.

Tiny numeric walk‑through for “awesome”
- Sentence lengths: 5, 9, 7 → TF1=2/5=0.40, TF2=1/9≈0.11, TF3=1/7≈0.14
- With sklearn‑style smoothing: idf = log((1+N)/(1+df)) + 1 → N=3, df(awesome)=3 → IDF=1.0
- TF‑IDF values: 0.40, ~0.11, ~0.14 (highest in sentence 1 due to repetition)

Beginner notes: TF/IDF
- Run the next cells to compute and view tables.
- Larger TF‑IDF means “important for that sentence and uncommon overall”.
- Optional: print top TF‑IDF words per sentence to see the standouts.

In [None]:
def compute_tf(sentences):
    tokenized = [clean_tokenize(s) for s in sentences]
    vocab = sorted(set(w for sent in tokenized for w in sent))
    w2i = {w: i for i, w in enumerate(vocab)}
    tf = np.zeros((len(sentences), len(vocab)), dtype=np.float32)
    for i, words in enumerate(tokenized):
        n = max(1, len(words))
        for w in words:
            tf[i, w2i[w]] += 1.0 / n
    return tf, vocab


def compute_idf(sentences, vocab):
    token_sets = [set(clean_tokenize(s)) for s in sentences]
    N = len(sentences)
    w2i = {w: i for i, w in enumerate(vocab)}
    idf = np.zeros(len(vocab), dtype=np.float32)
    for w in vocab:
        df = sum(1 for s in token_sets if w in s)
        idf[w2i[w]] = np.log((1 + N) / (1 + df)) + \
            1.0  # sklearn-style smoothing
    return idf


def tf_idf(sentences):
    tf, vocab = compute_tf(sentences)
    idf = compute_idf(sentences, vocab)
    return vocab, tf * idf


vocabulary, tf_idf_matrix = tf_idf(corpus)
tf_matrix, tf_vocab = compute_tf(corpus)
idf_vector = compute_idf(corpus, tf_vocab)
print('TF-IDF shape:', tf_idf_matrix.shape)
tf_df = pd.DataFrame(tf_matrix, columns=tf_vocab)
idf_df = pd.DataFrame([idf_vector], columns=tf_vocab)
tfidf_df = pd.DataFrame(tf_idf_matrix, columns=vocabulary)
display(tf_df.round(2))
display(idf_df.round(2))
display(tfidf_df.round(2))

In [None]:
# Show top TF-IDF terms per sentence
import numpy as np


def top_tfidf_per_doc(vocab, tfidf_matrix, top_k=3):
    for i in range(tfidf_matrix.shape[0]):
        row = tfidf_matrix[i]
        idxs = np.argsort(-row)[:top_k]
        pairs = [(vocab[j], float(row[j])) for j in idxs if row[j] > 0]
        print(f'Sentence {i+1}: {pairs}')


top_tfidf_per_doc(vocabulary, tf_idf_matrix, top_k=3)