# Text Representations: One-Hot, Bag of Words, TF/IDF

Learn simple, non-math ways to turn text into numbers with beginner notes and tiny examples.

> Beginner quick start
>
> - Run cells from top to bottom. If you change earlier text, re-run later cells.
> - Each section has a short note on what to run and how to read the outputs.
> - Use the tiny 3-sentence corpus to experiment.

In [None]:
import numpy as np
import pandas as pd
import re
from typing import List

In [None]:
def clean_tokenize(text: str) -> List[str]:
    text = text.lower()
    text = re.sub(r'[^a-z]+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text.split()

## One‑Hot Encoding

**What it does**
Turns each word in a sentence into a list of numbers where exactly one position is “on” (`1`) and all others are “off” (`0`). This simply marks which word it is—no extra meaning or math is added.

**Why it’s useful**
It’s the most basic way to turn text into numbers so a computer can start processing it.

**Example sentence**

> “Should we go to a pizzeria or do you prefer a restaurant?”

* **Vocabulary** = all unique words in the sentence (sorted).
* **One-hot matrix** = rows represent tokens (words in order), columns represent vocabulary; each row has a single `1` in the column for its word.

---

### How rows and tokens work

* **Rows** → words (tokens) in the sentence, in order.
* **Number of rows** = number of tokens in your sentence.
* **Number of columns** = number of unique words in the vocabulary.

### What are “tokens”?

* Tokens are the individual words after light cleaning (e.g., lowercase, letters only).
* When you run the next cell, you’ll see them printed as:
  `Tokens: [...]`

### Understanding the one-hot output

* Table structure:

  * Rows = tokens (words in order)
  * Columns = vocabulary (unique words)
* Each row has **exactly one** `1` in the column for that word; all other positions are `0`.
* The printed shape follows:
  `(number_of_tokens, vocabulary_size)`

---

> **Beginner tips**
>
> * Run the next cell to see tokens, vocabulary size, and the one-hot table.
> * If you change the sentence, re-run the cell to refresh tokens and vocabulary.

**Example**

Sentence: "Should we go to a pizzeria or do you prefer a restaurant?"

Vocabulary (unique, sorted): ["a", "do", "go", "or", "pizzeria", "prefer", "restaurant", "should", "to", "we", "you"]

Tokens (in order): ["should", "we", "go", "to", "a", "pizzeria", "or", "do", "you", "prefer", "a", "restaurant"]

One‑hot encoding matrix (rows = tokens in order, columns = vocabulary):

| Word       | a | do | go | or | pizzeria | prefer | restaurant | should | to | we | you |
|------------|---|----|----|----|----------|--------|------------|--------|----|----|-----|
| should     | 0 | 0  | 0  | 0  | 0        | 0      | 0          | 1      | 0  | 0  | 0   |
| we         | 0 | 0  | 0  | 0  | 0        | 0      | 0          | 0      | 0  | 1  | 0   |
| go         | 0 | 0  | 1  | 0  | 0        | 0      | 0          | 0      | 0  | 0  | 0   |
| to         | 0 | 0  | 0  | 0  | 0        | 0      | 0          | 0      | 1  | 0  | 0   |
| a          | 1 | 0  | 0  | 0  | 0        | 0      | 0          | 0      | 0  | 0  | 0   |
| pizzeria   | 0 | 0  | 0  | 0  | 1        | 0      | 0          | 0      | 0  | 0  | 0   |
| or         | 0 | 0  | 0  | 1  | 0        | 0      | 0          | 0      | 0  | 0  | 0   |
| do         | 0 | 1  | 0  | 0  | 0        | 0      | 0          | 0      | 0  | 0  | 0   |
| you        | 0 | 0  | 0  | 0  | 0        | 0      | 0          | 0      | 0  | 0  | 1   |
| prefer     | 0 | 0  | 0  | 0  | 0        | 1      | 0          | 0      | 0  | 0  | 0   |
| a          | 1 | 0  | 0  | 0  | 0        | 0      | 0          | 0      | 0  | 0  | 0   |
| restaurant | 0 | 0  | 0  | 0  | 0        | 0      | 1          | 0      | 0  | 0  | 0   |

Note: Column order = Vocabulary order shown above.
> Mnemonic: rows = tokens (words in order); columns = vocabulary (unique words).

So, for example, "should" becomes a vector with a `1` under the "should" column and `0`s elsewhere; "we" has a `1` under "we"; and so on. Duplicate words like "a" appear as multiple rows with the same one‑hot pattern.
**Mini visual (subset for clarity)**

To see the pattern quickly, here’s a tiny subset using just the first three tokens and three vocabulary columns:

- Tokens (first 3): ["should", "we", "go"]
- Vocabulary (subset): ["go", "should", "we"]

| Word   | go | should | we |
|--------|----|--------|----|
| should | 0  | 1      | 0  |
| we     | 0  | 0      | 1  |
| go     | 1  | 0      | 0  |

Each row has exactly one 1 in the column for that word. The full table below shows all words.

In [None]:
def one_hot_encoding(sentence):
    tokens = clean_tokenize(sentence)
    vocabulary = sorted(set(tokens))
    word_to_index = {w: i for i, w in enumerate(vocabulary)}
    mat = np.zeros((len(tokens), len(vocabulary)), dtype=int)
    for i, w in enumerate(tokens):
        mat[i, word_to_index[w]] = 1
    return mat, vocabulary, tokens


sentence = 'Should we go to a pizzeria or do you prefer a restaurant?'
one_hot_matrix, vocabulary, tokens = one_hot_encoding(sentence)

# Display
print(f'Sentence: {sentence}')
print(f'Tokens: {tokens}')
print(f'Vocabulary (Unique, and Sorted): {vocabulary}')

# Sizes of Sentence, Tokens, Vocabulary, One-Hot Matrix
print(f'Sentence: {len(sentence)}')
print(f'Tokens: {len(tokens)}')
print(f'Vocabulary size: {len(vocabulary)}')
print(f'One-Hot shape: {one_hot_matrix.shape}')

print('Note: Column order follows the sorted vocabulary above.')
# Pretty table: rows=tokens, cols=vocabulary; highlight 1s
# Use a unique index to avoid Styler errors with duplicate labels
display_index = [f"{i+1}. {w}" for i, w in enumerate(tokens)]
one_hot_df = pd.DataFrame(
    one_hot_matrix, columns=vocabulary, index=display_index)
try:
    styled = one_hot_df.style.applymap(
        lambda v: 'background-color: #ffeeba' if v == 1 else '')
    display(styled)
except Exception:
    # Fallback if Styler/display not available
    one_hot_df

In [None]:
# Try your own sentence
user_sentence = input(
    "Enter a sentence to one-hot encode (or press Enter to reuse the default): ").strip()
if user_sentence:
    s = user_sentence
else:
    s = sentence  # reuse the earlier example

mat, vocab2, toks2 = one_hot_encoding(s)
print('Tokens:', toks2)
print(f'Vocab size: {len(vocab2)} | One-Hot shape: {mat.shape}')
print('Note: Column order follows the sorted vocabulary.')
# Use a unique index to avoid Styler errors with duplicate labels
display_index2 = [f"{i+1}. {w}" for i, w in enumerate(toks2)]
df2 = pd.DataFrame(mat, columns=vocab2, index=display_index2)
try:
    styled2 = df2.style.applymap(
        lambda v: 'background-color: #ffeeba' if v == 1 else '')
    display(styled2)
except Exception:
    df2

## Bag of Words (BoW)

> If you’re focusing only on One‑Hot today, you can stop here. Come back to BoW and TF‑IDF later.

What it does:
Counts how many times each word appears in each sentence (or document). It builds a vocabulary of unique words and then makes a table of counts.

Why it’s useful:
Lets us compare sentences by the words they contain and how often. Helpful for search, clustering, or as input features.

Tiny example (3 sentences)
1) "This movie is awesome awesome"
2) "I do not say is good, but neither awesome"
3) "Awesome? Only a fool can say that"

Vocabulary (sorted): ["a", "awesome", "but", "can", "do", "fool", "good", "i", "is", "movie", "neither", "not", "only", "say", "that", "this"]

Counts matrix (rows = sentences, columns = vocabulary):
| sentence | a | awesome | but | can | do | fool | good | i | is | movie | neither | not | only | say | that | this |
|---------:|:-:|:------:|:---:|:---:|:--:|:----:|:----:|:-:|:--:|:-----:|:-------:|:---:|:----:|:---:|:----:|:----:|
| 1        | 0 |   2    |  0  |  0  | 0  |  0   |  0   | 0 |  1 |   1   |    0    |  0  |  0   |  0  |  0   |  1   |
| 2        | 0 |   1    |  1  |  0  | 1  |  0   |  1   | 1 |  1 |   0   |    1    |  1  |  0   |  1  |  0   |  0   |
| 3        | 1 |   1    |  0  |  1  | 0  |  1   |  0   | 0 |  0 |   0   |    0    |  0  |  1   |  1  |  1   |  0   |

How to read it
- Row 1 has “awesome” twice, and includes “is”, “movie”, and “this” once each.
- Row 2 includes single‑occurrence words like “but”, “do”, “good”, “i”, “is”, “neither”, “not”, “say”, and one “awesome”.
- Row 3 contains words like “a”, “awesome”, “can”, “fool”, “only”, “say”, “that” each once.

Notes
- Bag of Words ignores order: “movie awesome” equals “awesome movie”.
- It captures counts, not meaning. TF‑IDF and embeddings improve on this.

Beginner notes: BoW
- Run the next two cells.
- The table shows rows=sentences and columns=words; values are counts.
- We skip the if-check because the vocab is built from the same tokens we iterate (membership is guaranteed); add it only when using a fixed vocab or handling OOV words.

<mark>Each document becomes a vector whose length equals the vocabulary size (often high‑dimensional and sparse) — watch out for the “curse of dimensionality”.</mark>

> Beginner notes: What to look for in the BoW table
>
> - Rows = sentences; Columns = vocabulary (unique, sorted)
> - Printed shape = (number_of_sentences, vocabulary_size)
> - More unique words → longer document vectors (high‑dimensional, often sparse)
>
> Glossary: see key terms like “sparse vector” and “curse of dimensionality” in ../../docs/terminology.md

In [None]:
def bag_of_words(sentences):
    tokenized = [clean_tokenize(s) for s in sentences]
    print(f"Tokenized sentences: {tokenized}")

    vocab = sorted(set(w for sent in tokenized for w in sent))
    print(f"Vocabulary: {vocab}")

    w2i = {w: i for i, w in enumerate(vocab)}
    print(f"Word to index mapping: {w2i}")

    mat = np.zeros((len(sentences), len(vocab)), dtype=int)
    print(f"\nBag of Words matrix shape: {mat.shape}")
    print(f"Initial Bag of Words matrix:\n{mat}")
    for i, sent in enumerate(tokenized):
        for w in sent:
            mat[i, w2i[w]] += 1
    return vocab, mat


corpus = [
    'This movie is awesome awesome',
    'I do not say is good, but neither awesome',
    'Awesome? Only a fool can say that'
]
vocab_bow, bow_matrix = bag_of_words(corpus)
print('Vocab size:', len(vocab_bow), '| Matrix:', bow_matrix.shape)

bow_df = pd.DataFrame(bow_matrix, columns=vocab_bow)
bow_df

## TF and TF‑IDF

What it does:
- TF (Term Frequency): Shows how often each word appears in a sentence, relative to the sentence length.
- IDF (Inverse Document Frequency): Downweights words that appear in many sentences; upweights words that are rare across sentences.
- TF‑IDF = TF × IDF: Highlights words that are frequent in one sentence but uncommon overall.

Why it’s useful:
Helps find words that are distinctive to each sentence, useful for search, summarization, and feature engineering.

How to read TF and TF‑IDF
- TF: Within one sentence, count each word and divide by total words in that sentence (row sums ≈ 1.0).
- IDF: Higher for rarer words; equals 1.0 for words present in every sentence (with smoothing).
- TF‑IDF: Large when a word is frequent in a sentence and rare across others.

Tiny example (same 3 sentences)
1) "this movie is awesome awesome"
2) "i do not say is good, but neither awesome"
3) "awesome? only a fool can say that"

Intuition
- “awesome” appears in all sentences, so its IDF is lower than a word appearing in only one sentence.
- A word that repeats in one sentence can still get a higher TF‑IDF for that sentence.

Tiny numeric walk‑through for “awesome”
- Sentence lengths: 5, 9, 7 → TF1=2/5=0.40, TF2=1/9≈0.11, TF3=1/7≈0.14
- With sklearn‑style smoothing: idf = log((1+N)/(1+df)) + 1 → N=3, df(awesome)=3 → IDF=1.0
- TF‑IDF values: 0.40, ~0.11, ~0.14 (highest in sentence 1 due to repetition)

Beginner notes: TF/IDF
- Run the next cells to compute and view tables.
- Larger TF‑IDF means “important for that sentence and uncommon overall”.
- Optional: print top TF‑IDF words per sentence to see the standouts.

In [None]:
def compute_tf(sentences):
    tokenized = [clean_tokenize(s) for s in sentences]
    print(f"Tokenized sentences: {tokenized}")

    vocab = sorted(set(w for sent in tokenized for w in sent))
    print(f"Vocabulary: {vocab}")

    w2i = {w: i for i, w in enumerate(vocab)}
    print(f"Word to index mapping: {w2i}")

    tf = np.zeros((len(sentences), len(vocab)), dtype=np.float32)
    print(f"\nTF matrix shape: {tf.shape}")
    print(f"Initial TF matrix:\n{tf}")

    for i, words in enumerate(tokenized):
        n = max(1, len(words))
        for w in words:
            tf[i, w2i[w]] += 1.0 / n
    return tf, vocab


def compute_idf(sentences, vocab):
    token_sets = [set(clean_tokenize(s)) for s in sentences]
    print(f"Token sets per sentence: {token_sets}")

    N = len(sentences)
    w2i = {w: i for i, w in enumerate(vocab)}
    print(f"Word to index mapping: {w2i}")

    idf = np.zeros(len(vocab), dtype=np.float32)
    print(f"\nIDF vector shape: {idf.shape}")
    print(f"Initial IDF vector:\n{idf}")

    for w in vocab:
        df = sum(1 for s in token_sets if w in s)
        idf[w2i[w]] = np.log((1 + N) / (1 + df)) + \
            1.0  # sklearn-style smoothing
    return idf


def tf_idf(sentences):
    tf, vocab = compute_tf(sentences)
    idf = compute_idf(sentences, vocab)
    return vocab, tf * idf


vocabulary, tf_idf_matrix = tf_idf(corpus)
tf_matrix, tf_vocab = compute_tf(corpus)
idf_vector = compute_idf(corpus, tf_vocab)
print('TF-IDF shape:', tf_idf_matrix.shape)
tf_df = pd.DataFrame(tf_matrix, columns=tf_vocab)
idf_df = pd.DataFrame([idf_vector], columns=tf_vocab)
tfidf_df = pd.DataFrame(tf_idf_matrix, columns=vocabulary)
display(tf_df.round(2))
display(idf_df.round(2))
display(tfidf_df.round(2))

In [None]:
# Show top TF-IDF terms per sentence
import numpy as np


def top_tfidf_per_doc(vocab, tfidf_matrix, top_k=3):
    for i in range(tfidf_matrix.shape[0]):
        row = tfidf_matrix[i]
        idxs = np.argsort(-row)[:top_k]
        pairs = [(vocab[j], float(row[j])) for j in idxs if row[j] > 0]
        print(f'Sentence {i+1}: {pairs}')


top_tfidf_per_doc(vocabulary, tf_idf_matrix, top_k=3)