The bag‑of‑words model is a popular and simple feature extraction technique for text. It describes the occurrence of each word within a document.
To use this model, we need to:
Design a vocabulary of known words (also called tokens).
Choose a measure of the presence of known words.
Any information about the order or structure of the words is discarded. That’s why it’s called a “bag” of words. The model tries to understand whether a known word appears in a document, but it doesn’t know where that word is in the document.
The intuition is that similar documents have similar content. Moreover, from the content we can learn something about the document’s meaning.

In [1]:
# 1) Load text from file
from pathlib import Path
text_path = Path("The Pause Between Layers.txt")  # ensure this file is in the same folder as the notebook
text = text_path.read_text(encoding="utf-8")

# 2) Bag of Words with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Tokenization and basic cleaning; adjust stop_words as needed ("english" removes common words)
vectorizer = CountVectorizer(stop_words="english")  # remove English stop words
X = vectorizer.fit_transform([text])               # single document -> 1 x V matrix

# 3) Show as DataFrame (one row, many columns)
vocab = vectorizer.get_feature_names_out()
bow_df = pd.DataFrame(X.toarray(), columns=vocab)
bow_df

Unnamed: 0,27,43,accepting,act,acted,adjusted,advanced,agreed,air,aligned,...,wake,want,wanted,wants,way,wear,weights,whispered,world,yesterday
0,11,1,1,1,1,1,1,1,1,4,...,1,1,1,1,3,1,1,1,1,1


In [2]:
# Sum across the single row and sort
freq = bow_df.T[0].sort_values(ascending=False)
freq.head(25).to_frame("count")

Unnamed: 0,count
27,11
did,7
chord,6
optimizer,6
pause,5
rain,4
robot,4
aligned,4
like,4
city,3


In [3]:
# Split by blank lines as "documents" (paragraphs)
docs = [p for p in text.split("\n\n") if p.strip()]

vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(docs)
vocab = vectorizer.get_feature_names_out()

pd.DataFrame(X.toarray(), columns=vocab)

Unnamed: 0,27,43,accepting,act,acted,adjusted,advanced,agreed,air,aligned,...,wake,want,wanted,wants,way,wear,weights,whispered,world,yesterday
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,1,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,2,0,0,0,0,0,0,0,0,4,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,1,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,0,0,0,0,0
8,1,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,2,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


TF–IDF example for The Pause Between Layers.txt

In [4]:
# Import the libraries we need
from sklearn.feature_extraction.text import TfidfVectorizer

# Use the same 'docs' you already built (paragraphs from the story)
# docs = [p for p in text.split("\n\n") if p.strip()]

# Build the TF‑IDF model
tfidf_vectorizer = TfidfVectorizer(stop_words="english")
values = tfidf_vectorizer.fit_transform(docs)

# Show the model as a pandas DataFrame
feature_names = tfidf_vectorizer.get_feature_names_out()
pd.DataFrame(values.toarray(), columns=feature_names)

Unnamed: 0,27,43,accepting,act,acted,adjusted,advanced,agreed,air,aligned,...,wake,want,wanted,wants,way,wear,weights,whispered,world,yesterday
0,0.09704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.157528,0.0,0.203285,0.203285,0.0,0.0
1,0.0,0.204402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.204402
2,0.13583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.569088,...,0.142272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.175709,0.0,0.0,0.0,0.0,0.0
4,0.081187,0.0,0.0,0.0,0.170075,0.0,0.0,0.0,0.170075,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.086428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.18722,0.18722,0.18722,0.145079,0.0,0.0,0.0,0.0,0.0
8,0.073815,0.0,0.0,0.154631,0.0,0.0,0.154631,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.159592,0.0,0.167161,0.0,0.0,0.167161,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
