# Semantic Analysis Over Mobby Dick Lecture

About me:
I am Jeison Robbles Arias, a very enthucistic person focused on improving skills, education growing and difunding engineering.
Can Fallow Me through:
- [Medium](https://medium.com/@roblesjeison)
- [Linkedin](https://www.linkedin.com/in/jeison-robles-arias-6ab8a9ba/)
- [GitHub](https://github.com/JeisonRobles)

<br>**Notebook focus:**</br></br>
This note book steps over the basics about document preprocesing, embeding creation, semantic spaces and clustering for insight extraction.

Along this analysis I use usual tools and strategies in a common basic pipeline to extract information about documents. Information that  can be used as a RAG file for LLM and Agentic Systems that lather I'll be using in  further publications.

The document will be structured as follows:

[Basic knowledge and strategy](#basics)

## Basics

In [1]:
import sys
from pathlib import Path

# Notebook-safe project root (assumes notebook is in /notebooks)
PROJECT_ROOT = Path.cwd().parent

if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("PROJECT_ROOT:", PROJECT_ROOT)

PROJECT_ROOT: /Users/jeisonroblesarias/Documents/ODSC_2026/moby-embeddings-from-stratch


In [2]:
from pathlib import Path
import pandas as pd

from src.preprocess import PreprocessConfig, make_paragraph_dataset
from src.vectorize_tfidf import TfidfConfig, build_tfidf_matrix
from src.similarity import top_k_similar_rows
from src.reduce_dim import pca_2d
from src.viz import plot_embeddings_2d


In [3]:
DATA_RAW = PROJECT_ROOT / "data" / "raw" / "mobydick.txt"
OUTPUTS = PROJECT_ROOT / "outputs"
OUTPUTS.mkdir(exist_ok=True)

DATA_RAW, OUTPUTS


(PosixPath('/Users/jeisonroblesarias/Documents/ODSC_2026/moby-embeddings-from-stratch/data/raw/mobydick.txt'),
 PosixPath('/Users/jeisonroblesarias/Documents/ODSC_2026/moby-embeddings-from-stratch/outputs'))

In [4]:
with open(DATA_RAW, "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Chars:", len(raw_text))
print(raw_text[:500])


Chars: 1238242
The Project Gutenberg eBook of Moby Dick; Or, The Whale
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBo


1. Introduction & goal

2. Load raw text

In [5]:
cfg = PreprocessConfig(
    min_paragraph_chars=200,  # increase if you want fewer, richer paragraphs
    lowercase=True
)

rows = make_paragraph_dataset(raw_text, cfg)
df = pd.DataFrame(rows, columns=["paragraph_id", "raw_text", "clean_text"])

print("Paragraphs:", len(df))
df.head(3)


Paragraphs: 1567


Unnamed: 0,paragraph_id,raw_text,clean_text
0,0,"This text is a combination of etexts, one from...",this text is a combination of etexts one from ...
1,1,"The pale Usher—threadbare in coat, heart, body...",the pale usher threadbare in coat heart body a...
2,2,"“While you take in hand to school others, and ...",while you take in hand to school others and to...


In [6]:
df["char_len"] = df["clean_text"].str.len()
df["word_len"] = df["clean_text"].str.split().str.len()

df[["char_len", "word_len"]].describe()



Unnamed: 0,char_len,word_len
count,1567.0,1567.0
mean,684.92023,127.462668
std,479.11573,89.246217
min,200.0,32.0
25%,347.0,65.0
50%,541.0,101.0
75%,884.0,163.0
max,3466.0,734.0


In [7]:
i = 0
print("RAW:\n", df.loc[i, "raw_text"][:700])
print("\nCLEAN:\n", df.loc[i, "clean_text"][:700])


RAW:
 This text is a combination of etexts, one from the now-defunct ERIS
project at Virginia Tech and one from Project Gutenberg’s archives. The
proofreaders of this version are indebted to The University of Adelaide
Library for preserving the Virginia Tech version. The resulting etext
was compared with a public domain hard copy version of the text.

CLEAN:
 this text is a combination of etexts one from the now defunct eris project at virginia tech and one from project gutenberg s archives the proofreaders of this version are indebted to the university of adelaide library for preserving the virginia tech version the resulting etext was compared with a public domain hard copy version of the text


In [8]:
tfidf_cfg = TfidfConfig(
    min_df=3,
    max_df=0.9,
    ngram_range=(1, 2),     # unigrams + bigrams makes it feel “semantic”
    stop_words="english",
    sublinear_tf=True,
    norm="l2"               # important: L2 normalization makes cosine meaningful
)

X, vectorizer = build_tfidf_matrix(df["clean_text"].tolist(), tfidf_cfg)

print("TF-IDF shape:", X.shape)  # (num_paragraphs, num_terms)


TF-IDF shape: (1567, 6838)


In [9]:
query_idx = 10  # change this
neighbors = top_k_similar_rows(X, query_idx, k=5)

print("QUERY PARAGRAPH:\n")
print(df.loc[query_idx, "raw_text"][:600], "...\n")

print("TOP NEIGHBORS:\n")
for idx, score in neighbors:
    print(f"\nScore: {score:.3f} | paragraph_id={df.loc[idx, 'paragraph_id']}")
    print(df.loc[idx, "raw_text"][:450], "...")


QUERY PARAGRAPH:

“Which to secure, no skill of leach’s art Mote him availle, but to
 returne againe To his wound’s worker, that with lowly dart, Dinting
 his breast, had bred his restless paine, Like as the wounded whale to
 shore flies thro’ the maine.” —_The Fairie Queen_. ...

TOP NEIGHBORS:


Score: 0.132 | paragraph_id=1091
In his treatise on “Queen-Gold,” or Queen-pinmoney, an old King’s Bench
author, one William Prynne, thus discourseth: “Ye tail is ye Queen’s,
that ye Queen’s wardrobe may be supplied with ye whalebone.” Now this
was written at a time when the black limber bone of the Greenland or
Right whale was largely used in ladies’ bodices. But this same bone is
not in the tail; it is in the head, which is a sad mistake for a
sagacious lawyer like Prynne. But ...

Score: 0.116 | paragraph_id=1286
“Thou art but too good a fellow, Starbuck,” he said lowly to the mate;
then raising his voice to the crew: “Furl the t’gallant-sails, and
close-reef the top-sails, fore and aft; b

In [10]:
records = []
for idx, score in neighbors:
    records.append({
        "query_idx": query_idx,
        "neighbor_idx": idx,
        "cosine_score": score,
        "neighbor_snippet": df.loc[idx, "raw_text"][:250].replace("\n", " ")
    })

pd.DataFrame(records).to_csv(OUTPUTS / "neighbors.csv", index=False)
print("Saved:", OUTPUTS / "neighbors.csv")


Saved: /Users/jeisonroblesarias/Documents/ODSC_2026/moby-embeddings-from-stratch/outputs/neighbors.csv


In [11]:
coords = pca_2d(X)

df_plot = pd.DataFrame({
    "x": coords[:, 0],
    "y": coords[:, 1],
    "paragraph_id": df["paragraph_id"],
    "snippet": df["raw_text"].str.replace("\n", " ").str[:200]
})

out_html = OUTPUTS / "embedding_plot.html"
plot_embeddings_2d(df_plot, str(out_html))

print("Saved plot:", out_html)


Saved plot: /Users/jeisonroblesarias/Documents/ODSC_2026/moby-embeddings-from-stratch/outputs/embedding_plot.html


3. Strip Gutenberg boilerplate

4. Paragraph segmentation

5. Text normalization

6. TF-IDF embedings

7. Cosine similarity search

8. 2D semantic visualization

9. Insights & takeaways

In [1]:
print("hello")


hello
