# Bible Search with NLTK, TF-IDF, and LSA — Colab Hands-On

This notebook has two parts:

1) **Warm-up (Python & Colab)** — print, lists/dicts, list comprehensions, functions, file I/O  
2) **Bible Search Pipeline** — download a Bible text from Project Gutenberg, preprocess → TF-IDF → LSA → cosine search

You will learn:
- Basic Python data handling and Colab file I/O
- NLTK tokenization, POS tagging, lemmatization, stopwords
- TF-IDF vectorization and TruncatedSVD (LSA)
- Query processing and similarity search

In [None]:
# Install and import dependencies
!pip -q install nltk scikit-learn pandas numpy

import re
import numpy as np
import pandas as pd
from typing import List

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, sent_tokenize, pos_tag

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1) Warm-up: Python & Colab Basics

**Goal.** Get comfortable with Python/Colab: printing, lists & dicts, list comprehensions, functions, and file I/O.

### 1.1 Print, lists, dicts
- Create a list of words and a frequency dictionary
- Use list comprehension to transform items

### 1.2 Simple functions
- Write a function that counts tokens in a line

### 1.3 File I/O in Colab
- Upload a small `.txt` file
- Read lines, split tokens, and compute simple counts

In [None]:
# 1.1 Print, lists, dicts, list comprehensions
words = ["in", "the", "beginning", "god", "created", "the", "heaven", "and", "the", "earth"]
print("Words:", words)

# frequency dictionary
freq = {}
for w in words:
    freq[w] = freq.get(w, 0) + 1
print("Freq dict:", freq)

# list comprehension: uppercase non-stopwords (toy example)
stop = {"the", "and"}
upper_nonstop = [w.upper() for w in words if w not in stop]
print("Uppercase (non-stopwords):", upper_nonstop)

Words: ['in', 'the', 'beginning', 'god', 'created', 'the', 'heaven', 'and', 'the', 'earth']
Freq dict: {'in': 1, 'the': 3, 'beginning': 1, 'god': 1, 'created': 1, 'heaven': 1, 'and': 1, 'earth': 1}
Uppercase (non-stopwords): ['IN', 'BEGINNING', 'GOD', 'CREATED', 'HEAVEN', 'EARTH']


In [None]:
# 1.2 Simple functions
def count_tokens(line: str) -> int:
    tokens = line.strip().split()
    return len(tokens)

print(count_tokens("In the beginning God created the heaven and the earth."))

10


In [None]:
# 1.3 File I/O in Colab: upload a small text file and compute counts
from google.colab import files
print("Upload a small .txt file (any short paragraph).")
uploaded = files.upload()

import io, os
fname = list(uploaded.keys())[0]
with io.open(fname, "r", encoding="utf-8", errors="ignore") as f:
    lines = f.readlines()

print(f"Lines: {len(lines)}")
# token counts per line
line_token_counts = [count_tokens(ln) for ln in lines]
print("First 3 lines + token counts:")
for i in range(min(3, len(lines))):
    print(f"{i+1:>2}: {lines[i].strip()}  | tokens={line_token_counts[i]}")

Upload a small .txt file (any short paragraph).


Saving 1789-Washington[1].txt to 1789-Washington[1] (1).txt
Lines: 13
First 3 lines + token counts:
 1: Fellow-Citizens of the Senate and of the House of Representatives:  | tokens=10
 2:   | tokens=0
 3: Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, b

### Mini-exercise
- Modify `count_tokens` to ignore punctuation (e.g., periods and commas).
- Compute the **top-5 most frequent tokens** in your uploaded file.
- (Optional) Write a function `get_top_k(lines, k)` that returns `(token, count)` pairs sorted by count.


## 2) Download the Corpus (Project Gutenberg)

We will fetch the King James Bible from Project Gutenberg:

```python
import requests
r = requests.get('http://www.gutenberg.org/cache/epub/10/pg10.txt')
raw = r.text
```
**Code — download & clean**

In [None]:
import requests

url = 'http://www.gutenberg.org/cache/epub/10/pg10.txt'
r = requests.get(url)
raw = r.text
print("Raw length:", len(raw))

# Strip Gutenberg header/footer heuristically
def strip_gutenberg_headers(text: str) -> str:
    # Find probable start/end markers
    start_pat = r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*\*\*\*"
    end_pat   = r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*\*\*\*"
    start_m = re.search(start_pat, text, flags=re.IGNORECASE)
    end_m   = re.search(end_pat, text, flags=re.IGNORECASE)
    if start_m and end_m:
        text = text[start_m.end():end_m.start()]
    return text.strip()

clean_text = strip_gutenberg_headers(raw)
print("Clean length:", len(clean_text))

# For simplicity, we treat sentences as "verses"
sents = sent_tokenize(clean_text)
print("Sentence count:", len(sents))
sents[:3]

Raw length: 4451818
Clean length: 4451812
Sentence count: 29930


['\ufeffThe Project Gutenberg eBook of The King James Version of the Bible\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever.',
 'You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org.',
 'If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.']

In [None]:
# Build a simple DataFrame: one sentence per "doc"
df = pd.DataFrame({
    "doc_id": [f"sent_{i}" for i in range(len(sents))],
    "text": sents
})
df.head(3)

Unnamed: 0,doc_id,text
0,sent_0,﻿The Project Gutenberg eBook of The King James...
1,sent_1,"You may copy it, give it away or re-use it und..."
2,sent_2,"If you are not located in the United States,\r..."


## 2’) Structured Loader: Book / Chapter / Verse (Project Gutenberg KJV)

We will download the KJV text from Project Gutenberg and **parse** it into
`book`, `chapter`, `verse`, and `text` fields.

**Heuristics used**
- Detect **BOOK** headers (e.g., “The First Book of Moses, Called Genesis” → *Genesis*).
- Detect **CHAPTER** headers (e.g., “CHAPTER 1”, roman or arabic numerals).
- Detect **VERSE** lines (starting with a number, optionally followed by “:” or “.”).

This keeps the canonical reference like `Genesis 1:1`.


In [None]:
# download a document from gutenberg.org
import requests
r = requests.get('https://drive.usercontent.google.com/u/0/uc?id=1ayz_ce_xxIQiHzeeMDFMEGRnfbTdnVwJ&export=download')
raw = r.text # get raw text as a corpus

# Print 1000 characters
print(raw[:1000])

# Sentense and word tokenization
import nltk, re, pprint
from nltk import word_tokenize
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

# 多数改行する部分を捉えて、文としてピリオドを付与する
# Insert period as sentence to find line feed continuity.
raw2 = re.sub(r'\r\n\r\n(\r\n)+', ".\r\n", raw)
raw2 = re.sub(r'\.\.', ".", raw2) # 上記作業でできた2重ピリオドを削除 delete double period

# 入力したものを、文や単語ごとにトークン化
# Tokenize processed string
sentences = sent_tokenize(raw2)
tokens = word_tokenize(raw2)

print("Num. of Sentences: " + str(len(sentences)))
print("Num. of Words: " + str(len(tokens)))

# 20行分を表示。Print 20 lines
for i in range(101,120):
    print(sentences[i])
    print("-----")

# Extract last line using the code of end pattern in Gutenberg Corpus. (need preliminary knowledge)
finreg = re.compile(r'END OF THE PROJECT')
endline = [(s1-1, sentences[s1-1]) for s1, s2 in enumerate(sentences) if finreg.search(s2)]
print(endline)

# Extracting Book Information: The Bible almost always starts at 1: 1 so extract one line above the matched line
regexp = re.compile(r'^1\:1 ')
finreg = re.compile(r'^End of the Project')
titlelist = [(s1-1, sentences[s1-1]) for s1, s2 in enumerate(sentences) if regexp.search(s2)]
print(len(titlelist))
for t in titlelist:
    print(t)
    print('-----')

# Subdivide the data in the book.
# Because the line is divided, the text excluding the title of the book is made into one list,
# relying on the line number of the book information.
books = []
prev = 0
for i, title in enumerate(titlelist):
    book = []
    if i == 0:
        prev = title[0]
        continue
    for j in range(title[0]-prev-1):
        book += word_tokenize(sentences[prev+j+1])
    books.append(book)
    prev = title[0]
book = []
for i in range(endline[0][0]-prev):
    book += word_tokenize(sentences[title[0]+i+1])
books.append(book)

# Separation of clauses:
# Unfortunately, as for the verses, they are not properly divided into clauses.
# You need to check the actual patterns for finding the verse number
bookplace = []
verses = []
verse = []
place = 0
re_sever = re.compile(r'^[1-9][0-9]*\:[1-9][0-9]*')
for j,book in enumerate(books):
    bookplace.append(place)
    for i, s in enumerate(book):
        if re_sever.search(s):
            if verse != []:
                verses.append(verse)
                place += 1
            verse = []
            verse.append(s)
        else:
            verse.append(s)
    verses.append(verse)
    place += 1
    verse = []
# 節数の確認, Check the number of verses, but this is less than true.
print(len(verses))
# Versesの中の書物の先頭の場所, Check the number of position which is indicates start of each verse
print(bookplace)

The Project Gutenberg eBook of The King James Version of the Bible
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The King James Version of the Bible

Release date: August 1, 1989 [eBook #10]
                Most recently updated: October 29, 2024

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK THE KING JAMES VERSION OF THE BIBLE ***
The Old Testament of the King James Version of the Bible
The First Book of Moses: Called Genesis
The Second Book of Moses: Called Exodus
The Third Book of Moses: Called Leviticus
The Fourth Book of Moses: Called Numb

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Num. of Sentences: 30002
Num. of Words: 952607
4:13 And Cain said unto the LORD, My punishment is greater than I can
bear.
-----
4:14 Behold, thou hast driven me out this day from the face of the
earth; and from thy face shall I be hid; and I shall be a fugitive and
a vagabond in the earth; and it shall come to pass, that every one
that findeth me shall slay me.
-----
4:15 And the LORD said unto him, Therefore whosoever slayeth Cain,
vengeance shall be taken on him sevenfold.
-----
And the LORD set a mark
upon Cain, lest any finding him should kill him.
-----
4:16 And Cain went out from the presence of the LORD, and dwelt in the
land of Nod, on the east of Eden.
-----
4:17 And Cain knew his wife; and she conceived, and bare Enoch: and he
builded a city, and called the name of the city, after the name of his
son, Enoch.
-----
4:18 And unto Enoch was born Irad: and Irad begat Mehujael: and
Mehujael begat Methusael: and Methusael begat Lamech.
-----
4:19 And Lamech took unto him

In [None]:
# Build a DataFrame from your existing 'verses', 'bookplace', and 'titlelist'
# Assumptions:
# - verses: list of list[str], e.g., ['1:1', 'In', 'the', ...]
# - bookplace: list[int], verse index where each book starts (same logic as your code)
# - titlelist: list[tuple], where the second element is the book title sentence
#   (your code used: titlelist = [(idx, sentences[idx])] so title = t[1])

import re
import pandas as pd

def detok(tokens):
    """Naive detokenization: join with spaces, then remove spaces before punctuation."""
    s = " ".join(tokens).strip()
    s = re.sub(r"\s+([,.;:!?])", r"\1", s)
    return s

# If your 'books' were built by skipping titlelist[0], then book_titles should start from titlelist[1].
# Adjust this offset if needed (0 or 1). Default 1 matches your earlier loop.
TITLE_OFFSET = 0

# Build book title list aligned with bookplace length
def extract_title(t):
    # titlelist item could be (idx, title) or (i, j, title). Handle both.
    if isinstance(t, (list, tuple)):
        if len(t) >= 3:
            return t[2]
        elif len(t) >= 2:
            return t[1]
    return str(t)

book_titles = []
for i in range(len(bookplace)):
    ti = min(i + TITLE_OFFSET, len(titlelist) - 1)
    book_titles.append(extract_title(titlelist[ti]) if len(titlelist) > 0 else "UNKNOWN")

rows = []
cur_book_idx = 0

for v_idx, vtoks in enumerate(verses):
    # advance book index when reaching the next book start
    while cur_book_idx + 1 < len(bookplace) and v_idx >= bookplace[cur_book_idx + 1]:
        cur_book_idx += 1

    if not vtoks:
        continue

    # first token should be like "N:M"
    m = re.match(r"^(\d+):(\d+)$", vtoks[0])
    if not m:
        # skip malformed
        continue
    ch_i, vs_i = int(m.group(1)), int(m.group(2))
    text = detok(vtoks[1:])

    rows.append({
        "book":   book_titles[cur_book_idx],
        "chapter": ch_i,
        "verse":   vs_i,
        "text":    text
    })

df = pd.DataFrame(rows, columns=["book", "chapter", "verse", "text"])
df["chapter_verse"] = df["chapter"].astype(str) + ":" + df["verse"].astype(str)
df["doc_id"] = df["book"] + " " + df["chapter_verse"]

print("DataFrame shape:", df.shape)
display(df.head(10))
display(df[(df["chapter"]==1) & (df["verse"]==1)].head(10))


DataFrame shape: (31102, 6)


Unnamed: 0,book,chapter,verse,text,chapter_verse,doc_id
0,The First Book of Moses: Called Genesis.,1,1,In the beginning God created the heaven and th...,1:1,The First Book of Moses: Called Genesis. 1:1
1,The First Book of Moses: Called Genesis.,1,2,"And the earth was without form, and void; and ...",1:2,The First Book of Moses: Called Genesis. 1:2
2,The First Book of Moses: Called Genesis.,1,3,"And God said, Let there be light: and there wa...",1:3,The First Book of Moses: Called Genesis. 1:3
3,The First Book of Moses: Called Genesis.,1,4,"And God saw the light, that it was good: and G...",1:4,The First Book of Moses: Called Genesis. 1:4
4,The First Book of Moses: Called Genesis.,1,5,"And God called the light Day, and the darkness...",1:5,The First Book of Moses: Called Genesis. 1:5
5,The First Book of Moses: Called Genesis.,1,6,"And God said, Let there be a firmament in the ...",1:6,The First Book of Moses: Called Genesis. 1:6
6,The First Book of Moses: Called Genesis.,1,7,"And God made the firmament, and divided the wa...",1:7,The First Book of Moses: Called Genesis. 1:7
7,The First Book of Moses: Called Genesis.,1,8,And God called the firmament Heaven. And the e...,1:8,The First Book of Moses: Called Genesis. 1:8
8,The First Book of Moses: Called Genesis.,1,9,"And God said, Let the waters under the heaven ...",1:9,The First Book of Moses: Called Genesis. 1:9
9,The First Book of Moses: Called Genesis.,1,10,And God called the dry land Earth; and the gat...,1:10,The First Book of Moses: Called Genesis. 1:10


Unnamed: 0,book,chapter,verse,text,chapter_verse,doc_id
0,The First Book of Moses: Called Genesis.,1,1,In the beginning God created the heaven and th...,1:1,The First Book of Moses: Called Genesis. 1:1
1533,The Second Book of Moses: Called Exodus.,1,1,Now these are the names of the children of Isr...,1:1,The Second Book of Moses: Called Exodus. 1:1
2746,The Third Book of Moses: Called Leviticus.,1,1,"And the LORD called unto Moses, and spake unto...",1:1,The Third Book of Moses: Called Leviticus. 1:1
3605,The Fourth Book of Moses: Called Numbers.,1,1,And the LORD spake unto Moses in the wildernes...,1:1,The Fourth Book of Moses: Called Numbers. 1:1
4893,The Fifth Book of Moses: Called Deuteronomy.,1,1,These be the words which Moses spake unto all ...,1:1,The Fifth Book of Moses: Called Deuteronomy. 1:1
5852,The Book of Joshua.,1,1,Now after the death of Moses the servant of th...,1:1,The Book of Joshua. 1:1
6510,The Book of Judges.,1,1,"Now after the death of Joshua it came to pass,...",1:1,The Book of Judges. 1:1
7128,The Book of Ruth.,1,1,Now it came to pass in the days when the judge...,1:1,The Book of Ruth. 1:1
7213,The First Book of Samuel\r\n\r\nOtherwise Call...,1,1,Now there was a certain man of Ramathaimzophim...,1:1,The First Book of Samuel\r\n\r\nOtherwise Call...
8023,The Second Book of Samuel\r\n\r\nOtherwise Cal...,1,1,"Now it came to pass after the death of Saul, w...",1:1,The Second Book of Samuel\r\n\r\nOtherwise Cal...


### Sanity checks
- Are early rows from **Genesis 1**?
- Do we see familiar verses like **Genesis 1:1**?
- Is `book/chapter/verse` monotonic within a book?

In [None]:
# Basic sanity: show first 15 rows and a sample around Genesis 1
display(df.head(15))

# Spot-check Genesis 1
g1 = df[(df['book']=="Genesis") & (df['chapter']==1)].head(10)
display(g1)

# Check distinct books detected
print("Books detected:", df['book'].nunique())
print(sorted(df['book'].unique())[:10], "...")

Unnamed: 0,book,chapter,verse,text,chapter_verse,doc_id
0,The First Book of Moses: Called Genesis.,1,1,In the beginning God created the heaven and th...,1:1,The First Book of Moses: Called Genesis. 1:1
1,The First Book of Moses: Called Genesis.,1,2,"And the earth was without form, and void; and ...",1:2,The First Book of Moses: Called Genesis. 1:2
2,The First Book of Moses: Called Genesis.,1,3,"And God said, Let there be light: and there wa...",1:3,The First Book of Moses: Called Genesis. 1:3
3,The First Book of Moses: Called Genesis.,1,4,"And God saw the light, that it was good: and G...",1:4,The First Book of Moses: Called Genesis. 1:4
4,The First Book of Moses: Called Genesis.,1,5,"And God called the light Day, and the darkness...",1:5,The First Book of Moses: Called Genesis. 1:5
5,The First Book of Moses: Called Genesis.,1,6,"And God said, Let there be a firmament in the ...",1:6,The First Book of Moses: Called Genesis. 1:6
6,The First Book of Moses: Called Genesis.,1,7,"And God made the firmament, and divided the wa...",1:7,The First Book of Moses: Called Genesis. 1:7
7,The First Book of Moses: Called Genesis.,1,8,And God called the firmament Heaven. And the e...,1:8,The First Book of Moses: Called Genesis. 1:8
8,The First Book of Moses: Called Genesis.,1,9,"And God said, Let the waters under the heaven ...",1:9,The First Book of Moses: Called Genesis. 1:9
9,The First Book of Moses: Called Genesis.,1,10,And God called the dry land Earth; and the gat...,1:10,The First Book of Moses: Called Genesis. 1:10


Unnamed: 0,book,chapter,verse,text,chapter_verse,doc_id


Books detected: 66
['Amos.', 'Ecclesiastes\r\n\r\nor\r\n\r\nThe Preacher.', 'Ezra.', 'Habakkuk.', 'Haggai.', 'Hosea.', 'Joel.', 'Jonah.', 'Malachi.', 'Micah.'] ...


## 3) Preprocessing Pipeline

- Lowercase normalization
- Tokenization
- Stopword removal
- POS tagging → lemmatization with POS mapping

In [None]:
def normalize_text(s: str) -> str:
    s = s.strip()
    s = re.sub(r'\s+', ' ', s)
    return s.lower()

stop_en = set(stopwords.words('english'))
wnl = WordNetLemmatizer()
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV

def to_wn_pos(tag: str):
    if tag.startswith('J'): return ADJ
    if tag.startswith('V'): return VERB
    if tag.startswith('N'): return NOUN
    if tag.startswith('R'): return ADV
    return NOUN

def process_text_to_lemmas(text: str) -> List[str]:
    t = normalize_text(text)
    toks = [w for w in word_tokenize(t) if w.isalpha()]
    toks = [w for w in toks if w not in stop_en]
    tags = pos_tag(toks)
    lems = [wnl.lemmatize(w, pos=to_wn_pos(p)) for w, p in tags]
    return lems

# Apply to corpus
df['text_norm'] = df['text'].apply(normalize_text)
df['tokens'] = df['text_norm'].apply(word_tokenize)
df['lemmas'] = df['text'].apply(process_text_to_lemmas)

# Inspect
ix = 1000
print("TEXT :", df.loc[ix, 'text'])
print("TOKS :", df.loc[ix, 'tokens'][:20])
print("LEMS :", df.loc[ix, 'lemmas'][:20])

TEXT : And Hamor and Shechem his son came unto the gate of their city, and communed with the men of their city, saying,
TOKS : ['and', 'hamor', 'and', 'shechem', 'his', 'son', 'came', 'unto', 'the', 'gate', 'of', 'their', 'city', ',', 'and', 'communed', 'with', 'the', 'men', 'of']
LEMS : ['hamor', 'shechem', 'son', 'come', 'unto', 'gate', 'city', 'commune', 'men', 'city', 'say']


## 4) TF-IDF Vectorization

We build a document–term matrix over **lemmas** using `TfidfVectorizer`.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus_texts = [' '.join(lems) for lems in df['lemmas']]

vectorizer = TfidfVectorizer(
    tokenizer=lambda s: s.split(),
    preprocessor=lambda s: s,
    lowercase=False,
    min_df=3
)
X_tfidf = vectorizer.fit_transform(corpus_texts)
X_tfidf.shape



(31102, 5621)

## 5) Query Processing & Cosine Search

**Important:** Apply exactly the **same preprocessing** to the query.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def process_query(q: str) -> List[str]:
    return process_text_to_lemmas(q)

def search_tfidf(query: str, top_k: int = 10):
    qlems = process_query(query)
    q_text = ' '.join(qlems)
    q_vec = vectorizer.transform([q_text])
    sims = cosine_similarity(q_vec, X_tfidf).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return top_idx, sims[top_idx], qlems

# Try a few queries
for q in ["creation of heaven and earth", "love your neighbor"]:
    idx, scores, qlems = search_tfidf(q, top_k=5)
    print("\nQuery:", q, "| lemmas:", qlems)
    for i, (j, s) in enumerate(zip(idx, scores), 1):
        print(f"{i:>2}. {df.loc[j, 'doc_id']} | score={s:.3f}")
        print("   ", df.loc[j, 'text'])


Query: creation of heaven and earth | lemmas: ['creation', 'heaven', 'earth']
 1. The Gospel According to Saint Mark. 10:6 | score=0.478
    But from the beginning of the creation God made them male and female.
 2. The Fifth Book of Moses: Called Deuteronomy. 10:14 | score=0.439
    Behold, the heaven and the heaven of heavens is the LORD ’ s thy God, the earth also, with all that therein is.
 3. The First Book of Moses: Called Genesis. 2:4 | score=0.436
    These are the generations of the heavens and of the earth when they were created, in the day that the LORD God made the earth and the heavens,
 4. The Epistle of Paul the Apostle to the Romans. 8:22 | score=0.431
    For we know that the whole creation groaneth and travaileth in pain together until now.
 5. The Gospel According to Saint Mark. 13:19 | score=0.426
    For in those days shall be affliction, such as was not from the beginning of the creation which God created unto this time, neither shall be.

Query: love your neighbo

## 6) LSA with TruncatedSVD

- Reduce dimensionality to capture latent co-occurrence patterns
- Compare retrieval before vs. after LSA

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

k = 100  # try 50/100/300
svd = TruncatedSVD(n_components=k, random_state=0)
X_lsa = svd.fit_transform(X_tfidf)
X_lsa = Normalizer(copy=False).fit_transform(X_lsa)

def search_lsa(query: str, top_k: int = 10):
    qlems = process_query(query)
    q_text = ' '.join(qlems)
    q_vec = vectorizer.transform([q_text])
    q_lsa = svd.transform(q_vec)
    q_lsa = Normalizer(copy=False).transform(q_lsa)
    sims = cosine_similarity(q_lsa, X_lsa).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return top_idx, sims[top_idx], qlems

# Side-by-side comparison
query = "make the earth by words"
idx_t, sc_t, qlems_t = search_tfidf(query, top_k=5)
idx_l, sc_l, qlems_l = search_lsa(query, top_k=5)

print("Query:", query)
print("\nTF-IDF:")
for i, (j, s) in enumerate(zip(idx_t, sc_t), 1):
    print(f"{i:>2}. {df.loc[j,'doc_id']} | {s:.3f} | {df.loc[j,'text'][:120]}")

print("\nLSA:")
for i, (j, s) in enumerate(zip(idx_l, sc_l), 1):
    print(f"{i:>2}. {df.loc[j,'doc_id']} | {s:.3f} | {df.loc[j,'text'][:120]}")


Query: make the earth by words

TF-IDF:
 1. The Book of the Prophet Jeremiah. 22:29 | 0.728 | O earth, earth, earth, hear the word of the LORD.
 2. The Gospel According to Saint John. 1:1 | 0.498 | In the beginning was the Word, and the Word was with God, and the Word was God.
 3. The Second Book of Moses: Called Exodus. 34:27 | 0.462 | And the LORD said unto Moses, Write thou these words: for after the tenor of these words I have made a covenant with the
 4. The Book of Psalms. 115:15 | 0.460 | Ye are blessed of the LORD which made heaven and earth.
 5. The Book of Psalms. 148:11 | 0.434 | Kings of the earth, and all people; princes, and all judges of the earth:

LSA:
 1. The Book of Psalms. 105:28 | 0.797 | He sent darkness, and made it dark; and they rebelled not against his word.
 2. The Book of the Prophet Jeremiah. 22:29 | 0.712 | O earth, earth, earth, hear the word of the LORD.
 3. The Third Book of Moses: Called Leviticus. 26:19 | 0.705 | And I will break the pride of your pow

## 7) Highlighting Query Lemmas in Results

In [None]:
def highlight(text: str, lemmas: List[str]) -> str:
    words = word_tokenize(normalize_text(text))
    hits = set(lemmas)
    out = [(f"<b>{w}</b>" if w in hits else w) for w in words]
    return ' '.join(out)

from IPython.display import HTML, display

def show_results(query: str, use_lsa=True, top_k=5):
    idx, scores, qlems = (search_lsa(query, top_k) if use_lsa else search_tfidf(query, top_k))
    print(f"Query: {query} | Lemmas: {qlems} | LSA={use_lsa}")
    for j, s in zip(idx, scores):
        h = highlight(df.loc[j,'text'], qlems)
        display(HTML(f"<div><b>{df.loc[j,'doc_id']}</b> (score={s:.3f})<br/>{h}</div><hr/>"))

show_results("creation the earth by word", use_lsa=True, top_k=5)


Query: creation the earth by word | Lemmas: ['creation', 'earth', 'word'] | LSA=True


## 8) Mini-Evaluation & Ablations

- Create a few queries with expected matches (titles or fragments)
- Compare TF-IDF vs. LSA using Precision@k
- Ablations: without lemmatization, different `k`, add bigrams

In [None]:
# Simple "gold" using substrings in the text (since we don't have book/chapter labels)
gold = {
    "love your neighbor": ["love your neighbour", "love thy neighbour", "love your neighbor"],  # spelling variations
    "heaven and earth": ["heaven and earth"],
}

def precision_at_k_text_contains(query, expected_phrases, search_fn, k=5):
    idx, _scores, _q = search_fn(query, top_k=k)
    hits = 0
    for j in idx:
        txt = df.loc[j,'text'].lower()
        if any(ph.lower() in txt for ph in expected_phrases):
            hits += 1
    return hits / k

for q, phrases in gold.items():
    p_t = precision_at_k_text_contains(q, phrases, search_tfidf, k=5)
    p_l = precision_at_k_text_contains(q, phrases, search_lsa,   k=5)
    print(f"{q}\n  TF-IDF P@5 = {p_t:.2f}\n  LSA    P@5 = {p_l:.2f}")


love your neighbor
  TF-IDF P@5 = 0.00
  LSA    P@5 = 0.00
heaven and earth
  TF-IDF P@5 = 0.20
  LSA    P@5 = 0.00


## 9) Optional: Bigrams and Synonym Expansion

- Bigrams in TF-IDF to capture short phrases
- WordNet synonyms to expand the query (may improve recall)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

bigram_vectorizer = TfidfVectorizer(
    tokenizer=lambda s: s.split(),
    preprocessor=lambda s: s,
    lowercase=False,
    min_df=3,
    ngram_range=(1,2)
)
X_bi = bigram_vectorizer.fit_transform(corpus_texts)

from sklearn.metrics.pairwise import cosine_similarity

def search_tfidf_bigram(query: str, top_k: int = 10):
    qlems = process_query(query)
    q_text = ' '.join(qlems)
    q_vec = bigram_vectorizer.transform([q_text])
    sims = cosine_similarity(q_vec, X_bi).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return top_idx, sims[top_idx], qlems

# WordNet synonym expansion
from nltk.corpus import wordnet as wn

def expand_with_synonyms(lemmas: List[str], max_syns_per_word=2):
    expanded = set(lemmas)
    for w in lemmas:
        for syn in wn.synsets(w)[:1]:  # limit per word
            for l in syn.lemmas()[:max_syns_per_word]:
                if l.name().isalpha():
                    expanded.add(l.name().lower())
    return list(expanded)

def search_tfidf_expanded(query: str, top_k: int = 10):
    qlems = process_query(query)
    qlems_exp = expand_with_synonyms(qlems)
    q_text = ' '.join(qlems_exp)
    q_vec = vectorizer.transform([q_text])
    sims = cosine_similarity(q_vec, X_tfidf).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return top_idx, sims[top_idx], qlems_exp

# Try variants
for fn in [search_tfidf, search_tfidf_bigram, search_tfidf_expanded]:
    idx, sc, qlems = fn("love your neighbor", top_k=5)
    print(f"\nVariant: {fn.__name__} | query lemmas: {qlems}")
    for j, s in zip(idx, sc):
        print(f"  {df.loc[j,'doc_id']} | {s:.3f} | {df.loc[j,'text'][:100]}")





Variant: search_tfidf | query lemmas: ['love', 'neighbor']
  The Gospel According to Saint Luke. 6:32 | 0.855 | For if ye love them which love you, what thank have ye? for sinners also love those that love them.
  The First Epistle General of John. 4:19 | 0.846 | We love him, because he first loved us.
  The Gospel According to Saint John. 15:12 | 0.763 | This is my commandment, That ye love one another, as I have loved you.
  The Proverbs. 8:17 | 0.720 | I love them that love me; and those that seek me early shall find me.
  The First Epistle General of John. 4:10 | 0.707 | Herein is love, not that we loved God, but that he loved us, and sent his Son to be the propitiation

Variant: search_tfidf_bigram | query lemmas: ['love', 'neighbor']
  The First Epistle General of John. 4:19 | 0.698 | We love him, because he first loved us.
  The Gospel According to Saint Luke. 6:32 | 0.569 | For if ye love them which love you, what thank have ye? for sinners also love those that love them.
  Th

## 10) Wrap-Up

**Key takeaways**
- End-to-end pipeline: preprocess → TF-IDF → LSA → cosine search
- Keep the **same preprocessing** for corpus and queries
- LSA helps retrieve semantically related sentences even with lexical variation

**Next**
- Statistical sequence models (HMM) and neural sequence models (RNN/LSTM)
- Toward semantic search with embeddings and Q&A over passages


In [None]:
# === Preamble (run once) ===
!pip -q install nltk scikit-learn pandas numpy

import re, requests, numpy as np, pandas as pd
import nltk; nltk.download('punkt')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.metrics.pairwise import cosine_similarity


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# === Problem 1: Build df(book, chapter, para, text) ===

URL = "https://www.gutenberg.org/files/2600/2600-0.txt"
raw = requests.get(URL).text
text = raw.replace("\r\n", "\n")
m_end = re.search(r"END OF THE PROJECT GUTENBERG EBOOK", text, flags=re.IGNORECASE)
if m_end: text = text[:m_end.start()]

# Split into paragraphs by blank lines
paras_raw = [p.strip() for p in re.split(r'\n\s*\n', text) if p.strip()]

# Relaxed heading patterns
BOOK_PAT      = re.compile(r'^\s*BOOK\s+([A-Z]+)\b.*$', re.IGNORECASE)
EPILOGUE_PAT  = re.compile(r'^\s*EPILOGUE(?:\s*:\s*PART\s+([IVXLCM]+)|\s+PART\s+([IVXLCM]+))?\b.*$', re.IGNORECASE)
CHAPTER_PAT   = re.compile(r'^\s*CHAPTER\s+([IVXLCM]+)\b.*$', re.IGNORECASE)

ROMAN = {'M':1000,'CM':900,'D':500,'CD':400,'C':100,'XC':90,'L':50,'XL':40,'X':10,'IX':9,'V':5,'IV':4,'I':1}
def roman_to_int(s: str) -> int:
    s = s.upper(); i = 0; v = 0
    while i < len(s):
        if i+1 < len(s) and s[i:i+2] in ROMAN:
            v += ROMAN[s[i:i+2]]; i += 2
        else:
            v += ROMAN.get(s[i], 0); i += 1
    return v

records = []
cur_book, cur_chapter, para_in_chapter = None, None, 0

for p in paras_raw:
    first = p.splitlines()[0].strip()

    # TODO(1): detect book heading into cur_book
    mb = BOOK_PAT.match(first)
    if mb:
        cur_book = f"BOOK {mb.group(1).upper()}"
        cur_chapter = None
        para_in_chapter = 0
        continue

    me = EPILOGUE_PAT.match(first)
    if me:
        part = me.group(1) or me.group(2)
        cur_book = "EPILOGUE" if not part else f"EPILOGUE PART {part.upper()}"
        cur_chapter = None
        para_in_chapter = 0
        continue

    # TODO(2): detect chapter heading into cur_chapter (roman_to_int)
    mc = CHAPTER_PAT.match(first)
    if mc:
        cur_chapter = roman_to_int(mc.group(1))
        para_in_chapter = 0
        continue

    # TODO(3): if cur_book & cur_chapter set, append this paragraph to records
    if cur_book and cur_chapter:
        para_in_chapter += 1
        body = re.sub(r'\s+', ' ', p).strip()
        records.append({
            "book": cur_book,
            "chapter": int(cur_chapter),
            "para": para_in_chapter,
            "text": body
        })

df = pd.DataFrame(records, columns=["book","chapter","para","text"])

# ---- Grading (do not edit) ----
def canonize_opening(s: str) -> str:
    s = (s.replace("\u201c", '"').replace("\u201d", '"')
           .replace("\u2018", "'").replace("\u2019", "'")
           .replace("\u2014", "-").replace("\u00A0", " "))
    s = re.sub(r"\s+", " ", s).lstrip('"\''"“”‘’").lower()
    return s

assert not df.empty and {"book","chapter","para","text"}.issubset(df.columns)
# The first paragraph of BOOK ONE, CHAPTER 1 must match the canonical opening after normalization
p0 = (df[(df["book"]=="BOOK ONE") & (df["chapter"]==1)].sort_values("para").iloc[0]["text"])
gt_prefix = "well, prince, so genoa and lucca are now just family estates of the buonapartes"
assert canonize_opening(p0).startswith(gt_prefix), "Opening paragraph mismatch. Check parsing."
print("✅ Problem 1 passed.")
print(df.shape, "rows parsed.")


✅ Problem 1 passed.
(11345, 4) rows parsed.


In [None]:
# === Problem 2: TF-IDF self-retrieval ===

df = df.reset_index(drop=False).rename(columns={"index":"row_id"})

def normalize_simple(s: str) -> str:
    s = s.lower()
    s = re.sub(r"[^a-z0-9'\- ]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_norm"] = df["text"].apply(normalize_simple)

# TODO(1): Fit TF-IDF on text_norm
tfidf = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(1,1))
X_tfidf = tfidf.fit_transform(df["text_norm"])

def search_tfidf(query: str, topk=5):
    qv = tfidf.transform([normalize_simple(query)])
    sims = cosine_similarity(qv, X_tfidf).ravel()
    idx = np.argsort(-sims)[:topk]
    return idx, sims[idx]

# Use the exact first paragraph text as the query
p0_row = (df[(df["book"]=="BOOK ONE") & (df["chapter"]==1)].sort_values("para").iloc[0])
p0_idx = (df[(df["book"]=="BOOK ONE") & (df["chapter"]==1)].sort_values("para").index[0])
p0_text = p0_row["text"]

# TODO(2): search and check Top-1
idx, sims = search_tfidf(p0_text, topk=1)
hit = df.iloc[idx[0]]

# ---- Grading (do not edit) ----
assert idx[0] == p0_idx, "Top-1 is not the same paragraph (row index mismatch)."
assert (hit["book"] == p0_row["book"]
        and int(hit["chapter"]) == int(p0_row["chapter"])
        and int(hit["para"]) == int(p0_row["para"])), "Top-1 differs in book/chapter/para."
assert sims[0] >= 0.90, f"Cosine too low with TF-IDF ({sims[0]:.3f})."
print("✅ Problem 2 passed.")
print({"book": hit["book"], "chapter": int(hit["chapter"]), "para": int(hit["para"])}, sims[0])


✅ Problem 2 passed.
{'book': 'BOOK ONE', 'chapter': 1, 'para': 1} 1.0000000000000002
