# Step 1: Convert a Book to Markdown
In this notebook, a book is converted to markdown format. The book can be in either PDF or EPUB format. The conversion is done using the custom `textProcessing` module.

In [1]:
# Import the custom module
import os
import textPreprocessing as tp

input_file = 'Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_EBOOK_v103.epub'
output_markdown = input_file.replace('.epub', '.md')

# Check if the input file exists
if not os.path.exists(input_file):
    print(f"Input file {input_file} not found. Please check the path.")
else:
    print(f"Found input file: {input_file}")

converted_path = tp.convert_book_to_markdown(input_file, output_markdown)
print(f"Markdown version saved to: {converted_path}")

# Cleaning markdown text
markdown = ""
with open(output_markdown, 'r', encoding='utf-8') as file:
    markdown = file.read()
if not markdown:
    print("Markdown file is empty. Please check the conversion process.")

cleaned_markdown = tp.clean_markdown_text(markdown)

# Save the cleaned markdown text to a new file
cleaned_markdown_path = converted_path.replace('.md', '_cleaned.md')
with open(cleaned_markdown_path, 'w', encoding='utf-8') as file:
    file.write(cleaned_markdown)
print(f"Cleaned markdown version saved to: {cleaned_markdown_path}")

Found input file: Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_EBOOK_v103.epub
Markdown version saved to: Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_EBOOK_v103.md
Cleaned markdown version saved to: Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_EBOOK_v103_cleaned.md


  for root_file in tree.findall('//xmlns:rootfile[@media-type]', namespaces={'xmlns': NAMESPACES['CONTAINERNS']}):


# Step 2: Text Segmentation
The text is then segmented into smaller parts for easier processing of the embeddings. The segmentation is done using the `textSegmentation` module. Two methods are provided for segmentation: 
1. **NLTK**: This method uses the Natural Language Toolkit (NLTK) for text segmentation.
2. **LangChain**: This method uses the LangChain library for text segmentation.

In [None]:
# Import the segmentation functions from your module
import textSegmentation as ts

# Load your text from a file (e.g., your Markdown version of the book)
def load_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read()

book_text = cleaned_markdown

method = 'NLTK'
# method = 'LangChain'

segments = []

if method == 'NLTK':
    print("Using NLTK for text segmentation.")

    # Option 1: Segment using NLTK's TextTiling
    segments = ts.segment_text_texttiling(book_text)
    print("NLTK TextTiling produced", len(segments), "segments.")

if method == 'LangChain':
    print("Using LangChain for text segmentation.")
    
    # Option 2: Segment using LangChain's splitter
    try:
        segments = ts.segment_text_langchain(book_text, chunk_size=512, chunk_overlap=50)
        print("LangChain splitter produced", len(segments), "chunks.")
    except ImportError as e:
        print("LangChain not installed. Please install it via 'pip install langchain'.")


# Save the segments to a file
ts.save_segments_to_file(segments, input_file.replace('.epub', '_segments.txt'))

Using NLTK for text segmentation.
NLTK TextTiling produced 374 segments.
![cover.jpg](image/cover.jpg)

![](image/1.png)

Copyright © 2020 Eric Jorgenson

All rights reserved.

ISBN: 978-1-5445-1420-8

This book has been created as a public service. It is available for free download in pdf and e-reader versions on [Navalmanack.com](https://Navalmanack.com). Naval is not earning any money on this book. Naval has essays, podcasts and more at [Nav.al](https://Nav.al) and is on Twitter @Naval.

For my parents, who gave me everything and always seem to find a way to give  ...



# Step 3: Embedding Generation
The segmented text is then converted into embeddings using the Sentence Transformers library. 

In [3]:
import logging
from sentence_transformers import SentenceTransformer

# Set logging to INFO level so that you can see more details in console output
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Encode with the progress bar enabled
embeddings = model.encode(segments, show_progress_bar=True, batch_size=32)

2025-04-08 21:04:54,573 - INFO - Use pytorch device_name: cuda:0
2025-04-08 21:04:54,573 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


Batches:   0%|          | 0/12 [00:00<?, ?it/s]

# Step 4: Post Processing


In [4]:
import textPostprocessing as tpost

mappedEmbeddingOutput = input_file.replace('.epub', '_mapped_embeddings.json')
tpost.map_text_to_embeddings(segments, embeddings, mappedEmbeddingOutput)



Mapping of text segments to embeddings saved to: Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_EBOOK_v103_mapped_embeddings.json


# Step 5: Querying the Embeddings

In [8]:
import textQuerying as tq

# Specify the path to your mapped embedding file.
mapping_file = mappedEmbeddingOutput

# Define a query string.
query = "The importance of self-awareness in decision-making."

# Load your SentenceTransformer model (should be the same model used in step 3).
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Retrieve the top 5 matching segments.
top_matches = tq.query_mapped_embeddings(query, mapping_file, model, top_k=5)

# Print out the top matching segments
for match in top_matches:
    print(f"Segment ID: {match['id']}")
    print(f"Similarity: {match['similarity']:.4f}")
    print("Text snippet:", match['text'], "...")
    print("-" * 80)

2025-04-08 22:29:01,818 - INFO - Use pytorch device_name: cuda:0
2025-04-08 22:29:01,818 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Segment ID: 149
Similarity: 0.6351
Text snippet: Self-serving conclusions should have a higher bar.

I do view a lot of my goals over the next few years of unconditioning previous learned responses or habituated responses, so I can make decisions more cleanly in the moment without relying on memory or prepackaged heuristics and judgments. [4]

Almost all biases are time-saving heuristics. For important decisions, discard memory and identity, and focus on the problem. ...
--------------------------------------------------------------------------------
Segment ID: 264
Similarity: 0.5429
Text snippet: If I saw a guy with a bad hair day, I would at first think “Haha, he has a bad hair day.” Well, why am I laughing at him to make me feel better about myself? And why am I trying to make me feel better about my own hair? Because I’m losing my hair, and I’m afraid it’s going to go away. What I find is 90 percent of thoughts I have are fear-based. The other 10 percent may be desire- based.

You