## 1. Data Loading

In [11]:
# Load the data from a PDF file and extract the text

from langchain.document_loaders import PyMuPDFLoader
# Load the PDF file
loader = PyMuPDFLoader("pdf_file\AI, Automation, and War The Rise of a Military-Tech Complex (Anthony King).pdf")
documents = loader.load()

In [12]:
# Remove pages that are mostly whitespace or very short
documents = [
    doc for doc in documents
    if len(doc.page_content.strip()) > 100  # adjustable threshold
]

In [13]:
documents = [doc for doc in documents if doc.metadata["page"] > 8]

In [14]:
# Clean the extracted pdf text

import re

def clean_text(text: str) -> str:
    text = text.replace('\x0c', '')                 # common page-break character
    text = re.sub(r'\s+\n', '\n', text)             # remove spaces before newlines
    text = re.sub(r'\n{2,}', '\n\n', text)          # collapse multiple newlines
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)      # remove weird unicode
    text = re.sub(r' +', ' ', text)                 # remove extra spaces
    return text.strip()


In [15]:
# Apply cleaning
for doc in documents:
    doc.page_content = clean_text(doc.page_content)

In [18]:
# Display a sample of the extracted text after cleaning
for i, doc in enumerate(documents[100:]):
    print(f"\n--- Page {doc.metadata['page']} ---")
    print(doc.page_content[:1000])


--- Page 110 ---
98 CHAPTER 5
They have focused on the technology to which they impute often unfea 
sible powers ignoring the organisational transformations which have made
and will make the military application of AI pos si ble. Blinded by the remark 
able technical powers of AI, many have overlooked this human collaboration
between the tech sector and Special Operations Forces. The collaboration is
on a small scale, and it is often mundane. It is always discreet, often covert,
and sometimes classified.
The Special Operations Forces are but one small node in the armed forces.
Yet the emergent connection between the tech sector and the Special Opera 
tions Forces is a significant development. In the defence sector, the Special
Operations Forces have become the supersellers. They are the market leaders
in defence, the ones whom other forces tend to follow and imitate. Through
their advocacy for and application of AI, it is likely that the Special Operations
Forces will accelerate the t

In [48]:
import re

def detect_back_matter_start(documents, threshold: float = 0.8) -> int | None:
    """
    Detect the index in the document list where back matter begins (e.g., References, Bibliography, Index, etc.)

    Parameters:
    ----------
    documents : List[Document]
        The list of LangChain Document objects (e.g., from PyMuPDFLoader)
    threshold : float
        Percentage (default 0.85) of the book after which back matter is expected.

    Returns:
    -------
    int | None
        Index of the first back matter page, or None if not found.
    """

    back_keywords = ["bibliography", "references", "index", "appendix", "notes"]
    total_docs = len(documents)

    # Add sequential index metadata if missing
    for idx, doc in enumerate(documents):
        doc.metadata["index"] = idx

    # Only scan the last (1 - threshold)% of the book
    search_start = int(total_docs * threshold)

    for i in range(search_start, total_docs):
        doc = documents[i]
        text = doc.page_content.lower()

        # Extract all short lines to look for section titles
        lines = text.splitlines()
        short_lines = [line.strip() for line in lines if 3 <= len(line.strip()) <= 40]

        for line in short_lines:
            if re.match(r"^(bibliography|references|index|appendix|notes)\b", line):
                print(f"🟡 Back matter detected on page {doc.metadata.get('page', 'unknown')} at index {i}")
                print(f"➡️ Section header: {line}")
                return i

    # If nothing found
    print("✅ No back matter section found with current heuristic.")
    return None

In [49]:
# Detect start of back matter
back_start_index = detect_back_matter_start(documents)

# Split the documents
if back_start_index:
    main_docs = documents[:back_start_index]
    back_docs = documents[back_start_index:]
else:
    main_docs = documents
    back_docs = []

print(f"Main content: {len(main_docs)} pages | Back matter: {len(back_docs)} pages")

🟡 Back matter detected on page 197 at index 186
➡️ Section header: notes
Main content: 186 pages | Back matter: 44 pages


In [50]:
# Print a sample of back docs
back_docs[0].page_content

'NOTES\n1. Robot Wars\n1. Ray Kurzweil, The Singularity is Near: When Humans Transcend Biology (London: Duck \nworth, 2005).\n2. Ray Kurzweil, The Singularity is Nearer: When We Merge with AI (Oxford: Bodley Head,\n2024).\n3. Kurzweil, The Singularity is Nearer, 10.\n4. James Lovelock, The Novacene: The Coming Age of Hyperintelligence (London: Penguin\nBooks, 2020), 111.\n5. Mustafa Suleyman with Michael Bhaskar, The Coming Wave: AI, Power and the Twenty- first\n Century s Greatest Dilemma (London: Bodley Head, 2023), 3.\n6. Melanie Mitchell, Artificial Intelligence: A Guide for Thinking Humans (London: Pelican,\n2019), 198 9.\n7. Suleyman, The Coming Wave, 53.\n8. Suleyman, The Coming Wave, 51.\n9. Suleyman, The Coming Wave, 53.\n10. Marcus du Sautoy, The Creativity Code: Art and Innovation in the Age of AI (Cambridge,\nMA: The Belknap Press, 2019), 31.\n11. Matthew Sparkes, DeepMind s Protein- Folding AI Cracks Biology s Biggest Prob lem ,\nNew Scientist, 28 July 2022, https:// www .

## 2. Data Chunking

In [23]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", "!", "?", " ", ""]
)
chunks = text_splitter.split_documents(main_docs)

In [24]:
type(chunks)

list

In [25]:
len(chunks)

930

## 3. Data Embeddings (Convert text to numerical vector space)

In [52]:
main_docs[-1].page_content

'War at the Speed of Light 183\nwhich wants to wield military power has to embrace AI with which to enable\nand augment its armed forces. To do other wise would be like failing to adopt\ngunpowder, airpower, tanks, or aircraft or perhaps, even more aptly, failing\nto adopt mapping and charts.\nYet AI is not miraculous; it is not magic. AI offers novel capabilities, but its\npotential can be harnessed only through profound organisational reformation.\nA new relationship between the armed forces and the tech sector is required.\nConsequently, as armed forces pursue AI, a military- tech complex is\nappearing. In the next decade, the partnership between the state military\nforces and the private tech companies is likely to consolidate and deepen.\nUtopian or dystopian visions of war conducted by supercomputers and killer\ndrone swarms are phantasmagorical. Nevertheless, an emerging military- tech\ncomplex transforms the way in which states defend themselves and fight each\nother. The incre

In [51]:
back_docs[0].page_content

'NOTES\n1. Robot Wars\n1. Ray Kurzweil, The Singularity is Near: When Humans Transcend Biology (London: Duck \nworth, 2005).\n2. Ray Kurzweil, The Singularity is Nearer: When We Merge with AI (Oxford: Bodley Head,\n2024).\n3. Kurzweil, The Singularity is Nearer, 10.\n4. James Lovelock, The Novacene: The Coming Age of Hyperintelligence (London: Penguin\nBooks, 2020), 111.\n5. Mustafa Suleyman with Michael Bhaskar, The Coming Wave: AI, Power and the Twenty- first\n Century s Greatest Dilemma (London: Bodley Head, 2023), 3.\n6. Melanie Mitchell, Artificial Intelligence: A Guide for Thinking Humans (London: Pelican,\n2019), 198 9.\n7. Suleyman, The Coming Wave, 53.\n8. Suleyman, The Coming Wave, 51.\n9. Suleyman, The Coming Wave, 53.\n10. Marcus du Sautoy, The Creativity Code: Art and Innovation in the Age of AI (Cambridge,\nMA: The Belknap Press, 2019), 31.\n11. Matthew Sparkes, DeepMind s Protein- Folding AI Cracks Biology s Biggest Prob lem ,\nNew Scientist, 28 July 2022, https:// www .