This is a python project that takes a book and summarizes it.

How it does? It converts the book into manageable chunks and saves it into a continuous list. 
Then all the chunks are taken and converted into vector space using embedding. 

Using the embedding, KMeans algorithm is used to find the densest vector subspaces. Using those subspaces, the centre of the densest subspaces are taken (like 20) and whatever meaning is inferred from those subspaces (ie. the chunks of file that we split and arranged in order of pages), is 

In [1]:
from langchain.document_loaders import PyPDFLoader

# Load the book
loader = PyPDFLoader("./Artificial Intelligence - A Modern Approach (3rd Edition).pdf")
pages = loader.load()

# Cut out the open and closing parts
pages = pages[:]

# Combine the pages, and replace the tabs with spaces
text = ""

for page in pages:
    text += page.page_content
    
text = text.replace('\t', ' ')

In [2]:
from langchain import OpenAI
llm = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
num_tokens = llm.get_num_tokens(text)

print (f"This book has {num_tokens} tokens in it")

  llm = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")


This book has 911279 tokens in it


In [3]:
# Loaders
from langchain.schema import Document

# Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Model
from langchain.chat_models import ChatOpenAI

# Embedding Support
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Summarizer we'll use for Map Reduce
from langchain.chains.summarize import load_summarize_chain

# Data Science
import numpy as np
from sklearn.cluster import KMeans

In [4]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=10000, chunk_overlap=3000)

docs = text_splitter.create_documents([text])

In [5]:
# Make sure to `pip install openai` first
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

def get_embedding(text, model="nomic-ai/nomic-embed-text-v1.5-GGUF"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

print(get_embedding("Once upon a time, there was a cat."))


[0.028982488438487053, 0.05746348574757576, -0.15397877991199493, -0.08202224969863892, 0.04305528849363327, 0.03828218951821327, -0.07831504195928574, 0.03515299782156944, -0.02010026015341282, 0.026661841198801994, -0.012544429861009121, 0.03681633993983269, 0.06883080303668976, 0.04359506070613861, -0.07030102610588074, -0.07204297930002213, 0.044680267572402954, -0.037170905619859695, 0.010736566036939621, 0.03328876569867134, 0.008671264164149761, -0.040501371026039124, 0.055754706263542175, 0.03165202960371971, 0.039841484278440475, 0.0496266633272171, -0.03296654671430588, 0.05972766876220703, -0.014261442236602306, 0.019525792449712753, 0.023560449481010437, -0.01934228651225567, -0.0024567130021750927, -0.040603384375572205, -0.012765709310770035, -0.03936634585261345, 0.07554327696561813, 0.017120225355029106, 0.05403933674097061, 0.06017432361841202, 0.024638472124934196, 0.033470723778009415, -0.037430234253406525, -0.011497564613819122, 0.04136235639452934, 0.0289473719894

In [6]:
vectors = []

for x in docs:
    vectors.append(get_embedding(x.page_content))



In [7]:
import os
# Create a new folder inside your workspace
folder_name = "./docs"
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# Iterate over the docs list and write each element to a separate .txt file
for i, doc in enumerate(docs):
    filename = f"{i+1}.txt"  # Use f-string formatting for dynamic filename
    filepath = os.path.join(folder_name, filename)
    
    with open(filepath, "w") as f:
        f.write(str(doc))  # Write the element to the file
    
    print(f"Wrote {doc} to {filename}")

Wrote page_content='Artiﬁcial Intelligence
A Modern Approach
Third EditionPRENTICE HALL SERIES
IN ARTIFICIAL INTELLIGENCE
Stuart Russell and Peter Norvig, Editors
FORSYTH &P ONCE Computer Vision: A Modern Approach
GRAHAM ANSI Common Lisp
JURAFSKY &M ARTIN Speech and Language Processing, 2nd ed.
NEAPOLITAN Learning Bayesian Networks
RUSSELL &N ORVIG Artiﬁcial Intelligence: A Modern Approach, 3rd ed.Artiﬁcial Intelligence
A Modern Approach
Third Edition
Stuart J. Russell and Peter Norvig
Contributing writers:
Ernest Davis
Douglas D. Edwards
David Forsyth
Nicholas J. Hay
Jitendra M. Malik
Vibhu Mittal
Mehran Sahami
Sebastian Thrun
Upper Saddle River Boston Columbus San Francisco New York
Indianapolis London Toronto Sydney Singapore Tokyo Montreal
Dubai Madrid Hong Kong Mexico City Munich Paris Amsterdam Cape TownVice President and Editorial Director, ECS: Marcia J. Horton
Editor-in-Chief: Michael Hirsch
Executive Editor: Tracy Dunkelberger
Assistant Editor: Melinda Haggerty
Editorial Assi

In [8]:
import numpy as np

# assuming vectors is a list of numpy arrays
with open('./output.txt', 'w') as f:
    for i, vec in enumerate(vectors):
        np.savetxt(f, vec, header=f'Array {i+1}', comments='')

In [9]:
# Assuming 'embeddings' is a list or array of 1536-dimensional embeddings

# Choose the number of clusters, this can be adjusted based on the book's content.
# I played around and found ~10 was the best.
# Usually if you have 10 passages from a book you can tell what it's about
num_clusters = 21

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)

In [10]:
kmeans.labels_

array([11, 11, 11, 11,  9, 20, 20, 20,  2, 10, 20, 20, 20, 20, 20, 20, 20,
       20, 18, 20, 20,  2,  2, 14,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
       17, 16,  8, 16, 16,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8, 17,
        8,  8, 16,  8,  8,  8,  8,  8, 20,  8, 17, 17, 17, 17, 17,  8,  2,
        8, 17, 17,  8,  7,  7,  7,  7,  8, 19, 10, 17, 19,  8, 19, 19, 19,
       19,  7,  7,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        2,  2, 12, 15, 12, 15, 15, 13, 15, 13,  3, 13, 17, 17, 17, 15, 15,
        8, 15, 13, 15, 20, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 12,
       13, 15, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 15, 11, 13,
       15,  4,  4, 17, 17, 17,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  8,
        4,  4, 17, 17,  4, 17, 17,  4,  4,  4,  2,  2,  4,  4, 15,  0,  0,
        0,  4, 15, 15, 15, 15, 10, 15,  0,  0,  0, 15, 20, 15, 15, 10, 10,
       12, 12, 10,  9,  9,  9,  9, 10,  9,  9,  9,  9,  9,  9,  9,  9,  9,
        9,  9,  9,  9,  9

In [11]:
# Find the closest embeddings to the centroids

# Create an empty list that will hold your closest points
closest_indices = []

# Loop through the number of clusters you have
for i in range(num_clusters):
    
    # Get the list of distances from that particular cluster center
    distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
    
    # Find the list position of the closest one (using argmin to find the smallest distance)
    closest_index = np.argmin(distances)
    
    # Append that position to your closest indices list
    closest_indices.append(closest_index)

In [17]:
selected_indices = sorted(closest_indices)
selected_indices

[np.int64(6),
 np.int64(29),
 np.int64(55),
 np.int64(83),
 np.int64(86),
 np.int64(93),
 np.int64(105),
 np.int64(106),
 np.int64(136),
 np.int64(159),
 np.int64(176),
 np.int64(185),
 np.int64(265),
 np.int64(298),
 np.int64(333),
 np.int64(381),
 np.int64(400),
 np.int64(407),
 np.int64(415),
 np.int64(466),
 np.int64(470)]

In [12]:
"""llm3 = ChatOpenAI(temperature=0,
                 openai_api_key=openai_api_key,
                 max_tokens=1000,
                 model='gpt-3.5-turbo'
                )
"""
llm3 = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

In [14]:
from langchain import PromptTemplate


map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [18]:
selected_docs = [docs[doc] for doc in selected_indices]

In [34]:
for i, doc in enumerate(selected_docs):
    print(i)
    print("\n\n")
    print(doc)

0



page_content='These six disciplines compose most of AI, and Turing deserves credit for designing a test
that remains relevant 60 years later. Yet AI researchers have devoted little effort to passing
the Turing Test, believing that it is more important to study the underlying principles of in-
telligence than to duplicate an exemplar. The quest for “artiﬁcial ﬂight” succeeded when the
Wright brothers and others stopped imitating birds and started using wind tunnels and learn-
ing about aerodynamics. Aeronautical engineering texts do not deﬁne the goal of their ﬁeld
as making “machines that ﬂy so exactly like pigeons that they can fool even other pigeons.”
1.1.2 Thinking humanly: The cognitive modeling approach
If we are going to say that a given program thinks like a human, we must have some way of
determining how humans think. We need to getinside the actual workings of human minds.
There are three ways to do this: through introspection—trying to catch our own thoughts as
they go 

In [35]:
doc?

[1;31mType:[0m           Document
[1;31mString form:[0m   
page_content='Smyth, P., Heckerman, D., and Jordan, M. I.
           (1997). Probabilistic independence netw <...>  Sunstein, C. (2009). Nudge: Improv-
           ing Decisions About Health, Wealth, and Happiness.
           Penguin.'
[1;31mFile:[0m           c:\users\abhij\appdata\roaming\python\python312\site-packages\langchain_core\documents\base.py
[1;31mDocstring:[0m     
Class for storing a piece of text and associated metadata.

Example:

    .. code-block:: python

        from langchain_core.documents import Document

        document = Document(
            page_content="Hello, world!",
            metadata={"source": "https://example.com"}
        )
[1;31mInit docstring:[0m Pass page_content in as positional or named arg.

In [37]:

from openai import OpenAI

# Make an empty list to hold your summaries
summary_list = []
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(selected_docs):
    
    # Go get a summary of the chunk    
    completion = client.chat.completions.create(
        model="lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF",
        messages=[
             {"role": "system", "content": map_prompt},
             {"role": "user", "content": str(doc)}
        ], temperature=0.7,
        )

    chunk_summary = completion.choices[0].message
    
    # Append that summary to your list
    summary_list.append(chunk_summary)
    
    print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary} \n")

Summary #0 (chunk #6) - Preview: ChatCompletionMessage(content="The passage discusses the foundations of Artificial Intelligence (AI) and its various approaches. It begins by highlighting the significance of Alan Turing's Test, which assesses a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. However, researchers have devoted little effort to passing the test, focusing instead on understanding the underlying principles of intelligence.\n\nThe passage then introduces three approaches to AI: cognitive modeling, thinking rationally, and acting rationally. Cognitive modeling involves creating precise theories of the mind through introspection, psychological experiments, and brain imaging. This approach has led to the development of cognitive science, an interdisciplinary field that combines computer models from AI with experimental techniques from psychology.\n\nThe rational-agent approach is also discussed, which views an agent a

In [42]:
i = summary_list[0]

In [45]:
i.content?

[1;31mType:[0m        str
[1;31mString form:[0m The passage discusses the foundations of Artificial Intelligence (AI) and its various approaches. <...> e overview of the foundations of AI, highlighting its key approaches and historical developments.
[1;31mLength:[0m      2815
[1;31mDocstring:[0m  
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.

In [47]:
new_summary = [i.content for i in summary_list]

In [48]:
summaries = "\n".join(new_summary)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")

Your total summary has 10354 tokens


In [59]:
new_summary?

[1;31mType:[0m        list
[1;31mString form:[0m ["The passage discusses the foundations of Artificial Intelligence (AI) and its various approache <...> uested, I hope this overview provides some insight into the topics covered by these references.']
[1;31mLength:[0m      21
[1;31mDocstring:[0m  
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.

In [64]:
import os
def join_markdown_strings(markdown_list):
    """
    Join a list of Markdown strings into one, separated by "---".

    Args:
        markdown_list (list): List of Markdown strings.

    Returns:
        str: The joined Markdown string.
    """

    # Use the join() function to concatenate all strings in the list
    joined_markdown = "---\n".join(markdown_list)

    return joined_markdown


# Example usage:
markdown_list = new_summary
joined_markdown_string = join_markdown_strings(markdown_list)

# Save the joined Markdown string to a file named example.md
with open("example.md", "w") as f:
    f.write(joined_markdown_string)