## Project goal

Get valuable insights from a predefined text document. The obtained information should include at minimum the basic premise of the text, the outline and main topics mentioned in the text. Additional information may be extracted from the main content.


The text analysis will focus on the book: "[Man's Search for Meaning](https://www.amazon.com/Mans-Search-Meaning-Viktor-Frankl-ebook/dp/B009U9S6FI)" by [Viktor Frankl](https://en.wikipedia.org/wiki/Viktor_Frankl).

This project aims to provide both context knowledge as well as technical knowledge. The meaning of live is one one the fundamental  questions of existance. Finding a solution to such a complex topic isn't trivial and goes beyond the framework of one data project. Discovering meaningful insights however can be informative despite not providing a fix answer to a broad philosophical problem.

This project at it's basis uses [LangChain](https://github.com/hwchase17/langchain) and OpenAI's GPT3.5. The technical focus of the project is based arround setting up LangChain model calls to obtain meaningfull and reasonable answers for questions and tasks provided by the user. 

Certain book passages were mentioned in comparison to the model responses.

The project uses [Pinecone](https://www.pinecone.io/) - a vector database perfect for text semantic search.

### Libraries instalation

The notebook uses LangChain, which requires some external libraries to be installed.

In [None]:
!pip install langchain
!pip install unstructured
!pip install unstructured[local-inference]
!apt-get install poppler-utils 

!pip install openai
!pip install pinecone-client
!pip install tiktoken

In [None]:
!apt install tesseract-ocr
!apt install libtesseract-de

!pip install chromadb

In [None]:
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

### Libraries initialization

In [3]:
import os
import json

import pinecone

import langchain
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

  from tqdm.autonotebook import tqdm


### Credentials set-up

In [4]:
def load_api_keys(credentials_file_name: str = 'credentials.json') -> tuple:
    '''Load API keys from file

    Arguments:
        credentials_file_name: name of file containing credentials

    Returns:
        A tuple containing OpenAI API Key, Pinecone API key and Pinecone API
        environment name

    '''
    
    if os.path.exists(credentials_file_name):

      # open credentials file 
        with open(credentials_file_name) as f:
            content = json.load(f)

            # load api keys
            OPENAI_API_KEY = content['OPENAI_API_KEY']
            PINECONE_API_KEY = content['PINECONE_API_KEY']
            PINECONE_API_ENV = content['PINECONE_API_ENV']
    else:
        return f'No file {credentials_file_name} or file corrupted'

    return OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_API_ENV


In [5]:
# load the API keys from credential file and setup OPENAI API KEY as an environmental
# variable
OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_API_ENV = load_api_keys('credentials.json')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

### Book load

The book itself is loaded to a hard drive. Do to its size it has to be split into smaller chunks.

In [6]:
loader = UnstructuredPDFLoader('/content/input/book.pdf')
book = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Downloading model_final.pth:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading (…)50_FPN_3x/config.yml:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

In [7]:
len(book)

1

In [8]:
# get the first 500 words from the book
book[0].page_content[:500]

"be Bee oer) il \n\nRevised and Updated  \n\nInternationally renowned psychiatrist.Viktor E. Frankl,endured years of unspeakablehorror in Nazi death camps. During,and partly because of his suffering, Dr. Frankldeveloped a revolutionary approach topsychotherapy known as logotherapy. At thecore of his theory is the belief thatman's primary motivational force is hissearch for meaning.MAN'S SEARCH FOR MEANING is morethan the story of Viktor E. Frankl's triumph:it is a remarkable blend of science andhuman"

### Book split

In [None]:
# The split is handled by LangChain build-in function with 1000 as the chunk size
# and with no chunk overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0)
texts = text_splitter.split_documents(book)

In [None]:
# the book is split into 391 pieces
len(texts)

391

In [None]:
# preview of a random piece
texts[12]

Document(page_content='("Logotherapy in a Nutshell") boils down, as it were, to the lesson one may distill from the first part, the autobiographical account ("Experiences in a Concen- tration Camp"), whereas Part One serves as the exis- tential validation of my theories. Thus, both parts mutually support their credibility. I had none of this in mind when I wrote the book in 1945. And I did so within nine successive days and with the firm determination that the book would be published anonymously. In fact, the first printing of the original German version does not show my name on the cover, though at the last moment, just before the book\'s initial publication, I did finally give in to my friends who had urged me to let it be published with my name at least on the title page. At first, however, it had been written with the absolute conviction that, as an anonymous opus, it could never earn its author literary fame. I had wanted simply to convey to the reader by way of a concrete example

### Embeddings creation

OpenAI embeddings will be created for the book content and later upsert to Pinecone database

In [9]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

### Pinecone initialization and embeddings dump

The database uses GCP environment containing 391 vectors (each for one text) and 1536 dimentions optimized for cosine similarity search

In [None]:
pinecone.init(
    api_key = PINECONE_API_KEY,
    environment=PINECONE_API_ENV
)

index_name = 'langchain'

docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name = index_name)

### Chain building and  creation of the 'ask' function

In [12]:
def ask(query: str, chain_type : str = 'stuff') -> str:
  """Create a text embedding from a query, pass it to docsearch similarity 
     with respect to the other documents, return most relevant documents to a
     given query and use them as input for a language chain.

  Args:
      query (str): initial query provided by the user
      chain_type (str): type of chain, other chain types:
        stuff - use all the text from a given prompt
        
        map_reduce - split the prompt into pieces, feed all the pieces with the 
        query and make one final answer based on the results
        
        refine - split the text into batches, feed the first batch, get the output
        and feed the output to the second batch...
        
        map-rerank - split the text into batched, get the answer from each one and
        decide which one answered the question the best 

  Returns:
      str: Answer to a question generated by the language chain.
  """

  # initiate a Large Language Model
  llm = OpenAI(temperature = 0, openai_api_key=OPENAI_API_KEY)
  # build a chain of type provided by the user
  chain = load_qa_chain(llm, chain_type = chain_type)
  docs = docsearch.similarity_search(query, include_metadata = True)
  
  return chain.run(input_documents = docs, question = query)

## Architecture 1

pod Type: P1 - faster quaries

chunk_size = 1000

chunk_overlap = 0

no of vectors: 391

no of dimenstions 1536

metric type: cosine

chain_type: stuff

### Asking questions

In [None]:
ask("What is the main premise of the book?")

' The main premise of the book is that it is possible to "say yes to life" in spite of all the tragic aspects of human existence.'

In [None]:
ask("What are the main concepts mentioned in this book?")

' The main concepts mentioned in this book are logotherapy, tragic optimism, and addiction.'

In [None]:
ask("Who were Capos?")

' Capos were prisoners in concentration camps who were chosen by the guards to be in charge of other prisoners. They were often harder on the prisoners than the guards and had more privileges than the other prisoners.'

In [None]:
ask("What is logotherapy?")

" Logotherapy is a psychotherapy theory developed by Viktor Frankl that focuses on the meaning of human existence and man's search for such a meaning. It is based on the idea that the primary motivational force in man is the striving to find a meaning in life. Logotherapy is in contrast to Freudian psychoanalysis, which is centered on the pleasure principle, and Adlerian psychology, which is focused on the will to power."

In [None]:
ask("What thought kept the author alive in the concentration camp?")

' The thought that there was a meaning in his life and that whatever he had gone through could still be an asset to him in the future.'

In [None]:
ask("When someone developed the trait of seeing bad things in a good light, was this a good survival strategy?")

' Yes, this was a good survival strategy because it allowed people to cope with difficult situations and find meaning in suffering.'

In [None]:
ask("Did Dr. Frankl sister survive the concentration camp?")

" No, there is no mention of Dr. Frankl's sister surviving the concentration camp in the given context."

In [None]:
ask("What did the author tell the elderly general practitioner? ")

" The author told the elderly general practitioner that he had a manuscript of a scientific book in the inner pocket of his coat and that he wanted to keep it at all costs because it contained his life's work."

### Critique

What was the main premise of the book?
`The main premise of the book is that it is possible to "say yes to life" in spite of all the tragic aspects of human existence.`

In [None]:
# The model summarized the premise accordingly. The explanation in short and simple.

---

What are the main concepts mentioned in this book? ` The main concepts mentioned in this book are logotherapy, tragic optimism, and addiction.`

In [29]:
# "Logotherapy" and "tragic optimism" are phrases that accure frequently in the book but "addition" 
# is one of three facets of neurotic syndrome (the other two being depression and aggresion). 
# Addition is not the main focus of this book.

---

Who were Capos? ` Capos were prisoners in concentration camps who were chosen by the guards to be in charge of other prisoners. They were often harder on the prisoners than the guards and had more privileges than the other prisoners.`

In [24]:
# A precise short and accurate answer.

---

What is logotherapy? ` Logotherapy is a psychotherapy theory developed by Viktor Frankl that focuses on the meaning of human existence and man's search for such a meaning. It is based on the idea that the primary motivational force in man is the striving to find a meaning in life. Logotherapy is in contrast to Freudian psychoanalysis, which is centered on the pleasure principle, and Adlerian psychology, which is focused on the will to power.`

In [27]:
# A good answer concluded from this text: "According to logotherapy,
# this striving to find a meaning in one's life is the
# primary motivational force in man. That is why I
# speak of a will to meaning in contrast to the pleasure
# principle (or, as we could also term it, the will to
# pleasure) on which Freudian psychoanalysis is centered, as well as in contrast to the will to power on
# which Adlerian psychology, using the term "striving
# for superiority," is focused."

---

What thought kept the author alive in the concentration camp? `The thought that there was a meaning in his life and that whatever he had gone through could still be an asset to him in the future.'
`

In [1]:
# A correct answer

---

When someone developed the trait of seeing bad things in a good light, was this a good survival strategy? `Yes, this was a good survival strategy because it allowed people to cope with difficult situations and find meaning in suffering`

In [None]:
# Correct

---

Did Dr. Frankl sister survive the concentration camp? `No, there is no mention of Dr. Frankl's sister surviving the concentration camp in the given context.`

In [3]:
# This isn't true beacuse the is a mention of Frankl's sister in the text:
# "His father, mother, brother, and his wife died in camps or were sent to the gas ovens,
# so that, excepting for his sister, his entire family perished in these camps."

---

What did the author tell the elderly general practitioner? `The author told the elderly general practitioner that he had a manuscript of a scientific book in the inner pocket of his coat and that he wanted to keep it at all costs because it contained his life's work.`
`

In [5]:
# This isn't true. The correct answer is:"What would have happened, Doctor, if 
# you had died first, and your wife would have had to survive you?"

## Architecture 2

od Type: P2 - lowest latency and highest throughput

chunk_size = 1200

chunk_overlap = 80

no of vectors: 332

no of dimenstions 1536

metric type: cosine

chain_type: stuff

In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1200, chunk_overlap = 80)
texts = text_splitter.split_documents(book)

print(len(texts))

pinecone.init(
    api_key = PINECONE_API_KEY,
    environment=PINECONE_API_ENV
)

index_name = 'langchain2'

docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name = index_name)

332


### Asking questions

In [13]:
ask("What is the main premise of the book?")

' The main premise of the book is that hundreds of thousands of people are reaching out for a book that promises to deal with the question of a meaning to life, and that the book provides a compelling introduction to the most significant psychological movement of our day.'

In [14]:
ask("What are the main concepts mentioned in this book?")

' The main concepts mentioned in this book are logotherapy, the meaning of life, and addiction.'

In [15]:
ask("Who were Capos?")

' Capos were prisoners in concentration camps who were chosen by the SS men to act as supervisors over the other prisoners. They were usually chosen from those prisoners whose characters promised to make them suitable for such procedures, and they were often harder on the prisoners than the guards. They were given privileges such as better food and were often able to save the lives of other prisoners.'

In [16]:
ask("What is logotherapy?")

" Logotherapy is a meaning-centered psychotherapy that focuses on the meaning of human existence and man's search for such a meaning. It is based on the idea that the primary motivational force in man is the striving to find a meaning in life. Logotherapy defocuses the vicious-circle formations and feedback mechanisms which play a role in the development of neuroses, and it tries to make the patient fully aware of his own responsibleness. It is neither teaching nor preaching, but rather it is like an eye specialist, trying to enable the patient to see the world as it really is."

In [18]:
ask("What thought kept the author alive in the concentration camp?")

' The thought that there was a meaning in his life and that he could still fulfill a task.'

In [19]:
ask("When someone developed the trait of seeing bad things in a good light, was this a good survival strategy?")

' Yes, this was a good survival strategy because it allowed people to find meaning in suffering and turn their predicaments into human achievements.'

In [20]:
ask("Did Dr. Frankl sister survive the concentration camp?")

" No, Dr. Frankl's sister did not survive the concentration camp."

In [21]:
ask("What did the author tell the elderly general practitioner? ")

' The author told the elderly general practitioner to shave daily, if possible, even if it meant giving up their last piece of bread, in order to look younger and healthier and avoid being sent to the gas chambers.'

### Critique

What was the main premise of the book?
`The main premise of the book is that hundreds of thousands of people are reaching out for a book that promises to deal with the question of a meaning to life, and that the book 
provides a compelling introduction to the most significant psychological movement of our day.
`

In [None]:
# A very generic answer. The answer proposed by the first architecture was better.

---

What are the main concepts mentioned in this book? ` The main concepts mentioned in this book are logotherapy, tragic optimism, and addiction.`

In [None]:
#  This is the same answer as Architexture 1 provided.

---

Who were Capos? `Capos were prisoners in concentration camps who were chosen by the SS men to act as supervisors over the other prisoners. They were usually chosen from those prisoners whose characters promised to make them suitable for such procedures, and they were often harder on the prisoners than the guards. They were given privileges such as better food and were often able to save the lives of other prisoners.`

In [None]:
# A better answer that focuses slightly on the psychological aspect of who was chosen to be a Capo.

---

What is logotherapy? `Logotherapy is a meaning-centered psychotherapy that focuses on the meaning of human existence and man's search for such a meaning. It is based on the idea that the primary motivational force in man is the striving to find a meaning in life. Logotherapy defocuses the vicious-circle formations and feedback mechanisms which play a role in the development of neuroses, and it tries to make the patient fully aware of his own responsibleness. It is neither teaching nor preaching, but rather it is like an eye specialist, trying to enable the patient to see the world as it really is.`

In [None]:
# The eye specialist analogy was taken directly from the book. This answer discarts the 
# comparrison to Freud and Adler's work and focuses on the core aspects of logotherapy.

---

What thought kept the author alive in the concentration camp? ` The thought that there was a meaning in his life and that he could still fulfill a task.`

In [None]:
# Generally the same anser as before but this one if more vague in terms of the final task.

---

When someone developed the trait of seeing bad things in a good light, was this a good survival strategy? `Yes, this was a good survival strategy because it allowed people to find meaning in suffering and turn their predicaments into human achievements.`

In [None]:
# Correct answer - a slight variation of the first one

---

Did Dr. Frankl sister survive the concentration camp? `No, there is no mention of Dr. Frankl's sister surviving the concentration camp in the given context.`

In [None]:
# Incorrect answer. There is a brief mention in the book of his sistem surviving the concentration camp,
# although it is not easy to conclude.

---

What did the author tell the elderly general practitioner? `The author told the elderly general practitioner to shave daily, if possible, even if it meant giving up their last piece of bread, in order to look younger and healthier and avoid being sent to the gas chambers.`

In [None]:
# This isn't true. The correct answer is:"What would have happened, Doctor, if 
# you had died first, and your wife would have had to survive you?"