# Question answering on book "Deep Learning"

In this notebook we will take a .pdf from the internet to do question answering on. 

We will use the following book: 

- website: https://udlbook.github.io/udlbook/
- downlaoad: https://github.com/udlbook/udlbook/releases/download/v1.0.4/UnderstandingDeepLearning_08_05_23_C.pdf

### Contents
0. Install packages
1. Imports & settings
2. Get the book and split into chunks
3. Settings for langchain and similarity search
4. Question Answering
5. Just the similarity search

### Setting your OpenAI API key
In order to use this Notebook, you'll need an API key from OpenAI. Please register on openai and get your API key from https://platform.openai.com/account/api-keys.

The OpenAI library will try to read your API key from the OPENAI_API_KEY environment variable. You will set this environment variable. I personally use my own config.py file, also to store other params.

### Sources
- Blog: https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

## 0. Install packages

In [7]:
!pip install openai
!pip install langchain
!pip install PyPDF2
!pip install faiss-cpu



## 1. Imports & settings

In [27]:
#we will do multple imports to get all the settings right.
from PyPDF2 import PdfReader
#import the embeddings
from langchain.embeddings.openai import OpenAIEmbeddings 
#Textsplitter
from langchain.text_splitter import CharacterTextSplitter 
#Import the vectorstores
from langchain.vectorstores import Chroma, FAISS 
#Import the chains
from langchain.chains.question_answering import load_qa_chain
#Import the LLM's
from langchain.llms import OpenAI
#Import the summarizer function
from langchain.chains.summarize import load_summarize_chain

In [4]:
# Get your API keys from openai, you will need to create a (paid) account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
#I store my api keys in an config file as well
import os
import config
os.environ["OPENAI_API_KEY"] = config.openai_key

In [3]:
# Select OpenAI type embeddings (alternatives are o.a. )
embeddings = OpenAIEmbeddings()

In [25]:
#set the 'temperature'. A value of 0 gives short, factual answers. A value of 1 gives longer, more wide ranging answers
llm = OpenAI(temperature=0.2)

## 2. Get the book pdf, convert to txt and split into chunks

In [14]:
import wget
document = wget.download('https://github.com/udlbook/udlbook/releases/download/v1.0.4/UnderstandingDeepLearning_08_05_23_C.pdf')
document

  0% [                                                    ]        0 / 22132420  0% [                                                    ]     8192 / 22132420  0% [                                                    ]    16384 / 22132420  0% [                                                    ]    24576 / 22132420  0% [                                                    ]    32768 / 22132420  0% [                                                    ]    40960 / 22132420  0% [                                                    ]    49152 / 22132420  0% [                                                    ]    57344 / 22132420  0% [                                                    ]    65536 / 22132420  0% [                                                    ]    73728 / 22132420  0% [                                                    ]    81920 / 22132420  0% [                                                    ]    90112 / 22132420  0% [                                 

 14% [.......                                             ]  3252224 / 22132420 14% [.......                                             ]  3260416 / 22132420 14% [.......                                             ]  3268608 / 22132420 14% [.......                                             ]  3276800 / 22132420 14% [.......                                             ]  3284992 / 22132420 14% [.......                                             ]  3293184 / 22132420 14% [.......                                             ]  3301376 / 22132420 14% [.......                                             ]  3309568 / 22132420 14% [.......                                             ]  3317760 / 22132420 15% [.......                                             ]  3325952 / 22132420 15% [.......                                             ]  3334144 / 22132420 15% [.......                                             ]  3342336 / 22132420 15% [.......                          

 27% [..............                                      ]  6021120 / 22132420 27% [..............                                      ]  6029312 / 22132420 27% [..............                                      ]  6037504 / 22132420 27% [..............                                      ]  6045696 / 22132420 27% [..............                                      ]  6053888 / 22132420 27% [..............                                      ]  6062080 / 22132420 27% [..............                                      ]  6070272 / 22132420 27% [..............                                      ]  6078464 / 22132420 27% [..............                                      ]  6086656 / 22132420 27% [..............                                      ]  6094848 / 22132420 27% [..............                                      ]  6103040 / 22132420 27% [..............                                      ]  6111232 / 22132420 27% [..............                   

 37% [...................                                 ]  8396800 / 22132420 37% [...................                                 ]  8404992 / 22132420 38% [...................                                 ]  8413184 / 22132420 38% [...................                                 ]  8421376 / 22132420 38% [...................                                 ]  8429568 / 22132420 38% [...................                                 ]  8437760 / 22132420 38% [...................                                 ]  8445952 / 22132420 38% [...................                                 ]  8454144 / 22132420 38% [...................                                 ]  8462336 / 22132420 38% [...................                                 ]  8470528 / 22132420 38% [...................                                 ]  8478720 / 22132420 38% [...................                                 ]  8486912 / 22132420 38% [...................              

 51% [..........................                          ] 11460608 / 22132420 51% [..........................                          ] 11468800 / 22132420 51% [..........................                          ] 11476992 / 22132420 51% [..........................                          ] 11485184 / 22132420 51% [...........................                         ] 11493376 / 22132420 51% [...........................                         ] 11501568 / 22132420 52% [...........................                         ] 11509760 / 22132420 52% [...........................                         ] 11517952 / 22132420 52% [...........................                         ] 11526144 / 22132420 52% [...........................                         ] 11534336 / 22132420 52% [...........................                         ] 11542528 / 22132420 52% [...........................                         ] 11550720 / 22132420 52% [...........................      

 65% [.................................                   ] 14409728 / 22132420 65% [.................................                   ] 14417920 / 22132420 65% [.................................                   ] 14426112 / 22132420 65% [.................................                   ] 14434304 / 22132420 65% [.................................                   ] 14442496 / 22132420 65% [.................................                   ] 14450688 / 22132420 65% [.................................                   ] 14458880 / 22132420 65% [.................................                   ] 14467072 / 22132420 65% [..................................                  ] 14475264 / 22132420 65% [..................................                  ] 14483456 / 22132420 65% [..................................                  ] 14491648 / 22132420 65% [..................................                  ] 14499840 / 22132420 65% [.................................

 79% [.........................................           ] 17604608 / 22132420 79% [.........................................           ] 17612800 / 22132420 79% [.........................................           ] 17620992 / 22132420 79% [.........................................           ] 17629184 / 22132420 79% [.........................................           ] 17637376 / 22132420 79% [.........................................           ] 17645568 / 22132420 79% [.........................................           ] 17653760 / 22132420 79% [.........................................           ] 17661952 / 22132420 79% [.........................................           ] 17670144 / 22132420 79% [.........................................           ] 17678336 / 22132420 79% [.........................................           ] 17686528 / 22132420 79% [.........................................           ] 17694720 / 22132420 79% [.................................

 94% [.................................................   ] 20930560 / 22132420 94% [.................................................   ] 20938752 / 22132420 94% [.................................................   ] 20946944 / 22132420 94% [.................................................   ] 20955136 / 22132420 94% [.................................................   ] 20963328 / 22132420 94% [.................................................   ] 20971520 / 22132420 94% [.................................................   ] 20979712 / 22132420 94% [.................................................   ] 20987904 / 22132420 94% [.................................................   ] 20996096 / 22132420 94% [.................................................   ] 21004288 / 22132420 94% [.................................................   ] 21012480 / 22132420 94% [.................................................   ] 21020672 / 22132420 95% [.................................

'UnderstandingDeepLearning_08_05_23_C.pdf'

In [15]:
# Select the .pdf file to read 
reader = PdfReader('./UnderstandingDeepLearning_08_05_23_C.pdf')
reader

<PyPDF2._reader.PdfReader at 0x7f81ac3888b0>

In [16]:
# read data from the pdf file and convert to .txt
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text
print(raw_text[:200])
print(50 * '-')
print(f'The text is {len(raw_text)} characters')

Understanding
Deep Learning
Simon
J.D. Prince
May 8, 2023
The most recent version of this document can be found at http://udlbook.com.
Copyright in this work has been licensed exclusively to The MIT P
--------------------------------------------------
The text is 1299845 characters


In [18]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits. 
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
chunks = text_splitter.split_text(raw_text)

In [19]:
#print the number of chunks
len(chunks)

1639

In [20]:
#print the first chunk
chunks[0]

'Understanding\nDeep Learning\nSimon\nJ.D. Prince\nMay 8, 2023\nThe most recent version of this document can be found at http://udlbook.com.\nCopyright in this work has been licensed exclusively to The MIT Press,\nhttps://mitpress.mit.edu, which will be releasing the final version to the public in 2024. All\ninquiries regarding rights should be addressed to the MIT Press, Rights and Permissions\nDepartment.\nThis work is subject to a Creative Commons CC-BY-NC-ND license.\nI would really appreciate help improving this document. No detail too small! Please mail\nsuggestions, factual inaccuracies, ambiguities, questions, and errata to\nudlbookmail@gmail.com.This\nbook is dedicated to Blair, Calvert, Coppola, El lison, F aulkner, Kerpatenko,\nMorris, Robinson, Sträussler, W al lace, W aymon, W ojnarowicz, and al l the others\nwhose work is even more important and interesting than deep learning.Contents\n1\nIntroduction 1\n1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . 

## 3. Settings for langchain and similarity search

In [35]:
#select the chaintype. The default chain_type="stuff" uses ALL of the text from the documents in the prompt. Expensive!
chain = load_qa_chain(OpenAI(), chain_type="stuff")

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.
source: https://faiss.ai/

In [36]:
#We will use FAISS to do similarity searcht
docsearch = FAISS.from_texts(chunks, embeddings)

## 4. Question Answering

In [23]:
question = "What is a transformer?"
#first do similarity search
docs = docsearch.similarity_search(question)
#then ask the question to the LLM
chain.run(input_documents=docs, question=question)

' A transformer is a type of neural network developed for natural language processing (NLP) tasks. It is based on self-attention, which is a quadratic complexity that can process a large number of input variables. Transformers have since been used for image processing tasks, and have now eclipsed the performance of convolutional networks in this area.'

In [24]:
question = "What is a graph neural network?"
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' A graph neural network is a type of deep learning model that is applied to graphs. They use layers of nodes that are equivariant to permutations of the node indices to aggregate information from the neighbors of a node and use this to update the node embeddings.'

## 5. Just the Similarity Search

In [39]:
question = "What is a transformer?"
#first do similarity search
docs = docsearch.similarity_search(question)
docs

[Document(page_content='represent any word but serve to provide long-distance connections.\n12.10 T ransformers for images\nT ransformers were initially developed for text data. Their enormous success in this area\nled to experimentation on images. This was not obviously a promising idea for two\nThis work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.12.10\nT ransformers for images 229\nreasons.\nFirst, there are many more pixels in an image than words in a sentence, so the\nquadratic complexity of self-attention poses a practical bottleneck. Second, convolutional\nnets have a good inductive bias because each layer is equivariant to spatial translation,\nand they take into account the 2D structure of the image. However, this must be learned\nin a transformer network.\nRegardless of these apparent disadvantages, transformer networks for images have\nnow eclipsed the performance of convolutional networks for image classification and other', metadata={}),
 Document(