    Name: Joshua Soteras 
    Date: 5/23/23
   Custom Chat Notebook


<h3>Read Me</h3>
I have included all my findings and research in order to understand this Final Project. 

- The Final Project Code can be found at the last section of this document. 
- Any notes down below are all proof of my time spent with this project (I had fun learning and diving into something new). 
- Images and all results will be within a folder and the document as requested: if images in doc aren't clearly visible, I have uploaded individual screenshots to be viewed.

<h3>What I Learned and My Experiences</h3>

<b>Large Language Models</b>
- LLMs predict the sequence of words occurring in a sentence of paragraphs 
- pre-trained models, self supervised 
- language tasks tuning: read, summarize, and translate texts 
	
- GPT : (Generative Pre-trained Transformer) by Open AI
    - generative AI: works by using existing data  
    - monitors errors and specific information with the use of moderation ap
- BERT ( Bidirectional Encoder Representations from Transformers)
- by google trained the Encoder
	
<b> Embeddings </b> 
- are vectors based, relativity of words and their meanings 
- I believe that's why OpenAi is needed in this project, we are using their already creating Embeddings to use to query the document

<b> Grasping ChatGPT and OpenAi</b> 

- What are Transformers?
    - neural networks architecture 
    - effective type of model to analyze complicated/intricate data: i.e. images, video, audio 
	- there different type of neural networks for each type of data 		
    - models that translate, write, etc. 
    
- RNN (recurrent neural network: pre- transformers) 
    - sequential 
    - hard to train 
			
- How these things work  (example being used is languages i.e. communication) 
    - positional encoding
        - Example : Translating from English to French
		- Instead of looking at words sequential, we are taking each word and giving them numbers (order). This stores word order rather than structure of the neural network. After training these data sets, the model learns how to interpret the words in positional encodings
    - attention
        - which words to pay attention to 
				□ learned over data 
	- self attention 
        - understanding the words by looking at the context of the words around the statement 
		- building the underlying meaning of a language so we can build a network that can do number of task 
		- naturally learn the rules of grammar, gender, tenses,  etc. \
		- Server (waiter) vs server (computer hosting data)
            - learns the difference by looking at the context of the surrounding words 

<b> Pinecone.from_texts() </b>
- I was curious when you stated there could be better way of doing this.
- I discovered (as seen in the next section) that the API reference is made by Langchain and not Pinecone. 
- I attempted to find an alternate way to upsert the the data into the Pincecone Index, however I had trouble understanding what values to pass in and understanding if the data I had was in correct format (vectors)/ 


<b> Experiences with the results and personal thoughts </b> 
- The questions asked have to be specific; I.e "Who signed the Constitution?" 
    - This shows the limitations to AI 
- There is alot to grasp and learn; I want to understand vectors and embedding better. 
- There is alot of tools to use but understanding what data to pass in is what is important. 

<h1 style="text-align: center;">References to API guides for libraries used + Notes </h1>

<h3>References</h3>https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html 

<h4>Understanding Langchain: PyPDFLoader</h4>

- https://python.langchain.com/en/latest/reference/modules/document_loaders.html?highlight=PyPDfLoader()#langchain.document_loaders.PyPDFLoader

<h4> Langchain:  Recursive Text_splitter </h4>

- API: https://python.langchain.com/en/latest/reference/modules/text_splitter.html?highlight=text_splitter#module-langchain.text_splitter
-  Notes:
    -  RecursiveCharacterTextSplitter:implementation of splitting text that looks at characters 
    - recursively tries to split by different characters to find one that works 
    - chunk sizes refer to the amount of text that you can pass in: default value is 1000 token
    - chunk overlap  the number of overlapping characters between adjacent chunks. Ideal chunk overlap should be below our chunk size

<h4> Pinecone </h4> 
- https://docs.pinecone.io/docs/python-client
- https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/pinecone.html
    - Trying to understand the method of Pinecone.from_texts()
    - comes from langchain api 

<h1 style="text-align: center;">Custom Chat Notebook</h1>

<h3>Libraries and Dependencies</h3> 

In [None]:
# Using Langchain for loading pdf into chunks at character level, Loader also stores pages numbers
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

#Used for upserting our vector embeddings to Pinecone
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone
import os

#Questioning and Answering Documents: using language learning models 
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

#Keys needed to use API from OpenAi and Pinecone
OPENAI_API_KEY = "temp"
PINECONE_API_KEY = "temp"
PINECONE_API_ENV = "tepm"

<h3>Loading PDF and Splitting Text: Creating our Vectors</h3>

In [5]:
#file Path to PDf Document: Declaration of Independence
pdf = "U:\AnacondaZone\Final\decIndependence.pdf"

#loading Document 
loader = PyPDFLoader(pdf) #loads with pypdf and chunks at character level
document = loader.load() #load given path as pages (load data into doc. objects): returns a list

#Splitting text: recursiv splitter: splitting text by looking at characters
recursiveSplit = RecursiveCharacterTextSplitter(chunk_size = 400 , chunk_overlap = 0)
chunked_Text = recursiveSplit.split_documents(document) #value passed in is a list[str], returns list (chunks) 

#Displaying if chunks came out
print( "chunks: " + str(len(chunked_Text)) )   


chunks: 979


<h3> Creating our Embedding and using Pinecone</h3>

In [None]:
#creating embeddings using OpenAI API
embeddings = OpenAIEmbeddings(openai_api_key ="temp")

#Initialize Pinecone
pinecone.init( api_key=PINECONE_API_KEY , environment=PINECONE_API_ENV )

#Index Name is that of my Index in Pinecone
index_name = "primis"

#Upserting our data into our index on Pinecone
docSearch = Pinecone.from_documents(chunked_Text, embeddings, index_name=index_name)

#If there is an already exisiting index from Pinecone use this
#docsearch = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)

In [7]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm)

<h3> Questions</h3> 

<h4> Question 1 </h4> 

In [10]:
#Question 1 
query = "When was this document written?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' This document was written on July 25, 2007.'

<h4> Question 2 </h4> 

In [9]:
#Question 2 
query = "who signed the constitution of the united states"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The people who signed the Constitution of the United States were George Read, John Langdon, Gunning Bedford Jun, Nicholas Gilman, John Dickinson, Richard Bassett, Jaco: Broom, Nathaniel Gorham, Rufus King, James McHenry, Dan of Stthosjenifer, Danlcarroll, Wm. Saml. Johnson, Roger Sherman, John Blair, James Madison Jr., Alexander Hamilton, Wmblount, Wil: Livingston, Richd. Dobbs Spaight, David Brearley, Huwilliamson, Wm. Paterson, Jona: Dayton, J. Rutledge, Charles Cotesworth Pinckney, B F Franklin, Charles Pinckney, Thomas Mifflin, Pierce Butler, Robtmorris, Geo. Clymer, Thos. Fitzsimons, Jared Ingersoll, William Few, James Wilson, Abrbaldwin, Gouv Morris, and William Jackson.'

<h4> Question 3 </h4> 

In [11]:
#Question 3
query = "What is the index to the constitution and amendments?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The index to the constitution and amendments is located on page 29.'

<h4> Question 4 </h4> 

In [12]:
#Question 4
query = "What is the Constitution about?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The Constitution of the United States is about forming a more perfect Union, establishing Justice, insuring domestic Tranquility, providing for the common defense, promoting the general Welfare, and securing the Blessings of Liberty to ourselves and our Posterity.'

<h4> Question 5</h4> 

In [16]:
#Question 5
query = "What is Article II about?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Article II is about the President, Vice President, and all civil officers being removed on impeachment for and conviction of treason, bribery, or other high crimes and misdemeanors.'

<h4> Question 6</h4> 

In [17]:
#Question 6
query = "What is Article I Section 1 about?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)


' Article I Section 1 of the Constitution of the United States of America states that all legislative powers shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.'

<h4> Question 7</h4> 

In [20]:
#Question 7
query = "What is will the House of Representatives do?"
docs = docSearch.similarity_search(queDry)
chain.run(input_documents=docs, question=query)

' The House of Representatives will have the sole power of impeachment.'

<h4> Question 8</h4> 

In [21]:
#Question 8
query = "Give me a summary of the Constitution of the United States?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)


' The Constitution of the United States was established by the people of the United States in order to form a more perfect union, establish justice, insure domestic tranquility, provide for the common defense, promote the general welfare, and secure the blessings of liberty to ourselves and our posterity. Article I of the Constitution outlines the legislative branch of the government.'

<h4> Question 9</h4> 

In [22]:
#Question 9 
query = "What are the rules for the Vice President?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The Vice President of the United States shall be President of the Senate, but shall have no Vote, unless they be equally divided.'

<h4> Question 10</h4> 

In [23]:
#Question 10
query = "What is the power of the President?"
docs = docSearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

" The President has the power to set taxes and duties that are uniform throughout the United States. They also have the power to appoint and remove officials from office. In the case of the President's death, resignation, or inability, the powers and duties of the office shall devolve on the Vice President."