<a href="https://colab.research.google.com/github/CDAC-lab/isie2023/blob/main/tutorial-notebook-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Expose a question and answer database on top of your personal document repositories.

This notebook is designed to demonstrate an end-to-end pipeline for processing and querying files stored in a Google Drive folder. We'll begin by accessing the folder and reading the files, which we'll then process and convert into embeddings using OpenAI. These embeddings will be stored in a ChromaDB vector database, which we'll then use to query the data using Langchain.

## Table of Contents

1. [Introduction and Setting Up](#section1)
    - Introduction to the Notebook
    - Installing Necessary Libraries
    - Importing Libraries and Dependencies
2. [Accessing Google Drive](#section2)
    - Connecting to Google Drive
    - Reading Files from a Google Drive Folder
3. [Processing and Embedding with OpenAI](#section3)
    - Introduction to OpenAI's API
    - Processing and Embedding Files
4. [Storing Embeddings in ChromaDB](#section4)
    - Introduction to ChromaDB
    - Storing Vector Embeddings
5. [Querying with Langchain](#section5)
    - Introduction to Langchain
    - Setting Up Langchain for Querying
    - Formulating and Executing Queries
6. [Conclusion and Possible Extensions](#section6)
    - Summary of Achievements
    - Potential Future Work
7. [References and Additional Resources](#section7)

# Introduction and Setting Up

## Introduction to the Notebook
Welcome to our notebook! This project aims to process and query files stored in a Google Drive folder using OpenAI, ChromaDB, and Langchain.

## Installing Necessary Libraries
In this section, we'll guide you through the installation process for all the necessary libraries that we'll use throughout this notebook. This includes libraries for interacting with Google Drive and OpenAI, and for storing and querying data with ChromaDB and Langchain.

## Importing Libraries and Dependencies
Here, we'll import all the required Python libraries and dependencies. This includes standard libraries for data handling and manipulation, as well as libraries specific to our pipeline such as the API wrappers for Google Drive, OpenAI, ChromaDB, and Langchain.


##Install Libraries

In [None]:
!pip install openai
!pip install langchain
!pip install pypdf
!pip install tiktoken
!pip install chromadb
!pip install python-magic
!pip install pdf2image
!apt-get install poppler-utils

##Import Libraries

In [None]:
#libraries for google drive authentication
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI, VectorDBQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
import matplotlib.pyplot as plt
# import magic
import os
import nltk
# import pytesseract
import textwrap
from pdf2image import convert_from_path

##Set OpenAI Key

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "give your key"

# Accessing Google Drive

## Connecting to Google Drive
In this section, we'll guide you through the process of connecting to Google Drive from this notebook. This involves authenticating with your Google account and setting up the necessary permissions.

## Reading Files from a Google Drive Folder
Once we're connected to Google Drive, we'll show you how to access a specific folder and read the files within it. We'll also discuss how to handle different types of files and any potential issues that might arise.

###Download documents

In [None]:
#authenticate with you google drive credentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# This is the file ID of the data set, this will download the datafile from the shared location
file_id = '1F4ujHU6hzj4mIJOBmkSLE7srYbWAhRb4'
sample_data = drive.CreateFile({'id':file_id})
sample_data.GetContentFile('prompt_engineering.pdf')

### Load documents

In [None]:
loader = PyPDFLoader("prompt_engineering.pdf")
pages = loader.load_and_split()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

In [None]:
def display_document(path):
  images = convert_from_path(path)
  print("Total number of pages",len(images))
  _, axs = plt.subplots(10,4, figsize=(30, 60),squeeze=False)
  axs = axs.flatten()
  for img, ax in zip(images, axs):
      ax.set_xticks([])
      ax.set_yticks([])
      ax.imshow(img)
  # use tight_layout
  _.tight_layout()
  plt.show()

In [None]:
#display the pdf
display_document("prompt_engineering.pdf")

Output hidden; open in https://colab.research.google.com to view.

##Chunking documents

In [None]:
char_text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
doc_texts = char_text_splitter.split_documents(docs)

# Processing and Embedding with OpenAI

## Introduction to OpenAI's API
OpenAI's API allows us to process our files and convert them into vector embeddings. In this section, we'll provide a brief introduction to the API and explain how we'll use it in our pipeline.

## Processing and Embedding Files
Here, we'll walk you through the process of sending our files to the OpenAI API, receiving vector embeddings in return, and preparing these embeddings for storage in ChromaDB.

##Extract OpenAI embeddings to document chunks

In [None]:
openAI_embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])

##Create vector store

In [None]:
vStore = Chroma.from_documents(doc_texts, openAI_embeddings)

##Initialize VectorDBQA Chain from LangChain

In [None]:
model = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vStore)



# Storing Embeddings in ChromaDB

## Introduction to ChromaDB
ChromaDB is a high-performance vector database that we'll use to store our embeddings. In this section, we'll explain what ChromaDB is and why it's useful in our pipeline.

## Storing Vector Embeddings
Once we have our vector embeddings, it's time to store them in ChromaDB. We'll show you how to send the embeddings to ChromaDB, ensuring```markdown
that they're properly indexed and ready for querying.

# Querying with Langchain

## Introduction to Langchain
Langchain provides a natural language interface for querying our vector data. In this section, we'll provide an introduction to Langchain and explain how it fits into our pipeline.

## Setting Up Langchain for Querying
Before we can start querying, we need to set up Langchain. This section will guide you through the process of setting up Langchain to work with our vectorized data.

## Formulating and Executing Queries
With Langchain set up, we can now formulate and execute queries on our data. We'll walk you through the process of creating a query, sending it to Langchain, and interpreting the results.

##Question Anwering

In [None]:
question = "What is prompt engineering?"
response = model.run(question)
print(response)

 Prompt engineering is a process of creating a set of prompts, or questions, that are used to guide the user toward a desired outcome. It is an effective tool for designers to create user experiences that are easy to use and intuitive. This method is often used in interactive design and software development, as it allows users to easily understand how to interact with a system or product.


In [None]:
question = "List 4 elements of a prompt and explain"
response = model.run(question)
print(response)

 Elements of a prompt include instructions, context, input data, and output indicator. Instructions tell the user what they need to do, context provides information the user needs to complete the task, input data is the information the user needs to provide, and output indicator tells the user what the result of their input should be.


# Conclusion and Possible Extensions

## Summary of Achievements
In this section, we'll summarize what we've achieved in this notebook, from reading files in a Google Drive folder to querying the content using Langchain.

## Potential Future Work
The pipeline we've built has many potential extensions and improvements. Here, we'll discuss some possibilities for future work, such as refining the processing and embedding process or expanding the types of queries we can handle.

# References and Additional Resources
To wrap up the notebook, we'll provide a list of references and additional resources that you can use to further explore the topics covered in this notebook.