# Expose a question and answer database on top of your personal document repositories.

This notebook is designed to demonstrate an end-to-end pipeline for processing and querying files stored in a Google Drive folder. We'll begin by accessing the folder and reading the files, which we'll then process and convert into embeddings using OpenAI. These embeddings will be stored in a ChromaDB vector database, which we'll then use to query the data using Langchain.

## Table of Contents

1. [Introduction and Setting Up](#section1)
    - Introduction to the Notebook
    - Installing Necessary Libraries
    - Importing Libraries and Dependencies
2. [Accessing Google Drive](#section2)
    - Connecting to Google Drive
    - Reading Files from a Google Drive Folder
3. [Processing and Embedding with OpenAI](#section3)
    - Introduction to OpenAI's API
    - Processing and Embedding Files
4. [Storing Embeddings in ChromaDB](#section4)
    - Introduction to ChromaDB
    - Storing Vector Embeddings
5. [Querying with Langchain](#section5)
    - Introduction to Langchain
    - Setting Up Langchain for Querying
    - Formulating and Executing Queries
6. [Conclusion and Possible Extensions](#section6)
    - Summary of Achievements
    - Potential Future Work
7. [References and Additional Resources](#section7)

# Introduction and Setting Up

## Introduction to the Notebook
Welcome to our notebook! This project aims to process and query files stored in a Google Drive folder using OpenAI, ChromaDB, and Langchain.

## Installing Necessary Libraries
In this section, we'll guide you through the installation process for all the necessary libraries that we'll use throughout this notebook. This includes libraries for interacting with Google Drive and OpenAI, and for storing and querying data with ChromaDB and Langchain.

## Importing Libraries and Dependencies
Here, we'll import all the required Python libraries and dependencies. This includes standard libraries for data handling and manipulation, as well as libraries specific to our pipeline such as the API wrappers for Google Drive, OpenAI, ChromaDB, and Langchain.


##Install Libraries

In [1]:
!pip install --upgrade openai langchain chromadb beautifulsoup4 -q
!pip install git+https://github.com/julian-r/python-magic.git
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q
!apt-get install poppler-utils
!pip install tiktoken -q
!pip install pytesseract
!sudo apt install tesseract-ocr

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... 

##Import Libraries

In [1]:
#libraries for google drive authentication
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI, VectorDBQA
from langchain.document_loaders import DirectoryLoader
import magic
import os
import nltk
import pytesseract

##Set OpenAI Key

In [2]:
import os
os.environ["OPENAI_API_KEY"] = "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"

# Accessing Google Drive

## Connecting to Google Drive
In this section, we'll guide you through the process of connecting to Google Drive from this notebook. This involves authenticating with your Google account and setting up the necessary permissions.

## Reading Files from a Google Drive Folder
Once we're connected to Google Drive, we'll show you how to access a specific folder and read the files within it. We'll also discuss how to handle different types of files and any potential issues that might arise.

In [3]:
#authenticate with you google drive credentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# This is the file ID of the data set, this will download the datafile from the shared location
file_id = '1et2mE2SoNus-LlX3WitlRucuSjJSbY1-'
sample_data = drive.CreateFile({'id':file_id})
sample_data.GetContentFile('data.zip')

#unzip the folder
!unzip data.zip

Archive:  data.zip
replace Data/Prompt-Engineering-Lecture-Elvis.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Data/Prompt-Engineering-Lecture-Elvis.pdf  


##Load files from a directory

In [4]:
loader = DirectoryLoader("Data/")
docs = loader.load()

In [5]:
print(len(docs))

1


In [6]:
print(docs[0].page_content)

Prompt Engineering A lecture by DAIR.AI

Elvis Saravia

Prerequisites & Objectives

Prerequisites:

Python • Knowledge of language models • Basic understanding of deep learning / ML concepts

Objectives

Present an introduction to prompt engineering • Present an overview of the latest prompting techniques • Provide demonstrations and exercises to practice different

prompting techniques

Agenda

Introduction to Prompt Engineering

Advanced Techniques for Prompt Engineering

Applications & Tools

Conclusion & Future Directions

Part 1

Introduction to Prompt Engineering

What are prompts?

Prompts involve instructions and context passed to a

language model to achieve a desired task

Prompt engineering is the practice of developing and optimizing prompts to efficiently use language models (LMs) for a variety of applications

Prompt engineering is a useful skill for AI engineers and

researchers to improve and efficiently use language models

What is prompt engineering?

Prompt engineeri

##Chunking documents

In [7]:
char_text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
doc_texts = char_text_splitter.split_documents(docs)

# Processing and Embedding with OpenAI

## Introduction to OpenAI's API
OpenAI's API allows us to process our files and convert them into vector embeddings. In this section, we'll provide a brief introduction to the API and explain how we'll use it in our pipeline.

## Processing and Embedding Files
Here, we'll walk you through the process of sending our files to the OpenAI API, receiving vector embeddings in return, and preparing these embeddings for storage in ChromaDB.

##Extract OpenAI embeddings to document chunks

In [8]:
openAI_embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])

##Create vector store

In [9]:
vStore = Chroma.from_documents(doc_texts, openAI_embeddings)

##Initialize VectorDBQA Chain from LangChain

In [10]:
model = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vStore)



# Storing Embeddings in ChromaDB

## Introduction to ChromaDB
ChromaDB is a high-performance vector database that we'll use to store our embeddings. In this section, we'll explain what ChromaDB is and why it's useful in our pipeline.

## Storing Vector Embeddings
Once we have our vector embeddings, it's time to store them in ChromaDB. We'll show you how to send the embeddings to ChromaDB, ensuring```markdown
that they're properly indexed and ready for querying.

# Querying with Langchain

## Introduction to Langchain
Langchain provides a natural language interface for querying our vector data. In this section, we'll provide an introduction to Langchain and explain how it fits into our pipeline.

## Setting Up Langchain for Querying
Before we can start querying, we need to set up Langchain. This section will guide you through the process of setting up Langchain to work with our vectorized data.

## Formulating and Executing Queries
With Langchain set up, we can now formulate and execute queries on our data. We'll walk you through the process of creating a query, sending it to Langchain, and interpreting the results.

##Question Anwering

In [12]:
question = "What is prompt engineering?"
response = model.run(question)
print(response)

AuthenticationError: ignored

In [13]:
question = "List 4 elements of a prompt and explain"
response = model.run(question)
print(response)

AuthenticationError: ignored

# Conclusion and Possible Extensions

## Summary of Achievements
In this section, we'll summarize what we've achieved in this notebook, from reading files in a Google Drive folder to querying the content using Langchain.

## Potential Future Work
The pipeline we've built has many potential extensions and improvements. Here, we'll discuss some possibilities for future work, such as refining the processing and embedding process or expanding the types of queries we can handle.

# References and Additional Resources
To wrap up the notebook, we'll provide a list of references and additional resources that you can use to further explore the topics covered in this notebook.