# Expose a question and answer database on top of your personal document repositories.

This notebook is designed to demonstrate an end-to-end pipeline for processing and querying files stored in a Google Drive folder. We'll begin by accessing the folder and reading the files, which we'll then process and convert into embeddings using OpenAI. These embeddings will be stored in a ChromaDB vector database, which we'll then use to query the data using Langchain.

## Table of Contents

1. [Introduction and Setting Up](#section1)
    - Introduction to the Notebook
    - Installing Necessary Libraries
    - Importing Libraries and Dependencies
2. [Accessing Google Drive](#section2)
    - Connecting to Google Drive
    - Reading Files from a Google Drive Folder
3. [Processing and Embedding with OpenAI](#section3)
    - Introduction to OpenAI's API
    - Processing and Embedding Files
4. [Storing Embeddings in ChromaDB](#section4)
    - Introduction to ChromaDB
    - Storing Vector Embeddings
5. [Querying with Langchain](#section5)
    - Introduction to Langchain
    - Setting Up Langchain for Querying
    - Formulating and Executing Queries
6. [Conclusion and Possible Extensions](#section6)
    - Summary of Achievements
    - Potential Future Work
7. [References and Additional Resources](#section7)

# Introduction and Setting Up

## Introduction to the Notebook
Welcome to our notebook! This project aims to process and query files stored in a Google Drive folder using OpenAI, ChromaDB, and Langchain.

## Installing Necessary Libraries
In this section, we'll guide you through the installation process for all the necessary libraries that we'll use throughout this notebook. This includes libraries for interacting with Google Drive and OpenAI, and for storing and querying data with ChromaDB and Langchain.

## Importing Libraries and Dependencies
Here, we'll import all the required Python libraries and dependencies. This includes standard libraries for data handling and manipulation, as well as libraries specific to our pipeline such as the API wrappers for Google Drive, OpenAI, ChromaDB, and Langchain.


##Install Libraries

In [None]:
!pip install --upgrade openai langchain chromadb beautifulsoup4 -q
!pip install git+https://github.com/julian-r/python-magic.git
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q
!apt-get install poppler-utils
!pip install tiktoken -q
!pip install pytesseract
!sudo apt install tesseract-ocr

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/julian-r/python-magic.git
  Cloning https://github.com/julian-r/python-magic.git to /tmp/pip-req-build-zweqg3_2
  Running command git clone --filter=blob:none --quiet https://github.com/julian-r/python-magic.git /tmp/pip-req-build-zweqg3_2
  Resolved https://github.com/julian-r/python-magic.git to commit 6029e2d43ce0ee9f268c1f112c70e5417493190f
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
Reading package lists... Done
Building dependency tree       
Reading state information... Done
poppler-utils is already the newest version (0.86.1-0ubuntu1.1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information...

##Import Libraries

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI, VectorDBQA
from langchain.document_loaders import DirectoryLoader
import magic
import os
import nltk
import pytesseract

##Set OpenAI Key

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "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"

# Accessing Google Drive

## Connecting to Google Drive
In this section, we'll guide you through the process of connecting to Google Drive from this notebook. This involves authenticating with your Google account and setting up the necessary permissions.

## Reading Files from a Google Drive Folder
Once we're connected to Google Drive, we'll show you how to access a specific folder and read the files within it. We'll also discuss how to handle different types of files and any potential issues that might arise.

##Mounting Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Load files from a directory

In [None]:
loader = DirectoryLoader("/content/drive/MyDrive/ChatGPT/Resources/Data/")
docs = loader.load()

In [None]:
print(len(docs))

1


In [None]:
print(docs[0].page_content)

Basic understanding of deep learning / ML

Present an overview of the latest prompting techniques

demonstrations and exercises to practice techniques

Conclusion & Future Directions

Prompt engineering is a useful skill for AI engineers and researchers to improve and efficiently use language models

Important for research, discoveries, and advancement

Prompt Engineer andLibrarian

Compensation and Benefits*

SAN FRANCISCO, CA/ PRODUCT/ FULL-TIME / HYBRID

committed to pay fairness and aim for these three elements collectively to be highly competitive

Anthropic’s mission is to create reliable, interpretable, and steerable Al systems. We want Al to ksafe for our customers and for society as a whole.

Salary - The expected salary range for this position is $250k - $335k.

Anthropic’s Al technology is amongst the most capable and safe in the world. However, largelanguage models are a new type of intelligence, and the art of instructing them in a way thatdelivers the best results is stil

##Chunking documents

In [None]:
char_text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
doc_texts = char_text_splitter.split_documents(docs)

# Processing and Embedding with OpenAI

## Introduction to OpenAI's API
OpenAI's API allows us to process our files and convert them into vector embeddings. In this section, we'll provide a brief introduction to the API and explain how we'll use it in our pipeline.

## Processing and Embedding Files
Here, we'll walk you through the process of sending our files to the OpenAI API, receiving vector embeddings in return, and preparing these embeddings for storage in ChromaDB.

##Extract OpenAI embeddings to document chunks

In [None]:
openAI_embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])


##Create vector store

In [None]:
vStore = Chroma.from_documents(doc_texts, openAI_embeddings)



##Initialize VectorDBQA Chain from LangChain

In [None]:
model = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vStore)

# Storing Embeddings in ChromaDB

## Introduction to ChromaDB
ChromaDB is a high-performance vector database that we'll use to store our embeddings. In this section, we'll explain what ChromaDB is and why it's useful in our pipeline.

## Storing Vector Embeddings
Once we have our vector embeddings, it's time to store them in ChromaDB. We'll show you how to send the embeddings to ChromaDB, ensuring```markdown
that they're properly indexed and ready for querying.

# Querying with Langchain

## Introduction to Langchain
Langchain provides a natural language interface for querying our vector data. In this section, we'll provide an introduction to Langchain and explain how it fits into our pipeline.

## Setting Up Langchain for Querying
Before we can start querying, we need to set up Langchain. This section will guide you through the process of setting up Langchain to work with our vectorized data.

## Formulating and Executing Queries
With Langchain set up, we can now formulate and execute queries on our data. We'll walk you through the process of creating a query, sending it to Langchain, and interpreting the results.

##Question Anwering

In [None]:
question = "What is prompt engineering?"
response = model.run(question)
print(response)

 Prompt engineering is a skill used by AI engineers and researchers to improve and efficiently use language models. It is important for research, discoveries, and advancement, as it involves creating and using prompts to generate desired results with language models. Prompt engineering involves collecting demonstration data to train supervised policies, collecting comparison data to train reward models, and optimizing policies against the reward models using reinforcement learning algorithms.


In [None]:
question = "List 4 elements of a prompt and explain"
response = model.run(question)
print(response)



1. Instructions: this is the action or task that the prompt is asking you to do.

2. Context: this is the background information necessary to understand the instructions or task.

3. Input data: this is the data that is used to complete the instructions or task.

4. Output indicator: this is the result or expected outcome of the instructions or task.


# Conclusion and Possible Extensions

## Summary of Achievements
In this section, we'll summarize what we've achieved in this notebook, from reading files in a Google Drive folder to querying the content using Langchain.

## Potential Future Work
The pipeline we've built has many potential extensions and improvements. Here, we'll discuss some possibilities for future work, such as refining the processing and embedding process or expanding the types of queries we can handle.

# References and Additional Resources
To wrap up the notebook, we'll provide a list of references and additional resources that you can use to further explore the topics covered in this notebook.