<a href="https://colab.research.google.com/github/JohnnyKrup/chatgpt_with_pdf_data/blob/main/Chat_GPT_with_your_own_PDF_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use Chat GPT with your own data
This Jupyter Notebook let's you use Chat GPT with your own PDF files.
You upload multiple PDF files into a folder on your Google Drive and Chat GPT will answer your questions based on the content of your PDFs and it will also mention the source, where it found the information.

## Install

In [None]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken

## Load Required Packages

In [None]:
# The two langchain imports that we need for the PDFs and the Indexing of the data
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

## OPEN AI KEY

In [None]:
# Get your API keys from openai
# You will need to create an account. Create an account here: https://openai.com/
# Here you can create openai-api-key: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "YOUR-OPENAI-API-KEY"

## Connect your Google Drive

In [None]:
# connect your Google Drive
# I did not rename my Google Drive default is "My Drive"
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
root_dir = "/content/drive/My Drive/Colab Notebooks"

Mounted at /content/drive/


In [None]:
# add a folder ( I called mine: data) into the folder where your Colab Notebook is.
# I added an AI-Index Report 24MB (English) and a German document 4.5MB to my drive to test different sceanrios
# Remember the more MB your documents have the more expensive the embedding process is.
pdf_folder_path = f'{root_dir}/data/'
os.listdir(pdf_folder_path)

['AI-Index-Report_2023.pdf', 'obligationenrecht.pdf']

## Loading the PDF Files

In [None]:
# iterate over the files in your folder and use the UnstructuredPDFLoader from langchain to create your loader
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]

In [None]:
# check if the loaders were created
loaders

[<langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x79672fe739a0>,
 <langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x796760db2590>]

## Vector Store
Chroma is used as a vectorstore to index and search the embeddings

When using the VectorstoreIndexCreator three things are happening:

- Splitting documents into chunks

- Creating embeddings for each document

- Storing documents and embeddings in a vectorstore

In [None]:
# start the splitting, embedding and storing of your data
index = VectorstoreIndexCreator().from_loaders(loaders)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
# check if the index was created
index

VectorStoreIndexWrapper(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x79672fe73910>)

## Asking Questions
You can query the index to output just the answer:
`index.query('Which gender is more open towards AI?')`
or you can also query the index and also return the source.

In [None]:
index.query_with_sources("Which 3 specialized skills were found in job postings in the US in 2022 and what's the difference in demand compared to 2010-2012?")

{'question': "Which 3 specialized skills were found in job postings in the US in 2022 and what's the difference in demand compared to 2010-2012?",
 'answer': ' The top 3 specialized skills found in job postings in the US in 2022 were Python (Programming Language), Computer Science, and SQL (Programming Language). The demand for these skills increased by 592%, 63%, and 153% respectively compared to 2010-2012.\n',
 'sources': '/content/drive/My Drive/Colab Notebooks/data/AI-Index-Report_2023.pdf'}

## Disclaimer:
Note: Calling the API in the way demonstrated in this example will incur monetary charges. Refer to OpenAI's pricing information for details.
Here you can check your usage: https://platform.openai.com/account/usage

Be aware that information, such as files to train OpenAI's LLM can become public if applied in the way this demo demonstrates. Refer to OpenAI's usage policy for details.

Do not use for actual tax filing purposes. This demo is for educational purposes only and for demonstrating machine learning methods. The author makes no claims that the outcomes shown here or any outcomes that could be produced by this method are accurate or reliable.