<a href="https://colab.research.google.com/github/BeratE/GoogleColab/blob/main/PDFRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dependency Management

In [3]:
# Install utilities
!apt-get install poppler-utils tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  poppler-utils tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 4 newly installed, 0 to remove and 45 not upgraded.
Need to get 5,002 kB of archives.
After this operation, 16.3 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.3 [186 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 5,002 kB in 1s (5,164 kB/s)
Selecting previously unselected package popp

In [17]:
# Install dependencies
!pip install langchain huggingface_hub sentence_transformers faiss-cpu unstructured chromadb Cython tiktoken unstructured[local-inference] -q
!pip install torch==2.2.1 wrapt==1.11.0 pillow==10.1.0
# Session Restart required because python dependency management is a mess!



# Language Model

In [1]:
import os
import getpass
from google.colab import userdata
os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HF_TOKEN')
if "HUGGINGFACEHUB_API_TOKEN" not in os.environ:
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass(
        "Please provide your HuggingFaceHub API Token: ")

In [2]:
from langchain_community.llms import HuggingFaceEndpoint
llm = HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.2", temperature=0.1)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Mount Google Drive

In [3]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


# Document Embedding

In [4]:
from pathlib import Path
from langchain.document_loaders import UnstructuredPDFLoader
root = "/content/gdrive/My Drive/Books/GameAIPro"
loaders = [UnstructuredPDFLoader(str(fn)) for fn in Path(root).glob('**/*.pdf')]
loaders

[<langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9972f80>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973040>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973220>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c99732b0>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973370>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973400>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973460>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c99734f0>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973580>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973610>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader at 0x7a99c9973670>,
 <langchain_community.document_loaders.pdf.UnstructuredPDFLoader 

In [5]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
index = VectorstoreIndexCreator(
    embedding=HuggingFaceEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    ).from_loaders(loaders)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [6]:
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=index.vectorstore.as_retriever(),
                                    input_key="question")

# QA

In [13]:
import textwrap

def wrap_text(text, width=110):
    lines = text.split('\n')
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]
    wrapped_text = '\n'.join(wrapped_lines)
    return wrapped_text

In [19]:
prompt = input('How can I help you?\n\n')
while (prompt != "quit"):
  answer = chain.invoke(prompt)
  print("\n")
  print(wrap_text(answer["result"]))
  prompt = input("\n")

How can I help you?

Hello


 Hi there! (Assuming the character's %greeting% is "Hi there!")

Explanation:

The context provided explains that character-specific phrases are used to fill in dialog templates in social
exchanges. These phrases are defined in a character definition and are used to make the social exchange dialog
specific to that character. In this example, the character's %greeting% is "Hi there!". Therefore, when a
social exchange includes the %greeting% tag, it will be replaced by the character's specific phrase, "Hi
there!". This makes the social exchange feel more natural and personalized to the character.

quit
