# LLM using pinecone vector database.
* In this notebook we are creating a demo chatbot which trained on the PDF we have.
* The chatbot is excepted to generate answers from the PDF for the questions asked.
* The documents will be preprocessed and converted into smaller chunks.
* This chunks will further then converted into embeddings and stored in our vector database.
* Using pretrained models from OpenAI we will train our text embeddings.


In [5]:
# Installing libraries.

%pip install --upgrade --quiet pinecone-client langchain-openai tiktoken langchain

! pip install PyPDF -q
!pip install unstructured==0.7.12 -q

! pip install unstructured[local-inference] -q
! apt -get proper-utils


[1;31mE: [0mCommand line option 'g' [from -get] is not understood in combination with the other options.[0m


In [36]:
!pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=efb42457a931ea6be8088c292acd135ceec34c5565375154860406f7c2e079b5
  Stored in directory: 

In [30]:
# Importing all the necessary libraries.

import pinecone
import langchain
import openai
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Pinecone
import os

## Data Preprocessing.
* We will extract all the pdf files from our folder.


In [31]:
# Loading data from folders.
directory = '/content/drive/MyDrive/chat_data'

def load_docs(directory):
  loader = PyPDFDirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

901

In [32]:
# Splitting the text using Recursive Text splitter. Here the text will be converted into smaller chunks.

def text_splitter(documents, chunk_size = 500, chunk_overlap = 50):
  split_ = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50, length_function = len, is_separator_regex = False)
  docs = split_.split_documents(documents)
  return docs

doc = text_splitter(documents)
print(len(doc))

5034


In [33]:
print(doc[520].page_content)

fees, incurred by us subsequent to the termination or expiration of this Agreement in obtaining injunctive 
or other relief for enforceme nt of any provisions of this Section 12.  
 
  (g) You shall immediately turn over to us all materials including all manuals, records, 
files, instructions, correspondence, all materials related to operating the Franchised Business, including, 
without limitati on, brochures, agreements, invoices, Disclosure Documents, and any and all other materials


## Text to Embeddings
* In this sections we will convert our chunks into embeddings using OpenAIEmbeddings.

In [23]:

embeddings = OpenAIEmbeddings(api_key = 'YOUR_EMBEDDINGS')
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7c17d7e582e0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7c17d7e587c0>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None)

In [39]:
# Generating vectors.
sentence_embeddings = SentenceTransformerEmbeddings(model_name = 'all-MiniLM-L6-v2')
vectors = sentence_embeddings.embed_query('Hello World')
len(vectors)

384

## VectorDB (Pinecone)
* In this section we will push our embeddings to Vector database.

In [56]:
# Initialize pinecone
pinecone.init(
    api_key = 'YOUR_API_KEY',
    environment = 'gcp-starter'
)

index_name = 'chatbot'
index = Pinecone.from_documents(doc, sentence_embeddings, index_name = index_name)

In [67]:
# Running our chatbot
def similarity_score(query,k=10,score=False):
  if score:
    similar_docs = index.similarity_search_with_score(query, k=k)
  else:
    similar_docs = index.similarity_search(query, k=k)

  return similar_docs

query = 'What is BIGGBY COFFEE?'
output = similarity_score(query)

In [69]:
output

[Document(page_content='BIGGBY® or BIGGBY  COFFEE® or any designation indicating or tending to indicate that \nFranchise Owner is an authorized franchise owner of the Company;  \n   b. promptly surrender to the Company, or transfer to the buyer, any signs, \nstationery, letterhead, forms, printed matter and advertising conta ining the BIGGBY\n® \nmarks, all similar names or marks, any name or mark containing the designation BIGGBY® \nor BIGGBY C OFEE® or any designa tion indicating or tending to indicate that Franchise', metadata={'page': 157.0, 'source': '/content/drive/MyDrive/chat_data/02-Biggby Coffee April 30, 2021 FDD-clean-final v2.pdf'}),
 Document(page_content='COFFEE  Stores and the pr oduct s and services sold by BIG GBY® COFFEE  Store s and the \ngroup purchasing power of  BIGGBY® COFFEE  Stores , you mu st purcha se all pr oducts and \nservices used in  the design, development , construction,  and operation of you r Store in \naccordance with our specifi cations and only f