<a href="https://colab.research.google.com/github/Duxst/RAG_Company_Documents/blob/main/RAG%20Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### What is RAG anyway?


![withoutRAG](https://github.com/user-attachments/assets/649d6101-b63a-4750-997a-b6abc25e5609)

![withRAG](https://github.com/user-attachments/assets/e6dd9c46-0bf9-4c31-bd72-a27939ef82b8)

Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

# Install libraries

In [1]:
! pip install langchain langchain-community openai groq tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference sentence-transformers

Collecting langchain-community
  Downloading langchain_community-0.3.9-py3-none-any.whl.metadata (2.9 kB)
Collecting groq
  Downloading groq-0.13.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting unstructured
  Downloading unstructured-0.16.9-py3-none-any.whl.metadata (24 kB)
Collecting pdfminer==20191125
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six==20221105
  Downloading pdfminer.six-20221105-py3-none-any.whl.metadata (4.0 kB)
Collecting pillow_heif
  Downloading p

In [2]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
from groq import Groq



# Initialize the HuggingFace Embeddings client

In [3]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
text = "Hello my name is Abdalla"

query_result = embeddings.embed_query(text)

In [5]:
query_result

[0.06602814048528671,
 0.005066658835858107,
 -0.007188325747847557,
 -0.03836479410529137,
 0.04990890622138977,
 0.009791253134608269,
 0.06027050316333771,
 -0.009924323298037052,
 -0.024751506745815277,
 -0.01658320426940918,
 0.016217662021517754,
 -0.132093146443367,
 0.061138298362493515,
 0.006684721447527409,
 0.02685139700770378,
 -0.015643121674656868,
 0.016745170578360558,
 -0.06255047768354416,
 0.04465644806623459,
 -0.004165337421000004,
 0.030995329841971397,
 0.038594625890254974,
 -0.034895189106464386,
 0.04415293037891388,
 -0.0041908929124474525,
 -0.024685004726052284,
 -0.02261386439204216,
 -0.021270744502544403,
 0.005908634513616562,
 0.10629647225141525,
 0.04189879819750786,
 0.010589459910988808,
 0.03507106751203537,
 -0.013485178351402283,
 1.5476588259843993e-06,
 -0.05610208213329315,
 -0.029625384137034416,
 0.0037746569141745567,
 -0.05571063980460167,
 -0.05018450319766998,
 0.04794641211628914,
 0.01768467016518116,
 -0.000853199977427721,
 -0.0056

# Calculating sentence similarity with embeddings

In [6]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-mpnet-base-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)


def cosine_similarity_between_sentences(sentence1, sentence2):
    # Get embeddings for both sentences
    embedding1 = np.array(get_huggingface_embeddings(sentence1))
    embedding2 = np.array(get_huggingface_embeddings(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like walking to the park"
sentence2 = "I like running to the office"


similarity = cosine_similarity_between_sentences(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Embedding for Sentence 1: [[-5.18316701e-02  5.11823222e-02  1.72798848e-03 -1.36199668e-02
  -1.06868555e-03  2.96393596e-02 -4.72495705e-02 -2.11421214e-02
   5.48423491e-02  2.37766840e-02 -8.88854358e-03  1.03983447e-01
   1.87567454e-02 -6.70851534e-03 -3.84318568e-02 -7.80755132e-02
  -5.44625567e-03  6.69372454e-03 -1.80737358e-02  3.50140929e-02
  -3.07590067e-02  3.44667174e-02 -5.48805622e-03 -2.29204036e-02
   9.91364010e-03 -1.50746563e-02  1.37100592e-02 -3.11790481e-02
   7.79692158e-02  3.52224931e-02 -1.94614213e-02 -1.78903006e-02
   2.13377643e-02 -1.85624808e-02  1.29274201e-06  7.14494567e-03
  -7.68514466e-04  1.04230279e-02  3.67814638e-02 -3.46986540e-02
   3.50453444e-02  1.30667305e-02  1.00722983e-02 -4.18641744e-03
   2.04598270e-02 -2.74207480e-02  3.01959552e-02  2.14188565e-02
  -6.43194094e-02  1.04756653e-02 -4.66440897e-03 -4.05049063e-02
  -5.80140129e-02  1.99005492e-02 -2.49037729e-03  8.85134488e-02
   6.04227372e-02  1.96583532e-02  5.06717972e-02 

# Load in the Data

Learn more about the dataset [here](https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset)

In [7]:
! kaggle datasets download -d ayoubcherguelaine/company-documents-dataset
! unzip company-documents-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
License(s): apache-2.0
Downloading company-documents-dataset.zip to /content
 86% 8.00M/9.34M [00:00<00:00, 79.7MB/s]
100% 9.34M/9.34M [00:00<00:00, 89.6MB/s]
Archive:  company-documents-dataset.zip
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_1.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_2.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_3.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_4.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_5.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_6.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Cat

In [8]:
def process_directory(directory_path):
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:

            file_path = os.path.join(root, file)
            print(f"Processing file: {file_path}")
            loader = PyPDFLoader(file_path)
            data.append({"File": file_path, "Data": loader.load()})

    return data

directory_path = "/content/CompanyDocuments"
documents = process_directory(directory_path)


Processing file: /content/CompanyDocuments/Shipping orders/order_10412.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10714.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10308.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10944.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10542.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10562.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10306.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10519.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10427.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10480.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10366.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10295.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10671.pdf
Processing file: /content

# Setting up Pinecone
**1. Create an account on [Pinecone.io](https://app.pinecone.io/)**

**2. Create a new index called "rag-workshop" and set the dimensions to 768. Leave the rest of the settings as they are.**

![Screenshot 2024-11-28 at 12 01 30 AM](https://github.com/user-attachments/assets/548657af-ad75-4767-9bcf-41998e01a33e)


**3. Create an API Key for Pinecone**

![Screenshot 2024-11-24 at 10 44 37 PM](https://github.com/user-attachments/assets/e7feacc6-2bd1-472a-82e5-659f65624a88)


**4. Store your Pinecone API Key within Google Colab's secrets section, and then enable access to it (see the blue checkmark)**


![Screenshot 2024-11-24 at 10 45 25 PM](https://github.com/user-attachments/assets/eaf73083-0b5f-4d17-9e0c-eab84f91b0bc)




In [9]:
pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

index_name = "rag-workshop"

namespace = "company-documents"

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Insert Data into Pinecone

In [10]:
for document in documents:
    print(document['File'])
    print(document['Data'])
    print("\n")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
/content/CompanyDocuments/PurchaseOrders/purchase_orders_10871.pdf
[Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10871.pdf', 'page': 0}, page_content="Purchase Orders\nOrder ID Order Date Customer Name\n10871 2018-02-05 Laurence Lebihan\nProducts\nProduct ID: Product: Quantity: Unit Price:\n6 Grandma's Boysenberry Spread 50 25\n16 Pavlova 12 17.45\n17 Alice Mutton 16 39\nPage 1")]


/content/CompanyDocuments/PurchaseOrders/purchase_orders_10901.pdf
[Document(metadata={'source': '/content/CompanyDocuments/PurchaseOrders/purchase_orders_10901.pdf', 'page': 0}, page_content="Purchase Orders\nOrder ID Order Date Customer Name\n10901 2018-02-23 Carlos Hernández\nProducts\nProduct ID: Product: Quantity: Unit Price:\n41 Jack's New England Clam Chowder 30 9.65\n71 Flotemysost 30 21.5\nPage 1")]


/content/CompanyDocuments/PurchaseOrders/purchase_orders_10909.pdf
[Document(metadata={'sourc

In [11]:
document_data = []

for document in documents:

    document_source = document['File']
    document_content = document['Data'][0].page_content

    doc = Document(
        metadata= {
            "source": document_source
        },
        page_content=f"Source: {document_source}\n{document_content}"
    )

    document_data.append(doc)

    print(doc)
    print("\n")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Country: Sweden
Phone: 0921-12 34 65
Fax: 0921-12 34 67
Product Details:
Product ID Product Name Quantity Unit Price
39 Chartreuse verte 6 14.4
54 Tourtière 15 5.9
  TotalPrice 174.9
Page 1' metadata={'source': '/content/CompanyDocuments/invoices/invoice_10445.pdf'}


page_content='Source: /content/CompanyDocuments/invoices/invoice_10742.pdf
Invoice
Order ID: 10742
Customer ID: BOTTM
Order Date: 2017-11-14
Customer Details:
Contact Name: Elizabeth Lincoln
Address: 23 Tsawassen Blvd.
City: Tsawassen
Postal Code: T2F 8M4
Country: Canada
Phone: (604) 555-4729
Fax: (604) 555-3745
Product Details:
Product ID Product Name Quantity Unit Price
3 Aniseed Syrup 20 10.0
60 Camembert Pierrot 50 34.0
72 Mozzarella di Giovanni 35 34.8
  TotalPrice 3118.0
Page 1' metadata={'source': '/content/CompanyDocuments/invoices/invoice_10742.pdf'}


page_content='Source: /content/CompanyDocuments/invoices/invoice_10938.pdf
Invoice
Order ID: 10938

In [12]:
for idx, document in enumerate(document_data):
    print("Processing document:", idx)
    vectorstore_from_documents = PineconeVectorStore.from_documents(
        [document],
        embeddings,
        index_name=index_name,
        namespace=namespace
    )


Processing document: 0
Processing document: 1
Processing document: 2
Processing document: 3
Processing document: 4
Processing document: 5
Processing document: 6
Processing document: 7
Processing document: 8
Processing document: 9
Processing document: 10
Processing document: 11
Processing document: 12
Processing document: 13
Processing document: 14
Processing document: 15
Processing document: 16
Processing document: 17
Processing document: 18
Processing document: 19
Processing document: 20
Processing document: 21
Processing document: 22
Processing document: 23
Processing document: 24
Processing document: 25
Processing document: 26
Processing document: 27
Processing document: 28
Processing document: 29
Processing document: 30
Processing document: 31
Processing document: 32
Processing document: 33
Processing document: 34
Processing document: 35
Processing document: 36
Processing document: 37
Processing document: 38
Processing document: 39
Processing document: 40
Processing document: 41
Pr

# Initialize the Groq client

1. Get your Groq API Key [here](https://console.groq.com/keys)

2. Paste your Groq API Key into your Google Colab secrets, and make sure to enable permissions for it

![Screenshot 2024-11-25 at 12 00 16 AM](https://github.com/user-attachments/assets/e5525d29-bca6-4dbd-892b-cc770a6b281d)

In [13]:
groq_api_key = userdata.get("GROQ_API_KEY")
os.environ['GROQ_API_KEY'] = groq_api_key

groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Perform RAG

In [14]:
# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index(index_name)

In [17]:
def perform_rag(query):
    raw_query_embedding = get_huggingface_embeddings(query)

    query_embedding = np.array(raw_query_embedding)

    top_matches = pinecone_index.query(vector=query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, inventory reports and postal codes.

    Answer any questions I have, based on the data provided. Always consider all parts of the context provided when forming a response.
    """

    res = groq_client.chat.completions.create(
        model="llama-3.1-70b-versatile", # llama-3.1-70b-versatile
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return res.choices[0].message.content

In [20]:
response = perform_rag("What is Fran Wilson's address?")

print(response)

Based on the provided data from invoices for order IDs 10867 and 10662, Fran Wilson's address is:

89 Chiaroscuro Rd.
City: Portland
Postal Code: 97219
Country: USA
