![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter RAG Workshop

- Follow along with the [Google Doc here](https://docs.google.com/document/d/1RF-_JdPRMKL7JQgKa5R54L9LtNEofuPZJ1SX31d2Xik/edit?usp=sharing)

- **Skills: HuggingFace, LangChain, Pinecone**




### What is RAG anyway?


![withoutRAG](https://github.com/user-attachments/assets/649d6101-b63a-4750-997a-b6abc25e5609)

![withRAG](https://github.com/user-attachments/assets/e6dd9c46-0bf9-4c31-bd72-a27939ef82b8)

Retrieval-Augmented Generation (RAG) is a technique primarily used in GenAI applications to improve the quality and accuracy of generated text by LLMs by combining two key processes: retrieval and generation.

### Breaking It Down:
#### Retrieval:

- Before generating a response, the system first looks up relevant information from a large database or knowledge base. This is like searching through a library or the internet to find the most useful facts, articles, or data related to the question or topic.

#### Generation:

- Once the relevant information is retrieved, the system then uses it to help generate a response. This is where the model, like GPT, creates new text (answers, explanations, etc.) based on the retrieved information.

# Install libraries

In [None]:
! pip install langchain langchain-community openai groq tiktoken pinecone-client langchain_pinecone unstructured pdfminer==20191125 pdfminer.six==20221105 pillow_heif unstructured_inference sentence-transformers

Collecting langchain-community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting groq
  Downloading groq-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting unstructured
  Downloading unstructured-0.16.8-py3-none-any.whl.metadata (24 kB)
Collecting pdfminer==20191125
  Downloading pdfminer-20191125.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six==20221105
  Downloading pdfminer.six-20221105-py3-none-any.whl.metadata (4.0 kB)
Collecting pillow_heif
  Downloading p

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, WebBaseLoader, YoutubeLoader, DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone
from openai import OpenAI
import numpy as np
import tiktoken
import os
from groq import Groq



# Initialize the HuggingFace Embeddings client

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
text = "Hello my name is Faizan"

query_result = embeddings.embed_query(text)

In [None]:
query_result

[0.046228375285863876,
 0.002608109498396516,
 -0.027609243988990784,
 -0.0006826698081567883,
 0.038012392818927765,
 0.023305563256144524,
 0.06972605735063553,
 0.013894548639655113,
 0.022783074527978897,
 0.024456676095724106,
 0.015527884475886822,
 -0.0898384377360344,
 0.06828375905752182,
 -0.016088610514998436,
 0.03347540646791458,
 -0.0521082803606987,
 0.03958836942911148,
 -0.05087178945541382,
 0.04236622899770737,
 0.009597674012184143,
 0.05166830122470856,
 0.008004860952496529,
 -0.0196862630546093,
 0.04176325350999832,
 -0.03037300705909729,
 -0.023506667464971542,
 -0.01719014160335064,
 -0.026379821822047234,
 0.031206166371703148,
 0.07272737473249435,
 0.039661552757024765,
 -0.015026912093162537,
 0.02082163468003273,
 0.012617732398211956,
 1.5921107205940643e-06,
 -0.027857964858412743,
 -0.0007204359280876815,
 -0.0074823894537985325,
 -0.026543758809566498,
 -0.035991035401821136,
 0.016994791105389595,
 0.021291909739375114,
 -0.03716849535703659,
 -0.003

# Calculating sentence similarity with embeddings

In [None]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-mpnet-base-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)


def cosine_similarity_between_sentences(sentence1, sentence2):
    # Get embeddings for both sentences
    embedding1 = np.array(get_huggingface_embeddings(sentence1))
    embedding2 = np.array(get_huggingface_embeddings(sentence2))

    # Reshape embeddings for cosine_similarity function
    embedding1 = embedding1.reshape(1, -1)
    embedding2 = embedding2.reshape(1, -1)

    print("Embedding for Sentence 1:", embedding1)
    print("\nEmbedding for Sentence 2:", embedding2)

    # Calculate cosine similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity[0][0]


# Example usage
sentence1 = "I like walking to the park"
sentence2 = "I like running to the office"


similarity = cosine_similarity_between_sentences(sentence1, sentence2)
print(f"\n\nCosine similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Embedding for Sentence 1: [[-5.18317223e-02  5.11822924e-02  1.72791979e-03 -1.36199202e-02
  -1.06869487e-03  2.96393428e-02 -4.72495109e-02 -2.11421009e-02
   5.48422784e-02  2.37766728e-02 -8.88856407e-03  1.03983462e-01
   1.87567491e-02 -6.70846319e-03 -3.84319052e-02 -7.80754834e-02
  -5.44624683e-03  6.69373479e-03 -1.80737115e-02  3.50141115e-02
  -3.07590049e-02  3.44667286e-02 -5.48802782e-03 -2.29204204e-02
   9.91370343e-03 -1.50746480e-02  1.37100741e-02 -3.11791096e-02
   7.79691711e-02  3.52224708e-02 -1.94613449e-02 -1.78903583e-02
   2.13377569e-02 -1.85624994e-02  1.29274099e-06  7.14496849e-03
  -7.68434315e-04  1.04230363e-02  3.67814861e-02 -3.46986540e-02
   3.50453630e-02  1.30667230e-02  1.00722872e-02 -4.18642862e-03
   2.04598345e-02 -2.74207480e-02  3.01958937e-02  2.14188918e-02
  -6.43193796e-02  1.04757305e-02 -4.66440478e-03 -4.05048616e-02
  -5.80140166e-02  1.99005734e-02 -2.49033840e-03  8.85135308e-02
   6.04227521e-02  1.96583439e-02  5.06717786e-02 

In [None]:
print(len(get_huggingface_embeddings(sentence1)))

768


# Load in the Data

Learn more about the dataset [here](https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset)

In [None]:
! kaggle datasets download -d ayoubcherguelaine/company-documents-dataset
! unzip company-documents-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
License(s): apache-2.0
Downloading company-documents-dataset.zip to /content
 86% 8.00M/9.34M [00:00<00:00, 72.9MB/s]
100% 9.34M/9.34M [00:00<00:00, 77.2MB/s]
Archive:  company-documents-dataset.zip
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_1.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_2.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_3.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_4.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_5.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Category/monthly-Category/StockReport_2016-07_6.pdf  
  inflating: CompanyDocuments/Inventory Report/monthly-Cat

In [None]:
def process_directory(directory_path):
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:

            file_path = os.path.join(root, file)
            print(f"Processing file: {file_path}")
            loader = PyPDFLoader(file_path)
            data.append({"File": file_path, "Data": loader.load()})

    return data

directory_path = "/content/CompanyDocuments"
documents = process_directory(directory_path)


Processing file: /content/CompanyDocuments/Shipping orders/order_10640.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10466.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10712.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10510.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10459.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10716.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_11066.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_11013.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10674.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10746.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10355.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10552.pdf
Processing file: /content/CompanyDocuments/Shipping orders/order_10727.pdf
Processing file: /content

# Setting up Pinecone
**1. Create an account on [Pinecone.io](https://app.pinecone.io/)**

**2. Create a new index called "rag-workshop" and set the dimensions to 768. Leave the rest of the settings as they are.**

![Screenshot 2024-11-28 at 12 01 30 AM](https://github.com/user-attachments/assets/548657af-ad75-4767-9bcf-41998e01a33e)


**3. Create an API Key for Pinecone**

![Screenshot 2024-11-24 at 10 44 37 PM](https://github.com/user-attachments/assets/e7feacc6-2bd1-472a-82e5-659f65624a88)


**4. Store your Pinecone API Key within Google Colab's secrets section, and then enable access to it (see the blue checkmark)**


![Screenshot 2024-11-24 at 10 45 25 PM](https://github.com/user-attachments/assets/eaf73083-0b5f-4d17-9e0c-eab84f91b0bc)




In [None]:
pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

index_name = "hs-rag-workshop"

namespace = "company-documents"

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

# Insert Data into Pinecone

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain_pinecone import PineconeVectorStore
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
import os

pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

index_name = "hs-rag-workshop"
namespace = "company-documents"

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings, namespace=namespace)




Inserting batch 1 of 420


KeyboardInterrupt: 

In [None]:
import multiprocessing

def process_directory(directory_path):
       data = []
       for root, _, files in os.walk(directory_path):
           for file in files:
               file_path = os.path.join(root, file)
               loader = PyPDFLoader(file_path)
               loaded_documents = loader.load()  # Load documents using the loader
               for doc in loaded_documents:
                   # Assuming 'page_content' is the key for content in loaded documents
                   doc.metadata['source'] = file_path
                   data.append(doc)

       return data

def add_documents_batch(batch):
    """Adds a batch of documents to the Pinecone index."""
    vectorstore.add_documents(batch)

def process_directory_parallel(directory_path, num_processes=None):
    """Processes a directory of PDF files and adds them to Pinecone in parallel."""
    data = []
    for root, _, files in os.walk(directory_path):
        for file in files:
            file_path = os.path.join(root, file)
            loader = PyPDFLoader(file_path)
            loaded_documents = loader.load()  # Load documents using the loader
            for doc in loaded_documents:
                # Assuming 'page_content' is the key for content in loaded documents
                doc.metadata['source'] = file_path
                data.append(doc)

    if num_processes is None:
        num_processes = multiprocessing.cpu_count()  # Use all available cores

    batch_size = len(data) // num_processes
    batches = [data[i : i + batch_size] for i in range(0, len(data), batch_size)]

    with multiprocessing.Pool(processes=num_processes) as pool:
        pool.map(add_documents_batch, batches)

# Usage
directory_path = "/content/CompanyDocuments"
process_directory_parallel(directory_path)

# Batch processing
# Assuming 'pinecone_index' is already defined
# BATCH_SIZE = 1  # Experiment with batch sizes like 16, 32, or 64
# for i in range(0, len(documents), BATCH_SIZE):
#     batch = documents[i : i + BATCH_SIZE]
#     print(f"Inserting batch {i // BATCH_SIZE + 1} of {len(documents) // BATCH_SIZE + 1}")
#     vectorstore.add_documents(batch)
#     print(f"Inserted batch {i // BATCH_SIZE + 1} of {len(documents) // BATCH_SIZE + 1}")

print(f"Added {len(documents)} documents to Pinecone")

Inserting batch 1 of 3355
Inserted batch 1 of 3355
Inserting batch 2 of 3355
Inserted batch 2 of 3355
Inserting batch 3 of 3355
Inserted batch 3 of 3355
Inserting batch 4 of 3355
Inserted batch 4 of 3355
Inserting batch 5 of 3355
Inserted batch 5 of 3355
Inserting batch 6 of 3355
Inserted batch 6 of 3355
Inserting batch 7 of 3355
Inserted batch 7 of 3355
Inserting batch 8 of 3355
Inserted batch 8 of 3355
Inserting batch 9 of 3355
Inserted batch 9 of 3355
Inserting batch 10 of 3355
Inserted batch 10 of 3355
Inserting batch 11 of 3355
Inserted batch 11 of 3355
Inserting batch 12 of 3355
Inserted batch 12 of 3355
Inserting batch 13 of 3355
Inserted batch 13 of 3355
Inserting batch 14 of 3355
Inserted batch 14 of 3355
Inserting batch 15 of 3355
Inserted batch 15 of 3355
Inserting batch 16 of 3355
Inserted batch 16 of 3355
Inserting batch 17 of 3355
Inserted batch 17 of 3355
Inserting batch 18 of 3355
Inserted batch 18 of 3355
Inserting batch 19 of 3355
Inserted batch 19 of 3355
Inserting b

# Initialize the Groq client

1. Get your Groq API Key [here](https://console.groq.com/keys)

2. Paste your Groq API Key into your Google Colab secrets, and make sure to enable permissions for it

![Screenshot 2024-11-25 at 12 00 16 AM](https://github.com/user-attachments/assets/e5525d29-bca6-4dbd-892b-cc770a6b281d)

In [None]:
groq_api_key = userdata.get("GROQ_API_KEY")
os.environ['GROQ_API_KEY'] = groq_api_key

groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Perform RAG

In [None]:
# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index(index_name)

In [None]:
def query(name):
    query = ( f"What are some items that {name} is likely to buy next? "
              "What incentives can I put in place to ensure he or she orders more?"
    )

    return query

In [None]:
name = "Pirkko Koskitalo"

In [None]:
query = query(name)
raw_query_embedding = get_huggingface_embeddings(query)

In [None]:
top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

In [None]:
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [None]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [None]:
print(augmented_query)

<CONTEXT>
Unit Price: 12.0
Total: 540.0
--------------------------------------------------------------------------------------------------
Product: Zaanse koeken
Quantity: 10
Unit Price: 9.5
Total: 95.0
--------------------------------------------------------------------------------------------------
Product: Gnocchi di nonna Alice
Quantity: 45
Unit Price: 38.0
Total: 1710.0
--------------------------------------------------------------------------------------------------
Product: Camembert Pierrot
Quantity: 30
Unit Price: 34.0
Total: 1020.0
Total Price:
Total Price: 4371.6


-------

Purchase Orders
Order ID Order Date Customer Name
11077 2018-05-06 Paula Wilson
Products
Product ID: Product: Quantity: Unit Price:
2 Chang 24 19
3 Aniseed Syrup 4 10
4 Chef Anton's Cajun Seasoning 1 22
6 Grandma's Boysenberry Spread 1 25
7 Uncle Bob's Organic Dried Pears 1 30
8 Northwoods Cranberry Sauce 2 40
10 Ikura 1 31
12 Queso Manchego La Pastora 2 38
13 Konbu 4 6
14 Tofu 1 23.25
16 Pavlova 2 17.45


In [None]:
system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.

Answer any questions I have, based on the data provided. Always consider all of the context provided when forming a response.
"""

llm_response = groq_client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": augmented_query}
    ]
)

response = llm_response.choices[0].message.content

# Putting it all together

In [None]:
# To predict probability of customer buying product based on all the sales history we have

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def predict_purchase_probability(sales_history_filepath):
    """
    Predicts the probability of a customer buying a product based on sales history.

    Args:
        sales_history_filepath: Path to the CSV file containing sales history data.
                                 Assumed format: CustomerID, ProductID, Purchase (0 or 1)

    Returns:
        A trained Logistic Regression model, or None if there's an error.
    """

    try:
      # Load the sales data into a Pandas DataFrame
      sales_data = pd.read_csv(sales_history_filepath)

      # Prepare the data for model training
      X = sales_data[['CustomerID', 'ProductID']] # Features (customer and product IDs)
      y = sales_data['Purchase']  # Target variable (purchase or not)

      # Split the data into training and testing sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

      # Initialize and train the logistic regression model
      model = LogisticRegression(max_iter=1000) # Increased max_iter for convergence
      model.fit(X_train, y_train)

      # Make predictions on the test set
      y_pred = model.predict(X_test)

      # Evaluate the model's accuracy
      accuracy = accuracy_score(y_test, y_pred)
      print(f"Model Accuracy: {accuracy}")

      return model

    except FileNotFoundError:
        print(f"Error: Sales history file not found at '{sales_history_filepath}'")
        return None
    except KeyError as e:
      print(f"Error: Missing column in the dataset: {e}")
      return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

# Example usage (for sales data in 'sales_history.csv')
# sales_history_filepath = 'sales_history.csv'
# trained_model = predict_purchase_probability(sales_history_filepath)

# If the model trained successfully, you can use it to predict probabilities for new customers/products:

# if trained_model:
#   new_customer = pd.DataFrame({'CustomerID': [101], 'ProductID': [2]})
#   probability = trained_model.predict_proba(new_customer)[:, 1]
#   print(f"Probability of purchase for new customer: {probability[0]}")

In [None]:
def perform_rag(query):
    raw_query_embedding = get_huggingface_embeddings(query)

    query_embedding = np.array(raw_query_embedding)

    top_matches = pinecone_index.query(vector=query_embedding.tolist(), top_k=10, include_metadata=True, namespace=namespace)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are an expert at understanding and analyzing company data - particularly shipping orders, purchase orders, invoices, and inventory reports.

    Answer any questions I have, based on the data provided. Always consider all parts of the context provided when forming a response.
    """

    res = groq_client.chat.completions.create(
        model="llama-3.1-70b-versatile", # llama-3.1-70b-versatile
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return res.choices[0].message.content

In [None]:

response = perform_rag(f"What are some trends with {name} purchase orders?")

print(response)

Based on the provided context, several trends can be observed in Pirkko Koskitalo's purchase orders:

1. **Frequency of orders**: Pirkko Koskitalo has placed a total of 7 purchase orders within a period of about 1.5 years (from 2016-10-08 to 2018-04-15). This indicates a relatively frequent ordering pattern.

2. **Product diversity**: Pirkko Koskitalo has ordered a diverse range of products across multiple orders. This suggests that she may be looking to supply a variety of items to her customers.

3. **Recurring product purchases**: Some products, such as Gnocchi di nonna Alice (Product ID: 56), appear in multiple orders (10526, 10781, and no direct purchases in other customer orders but there are other customer orders with this product). However, orders for this product are spaced out over several months.

4. **Product ID: 1 (Chai) and Product ID: 13 (Konbu)**: These products have been ordered by Pirkko Koskitalo in two separate instances (10526 and 11025) indicating possible recurri