## Description

We'll build an **AI fashion assistant**, using a fashion shop dataset from HuggingFace for indexing, and set up a RAG chain to process user queries and generate responses.

## Retrieval-Augmented Generation (RAG)

RAG a technique that enhances the knowledge of language models by integrating additional data.

A RAG application has two main components:

### 1. Indexing:

Ingest and index data from a specific source, typically done offline.


### 2. Retrieval and Generation:

During runtime, process the user's query, retrieve relevant data, and generate a response.



## Prerequisites

### 1. OpenAI API Key
We use `OpenAI embedding model` for embedding generation.

### 2. MongoDB Atlas connection string
We store the embedding in `MongoDB Atlas`, which is our vector store.

### 3. LangChain API key
We pull some library from LangChain hub.

## Libraries dependency and configuration setup

### Install project dependencies
Run the cell below to install the required dependencies.

In [None]:
!pip install langchain>=0.0.231 langchain-openai langchain-community \
langchain_mongodb pymongo langsmith \
openai tiktoken datasets pandas argparse

### Credential setup, MongoDB and embedding model config

First, add these credentials to your Google Colab's Secrets:
- OPENAI_API_KEY
- LANGCHAIN_API_KEY
- MONGODB_CONN_STRING

In [None]:
import os
from google.colab import userdata

# langchain lib relies on these env var
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = userdata.get('LANGCHAIN_API_KEY')

# mongoDB
mongodb_conn_string = userdata.get('MONGODB_CONN_STRING')
db_name = "fashion_shop_faq"
collection_name = "faq_assistant"

# openAI model for embedding
ai_model = "text-embedding-3-small"

# embedding vector dimension
vector_dimension = 512

##1. Indexing Phase

In this phase, we will load the fashion dataset from HuggingFace, generate embedding for it, and store it in MongoDB Atlas.

### Connect to MongoDB Atlas

In [None]:
from langchain_openai import OpenAIEmbeddings
from pymongo import MongoClient
from datasets import load_dataset
import pandas as pd
import tiktoken

# Connect to MongoDB Atlas
client = MongoClient(mongodb_conn_string)
db = client[db_name]
collection = db[collection_name]

print(client)

### Set up OpenAI Embedding model
Use OpenAI "text-embedding-3-small" as the embedding model, and set the vector dimension.

This is used to generate embedding for the dataset.

In [None]:
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(model=ai_model, dimensions=vector_dimension)

print(embeddings)

###Load the dataset from HuggingFace, and prepare it for embedding generation
Dataset is loaded from HuggingFace and converted into Pandas dataframe.

Then the "Question" and "Answer" fields are combined into a single field for embedding generation.

In [None]:
# Load dataset
dataset = load_dataset("Quangnguyen711/Fashion_Shop_Consultant", split="train")

# Convert dataset to Panda DataFrame
df = pd.DataFrame(dataset)

print(df.head(5))

# Only keep records where the Question and Answer fields are not null
df = df[df["Question"].notna() & df["Answer"].notna()]

# Combine Question and Answer fields into a single text field
# axis=1: This means the function is applied row-wise
df["text"] = df.apply(lambda row: f"[Question]{row['Question']}[Answer]{row['Answer']}", axis=1)

# Convert the combined text column to a list of strings
texts = df["text"].tolist()

### Generate embedding and store in vector store.
Generate embedding in batches, and store it together with the original `question + answer` string in MongoDB Atlas.

`tiktoken` is used to calculate the number of token used to generate embeddings.

In [None]:
# Initialize the tokenizer for the specific model
tokenizer = tiktoken.encoding_for_model(ai_model)

# Initialize a variable to keep track of the total tokens used
total_tokens_used = 0

# Define a reasonable batch size
batch_size = 50

# Process the dataset in batches
for i in range(0, len(texts), batch_size):
    batch_texts = texts[i:i + batch_size]

    # Calculate total tokens used for the current batch
    batch_tokens_used = sum(len(tokenizer.encode(text)) for text in batch_texts)
    total_tokens_used += batch_tokens_used

    # Generate embeddings for the current batch
    embeddings_list = embeddings.embed_documents(batch_texts)

    # Prepare documents with embeddings to insert into MongoDB
    documents = []
    for j, (index, row) in enumerate(df.iloc[i:i + batch_size].iterrows()):
        document = {
            "text" : row["text"],
            "embedding": embeddings_list[j]
        }

        # print (document)
        documents.append(document)

    # Insert the batch of documents into MongoDB
    collection.insert_many(documents)

    # Print total tokens used in the current batch
    print(f"Processed and inserted batch {i // batch_size + 1}, tokens used : {batch_tokens_used}")

print(f"Embeddings generated and stored in MongoDB! Total tokens used: {total_tokens_used}")

### Close MongoDB connection

In [None]:
# Close the MongoDB connection
client.close()

##2. Retrieval and Generation Phase

Here we set up a **fashion assistant** which process user's query.

In the retriever, relevant data is retrieved from MongoDB Atlas based on user's query. First, we need to create a vector search index for our collection in MongoDB, as below:

**MongoDB Atlas -> Atlas Vector Search -> JSON Editor:**

Select your database and collection accordingly, set Index Name as "vector_index", and configure the JSON value as below:

```
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 512,
      "similarity": "cosine"
    }
  ]
}
```



### Connect to MongoDB Atlas

In [None]:
from langchain_openai import OpenAIEmbeddings
from pymongo import MongoClient
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Connect to MongoDB Atlas
client = MongoClient(mongodb_conn_string)
db = client[db_name]
collection = db[collection_name]

print(client)

### Set up OpenAI Embedding model

This is used to generate embedding for user's query.

In [None]:
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(model=ai_model, dimensions=vector_dimension)

print(embeddings)

### Set up MongoDB Atlas vector search as the retriever.

By default, the retriever returns 4 closest match to user's query from the vector store.

In [None]:
index_name = "vector_index"

# Initialize MongoDBAtlasVectorSearch with correct keys
vectorStore = MongoDBAtlasVectorSearch(
    collection=collection,
    embedding=embeddings,        # Your embedding model
    text_key="text",             # Field in MongoDB for the text you want to retrieve
    embedding_key="embedding",   # Field in MongoDB for the stored embeddings
    index_name=index_name,       # Name of Vector Index in MongoDB Atlas
    relevance_score_fn="cosine"  # Use cosine similarity
)

print(vectorStore)

retriever = vectorStore.as_retriever()

### Enter your query to the fashion assistant

In [None]:
query = "outfit suggestion for wedding dinner" # @param {"type":"string","placeholder":"What to wear for dinner?"}


### RAG chain to process user's query

In [None]:
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    print("\nRetriver - relevant docs:")
    print("--------------------------------------")

    for i, doc in enumerate(docs):
        print(f"{i+1}: {doc.page_content}")

    return "\n\n".join(
        [f"{doc.page_content}" for doc in docs]
    )

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

response = rag_chain.invoke(query)

print("\nRAG Chain Response:")
print("-------------------")
print(response)

### Close MongoDB connection

In [None]:
# Close the MongoDB connection
client.close()