In [10]:
!pip install langchain_community arxiv pymupdf langchain_mistralai



In [11]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableMap
from langchain_mistralai import ChatMistralAI
from langchain_core.retrievers import BaseRetriever
from langchain_community.retrievers import ArxivRetriever
from langchain_core.messages import AIMessage, HumanMessage

In [12]:
from getpass import getpass
import os

OPENAI_API_KEY = getpass()
os.environ["MISTRAL_API_KEY"] = OPENAI_API_KEY

··········


## No query generator

In [6]:
history = []

# Prompt template with added history
prompt = ChatPromptTemplate.from_template(
    """You are an assistant who answers scientific questions using data from an articles' database.
    This data will be given to you each time, and it is called context.
    Answer the user's question based only on this context provided.

Context: {context}

Conversation history (include recent exchanges):
{history}

User's current question: {question}"""
)

llm = ChatMistralAI(
    model="mistral-large-latest",
    temperature=0,
    max_retries=2,
    # other params...
)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain logic with message history handling
def build_chain(retriever):
    return (
        {
            # Extract only "question" for the retriever
            "context": {"question": RunnablePassthrough() | (lambda x: x["question"]) | retriever | format_docs},
            "question": RunnablePassthrough(),  # Pass question directly
            "history": RunnablePassthrough(),   # Pass conversation history
        }
        | prompt  # Pass all inputs into the prompt
        | llm     # Pass the prompt result into the LLM
        | StrOutputParser()  # Parse LLM output
    )

# Example interaction
def handle_user_input(question: str, retriever: BaseRetriever):
    global history
    chain = build_chain(retriever)

    # Append the current question to the history
    user_input = f"User: {question}"
    formatted_history = "\n".join(history)
    response = chain.invoke({"question": question, "history": formatted_history})

    # Append both user input and response to history
    assistant_response = f"Assistant: {response}"
    history.append(user_input)
    history.append(assistant_response)

    return response

# Example use
retriever = ArxivRetriever(
    top_k_results=5,
    get_full_documents=True,
    doc_content_chars_max=10000000000
)
response = handle_user_input(
    "How does ImageBind model bind multiple modalities into a single embedding space? Tell me in detail",
    retriever
)
print(response)

The ImageBind model binds multiple modalities into a single embedding space by leveraging a shared representation space across six different modalities: image/video, text, audio, depth, thermal images, and IMU data. This is achieved through a process called contrastive learning, where the model learns to align the embeddings from different modalities to a common space.

The key steps in this process are:

1. **Image Encoding**: The model uses a Vision Transformer (ViT) to encode images into embeddings.
2. **Text Encoding**: Text is encoded using a text encoder that maps textual descriptions to embeddings.
3. **Audio Encoding**: Audio is converted into mel-spectrograms and then encoded using a Vision Transformer, treating the spectrogram as a 2D signal.
4. **Depth and Thermal Encoding**: Depth and thermal images are treated as single-channel images and encoded similarly to RGB images.
5. **IMU Encoding**: IMU data is encoded using a simple feedforward network.
6. **Contrastive Learning*

In [7]:
response = handle_user_input(
    "Nice, can you tell me more about the third step? And please also write source article of your answer this time.",
    retriever
)
print(response)

The third step in the process of binding multiple modalities into a single embedding space involves contrastive learning. This step is crucial for aligning the embeddings from different modalities into a common space. Here's how it works:

1. **Contrastive Learning**: The model is trained to minimize the distance between embeddings of paired data from different modalities while maximizing the distance between embeddings of unrelated data. This is achieved using a contrastive loss function, which encourages the embeddings of semantically similar data from different modalities to be close to each other in the embedding space, while pushing the embeddings of dissimilar data apart.

2. **Alignment**: By aligning these embeddings into a common space, the model can perform tasks such as cross-modal retrieval and zero-shot classification, where it can recognize and categorize data from modalities it wasn't explicitly trained on. This alignment allows the model to understand and generate respo

In [8]:
response = handle_user_input(
    "Do you remember your first answer? Please write the third step from your first answer",
    retriever
)
print(response)

The third step from the first answer is:

**Audio Encoding**: Audio is converted into mel-spectrograms and then encoded using a Vision Transformer, treating the spectrogram as a 2D signal.


In [9]:
response = handle_user_input(
    "Great. Now tell me more about this step",
    retriever
)
print(response)

The third step, **Audio Encoding**, involves converting audio signals into mel-spectrograms. A mel-spectrogram is a visual representation of the spectrum of frequencies in a sound signal as they vary with time. This conversion allows the audio data to be treated as a 2D signal, similar to an image.

Once the audio is converted into a mel-spectrogram, it is then encoded using a Vision Transformer (ViT). The Vision Transformer treats the spectrogram as a 2D image, enabling the model to process audio data in a manner similar to how it processes image data. This approach leverages the strengths of Vision Transformers in handling spatial data, making it effective for encoding audio information into embeddings that can be aligned with embeddings from other modalities in a common space.

For more detailed information, you can refer to the following sources:

1. **Source Article**: The source article for this information is "ImageBind: Binding Images and Language with a Single Embedding Space"

## With query generator

In [25]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

In [92]:
contextualize_q_system_prompt = (
    "Given a chat history and the latest user's question "
    "which might reference context in the chat history, "
    "formulate a question based on the user's question and "
    "the chat history. Do NOT answer the question, "
    "just reformulate it if needed. Otherwise return it as is."
)

llm = ChatMistralAI(
    model="mistral-large-latest",
    temperature=0,
    max_retries=2,
    # other params...
)

retriever = ArxivRetriever(
    top_k_results=3,
    get_full_documents=True,
    doc_content_chars_max=10000000
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

In [93]:
system_prompt = (
    "You are an assistant who answers scientific questions  "
    "using data from an articles' database. "
    "This data will be given to you each time, "
    "and it is called context.  "
    "answer concise. Answer the user's question based only on this context provided. "
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

In [94]:
chat_history = []

question = "What is an ImageBind model?"
ai_msg_1 = rag_chain.invoke({"input": question, "chat_history": chat_history})
chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_1["answer"]),
    ]
)

In [95]:
ai_msg_1['answer']

'An ImageBind model is a type of machine learning model designed to align and bind various modalities (such as images, text, audio, depth, thermal, and IMU data) into a single joint embedding space. This alignment allows the model to perform cross-modal retrieval and zero-shot recognition tasks across different modalities. The model is trained using large-scale web data and specific paired datasets for each modality, enabling it to learn a shared representation space where embeddings from different modalities can be directly compared and utilized for various applications.'

In [96]:
second_question = "For which applications can it be used?"
ai_msg_2 = rag_chain.invoke({"input": second_question, "chat_history": chat_history})

chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_2["answer"]),
    ]
)

In [97]:
print(ai_msg_2["answer"])

IMAGEBIND is a method that aligns multiple modalities (images, text, audio, depth, thermal, and IMU) into a single embedding space. This alignment allows for novel multimodal capabilities, such as cross-modal retrieval, zero-shot recognition tasks across modalities, and compositional tasks. It can be applied in various domains where multimodal data is prevalent, such as robotics, autonomous driving, and augmented reality.


In [98]:
third_question = "Which AI models can be used in robotics?"
ai_msg_3 = rag_chain.invoke({"input": third_question, "chat_history": chat_history})

chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_3["answer"]),
    ]
)

In [99]:
print(ai_msg_3["answer"])

Based on the provided context, several AI models can be used in robotics:

- **Reinforcement Learning (RL) Models**: These models can help robots learn from their environment and improve their performance over time. Examples include Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC).

- **Imitation Learning Models**: These models allow robots to learn from demonstrations provided by humans or other agents. Examples include Behavioral Cloning and Generative Adversarial Imitation Learning (GAIL).

- **Supervised Learning Models**: These models can be used for various tasks in robotics, such as object recognition and grasping. Examples include Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

- **Model-Based Reinforcement Learning**: These models use a learned or given model of the environment to plan and make decisions. Examples include Model-Predictive Control (MPC) and MuZero.

- **Multi-Agent Reinforcement Learning (MARL)**: 

In [102]:
fourth_question = "Tell me more about RL models' use in robotics"
ai_msg_4 = rag_chain.invoke({"input": fourth_question, "chat_history": chat_history})

chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_4["answer"]),
    ]
)

In [103]:
print(ai_msg_4["answer"])

Robotics is a field where Reinforcement Learning (RL) models have found significant applications due to their ability to learn and optimize behaviors through interaction with an environment. Here are some key points about the use of RL models in robotics:

1. **Learning Complex Tasks**:
   - RL models are particularly useful in robotics for learning complex tasks that are difficult to program explicitly. Robots can learn to perform tasks such as navigation, manipulation, and locomotion by interacting with their environment and receiving rewards or penalties based on their actions.

2. **Adaptability**:
   - RL allows robots to adapt to changing environments and tasks. By continuously learning from their experiences, robots can improve their performance over time and handle variations and uncertainties in their environment.

3. **Policy Optimization**:
   - RL models optimize the policy (the strategy or behavior) that the robot uses to make decisions. This optimization is done to maximi