In [46]:
from langchain.chains.router import MultiPromptChain
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_community.llms import Ollama

In [47]:
# LLM
llm = Ollama(model='phi3')

### Query Routing

In [48]:
# Define specialized prompts for different types of queries
cv_prompt = PromptTemplate(
    template="You are an expert in Computer Vision. Answer the query: {input}",
    input_variables=["input"]
)
sa_prompt = PromptTemplate(
    template="You are an expert in System Architecture. Answer the query: {input}",
    input_variables=["input"]
)
empty_prompt = PromptTemplate(
    template="You are a helpful AI Assistant. Answer the query: {input}",
    input_variables=["input"]
)

In [49]:
# Create LLMChains for each specialization
cv_chain = LLMChain(llm=llm, prompt=cv_prompt)
sa_chain = LLMChain(llm=llm, prompt=sa_prompt)
no_chain = LLMChain(llm=llm, prompt=empty_prompt)

In [50]:
# Define a router with a classification logic
router_prompt = PromptTemplate(
    template="Decide if the query is about 'Computer Vision' or 'System Architecture': {input}",
    input_variables=["input"]
)
router_chain = LLMChain(llm=llm, prompt=router_prompt)

In [51]:
def route_query(query):
    # Use the router_chain to classify the query
    classification_result = router_chain.run(query)
    
    # Route based on classification
    if "Computer Vision" in classification_result:
        return cv_chain.run(query)
    elif "System Architecture" in classification_result:
        return sa_chain.run(query)
    else:
        return no_chain.run(query)

In [64]:
query = "What is CLIP?"
basic_response = no_chain.run(query)
routed_response = route_query(query)


In [65]:
basic_response

" CLIP (Contrastive Language–Image Pre-Training) is an artificial intelligence model developed by researchers at Salesforce Labs, which has been shown to effectively understand and interpret both visuals and natural language simultaneously without separate image or text representations. It's a multi-modal pre-trained transformer that can perform tasks like zero-shot classification of images based on the descriptions provided in English sentences with remarkable accuracy.\n\nThe key features and abilities of CLIP include:\n1. Multi-task learning capability, where it simultaneously learns to recognize visual objects from a large set (ImageNet) and understand textual content describing these object categories or attributes using contrastive language-image pre-training technique. This is done by mapping both images and their corresponding descriptions into the same space of latent vectors shared between them in an unsupervised manner, enabling CLIP to learn high-quality image representatio

In [66]:
routed_response

" Cut (Contrastive Language-Image Pre-training) was introduced by researchers from Salesforce Labs and Stanford University as a novel approach to understanding both visual content and textual descriptions simultaneously within a unified framework, referred to as 'CLIP' or Contrastive Language-Image Pre-training. The primary aim is facilitating zero-shot learning where the model can make accurate predictions about images even without any prior training on such examples through their understanding of related visual and textual content.\n\nAt its core, CLIP incorporates a diverse dataset comprising millions of image/text pairs sourced from various internet websites to learn correlations between different types of imagery across multiple domains (e.g., landscapes, animals) alongside the context that describes them in natural language sentences or captions provided by humans for those images on web platforms like Flickr and Wikipedia. CLIP employs a contrastive loss function called 'MixMatc

### Query Rewriting

In [31]:
# Define a query rewriting prompt
rewrite_prompt = PromptTemplate(
    template="Rewrite the query to include references at the end: {input}",
    input_variables=["input"]
)

query_rewriter = LLMChain(llm=llm, prompt=rewrite_prompt)

In [32]:
rewritten_query = query_rewriter.run(routed_response)
print(rewritten_query)

 The revised query to include references at the end could be:

CLIP, which stands for Contrastive Language-Image Pre-training (also known as Cross-lingual and Multimodal pre-trained models), is a versatile neural network model developed by researchers at DeepMind with significant contributions from OpenAI. Drawing inspiration from vision networks like VGG or ResNets, alongside transformer architectures such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), CLIP uniquely combines visual perception with natural language understanding for multimodal tasks involving image-text pairs. It pretrains a vision encoder to understand images in conjunction with the textual context of those same visuals, using millions of paired examples from ImageNet and LSUN RGB datasets along with their corresponding captions (in different languages). This robust representation learning allows CLIP to achieve zero-shot classification tasks without additional fine-tuning or specialized training for un

Agentic Experimentation

In [33]:
from langchain.tools import Tool
from langchain.agents import initialize_agent, Tool

In [36]:
def rewrite_tool(input_text):
    return f"Rewritten query: {input_text}"

rewrite_tool = Tool(name="QueryRewriter", func=rewrite_tool, description="Rewrite the query to include references at the end")
agent = initialize_agent([rewrite_tool], llm, agent="zero-shot-react-description")


In [37]:
agent_rewritten_query = agent.run(routed_response)
print(agent_rewritten_query)

Contrastive Language-Image Pre-training (CLIP) is an advanced multimodal AI model created with the purpose of unifying visual perception, similar to that found in models like VGG networks or ResNets, and natural language understanding inspired by GPT and BERT. This integration allows CLIP not only to understand images but also their corresponding textual contexts effectively. 

During its pretraining process on millions of image-text pairs from ImageNet and LSUN RGB datasets as well as a diverse array of languages, it learns robust representations that are cross-lingual—meaning CLIP can work with various language inputs related to the images without being restricted or biased towards any specific one.

CLIP's core components include: 
1. Vision Encoder - It employs ResNet as its backbone and processes raw pixel values into more abstract, universal visual features via CNN-based techniques (or variants like ViT). This encoder allows CLIP to grasp the essence of images without being depen

### Query Expansion

In [38]:
expansion_prompt = PromptTemplate(
    template="Expand the following query by adding similar models: {input}",
    input_variables=["input"]
)

query_expander = LLMChain(llm=llm, prompt=expansion_prompt)

In [39]:
expanded_query = query_expander.run(routed_response)
print(expanded_query)

 CLIP can be extended to incorporate similar models that have their unique features, adaptability, and architectural variations in handling multimodal data:

1. **ViT (Vision Transformer):** This model uses transformers for vision processing rather than CNNs like ResNets or VGG networks used by CLIP's encoder. ViT treats images as sequences of patches and processes them using self-attention mechanisms similar to those in NLP, allowing it to capture global dependencies between image regions effectively.

2. **SimCLR (Self-Supervised Learning Framework):** While not a single model like CLIP or transformer models used for language processing, SimCLR is another approach that learns visual representations by maximizing agreement with augmented versions of the same images while simultaneously minimizing disagreement among different examples in an unsupervised manner.

3. **LXMERT (Language-Image Crossmodal Transformer):** LXMERT combines convolutional neural networks for processing image inp

Custom Experimentation

In [76]:
def custom_query_expansion(query):
    expansion_map = {
        "CV": ["Object Detection", "Object Classification", "Image Segmentation"]
    }
    terms = expansion_map.get(query, [])
    return query + " " + ", ".join(terms)

In [77]:
ai_response = no_chain.run("CV")
expanded_query = custom_query_expansion("CV")
expanded_ai_response = no_chain.run(expanded_query)

In [78]:
ai_response

' I\'m sorry, but it seems like you might have made an error in your input since "CV" typically stands for "CurriculCT," which is related to education or academic qualifications rather than being a standalone term that can be explained directly without context. \n\nIf the intention was asking about what \'curriculum\' (abbreviated as Cv) refers to, here it is an educational program consisting of subjects and learning outcomes designed by institutions for teaching students within their organization or society:\n\nCurriculum design involves deciding on a sequence of topics that aligns with the goals, mission, standards set by education bodies like IBO (International Baccalaureate), NCLB (No Child Left Behind in US policy context) and other relevant educational policies. It also includes choosing appropriate teaching methods to help students learn effectively from these materials over a certain period of time.\n\nCurriculum can be divided into four main components: content, learning exper

In [79]:
expanded_ai_response

" In computer vision and artificial intelligence (AI), tasks such as object detection, classification, and segmentation play crucial roles in interpreting visual information accurately and efficiently to support various applications like self-driving cars, medical image analysis, surveillance systems, etc. Here's a quick overview of these techniques:\n\n1. Object Detection: The process involves identifying different objects present within an image or video frame with their respective locations often represented by bounding boxes (rectangles). In object detection algorithms like YOLO and SSD, the aim is to output not only class labels but also spatial coordinates of each detected object in a given scene.\n\n2. Object Classification: This task focuses on assigning one or more classes/labels to an entire image based on its visual content without explicitly locating objects within it (unlike detection). The classification can be done at various levels, such as identifying whether the overa