# Explanation of Key Components


**This version does not use Embedings**


### Data Collection Pipeline:

Uses arXiv API to search for papers based on research queries               
Prioritizes HTML versions of papers (available for papers since Dec 2023)               
Falls back to abstracts for older papers without HTML versions          
Caches paper information to avoid redundant processing            
Extracts clean text content using BeautifulSoup for HTML papers     

### Query Enhancement:

Employs few-shot prompting with domain-specific examples        
Expands original queries to include related concepts and terminology         
Extracts key terms from expanded queries for more effective searching         
Adapts to specific research domains with customized examples          
Handles both general research and specialized topics (like Vision Transformers)    

### LangGraph RAG Workflow:

Implements a structured workflow with defined nodes and transitions       
Maintains comprehensive state throughout the research process          
Four-stage pipeline: query expansion → paper search → analysis → response generation         
Each node updates specific parts of the state without losing information          
Handles the full research journey from question to comprehensive analysis      

### Paper Analysis Capabilities:

Generates in-depth analyses of retrieved papers using few-shot learning       
Identifies connections between papers and research question           
Extracts key contributions, methodologies, and technical details           
Provides research context through carefully selected examples          
Synthesizes information across multiple papers for comprehensive understanding          

### User Interaction:

Provides formatted Markdown output for easy reading in notebooks      
Displays expanded query to show understanding of research needs         
Presents comprehensive research analysis with insights and connections          
Lists top papers with titles, authors, publication dates, and abstracts       
Includes direct links to original papers on arXiv          
Simple interface for entering research queries and viewing results          

## Customizable Parameters in the Pipeline

Here are the key parameters to modify to experiment with results:

**Model Parameters**     
Model Version: Change model = genai.GenerativeModel('models/gemini-1.5-pro-latest') to use a different Gemini model         
Temperature: Add temperature parameter when creating the model to control creativity (e.g., model = genai.GenerativeModel('models/gemini-1.5-pro-latest', temperature=0.2))             

**Search Parameters**
Max Results: Modify max_results=20 in arxiv_api_search() to return more/fewer papers                  
Sort Criteria: Change sort_by=arxiv.SortCriterion.Relevance to sort by other criteria like Submitted or LastUpdated             
Category Filter: Customize the category filter logic in arxiv_api_search() (currently set to cs.CV for vision-related queries)              

**Content Processing**
Papers Analyzed: Change the number of papers analyzed in analyze_papers_node() and analyze_papers() (currently uses top 5)             
Few-Shot Examples: Modify the examples in expand_research_query() and analyze_papers() to better fit your domain         

**Display Settings**
Display Limit: Change the slice in results["search_results"][:5] in display_langgraph_results() to show more papers         
Abstract Length: Adjust the paper['abstract'][:300] to show more/less text      

**Workflow Configuration**
Node Ordering: Rearrange the workflow by modifying edges in create_research_workflow()               
Initial State: Add additional fields to the initial state in research_arxiv_langgraph()            


In [41]:
#!pip install -U google-generativeai langchain-google-genai

## Imports and Configuration

In [None]:
# ============= Imports and Configuration =============

import os
import arxiv
import requests
import numpy as np
from typing import TypedDict, List, Dict, Any, Sequence
from bs4 import BeautifulSoup
from IPython.display import Markdown, display

# LangGraph imports
from langgraph.graph import StateGraph

# LLM and message handling imports
import google.generativeai as genai
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage

"""
langgraph  Version: 0.3.34
google-generativeai Version: 0.8.5
"""

## State and Configuration 

In [46]:
class RAGState(TypedDict):
    """State for the RAG workflow.
    
    Attributes:
        query: The original user query
        expanded_query: Query after expansion with the LLM
        context: List of contextualized information from papers
        messages: List of chat messages in the conversation
        search_results: List of papers retrieved from arXiv
        analysis: Generated analysis of the papers
    """
    query: str
    expanded_query: str
    context: List[str]
    messages: List[Dict[str, Any]]
    search_results: List[Dict[str, Any]]
    analysis: str


# API Configuration
GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY", "AIzaSyBuYf9Sdm8M8tIMvfArkcS_YUjhEZfZqes")
genai.configure(api_key=GOOGLE_API_KEY)

# Create a model instance
model = genai.GenerativeModel('models/gemini-1.5-pro-latest')

# Papers database cache
papers_db = {}


## Query Expansion and Analysis 

In [52]:
def expand_research_query(query: str) -> str:
    """Expand a research query using few-shot prompting.
    
    Uses domain-specific examples to help the model generate
    a comprehensive expansion of the original query.
    
    Args:
        query: The original research query
        
    Returns:
        An expanded version of the query with additional concepts and terms
    """
    # Vision transformer specific example if the query is about vision transformers
    if "vision transformer" in query.lower() or "vit" in query.lower():
        few_shot_examples = """
        Example 1:
        Query: "vision transformer architecture"
        Expanded: The query is about Vision Transformer (ViT) architectures for computer vision tasks. Key aspects to explore include: original ViT design and patch-based image tokenization; comparison with CNN architectures; attention mechanisms specialized for vision; hierarchical and pyramid vision transformers; efficiency improvements like token pruning and sparse attention; distillation techniques for vision transformers; adaptations for different vision tasks including detection and segmentation; recent innovations addressing quadratic complexity and attention saturation.
        
        Example 2: 
        Query: "how do vision transformers process images"
        Expanded: The query focuses on the internal mechanisms of how Vision Transformers process visual information. Key areas to investigate include: patch embedding processes; position embeddings for spatial awareness; self-attention mechanisms for global context; the role of MLP blocks in feature transformation; how class tokens aggregate information; patch size impact on performance and efficiency; multi-head attention design in vision applications; information flow through vision transformer layers; differences from convolutional approaches to feature extraction.
        """
    else:
        few_shot_examples = """
        Example 1:
        Query: "transformer models for NLP"
        Expanded: The query is about transformer architecture models used in natural language processing. Key aspects to explore include: BERT, GPT, T5, and other transformer variants; attention mechanisms; self-supervision and pre-training approaches; fine-tuning methods; performance on NLP tasks like translation, summarization, and question answering; efficiency improvements like distillation and pruning; recent innovations in transformer architectures.
        
        Example 2:
        Query: "reinforcement learning for robotics"
        Expanded: The query concerns applying reinforcement learning methods to robotic systems. Important areas to investigate include: policy gradient methods; Q-learning variants for continuous control; sim-to-real transfer; imitation learning; model-based RL for robotics; sample efficiency techniques; multi-agent RL for coordinated robots; safety constraints in robotic RL; real-world applications and benchmarks; hierarchical RL for complex tasks.
        
        Example 3:
        Query: "graph neural networks applications"
        Expanded: The query focuses on practical applications of graph neural networks. Key dimensions to explore include: GNN architectures (GCN, GAT, GraphSAGE); applications in chemistry and drug discovery; recommender systems using GNNs; traffic and transportation network modeling; social network analysis; knowledge graph completion; GNNs for computer vision tasks; scalability solutions for large graphs; theoretical foundations of graph representation learning.
        """
    
    prompt = f"""Based on the examples below, expand my research query to identify key concepts, relevant subtopics, and specific areas to explore:

    {few_shot_examples}

    Query: "{query}"
    Expanded:"""
    
    generation_config = {"temperature": 1.0}
    
    response = model.generate_content(prompt, generation_config=generation_config)
    
    return response.text


def analyze_papers(query: str, papers: List[Dict[str, Any]]) -> str:
    """Analyze papers using few-shot prompting with domain-specific examples.
    
    Generates a research analysis based on the retrieved papers and
    the original query using domain-specific examples.
    
    Args:
        query: The original research query
        papers: List of paper dictionaries containing metadata and content
        
    Returns:
        A comprehensive analysis of the papers in relation to the query
    """
    few_shot_examples = """
    Example 1:
    Papers:
    1. "Attention Is All You Need" - Introduced the transformer architecture relying entirely on attention mechanisms without recurrence or convolutions.
    2. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" - Proposed bidirectional training for transformers using masked language modeling.
    
    Analysis:
    These papers represent seminal work in transformer architectures for NLP. "Attention Is All You Need" established the foundation with the original transformer design using multi-head self-attention. BERT built upon this by introducing bidirectional context modeling and masked language modeling for pre-training, significantly advancing performance on downstream tasks. Key themes include attention mechanisms, pre-training objectives, and the importance of training methodology.
    
    Example 2:
    Query: "How does the Less-Attention Vision Transformer architecture address the computational inefficiencies and saturation problems of traditional Vision Transformers?"
    Papers: 
    1. "You Only Need Less Attention at Each Stage in Vision Transformers" - Proposed reusing early-layer attention scores through linear transformations to reduce computational costs.
    
    Analysis:
    Less-Attention Vision Transformer reduces ViT's quadratic attention cost by reusing early-layer attention scores through linear transformations. It also mitigates attention saturation using residual downsampling and a custom loss to preserve attention structure. This approach addresses two key limitations of traditional Vision Transformers: computational inefficiency due to quadratic complexity of self-attention, and the saturation problem where attention maps become increasingly similar in deeper layers.
    """
    
    # Format paper information
    paper_info = "\n".join([
        f"{i+1}. \"{p['title']}\" - {p['abstract'][:200]}..." 
        for i, p in enumerate(papers[:5])
    ])
    
    prompt = f"""Based on the examples below, analyze the following research papers related to "{query}" to identify key technical contributions, methodologies, and how they address specific challenges:

    {few_shot_examples}
    
    Papers:
    {paper_info}
    
    Analysis:"""
    
    response = model.generate_content(prompt, generation_config={"temperature": 0.7})
    
    return response.text


## arXiv Data Collection Functions 

In [53]:
def arxiv_api_search(query: str, max_results: int = 20) -> List[Any]:
    """Search arXiv for papers matching the query.
    
    Uses the arXiv API to find relevant papers based on the query,
    with optional filtering for specific categories.
    
    Args:
        query: The search query
        max_results: Maximum number of results to return (default: 20)
        
    Returns:
        List of arxiv.Result objects representing papers
    """
    # Use category filter for more relevant results
    category_filter = "cat:cs.CV" if "vision" in query.lower() or "image" in query.lower() else ""
    search_query = f"{query} {category_filter}".strip()
    
    # Create a Client instance and use it for search
    client = arxiv.Client()
    search = arxiv.Search(
        query=search_query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.Relevance,
        sort_order=arxiv.SortOrder.Descending
    )
    return list(client.results(search))


def check_html_available(paper_id: str) -> bool:
    """Check if HTML version is available for a paper.
    
    Tests if the paper has an HTML version on arXiv by
    sending a HEAD request to the HTML URL.
    
    Args:
        paper_id: The arXiv paper ID
        
    Returns:
        True if HTML version is available, False otherwise
    """
    html_url = f"https://arxiv.org/html/{paper_id}"
    response = requests.head(html_url)
    return response.status_code == 200


def get_html_content(paper_id: str) -> str:
    """Get HTML content of a paper if available.
    
    Fetches and parses the HTML version of a paper, removing
    irrelevant elements and extracting the main content.
    
    Args:
        paper_id: The arXiv paper ID
        
    Returns:
        Extracted text content from the HTML version or None if unavailable
    """
    html_url = f"https://arxiv.org/html/{paper_id}"
    response = requests.get(html_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Remove scripts, styles, and navigation elements
        for tag in soup(['script', 'style', 'nav', 'header', 'footer']):
            tag.decompose()
        # Get main content
        main_content = soup.find('main') or soup.find('body')
        if main_content:
            return main_content.get_text(separator='\n', strip=True)
    return None

## LangGraph Workflow Nodes 

In [54]:
def query_expansion_node(state: RAGState) -> RAGState:
    """LangGraph node that expands the original query.
    
    Takes the original query and generates an expanded version
    with additional concepts and search terms.
    
    Args:
        state: The current LangGraph state
        
    Returns:
        Updated state with expanded query
    """
    query = state["query"]
    
    # Use a very explicit prompt to avoid the model repeating our instructions
    prompt = f"""
    Please expand the following research query about vision transformers:
    
    Query: "{query}"
    
    Provide a detailed expansion that identifies key concepts, 
    terminology, and relevant subtopics. Do not include phrases like
    "Query:" or "Expanded:" in your response. Just provide the expanded content.
    """
    
    response = model.generate_content(prompt)
    expanded_query = response.text.strip()
    
    print("EXPANSION NODE - Output expanded query:", expanded_query[:100] + "...")
    
    return {"expanded_query": expanded_query}


def search_papers_node(state: RAGState) -> RAGState:
    """LangGraph node that searches for papers based on the query.
    
    Uses the original and expanded query to search arXiv for relevant
    papers, processes them, and stores the results.
    
    Args:
        state: The current LangGraph state
        
    Returns:
        Updated state with search results
    """
    print("SEARCH NODE - Input state keys:", list(state.keys()))
    
    query = state["query"]
    expanded_query = state["expanded_query"]
    
    # Extract actual search terms from expanded query
    # We need to clean up the expanded query to just get the keywords
    # Instead of using the full text, let's extract key concepts
    
    # First, check if expanded query starts with "Query:" (which would indicate our formatting issue)
    if "Query:" in expanded_query and "Expanded:" in expanded_query:
        # Extract just the expansion part
        expanded_query = expanded_query.split("Expanded:")[1].strip()
    
    # Now extract key concepts - focus on noun phrases
    key_terms = []
    important_phrases = ["Vision Transformer", "ViT", "image patches", 
                        "self-attention", "transformer encoder", 
                        "multi-head attention", "computer vision"]
    
    # Add any terms from our list that appear in the expanded query
    for phrase in important_phrases:
        if phrase.lower() in expanded_query.lower():
            key_terms.append(phrase)
    
    # Ensure we have the core terms at minimum
    if not any(term for term in key_terms if "vision transformer" in term.lower() or "vit" in term.lower()):
        key_terms.append("Vision Transformer")
    
    # Join terms with OR
    expanded_terms = " OR ".join(f'"{term}"' for term in key_terms)
    
    # Create a clean search query
    search_query = f'"{query}" OR ({expanded_terms})'
    print(f"Clean search query: {search_query}")
    
    papers = arxiv_api_search(search_query)
    
    print(f"Found {len(papers)} papers")
    
    # Process and store papers
    results = []
    for paper in papers:
        paper_id = paper.entry_id.split('/')[-1]
        paper_info = {
            "paper_id": paper_id,
            "title": paper.title,
            "authors": ", ".join(author.name for author in paper.authors),
            "published": paper.published.strftime("%Y-%m-%d"),
            "url": paper.entry_id,
            "abstract": paper.summary,
            "has_html": check_html_available(paper_id)
        }
        
        # Get HTML content if available
        if paper_info["has_html"]:
            paper_info["content"] = get_html_content(paper_id)
        else:
            paper_info["content"] = paper_info["abstract"]
            
        papers_db[paper_id] = paper_info
        results.append(paper_info)
    
    print(f"Processed {len(results)} papers for state")
    
    # Return updated state
    updated_state = {"search_results": results}
    print("SEARCH NODE - Output state keys:", list(updated_state.keys()))
    return updated_state


def analyze_papers_node(state: RAGState) -> RAGState:
    """LangGraph node that analyzes papers.
    
    Takes the search results and generates an analysis
    of the papers in relation to the original query.
    
    Args:
        state: The current LangGraph state
        
    Returns:
        Updated state with paper analysis and context
    """
    query = state["query"]
    search_results = state["search_results"]
    
    analysis = analyze_papers(query, search_results)
    
    # Extract relevant content for context
    context = []
    for paper in search_results[:5]:
        context.append(f"Title: {paper['title']}\nAuthors: {paper['authors']}\nAbstract: {paper['abstract']}")
    
    return {
        "analysis": analysis,
        "context": context
    }


def generate_response_node(state: RAGState) -> RAGState:
    """LangGraph node that generates the final response.
    
    Formats the analysis and adds it to the message history
    in the state.
    
    Args:
        state: The current LangGraph state
        
    Returns:
        Updated state with messages
    """
    query = state["query"]
    context = state["context"]
    analysis = state["analysis"]
    
    # Format the message as an AI response
    message = {
        "role": "assistant",
        "content": analysis
    }
    
    # Add the message to the state
    if "messages" not in state:
        state["messages"] = []
    
    state["messages"].append(message)
    
    return {"messages": state["messages"]}

## LangGraph Workflow 

In [55]:
def create_research_workflow():
    """Create a LangGraph workflow for research.
    
    Defines the workflow graph with nodes for query expansion,
    paper search, analysis, and response generation.
    
    Returns:
        A compiled LangGraph workflow
    """
    # Initialize the workflow with the RAGState
    workflow = StateGraph(RAGState)
    
    # Add nodes to the graph
    workflow.add_node("query_expansion", query_expansion_node)
    workflow.add_node("search_papers", search_papers_node)
    workflow.add_node("analyze_papers", analyze_papers_node)
    workflow.add_node("generate_response", generate_response_node)
    
    # Add edges to connect the nodes
    workflow.add_edge("query_expansion", "search_papers")
    workflow.add_edge("search_papers", "analyze_papers")
    workflow.add_edge("analyze_papers", "generate_response")
    
    # Set the entry point
    workflow.set_entry_point("query_expansion")
    
    # Compile the workflow
    return workflow.compile()

## Interface Functions

In [56]:
def research_arxiv_langgraph(query: str) -> Dict[str, Any]:
    """Research arXiv papers using the LangGraph workflow.
    
    Main function that executes the full research pipeline
    on a given query.
    
    Args:
        query: The research query
        
    Returns:
        The final state with all results
    """
    # Create the workflow
    workflow = create_research_workflow()
    
    # Initialize the state
    initial_state = {
        "query": query,
        "expanded_query": "",
        "context": [],
        "messages": [],
        "search_results": [],
        "analysis": ""
    }
    
    # Execute the workflow
    final_state = workflow.invoke(initial_state)
    
    # Debug print state
    print("Final state keys:", list(final_state.keys()))
    
    # Check if search_results exists and has content
    if "search_results" not in final_state or not final_state["search_results"]:
        print("WARNING: No search results found in final state!")
        # If the search_results got lost, we should check if it's available in our papers_db
        if papers_db:
            print(f"Found {len(papers_db)} papers in papers_db, using those instead")
            final_state["search_results"] = list(papers_db.values())
    
    return final_state

## Display Functions 

In [57]:
def display_langgraph_results(results: Dict[str, Any]) -> None:
    """Display the research results in a formatted way.
    
    Creates formatted Markdown outputs for the expanded query,
    research analysis, and top papers.
    
    Args:
        results: The final state from the research workflow
    """
    from IPython.display import display, Markdown, HTML
    
    display(Markdown("### QUERY EXPANSION"))
    display(Markdown(results["expanded_query"]))
    
    display(Markdown("### RESEARCH ANALYSIS"))
    display(Markdown(results["analysis"]))
    
    display(Markdown("### TOP PAPERS"))
    
    # Debug info
    display(Markdown(f"**Debug:** State keys: {list(results.keys())}"))
    
    if "search_results" not in results or not results["search_results"]:
        display(Markdown("**No papers found in search results.**"))
    else:
        display(Markdown(f"**Found {len(results['search_results'])} papers.**"))
        for i, paper in enumerate(results["search_results"][:5]):
            paper_md = f"""
**{i+1}. {paper['title']}**

*Authors:* {paper['authors']}

*Published:* {paper['published']}

*URL:* {paper['url']}

*Abstract:* {paper['abstract'][:300]}...

---
"""
            display(Markdown(paper_md))

In [58]:
# ============= Main Execution =============

if __name__ == "__main__":
    query = input("Enter your research query: ")
    results = research_arxiv_langgraph(query)
    display_langgraph_results(results)

EXPANSION NODE - Output expanded query: Vision Transformers (ViTs) represent a groundbreaking shift in computer vision, applying the Transfo...
SEARCH NODE - Input state keys: ['query', 'expanded_query', 'context', 'messages', 'search_results', 'analysis']
Clean search query: "what are vision transformers?" OR ("Vision Transformer" OR "ViT" OR "self-attention" OR "computer vision")
Found 20 papers
Processed 20 papers for state
SEARCH NODE - Output state keys: ['search_results']
Final state keys: ['query', 'expanded_query', 'context', 'messages', 'search_results', 'analysis']


### QUERY EXPANSION

Vision Transformers (ViTs) represent a groundbreaking shift in computer vision, applying the Transformer architecture, originally designed for natural language processing, to image recognition tasks.  Understanding ViTs involves exploring several key areas:

**1. The Core Architecture:**

* **Self-Attention Mechanism:**  This is the heart of the Transformer. It allows the model to weigh the importance of different parts of an image (or sequence in NLP) in relation to each other, capturing long-range dependencies.  Key concepts here include query, key, and value embeddings, attention matrices, and multi-head self-attention.
* **Transformer Blocks:** These are the building blocks of the ViT. Each block typically consists of a multi-head self-attention layer, followed by a feed-forward network, both wrapped in residual connections and layer normalization.
* **Image Patch Embeddings:**  Unlike convolutional networks, ViTs treat images as sequences of patches.  Each image patch is flattened and linearly projected to an embedding vector, analogous to word embeddings in NLP. Positional embeddings are also added to retain spatial information.
* **Classification Head:**  After processing through the Transformer blocks, a classification token (CLS token) is used for the final image classification.

**2. Comparison to Convolutional Neural Networks (CNNs):**

* **Inductive Bias:** CNNs have a built-in inductive bias for locality and translation equivariance. ViTs, however, have a weaker inductive bias, relying more on data to learn these properties. This can be both an advantage and a disadvantage.
* **Computational Complexity:** The computational cost of self-attention scales quadratically with the input sequence length. This can be a limitation for high-resolution images.  Various approaches are being developed to mitigate this, such as sparse attention and hierarchical architectures.
* **Performance:** ViTs have demonstrated comparable or superior performance to CNNs on various image recognition benchmarks, particularly with large datasets.

**3. Variants and Advancements:**

* **Data-efficient ViTs:**  Techniques like data augmentation, self-supervised learning, and knowledge distillation are being employed to train ViTs with less labeled data. Examples include DeiT (Data-efficient Image Transformers).
* **Hybrid Architectures:** Some models combine the strengths of CNNs and ViTs, using convolutional layers for early feature extraction and transformers for global reasoning.
* **Hierarchical ViTs:**  These models process images at multiple scales, offering better computational efficiency and handling larger images effectively.  Swin Transformer and Pyramid Vision Transformer (PVT) are prominent examples.
* **Vision Transformer Applications:**  Beyond image classification, ViTs are being applied to various tasks like object detection, image segmentation, video understanding, and image generation.

**4. Training and Optimization:**

* **Large-scale datasets:** ViTs typically benefit from training on massive datasets.
* **Optimizer choices:** AdamW and other optimizers with weight decay are commonly used.
* **Learning rate schedules:**  Techniques like linear warm-up and cosine decay are often employed.


By understanding these concepts and subtopics, one can gain a comprehensive understanding of Vision Transformers and their implications for the future of computer vision.

### RESEARCH ANALYSIS

These papers explore various aspects of Vision Transformers (ViTs) and address some of their limitations.  "Vision Transformer: ViT and its Derivatives" provides a general overview of ViTs and their evolution, setting the stage for subsequent papers. "A Unified Pruning Framework for Vision Transformers" tackles the high computational cost and data requirements of ViTs by proposing a pruning method to reduce model size and complexity.  "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding" focuses on improving the robustness of ViTs, specifically addressing architectural differences compared to CNNs and potentially enhancing performance in noisy or adversarial scenarios by modifying the normalization strategy.  "Vision Transformer with Progressive Sampling" aims to optimize the computational efficiency of ViTs, likely by strategically sampling input tokens or features, thereby reducing the computational burden of attention mechanisms. Finally, "Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification" applies ViTs to the medical imaging domain and enhances robustness through a self-ensembling technique, potentially improving performance and reliability in challenging medical image classification tasks.

Key themes emerging from these papers include:

* **Computational Efficiency:**  Several papers directly address the high computational cost associated with ViTs, using techniques like pruning and progressive sampling.
* **Robustness:**  Improving the robustness of ViTs against noise, adversarial attacks, and data variations is another significant area of focus.
* **Application Specificity:**  The application of ViTs to specific domains, such as medical imaging, highlights the adaptability and potential of these architectures.
* **Architectural Modifications:** Papers explore modifications to the core ViT architecture, including normalization strategies and embedding methods, to enhance performance and address specific limitations.


In summary, these papers demonstrate the ongoing research and development efforts to refine and optimize ViTs, addressing challenges like computational complexity, robustness, and domain-specific application.


### TOP PAPERS

**Debug:** State keys: ['query', 'expanded_query', 'context', 'messages', 'search_results', 'analysis']

**Found 20 papers.**


**1. Vision Transformer: Vit and its Derivatives**

*Authors:* Zujun Fu

*Published:* 2022-05-12

*URL:* http://arxiv.org/abs/2205.11239v2

*Abstract:* Transformer, an attention-based encoder-decoder architecture, has not only
revolutionized the field of natural language processing (NLP), but has also
done some pioneering work in the field of computer vision (CV). Compared to
convolutional neural networks (CNNs), the Vision Transformer (ViT) relies...

---



**2. A Unified Pruning Framework for Vision Transformers**

*Authors:* Hao Yu, Jianxin Wu

*Published:* 2021-11-30

*URL:* http://arxiv.org/abs/2111.15127v1

*Abstract:* Recently, vision transformer (ViT) and its variants have achieved promising
performances in various computer vision tasks. Yet the high computational costs
and training data requirements of ViTs limit their application in
resource-constrained settings. Model compression is an effective method to
spe...

---



**3. Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding**

*Authors:* Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Dong Gu Lee, Wonseok Jeong, Sang Woo Kim

*Published:* 2021-11-16

*URL:* http://arxiv.org/abs/2111.08413v1

*Abstract:* Vision transformers (ViTs) have recently demonstrated state-of-the-art
performance in a variety of vision tasks, replacing convolutional neural
networks (CNNs). Meanwhile, since ViT has a different architecture than CNN, it
may behave differently. To investigate the reliability of ViT, this paper
st...

---



**4. Vision Transformer with Progressive Sampling**

*Authors:* Xiaoyu Yue, Shuyang Sun, Zhanghui Kuang, Meng Wei, Philip Torr, Wayne Zhang, Dahua Lin

*Published:* 2021-08-03

*URL:* http://arxiv.org/abs/2108.01684v1

*Abstract:* Transformers with powerful global relation modeling abilities have been
introduced to fundamental computer vision tasks recently. As a typical example,
the Vision Transformer (ViT) directly applies a pure transformer architecture
on image classification, by simply splitting images into tokens with a...

---



**5. Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification**

*Authors:* Faris Almalik, Mohammad Yaqub, Karthik Nandakumar

*Published:* 2022-08-04

*URL:* http://arxiv.org/abs/2208.02851v1

*Abstract:* Vision Transformers (ViT) are competing to replace Convolutional Neural
Networks (CNN) for various computer vision tasks in medical imaging such as
classification and segmentation. While the vulnerability of CNNs to adversarial
attacks is a well-known problem, recent works have shown that ViTs are a...

---


In [None]:
what are vision transformers?