data to be considered:

- author
- Abstract
- Sections - 
- Code 
- URLs
- Tables
- Images
- References
- Manuscript 


In [2]:
from llmsherpa.readers import LayoutPDFReader
import re
import requests
from bs4 import BeautifulSoup


In [3]:
files = ["/home/asmaa/google-hackathon/ai-research-assistant/sample-assets/2404.18923v1.pdf",
         "/home/asmaa/google-hackathon/ai-research-assistant/sample-assets/2404.18928v1.pdf",
         "/home/asmaa/google-hackathon/ai-research-assistant/sample-assets/2404.18930v1.pdf"]

In [4]:
# extract pdf dource files and read them as 
def read_file_layout(file_name):
    llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
    pdf_reader = LayoutPDFReader(llmsherpa_api_url)
    doc = pdf_reader.read_pdf(file_name)
    return doc

def get_files_as_docs(files):
    docs_lst = []

    for file in files:
        doc = read_file_layout(file)
        docs_lst.append(doc)
        
    return docs_lst

def get_chunk_paths(chunk):
    pattern = re.compile(r'^([^\n]+)$', re.MULTILINE)
    matches = pattern.findall(chunk)
    return matches[0]

Tables extraction

In [5]:
def get_chunks_details(chunk,section_title):
    chunk_details = []
    for chunk in chunk:
        chunk_text = chunk.to_context_text(include_section_info=True)
        chunk_path =  get_chunk_paths(chunk_text)
        chunk_page_num = chunk.page_idx
        chunk_details.append({"text":chunk.to_text(),
                                "title":chunk_path,
                                "page":chunk_page_num,
                                "source_doc":section_title})
    return chunk_details

In [6]:
def get_doc_sections(doc):
    
    sections_details = []
    sections = doc.sections()
    for section in sections:
        section_text = section.to_text(include_children=True,recurse=True)
        section_title = section.title
        section_chunks = section.chunks()
        section_tables = section.tables()
        section_details = {"text":section_text,
                           "title":section_title,
                           "chunks":get_chunks_details(section_chunks,section_title),
                           "tables":section_tables}
        sections_details.append(section_details)
    return sections_details
    

In [7]:
docs_lst = get_files_as_docs(files)

docs_sections = []

for doc in docs_lst:
    docs_sections.append(get_doc_sections(doc))
    
docs_sections

[[{'text': 'Holmes', 'title': 'Holmes', 'chunks': [], 'tables': []},
  {'text': 'Benchmark the Linguistic Competence of Language Models',
   'title': 'Benchmark the Linguistic Competence of Language Models',
   'chunks': [],
   'tables': []},
  {'text': 'Andreas Waldis∗1,2, Yotam Perlitz3, Leshem Choshen4,5, Yufang Hou6, Iryna Gurevych1\n1Ubiquitous Knowledge Processing Lab (UKP Lab) Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt 2Information Systems Research Lab, Lucerne University of Applied Sciences and Arts 3IBM Research AI, 4MIT CSAIL, 5MIT-IBM Watson AI Lab, 6IBM Research Europe - Ireland www.ukp.tu-darmstadt.de www.hslu.ch\nAbstract\nWe introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) – their ability to grasp linguistic phe- nomena.\nUnlike prior prompting-based evalua- tions, Holmes assesses the linguistic compe- tence of LMs via their internal representations using classifier-bas

Let us stitch the sections together to be passed as context to gemini API

In [8]:
context = []
tables = []
for doc in docs_sections:
    for sec in doc:
        context.append(sec["text"])
        tables.append(sec["tables"])
context

['Holmes',
 'Benchmark the Linguistic Competence of Language Models',
 'Andreas Waldis∗1,2, Yotam Perlitz3, Leshem Choshen4,5, Yufang Hou6, Iryna Gurevych1\n1Ubiquitous Knowledge Processing Lab (UKP Lab) Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt 2Information Systems Research Lab, Lucerne University of Applied Sciences and Arts 3IBM Research AI, 4MIT CSAIL, 5MIT-IBM Watson AI Lab, 6IBM Research Europe - Ireland www.ukp.tu-darmstadt.de www.hslu.ch\nAbstract\nWe introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) – their ability to grasp linguistic phe- nomena.\nUnlike prior prompting-based evalua- tions, Holmes assesses the linguistic compe- tence of LMs via their internal representations using classifier-based probing.\nIn doing so, we disentangle specific phenomena (e.g., part-of- speech of words) from other cognitive abilities, like following textual instructions, and meet recent calls

In [9]:
import google.generativeai as genai
import os
from dotenv import load_dotenv
load_dotenv()

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

model = genai.GenerativeModel(model_name="models/gemini-1.5-pro-latest")

response = model.generate_content(
    ['# Here is a research paper that the user who is an AI Engineer is reading:']+
    context +
    ["[END]\n\nPlease sumarize it"]
)

response.text

'## Summary of "Hallucination of Multimodal Large Language Models: A Survey"\n\nThis paper explores the issue of **hallucination** in **multimodal large language models (MLLMs)**, which are AI systems that combine language processing with other modalities like vision. While MLLMs show great promise in tasks like image captioning and visual question answering, they can generate outputs that are inconsistent with the visual content, known as hallucinations. This raises concerns about their reliability and practicality.\n\n**Causes of Hallucination:**\n\n* **Data Issues:** Insufficient data, noisy data, lack of diversity in data (e.g., mainly positive instructions), and statistical biases (e.g., frequent objects) can all contribute to hallucinations.\n* **Model Issues:** Weak vision models, imbalanced architecture favoring language models, and poorly aligned interfaces between modalities can lead to misinterpretations and hallucinations.\n* **Training Issues:** The standard next-token pre

In [10]:
from IPython.display import Markdown
Markdown(response.text)

## Summary of "Hallucination of Multimodal Large Language Models: A Survey"

This paper explores the issue of **hallucination** in **multimodal large language models (MLLMs)**, which are AI systems that combine language processing with other modalities like vision. While MLLMs show great promise in tasks like image captioning and visual question answering, they can generate outputs that are inconsistent with the visual content, known as hallucinations. This raises concerns about their reliability and practicality.

**Causes of Hallucination:**

* **Data Issues:** Insufficient data, noisy data, lack of diversity in data (e.g., mainly positive instructions), and statistical biases (e.g., frequent objects) can all contribute to hallucinations.
* **Model Issues:** Weak vision models, imbalanced architecture favoring language models, and poorly aligned interfaces between modalities can lead to misinterpretations and hallucinations.
* **Training Issues:** The standard next-token prediction loss may not be ideal for learning visual information, and the lack of RLHF (reinforcement learning from human feedback) in MLLM training can result in misalignment with human preferences.
* **Inference Issues:** As MLLMs generate text, they may lose focus on the visual content over time, relying more on previously generated text and leading to hallucinations.

**Evaluation of Hallucinations:**

The paper reviews various benchmarks and metrics used to assess the severity of hallucinations, such as:

* **CHAIR**: Measures the proportion of generated words that are actually present in the image.
* **POPE**: Evaluates object hallucination through yes/no questions about specific objects.
* **AMBER**: Assesses both generative and discriminative tasks, considering object existence, attributes, and relations.
* **HallusionBench**: Diagnoses potential failure modes by evaluating visual commonsense knowledge and reasoning.
* **FaithScore**: Analyzes free-form responses to open-ended questions, identifying hallucinated entities, attributes, and relations.

**Mitigation Strategies:**

* **Data-centric**: Introducing negative and counterfactual data, along with refining existing datasets, can help improve data quality and diversity.
* **Model-centric**: Scaling up vision resolution, incorporating diverse vision encoders, and adding dedicated modules to control language priors can enhance visual understanding and reduce reliance on language bias.
* **Training-centric**: Employing auxiliary supervision signals and reinforcement learning techniques can improve cross-modal alignment and model faithfulness.
* **Inference-centric**: Techniques like contrastive decoding and guided decoding can steer generation towards accurate representations of visual content, while post-hoc correction methods can identify and rectify hallucinations after generation.

**Challenges and Future Directions:**

* Addressing data quality and bias.
* Improving cross-modal alignment and consistency.
* Developing advanced model architectures.
* Establishing standardized benchmarks and metrics.
* Exploring the potential of hallucinations as a creative feature.
* Enhancing interpretability and building trust in MLLMs.
* Navigating ethical considerations and responsible AI development.

**Overall, the paper highlights the need for continuous research and innovation to mitigate hallucinations in MLLMs and ensure their reliable and ethical deployment in real-world applications.** 


Summarizing tables

Could display each table as HTML/MARKDOWN besides its summary

In [22]:
table = tables[5][0].to_html()

In [25]:
response_tables = model.generate_content(
    ['# Here is a table extracted from a research paper that you have summarized:']+
    [table] +
    ["# This is the summary:"] +
    [response.text] +
    ["[END]\n\nPlease sumarize the table in a simplified way"]
)

from IPython.display import Markdown
Markdown(response_tables.text)

## Simplified Summary of the Model Comparison Table:

This table compares several large language models (LLMs) across different aspects of language understanding and generation:

* **Morphology:** Understanding the structure of words. 
* **Syntax:** Understanding the arrangement of words and phrases to create well-formed sentences.
* **Semantics:** Understanding the meaning of words and sentences.
* **Reasoning:** The ability to draw logical conclusions and make inferences.
* **Discourse:** Understanding the flow and context of language in a larger piece of text.

**Key Observations:**

* **Most models show improvement in morphology and syntax compared to their predecessors.** This suggests progress in understanding the structure and grammar of language.
* **Semantics and discourse are areas where several models struggle.** This indicates difficulty in comprehending meaning and context, leading to potential issues with factual accuracy and logical coherence.
* **Overall performance varies significantly across models.** Some models like Vicuna-v1.5 and FLAN-UL2 show strong overall improvements, while others like Dolly-v2 and Tülu-2 exhibit declines compared to their base models.
* **The average across all models suggests a slight improvement in overall language capabilities.** However, there's still much room for advancement, particularly in semantics and discourse understanding.

**Important Notes:**

* The table uses percentages to show relative improvements or declines compared to a base model or a specific parameter count.
* The "Overall" column provides a general indication of performance, but it's essential to consider individual strengths and weaknesses based on the specific task or application. 


Extract urls from section text

In [11]:
from urlextract import URLExtract

extractor = URLExtract()
urls = []
for doc in docs_sections:
    for section in doc:
        url = extractor.find_urls(section["text"])
        urls.append(url) 
        
urls

[[],
 [],
 ['hessian.AI',
  'www.ukp.tu-darmstadt.de',
  'www.hslu.ch',
  'holmes-benchmark.github.io'],
 ['holmes-benchmark.github.io'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['OpenReview.net',
  'OpenReview.net',
  'https://huggingface.co/spaces/',
  'OpenReview.net',
  'OpenReview.net',
  'OpenReview.net',
  'OpenReview.net',
  'OpenReview.net',
  'view.net',
  'OpenReview.net',
  'OpenReview.net',
  'CEUR-WS.org',
  'OpenReview.net'],
 ['OpenReview.net'],
 ['OpenReview.net',
  'OpenReview.net',
  'OpenReview.net',
  'OpenReview.net',
  'view.net',
  'OpenReview.net',
  'OpenReview.net',
  'CEUR-WS.org',
  'OpenReview.net'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['stylus-diffusion.github.io', 'https://civitai.com/'],
 ['stylus-diffusion.github.io', 'https://civitai.com/'],
 ['stylus-diffusion.github.io', 'https://civitai.com/'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],


In [14]:



def find_github_repo_inwebpage(url):
    # Send a GET request to the URL
    response = requests.get(url)
    response.raise_for_status()  # Raises an HTTPError for bad responses

    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Regular expression to match GitHub repository homepage URLs
    github_repo_url_pattern = re.compile(r'https?://github\.com/[\w-]+/[\w-]+/?$')
    
    # Find all links in the parsed HTML
    links = soup.find_all('a', href=True)
    
    # Filter and return GitHub repository homepage URLs
    github_repo_urls = {link['href'] for link in links if github_repo_url_pattern.match(link['href'])}
    return github_repo_urls

# Example usage:
url_to_scrape = 'https://holmes-benchmark.github.io/'  # Replace with the URL of the website you want to scrape
github_urls = find_github_repo_inwebpage(url_to_scrape)
print(github_urls)


{'https://github.com/Holmes-Benchmark/holmes-evaluation'}


Need an agent to filter the urls to get the paper's code URLS only

In [13]:

def find_github_repo_urls_in_text(text):
    # Regular expression to match GitHub repository homepage URLs
    github_repo_url_pattern = re.compile(r'https?://github\.com/[\w-]+/[\w-]+/?$')
    
    # Search the text for matches
    github_repo_urls = github_repo_url_pattern.findall(text)
    
    # Remove duplicates and return the results
    return set(github_repo_urls)

# Example usage:
text_to_scan = """
Here are some GitHub repos you might find interesting: 
https://github.com/Holmes-Benchmark/holmes-evaluation, and also check out https://github.com/octocat/Spoon-Knife/.
Don't forget https://github.com/octocat/Spoon-Knife/issues which is not what we want.
"""
github_repo_urls = find_github_repo_urls_in_text(text_to_scan)
print(github_repo_urls)

set()


Need an agent to analyze a github repository

In [16]:
response_code= model.generate_content(
    ['# Here is a github repository containing code for a research paper you summarized']+
    ["https://github.com/Holmes-Benchmark/holmes-evaluation"] +
    ["# This is the summary:"] +
    [response.text] +
    ["[END]\n\nPlease sumarize the code in the repository and provide a detailed guide showing an AI Engineering how they may use this code for a relevant task"]
)

from IPython.display import Markdown
Markdown(response_code.text)

## Summary of the Holmes Benchmark Repository

The Holmes Benchmark repository provides code and resources for evaluating and analyzing hallucination in Multimodal Large Language Models (MLLMs). It includes implementations of various evaluation metrics, datasets for benchmarking, and tools for visualization and analysis. 

Here's a breakdown of the key components:

* **Evaluation Metrics:**
    * **CHAIR**: Implementation for measuring the proportion of generated words grounded in the image.
    * **POPE**: Code for evaluating object hallucination using yes/no questions.
    * **AMBER**: Scripts for assessing both generative and discriminative tasks related to object existence, attributes, and relationships.
    * **HallusionBench**: Tools for diagnosing potential failure modes by evaluating visual commonsense and reasoning.
    * **FaithScore**: Code for analyzing free-form responses to open-ended questions and identifying hallucinated entities, attributes, and relations.
* **Datasets:**
    * Scripts for downloading and preparing various datasets used for hallucination benchmarking, such as COCO-Captions and Visual Genome.
* **Visualization and Analysis Tools:** 
    * Jupyter notebooks for visualizing hallucination examples and analyzing evaluation results.
    * Scripts for generating reports and comparing different models or metrics. 

## Guide for AI Engineers: Using Holmes Benchmark for Hallucination Analysis

**Scenario:** You are an AI Engineer working on developing a new MLLM for image captioning. You want to evaluate its performance and understand the extent of hallucination issues in its generated captions. 

**Here's how you can utilize the Holmes Benchmark repository:**

**1. Setup:**

* Clone the repository: `git clone https://github.com/Holmes-Benchmark/holmes-evaluation.git`
* Install the required dependencies using `pip install -r requirements.txt`.

**2. Data Preparation:**

* Choose a relevant dataset for your task, e.g., COCO-Captions.
* Use the provided scripts to download and prepare the dataset. 

**3. Model Evaluation:**

* Select appropriate metrics based on your focus. For example:
    * Use **CHAIR** to measure the overall grounding of generated captions.
    * Use **POPE** to assess object hallucination specifically.
    * Use **HallusionBench** to analyze potential failure modes in visual reasoning.
* Run your MLLM on the chosen dataset and generate captions.
* Use the corresponding scripts to calculate the chosen metrics for your generated captions.

**4. Analysis and Visualization:**

* Utilize the provided Jupyter notebooks to visualize examples of hallucinations and analyze the distribution of scores across different categories.
* Compare your MLLM's performance with other models or baselines using the available tools.

**5. Mitigation Strategies:**

* Based on the insights gained from the evaluation, identify areas where your MLLM struggles with hallucinations. 
* Consider implementing mitigation strategies mentioned in the paper, such as:
    * **Data-centric:** Improve data quality and diversity by adding negative examples or counterfactual data.
    * **Model-centric:** Enhance visual understanding by using stronger vision models or incorporating dedicated modules to control language priors.
    * **Training-centric:**  Improve cross-modal alignment using auxiliary supervision signals or reinforcement learning techniques. 
    * **Inference-centric:**  Explore techniques like contrastive decoding or guided decoding to steer generation towards factual representations.

**6. Iterate and Improve:**

* Continue to evaluate your MLLM as you refine your model and training process. 
* Monitor the progress in mitigating hallucinations and track the improvement in relevant metrics.

**Additional Tips:**

* The repository provides flexibility to customize the evaluation process based on your specific needs. 
* Explore the various configuration options and parameters available for each metric.
* Consider contributing to the repository by adding new metrics, datasets, or analysis tools. 

By leveraging the Holmes Benchmark repository, AI Engineers can gain valuable insights into the hallucination behavior of their MLLMs and take steps to improve their accuracy and reliability for real-world applications.
