# Information extraction with llama-parse + Mistral

In this notebook, we are going to extract information from some pages of a paper. Relevant information that needs to be extracted has been annotated by our LA.

We are going to focus on page 1, 3, 4 and 5.

- Page 0: contains all relevant info re- authors, title and journal + abstract in a text format
- Page 2: is dense of information relevant for the LA in a text format
- Page 3: there is a mix of text and a table
- Page 4: there is a mix of text and an image (which contains text)

### Step 1. We use llama_parse to extract those pages and get it into `md` format.

I choose this format because a human can read it easily, and it is conservative re- formulas and tables structure as opposed to `txt`

In [2]:
# this bit of code is important to run in a notebook because of the nature of llama_parse -- more on this in their documentation (it was too technical for me)
import nest_asyncio
nest_asyncio.apply()

In [3]:
# set-up your API key -- you can have it for free via llamaCloud
import os
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-.."

In [41]:
from llama_parse import LlamaParse

document_with_instruction = LlamaParse(
    result_type="markdown", # set the output format
    language='en', # set the language -- this is useful only for tables and figures
    target_pages="0,2,3,4", # pages that we want to extract
    # below you can set-up some prompts. I found that performance improves for reporting formulas and tables
    # if the instructions report that this is indeed a research paper
    parsing_instruction= """ 
    This is a research paper. 
    """).load_data("test-papers/highlighted pdf_vch.pdf")

Started parsing the file under job_id 13d4eaeb-e054-43d3-9a76-07b1a591e1f6


Take a look at the result:

In [38]:
# take a look at the result
print(document_with_instruction[0].text[:1000])

# The Seasonality of Diarrheal Pathogens: A Retrospective Study

#
# The Seasonality of Diarrheal Pathogens: A Retrospective Study of Seven Sites Over Three Years

Authors: Dennis L. Chao1*, Anna Roose2, Min Roh1, Karen L. Kotloff2, Joshua L. Proctor1

1Institute for Disease Modeling, Bellevue, Washington, United States of America

2Center for Vaccine Development and Global Health, University of Maryland, Baltimore, Maryland, United States of America

Email: dennisc@idmod.org

# Abstract

This is a research article that investigates the seasonal prevalence of diarrheal pathogens among children with moderate-to-severe diarrhea (MSD) over three years from seven sites of the Global Enteric Multicenter Study (GEMS). The study characterizes the seasonality of different pathogens, their association with site-specific weather patterns, and consistency across study sites.

# Background

Pediatric diarrhea can be caused by a wide variety of pathogens, from bacteria to viruses to protozoa. Patho

### Step 2. Load pydantic schemas

These schemas have been created ad-hoc by us to accomodate the information that the LA wants to extract. Note that these are all pydantic classes nested into `Bibliography`. The structure of these schemas is quite complex and articulated, even though it does not probably cover all the info that the LA wants to extract, it is a good starting point to experiment.

In [17]:
from pydantic_schemas import Measurement, WeatherData, Methodology, StudyScope, StudyDuration, ParticipantAgeGroup, Bibliography

### Step 3. Access Mistral's API

More on how to get access to Mistral here: https://docs.mindmac.app/how-to.../add-api-key/create-mistral-ai-api-key -- you can get a free API for 2 weeks

Note that we decide to use `langchain` to use Mistral as its setup and usage is easier

In [12]:
import getpass
import os

os.environ["MISTRAL_API_KEY"] = getpass.getpass()

from langchain_mistralai import ChatMistralAI

llm = ChatMistralAI(model="mistral-large-latest")

 ········


### Step 4. set up a prompt

Here we set up a prompt to let the model know that they must follow our pydantic schemas when extracting the info, and return it into a json file

In [14]:
# PydanticOutputParser allows us to use our predefined pydantic schemas
from langchain_core.output_parsers import PydanticOutputParser
# ChatPromptTemplate allows us to build a prompt
from langchain_core.prompts import ChatPromptTemplate

In [18]:
parser = PydanticOutputParser(pydantic_object=Bibliography)

In [19]:
# Note that we get to use our predefined pydantic schemas wrapped into Bibliography by setting up a prompt and using `get_format_instructions` function

# we get some warnings but 
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the user query. Wrap the output in `json` tags\n{format_instructions}",
        ),
        ("human", "{query}"),
    ]
).partial(format_instructions=parser.get_format_instructions())

### Step 5. Set up the query

Now we set up into `query` what we would like the model to process.

For checking a little bit better what's going on, I prefer to give one page at a time.

In [20]:
query = (f"You are provided with an excerpt of a scientific paper: {document_with_instruction[0].text}")

### Step 6. Set up a chain to parse output into a human-readable json

To get the output we can build a 'chain' in which the output gets wrapped in a more consistent way, so that it's easier to save. We also use a function to parse it into a json so that we get also a pretty indentation

In [25]:
from langchain_core.messages import AIMessage
from typing import List
import re
import json

def extract_json(message: AIMessage) -> List[dict]:
    """Extracts JSON content from a string where JSON is embedded between ```json and ``` tags.

    Parameters:
        text (str): The text containing the JSON content.

    Returns:
        list: A list of extracted JSON strings.
    """
    text = message.content
    # Define the regular expression pattern to match JSON blocks
    pattern = r"```json(.*?)```"

    # Find all non-overlapping matches of the pattern in the string
    matches = re.findall(pattern, text, re.DOTALL)

    # Return the list of matched JSON strings, stripping any leading or trailing whitespace
    try:
        return [json.loads(match.strip()) for match in matches]
    except Exception:
        raise ValueError(f"Failed to parse: {message}")

### Look at page 1

In [23]:
chain = prompt | llm | parser

chain.invoke({"query": query})

Bibliography(article_type='Research article', study_type='retrospective study', title='The Seasonality of Diarrheal Pathogens: A Retrospective Study of Seven Sites Over Three Years', study_scope=StudyScope(study_sites=[], description=[], locations=['Institute for Disease Modeling, Bellevue, Washington, United States of America', 'Center for Vaccine Development and Global Health, University of Maryland, Baltimore, Maryland, United States of America']), methodology=[Methodology(name='signal processing', description='Using traditional methodologies from signal processing', methods=['Fast fourier transformation'], goals=['Characterize the seasonality of different pathogens']), Methodology(name='machine-learning methodologies', description='Applying modern machine-learning methodologies to site-specific weather data', methods=[], goals=['Define seasons associated with pathogen prevalence'])], study_duration=StudyDuration(duration='three years', details=['The study ran over three years']), p

In [26]:
chain2 = prompt | llm | extract_json

chain2.invoke({"query": query})

[{'article_type': 'Research article',
  'study_type': 'retrospective study',
  'title': 'The Seasonality of Diarrheal Pathogens: A Retrospective Study',
  'study_scope': {'study_sites': ['Seven sites of the Global Enteric Multicenter Study (GEMS)'],
   'description': ['The study characterizes the seasonality of different pathogens, their association with site-specific weather patterns, and consistency across study sites.'],
   'locations': ['Seven sites of the Global Enteric Multicenter Study (GEMS)']},
  'methodology': [{'name': 'signal processing',
    'description': 'Using traditional methodologies from signal processing, we found that certain pathogens peaked at the same time every year, but not at all sites.',
    'methods': ['signal processing'],
    'goals': ['Identifying the seasonally-dependent prevalence for diarrheal pathogens']}],
  'study_duration': {'duration': 'three years',
   'details': ['The study was conducted over three years.']},
  'participant_age_group': {'age_ra

Informal evaluation: the only thing it did not capture, is 'multi-site, multi-continent study' which was highlighted from the LA

### Look at page 3

- Page 3: is dense of information relevant for the LA in a text format. We just need to adjust the query.

In [42]:
query = (f"You are provided with an excerpt of a scientific paper: {document_with_instruction[1].text}")

In [43]:
chain_page2 = prompt | llm | extract_json

chain_page2.invoke({"query": query})

[{'article_type': 'Research article',
  'study_type': 'retrospective study',
  'title': 'Seasonality of Diarrheal Pathogens',
  'study_scope': {'study_sites': ['Basse, The Gambia',
    'Bamako, Mali',
    'Nyaza Province, Kenya',
    'Manhiça, Mozambique',
    'Karachi [Bin Qasim Town], Pakistan',
    'Kolkata, India',
    'Mirzapur, Bangladesh'],
   'description': None,
   'locations': ['Basse, The Gambia',
    'Bamako, Mali',
    'Nyaza Province, Kenya',
    'Manhiça, Mozambique',
    'Karachi [Bin Qasim Town], Pakistan',
    'Kolkata, India',
    'Mirzapur, Bangladesh']},
  'methodology': [{'name': 'signal processing',
    'description': 'Traditional signal processing analysis methods to detect periodicity in time series data',
    'methods': ['Lomb–Scargle periodogram', 'Fast Fourier transforms'],
    'goals': ['estimate the annual, biannual, and other seasonal signals for each pathogen at each site']},
   {'name': 'machine-learning',
    'description': 'Machine-learning methodol

Informal evalutation. There are many info highlighted by the LA that are not captured. It might be due to the structure of the json input, and perhaps a better description of the fields. All is correct

### Look at page 4

- Page 4: there is a mix of text and a table

In [33]:
file_name = "test-papers/txt/highlighted-paper-with-instr-table.md"
with open(file_name, 'w') as file:
    file.write(document_with_instruction[2].text)

In [31]:
query = (f"You are provided with an excerpt of a scientific paper: {document_with_instruction[2].text}")

In [32]:
chain_page3 = prompt | llm | extract_json

chain_page3.invoke({"query": query})

[{'article_type': 'Research article',
  'study_type': 'retrospective study',
  'title': 'Seasonality of Diarrheal Pathogens',
  'study_scope': {'study_sites': ['The Gambia',
    'Mali',
    'Kenya',
    'Mozambique',
    'Pakistan',
    'India',
    'Bangladesh'],
   'locations': ['The Gambia',
    'Mali',
    'Kenya',
    'Mozambique',
    'Pakistan',
    'India',
    'Bangladesh']},
  'methodology': [{'name': 'Estimating Pathogen Prevalence',
    'description': 'We estimated the number and percentage of eligible children positive for each pathogen based on the proportion of tested cases who were positive.',
    'methods': ['Assuming equal proportion of pathogen-positive children among enrollees and eligible children within the same age strata and fortnight of clinic visit.'],
    'goals': ['To estimate the number of children who would have tested positive for a pathogen had all eligible children been enrolled.']}],
  'study_duration': {'duration': '2007–2011',
   'details': ['The stu

Informal evaluation: Here the info is again all correct except for author, affiliation, citation, github_repo and corresponding author. The table, while it has been extracted, the info are not reported. This is not an error per se, as a posteriori one can see that none of the pydantic classes really has space where this info could be put. So this experiment also tells us that the refinement of the pydantic schema is a challenge per se.

### Look at page 5

This page contains a big picture with text

In [34]:
file_name = "test-papers/txt/highlighted-paper-with-instr-fig.md"
with open(file_name, 'w') as file:
    file.write(document_with_instruction[3].text)

The text in the figure is not extracted. However, even if it was, it would be extremely hard to see which information should be retained and how, so not a big deal in this experiment but definitely something to discuss with the LA if he thinks that anything should be taken from a picture like this one. From the LA annotations, it seems indeed that there is nothing relevant that must be extracted, but rather this info is in the caption. Let's see what is the result.

In [35]:
query = (f"You are provided with an excerpt of a scientific paper: {document_with_instruction[3].text}")

In [36]:
chain_page5 = prompt | llm | extract_json

chain_page5.invoke({"query": query})

[{'article_type': 'Research article',
  'study_type': 'retrospective study',
  'title': 'Seasonality of Diarrheal Pathogens',
  'study_scope': {'study_sites': [],
   'description': [],
   'locations': ['Bangladesh']},
  'methodology': [{'name': 'Statistical Analysis',
    'description': 'Statistical methods used to analyze the data.',
    'methods': ['Power spectral density analysis',
     'Principal component analysis (PCA)',
     'k-means clustering'],
    'goals': ['Estimate rotavirus prevalence',
     'Identify seasons using weather variables']}],
  'study_duration': {'duration': None, 'details': []},
  'participant_age_group': {'age_range': None, 'details': []},
  'weather_data': {'time_period': None,
   'details': [],
   'measurement': [{'measurement_type': 'Specific Humidity',
     'description': 'Specific humidity was estimated using statistical methods.',
     'formula': ['P = 101.3 * e^(L/8200)',
      'Es = 0.622 * e^(17.502*T)/(T + 240.97)'],
     'variables': ['P', 'L', 'E