# experiments

In [None]:
pip install pandas

## Getting to know LLMs and LangChain

A foundational element of using LangChain as a wrapper for large
language models (LLMs) like GPT4 is the prompt. A prompt in the LLM
context is just like a prompt in the usual context: it’s a thoughtfully
designed question meant to elicit a response. If you want to use an LLM
to explore text, it is critically important to design an effective
prompt that will help the model generate accurate and helpful responses.
We are exploring the use of LLMs to help us “read” undergraduate student
research papers in marine science and figure out if the paper contains a
species occurrence. That is, did the student observe or collect a given
species at a given place during the course of their research? If they
did, that kind of information is a species occurrence (species + place +
date).

To explore the potential for using LLMs in this work, we selected a few
online tools that are designed to help the user ask questions about text
provided to the application. We picked a few student papers at random
(all open access) and iterated on a series of questions to learn how to
engineer prompts that might give us the information we need to determine
if a paper includes a species occurrence. This was our process and what
we found.

## Tools we tried

### Chat My Data 📝 ChatPDF 📝 Ask My PDF

-   [Chat My
    Data](https://blog.langchain.dev/tutorial-chatgpt-over-your-data)
-   [ChatPDF](https://www.chatpdf.com)
-   [Ask My PDF](https://ask-my-pdf.streamlit.app)

The first question we gave to each chat tool was, “*What is this paper
about?*”

# Results

In [6]:

import pandas as pd

In [13]:
df = pd.read_csv('data.csv', nrows=3, usecols=[0,1,2,3])
df.style.set_table_styles([dict(selector="th",props=[('max-width', '100px')])])
left_aligned_df = df.style.set_properties(**{'text-align': 'left'})
display(left_aligned_df)


## Testing Architectures

Having experimented with prompts across some pre-made conversational
tools, the next step is to explore different combinations of
tools/methods for 1. Load, 2. Transform (Text splitting), 3. Embed, 4.
Store, 5. Retrieve (Vector store query). We came up with four main
options (below) with some possible variations (see the yellow arrows).

In [None]:
neato`
digraph {
    nodesep=0.5;
    labelloc = "b"
    fontname = Arial
    node [
        shape = rectangle
        width = 1.5
        color= lightgray
        style = filled
        fontname="Helvetica,Arial,sans-serif"
    ]
    edge [
    len = 2 
    penwidth = 1.5 
    arrowhead=open
    color= darkgray

  ]
    start = regular
    normalize = 0

    subgraph cluster_0 {
        style=filled;
        color= deepskyblue;
        node [style=filled,color=white];
        SentenceTransformers -> SentenceTransformerEmbeddings -> Annoy -> MultiQueryRetriever;
        label = "Option #1";
    }

    subgraph cluster_1 {
        style=filled;
        color= yellowgreen;
        node [style=filled,color=white];
        TikToken -> OpenAIEmbeddings -> FAISS -> ContextualCompression;
        label = "Option #2";
    }

subgraph cluster_2 {
        style=filled;
        color= orange;
        node [style=filled,color=white];
        NLTK -> LlamaCCP -> Chroma -> ChromaSelfQuerying;
        label = "Option #3";
    }

subgraph cluster_3 {
        style=filled;
        color= deeppink;
        node [style=filled,color=white];
        SpaCY -> SpaCYEmbeddings -> Chroma2 -> ChromaSelfQuerying2;
        label = "Option #4";
    }

subgraph cluster_4 {
        margin=40
        style=filled;
        color= gray;
        node [style=filled,color=white];
        rankdir=LR;
        Stuffing, Refine, MapReduce, MapReRank;
    }


    source -> PyMuPDF;
  PyMuPDF -> SentenceTransformers;
  PyMuPDF -> TikToken;
  PyMuPDF -> NLTK;
  PyMuPDF -> SpaCY;
  MultiQueryRetriever -> Stuffing [color = lightblue]
  MultiQueryRetriever -> Refine [color = lightblue]
  MultiQueryRetriever -> MapReduce [color = lightblue]
  MultiQueryRetriever -> MapReRank [color = lightblue]
  ContextualCompression -> Stuffing [color = lightblue]
  ContextualCompression -> Refine [color = lightblue]
  ContextualCompression -> MapReduce [color = lightblue]
  ContextualCompression -> MapReRank [color = lightblue]
  ChromaSelfQuerying -> Stuffing [color = lightblue]
  ChromaSelfQuerying -> Refine [color = lightblue]
  ChromaSelfQuerying -> MapReduce [color = lightblue]
  ChromaSelfQuerying -> MapReRank [color = lightblue]
  ChromaSelfQuerying2 -> Stuffing [color = lightblue]
  ChromaSelfQuerying2 -> Refine [color = lightblue]
  ChromaSelfQuerying2 -> MapReduce [color = lightblue]
  ChromaSelfQuerying2 -> MapReRank [color = lightblue]
    FAISS -> MultiQueryRetriever [color = yellow]
  Annoy -> ContextualCompression [color = yellow]
  LlamaCCP -> FAISS [color = yellow]
  OpenAIEmbeddings -> Annoy [color = yellow] 

    source [shape=Msquare];
}
`

In [None]:
neato = require("@observablehq/graphviz@0.2")