# BioRxivist

BioRxivist is a platform to explore the possibilities in combinding Large Laguage Models (LLM) with full text pre-publication scientific articles from [BioRxiv](https://biorxiv.org).  Use this tool to obtain datasets from BioRxiv publications build knowledge graphs using [Neo4J](https://neo4j.com/).  And facilitate the use of docker run databases and other tools.



<a id="tocz"></a>
## Table of Contents
1. ### [Using BioRxivist to find and load text from papers](#loading-text)
2. ### [BioRxivistPaper Objects](#paper-object)
3. ### [Building Vector embeddings](#embeddings)
4. ### [Connecting to and existing vector database](#vectordatabase)

In [1]:
import os
import sys
sys.version

'3.10.13 (main, Nov 17 2023, 08:59:57) [GCC 9.4.0]'

In [2]:
from dotenv import find_dotenv, load_dotenv
from os import environ
import openai
from warnings import warn
# import classes from BioRxivist
from biorxivist.webtools import BioRxivPaper
from biorxivist.webtools import BioRxivDriver
from biorxivist.webtools import SearchResult
from langchain.vectorstores import Neo4jVector

In [3]:
# load_environment variables like our API key and Neo4J credentials
load_dotenv(find_dotenv())

True

Setting Up Your Environment

Your environment should include the following:

```
OPENAI_API_KEY=<YOUR_SECRET_KEY>
NEO4J_URL=neo4j://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=MyPa$$word!
```
Get an [openai key](https://openai.com/)

If you are installing this package in a development enviornment you might choose to set them in a `.env` environment variable here.

<a id='loading-text'></a>
# Using BioRxivist to find and load text from papers

The first thing we'll want to do is find a paper of interest.  Well use the object from this package. `BioRxivDriver` will give us access to BioRxiv's search utility. `SearchResuls` will help us manage the results of our search. `BioRxivPaper` will manage how we access text from the papers.


[Table Of Contents](#tocz)

In [4]:
driver = BioRxivDriver()

In [5]:
r = driver.search_biorxiv('TGF-Beta 1 Signaling in monocyte maturation')

In [6]:
r.response.request.url

'https://www.biorxiv.org/search/TGF-Beta+1+Signaling+in+monocyte+maturation%20numresults:75'

In [7]:
type(r)

biorxivist.webtools.SearchResult

In [8]:
len(r.results)

75

In [9]:
# load more results
r.more()

In [10]:
len(r.results)

150

In [11]:
# Results are a list of BioRxivPaper objects
r.results[0:5]

[[Shear Stress Induces a Time-Dependent Inflammatory Response in Human Monocyte-Derived Macrophages](https://biorxiv.org/content/10.1101/2022.12.08.519590v3),
 [Cross-species analysis identifies conserved transcriptional mechanisms of neutrophil maturation](https://biorxiv.org/content/10.1101/2022.11.28.518146v1),
 [Melanogenic Activity Facilitates Dendritic Cell Maturation via FMOD](https://biorxiv.org/content/10.1101/2022.05.14.491976v2),
 [Microglia integration into human midbrain organoids leads to increased neuronal maturation and functionality](https://biorxiv.org/content/10.1101/2022.01.21.477192v1),
 [Long-term culture of fetal monocyte precursors in vitro allowing the generation of bona fide alveolar macrophages in vivo](https://biorxiv.org/content/10.1101/2021.06.04.447115v2)]

<a id='paper-object'></a>
# BioRxivPaper Objects:
BioRxiv Paper objects link us to BioRxiv resources related to individual papers. Once instantiated these papers load a minimal amount of information into memory the URI of the papers homepage and the title.  Other features like the paper's abstract and full text are lazy-loaded properties. They are only accessed once we call them for the first time. After that they are available from memory so we don't have to hit the URL another time.

They are also feed directly into our LangChain pipeline.  In the next section we will make sure of their BioRxivPaper.langchain_html_doc attribute.

[Table of Contents](#tocz)

# Interact with individual papers:
The results are a collection of BioRxivPaper objects.  We can interact with them by indexing into the list or we can pull them out and interact with them here:

In [12]:
paper3 = r.results[3]

In [13]:
paper3.title

'Microglia integration into human midbrain organoids leads to increased neuronal maturation and functionality'

In [14]:
paper3.abstract

'The human brain is a complex, three-dimensional structure. To better recapitulate brain complexity, recent efforts have focused on the development of human specific midbrain organoids. Human iPSC-derived midbrain organoids consist of differentiated and functional neurons, which contain active synapses, as well as astrocytes and oligodendrocytes. However, the absence of microglia, with their ability to remodel neuronal networks and phagocytose apoptotic cells and debris, represents a major disadvantage for the current midbrain organoid systems. Additionally, neuro-inflammation related disease modeling is not possible in the absence of microglia. So far, no studies about the effects of human iPSC-derived microglia on midbrain organoid neural cells have been published. Here we describe an approach to derive microglia from human iPSCs and integrate them into iPSC-derived midbrain organoids. Using single nuclear RNA Sequencing, we provide a detailed characterization of microglia in midbrai

# Accessing the paper's text

There are now a few ways to access the papers text:

In [15]:
# through the paper.text property:
# print(paper3.text)
# by accessing the BioRxivPaper.__str__ attribute:
print(f'...{paper3[392:1000]}...')

...r ability to remodel neuronal networks and phagocytose apoptotic cells and debris, represents a major disadvantage for the current midbrain organoid systems. Additionally, neuro-inflammation related disease modeling is not possible in the absence of microglia. So far, no studies about the effects of human iPSC-derived microglia on midbrain organoid neural cells have been published. Here we describe an approach to derive microglia from human iPSCs and integrate them into iPSC-derived midbrain organoids. Using single nuclear RNA Sequencing, we provide a detailed characterization of microglia in midbrain...


<a id='embeddings'></a>
# Build vector Embeddings

You can use OpenAI to build vecotor embeddings from the paper(s) you extract from BioRxiv.  This will form the bases of all tools to follow. =


[Table of Contents](#tocz)

In [16]:
openai.api_key = environ['OPENAI_API_KEY']

In [17]:
from langchain.embeddings.openai import OpenAIEmbeddings

In [18]:
len(paper3.langchain_doc)

Created a chunk of size 1473, which is longer than the specified 1000
Created a chunk of size 1440, which is longer than the specified 1000
Created a chunk of size 1038, which is longer than the specified 1000
Created a chunk of size 1232, which is longer than the specified 1000
Created a chunk of size 1149, which is longer than the specified 1000
Created a chunk of size 1177, which is longer than the specified 1000
Created a chunk of size 1144, which is longer than the specified 1000
Created a chunk of size 1029, which is longer than the specified 1000
Created a chunk of size 1039, which is longer than the specified 1000
Created a chunk of size 1408, which is longer than the specified 1000
Created a chunk of size 1071, which is longer than the specified 1000
Created a chunk of size 1400, which is longer than the specified 1000
Created a chunk of size 1863, which is longer than the specified 1000
Created a chunk of size 1268, which is longer than the specified 1000
Created a chunk of s

78

In [19]:
# TODO lets make an object in BioRxivist that does this
vec = Neo4jVector.from_documents(
    paper3.langchain_doc,
    OpenAIEmbeddings()
)
# If you have your NEO4J environment variables set no need to set them here.

In [20]:
type(vec)

langchain.vectorstores.neo4j_vector.Neo4jVector

In [21]:
docs_with_score = vec.similarity_search_with_score('Providing the reasonging complete the following the function of CSF2 is <BLANK>?', k=5)

In [22]:
docs_with_score[0]

(Document(page_content='In addition to the homeostatic function, AM play an essential role in protecting influenza virus-infected mice from morbidity by maintaining lung integrity through the removal of dead cells and excess surfactant (Schneider et al, 2014). To assess the functional capacity of CSF2-cFLiMo-derived AM during pulmonary virus infection, we reconstituted Csf2ra-/- neonates with CSF2-cFLiMo and infected adults 10 weeks later with influenza virus PR8 (Fig. 5A). Without transfer, Csf2ra-/- mice succumbed to infection due to lung failure (Fig. 5B-E), as reported previously (Schneider et al, 2017). Notably, the presence of CSF2-cFLiMo-derived-AM protected Csf2ra-/- mice from severe morbidity (Fig. 5B, C) and completely restored viability (Fig. 5D) and O2 saturation (Fig. 5E) compared to infected WT mice.', metadata={'title': 'Long-term culture of fetal monocyte precursors in vitro allowing the generation of bona fide alveolar macrophages in vivo', 'source': 'https://biorxiv.o

In [23]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.9044170379638672
In addition to the homeostatic function, AM play an essential role in protecting influenza virus-infected mice from morbidity by maintaining lung integrity through the removal of dead cells and excess surfactant (Schneider et al, 2014). To assess the functional capacity of CSF2-cFLiMo-derived AM during pulmonary virus infection, we reconstituted Csf2ra-/- neonates with CSF2-cFLiMo and infected adults 10 weeks later with influenza virus PR8 (Fig. 5A). Without transfer, Csf2ra-/- mice succumbed to infection due to lung failure (Fig. 5B-E), as reported previously (Schneider et al, 2017). Notably, the presence of CSF2-cFLiMo-derived-AM protected Csf2ra-/- mice from severe morbidity (Fig. 5B, C) and completely restored viability (Fig. 5D) and O2 saturation (Fig. 5E) compared to infected WT mice.
--------------------------------------------------------------------------------
---------

<a id='vectordatabase'></a>
# Connect to an existing vector store

Once you have a vector store in your enviroment BioRxivist can simplify how you access and interact with that data using its `Neo4JDatabase` object.


[Table of Contents](#tocz)

## From Existing Index:

In [24]:
index_name = "vector"  # default index name

store = Neo4jVector.from_existing_index(
    OpenAIEmbeddings(),
    index_name=index_name,
)
# consuming the NEOJ environment variables

In [25]:
type(store)

langchain.vectorstores.neo4j_vector.Neo4jVector

In [26]:
result = store.similarity_search_with_score('CSF2', k=5)

In [27]:
result

[(Document(page_content='CSF2-cFLiMo generated from wild-type or gene-deficient mice could be used as a high-throughput screening system to study AM development in vitro and in vivo. Our model is suitable to study the relationship between AM and lung tissue, as well as the roles of specific genes or factors in AM development and function. Furthermore, CSF2-cFLiMo can overcome the limitation in macrophage precursor numbers and be used as a therapeutic approach for PAP disease or in other macrophage-based cell therapies including lung emphysema, lung fibrosis, lung infectious disease and lung cancer (Byrne et al, 2016; Lee et al, 2016; Wilson et al, 2010). Finally, genetically modified and transferred CSF2-cFLiMo might facilitate the controlled expression of specific therapeutic proteins in the lung for disease treatment, and therefore, could represent an attractive alternative to non-specific gene delivery by viral vectors.', metadata={'title': 'Long-term culture of fetal monocyte precu

## From Graph

In [28]:
# Now we initialize from existing graph
existing_graph = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    index_name="vector",
    node_label="Chunk",
    text_node_properties=["text", "title"], # not all the properties, only those that contain text.
    embedding_node_property="embedding",
)



In [29]:
type(existing_graph)

langchain.vectorstores.neo4j_vector.Neo4jVector

In [30]:
existing_graph.retrieve_existing_index()

1536

In [31]:
existing_graph.similarity_search('macrophage', k=1)

[Document(page_content='\ntext: Tissue-resident macrophages (MFTR) are heterogeneous cell populations, present in almost all tissues and play multiple tissue-specific functions in homeostasis and diseases (Davies et al, 2013; Hoeffel & Ginhoux, 2015). MF-based therapies have been proposed as potential strategies in various diseases (Duan & Luo, 2021; Mass & Lachmann, 2021; Moroni et al, 2019; Peng et al, 2020).\ntitle: Long-term culture of fetal monocyte precursors in vitro allowing the generation of bona fide alveolar macrophages in vivo', metadata={'source': 'https://biorxiv.org/content/10.1101/2021.06.04.447115v2.full-text'})]

<a id='loading-docs'></a>
# Loading more Documents:


In [32]:
for p in r.results[4:13]:
    try:
        store.add_documents(p.langchain_doc)
    except TypeError as e:
        warn(f'No HTML: {e}')

Created a chunk of size 1197, which is longer than the specified 1000
Created a chunk of size 1274, which is longer than the specified 1000
Created a chunk of size 1647, which is longer than the specified 1000
Created a chunk of size 1902, which is longer than the specified 1000
Created a chunk of size 1529, which is longer than the specified 1000
Created a chunk of size 1496, which is longer than the specified 1000
Created a chunk of size 1123, which is longer than the specified 1000
Created a chunk of size 1078, which is longer than the specified 1000
Created a chunk of size 1122, which is longer than the specified 1000
Created a chunk of size 1342, which is longer than the specified 1000
Created a chunk of size 1292, which is longer than the specified 1000
Created a chunk of size 1133, which is longer than the specified 1000
Created a chunk of size 1627, which is longer than the specified 1000
Created a chunk of size 1483, which is longer than the specified 1000
Created a chunk of s

# Setup `Neo4jVector` as a retreiver

In [33]:
retriever = store.as_retriever()


## Questions Answering with Sources

In [34]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import ChatOpenAI

In [35]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), retriever=retriever
)

In [36]:
answer = chain(
    {"question": "What is the role of TGF-beta 1 in monocyte maturation?"},
    return_only_outputs=True,
)

In [37]:
answer

{'answer': 'TGF-beta 1 plays a role in the development of cLP tissue-resident macrophages and is linked to the transition of monocytes to macrophages. Loss of TGFβ-Receptor on macrophages resulted in a minor impairment of macrophage differentiation.\n',
 'sources': 'https://biorxiv.org/content/10.1101/601963v1.full-text, https://biorxiv.org/content/10.1101/2021.06.04.447115v2.full-text'}

In [38]:
answer = answer = chain(
    {"question": "How are mice used in TGF-bet 1 research?"},
    return_only_outputs=True,
)

In [39]:
answer

{'answer': 'Mice were infected with Helicobacter hepaticus (Hh) in TGF-bet 1 research.\n',
 'sources': 'https://biorxiv.org/content/10.1101/601963v1.full-text'}

In [40]:
answer = answer = chain(
    {"question": "What factors have been found to impact the activity of alveolar macrophages?"},
    return_only_outputs=True,
)

In [41]:
answer

{'answer': 'Factors that impact the activity of alveolar macrophages include the presence of GM-CSF-induced PPARγ, absence of PPARγ, GM-CSF, or GM-CSFR subunits Csf2ra and Csf2rb, and mutations in CSF2RA and CSF2RB genes. Additionally, the differentiation of M-CSF-derived BMM or GM-CSF-derived BMM can lead to the development of alveolar macrophages.\n',
 'sources': 'https://biorxiv.org/content/10.1101/2021.06.04.447115v2.full-text'}

In [42]:
answer = answer = chain(
    {"question": "What factors have been found to induce anti-inflamatory states in maturing monocytes?"},
    return_only_outputs=True,
)

In [43]:
answer

{'answer': 'The factors that have been found to induce anti-inflammatory states in maturing monocytes are TGFβ and IL10 cytokines.\n',
 'sources': 'https://biorxiv.org/content/10.1101/601963v1.full-text'}

In [44]:
answer = answer = chain(
    {"question": "How have monocytes been linked to CXCL4 signaling?"},
    return_only_outputs=True,
)

In [45]:
answer

{'answer': 'Monocytes have been linked to CXCL4 signaling through the differentiation of monocyte-derived dendritic cells (moDCs) in the presence of CXCL4. This process was studied to understand the effects of CXCL4 on the trajectory of monocyte differentiation into moDCs and moDC maturation.\n',
 'sources': 'https://biorxiv.org/content/10.1101/807230v1.full-text'}