# Overview

The main objective of this notebook is to

- save document to the Chroma databse
- load the document from the Chroma database

## Model and embedding

In [1]:
from langchain.llms import Ollama
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.embeddings import OllamaEmbeddings

In [2]:
model_name = "mistral"

In [3]:
llm  = Ollama(
    model=model_name,
    callbacks=[StreamingStdOutCallbackHandler()],
)

In [4]:
ollama_emb = OllamaEmbeddings(
    model=model_name,
)

## Chroma

In [5]:
import glob
from langchain_community.document_loaders import PyPDFLoader

In [6]:
pdf_files = glob.glob("../data/*.pdf")

In [7]:
pages = []

for pdf_file in pdf_files:
    loader = PyPDFLoader(pdf_file)
    pages += loader.load_and_split()

len(pages)

14

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=100)
all_splits = text_splitter.split_documents(pages)

In [9]:
from langchain.vectorstores import Chroma

In [10]:
chroma_db = Chroma.from_documents(
    all_splits,
    ollama_emb,
    persist_directory="./chroma_db",  # This is where the database will be stored
)

let's create another database instance that loads from the Chroma database.

In [11]:
chroma_db_reloaded = Chroma(
    persist_directory="./chroma_db",
    embedding_function=ollama_emb,
)

### Test evaluation

In [12]:
target_source_chunks = 4
retriever = chroma_db_reloaded.as_retriever(
    search_kwargs={"k": target_source_chunks},
)

In [13]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
)

In [14]:
qa.invoke("What is the neutron data project?")

 The Neutron Data Project is a program that provides software capabilities needed by users of ORNL's neutron facilities (SNS and HFIR). It focuses primarily on software for reducing and analyzing neutron data generated from instruments. The project supports over 30 instruments across SNS and HFIR, as well as eight different neutron analysis techniques used with those instruments. The program offers infrastructure that grants users access to various software packages, and is responsible for developing and managing some of the software while also making other packages available to users. The complexity of the project arises from its organization and the sharing of limited development resources.

{'query': 'What is the neutron data project?',
 'result': " The Neutron Data Project is a program that provides software capabilities needed by users of ORNL's neutron facilities (SNS and HFIR). It focuses primarily on software for reducing and analyzing neutron data generated from instruments. The project supports over 30 instruments across SNS and HFIR, as well as eight different neutron analysis techniques used with those instruments. The program offers infrastructure that grants users access to various software packages, and is responsible for developing and managing some of the software while also making other packages available to users. The complexity of the project arises from its organization and the sharing of limited development resources."}

In [15]:
llm.invoke("What is the neutron data project?")

 The Neutron Data Project is an international collaboration aimed at collecting, processing, and analyzing high-precision data from neutron scattering experiments. Neutrons are subatomic particles similar to electrons but with a neutral charge. They interact differently with matter than electrons, allowing for unique insights into the structure and properties of materials at the atomic and molecular level.

The project focuses on providing open access to high-quality neutron data from various sources, promoting its sharing among researchers around the world. By making this data readily available, scientists can reanalyze it using advanced techniques or apply new theories that were not previously considered when the original experiments were conducted. This collaboration enhances the scientific community's ability to learn from each other's work and accelerate discoveries in various fields such as materials science, chemistry, biology, and condensed matter physics.

" The Neutron Data Project is an international collaboration aimed at collecting, processing, and analyzing high-precision data from neutron scattering experiments. Neutrons are subatomic particles similar to electrons but with a neutral charge. They interact differently with matter than electrons, allowing for unique insights into the structure and properties of materials at the atomic and molecular level.\n\nThe project focuses on providing open access to high-quality neutron data from various sources, promoting its sharing among researchers around the world. By making this data readily available, scientists can reanalyze it using advanced techniques or apply new theories that were not previously considered when the original experiments were conducted. This collaboration enhances the scientific community's ability to learn from each other's work and accelerate discoveries in various fields such as materials science, chemistry, biology, and condensed matter physics."