# Zotero RAG 
## Chat with Your Zotero Library

## Setup Instructions

Before running this notebook, you need to create a `.env` file in the root directory of the project with your Zotero credentials:

```
ZOTERO_API_KEY=your_zotero_api_key_here
ZOTERO_USER_ID=your_zotero_user_id_here
LIBRARY_TYPE=user
GROQ_API_KEY=your_groq_api_key_here
```

### How to get your Zotero credentials:
1. **API Key**: Go to https://www.zotero.org/settings/keys/new and create a new private key
2. **User ID**: Found in your Zotero account settings or in the URL when you visit your library (e.g., https://www.zotero.org/users/YOUR_USER_ID/)
3. **Library Type**: Use `user` for personal library or `group` for group library

### How to get your Groq API key (if you don't go for ollama local option):
1. Visit https://console.groq.com/
2. Sign up or log in
3. Navigate to API Keys section and create a new key

In [24]:
# from pyzotero import zotero
# from dotenv import load_dotenv
# import os

# load_dotenv()  # by default, looks for `.env` in the working dir

# ZOTERO_API_KEY = os.getenv("ZOTERO_API_KEY", "").strip()
# ZOTERO_USER_ID = os.getenv("ZOTERO_USER_ID", "").strip()
# LIBRARY_TYPE = os.getenv("LIBRARY_TYPE", "").strip()

# if not ZOTERO_API_KEY or not ZOTERO_USER_ID or not LIBRARY_TYPE:
#     raise ValueError("Zotero API key, user ID, and library type must be set in the environment variables.")

# print(f"Library Type: '{LIBRARY_TYPE}'")  # Debug print to verify the value
# print(f"User ID: '{ZOTERO_USER_ID}'")  # Debug print to verify the value

# # Initialize Zotero client
# zot = zotero.Zotero(ZOTERO_USER_ID, LIBRARY_TYPE, ZOTERO_API_KEY)

# # Get items with attachments (PDFs)
# items = zot.items(top=True, itemType='attachment')

# download_folder = "../documents/zotero_pdfs"
# os.makedirs(download_folder, exist_ok=True)

# for item in items:
#     if 'application/pdf' in item['data'].get('contentType', ''):
#         title = item['data']['title']
#         key = item['key']
#         try:
#             file = zot.file(key)
#             path = os.path.join(download_folder, title + ".pdf")
#             with open(path, "wb") as f:
#                 f.write(file)
#             print(f"Downloaded: {title}")
#         except Exception as e:
#             print(f"Failed to download {title}: {e}")


## Download Zotero PDFs

The cell above downloads PDFs from your Zotero library. Uncomment and run it once.

**Note:** Some PDFs may fail to download if they are:
- Linked files stored locally on your computer (not synced to Zotero cloud)
- PDFs without proper access permissions
- Attachments that were deleted or moved

If any PDFs failed to download, you can manually add them to the `../documents/zotero_pdfs/` folder to include them in your RAG system.
The section and RAG are independent.

---

In [25]:
import warnings
warnings.filterwarnings("ignore")

# Using Ollama (local, no token limits) - before, I used Groq but it was too limited for a normal size Zotero library
# First install Ollama: https://ollama.com/download
# Then run in terminal: ollama pull llama3.2

from langchain_community.llms import Ollama
llm = Ollama(
    model="llama3.2",
    temperature=0
)

# Alternative: Use Groq (cloud-based, fast but has token limits)
# Uncomment the lines below and comment out the Ollama lines above to use Groq
# 
# from langchain_groq import ChatGroq
# llm = ChatGroq(
#     model="llama3-8b-8192",
#     temperature=0,
#     max_tokens=None,
#     timeout=None,
#     max_retries=2
# )

In [29]:
# Load all PDFs from Zotero folder
from langchain_community.document_loaders import PyPDFLoader
import glob

# Get all PDF files from the zotero_pdfs folder
pdf_files = glob.glob("../documents/zotero_pdfs/*.pdf")

if not pdf_files:
    raise ValueError("No PDF files found in ../documents/zotero_pdfs/. Please download PDFs first.")

# Load all documents from all PDFs
all_documents = []
for pdf_path in pdf_files:
    loader = PyPDFLoader(file_path=pdf_path)
    all_documents.extend(loader.load())

print(f"Loaded {len(all_documents)} pages from {len(pdf_files)} PDF files")

Loaded 231 pages from 10 PDF files


In [30]:
# Split documents into chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents, chunk_size=800, chunk_overlap=80):
    """
    this function splits documents into chunks of given size and overlap
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = text_splitter.split_documents(documents=documents)
    return chunks

zotero_chunks = split_documents(all_documents)
print(f"Created {len(zotero_chunks)} chunks from the documents")

Created 1378 chunks from the documents


In [31]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

def create_embedding_vector_db(chunks, db_name):
    """
    this function uses the open-source embedding model HuggingFaceEmbeddings 
    to create embeddings and store those in a vector database called FAISS, 
    which allows for efficient similarity search
    """
    # instantiate embedding model
    embedding = HuggingFaceEmbeddings(
        model_name='sentence-transformers/all-mpnet-base-v2'
    )
    # create the vector store 
    vectorstore = FAISS.from_documents(
        documents=chunks,
        embedding=embedding
    )
    # save vector database locally
    vectorstore.save_local(f"../vector_databases/vector_db_{db_name}")

In [32]:
# Create embeddings and save vector database
create_embedding_vector_db(chunks=zotero_chunks, db_name="zotero")
print("Vector database created and saved successfully!")

Vector database created and saved successfully!


### Retrieve from Vector Database

In [33]:
def retrieve_from_vector_db(vector_db_path):
    """
    this function splits out a retriever object from a local vector database
    """
    # instantiate embedding model
    embeddings = HuggingFaceEmbeddings(
        model_name='sentence-transformers/all-mpnet-base-v2'
    )
    zotero_vectorstore = FAISS.load_local(
        folder_path=vector_db_path,
        embeddings=embeddings,
        allow_dangerous_deserialization=True
    )
    # Retrieve more documents since Ollama has no strict token limits
    retriever = zotero_vectorstore.as_retriever(search_kwargs={"k": 8})
    return retriever

In [34]:
zotero_retriever = retrieve_from_vector_db("../vector_databases/vector_db_zotero")

### Generation

In [35]:
from langchain import hub
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

In [36]:
def connect_chains(retriever):
    """
    this function connects stuff_documents_chain with retrieval_chain
    """
    stuff_documents_chain = create_stuff_documents_chain(
        llm=llm,
        prompt=hub.pull("langchain-ai/retrieval-qa-chat")
    )
    retrieval_chain = create_retrieval_chain(
        retriever=retriever,
        combine_docs_chain=stuff_documents_chain
    )
    return retrieval_chain

In [37]:
zotero_retrieval_chain = connect_chains(zotero_retriever)

In [38]:
def print_output(
    inquiry,
    retrieval_chain=zotero_retrieval_chain
):
    result = retrieval_chain.invoke({"input": inquiry})
    print(result['answer'].strip("\n"))

### Chat with Your Zotero Library

In [39]:
# Example: Ask a question about your Zotero library
print_output("Give me a summary of the main topics covered in the papers.")

Based on the provided context, I can summarize the main topics covered in the papers:

1. **Language Models**: The papers discuss various aspects of language models, including their performance, limitations, and potential societal impacts. For example, "LoRA: Low-Rank Adaptation of Large Language Models" explores a method for adapting large language models to specific tasks, while "The Automation Charade" critiques the increasing reliance on AI in automation.

2. **AI Research and Ethics**: The papers touch on the importance of ethics in AI research, including the need for transparency, accountability, and human-centered design. For instance, "Ethically aligned design: A vision for prioritizing human well-being with artificial intelligence and autonomous systems" proposes a framework for designing AI systems that prioritize human well-being.

3. **AI Governance and Standards**: The papers discuss the need for governance and standards in AI development, including the importance of ensur

In [40]:
# Ask another question
print_output("What are the key findings or conclusions from the papers?")

Based on the provided context, here are the key findings and conclusions from the papers:

1. "The Hardware Lottery" by Sara Hooker (2020):
The paper discusses the challenges of developing large language models (LLMs) that can be deployed on a variety of hardware platforms. The author argues that the current approach to LLM development is like playing a "hardware lottery," where the model's performance depends on the specific hardware configuration.

Key finding: The author highlights the need for more research on how to develop LLMs that can adapt to different hardware environments.

2. "LoRA: Low-Rank Adaptation of Large Language Models" by Edward J. Hu et al. (2021):
The paper proposes a new method called LoRA, which adapts large language models to specific hardware platforms using low-rank matrix factorization. The authors demonstrate that LoRA can improve the performance of LLMs on various hardware configurations.

Key finding: The authors show that LoRA can reduce the computation

In [41]:
# Ask any custom question about your Zotero library
# print_output("Your question here")