# Personal Notekeeper

In this use case, we will guide you the steps of storing your notes in various document types and using Ollama to process them. 

Whether your notes contain sensitive information or not, your data never leaves your device. 

This project will enable you to ask questions about their vast amounts of notes, PDFs, articles, meeting summaries in a private environment. 

One of the key advantages of using Ollama is its ability to process large datasets locally without the need for cloud infrastructure or storage limitations. Unlike other platforms that may impose restrictions on the size of data you can work with, Ollama allows you to handle and process as much data as your local system can support. This is particularly useful when working with large volumes of files like PDFs, DOCX documents, MP3s, and more. While the processing time may take longer depending on the size of your data and the resources available on your local machine, the ability to run everything privately and offline is a significant benefit for those who need to keep their data secure and maintain control over their resources.

## Step 1: Install the required libraries and pull the required models

Download required Python libraries via PIP

In [None]:
!pip install chromadb pandas PyPDF2 openai-whisper python-docx

Using the Ollama CLI, pull the model of `mxbai-embed-large` for generating embeddings and a model of your choice for generating the results

In [None]:
ollama pull mxbai-embed-large

# Disclaimer

While this project demonstrates the power of local AI processing, it’s important to note that **model and dataset sizes can be quite large**, potentially exceeding the capacity of a standard local machine (e.g., 16GB RAM). For optimal performance when working with larger models and datasets, access to a machine with a **powerful GPU and at least 64GB of RAM** is recommended.

If you're experimenting with this project on a smaller setup, consider using models like **1B, 3B, 7B, or 16B parameters**, which strike a good balance between capability and resource requirements. Avoid attempting to use excessively large models unless you have access to the necessary hardware. This approach ensures the process remains practical and accessible for demonstration and educational purposes.


In our case, we will use the Llama3.1 model with 8B parameters since it only uses 4.9GB. 

In [None]:
ollama pull llama3.1:8b

# Step 2: Define function for file processing

We need to define several functions to process different types of files such as PDF, TXT, DOCX, MP3/MP4, and Excel files. These functions will extract the text content from the files, which will later be used to generate embeddings.

In [None]:
import os
import pandas as pd
from PyPDF2 import PdfReader
import whisper
import docx
import ollama
import chromadb

Process PDF Files

In [None]:
from PyPDF2 import PdfReader

def process_pdf(file_path):
    """
    Extracts text from PDF files.
    This function uses PyPDF2 to read the content of the PDF file and extracts the text from each page.
    """
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

Process TXT Files

In [None]:
def process_txt(file_path):
    """
    Reads and returns text from TXT files.
    This function simply opens a TXT file and returns its content as a string.
    """
    with open(file_path, 'r') as file:
        return file.read()

Process DOCX Files

In [None]:
import docx

def process_docx(file_path):
    """
    Extracts text from DOCX files.
    This function uses the docx library to open the DOCX file and retrieves all the text from the document.
    """
    doc = docx.Document(file_path)
    return " ".join([para.text for para in doc.paragraphs])

Process video and audio files (MP3/MP4)

In [None]:
import whisper

def transcribe_audio(file_path, model="base"):
    '''
    Leverages OpenAI's whisper model to transcribe text from videos
    '''
    audio_model = whisper.load_model(model)
    result = audio_model.transcribe(file_path)
    return result['text']

Process Excel (XLS/XLSX) files

In [None]:
import pandas as pd

def process_excel(file_path):
    '''
    Uses Pandas library to process XLS/XLSX files
    '''
    df = pd.read_excel(file_path)
    return df.to_string()

## Compile all methods into a single "preprocess" function

In [None]:
def preprocess(file_path):
    ext = file_path.split('.')[-1].lower()
    if ext == "pdf":
        return process_pdf(file_path)
    elif ext == "txt":
        return process_txt(file_path)
    elif ext in ["doc", "docx"]:
        return process_docx(file_path)
    elif ext in ["mp3", "mp4"]:
        return transcribe_audio(file_path)
    elif ext in ["xls", "xlsx"]:
        return process_excel(file_path)
    else:
        return "Unsupported file type"

# Step 3: Add your data to a folder named "notes"

Ensure the notebook you are running is the in the same level as the notes folder

In [None]:
Personal_Private_Chatbot_using_Ollama.ipynb
notes/
|--- file1.pdf
|--- file2.docx
|--- audio1.mp3
|--- spreadsheet1.xlsx


# Step 4: Connect to ChromaDB and Create a Collection

We use ChromaDB to store and query document embeddings. Let's initialize the client and create a collection.

In [None]:
import chromadb

# Connect to ChromaDB and create a collection for storing embeddings
client = chromadb.Client()
collection = client.create_collection(name="docs")

ChromaDB is a vector database that allows you to store and query embeddings efficiently.

The client.create_collection() method is used to create a new collection called docs where we will store our document embeddings.

# Step 5: Process Files and Store Embeddings in ChromaDB


In [None]:
data_folder = "notes"

for file_name in os.listdir(data_folder):
    file_path = os.path.join(data_folder, file_name)
    if not os.path.isfile(file_path):  # Check if the file exists
        print(f"File not found: {file_name}")
        continue

    try:
        # Preprocess file to extract content
        content = preprocess(file_path)
        if content != "Unsupported file type":
            # Get embeddings
            response = ollama.embeddings(model="mxbai-embed-large", prompt=content)
            embedding = response["embedding"]

            # Add to ChromaDB collection
            collection.add(
                ids=[file_name],  # Use file name as ID
                embeddings=[embedding],
                documents=[content]
            )
            print(f"Processed and stored embeddings for: {file_name}")
        else:
            print(f"Unsupported file type: {file_name}")
    except Exception as e:
        print(f"Error processing {file_name}: {e}")

# Step 6: Query the database and generate responses
Once the documents are processed and stored, you can query the database with a prompt. ChromaDB retrieves relevant documents, and Ollama generates a response based on the context of the retrieved documents.

In [None]:
while True:
    prompt = input("> ")

    # Get the embedding for the query
    response = ollama.embeddings(
        prompt=prompt,
        model="mxbai-embed-large"
    )

    # Query the collection for similar documents
    results = collection.query(
        query_embeddings=[response["embedding"]],
        n_results=1  # Retrieve the most relevant document
    )
    
    # Retrieve the document for the query
    data = results['documents'][0][0]

    # Generate a response using the retrieved document and the query
    output = ollama.generate(
        model="llama3.1",
        prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
    )

    # Print the response
    print(output['response'])
