Download sample photos:

In [None]:
!pip install gdown

import gdown, os
destination_folder = "./photos"
os.makedirs(destination_folder, exist_ok=True)
drive_folder_url = "https://drive.google.com/drive/folders/156JvLn-CLIdUGmRxumr2ZtbPGFND_k0d"
gdown.download_folder(url=drive_folder_url, output=destination_folder, quiet=False, use_cookies=False)

# Setup - Files, Ollama and Packages

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Files Upload

First, let's add our photos. In the notebook, create a folder named '**photos**'. Upload your files in the folder.

If you don't have some photos handy, [get some from here.](https://drive.google.com/drive/folders/156JvLn-CLIdUGmRxumr2ZtbPGFND_k0d?usp=sharing)


## Ollama Installation

Let's continue with installing Ollama. To do so, we must install the colab-xterm package. This will provide us with a CLI, from where we can perform the required actions.

In [None]:
!pip install colab-xterm
%load_ext colabxterm

What we need to do now is:



1.   Open a terminal
2.   Install and run Ollama

1.   Pull the models











We must run execute some commands in the terminal.

To install Ollama:

`curl -fsSL https://ollama.ai/install.sh | sh`

To run Ollama:

`ollama serve &`

To pull the vision model:

`ollama pull llava:7b`

We also need to pull an **embedding model**.

`ollama pull mxbai-embed-large`

Model links for reference:

[llava](https://ollama.com/library/llava)

[mxbai-embed-large](https://ollama.com/library/mxbai-embed-large)


In [None]:
%xterm

If needed, run this:

In [None]:
%reload_ext colabxterm

Let's confirm our installation. In the terminal window, type:

`ollama run llava:7b`

Ollama will load and execute our model. Say hello and the model will respond. To exit the chat type:

`/bye`


## Packages

Finally, let's install the packages we require.

(Please note that we will also install the **OpenAI** package - this is a backup plan, just in case GPU resources will not be available in Colab).

In [None]:
pip install langchain langchain-chroma langchain-ollama pillow langchain-openai

## Test our installation

Now let's try to query our model with Langchain.

To do this, we need to configure a **chain**.

A typical chain is made from 3 parts:

`prompt -> model -> output formatting`

Let's break those down one by one.

### Prompt

We will create a [ChatPromptTemplate](https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html) object. In our case, this is nothing more than a [SystemMessage](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.system.SystemMessage.html) and a [HumanMessage](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.human.HumanMessage.html).

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage, HumanMessage

template = ChatPromptTemplate([
    ('system', 'You are a helpful AI assistant.'),
    ('human', '{query}')])

### Model

Now let's setup our model.
This is very simple and can be switched to another model extremely easily.

In [None]:
from langchain_ollama import ChatOllama
llm = ChatOllama(model='llava:7b')
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model='gpt-4o', api_key='<your API key>')

Just to show you that we could invoke the model directly:

In [None]:
test_response = llm.invoke('A penguin and an astronaut walk into a bar. Fill in the rest, make it short.')
print(test_response)

However we want to do this properly, so we will proceed with the more nuanced ChatPromptTemplate.

### Output

We also need to add the [output parser](https://python.langchain.com/docs/concepts/output_parsers/) we want. Output parsing dictates the output format we want from the model. There are several options here. In our example, we just get a string back. Complex outputs are supported, for example Json, with the caveat that you most probably require a bigger model.

In [None]:
from langchain_core.output_parsers import StrOutputParser

### Create and Invoke the chain

Now we have all the pieces to create our example chain.

In [None]:
chain = template | llm | StrOutputParser()

Time to ask something:

In [None]:
user_query = 'A penguin and an astronaut walk into a bar. Fill in the rest, keep it short and make it a bit weird.'
response = chain.invoke({'query':user_query})
print(response)

Well done!

# Plan of attack



1.   Setup our embedding function
2.   Setup our vector database
1.   Setup our vision chain
2.   For each photo: analyze the photo and store the result in the database

Some of those steps are a bit more involved, so let's take those one by one.


# Steps 1 & 2 - Embeddding Function and Vector Database

We will use the `mxbai-embed-large` embedding model that we pulled into Ollama previously. We will also create a Chroma database  with that embedding function.


In [None]:
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma

embedding_function = OllamaEmbeddings(model='mxbai-embed-large')
db = Chroma(
    collection_name='photo_collection',
    embedding_function=embedding_function,
    persist_directory='./db_photos')

# Step 3 - Chain Setup

Now we need to configure our chains. This is going to be a bit more complicated than in the previous simple text example. There are several prerequisites for this, because now we will work with images instead of text.

The way this works is that we convert the photo bytes to a base64 string, and we craft a special HumanMessage, containing both the image data and the user's request for the model.

Let's get the simple things out of the way first.



## System and Human prompts

In [None]:
def get_vision_system_message()->str:
    system_message_text = '''
You're an expert image and photo analyzer.
You are very perceptive in analyzing images and photos.
You possess excelent vision.
Do not read any text unless it is the most prominent in the image.
Your description should be neutral in tone.
'''
    return system_message_text


This function just returns the system prompt text, and is just a convenient wrapper. You could just have that in a constant if you want.

For our use case, let's also create a similar function that will return the HumanMessage's text.

**Important Note**: This would not be the case if we were building, for example a chatbot, where the human message would be typed by an actual human. In that case, we would retrieve the user input and just add it dynamically. Here however, it is always going to be the same, so let's make our lives a bit easier and wrap this into a function too.

In [None]:
def get_vision_human_message()->str:
    human_message_text = 'Describe the image in as much detail as possible. Do not try to read any text.'
    return human_message_text



---



## Utility Function - Convert a PIL image to a base64 string

Now let's create a utility function that will convert an image to a base64 string. We will use pillow for that.

In [None]:
from PIL import Image
from io import BytesIO
import base64

def convert_to_base64(pil_image)->str:
    buffered = BytesIO()
    pil_image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str

## Prompt Function

Now comes the tricky part: we are going to construct the prompt that we will use. As we have touched upon this a bit already, we will create a SystemMessage and a HumanMessage.

The SystemMessage contains the base directions for the model. We have those already in the `get_vision_system_message()` function.

The HumanMessage is a bit more tricky: we need to include both the base64 data and the user's prompt.

To do this, we will create two **Parts**. A Part is just a python dictionary, containing the **type** and the **value** of the part. For example, a text part would be something like:



```
{
    'type': 'text',
    'text': 'The actual value of the part'
}
```

While an image part would be something like:

```
# Ollama
{
    'type': 'image_url',
    'image_url': f'data:image/jpeg;base64,<the base64 data>'
}
```

For OpenAI it is slightly different:

```
# OpenAI
{
    'type': 'image_url',
    'image_url': {
      'url':f'data:image/jpeg;base64,<the base64 data>'
    }
}
```





Let's build our prompt function:

In [None]:
def prompt_func(data):
    imageb64 = data["imageb64"]
    system_message = SystemMessage(content=get_vision_system_message())
    text_part = {"type": "text", "text": get_vision_human_message()}
    # Ollama
    image_part = {
        'type': 'image_url',
        'image_url': f'data:image/jpeg;base64,{imageb64}',
    }
    # OpenAI
    # image_part = {
    #    'type': 'image_url',
    #     'image_url': {'url':f'data:image/jpeg;base64,{imageb64}'},
    # }
    content_parts = []
    content_parts.append(image_part)
    content_parts.append(text_part)
    human_message = HumanMessage(content=content_parts)
    return [system_message, human_message]


What we are doing here is:

1.   We extract the base64 data  from the dictionary
2.   We create our SystemMessage object
1.   We create a text part containing the human message prompt text
2.   We create an image part containing the base64 data (our actual image)
1.   We add both parts to a list
2.   We construct the HumanMessage object from this list
1.   We return the combined System and Human messages that make up our prompt










Let's do a quick test - just replace the photo's file name/path.

In [None]:
###############################################
# TEST

vision_chain = prompt_func | llm | StrOutputParser()

# just replace the file name with one of your files
file_name = './photos/100_3148.JPG'

with Image.open(file_name) as img:
  imageb64 = convert_to_base64(img)
  response = vision_chain.invoke({'imageb64':imageb64})
  print(response)

# Step 4 - Main Loop

Let's think a bit about how our main loop would look like:

Retrieve the available photo paths and filenames


```
For each file:
  convert to base64
  get the model's analysis
  create a Langchain document
  store the Document in the database
```

Well this looks good, but it is a bit simplistic and not very useful. We should do a couple more things at least.

We should add some metadata to the document. We should at least store the full path and name of the file. Optionally we can add more information that would be useful to us - for example we could add some photographic metadata, like the photo's date or even the location if it is available.

By adding the full path and name of the file, we can do efficient processing and avoid constly LLM calls to reprocess the same photo.

Our revised loop should look like:


```
For each file:
  IF there is NO database entry:
    convert to base64
    get the model's analysis
    create a Langchain document
    store the Document in the database
```





## File Retrieval

We will use glob to get a list of our files.

In [None]:
import glob
photo_files = glob.glob("./photos/*")
for file in photo_files:
  print(file)

## Final pieces to the puzzle

We reached a point where we only miss two things:



1.   Store our analysis and metadata to the vector database
2.   Check if a photo has already been processed.

Once we have those in place, the only thing left to do is to piece everything together.



### Storing into the vector database

Langchain provides a very easy way to handle input and output in a vector database by providing the Document class. The way this works is that we instantiate a Document class, passing in the text we want to embed (in our case the model's description of the photo) and any additional metadata we want to include.

For example:

```
document = Document(
    page_content="Hello, world!",
    metadata={"source": "https://example.com"}
)
```

Our document is created, and now we should store it in our database. We have already initiated our vector database, `db`, and attached an embedding model (`mxbai-embed-large`) to it.

```
db.add_documents([document])
```

Note that add_documents takes a list.

What this does is that it will create the vector for the text we provide at the **page_content** parameter, and attach the additional metadata to this database record.


For more information check [Langchain Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)

To use this, we need to import the Document class:

In [None]:
from langchain_core.documents import Document

### Vector Metadata

Let's talk about metadata. For every vector we store, we can accompany it with various metadata. In our case, we should definitely include the file name, because we are going to use it to check whether we have already processed a photo or not.

While handy for our program, this is not very useful for our problem. Why not include additional information that may be useful? For example, we could include the data and time the photo was taken, plus any number of interesting facts about the photo, for example the Make and the Model of the camera used.

This is all handy, because Langchain provides a very easy way to filter through the metadata before searching through the vector space.

Let's see how we will do this - fortunately Pillow is quite good for this:

In [None]:
from PIL import ImageFile
from PIL.ExifTags import TAGS

def get_exif_data(img:ImageFile)->dict:
    exif_data = img._getexif()

    if exif_data:
        exif_dict = {}
        for tag, value in exif_data.items():
            tag_name = TAGS.get(tag, tag)
            exif_dict[tag_name] = value
        return exif_dict
    else:
        return {}

def get_metadata(file_name:str, img:ImageFile)->dict:
  exif = get_exif_data(img)
  make = exif.get('Make', '')
  model = exif.get('Model', '')
  dt = exif.get('DateTime', '') or exif.get('DateTimeOriginal', '')
  return {
      'file_name':file_name,
      'make':make,
      'model':model,
      'dt':dt
  }

Let's try this:

In [None]:
###############################################
# TEST

file_name = './photos/20220814172719_IMG_9572.JPG'
with Image.open(file_name) as img:
  test = get_metadata(file_name, img)
  print(test)

## Get files that have already been processed

The way to get our processed file names is to simply retrieve the corresponding metadata field from the database, in our case `file_name`. To do this, we can directly query the database.

Chroma is just a sqlite database - very handy, right?

In [None]:
from contextlib import closing
import sqlite3

def get_processed_files():
    with closing(sqlite3.connect(f"./db_photos/chroma.sqlite3")) as connection:
        sql = "select string_value from embedding_metadata where key='file_name'"
        rows = connection.execute(sql).fetchall()
        processed_files = [file_name for file_name, in rows]
    return processed_files

## The Loop

All the pieces are there. Time to assemble them:

In [None]:
processed_files = get_processed_files()
for file_name in photo_files:
  if file_name not in processed_files:
    with Image.open(file_name) as img:
      print(f'Processing: {file_name}')
      imageb64 = convert_to_base64(img)
      llm_analysis = vision_chain.invoke({'imageb64':imageb64})
      metadata = get_metadata(file_name, img)
      document = Document(page_content=llm_analysis, metadata=metadata)
      db.add_documents([document])
      print('OK')
  else:
    print(f'File {file_name} has already been processed')
print('DONE!')

## Retrieval

The simplest way to retrieve our documents is by using similarity search:

In [None]:
results = db.similarity_search('jack', k=2)
print(results)

In [None]:
import os
for doc in results:
  print(doc.metadata['file_name'])
  print(doc)

We can also create a retriever:

In [None]:
retriever = db.as_retriever()
results = retriever.invoke('jack', k=2)

We can initiate the retriever with the search type and threshold as well:

In [None]:
retriever = db.as_retriever(search_type='similarity_score_threshold', search_kwargs={'score_threshold': 0.2})
results = retriever.invoke('jack', k=2)

We can even filter our metadata:

In [None]:
retriever = db.as_retriever(search_kwargs={'filter': {'make':'Canon'}})
results = retriever.invoke('', k=2)

# Where to go from here

First of all, you can check out the full example on my [Medium](https://medium.com/@dimosdennis/personal-photo-library-with-langchain-ollama-llava-fully-local-e82edfe07f54).

You can take those basic principles and apply them to your use cases. RAG is the prime example - you can process some data (media or not), and do semantic retrieval and further generation. You could do some entity extraction and store them in a graph database.The world is your oyster.

AI is expensive - while we examined local execution, processing time quickly adds up, and time is a valuable commodity.

Enjoy!
