# kdb.ai Classification Network Template

The following document is a template for converting kdb.ai into a classification network based on pre-defined models.

This document will be split into 3 parts. The first section will create kdb.ai embeddings on a data set using a model that you have already created. The second step will create embeddings for your test images and is a mandatory step for use of this document. The third step will perfrom classification on the image.

## Section 0: Setup

The following section will import all of the required modules and define helper functions that are necessary for this document, and this section should always be run before using the document.

In [1]:
import os

In [2]:
### ignore tensorflow warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [3]:
# force tensorflow to use CPU only
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [4]:
# download data
from zipfile import ZipFile

In [5]:
# embeddings
from tensorflow.keras.utils import image_dataset_from_directory
from huggingface_hub import from_pretrained_keras
from PIL import Image
import numpy as np
import pandas as pd
import tensorflow as tf

In [6]:
# timing
from tqdm.auto import tqdm

In [7]:
# vector DB
import kdbai_client as kdbai
from getpass import getpass
import time

In [8]:
from pathlib import Path
import imghdr

In [10]:
import kdbai_client as kdbai
session = kdbai.Session(endpoint='http://localhost:8082')

In [11]:
import math
import statistics

### Defining Helper Functions:

In [12]:
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

In [13]:
def extract_file_paths_from_folder(parent_dir: str) -> dict:
    image_paths = {}
    for sub_folder in os.listdir(parent_dir):
        sub_dir = os.path.join(parent_dir, sub_folder)
        image_paths[sub_folder] = [
            os.path.join(sub_dir, file) for file in os.listdir(sub_dir)
        ]
    return image_paths

## Section 1: Creating Embeddings for Dataset

The following section should be used to create embeddings and store them in kdb.ai. If you have already stored your embeddings in a kdb.ai session, you may skip to section 2.

### IMPORTANT

The following cell will search your data folder for files that are incompatible with Tensorflow. These files will then be deleted from the data folder, so it is important that if you want to keep all of these images that you have the data set saved elsewhere as a backup.

In [14]:
data_dir = "data/"
image_extensions = [".png", ".jpg", ".jpeg"]  # add there all your images file extensions

img_type_accepted_by_tf = ["bmp", "gif", "jpeg", "png"]
for filepath in Path(data_dir).rglob("*"):
    if filepath.suffix.lower() in image_extensions:
        img_type = imghdr.what(filepath)
        if img_type is None:
            print(f"{filepath} is not an image")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")
        elif img_type not in img_type_accepted_by_tf:
            print(f"{filepath} is a {img_type}, not accepted by TensorFlow")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")

### Loading Image Data

In [15]:
image_paths_map = extract_file_paths_from_folder("data")

In [16]:
dataset = image_dataset_from_directory(
    "data",
    labels="inferred",
    label_mode="categorical",
    shuffle=False,
    seed=1,
    image_size=(224, 224),
    batch_size=1,
)

Found 2396 files belonging to 5 classes.


### Creating Vector Embeddings

In [17]:
model = tf.keras.models.load_model('saved_model/your_model')

In [17]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 224, 224, 3)]        0         []                            
                                                                                                  
 regnetx064_prestem_rescali  (None, 224, 224, 3)          0         ['input_1[0][0]']             
 ng (Rescaling)                                                                                   
                                                                                                  
 regnetx064_stem_conv (Conv  (None, 112, 112, 32)         864       ['regnetx064_prestem_rescaling
 2D)                                                                [0][0]']                      
                                                                                              

In [18]:
# create empty arrays to store the embeddings and labels
embeddings = np.empty([len(dataset), 2048])
labels = np.empty([len(dataset), 5]) # You must replace N in this line with the number of classifications your data set has

In [19]:
# for each image in dataset, get its embedding and class label
for i, image in tqdm(enumerate(dataset), total=len(dataset)):
    embeddings[i, :] = model.predict(image[0], verbose=0)
    labels[i, :] = image[1]

  0%|          | 0/2396 [00:00<?, ?it/s]

I0000 00:00:1709719292.339503    2265 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


### Defining Class Labels

In [20]:
sorted(image_paths_map.keys())

['.ipynb_checkpoints', 'Healthy', 'Mosaic', 'RedRot', 'Rust', 'Yellow']

If incorrect classification names are present, use the following:

In [21]:
del image_paths_map['.ipynb_checkpoints']
sorted(image_paths_map.keys())

['Healthy', 'Mosaic', 'RedRot', 'Rust', 'Yellow']

And then continue from here:

In [22]:
# list the classification types in sorted order
classification_types = sorted(image_paths_map.keys())

In [23]:
# for each vector, save the classification type given by the high index
class_labels = [classification_types[label.argmax()] for label in labels]

### Defining Image Filepaths

In [24]:
# get a single list of all paths
all_paths = []
for _, image_paths in image_paths_map.items():
    all_paths += image_paths

In [25]:
# sort the source_files in alphanumeric order
sorted_all_paths = sorted(all_paths)

### Defining Embedding Dataframe

In [32]:
embedded_df = pd.DataFrame(
    {
        "source": sorted_all_paths,
        "class": class_labels,
        "embedding": embeddings.tolist(),
    }
)

If you receive an error on the previous cell stating that the arrays need to be of the same length, you may need to remove the '.ipynb_checkpoints' from each classification within the data set. The following cells will do this, but it is important that you replicate this cell with as many classifications that you have:

In [27]:
sorted_all_paths.remove('data/Healthy/.ipynb_checkpoints')

In [28]:
sorted_all_paths.remove('data/Mosaic/.ipynb_checkpoints')

In [29]:
sorted_all_paths.remove('data/Rust/.ipynb_checkpoints')

In [30]:
sorted_all_paths.remove('data/RedRot/.ipynb_checkpoints')

In [31]:
sorted_all_paths.remove('data/Yellow/.ipynb_checkpoints')

Do this to remove the files from each classification and then continue from here:

In [33]:
show_df(embedded_df)

(2396, 3)


Unnamed: 0,source,class,embedding
0,data/Healthy/healthy (121).jpeg,Healthy,"[0.9535317420959473, 0.038117386400699615, 0.0..."
1,data/Healthy/healthy (122).jpeg,Healthy,"[0.8989417552947998, 0.06690378487110138, 0.00..."
2,data/Healthy/healthy (123).jpeg,Healthy,"[0.5552961826324463, 0.3637266755104065, 0.025..."
3,data/Healthy/healthy (124).jpeg,Healthy,"[0.9178681373596191, 0.03479386866092682, 0.00..."
4,data/Healthy/healthy (125).jpeg,Healthy,"[0.9047070741653442, 0.036002758890390396, 0.0..."


### Defining Vector DB Schema

In [34]:
image_schema = {
    "columns": [
        {"name": "source", "pytype": "str"},
        {"name": "class", "pytype": "str"},
        {
            "name": "embedding",
            "vectorIndex": {"dims": 2048, "metric": "L2", "type": "hnsw"},
        },
    ]
}

### Creating Vector DB Table

In [35]:
# ensure the table does not already exist
try:
    session.table("regnetx064").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [36]:
table = session.create_table("regnetx064", image_schema)

### Adding embedded data to the table

This next stage requires some added steps depending on how large your embedding vector data set is. The "insert" command that will be used in this stage can only insert a certain number of bytes, with a general rule of thumb that 10mb is the maximum amount of data that can be inserted at once.

The following cell will provide a rough estimate of how many megabytes your embedding vector data set is made up of:

In [37]:
# convert bytes to MB
embedded_df.memory_usage(deep=True).sum() / (1024**2)

37.92246055603027

If the data set is comfortably below the 10mb limit, then you should be able to insert the embeddings into the table in one step using the following:

In [None]:
table.insert(embedded_df)

Should the data set be larger than 10mb, you will need to divide the data set into smaller parts. This can be done using the following cells.

First of all, it is important to get a rough estimate of how many items will be in each block. This can be done with the following cell:

In [38]:
megab = embedded_df.memory_usage(deep=True).sum() / (1024**2)
megabs = megab/10
blocks = len(embedded_df)/megabs
math.floor(blocks)

631

The previous cell will have provided a rough estimate for an upper limit to the amount of items within each block. The following cell will break the data set into blocks of a specified size. Try this with the estimated block size provided.

In [39]:
# Yield successive n-sized 
# chunks from l. 
def divide_chunks(l, n): 
      
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 
  
# How many elements each 
# list should have 
n = 500

In [40]:
embedded_df_split = list(divide_chunks(embedded_df, n))

Now that the data set has been split into smaller blocks, it can be inserted into the KDB.AI table using the following cell:

In [41]:
for i in range(len(embedded_df_split)):
    table.insert(embedded_df_split[i])



Should you still be returned with an error, try breaking the table into smaller blocks than you are currently using and eventually the blocks will be small enough to be inserted into the table.

You can now verify that the data has been inserted into the table with the following cell:

In [42]:
table.query()

Unnamed: 0,source,class,embedding
0,data/Healthy/healthy (121).jpeg,Healthy,"[0.9535317420959473, 0.038117386400699615, 0.0..."
1,data/Healthy/healthy (122).jpeg,Healthy,"[0.8989417552947998, 0.06690378487110138, 0.00..."
2,data/Healthy/healthy (123).jpeg,Healthy,"[0.5552961826324463, 0.3637266755104065, 0.025..."
3,data/Healthy/healthy (124).jpeg,Healthy,"[0.9178681373596191, 0.03479386866092682, 0.00..."
4,data/Healthy/healthy (125).jpeg,Healthy,"[0.9047070741653442, 0.036002758890390396, 0.0..."
...,...,...,...
2391,data/Yellow/yellow (95).jpeg,Yellow,"[0.36971867084503174, 0.18064580857753754, 0.0..."
2392,data/Yellow/yellow (96).jpeg,Yellow,"[0.004562804941087961, 0.011000572703778744, 0..."
2393,data/Yellow/yellow (97).jpeg,Yellow,"[0.015184606425464153, 0.04275369271636009, 0...."
2394,data/Yellow/yellow (98).jpeg,Yellow,"[0.0012440424179658294, 0.002134703565388918, ..."


## Section 2: Creating Embeddings for Test Image

Next up, embeddings have to be created for the image that you want to have classified. This will be done in a similar manner to the previous embeddings, but will be classified on the "search" folder rather than the data folder. 

Should you have already created and inserted data into a table in kdb.ai, you can recall it using the following cell. This is useful as you do not need to recreate the embeddings again each time the model is used.

In [None]:
table = session.table("yourTable")

### IMPORTANT

Data set testing and deletion occurs with the following cell, please backup the images you do not want to lose.

In [43]:
data_dir = "search/"
image_extensions = [".png", ".jpg", ".jpeg"]  # add there all your images file extensions

img_type_accepted_by_tf = ["bmp", "gif", "jpeg", "png"]
for filepath in Path(data_dir).rglob("*"):
    if filepath.suffix.lower() in image_extensions:
        img_type = imghdr.what(filepath)
        if img_type is None:
            print(f"{filepath} is not an image")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")
        elif img_type not in img_type_accepted_by_tf:
            print(f"{filepath} is a {img_type}, not accepted by TensorFlow")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")

### Loading Search Images

In [45]:
search_image_paths_map = extract_file_paths_from_folder("search")

In [46]:
search_dataset = image_dataset_from_directory(
    "search",
    labels="inferred",
    label_mode="categorical",
    shuffle=False,
    seed=1,
    image_size=(224, 224),
    batch_size=1,
)

Found 125 files belonging to 1 classes.


### Create Search Embeddings

In [47]:
search_embeddings = np.empty([len(search_dataset), 2048])
search_labels = np.empty([len(search_dataset), 1])

In [48]:
# for each image in dataset, get its embedding and class label
for i, image in tqdm(enumerate(search_dataset), total=len(search_dataset)):
    search_embeddings[i, :] = model.predict(image[0], verbose=0)
    search_labels[i, :] = image[1]

  0%|          | 0/125 [00:00<?, ?it/s]

### Defining Test Classification Name

In [49]:
search_classification_types = "test"

In [50]:
search_class_labels = [search_classification_types for search_label in search_labels]

### Defining Image Filepaths

In [51]:
search_paths = []
for _, image_paths in search_image_paths_map.items():
    search_paths += image_paths

In [52]:
sorted_search_paths = sorted(search_paths)

### Defining Embedding Dataframe

In [53]:
search_embedded_df = pd.DataFrame(
    {
        "source": sorted_search_paths,
        "class": search_class_labels,
        "embedding": search_embeddings.tolist(),
    }
)

May need to remove the '.ipynb_checkpoints' here too, so this can be done with the following cell:

In [None]:
all_paths.remove('data/test/.ipynb_checkpoints')

In [54]:
show_df(search_embedded_df)

(125, 3)


Unnamed: 0,source,class,embedding
0,search/test/healthy (1).jpeg,test,"[0.6843213438987732, 0.22339944541454315, 0.02..."
1,search/test/healthy (10).jpeg,test,"[0.3495194613933563, 0.4964156150817871, 0.059..."
2,search/test/healthy (100).jpeg,test,"[0.3710249960422516, 0.4820334315299988, 0.052..."
3,search/test/healthy (101).jpeg,test,"[0.548697292804718, 0.21445149183273315, 0.096..."
4,search/test/healthy (102).jpeg,test,"[0.6019561886787415, 0.35139545798301697, 0.01..."


## Section 3: Classifying the image

In [None]:
test_embedding = search_embedded_df.iloc[0,2]

In [None]:
results_1 = table.search([test_embedding], n=400)

In [None]:
results_2 = results_1[0]

In [None]:
statistics.mode(results_2.iloc[:,1])

### Alternate: Classifying multiple images

The following section can be used to classify a list of images rather than just one:

In [55]:
pd.set_option('display.max_rows', None)

In [56]:
classifications=[]
for i in range(len(search_embedded_df)):
    w = search_embedded_df.iloc[i,2]
    x = table.search([w], n=400)
    y = x[0]
    z = statistics.mode(y.iloc[:,1])
    classifications.append(z)

In [57]:
classification_list = pd.DataFrame(
    {
        "source": sorted_search_paths,
        "classification": classifications,
    }
)

In [58]:
classification_list

Unnamed: 0,source,classification
0,search/test/healthy (1).jpeg,Healthy
1,search/test/healthy (10).jpeg,Healthy
2,search/test/healthy (100).jpeg,Healthy
3,search/test/healthy (101).jpeg,Healthy
4,search/test/healthy (102).jpeg,Healthy
5,search/test/healthy (103).jpeg,Healthy
6,search/test/healthy (104).jpeg,Healthy
7,search/test/healthy (105).jpeg,Healthy
8,search/test/healthy (106).jpeg,Healthy
9,search/test/healthy (107).jpeg,Healthy
