# kdb.ai Classification Network Template

The following document is a template for converting kdb.ai into a classification network based on pre-defined models.

This document will be split into 3 parts. The first section will create kdb.ai embeddings on a data set using a model that you have already created. The second step will create embeddings for your test images and is a mandatory step for use of this document. The third step will perfrom classification on the image.

## Section 0: Setup

The following section will import all of the required modules and define helper functions that are necessary for this document, and this section should always be run before using the document.

In [2]:
import os

In [3]:
### ignore tensorflow warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [4]:
# force tensorflow to use CPU only
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [5]:
# download data
from zipfile import ZipFile

In [6]:
# embeddings
from tensorflow.keras.utils import image_dataset_from_directory
from huggingface_hub import from_pretrained_keras
from PIL import Image
import numpy as np
import pandas as pd
import tensorflow as tf

In [7]:
# timing
from tqdm.auto import tqdm

In [8]:
# vector DB
import kdbai_client as kdbai
from getpass import getpass
import time

In [9]:
from pathlib import Path
import imghdr

In [10]:
import kdbai_client as kdbai
session = kdbai.Session(endpoint='http://localhost:8082')

In [11]:
import math
import statistics

### Defining Helper Functions:

In [12]:
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

In [13]:
def extract_file_paths_from_folder(parent_dir: str) -> dict:
    image_paths = {}
    for sub_folder in os.listdir(parent_dir):
        sub_dir = os.path.join(parent_dir, sub_folder)
        image_paths[sub_folder] = [
            os.path.join(sub_dir, file) for file in os.listdir(sub_dir)
        ]
    return image_paths

## Section 1: Creating Embeddings for Dataset

The following section should be used to create embeddings and store them in kdb.ai. If you have already stored your embeddings in a kdb.ai session, you may skip to section 2.

### IMPORTANT

The following cell will search your data folder for files that are incompatible with Tensorflow. These files will then be deleted from the data folder, so it is important that if you want to keep all of these images that you have the data set saved elsewhere as a backup.

In [None]:
data_dir = "data/"
image_extensions = [".png", ".jpg", ".jpeg"]  # add there all your images file extensions

img_type_accepted_by_tf = ["bmp", "gif", "jpeg", "png"]
for filepath in Path(data_dir).rglob("*"):
    if filepath.suffix.lower() in image_extensions:
        img_type = imghdr.what(filepath)
        if img_type is None:
            print(f"{filepath} is not an image")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")
        elif img_type not in img_type_accepted_by_tf:
            print(f"{filepath} is a {img_type}, not accepted by TensorFlow")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")

### Loading Image Data

In [None]:
image_paths_map = extract_file_paths_from_folder("data")

In [None]:
dataset = image_dataset_from_directory(
    "data",
    labels="inferred",
    label_mode="categorical",
    shuffle=False,
    seed=1,
    image_size=(224, 224),
    batch_size=1,
)

### Creating Vector Embeddings

In [None]:
model = tf.keras.models.load_model('saved_model/your_model')

In [None]:
model.summary()

In [None]:
# create empty arrays to store the embeddings and labels
embeddings = np.empty([len(dataset), 2048])
labels = np.empty([len(dataset), 5]) # You must replace N in this line with the number of classifications your data set has

In [None]:
# for each image in dataset, get its embedding and class label
for i, image in tqdm(enumerate(dataset), total=len(dataset)):
    embeddings[i, :] = model.predict(image[0], verbose=0)
    labels[i, :] = image[1]

### Defining Class Labels

In [None]:
sorted(image_paths_map.keys())

If incorrect classification names are present, use the following:

In [None]:
del image_paths_map['.ipynb_checkpoints']
sorted(image_paths_map.keys())

And then continue from here:

In [None]:
# list the classification types in sorted order
classification_types = sorted(image_paths_map.keys())

In [None]:
# for each vector, save the classification type given by the high index
class_labels = [classification_types[label.argmax()] for label in labels]

### Defining Image Filepaths

In [None]:
# get a single list of all paths
all_paths = []
for _, image_paths in image_paths_map.items():
    all_paths += image_paths

In [None]:
# sort the source_files in alphanumeric order
sorted_all_paths = sorted(all_paths)

### Defining Embedding Dataframe

In [None]:
embedded_df = pd.DataFrame(
    {
        "source": sorted_all_paths,
        "class": class_labels,
        "embedding": embeddings.tolist(),
    }
)

If you receive an error on the previous cell stating that the arrays need to be of the same length, you may need to remove the '.ipynb_checkpoints' from each classification within the data set. The following cells will do this, but it is important that you replicate this cell with as many classifications that you have:

In [None]:
sorted_all_paths.remove('data/Classname1/.ipynb_checkpoints')

In [None]:
sorted_all_paths.remove('data/Classname2/.ipynb_checkpoints')

Do this to remove the files from each classification and then continue from here:

In [None]:
show_df(embedded_df)

### Defining Vector DB Schema

In [None]:
image_schema = {
    "columns": [
        {"name": "source", "pytype": "str"},
        {"name": "class", "pytype": "str"},
        {
            "name": "embedding",
            "vectorIndex": {"dims": 2048, "metric": "L2", "type": "hnsw"},
        },
    ]
}

### Creating Vector DB Table

In [None]:
# ensure the table does not already exist
try:
    session.table("yourTable").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [None]:
table = session.create_table("efficientNetB7", image_schema)

### Adding embedded data to the table

This next stage requires some added steps depending on how large your embedding vector data set is. The "insert" command that will be used in this stage can only insert a certain number of bytes, with a general rule of thumb that 10mb is the maximum amount of data that can be inserted at once.

The following cell will provide a rough estimate of how many megabytes your embedding vector data set is made up of:

In [None]:
# convert bytes to MB
embedded_df.memory_usage(deep=True).sum() / (1024**2)

If the data set is comfortably below the 10mb limit, then you should be able to insert the embeddings into the table in one step using the following:

In [None]:
table.insert(embedded_df)

Should the data set be larger than 10mb, you will need to divide the data set into smaller parts. This can be done using the following cells.

First of all, it is important to get a rough estimate of how many items will be in each block. This can be done with the following cell:

In [None]:
megab = embedded_df.memory_usage(deep=True).sum() / (1024**2)
megabs = megab/10
blocks = len(embedded_df)/megabs
math.floor(blocks)

The previous cell will have provided a rough estimate for an upper limit to the amount of items within each block. The following cell will break the data set into blocks of a specified size. Try this with the estimated block size provided.

In [None]:
# Yield successive n-sized 
# chunks from l. 
def divide_chunks(l, n): 
      
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 
  
# How many elements each 
# list should have 
n = 500

In [None]:
embedded_df_split = list(divide_chunks(embedded_df, n))

Now that the data set has been split into smaller blocks, it can be inserted into the KDB.AI table using the following cell:

In [None]:
for i in range(len(embedded_df_split)):
    table.insert(embedded_df_split[i])

Should you still be returned with an error, try breaking the table into smaller blocks than you are currently using and eventually the blocks will be small enough to be inserted into the table.

You can now verify that the data has been inserted into the table with the following cell:

In [None]:
table.query()

## Section 2: Creating Embeddings for Test Image

Next up, embeddings have to be created for the image that you want to have classified. This will be done in a similar manner to the previous embeddings, but will be classified on the "search" folder rather than the data folder. 

Should you have already created and inserted data into a table in kdb.ai, you can recall it using the following cell. This is useful as you do not need to recreate the embeddings again each time the model is used.

In [15]:
table = session.table("regnetx064")

Loading in the model:

In [16]:
model = tf.keras.models.load_model('saved_model/your_model')

### IMPORTANT

Data set testing and deletion occurs with the following cell, please backup the images you do not want to lose.

In [17]:
data_dir = "search/"
image_extensions = [".png", ".jpg", ".jpeg"]  # add there all your images file extensions

img_type_accepted_by_tf = ["bmp", "gif", "jpeg", "png"]
for filepath in Path(data_dir).rglob("*"):
    if filepath.suffix.lower() in image_extensions:
        img_type = imghdr.what(filepath)
        if img_type is None:
            print(f"{filepath} is not an image")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")
        elif img_type not in img_type_accepted_by_tf:
            print(f"{filepath} is a {img_type}, not accepted by TensorFlow")
            print(f"Deleting {filepath} from data folder")
            os.remove("{filepath}")

### Loading Search Images

In [18]:
search_image_paths_map = extract_file_paths_from_folder("search")

In [19]:
search_dataset = image_dataset_from_directory(
    "search",
    labels="inferred",
    label_mode="categorical",
    shuffle=False,
    seed=1,
    image_size=(224, 224),
    batch_size=1,
)

Found 125 files belonging to 1 classes.


### Create Search Embeddings

In [20]:
search_embeddings = np.empty([len(search_dataset), 2048])
search_labels = np.empty([len(search_dataset), 1])

In [21]:
# for each image in dataset, get its embedding and class label
for i, image in tqdm(enumerate(search_dataset), total=len(search_dataset)):
    search_embeddings[i, :] = model.predict(image[0], verbose=0)
    search_labels[i, :] = image[1]

  0%|          | 0/125 [00:00<?, ?it/s]

I0000 00:00:1710251285.035411    5177 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


### Defining Test Classification Name

In [22]:
search_classification_types = "test"

In [23]:
search_class_labels = [search_classification_types for search_label in search_labels]

### Defining Image Filepaths

In [24]:
search_paths = []
for _, image_paths in search_image_paths_map.items():
    search_paths += image_paths

In [25]:
sorted_search_paths = sorted(search_paths)

### Defining Embedding Dataframe

In [26]:
search_embedded_df = pd.DataFrame(
    {
        "source": sorted_search_paths,
        "class": search_class_labels,
        "embedding": search_embeddings.tolist(),
    }
)

May need to remove the '.ipynb_checkpoints' here too, so this can be done with the following cell:

In [None]:
all_paths.remove('data/test/.ipynb_checkpoints')

Then continue from here:

In [27]:
show_df(search_embedded_df)

(125, 3)


Unnamed: 0,source,class,embedding
0,search/test/healthy (1).jpeg,test,"[0.6843213438987732, 0.22339944541454315, 0.02..."
1,search/test/healthy (10).jpeg,test,"[0.3495194613933563, 0.4964156150817871, 0.059..."
2,search/test/healthy (100).jpeg,test,"[0.3710249960422516, 0.4820334315299988, 0.052..."
3,search/test/healthy (101).jpeg,test,"[0.548697292804718, 0.21445149183273315, 0.096..."
4,search/test/healthy (102).jpeg,test,"[0.6019561886787415, 0.35139545798301697, 0.01..."


## Section 3: Classifying the image

In [None]:
test_embedding = search_embedded_df.iloc[0,2]

In [None]:
results_1 = table.search([test_embedding], n=400)

In [None]:
results_2 = results_1[0]

In [None]:
statistics.mode(results_2.iloc[:,1])

### Alternate: Classifying multiple images

The following section can be used to classify a list of images rather than just one:

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
classifications=[]
for i in range(len(search_embedded_df)):
    w = search_embedded_df.iloc[i,2]
    x = table.search([w], n=400)
    y = x[0]
    z = statistics.mode(y.iloc[:,1])
    classifications.append(z)

In [None]:
classification_list = pd.DataFrame(
    {
        "source": sorted_search_paths,
        "classification": classifications,
    }
)

In [None]:
classification_list

### Testing accuracy

In [None]:
real_classifications=[]
for i in range(len(search_embedded_df)):
    w = search_embedded_df.iloc[i,0]
    x = os.path.basename(w)
    y = x.split()
    z = y[0]
    real_classifications.append(z)

In [None]:
match=[]
for i in range(len(search_embedded_df)):
    if real_classifications[i] == (classifications[i]).lower():
        match.append('yes')
    else:
        match.append('no')  

In [None]:
count = 0
for i in range(len(search_embedded_df)):
    if match[i] == 'yes':
        count += 1

In [None]:
accuracy_percentage = (count/(len(search_embedded_df)))*100
accuracy_percentage

### Testing best test length for accuracy

In [28]:
test_length=[]
accuracies=[]

for j in range(1, 479):
    classifications=[];
    for i in range(len(search_embedded_df)):
        w = search_embedded_df.iloc[i,2]
        x = table.search([w], n=j, index_options={'efSearch': j})
        y = x[0]
        z = statistics.mode(y.iloc[:,1])
        classifications.append(z)
    classification_list = pd.DataFrame(
    {
        "source": sorted_search_paths,
        "classification": classifications,
    }
    )
    real_classifications=[]
    for i in range(len(search_embedded_df)):
        w = search_embedded_df.iloc[i,0]
        x = os.path.basename(w)
        y = x.split()
        z = y[0]
        real_classifications.append(z)
    match=[]
    for i in range(len(search_embedded_df)):
        if real_classifications[i] == (classifications[i]).lower():
            match.append('yes')
        else:
            match.append('no')
    count = 0
    for i in range(len(search_embedded_df)):
        if match[i] == 'yes':
            count += 1
    accuracy_percentage = (count/(len(search_embedded_df)))*100
    test_length.append(j)
    accuracies.append(accuracy_percentage) 
    print(j)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [29]:
test_accuracies = pd.DataFrame(
    {
        "no. of results": test_length,
        "accuracy": accuracies,
    }
)

In [30]:
max_row = test_accuracies[test_accuracies['accuracy'] == test_accuracies['accuracy'].max()]
max_row

Unnamed: 0,no. of results,accuracy
6,7,81.6
