# imageSearch

## A Python console application for biometric identification, specifically designed for wildlife biologists

## Table of Contents

[About this application](#About-this-application)

* [Motivation](#Motivation)
* [What you can do with this application](#What-you-can-do-with-this-application)
* [A brief presentation of the algorithm](#A-brief-presentation-of-the-algorithm)
* [The importance of image preprocessing](#The-importance-of-image-preprocessing)

[Tutorial](#Tutorial)
* [Data for this tutorial](#Data-for-this-tutorial)
* [Instantiating and loading a DB object](#Instantiating-and-loading-a-DB-object)
* [Reading your data into the register](#Reading-your-data-into-the-register)
* [Managing your register](#Managing-your-register)
* * [Deleting entries](#Deleting)
* * [Assigning an ID to an image](#Assigning-an-ID-to-an-image)
* * [Exploring your data](#Exploring-your-data)
* [Verifying identities and matching images](#Verifying-identities-and-matching-image)
* [Saving and reloading your database](#Saving-and-reloading-your-DB) 
* [Training and testing your model](#Training-and-testing-your-model)

[A few suggestions for optimal use](#A-few-suggestions-for-optimal-use)
* [Use the help function](#Use-the-help-function)
* [Similarity score and ranking](#Similarity-score-and-ranking)
* [Picking the right threshold](#Picking-the-right-threshold)
* [Looking at the ranking ratio](#Looking-at-the-ranking-ratio)
* [Performance and computing time: parameter tuning](#Performance-and-computing-time:-parameter-tuning)
* * [Resolution](#Resolution)
* * [Blurring](#Blurring)
* * [Other parameters for advanced users](#Other-parameters-for-advanced-users)

## About this application

### Motivation 

Wildlife biologists that study a certain species often capture individuals and release them later. When capturing individuals on the site, it is paramount to determine, whether the captured individuals have *already been captured* previously, and they have, what their identity is.

To be able to identify individuals in case of recapture, biologists often tag them before releasing them. However, tagging can be problematic in some cases: they can impair individuals' survival in the wild, and they can also be damaged or lost. A possible solution to this problem is image biometric identification. 

This challenge is faced by many other applications involving human faces. For example, when trying to enter a secured facility, access will be granted if a photo, taken on the spot, can match with one of the indexed images of the staff members (one-to-many matching). 

Using neural networks, very performant, fast models have been developed for face identification. Thanks to massive amounts of data, these models have learned to extract light, highly distinguishable representations from face images. In wildlife research, however, data is often unsifficient to train such models. In addition, there is a need for a versatile model, able to accommodate a variety of different-looking species.

This console Python application proposes biometric identification based on keypoints detection and feature matching. Its model can be trained with small amounts of data, and preliminary testing has demonstrated its ability to identify accurately species as different as toads and humans.



### What you can do with this application

* Manage a register of images for your population.
* Authentify and identify new images against the indexed images of your register. 
* Train a model to accommodate your species and your type of images

### A brief presentation of the algorithm

The proposed program is based on [detection and description of keypoints (or features) in images](https://en.wikipedia.org/wiki/Feature_detection_(computer_vision)), a well-studied topic, for which many implementations exist. Here we used the (patented) SURF algorithm, implemented in openCV. There are several alternative to the SURF algorithm, such as ORB or FAST, which can yield good results with a little adjustment. In short, keypoints are areas in an the image where pixel intensities differ importantly and for which descriptors containing information about neighboring pixels, scale, orientation etc are computed ([details on the specific openCV implementation can be found here)](https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_surf_intro/py_surf_intro.html). For more details, a good overview of image descriptors in openCV is to be found in the chapter 6  of J.Howe's and J. Minichino's [Learning opvenCV 4 computer vision with Python 3](https://www.amazon.com/Learning-OpenCV-Computer-Vision-Python-ebook/dp/B084ZH43LV).

*Here is an example of a keypoints (i.e. blobs of contrasting pixels) found using SURF:* 
<img src="files/surf_ex.jpg" alt="Drawing" style="width: 400px;">

Descriptors of keypoints across any pair of images A and B can be therefore compared and the distance between the arrays of descriptors (here, the **norm** of the difference between the descriptor arrays; not to be confused between the **Euclidian distance** between the locations of the keypoints) can be calculated. Among the best matches, we filter false matches with the Lowe's ratio test, which is a good indicator for match quality: the distance score for the best match between keypoints have to smaller than a threshold fraction (say 0.8) of the distance scores for the second-best match. 

*Here is an example of matching keypoins (colored circles) between two different images of the same toad (left and right):*
<img src="files/matches.jpg" alt="Drawing" style="width: 600px;">

The algorithm further filters one-to-many matches between keypoints, as these are typically spurious. 

Using the best matches, we find out the best [homography transformation](https://en.wikipedia.org/wiki/Homography), i.e. the reprojection of the image A according to the camera view of image B. With the RANSAC algorithm for outlier detection, we can identify matches that are not consistent with the image reprojection. This does not only allow for a better selection of matches, but also to assess the overall similarity between the images.


*In the left-hand column of the figure below, we see image A before and after being warped by the homography matrix (top and bottom); the right-hand column represents image B, the destination. We can see that image A is warped so as to fit the perspective of image B around the matched keypoints.*
<img src="files/homography.jpg" alt="Drawing" style="width: 400;">

In addition to the distance between the descriptors (i.e. how dissimilar they are), we further measure the Euclidian distance between the *locations* of the best-matching keypoints. Combined with the homography data, these help get a good idea of the overall coherence of the matches between the images.

Thus, for a given set of a varying number of the matches in a pair of images, we obtain (a) the number of good matches, (b) the distances between matched descriptors, of which we take the mean and the standard deviation (c) the coeficients of the Homography matrix. We feed these data into a stacking model consisting of a [*support vector machine*] (https://en.wikipedia.org/wiki/Support_vector_machine) with radial basis function kernel and a [*random forest classifier*](https://en.wikipedia.org/wiki/Random_forest), assigning a probability of similarity between pairs of images. Thus, the model can output a similarity score between two images for authentification (one-to-one matching), determine whether an individual exists in the register, and if it does, which it is (one-to-many matching)

### The importance of image preprocessing
The efficiency of the algorithm described above in matching between images of the same individual depends strongly on the preprocessing of the images. 

First, image denoising can speed up the process of matching, as small noise areas of the image can be detected as keypoints. Here we apply a Guassian blur to denoise the image. 

Second, the contrast is equalized locally using the [CLAHE technique](https://en.wikipedia.org/wiki/Adaptive_histogram_equalization#Contrast_Limited_AHE), as photos are taken under varying luminance conditions. Indeed, we need to equalize brigtness and contrast, as matching can be impaired by regions of the photographed individual that are darker in one image than in another. 

Last, images are converted to gray scale to ensure a better and fast processing. 


## Tutorial 

### Data for this tutorial

This tutorial shows many examples from real data. The application comes with a dataset of green toads *Bufotes viridis* (see README for download line). Let's download it into the project folder (i.e. where the scripts are), name the folder `data`.

### Instantiating and loading a DB object
First, we import the `database` module, as well as the `utils` module that comes with it:

In [1]:
import database
import utils

To instantiate a new `DB` object, we need to passe the size of the images that will be read into the register. E.g. for a 500x600:

In [2]:
foo = database.DB(target_size = (500,600))

Alternatively, we can load a DB object created previously with the `load_DB(path)` function from the `utils` module: 

### Reading your data into the register

Once your `DB` object exists, you can read your data into the register. Let's examine four cases: 
1. You have **multiple images** of captured individuals, to which *an ID can be assigned with certainty*. This can be the case when you go to a new site, and all captured individuals are certainly new. Thus, you want to assign each of them a new ID. 

    The application supports reading of multiple image files organized in subfolders, where each sub-older corresponds to an individual. By calling `DB.add_entries_from_folder(path)`, the application infers the ID that you want to assign to an individual from the subfolder's name. 

    E.g., we have two individuals, Alice and Bob, with one or multiple images for each. We organize our data in a folder `data` with subfolders `Bob` and `Alice`:
    ```
    data
    |-- Bob
        |-- img_of_bob.jpg
    |-- Alice
        |-- img_of_alice_0.jpg
        |-- img_of_alice_1.jpg`

    ```
    
    The dataset of green toads (see README for download line) is organized in subfolders as required. We will use it to illustrate our point.   
    

In [3]:
foo.add_entries_from_folder("data/toads")


Registering images...
[####################] 100% 
 


Once new data have now been registered and each image has been assigned a unique key and the identity of the individual that it represents. 

We can view the register (or parts of it) with the method `DB.view_register()`. By passing a list of keys, we can specify the keys that we want to view.

In [4]:
foo.view_register([1,13])


*Key*   *Assigned_id*   *Images_for_id*   *Path*                           
1       001             [0, 1, 2]         data/toads/001/Ind_001_002.JPG   
13      007             [13, 14]          data/toads/007/Ind_007_001.JPG   

2. You want to register **multiple images**, whose **ID is unknown** (you don't know whether they have been captured before). You can use the same method, but specify that ID should not be inferred from subfolders' names:  `DB.add_entries_from_folder(path, id_from_folder = False)`. Entries will be assigned the a NA value as ID.


3. You want to register a **single image**, **whose ID you know**:  `DB.add_entry(path, identity = 'my_id')`.


4. You want to register a **single image**, of which **you don't know the ID**: `DB.add_entry(path)`. The image will be assigned a NA value as ID.

### Managing your register

The application offers basic functionalities for managing your register.
#### Deleting
You can delete a specific entry by calling `DB.delete_entry(entry_key)`. If you want to delete all images associated with an ID, call `DB.delete_id(id)`.
#### Assigning an ID to an image
If you want to (re-) assign an ID to a certain image, call `DB.assign_id(entry_key, id)`. Pass as arguments the key of the image you want to assign an ID to and the ID (string).
#### Exploring your data
Apart from `DB.view_register()` introduced before, you can explore the data in your register using `DB.image_for_id(id)`. This method outputs all image keys associated with an id. Let's see what images are associated with ID `011` and `test`. The latter are images of various individuals which will be used for testing later. 

In [5]:
foo.image_for_id(['011', 'test'])

{'011': [22, 23, 24], 'test': [141, 142, 143, 144, 145, 146, 147]}

### Verifying identities and matching images

The core functionality of the application is image verification. For a new image whose ID is unknown, we can check whether similar images exist, thus suggesting that the individual probably exists in the register. 
Let's illustrate this functionality by matching images labelled as "test":

In [6]:
foo.image_for_id('test')

{'test': [141, 142, 143, 144, 145, 146, 147]}

Photos 141-147 are test images, which belong to different individuals. Each image has two 'siblings' labelled with their true identitiy. The program does not know what which they are, only the file name tells us the true identity of the test images:

In [7]:
foo.view_register(foo.image_for_id('test')['test'])


*Key*   *Assigned_id*   *Images_for_id*                       *Path*                            
141     test            [141, 142, 143, 144, 145, 146, 147]   data/toads/test/Ind_001_004.JPG   
142     test            [141, 142, 143, 144, 145, 146, 147]   data/toads/test/Ind_011_004.JPG   
143     test            [141, 142, 143, 144, 145, 146, 147]   data/toads/test/Ind_016_004.JPG   
144     test            [141, 142, 143, 144, 145, 146, 147]   data/toads/test/Ind_024_004.JPG   
145     test            [141, 142, 143, 144, 145, 146, 147]   data/toads/test/Ind_030_004.JPG   
146     test            [141, 142, 143, 144, 145, 146, 147]   data/toads/test/Ind_042_004.JPG   
147     test            [141, 142, 143, 144, 145, 146, 147]   data/toads/test/Ind_058_004.JPG   

Let's take, for instance, image # 141. We call `DB.verify_with_all(key, n)` to match pairwise between the image registered under `key` and for displaying the `n` best matches.

In [8]:
foo.verify_with_all(141, n = 5)

[(0.9819, '001', 2),
 (0.9777, '001', 0),
 (0.9607, '001', 1),
 (0.0016, '029', 67),
 (0.0014, '002', 3),
 (0.0014, '002', 4)]

We see that the model correctly detected images 2,0,1, belonging to individual '001' as being most resemblant to image # 141, with match level > 90%, against < 1% for other images. We can safely reassign '001' as ID to image # 141:

In [9]:
foo.assign_id(141, '001')

picture 141 was assigned the id 001


We can reassign the IDs for the rest of the test images following the same procedure. 

In case of doubt, we can inspect the match between two images with `DB.match(pair, show = True, force_rematch = True)`.  We pass (147,140) as `pair`, and we choose to `show` the matching images, and to `force_rematch`, as the matching operation was already carried out once. 

In [4]:
foo.match((147,140), show = True, force_rematch = True)

An 'manual' inspection would have allowed us to determine with certainty that images # 140 and # 147 are siblings, even the modelwas uncertain about it.

### Saving and reloading your DB
The DB class has a method `DB.save(path)` for saving the database. Reloading the database can be done with the function `load_DB(path)` of the `utils` module. Since opencv keypoints can't be saved, these functions convert them to Python lists of arrays when saving and back to opencv keypoints when loading. Therefore, loading can take some time. 

In [5]:
foo.save("foo.pkl")
foo = utils.load_DB("foo.pkl")

loading database...
creating keypoints...
[####################] 100% 


### Training and testing your model

To output a similarity score between images, the program relies on a feature-based machine learning classification model. Optimal decision boundaries might differ across problem types. The default model ("model.pkl") was trained to optimally verify *toad images*. Other types of images might require a new training of the model. To show how you can do it, let's consider a toy example with human faces (there are, of course, many other outstanding, performant and much more effective models for face verification). The Utrecht face dataset [http://pics.stir.ac.uk/2D_face_sets.htm] is a simple dataset consisting (mainly) of pairs of images with the same person, smiling in one, and with neutral face in the other. We have preprepared the data, so that it can fit the folder organization required by the program. The link to the data is in the README file. 

We call a new DB instance and read the images in:

In [6]:
test = database.DB((150,200))
test.add_entries_from_folder("data/utrecht")


Registering images...
[####################] 100% 
 


We decide that we split our data into two equal parts for training and testing. 

Now we can call `DB.train_model(train_prop, clf = None, save_model = None)`. We save our model as `face_model.pkl`. If you want this model to be your default model, save it as `model.pkl`. You can pass your favorite sklearn classifier as argument (`clf`), but by default, the stacking classifier mentioned in the presentation of the algorithm will be used.

The training depends on the image size and, of course, the sample number, and can be very long.

In [7]:
my_model = test.train_model(train_prop = 0.5, save_model = "utrecht_face_model.pkl")


 Matching pairs of images...

[####################] 100% 
Confusion matrix for training dataset
[[1736    0]
 [   0   34]]


There are many ways to test the performance of such a model. Here we use the following method: each image is matched with one image from all labelled individual in the register. For each image, we rank the matched images by similarity score and search through the ranking for the/a twin image (i.e. image of the same individual). The higher the position of the twin image in the ranking, the better the model. 

The program assumes by default that the test individuals are those previously defined by `DB.train_model()`, as in this example. If you want to determine it yourself, or if you didn't train the model at all, you can choose to randomly assign individuals to the testing, by passing to `DB.test_model()` either a list of ID's or a float corresponding to the proportion to be used for testing. 

Here, we have pairs of images and the testing ID's were assigned by the training function so we just have to call `DB.test_model()`. The method returns a 2D array with three columns: keys, scores (i.e. ranking of the corresponding twin images), and (one of the) true twin(s). 

In [10]:
res = test.test_model()

[####################] 100% 


To view how good the model performed, we can check:

In [11]:
import numpy as np
print("% of right matches in first places of ranking: {:2.2%}".format(np.mean(res[:,1]==0)))
print("% of right matches in first 3 places of ranking: {:2.2%}".format(np.mean(res[:,1]<=2)))
print("% of right matches in first 5 places of ranking: {:2.2%}".format(np.mean(res[:,1]<=4)))

% of right matches in first places of ranking: 100.00%
% of right matches in first 3 places of ranking: 100.00%
% of right matches in first 5 places of ranking: 100.00%


Once our model is saved an to our satisfaction, we can load it at any time after creating a DB object with `DB.load_model(path)`. 

## A few suggestions for optimal use

### Use the help function

Not all the features are covered in this tutorial. You can use the python `help` function on a method to access the full documentation. For instance:

In [12]:
import numpy as np
import database
import utils
example = database.DB((300,400))
help(example.match)

Help on method match in module database:

match(key_tuple, show=False, force_rematch=False, verbose=True) method of database.DB instance
    Matches descriptors of keypoints in two images, and creates open-cv match objects,
    selected keypoints and a homography matrix, from which
    match similarity features are extracted. They are subsequently used to assess
    degree of identity between the images.
    
    Parameters
    ----------
    key_tuple : tuple of two INT.
        Keys of the image to be matched.
    show : BOOL, optional
        Whether to show the pair of images with the keypoints and drawn matches.
        The default is False
    force_rematch : BOOL, optional
        If match already exsists whether to rematch. The default is False.
    
    Returns
    -------
    None.



### Similarity score and ranking

The testing method proposed here is based on ranking. Ranking is more robust to changes in image types, such as contrast or resolution, since the level of similarity is **relative**. However, when you don't know  *whether* a twin image actually exists in the register, you need to rely on the similarity score rather than on ranking, and set a decision threshold between above which one can consider a pair of images as belonging to the same individual. However, using a model trained on different types, such changes can impact the similarity score magnitude. 

**It is therefore preferrable to retrain your model whenever dealing with new types of data** (different resolution, contrast, picture frame, and of course species). 

### Picking the right threshold

If you can't retrain the model, be careful about the decision threshold. In an optimally trained model, the decision threshold should be at 50%. When importing a model trained on data with different characteristics, the model can still suit your needs, but it might output probabilities such that you need to set your decision threshold at a different level. For example, when all images have a lower resolution or lower contrast, the number of good matches between keypoints will be lower, and therefore a model trained on high resoultion images might predict lower similarity probabilities. It is therefore important to test the model with a few labelled data to adjust the decision threshold to the new type of images. 

To illustrate this point, let's try to use the default model, trained on 500x600 toad images, on the Utrecht face database, reducing images to a 300x400 resolution. We will take 10 pairs of data, for which the identity is known, to test out our model on this new type of images:

In [13]:
import numpy as np
import database
import utils
example = database.DB((300,400))
example.load_model("utrecht_face_model.pkl")
example.add_entries_from_folder("data/utrecht")
keys_subsample = np.arange(20,40)
#we test every other image, one per pair
for key in keys_subsample[::2]:
    print("Twin keys of {} are {}".format(key, example.view_twin_keys(key)))
    print(example.verify_with_batch(key,keys_subsample, return_ratio = True)[:3])


Registering images...
[####################] 100% 
 
Twin keys of 20 are [19]
[(0.0028, '19', 22, 1.037), (0.0027, '19', 21, 1.0), (0.0027, '2', 23, 1.0)]
Twin keys of 22 are [21]
[(0.41, '19', 21, 146.4286), (0.0028, '18', 20, 1.0), (0.0028, '2', 24, 1.0)]
Twin keys of 24 are [23]
[(0.0967, '2', 23, 34.5357), (0.0028, '19', 21, 1.0), (0.0028, '19', 22, 1.0)]
Twin keys of 26 are [25]
[(0.3034, '20', 25, 108.3571), (0.0028, '19', 21, 1.0), (0.0028, '19', 22, 1.037)]
Twin keys of 28 are [27]
[(0.1893, '21', 27, 67.6071), (0.0028, '19', 21, 1.0), (0.0028, '19', 22, 1.0)]
Twin keys of 30 are [29]
[(0.0876, '22', 29, 31.2857), (0.0028, '19', 22, 1.037), (0.0027, '18', 20, 1.0)]
Twin keys of 32 are [31]
[(0.1445, '23', 31, 51.6071), (0.0028, '24', 33, 1.037), (0.0027, '18', 20, 1.0)]
Twin keys of 34 are [33]
[(0.1185, '24', 33, 42.3214), (0.0028, '19', 22, 1.037), (0.0027, '18', 20, 1.0)]
Twin keys of 36 are [35, 37]
[(0.6505, '25', 35, 34.418), (0.0189, '25', 37, 6.75), (0.0028, '19', 22, 

As we can see, the model is still rather accurate (the true twin image appears mostly at the first place), although it was trained with a different resolution. Setting the threshold at 1% or 2% will still yield good results, but it would be very dangerous to set the threshold at 50%.  


### Looking at the ranking ratio

The ranking ratio is the ratio between the similarity score ranked i and i+1. A low ranking ratio (close to one) should rings an alarm bell since two images have been assigned a close similarity score. It can mean, of course, that the image has several twin images in the register, but also that the model performs poorly and that the image fed in fails to produce meaningful information. In the example above, key # 20 has a ranking ratio of 1.04 (the first place has a score of 0.0028 and the second 0.0027). Indeed, the perdicted twin key is not correct: the model did not rank the twin image at the first place.

### Performance and computing time: parameter tuning

#### Resoltion

High resolution images usually yield better results, since they contain details that can be used for matching. However, higher resolution considerably increase computing time. When designing a model and a data collection protocol, it is worth playing around with a few data samples to find a good trade-off between the two. 

#### Blurring 
Up to a certain point, blurring can improve performance, as it denoises the images and prevents the algorithm from finding  useless keypoints. Consequently, it can also significantly reduces computing time. However, above a certain level, aggressive blurring leads to a loss of information and a weaker performance. As for the other parameters, the right trade-off depends on the nature of the images, and some experimenting will help you find the sweet spot that best suits your needs. 

By default, blurring is made with a sqare Gaussian kernel of the following side size:

kernel_size = blurring_factor*(image_height + image_width)/2.

The default blurring factor is set at 0.03. You can set it before reading the images in with:

`DB.blurring_factor = value`

#### Other parameters for advanced users

Keypoint detections is implemented using the surf algorithm. You can set the [parameters of the algorithm](https://docs.opencv.org/3.4/d5/df7/classcv_1_1xfeatures2d_1_1SURF.html) to improve performance or reducing computing time. for instance:
`DB.hessianThreshold = 500`, `DB.extended = 0` etc.
