CIRCL · Vincent-CIRCL · Apr 24, 2019 · Apr 18, 2019 · Apr 18, 2019 · Apr 23, 2019
diff --git a/README.md b/README.md
@@ -12,7 +12,8 @@ A lot of information collected or processed by CIRCL are related to images (at l
 
 Three main applications are envisioned : 
 - Match new screenshots to a known baseline. E.g. : Matching screenshots from phishing-like website to known legitimate wesite.
-- Match pictures or objects of pictures together. E.g : AIL; crawled websites (mainly Tor hidden services) from interpreted screenshot, crawled websites - image extraction from DOM (ex : picture matching and inference if there is no text related to the picture.)
+- Match pictures or objects of pictures together. E.g : AIL; crawled websites (mainly Tor hidden services) from interpreted screenshot, 
+- Match pictures or objects of pictures together. E.g : crawled websites or image extraction from DOM (ex : picture matching and inference if there is no text related to the picture.)
 
 
 Other picture sources : 
@@ -21,20 +22,29 @@ Other picture sources :
 
 ## Getting Started
 
-* Review existing algorithms, techniques and libraries for calculating distances between images, State Of The Art : [MarkDown](./SOTA/SOTA.md) | [PDF version](./SOTA/SOTA.pdf)
+* Review of existing algorithms, techniques and libraries for calculating distances between images, State Of The Art : [MarkDown](./SOTA/SOTA.md) | [PDF version](./SOTA/SOTA.pdf)
 
 ### Questions
-- Do we want to a YES/NO algorithms output, which may not deliver any results, or do we want a "top N" algorithm, who's trying to match the best pictures he has ? 
-First case require some kind of threeshold at some point. Second case is just a ranking algorithm.
+- **_Do we want to a "YES they are the same"/"NO they're not" algorithms output, which can deliver an empty set of results (threeshold at some point) OR  do we want a "top N" algorithm, who's trying to match the best pictures he has ? (ranking algorithm)_**
 
-Depends on the usecase. MISP would need a certain clear correlation for automation, whereas other application may only be a "best match" output.
+Depends on the usecase. MISP would need a certain clear correlation for automation. The "best match" output is mainly useful for quality evaluation of different algorithms. However some application could use it as a production output.
 
-- Is it a similarity search (global picture, then), an object search (1 object -> Where whitin a scene, OR one Scene -> Many objects -> Where each is within other Scene ?)
+The final goal of this library is to map all matches into one of the three categories : Accepted pictures, To-review pictures, and Rejected pictures.
+Therefore, the goal will be to reduce the "to review" category to diminish needed human labor to edgy cases only.
+
+- **_Is it about a similarity search (global picture matching) or an object search (1 object -> Where is it within a scene OR one Scene -> Many objects -> Where each object is within other Scene ?)_**
 
 For a first iteration, we are focusing on picture-to-picture matching. Given problems we will face and usecases we will add, the project may be extended to object to picture matching.
+However, matching principles are quite similar, and the extension may be trivial.
+
+- **_Can I use the library in the current state for production ?_**
+
+Not now. The library has for now no "core element" that you can atomically use in your own software. The library is for now mainly a testbench to evaluate algorithms on your own dataset.
 
 ### Prerequisites
 
+See requirements.txt
+
 (...)
 
 ### Installing
@@ -45,6 +55,26 @@ For a first iteration, we are focusing on picture-to-picture matching. Given pro
 
 (...)
 
+## Running the benchmark evaluation
+
+in /lib_testing you just have to launch "python3 ./launcher.py"
+Parameters are hardcoded in the launcher.py, as : 
+- Path to pictures folder
+- Output folder to store results
+- Requested outputs (result graphe, statistics, LaTeX export, threshold evaluation, similarity matrix ...)
+
+This is currently working on most configuration and will explore following algorithms for matching : 
+- ImageHash Algorithms (A-hash, P-hash, D-hash, W-hash ... )
+- TLSH (be sure to have BMP pictures or uncompressed format at least. A function is available to convert pictures in /utility/manual.py) 
+- ORB (and its parameters space)
+- ORB Bag-Of-Words / Bag-Of-Features (and its parameters space, including size of the "Bag"/Dictionnary)
+- ORB RANSAC (with/without homography matrix filtering)
+
+You can also manually generate modified datasets from your original dataset : 
+- Text detector and hider (DeepLearning, Tesseract, ...)
+- Edge detector (DeepLearning, Canny, ...)
+- PNG/BMP versions of pictures (compressed/uncompressed)
+
 ### For Developers
 
 (...)
@@ -62,3 +92,4 @@ For a first iteration, we are focusing on picture-to-picture matching. Given pro
 * [Bibliography](https://www.zotero.org/groups/2296751/carl-hauser/items)
 
 ## Contributing
+PR are welcomed
diff --git a/SOTA/SOTA.md b/SOTA/SOTA.md
@@ -295,6 +295,7 @@ Estimation of the homography, searches for the best relative pose between images
 “RANSAC loop involves selecting four feature pairs randomly. It computes Homography H (mapping between any two points with the same center of projection). For each key point, there may be more than one candidate matches in the other processed image. We choose the best matching based on the distance between their descriptors” from \[1\]
 
 A good illustration video is available at <https://youtu.be/1YNjMxxXO-E?t=95>
+**RANSAC** can also be used to filter matches. Only inliers are kept and evaluated as “good matches”.
 
 #### Least Meadian
 
@@ -871,13 +872,13 @@ Is a corner detector, based on machine learning. More accuracy, kept with high s
 
 <span>Kind of descriptor ?</span>
 
+Feature descriptor
+------------------
+
 #### Bag Of Feature
 
 <span>To detail + Not sure it’s a feature detector ! </span>
 
-Feature descriptor
-------------------
-
 Feature detector and descriptor
 -------------------------------
 
@@ -1105,9 +1106,32 @@ Remove outliers and bad matches :
 
 -   **CrossCheck** is an alternative to the ratio test. Cross-check does matching of two sets of descriptors D1 and D2 in both directions (D1 to D2 and D2 to D1) retaining matches that exists in both.
 
+        //C++ version of the code. Python version can directly be found in the library
+        BruteForceMatcher<L2<float> > descriptorMatcher;
+        vector<DMatch> filteredMatches12, matches12, matches21;
+        descriptorMatcher.match( descriptors1, descriptors2, matches12 );
+        descriptorMatcher.match( descriptors2, descriptors1, matches21 );
+        for( size_t i = 0; i < matches12.size(); i++ )
+        { 
+            DMatch forward = matches12[i]; 
+            DMatch backward = matches21[forward.trainIdx]; 
+            if( backward.trainIdx == forward.queryIdx ) 
+                filteredMatches12.push_back( forward ); 
+        }
+
 -   A **ratio test** can be performed on each k-uplets. Repetitive patterns are detected if the distance between one descriptor of the target picture is the same with the two best descriptors of the candidate picture. 2 points on the candidate picture matched 1 point on the target picture.
 
-**RANSAC** filter matches. (TO CHECK)
+        //C++ version of the code. Python version can directly be found in the library
+        void RatioCheck::filter(std::vector<vector<DMatch> > &nmatches, double RatioT=0.8) { 
+            vector<vector<DMatch>> knmatches; 
+                for(int i=0; i<nmatches.size(); i++) { 
+                    if((nmatches[i].size()==1)||
+                        (nmatches[i][0].distance/nmatches[i][1].distance<RatioT)) { 
+                    knmatches.push_back(nmatches[i]); 
+                    } 
+                } 
+            nmatches = knmatches; 
+        }
 
 Print nice graphs :
 
@@ -1129,7 +1153,7 @@ Distance can be computed in many ways. Here is an analysis of each of the used d
 
 -   Mean of matches : it makes use of the “internal distance of a match” : the distance of one descriptor to the other. The distance between two pictures is computed as the mean of the matches distance. This doesn’t work well, for the same reason as min-length : if one of both pictures has a low number of descriptors, it will act as an “attractor” : this picture will have very few matches with others, but this small set of matches will have very “good distances”.
 
-A question remains, about “How to select good and batch matches”. Ratio approach (as in SIFT) are for example usable. A simple threeshold can be used :
+A question remains, about “How to select good and batch matches”. Ratio approach (as in SIFT) are for example usable. A simple threeshold can be used, see Figure \[fig:generalized-matching\]
 
 ![Threeshold to use from \[70\]<span data-label="fig:generalized-matching"></span>](sota-ressources/threeshold.png)
 
@@ -1145,12 +1169,6 @@ A question remains, about “How to select good and batch matches”. Ratio appr
 
 <span>0.8</span> <img src="sota-ressources/outputs-evaluation/orb_max/swedish_bank_good_match.png" title="fig:" alt="Results - ORB - max version" />
 
-<span>0.51</span> <img src="sota-ressources/outputs-evaluation/orb_max/KBC_perfect_match.png" title="fig:" alt="Results - ORB - max version" />
-
-<span>0.48</span> <img src="sota-ressources/outputs-evaluation/orb_max/Microsoft_good_match_threeshold.png" title="fig:" alt="Results - ORB - max version" />
-
-<span>0.8</span> <img src="sota-ressources/outputs-evaluation/orb_max/swedish_bank_good_match.png" title="fig:" alt="Results - ORB - max version" />
-
 Following pictures are showing drawbacks of ORB algorithm. As a first “general overview” of some matching results, few examples with commentary are giv en below. Tests were performed in “ratio” configuration, however, drawbacks are generalized to other submethods.
 Few tips to analyze following pictures :
 
@@ -1182,6 +1200,34 @@ Few tips to analyze following pictures :
 
 <span>0.48</span> <img src="sota-ressources/outputs-evaluation/orb_matches/whitepagetextproblem.png" title="fig:" alt="Results - ORB - drawbacks examples 3/3" />
 
+RANSAC outputs a homography between the request picture (the one which should be labelled) and the current picture of all pictures of the database. Please note that ’transformation’ refers to the transformation matrix that RANSAC ouputs, which determines which matches are insiders and which are outliers. A ’strong transformation’ is a significant rotation/translation/scale-up or down/deformation. A ’light transformation’ is a near direct translation, without rotation, scaling or deformation. Figure \[fig:ransac1\] and \[fig:ransac2\]
+
+The transformation matrix is equivalent to the transformation that should be applied to the request picture to “fit” the current picture of the database it is compared to. Displaying the request picture with its transformation gives an idea of “how much” the request picture should be transformed. If the transformation is strong (high distortion) then the match is probably low. Figure \[fig:matrixtransformation\]
+
+<span>0.45</span> <img src="sota-ressources/outputs-evaluation/RANSAC_ORB/good_match.png" title="fig:" alt="Results - ORB - RANSAC Filtering - No matrix filter" />
+
+<span>0.54</span> <img src="sota-ressources/outputs-evaluation/RANSAC_ORB/Outlook96.png" title="fig:" alt="Results - ORB - RANSAC Filtering - No matrix filter" />
+
+<span>0.5</span> <img src="sota-ressources/outputs-evaluation/RANSAC_ORB/Outlook_impressive_bis.png" title="fig:" alt="Results - ORB - RANSAC Filtering - No matrix filter" />
+
+<span>0.48</span> <img src="sota-ressources/outputs-evaluation/RANSAC_ORB/threshold_clear.png" title="fig:" alt="Results - ORB - RANSAC Filtering - No matrix filter" />
+
+<span>1</span> <img src="sota-ressources/outputs-evaluation/RANSAC_ORB/avoidable_example.png" title="fig:" alt="Issues - ORB - RANSAC Filtering - No matrix filter" />
+
+<span>1</span> <img src="sota-ressources/outputs-evaluation/RANSAC_ORB/problem_ransac_block_text.png" title="fig:" alt="Issues - ORB - RANSAC Filtering - No matrix filter" />
+
+<span>1</span> <img src="sota-ressources/outputs-evaluation/RANSAC_ORB/very_good_matching_ransac.png" title="fig:" alt="Issues - ORB - RANSAC Filtering - No matrix filter" />
+
+<span>0.9</span> <img src="sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/easy_match_low_distortion_microsoft.png" title="fig:" alt="Matrix transformation visualisation - ORB - RANSAC Filtering - Visualisation of transformation matrix applied to request picture. From left to right : database picture (example), target picture (request), deformed target picture thanks to RANSAC transformation matrix " />
+
+<span>0.9</span> <img src="sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/medium_distortion.png" title="fig:" alt="Matrix transformation visualisation - ORB - RANSAC Filtering - Visualisation of transformation matrix applied to request picture. From left to right : database picture (example), target picture (request), deformed target picture thanks to RANSAC transformation matrix " />
+
+<span>0.9</span> <img src="sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/flipped_picture.png" title="fig:" alt="Matrix transformation visualisation - ORB - RANSAC Filtering - Visualisation of transformation matrix applied to request picture. From left to right : database picture (example), target picture (request), deformed target picture thanks to RANSAC transformation matrix " />
+
+<span>0.9</span> <img src="sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/butterfly_configuration.png" title="fig:" alt="Matrix transformation visualisation - ORB - RANSAC Filtering - Visualisation of transformation matrix applied to request picture. From left to right : database picture (example), target picture (request), deformed target picture thanks to RANSAC transformation matrix " />
+
+<span>0.9</span> <img src="sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/obvious_mismatch_text.png" title="fig:" alt="Matrix transformation visualisation - ORB - RANSAC Filtering - Visualisation of transformation matrix applied to request picture. From left to right : database picture (example), target picture (request), deformed target picture thanks to RANSAC transformation matrix " />
+
 <img src="sota-ressources/memory_consumption/ORB_only_PNG_1/ORB_Memory.png" alt="Memory consumption of ORB. 84 Mo of pictures are loaded and kept in memory for debug and output purposes. Each spike is a different ORB configuration trial. About 60 configurations are successfully tested. Overhead of the framework is displayed page Figure [fig:frameworkmemory]" />
 
 #### BRIEF – Binary Robust Independent Elementary Features

diff --git a/SOTA/SOTA.pdf b/SOTA/SOTA.pdf
diff --git a/SOTA/sota-ressources/Bow_explanations.jpeg b/SOTA/sota-ressources/Bow_explanations.jpeg
diff --git a/...ota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/butterfly_configuration.png b/...ota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/butterfly_configuration.png
diff --git a/...es/outputs-evaluation/RANSAC_Matrix_ORB/easy_match_low_distortion_microsoft.png b/...es/outputs-evaluation/RANSAC_Matrix_ORB/easy_match_low_distortion_microsoft.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/flipped_picture.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/flipped_picture.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/medium_distortion.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/medium_distortion.png
diff --git a/.../sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/obvious_mismatch_text.png b/.../sota-ressources/outputs-evaluation/RANSAC_Matrix_ORB/obvious_mismatch_text.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/Outlook96.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/Outlook96.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/Outlook_impressive_bis.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/Outlook_impressive_bis.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/avoidable_example.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/avoidable_example.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/good_match.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/good_match.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/problem_ransac_block_text.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/problem_ransac_block_text.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/threshold_clear.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/threshold_clear.png
diff --git a/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/very_good_matching_ransac.png b/SOTA/sota-ressources/outputs-evaluation/RANSAC_ORB/very_good_matching_ransac.png
diff --git a/SOTA/sota-ressources/raw_phishing_COLORED_intersection_matrix.pdf b/SOTA/sota-ressources/raw_phishing_COLORED_intersection_matrix.pdf
diff --git a/SOTA/sota-ressources/raw_phishing_COLORED_intersection_paired.pdf b/SOTA/sota-ressources/raw_phishing_COLORED_intersection_paired.pdf
diff --git a/SOTA/sota-ressources/raw_phishing_Tesseract_intersection_matrix.pdf b/SOTA/sota-ressources/raw_phishing_Tesseract_intersection_matrix.pdf
diff --git a/SOTA/sota-ressources/raw_phishing_Tesseract_intersection_paired.pdf b/SOTA/sota-ressources/raw_phishing_Tesseract_intersection_paired.pdf
diff --git a/SOTA/sota-ressources/raw_phishing_bmp_intersection_matrix.pdf b/SOTA/sota-ressources/raw_phishing_bmp_intersection_matrix.pdf
diff --git a/SOTA/sota-ressources/raw_phishing_bmp_intersection_paired.pdf b/SOTA/sota-ressources/raw_phishing_bmp_intersection_paired.pdf
diff --git a/SOTA/sota-ressources/raw_phishing_intersection_matrix.pdf b/SOTA/sota-ressources/raw_phishing_intersection_matrix.pdf
diff --git a/SOTA/sota-ressources/raw_phishing_intersection_paired.pdf b/SOTA/sota-ressources/raw_phishing_intersection_paired.pdf
diff --git a/lib_testing_area/OpenCV/bow.py b/lib_testing_area/OpenCV/bow.py
@@ -35,6 +35,8 @@ def __init__(self, conf: configuration.BoW_ORB_default_configuration):
         self.Local_Picture_class_ref = Local_Picture
         self.conf = conf
 
+        # self.printer = Custom_printer(self.conf)
+
         # ===================================== ALGORITHM TYPE =====================================
         self.algo = cv2.ORB_create(nfeatures=conf.ORB_KEYPOINTS_NB)
         # SIFT, BRISK, SURF, .. # Available to change nFeatures=1000 for example. Limited to 500 by default