# <div align="center">Selective Search</div>
---------------------------------------------------------------------

you can Find me on Github:
> ###### [ GitHub](https://github.com/lev1khachatryan)
  
  
<img src="asset/7_1/main.jpg" />

This notebook addresses the problem of ***generating possible object locations*** for use in object recognition. We introduce ***Selective Search***
which combines the strength of both an ***exhaustive search*** and ***segmentation***. Like segmentation, we use the image structure to guide
our sampling process. Like exhaustive search, we aim to capture
all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a
variety of complementary image partitionings to deal with as many
image conditions as possible. Our Selective Search results in a
small set of data-driven, class-independent, high quality locations,
yielding 99% recall and a Mean Average Best Overlap of 0.879 at
10,097 locations. The reduced number of locations compared to
an exhaustive search enables the use of stronger machine learning
techniques and stronger appearance models for object recognition.
In this notebook we show that selective search enables the use of
the powerful ***Bag-of-Words model for recognition***. The Selective
Search software is made publicly available.

# <div align="center">1 Introduction</div>
---------------------------------------------------------------------

For a long time, objects were sought to be delineated before their
identification. This gave rise to segmentation, which aims for
a unique partitioning of the image through a generic algorithm,
where there is one part for all object silhouettes in the image. Research on this topic has yielded tremendous progress over the past
years. But images are intrinsically hierarchical: In
Figure 1a the salad and spoons are inside the salad bowl, which in
turn stands on the table. Furthermore, depending on the context the
term table in this picture can refer to only the wood or include everything on the table. Therefore both the nature of images and the
different uses of an object category are hierarchical. This prohibits
the unique partitioning of objects for all but the most specific purposes. Hence for most tasks multiple scales in a segmentation are a
necessity. This is most naturally addressed by using a hierarchical
partitioning, as done for example by Arbelaez.

Besides that a segmentation should be hierarchical, a generic solution for segmentation using a single strategy may not exist at all.
There are many conflicting reasons why a region should be grouped
together: In Figure 1b the cats can be separated using colour, but
their texture is the same. Conversely, in Figure 1c the chameleon is similar to its surrounding leaves in terms of colour, yet its texture differs. Finally, in Figure 1d, the wheels are wildly different
from the car in terms of both colour and texture, yet are enclosed
by the car. Individual visual features therefore cannot resolve the
ambiguity of segmentation.

<img src='asset/7_1/1.png'>

Figure 1: There is a high variety of reasons that an image region
forms an object. In (b) the cats can be distinguished by colour, not
texture. In (c) the chameleon can be distinguished from the surrounding leaves by texture, not colour. In (d) the wheels can be part
of the car because they are enclosed, not because they are similar
in texture or colour. Therefore, to find objects in a structured way
it is necessary to use a variety of diverse strategies. Furthermore,
an image is intrinsically hierarchical as there is no single scale for
which the complete table, salad bowl, and salad spoon can be found
in (a).

And, finally, there is a more fundamental problem. Regions with
very different characteristics, such as a face over a sweater, can
only be combined into one object after it has been established that
the object at hand is a human. Hence without prior recognition it is
hard to decide that a face and a sweater are part of one object.
This has led to the opposite of the traditional approach: to do
localisation through the identification of an object. This recent approach in object recognition has made enormous progress in less
than a decade. With an appearance model learned
from examples, an exhaustive search is performed where every location within the image is examined as to not miss any potential
object location.

However, the exhaustive search itself has several drawbacks.
Searching every possible location is computationally infeasible.
The search space has to be reduced by using a regular grid, fixed
scales, and fixed aspect ratios. In most cases the number of locations to visit remains huge, so much that alternative restrictions
need to be imposed. The classifier is simplified and the appearance
model needs to be fast. Furthermore, a uniform sampling yields
many boxes for which it is immediately clear that they are not supportive of an object. Rather then sampling locations blindly using
an exhaustive search, a key question is: Can we steer the sampling
by a data-driven analysis?

In this notebook, we aim to combine the best of the intuitions of segmentation and exhaustive search and propose a data-driven selective search. Inspired by bottom-up segmentation, we aim to exploit
the structure of the image to generate object locations. Inspired by
exhaustive search, we aim to capture all possible object locations.
Therefore, instead of using a single sampling technique, we aim
to diversify the sampling techniques to account for as many image
conditions as possible. Specifically, we use a data-driven groupingbased strategy where we increase diversity by using a variety of
complementary grouping criteria and a variety of complementary
colour spaces with different invariance properties. The set of locations is obtained by combining the locations of these complementary partitionings. Our goal is to generate a class-independent,
data-driven, selective search strategy that generates a small set of
high-quality object locations.


Our application domain of selective search is object recognition.
We therefore evaluate on the most commonly used dataset for this
purpose, the Pascal VOC detection challenge which consists of 20
object classes. The size of this dataset yields computational constraints for our selective search. Furthermore, the use of this dataset
means that the quality of locations is mainly evaluated in terms of
bounding boxes. However, our selective search applies to regions
as well and is also applicable to concepts such as “grass”

In this notebook we propose selective search for object recognition.
Our main research questions are: 
* What are good diversification strategies for adapting segmentation as a selective search strategy?


* How effective is selective search in creating a small set of highquality locations within an image? 


* Can we use selective search to employ more powerful classifiers and appearance models for object recognition?


# <div align="center">2 Related Work</div>
---------------------------------------------------------------------

We confine the related work to the domain of object recognition
and divide it into three categories: 

* Exhaustive search, 


* segmentation, and 


* other sampling strategies that do not fall in either category.


## <div align="center">2.1 Exhaustive Search</div>
---------------------------------------------------------------------

As an object can be located at any position and scale in the image,
it is natural to search everywhere. However, the visual
search space is huge, making an exhaustive search computationally
expensive. This imposes constraints on the evaluation cost per location and/or the number of locations considered. Hence most of
these sliding window techniques use a coarse search grid and fixed
aspect ratios, using weak classifiers and economic image features such as HOG (Histogram of oriented gradients). This method is often used as a preselection step in a cascade of classifiers.

Related to the sliding window technique is the highly successful
part-based object localisation method of Felzenszwalb.
Their method also performs an exhaustive search using a linear
SVM and HOG features. However, they search for objects and
object parts, whose combination results in an impressive object detection performance.

Lampert proposed using the appearance model to
guide the search. This both alleviates the constraints of using a
regular grid, fixed scales, and fixed aspect ratio, while at the same
time reduces the number of locations visited. This is done by directly searching for the optimal window within the image using a
branch and bound technique. While they obtain impressive results
for linear classifiers, found that for non-linear classifiers the
method in practice still visits over a 100,000 windows per image.

Instead of a blind exhaustive search or a branch and bound
search, we propose selective search. We use the underlying image structure to generate object locations. In contrast to the discussed methods, this yields a completely class-independent set of
locations. Furthermore, because we do not use a fixed aspect ratio, our method is not limited to objects but should be able to find
stuff like “grass” and “sand” as well. Finally, we hope to generate fewer locations, which should make the
problem easier as the variability of samples becomes lower. And
more importantly, it frees up computational power which can be
used for stronger machine learning techniques and more powerful
appearance models.

## <div align="center">2.2 Segmentation</div>
---------------------------------------------------------------------

Both ***Carreira and Sminchisescu*** and ***Endres and Hoiem*** propose to generate a set of class independent object hypotheses using
segmentation. Both methods generate multiple foreground/background segmentations, learn to predict the likelihood that a foreground segment is a complete object, and use this to rank the segments. Both algorithms show a promising ability to accurately
delineate objects within images, who achieve
state-of-the-art results on pixel-wise image classification.
As common in segmentation, both methods rely on a single strong
algorithm for identifying good regions. They obtain a variety of
locations by using many randomly initialised foreground and background seeds. In contrast, we explicitly deal with a variety of image
conditions by using different grouping criteria and different representations. This means a lower computational investment as we do
not have to invest in the single best segmentation strategy, such as
using the excellent yet expensive contour detector of [3]. Furthermore, as we deal with different image conditions separately, we
expect our locations to have a more consistent quality. Finally, our
selective search paradigm dictates that the most interesting question is not how our regions compare to, but rather how they
can complement each other.

Gu address the problem of carefully segmenting and
recognizing objects based on their parts. They first generate a set
of part hypotheses using a grouping method based on Arbelaez. Each part hypothesis is described by both appearance and
shape features. Then, an object is recognized and carefully delineated by using its parts, achieving good results for shape recognition. In their work, the segmentation is hierarchical and yields segments at all scales. However, they use a single grouping strategy whose power of discovering parts or objects is left unevaluated. In
this work, we use multiple complementary strategies to deal with
as many image conditions as possible. We include the locations
generated in our evaluation.

## <div align="center">2.3 Other Sampling Strategies</div>
---------------------------------------------------------------------

Alexe address the problem of the large sampling space
of an exhaustive search by proposing to search for any object, independent of its class. In their method they train a classifier on the
object windows of those objects which have a well-defined shape
(as opposed to stuff like “grass” and “sand”). Then instead of a full
exhaustive search they randomly sample boxes to which they apply
their classifier. The boxes with the highest “objectness” measure
serve as a set of object hypotheses. This set is then used to greatly
reduce the number of windows evaluated by class-specific object
detectors. We compare our method with their work.


Another strategy is to use visual words of the Bag-of-Words
model to predict the object location. Vedaldi use jumping
windows, in which the relation between individual visual words
and the object location is learned to predict the object location in
new images. Maji and Malik combine multiple of these relations to predict the object location using a Hough-transform, after
which they randomly sample windows close to the Hough maximum. In contrast to learning, we use the image structure to sample
a set of class-independent object hypotheses.

To summarize, our novelty is as follows. Instead of an exhaustive search we use segmentation as selective search
yielding a small set of class independent object locations. In contrast to the segmentation, instead of focusing on the best
segmentation algorithm, we use a variety of strategies to deal
with as many image conditions as possible, thereby severely reducing computational costs while potentially capturing more objects
accurately. Instead of learning an objectness measure on randomly
sampled boxes, we use a bottom-up grouping procedure to generate good object locations.

# <div align="center">3 Selective Search</div>
---------------------------------------------------------------------

In this section we detail the selective search algorithm for object
recognition and present a variety of diversification strategies to deal
with as many image conditions as possible. A selective search algorithm is subject to the following design considerations:

<img src='asset/7_1/2.png'>

Figure 2: Two examples of our selective search showing the necessity of different scales. On the left we find many objects at different
scales. On the right we necessarily find the objects at different scales as the girl is contained by the tv.

***Capture All Scales***. Objects can occur at any scale within the image. Furthermore, some objects have less clear boundaries
then other objects. Therefore, in selective search all object
scales have to be taken into account, as illustrated in Figure2. This is most naturally achieved by using an hierarchical
algorithm.

***Diversification***. There is no single optimal strategy to group regions together. As observed earlier in Figure 1, regions may
form an object because of only colour, only texture, or because
parts are enclosed. Furthermore, lighting conditions such as
shading and the colour of the light may influence how regions
form an object. Therefore instead of a single strategy which
works well in most cases, we want to have a diverse set of
strategies to deal with all cases.


***Fast to Compute***. The goal of selective search is to yield a set of
possible object locations for use in a practical object recognition framework. The creation of this set should not become a
computational bottleneck, hence our algorithm should be reasonably fast.


## <div align="center">3.1 Selective Search by Hierarchical Grouping</div>
---------------------------------------------------------------------

We take a hierarchical grouping algorithm to form the basis of our
selective search. Bottom-up grouping is a popular approach to segmentation, hence we adapt it for selective search. Because
the process of grouping itself is hierarchical, we can naturally generate locations at all scales by continuing the grouping process until
the whole image becomes a single region. This satisfies the condition of capturing all scales.

As regions can yield richer information than pixels, we want to
use region-based features whenever possible. To get a set of small
starting regions which ideally do not span multiple objects, we use the fast method of Felzenszwalb and Huttenlocher, which
found well-suited for such purpose.

Our grouping procedure now works as follows. We first use ***Felzenszwalb and D. P. Huttenlocher. Efficient GraphBased Image Segmentation*** to create initial regions, Then we use a greedy algorithm to iteratively group regions together: First the similarities between all
neighbouring regions are calculated. The two most similar regions
are grouped together, and new similarities are calculated between
the resulting region and its neighbours. The process of grouping
the most similar regions is repeated until the whole image becomes
a single region. The general method is detailed in Algorithm 1.

<img src='asset/7_1/3.png'>

For the similarity s(ri
,rj) between region ri and rj we want a variety of complementary measures under the constraint that they are
fast to compute. In effect, this means that the similarities should be
based on features that can be propagated through the hierarchy, i.e.
when merging region ri and rj
into rt
, the features of region rt need
to be calculated from the features of ri and rj without accessing the
image pixels.

## <div align="center">3.2 Diversification Strategies</div>
---------------------------------------------------------------------

The second design criterion for selective search is to diversify the
sampling and create a set of complementary strategies whose locations are combined afterwards. We diversify our selective search
(1) by using a variety of colour spaces with different invariance
properties, (2) by using different similarity measures si j, and (3)
by varying our starting regions.

# <div align="center">4 Use Selective search for region proposals</div>
---------------------------------------------------------------------

As we mentioned above, In selective search, we start with many tiny initial regions. We use a greedy algorithm to grow a region. First we locate two most similar regions and merge them together. Similarity S between region a and b is defined as:

$S(a,b)=S_{texture}(a,b)+S_{size}(a,b)$

where $S_{texture}(a,b)$ measures the ***visual similarity***, and $S_{size}$ prefers ***merging smaller regions together*** to avoid a single region from gobbling up all others one by one.

We continue merging regions until everything is combined together. In the first row, we show how we grow the regions, and the blue rectangles in the second rows show all possible region proposals we made during the merging. The green rectangle are the target objects that we want to detect.

<img src='asset/7_2/4.png'>

# From Selective Search to R-CNN(Introduction)

### Warping

For every region proposal, we use a CNN to extract the features. Since a CNN takes a fixed-size image, we wrap a proposed region into a 227 x 227 RGB images.

<img src='asset/7_2/5.jpg'>

### Extracting features with a CNN

This will then process by a CNN to extract a 4096-dimensional feature:

<img src='asset/7_2/6.jpg'>

### Classification

We then apply a SVM classifier to identify the object:

<img src='asset/7_2/7.jpg'>

### Putting it together

<img src='asset/7_2/8.png'>

### Bounding box regressor

The original boundary box proposal may need further refinement. We apply a regressor to calculate the final red box from the initial blue proposal region.

<img src='asset/7_2/9.png'>

<img src='asset/7_2/10.jpg'>

Here, the R-CNN classifies objects in a picture and produces the corresponding boundary box

<img src='asset/7_2/11.png'>