# <center> Reinforcement Learning in Computer Vision</center>

## Contents

* We will look at three Computer Vision tasks, namely-
    
    * Object Detection
    * Action Detection
    * Visual Tracking

* For each task we will try answering these questions -
    * What is the task?
    * Can we identify the RL components -
        * The MDP formulation
            * State Space
            * Action Space
            * Reward System
        * Network Architecture
    * Why use RL for this task?

## <center> Task 1: Object Detection </center>


<center><img src="img/1_1.png" alt="Example1"/></center>

<center><img src="img/1.png" alt="Example1"/></center>

## What is the task?

<center><img src="img/5.png" alt="Example1"/></center>

Some Important KeyPoints -
* This is one of the fundamental Computer Vision Tasks
* This has been viewed as a Supervised Learning so far
* Some of the popular Approaches -
    * Selective Search
    * CPMC
    * Edge Boxes (based on sliding windows)
    * R-CNN
    * Fast R-CNN
    * Faster R-CNN
    * YoLo
    * ...
    

## What's the idea here?

* A class-specific active detection model that learns to **localize target objects** known by the system. 
* Model follows a **top-down search strategy**, which starts by analyzing the whole scene and then proceeds to narrow down the correct location of objects.
* Achieved by applying a **sequence of transformations** to a box that initially covers a large region of the image and is finally reduced to a tight bounding box.


## Example Output
<center><img src="img/1_2.png" alt="Example1"/></center>

## How to look it as a RL problem?

**Question 1**: How to think of this as a **Sequential Decision Making** Problem?


**Answer 1**: At each time step, the agent should decide in which region of the image to focus its attention so that it can find objects in a few steps.

**Question 2**: How to cast this as a **Markov Decision Process**?

**Answer 2**: We cast the problem of object localization as a Markov decision process (MDP) since this setting provides a formal framework to model an agent that makes a sequence of decisions. We will try to identify the components of MDP, the set of actions A, a set of states S, and a reward function R.

## Identifying MDP Parameters

### Action Space?

The **set of actions A** is composed of **eight transformations** that can be applied to the box and one action to terminate the search process.
<center><img src="img/1_3.png" alt="Example1"/></center>

### State Space?

* State Repreentation = tuple(o,h)
* o = feature vector of the observed region 
    * ((fc6) output => 4,096 dimensional feature vector to represent its content)
* h = vector with the history of taken actions
    * The history vector encodes 10 past actions
    * Actions encoded as a 9-dimensional binary vector
    
Motivation behind the history vector: Useful to stabilize search trajectories that might get stuck in repetitive cycles, improving average precision
by approximately 3 percent points.

### Rewards?

Motivation:

* Reward function R is proportional to the improvement that the agent makes to localize an object after selecting a particular action
* Measured using the Intersection-over-Union (IoU) between the target object and the predicted box at any given time


<center><img src="img/1_10.png" alt="Example1"/></center>
<center><img src="img/1_11.png" alt="Example1"/></center>
<center><img src="img/1_12.png" alt="Example1"/></center>

where,
b = be the box of an observable region, and 

g = the ground truth box for a target object

Explanation:
* Given a state s, those actions will be **rewarded** that result in a **higher IOU with the groudtruth**, otherwise the actions are penalised. 
* For trigger action, reward is positive if final IOU with groundtruth is greater than a certin threshold, and negative otherwise. 
* This **reward scheme is binary** r ∈ {−1, +1}, and applies to any action that transforms the box.

## Network Architecture
<center><img src="img/1_4.png" alt="Example1"/></center>

## Some Quantitative Results

<center><img src="img/1_5.png" alt="Example1"/></center>

## Some Qualitative Results


<center><img src="img/1_8.png" alt="Example1"/></center>

<center><img src="img/1_9.png" alt="Example1"/></center>

## Why to use RL here?

<center><img src="img/1_6.png" alt="Example1"/></center>

<center><img src="img/1_7.png" alt="Example1"/></center>