# *Classification of Natural vs. Computer Generated Images*

### -------------------------------------------------------------------------------------------------------------------

## Project Overview:

* Two sets of training data given: Natural and Comp. Gen
* Statistics based detection approach to classification
* Generalized detection approach:
    * feature detection
    * data fitting
    * threshold testing

## Strategies Implemented:

### I) Edge Detection
### II) Grayscale Intensity
### III) RGB Peak Density


### -------------------------------------------------------------------------------------------------------------------

### I) Edge Detection

Edge detection is considered to distinguish synthetic and natural images.
Natural images, which have greater sources of distortion and imperfect lighting, are expected to have a greater number of pixels detectable as edges.
This section will present the basic approach used including the training of models for hypothesis testing.
Subsequently the likelihood ratio test for image discrimination is presented and the correct classification rate is discussed.

#### Approach
Gradient-based edge detection is used.
Pixels for which the gradient exceeds a fixed threshold are classified as edges.
The threshold is determined using a training set of 23 each synthetic and natural images.
The following figure shows normalized histograms of the pixel gradients across all images in the data sets.
Based on the histogram, a threshold value of 4 was chosen.
All pixels whose gradients exceed the threshold are classified as edges.

<img style="float: left;" src="img1.png">

The threshold is applied to the same training set.
For uniformity, each image is resized to $(540 \times 960)$ before it is processed, ensuring that differences in edge counts are not due to differences in size.
The number of edge pixels, as determined by the threshold on the gradient, is then totalled.
Applying this metric to the synthetic and natural images separately, a histogram for the number of edge pixels in the training images is obtained.
These histograms are shown in the following figure.

<img style="float: left;" src="img2.png">

The histograms are coarse because of the small size of the training sets.
However, they provide a means for determining a plausible probability density function (PDF) for the synthetic and natural image sets.
The histograms have very wide tails, so Cauchy distributions are fit to the data.
The Cauchy distributions for natural and synthetic images have the following parameters:

\begin{align}
\mathrm{scenes}: x_0 = 131700, \gamma = 52984.7 \\
\mathrm{synthetic}: x_0 = 35760.8, \gamma = 5377.61
\end{align}

Here, $x_0$ indicates the mean and $\gamma$ the scale parameter.
Having determined approximate distributions for the number of edges in the synthetic and natural images, it is a straightforward matter to apply a likelihood ratio test for a new candidate image to classify it as synthetic or natural.

#### Likelihood ratio test and performance

The likelihood ratio test compares the lieklihood of the measured datum, $z$, for the two candidate hypothesis.
A Bayesian framework is incorporated, so we arbitrarily can choose "synthetic" as the null hypothesis and "natural" as the test hypothesis.
In the Bayesian framework, the likelihood ratio is as follows:

\begin{equation}
\frac{Pr(H_0)}{Pr(H_1)} \lessgtr \frac{p_0(z)}{p_1(z)}
\end{equation}

$Pr(H_i)$ indicates the probability of a hypothesis $H_i$ and $p_i(z)$ is the associated probability density function for the test statistic.
For simplicity, synthetic and natural images are treated as equally likely, so the likelihood ratio is compared to one.
The following figure shows the likelihood ratios for the synthetic and natural image sets.
The likelihood threshold based on the priors is plotted for comparison.
Clearly, the test is conservative with respect to synthetic images, and fails to detect all the natural scenes.
No doubt better performance could be obtained using larger training sets and a more refined edge detection scheme.

<img style="float: left;" src="img3.png">

Using equal priors, there are zero false positives out of ninety-nine synthetic images and seven false negatives out of fifty-six natural scenes.
This is a total error rate of just about 4.5% for the whole data set.
It should be noted that the test set includes the images used in training the statistics for the Cauchy distributions.

### -------------------------------------------------------------------------------------------------------------------

### II) Grayscale Intensity

#### Approach: 
Compute the sum of difference between (neighbor) points of grayscale histogram (describes histogram smoothness)

#### Simple illustration of natural image histogram

<img src="Images in the project/Image_codeline3.png">

#### Fitted  Image Histograms

<img src="Images in the project/Image_codeline9.png">
<img src="Images in the project/Image_codeline9(2).png">
<img src="Images in the project/Image_codeline9(3).png">

#### Result
* False Positive Rate: 9.6%
* False Negative Rate: 26.5%
* Total Error Rate: 18%

### -------------------------------------------------------------------------------------------------------------------

### III) RGB Peak Density

#### Approach

* Feacture of interest: sharpness of peak values within RGB histograms
* Feature metric: sum of squared maximum difference within subhistogram around max values


* Code Snipet:
~~~~
    Rmaxdiff = np.diff(Rsubhist).max()
    Gmaxdiff = np.diff(Gsubhist).max()
    Bmaxdiff = np.diff(Bsubhist).max()

    metric = pow(Rmaxdiff,2)+pow(Gmaxdiff,2)+pow(Bmaxdiff,2)
~~~~

#### Example of Scene RGB Histogram:

<img src="sceneRhist.png">

#### Example of Synthetic RGB Histogram:

<img src="synthRhist.png">

#### Determine Threshold Rule

##### Method Explanation:

* Distribution fit: after several different attempts to fit a distribution to the metric histograms, cauchy CDFs were chosen
* Threshold: the decision region for both hypotheses was decided by using a threshold on the Log Likelihood Ratio. The threshold on the feature metric value was found to be 3.8e-5

#### Feature Metric Histograms:

<img src="metrichistogram.png">

#### Fitted Distribution Plot:

<img src="fitteddists.png">

#### Performance

* False Positive Rate:    *0.415*
* False Negative Rate:    *0.163*

* Overal Error Rate:      *0.289*

### -------------------------------------------------------------------------------------------------------------------

## Results Summarized:

### I) Edge Detection
#### Total Error Rate: 4.5%

### II) Grayscale Intensity
#### Total Error Rate: 10.0%

### III) RGB Peak Density
####  Total Error Rate: 30.0%


### -------------------------------------------------------------------------------------------------------------------

## Conclusions:

* The Intensity and RGB methods work well for computer generated images which are dominated by a few colors; however, in this way it is more limiting than other methods.
* Analyzing this problem within the Fourier Domain space seems more robust than analyzing color or intensity sharpness. There were several images within the scene set which would have been very difficult to distinguish based on predominating color. 
* Additionally, analysis of spatial variation also seems to be more robust than the intensity/RGB methods. Many natural photos are prone to more spatial noise. This noise is amplified by derivative-esque transformations and could prove as distinguishable for a wider range of photos