# Digit Detector for Real World Images

## Project Intro/Objective
The purpose of this project is to develope a multi-class classifier for digit detection and recognition in real- world images. The classifier should be able to perform two tasks: 1) detect digit in real-world images; 2) if a digit is detected, recognize it from 0 to 9. The classifier needs to be invariant to conditions including scale, location, font, pose, lighting, and noise and robust to complex scene background.

### Methods Used
* Maximally Stable Extremal Regions (MSERs)
* Connected component splitting
* Convolutional Neural Network (CNN)

### Libraries
* PyTorch 
* OpenCV
* Numpy

## Project Description
The project followed a pipeline as shown in Fig.1:

![image info](./figs/project_pipeline.png)

<div align="center"><b> Fig.1: Project Pipeline </b></div>

### MSERs Pyramid
MSERs pyramid is aimed at extending MSERs to be scale-invariant. The pyramid is created by detecting Region of Interests (ROI) i.e. MSERs in this case on the scaled images. The deteced ROIs are then applied Non-Maximum Suppression (NMS) according to Intersection over Union (IoU) score.

### Connected Component Splitting
The performance of MSERs is impacted by MSERs margin. A higher margin leads to good precision but poor recall, while a lower margin leads to better recall but error-connected components (ROIs that contain multiple characters). To balance between precision and recall, a novel method was employed which incorporats CNN and MSERs tree structure with a sliding window [[1]](#1).

## Data
This project used <a href=http://ufldl.stanford.edu/housenumbers/>Street View House Number</a> (SVHN) dataset [[2]](#2). This dataset originally serves the purpose of digit recognition and thus only contains digit images. In order to be used for digit detection, it must include non-digit images i.e. negative training samples.

Let's create negaive training samples and add into original dataset. Then we split dataset into three sets: train, validation, and test. 

```python
from data_create import create_train_val_test_data
create_train_val_test_data(
    'train_32x32.mat', 'digitStruct.json', './train', 
    'test_32x32.mat', 'digitStruct_test.json', './test'
)
```

This process will output three files: 
* train_data.mat
* val_data.mat
* test_data.mat

## Model
This project considered two models: Text-Attentional CNN [[3]](#3) and VGG16 [[4]](#4). 

### Text-Attentional CNN
Text-Attentional CNN is a deep neural network that particularly focuses on extracting text-related features from an image. It is a model for multi-task feature learning (MTL). Its architecture is shown in Fig.2:
![image info](./figs/Text_Attentional_CNN.png)

<p style="text-align: center;">Fig.2: The Architecture of Text-Attentional CNN</p>

### VGG16
VGG16 is a very deep CNN consisting of 16 weight layers. Since the image size and number of classes used in VGG are much larger than this project, the model is fine-tuned to suit a smaller image size and 11 classes, including shrinking the kernel size of averaging pooling layer before fully-connected layer and decreasing the number of units of fully-connected layer.

```python
class  FineTuneVGG(nn.Module):
    def __init__(self, freeze=True):
        super(FineTuneVGG, self).__init__()
        
        self.model = models.vgg16_bn(pretrained=True)
        if freeze:
            for param in self.model.parameters():
                param.requires_grad = False

        # change output size and class number
        self.model.avgpool = nn.AdaptiveAvgPool2d(output_size=(1, 1))

        for i, layer in enumerate(self.model.classifier):
            if isinstance(layer, nn.Linear):
                self.model.classifier[i] = nn.Linear(
                    in_features=(layer.in_features // 49),
                    out_features=(layer.out_features // 49)
                )
        self.model.classifier[6] = nn.Linear(83, 11)
```

## Training
The training is run for 2.5 × 105 iterations. For each iteration, weights are updated according to Stochastic Gradient Descent (SGD) with momentum. The best weights are determined according to the accuracy score over validation set. Over all iterations, the set of weights which can achieve the highest validation accuracy score is regarded as the best and is saved. The training is run to the end and no early stopping is applied. See Line 39 to 93 in train.py for details.

## Hyperparameter Tunning
Several experiments on learning rate, mini-batch size, and network variations have been performed to decide appropriate values and choice of CNN. 
```python
    # batch size experiment
    batch_sizes = [8, 16, 32, 64]
    epoches = [10, 20, 40, 80]
    batch_size_experiment(trainset, valset, zip(batch_sizes, epoches))

    # learning rate experiment
    lrs = [0.001, 0.005, 0.01]
    learning_rate_experiment(trainset, valset, lrs)

    # compare Text-Attentional CNN and VGG
    network_variation_experiment(trainset, valset, False)

    # compare pretrained and retrained weights of VGG
    free_weights_experiement(trainset, valset, True)

    # compare single-task mode and multi-task mode of Text-Attentional CNN
    multi_task_experiment(trainset, valset, True)
```

These variations are evaluated according to their best accuracy scores over validation set. The results of the experiments are shown in Fig.3.

![image info](./figs/lr.png)
![image info](./figs/bs.png)
![image info](./figs/nv.png)

<p style="text-align: center;">Fig.3: The Results of Hyperparameter Tunning Experiments</p>

The best accuracy score for each variation is summarized as follows:

Variation | Training | Validation
--- | --- | --- 
LR0.001 | 99.66% | 97.43% 
LR0.005 | 98.96% | 97.23%
LR0.01 | 99.47% | 97.30%
BS8 | 98.90% | 97.24% 
BS16 | 99.66% | 97.43%
BS32 | 99.88% | 97.52%
BS64 | 99.91% | 97.40%
VGG-baseline | 99.66% | 97.43%
TextCNN-baseline | 97.81% | 94.90%

## Results

The final model is configured as VGG16 with α = 0.001 and mini-batch size 16. The model is tested on SVHN-test and achieves 97.54% accuracy which is very closed to human performance (98% accuracy) on this dataset. The F-measure achieved is 69% with 73.7% precision and 64.5% recall. 

The classifier is able to detect and recognize digits with different scales, locations, orientations and illuminations as shown in Fig. 4a. But for complicated texts like closely-connected digits or digits with similar patterns as shown in Fig. 4b, they are undetected or misclassified due to the incorrect features extracted by MSERs detector.

![image info](./figs/result.png)

<p style="text-align: center;">Fig.4: Performance of Digit Detector for Real World Images</p>

## References
<a id="1">[1]</a> 
Huang,W., Qiao,Y., & Tang,X.(2014). 
Robust scene text detection with convolution neural network induced mser trees. 
In <em>European conference on computer vision</em> (pp. 497-511). Springer, Cham.<br/>
<a id="2">[2]</a>
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning.
<em>NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011</em>.<br/>
<a id="3">[3]</a>
He, T., Huang, W., Qiao, Y., & Yao, J. (2016). 
Text-attentional convolutional neural network for scene text detection. 
<em>IEEE transactions on image processing</em>, 25(6), 2529-2541.<br/>
<a id="4">[4]</a> 
Simonyan, K., & Zisserman, A.(2015). 
Very deep convolutional networks for large-scale image recognition. 
arXiv preprint arXiv:1409.1556.