# An overview of text detection in seven days of OCR classes


## 1. Text detection

The text detection task is to find out the location of text in an image or video. Unlike the target detection task, target detection not only solves the localization problem, but also the target classification problem.

The representation of text in an image can be considered as a 'target', and the generic approach to target detection is also applicable to text detection, in terms of the task itself:

- Target detection: given an image or video, find the location of the target (box) and give the category of the target；
- Text detection: given an input image or video, find areas of text, either single character positions or entire lines of text；



<center><img src="https://ai-studio-static-online.cdn.bcebos.com/af2d8eca913a4d5a968945ae6cac180b009c6cc94abc43bfbaf1ba6a3de98125" width="400" ></center>

<br><center>Figure1 Target detection</center>

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/400b9100573b4286b40b0a668358bcab9627f169ab934133a1280361505ddd33" width="1000" ></center>

<br><center>Figure2 Text detection</center>

Target detection and text detection are both "localization" problems. However, text detection does not require target classification, and text shapes are complex and varied.

The current said text detection is generally natural scene text detection, the difficulties of which are.

1. the diversity of text in natural scenes: text detection is affected by text color, size, font, shape, orientation, language, and text length;
2. complex backgrounds and distractions; text detection is affected by image distortion, blur, low resolution, shadows, brightness, etc;
3. dense or even overlapping text can affect text detection;
4. the existence of local consistency of text, where a small part of a text line, can also be considered as independent text;

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/072f208f2aff47e886cf2cf1378e23c648356686cf1349c799b42f662d8ced00"
width="1000" ></center>

<br><center>Figure3 Text detection scenary</center>

In response to the above problems, many deep learning-based text detection algorithms have been derived to solve the problem of natural scene text detection, and these methods can be divided into regression-based and segmentation-based text detection methods.

The next section briefly describes the classical text detection algorithm based on deep learning techniques.

## 2. Introduction to text detection methods


Deep learning-based text detection algorithms have emerged in recent years, and these approaches can be broadly classified into two categories:
1. Regression-based text detection methods
2. Segmentation-based text detection methods


This section screens the commonly used text detection methods for 2017-2021, classified according to the two categories of methods as shown in the table below.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/22314238b70b486f942701107ffddca48b87235a473c4d8db05b317f132daea0"
width="600" ></center>
<br><center>Figure4 Text detection algorithm</center>


### 2.1 Regression-based text detection

The regression-based text detection method is similar to that of the target detection algorithm in that the text detection method has only two categories, with the text in the image considered as the target to be detected and the rest as the background.

#### 2.1.1 Horizontal text detection

Earlier deep learning based text detection algorithms were improved from the target detection approach to support horizontal text detection. For example, the TextBoxes algorithm is improved based on the SSD algorithm, and CTPN is improved based on the two-stage target detection Fast-RCNN algorithm.

In TextBoxes [1] the algorithm changes the default text box to a quadrilateral with a specification adapted to the text orientation and aspect ratio based on a one-stage target detector SSD adjustment, providing an end-to-end trained text detection method and without complex post-processing.
- Pre-selected boxes with larger aspect ratios
- Convolution kernel changed from 3x3 to 1x5, more suitable for long text detection
- Using multi-scale input

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/3864ccf9d009467cbc04225daef0eb562ac0c8c36f9b4f5eab036c319e5f05e7" width="1000" ></center>
<br><center>Figure5 textbox frame diagram</center>

CTPN [3] is based on Fast-RCNN algorithm, extending the RPN module and designing a CRNN-based module to allow the whole network to detect text sequences from convolutional features, and the two-stage approach obtains more accurate feature localization by ROI Pooling. However, TextBoxes and CTPN only support the detection of horizontal text.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/452833c2016e4cf7b35291efd09740c13c4bfb8f7c56446b8f7a02fc7eb3e901" width="1000" ></center>
<br><center>Figure6 CTPN Framework Diagram</center>

#### 2.1.2 Arbitrary angle text detection

TextBoxes++ [2] improves on TextBoxes to support detection of text at any angle. Structurally, unlike TextBoxes, TextBoxes++ detects for multi-angle text by firstly modifying the aspect ratio of the preselected box and adjusting the aspect ratioaspect ratio to 1, 2, 3, 5, 1/2, 1/3, 1/5. Secondly, the $1*5$ convolution kernel is changed to $3*5$ to better learn the features of tilted text; finally, the TextBoxes++ outputs the representation information of the rotated boxes.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/ae96e3acbac04be296b6d54a4d72e5881d592fcc91f44882b24bc7d38b9d2658"
width="1000" ></center>
<br><center>图7 TextBoxes++ framework diagram</center>


EAST [4] proposed a two-stage text detection method, including FCN feature extraction and NMS part, for the localization of skewed text.EAST proposed a new pipline structure for text detection, which can be trained end-to-end and supports the detection of arbitrarily oriented text, and has the features of simple structure and high performance.FCN supports output of skewed rectangular boxes FCN supports output of skewed rectangular and horizontal boxes, and the output format can be freely chosen.
- If the output detection shape is RBox, the output Box rotation angle and AABB text shape information, AABB indicates the offset to the top and bottom left and right edges of the text box. rBox can rotate the text of the rectangle.
- If the output detection box is a four-point box, the last dimension of the output is 8 numbers, indicating the position offset from the four corner vertices of the quadrilateral. This output method can predict the text of irregular quadrilaterals.

Considering that the text boxes output by FCN are relatively redundant, for example, the neighboring pixels of a text region generate boxes with high overlap, but not the same text generates detection boxes with small overlap, EAST proposes to merge the prediction boxes by row first and then finally filter the remaining quads with the original NMS.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/d7411ada08714adab73fa0edf7555a679327b71e29184446a33d81cdd910e4fc"
width="1000" ></center>
<br><center>Figure8 EAST framework diagram</center>           


MOST [15] proposed TFAM module to dynamically adjust the perceptual field of coarse-grained detection results, and additionally proposed PA-NMS to merge reliable detection prediction results based on location information. In addition, Instance-wise IoU loss function is proposed in training for balanced training to handle text instances at different scales. This method can be combined with the EAST method to have better detection and performance in detecting texts with extreme aspect ratios and different scales.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/73052d9439714bba86ffe4a959d58c523b07baf3f1d74882b4517e71f5a645fe"
width="1000" ></center>
<br><center>Figure9 MOST framework diagram</center>


#### 2.1.3 Bending text detection

A simple idea to solve the detection problem of curved text using regression is to describe the boundary polygons of curved text with multi-point coordinates, and then directly predict the coordinates of the vertices of the polygons.

CTD [6] proposed a direct prediction of the boundary polygons of the 14 vertices of the curved text, and a Bi-LSTM [13] layer was used in the network in order to refine the predicted coordinates of the vertices, and a regression-based approach to curved text detection was implemented.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/6e33d76ebb814cac9ebb2942b779054af160857125294cd69481680aca2fa98a"
width="600" ></center>
<br><center>Figure10 CTD framework diagram</center>



LOMO [19] proposed an iterative optimized text localization feature to obtain finer text localization for long and curved text problems. the method consists of three parts, the coordinate regression module DR, the iterative optimization module IRM and the arbitrary shape representation module SEM. they are used to generate text approximate regions, iteratively optimize text localization features, predict text regions, text centerlines and text boundaries, respectively. The iterative optimization of text features can better solve the long text localization problem and obtain more accurate text region localization.
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/e90adf3ca25a45a0af0b84a181fbe2c4954be1fcca8f4049957128548b7131ef"
width="1000" ></center>
<br><center>Figure11 LOMO framework diagram</center>


Contournet [18] based on the proposed modeling of text contour points to obtain curved text detection frames, the method first uses Adaptive-RPN to obtain proposal features of text regions, then a locally orthogonal texture-aware LOTM module is designed to learn the texture features in horizontal and vertical directions and represent them with contour points, finally, by considering both orthogonal directions simultaneously Finally, by considering the feature responses in both orthogonal directions simultaneously, the predictions of strong one-way or weak orthogonal activation can be effectively filtered out using the Point Re-Scoring algorithm, and the final text contours can be represented by a set of high-quality contour points.
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/1f59ab5db899412f8c70ba71e8dd31d4ea9480d6511f498ea492c97dd2152384"
width="600" ></center>
<br><center>Figure12 Contournet framework diagram</center>


PCR [14] proposed a progressive coordinate regression to deal with the curved text detection problem, which is generally divided into three stages, first roughly detecting the text region and obtaining the text box, and additionally predicting the corner point coordinates of the text minimum enclosing box by the designed Contour Localization Mechanism, and then predicting it by overlaying multiple CLM modules and RCLM modules to obtain The curved text is then obtained by superimposing multiple CLM and RCLM modules. This method not only suppresses the influence of redundant noise points on the coordinate regression, but also locates the text region more accurately by using the text contour information aggregation to obtain a rich text contour feature representation.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/c677c4602cee44999ae4b38bd780b69795887f2ae10747968bb084db6209b6cc"
width="600" ></center>
<br><center>Figure13 PCR framework diagram</center>



### 2.2 Segmentation-based text detection

Although the regression-based method has achieved good results in text detection, it is often difficult to get smooth text wrap-around curves for solving curved text, and the model is more complex without performance advantages. So researchers proposed an image segmentation-based text segmentation method, which first does classification at the pixel level, discriminates whether each pixel point belongs to a text target, gets a probability map of the text region, and gets the wrapping curve of the text segmentation region by post-processing.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/fb9e50c410984c339481869ba11c1f39f80a4d74920b44b084601f2f8a23099f"
width="600" ></center>
<br><center>Figure14 Schematic diagram of text segmentation algorithm</center>


Such methods are usually based on segmentation methods to achieve text detection, segmentation-based methods have a natural advantage for the detection of irregularly shaped text. The main idea of segmentation-based text detection method is to get the text region in the image by segmentation method, and then use opencv, polygon and other post-processing to get the minimum enclosing curve of the text region.


Pixellink [7] used segmentation to solve the text detection problem by linking pixels in the same text line (word) together to segment the text, and extracting the text bounding box directly from the segmentation result, which can achieve the effect of regression-based text detection without position regression. Wu, Yue et al [8] proposed to segment the text while learning the text boundary position for better differentiation of text regions. In addition, Tian et al [9] proposed to map the pixels of the same text to the mapping space so that the mapping vectors of uniform text are close to each other and the mapping vectors of different text are far away from each other in the mapping space.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/462b5e1472824452a2c530939cda5e59ada226b2d0b745d19dd56068753a7f97"
width="600" ></center>
<br><center>Figure15 PixelLink framework diagram</center>

MSR [20] addresses the multi-scale problem of text detection by proposing to extract features of multiple scales of the same image, and then fuse these features and upsample them to the original image size, and the network finally predicts the x-coordinate offset and y-coordinate offset of the text center region, each point of the text center region to the nearest boundary point, and finally the set of contour coordinates of the text region can be obtained.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/9597efd68a224d60b74d7c51c99f7ff0ba9939e5cdb84fb79209b7e213f7d039"
width="600" ></center>
<br><center>Figure16 MSR framework diagram</center>
  
To address the problem that segmentation-based text algorithms have difficulty in distinguishing adjacent text, PSENet [10] proposed a progressive scale expansion network to learn text segmentation regions, predict text regions with different shrinkage ratios, and expand the detected text regions one by one, which is essentially a variant of the boundary learning method and can effectively solve the detection problem of arbitrarily shaped adjacent text.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/fa870b69a2a5423cad7422f64c32e0645dfc31a4ecc94a52832cf8742cded5ba"
width="1000" ></center>
<br><center>Figure17 PSENet framework diagram</center>

Suppose PSENet post-processing uses three kernels with different scales, as shown in the above figure s1,s2,s3. First, starting from the smallest kernel s1, we calculate the connected domain of the text segmentation region and get (b), then, we do the scale expansion of the connected domain along the top, bottom, left and right, and categorize the pixels belonging to s2 but not s1 in the expanded region, and when we encounter a conflict point, we adopt the principle of "first come, first served" and repeat the scale expansion. Finally, we can get independent segmentation regions of different text lines.


Seglink++ [17] proposed a characterization of the attraction and repulsion relationships between text block units for the curved text and dense text problems, and then designed a minimum spanning tree algorithm for unit combination to obtain the final text detection frame, and proposed the instance-aware loss function to make the Seglink++ method trainable end-to-end.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/1a16568361c0468db537ac25882eed096bca83f9c1544a92aee5239890f9d8d9"
width="1000" ></center>
<br><center>Figure18 Seglink++ framework diagram</center>

Although the segmentation method solves the detection problem of curved text, the complex post-processing logic and the prediction speed are also targets to be optimized.

PAN [11] addresses the problem of slow text detection and prediction by improving the algorithm performance in terms of network design and post-processing. First, PAN uses a lightweight ResNet18 as Backbone, and additionally designs a lightweight feature enhancement module FPEM and a feature fusion module FFM to enhance the features extracted by Backbone. In post-processing, the pixel clustering method is used to merge the pixels whose distance from the kernel is less than a threshold d along the center of the predicted text (kernel). PAN ensures high accuracy and faster prediction speed.


<center><img src="https://ai-studio-static-online.cdn.bcebos.com/a76771f91db246ee8be062f96fa2a8abc7598dd87e6d4755b63fac71a4ebc170"
width="1000" ></center>
<br><center>Figure19 PAN framework diagram</center>

DBNet [12] addresses the problem that segmentation-based methods require binarization using thresholds that lead to time-consuming post-processing, and proposes learnable thresholds and cleverly designs a binarization function that approximates a step function, enabling the segmentation network to learn thresholds for text segmentation end-to-end during training. The automatic adjustment of thresholds not only brings improvement in accuracy, but also simplifies post-processing and improves the performance of text detection.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/0d6423e3c79448f8b09090cf2dcf9d0c7baa0f6856c645808502678ae88d2917"
width="1000" ></center>
<br><center>Figure20 DB framework diagram</center>

FCENet [16] proposed to represent the text envelope curve with the parameters of Fourier transform, and since the Fourier coefficient representation can theoretically fit arbitrary closed curves, the improvement of detection accuracy for highly curved text instances in natural scene text detection is achieved by designing a suitable model to predict arbitrarily shaped text envelope representation based on Fourier transform.

<center><img src="https://ai-studio-static-online.cdn.bcebos.com/45e9a374d97145689a961977f896c8f9f470a66655234c1498e1c8477e277954"
width="1000" ></center>
<br><center>Figure21 FCENet framework diagram</center>



## 3. Conclusion

This section introduces the development in the field of text detection in recent years, including text detection methods based on regression and segmentation, and lists and introduces the methodological ideas of some classic papers respectively. The next section takes PaddleOCR open source library as an example, and introduces the algorithm principle of DBNet and the core code implementation in detail.

## References
1. Liao, Minghui, et al. "Textboxes: A fast text detector with a single deep neural network." Thirty-first AAAI conference on artificial intelligence. 2017.
2. Liao, Minghui, Baoguang Shi, and Xiang Bai. "Textboxes++: A single-shot oriented scene text detector." IEEE transactions on image processing 27.8 (2018): 3676-3690.
3. Tian, Zhi, et al. "Detecting text in natural image with connectionist text proposal network." European conference on computer vision. Springer, Cham, 2016.
4. Zhou, Xinyu, et al. "East: an efficient and accurate scene text detector." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.
5. Wang, Fangfang, et al. "Geometry-aware scene text detection with instance transformation network." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
6. Yuliang, Liu, et al. "Detecting curve text in the wild: New dataset and new solution." arXiv preprint arXiv:1712.02170 (2017).
7. Deng, Dan, et al. "Pixellink: Detecting scene text via instance segmentation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.
8. Wu, Yue, and Prem Natarajan. "Self-organized text detection with minimal post-processing via border learning." Proceedings of the IEEE International Conference on Computer Vision. 2017.
9. Tian, Zhuotao, et al. "Learning shape-aware embedding for scene text detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
10. Wang, Wenhai, et al. "Shape robust text detection with progressive scale expansion network." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
11. Wang, Wenhai, et al. "Efficient and accurate arbitrary-shaped text detection with pixel aggregation network." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
12. Liao, Minghui, et al. "Real-time scene text detection with differentiable binarization." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.
13. Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
14. Dai, Pengwen, et al. "Progressive Contour Regression for Arbitrary-Shape Scene Text Detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
15. He, Minghang, et al. "MOST: A Multi-Oriented Scene Text Detector with Localization Refinement." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
16. Zhu, Yiqin, et al. "Fourier contour embedding for arbitrary-shaped text detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
17. Tang, Jun, et al. "Seglink++: Detecting dense and arbitrary-shaped scene text by instance-aware component grouping." Pattern recognition 96 (2019): 106954.
18. Wang, Yuxin, et al. "Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
19. Zhang, Chengquan, et al. "Look more than once: An accurate detector for text of arbitrary shapes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
20. Xue C, Lu S, Zhang W. Msr: Multi-scale shape regression for scene text detection[J]. arXiv preprint arXiv:1901.02596, 2019. 
