# 📖 👆🏻 Printed Links Detection Using TensorFlow 2 Object Detection API

![Links Detector Cover](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/01-banner.png)

## 📃 TL;DR

_In this article we will start solving the issue of making the printed links (i.e. in a book or in a magazine) clickable via your smartphone camera._

We will use TensorFlow 2 [Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection) to train a custom object detector model to find positions and bounding boxes of the sub-strings like `https://` in the text image (i.e. in smartphone camera stream).

The text of each link (right continuation of `https://` bounding box) will be recognized by using [Tesseract](https://tesseract.projectnaptha.com/) library. The recognition part will not be covered in this article but you may find the complete code example of the application in [links-detector repository](https://github.com/trekhleb/links-detector).   

> 🚀 [**Launch Links Detector demo**](https://trekhleb.github.io/links-detector/) from your smartphone to see the final result.

> 📝 [**Open links-detector repository**](https://github.com/trekhleb/links-detector) on GitHub to see the complete source code of the application.

Here is how the final solution will look like:

![Links Detector Demo](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/03-links-detector-demo.gif)

> ⚠️ Currently the application is in _experimental_ _Alpha_ stage and has [many issues and limitations](https://github.com/trekhleb/links-detector/issues?q=is%3Aopen+is%3Aissue+label%3Aenhancement). So don't raise your expectations bar to high until these issues are resolved 🤷🏻‍. Also the pruspose of this article is more about learning how to work with TensorFlow 2 Object Detection API rather than comming up with a production ready model.

## 🤷🏻‍♂️ The Problem

I work as a software engineer and on my own time I learn Machine Learning as a hobby. But this is not the problem yet.

I bought a printed book about Machine Learning recently and while I was reading through the first several chapters I've encountered many printed links in the text that looked like `https://tensorflow.org/` or `https://some-url.com/which/may/be/even/longer?and_with_params=true`.

![Printed Links](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/02-printed-links.jpg)

I saw all these links but I couldn't click on them since they were printed (thanks, cap!). To visit these links I needed to start typing them character by character in the browser's address bar, which was pretty annoying and error prone.

## 💡 Possible Solution

So, what if, similarly to QR-code detection, we will try to "teach" the smartphone to _(1)_ _detect_ and _(2)_ _recognize_ printed links for us and also to make them _clickable_? This way you'll do just one click instead of multiple keystrokes. The operational complexity goes from `O(N)` to `O(1)`.

This is how the final workflow will look like:

![Links Detector Demo](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/03-links-detector-demo.gif)

## 📝 Solution Requirements

As I've mentioned earlier I'm just studying a Machine Learning as a hobby. Thus the pruspose of this article is more about _learning_ how to work with TensorFlow 2 Object Detection API rather than comming up with a production ready application.

With that beign said, I simplified the solution requirements to the following:

1. The detection and recognition processes should have a **close-to-real-time** performance (i.e. `0.5-1` frames per second) on a device like iPhone X. It means that whole _detection + recognition_ process should take up to `2` seconds (preatty bearable as for the amateur project).
2. Only **English** links should be supported.
3. Only **dark text** (i.e. black or dark-grey) on **light background** (i.e. white or light-grey) should be supported.
4. Only `https://` links should be supported for now (it is ok if our model will not recognize the `http://`, `ftp://`, `tcp://` or other types of links).

## 🧩 Solution Breakdown

### High-level breakdown

Let's see how we could approach the problem on a high level.

#### Option 1: Detection model on the back-end

**The flow:**

1. Get camera stream (frame by frame) on the client side.
2. Send each frame one by one over the network to the back-end.
3. Do links detection and recognition on the back-end and send the response back to the client.
4. Client draws the detection boxes with the clickable links.

![Model on the back-end](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/04-frontend-backend.jpg)

**Pros:**

💚 The detection performance is not limited by the client's device. We may speed the detection up by scaling the service horizontally (adding more instances) and vertically (adding more cores/GPUs).

💚 The model might be bigger since there is no need to upload it to the client side. Downloading the `~10Mb` model on the client side may be ok, but loading the `~100Mb` model might be a big issue for the client's network and application UX (user experience) otherwise.

💚 It is possible to controll who is using the model. Model is guarded behind the API so we would have complete controll over its callers/clients.

**Cons:**

💔 System complexity growth. The aplication tech stack growth from just `JavaScript` to, let's say, `JavaScript + Python`. We need to take care about the autoscaling.

💔 Offline mode for the app is not possible since it needs an internet connection to work.

💔 Too many HTTP requests between the client and the server may become a bottleneck at some point. Imagine if we would want to improve the performance of the detecton, let's say, from `1` to `10+` frames per second. This means that each client will send `10+` requests per second. For `10` simultanious clients it is already `100+` requests per second. The `HTTP/2` bidirectional streaming and `gRPC` might be useful in this case, but we're going back to increased system complexity here.  

💔 System becomes more expensive. Almost all points from Pros section need to be paid for.

#### Option 2: Detection model on the front-end

**The flow:**

1. Get camera stream (frame by frame) on the client side.
2. Do links detection and recognition on the client side (without sending anything to the back-end).
3. Client draws the detection boxes with the clickable links.

![Model on the front-end](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/05-frontend-only.jpg)

**Pros:**

💚 System is less complex. We don't need to set up the servers, build the API and introcude an additional Python stack to the system. 

💚 Offline mode is possible. The app doesn't need an internet connection to work since the model is loaded on the device. So the Progressive Web Application ([PWA](https://web.dev/progressive-web-apps/)) might be built to support that.

💚 System is "kind of" scaling automatically. The more clients you have, the more cores and GPUs they bring. This is not a proper scaling solution though (more about that in a Cons section below). 

💚 System is cheaper. We only need a server for static assets (`HTML`, `JS`, `CSS`, model files etc.). This may be done for free, let's say, on GitHub.

💚 No issue with the growing number of HTTP requests per second to the server side.

**Cons:**

💔 Only the horizontal scaling is possible (each client will have it's own CPU/GPU). Vertical scaling is not possible since we can't influence the client's device performance. As a result we can't guarantee fast detection for low performant devices.

💔 It is not possible to guard the model usage and controll the callers/clients of the model. Everyone could download the model and re-use it. 

💔 Battery consumption of the client's device might become an issue. For the model to work it needs computational resources. So clients might not be happy with their iPhone getting warmer and warmer while the app is working.

#### High-level conslusion

Since the purpose of the project was more about learning and not comming up with a production ready solution _I decided to go with the second option of serving the model from the client side_. This made the whole project much cheaper (actually with the GitHub it was free to host it) and I could focus more on Machine Learning then on the autoscaling back-end infrastructure.


### Lower level breakdown

Ok, so we've decided to go with the serverless solution. And now we have an image from the camera stream as an input that looks something like this:

![Printed Links Input](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/06-printed-links-clean.jpg)

We need to solve two sub-tasks for this image:

1. Links **detection** (finding the position and bounding boxes of the links)
2. Links **recognition** (recognizing the text of the links)

#### Option 1: Tesseract based solution

The first and the most obvious aproach would be to solve the _Optical Character Recognition_ ([OCR](https://en.wikipedia.org/wiki/Optical_character_recognition)) task by recognizing the whole text of the image by using, let's say, [Tesseract.js](https://github.com/naptha/tesseract.js) library. As a pleasent bonus it returns the bounding boxes of the paragraphs, text lines and text blocks along with the recognized text.

![Recognized text with bounding boxes](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/07-printed-links-boxes.jpg)

Then we may try to extract the links from the recognized text lines or text blocks with a regular expression like [this one](https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url):

```typescript
const URL_REG_EXP = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9@:%_+.~#?&/=]*)/gi;

const extractLinkFromText = (text: string): string | null => {
  const urls: string[] | null = text.match(URL_REG_EXP);
  if (!urls || !urls.length) {
    return null;
  }
  return urls[0];
};
```

💚 Seems like the issue is solved in a pretty straightforward and simple way:

- We know the bounding boxes of the links
- And we also know the text of the links to make them clickable

💔 The thing is that the _recognition + detection_ time may vary from `2` to `20+` seconds depending on the size of the text, on the ammount of "something that looks like a text" on the image, on the image quality and on other factors. So it will be realy hard to achive those `0.5-1` frames per second to make the user experience at least _close_ to the real-time.

💔 Also if we would think about it, we're asking the library to recognize the **whole** text from the image for us even though it might contain only one or two links in it (i.e. only ~10% of the text might be usefull for us) or it may even not contain the links at all. In this case it sounds like a waste of the computational resources. 

#### Option 2: Tesseract + TensorFlow based solution

We could make Tesseract work faster if we used some _additional "adviser" algorithm_ in prior to the links text recognition. This "adviser" algorithm should detect, but not recognize, _the the leftmost position_ of each link on the image if there are any. This will allow us to speed up the recognition part by following these rules:

1. If the image does not contain any link we should not call Tesseract detection/recognition at all.
2. If the image does have the links then we need to ask Tesseract to recognize only those parts of the image that contains the links. We're not interested in spending time for recognition of the irrelevant text that doesn't contain the links.

The "adviser" algorithm that will take place before the Tesseract should work with a constant time regardless of the image quality or the presence/absence of the text on the image. It also should be pretty fast and detect the leftmost positions of the links for less then `1s` so that we could satisfy the "close-to-real-time" requirement (i.e. on iPhone X).

> 💡 So what if we will use another object detection model to help us find all occurrences of the `https://` substrings (every secure link has this prefix, doesn't it) in the image? Then, having these `https://` bounding boxes in the text we may extract the right-side continuation of them and send them to the Tesseract for text recognition.

Take a look at the picture below:

![Tesseract and TensorFlow based solution](https://raw.githubusercontent.com/trekhleb/links-detector/master/articles/printed_links_detection/assets/08-tesseract-vs-tensorflow.jpg)

You may notice that Tesseract needs to do **much less** work in case if it would have some hints about where are the links might be located (see the number of blue boxes on both pictures).

So the question now is which object detection model we should choose and how to re-train it to support the detection of the custom `https://` objects (to do the [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning)).  

> Finally! We've got closer to the TensorFlow part of the article 😀


## 🤖 Selecting the object detection model

> I'm not a machine learning expert so please forgive me some future inaccuracies in this section in advance. Feel free to add corrections in the comments section under the article.

`ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8`

, **but** this "adviser algorithm that will take place before the Tesseract" reminds me about [Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) or SSD (I haven't read the paper though). The cool part about it is that it does the objects detection for the whole image and for all possible objects in it **in one go** regardless of the image content! 


We will use TensorFlow 2 [Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection) to train a custom object detector model to find positions and bounding boxes of the sub-strings like `https://` in the text image (i.e. in smartphone camera stream).

## 📝 Creating the Dataset Manually

- Making pictures of the book
- What tools to use to add bounding boxes
- How to convert to protobuf
- Issues with custom dataset (fonts, colors, bolds, underlined, etc.)
- Train/test split approach

### 🌅 Preprocessing the data

- Data preprocessing: resize, crop square, color adjustment

### 🔖 Labeling the dataset

- How to use LabelImg

### 🗜 Exporting the dataset

- Protobuf (the way of storing the dataset)

## 📚 Generating the Dataset Automatically (?)

- Automated way of generating the dataset
- Train/test split approach

## 📖 Exploring the Dataset

- Preview images with detection boxes
- Number of images (why is this enough)
- Do we need to preprocess the images

## 🛠 Installing Object Detection API 

- What is object detection API
- Why it will simplify our lives
- How it may be used

In [None]:
!git clone --depth 1 https://github.com/tensorflow/models

fatal: destination path 'models' already exists and is not an empty directory.


In [None]:
ls -la models

total 72
drwxr-xr-x  8 root root  4096 Nov 21 17:22 [0m[01;34m.[0m/
drwxr-xr-x  1 root root  4096 Nov 21 17:24 [01;34m..[0m/
-rw-r--r--  1 root root   337 Nov 21 17:22 AUTHORS
-rw-r--r--  1 root root  1015 Nov 21 17:22 CODEOWNERS
drwxr-xr-x  2 root root  4096 Nov 21 17:22 [01;34mcommunity[0m/
-rw-r--r--  1 root root   390 Nov 21 17:22 CONTRIBUTING.md
drwxr-xr-x  8 root root  4096 Nov 21 17:22 [01;34m.git[0m/
drwxr-xr-x  3 root root  4096 Nov 21 17:22 [01;34m.github[0m/
-rw-r--r--  1 root root  1104 Nov 21 17:22 .gitignore
-rw-r--r--  1 root root  1115 Nov 21 17:22 ISSUES.md
-rw-r--r--  1 root root 11405 Nov 21 17:22 LICENSE
drwxr-xr-x 12 root root  4096 Nov 21 17:22 [01;34mofficial[0m/
drwxr-xr-x  3 root root  4096 Nov 21 17:22 [01;34morbit[0m/
-rw-r--r--  1 root root  3668 Nov 21 17:22 README.md
drwxr-xr-x 23 root root  4096 Nov 21 17:22 [01;34mresearch[0m/


In [None]:
%%bash
cd ./models/research
protoc object_detection/protos/*.proto --python_out=.

In [None]:
%%bash
cd ./models/research
cp ./object_detection/packages/tf2/setup.py .
pip install . --quiet

ERROR: multiprocess 0.70.10 has requirement dill>=0.3.2, but you'll have dill 0.3.1.1 which is incompatible.
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.0 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: apache-beam 2.25.0 has requirement avro-python3!=1.9.2,<1.10.0,>=1.8.1; python_version >= "3.0", but you'll have avro-python3 1.10.0 which is incompatible.


## ⬇️ Downloading Pre-Trained Model

- Model detection Zoo review
- What models we could use possibly
- Why I've picked the MobileNet model
- Diagram of the model architecture

## 🏄🏻‍♂️ Trying the Model (Inference)

- Show that model works for general purpose classes
- Show that model doesn't work for custom objects (links)

## 📈 Setting Up TensorBoard

- Why do we need it (for debugging)
- What we will monitor

## 👨‍🎓 Transfer Learning

- What is transfer learning
- Why don't we train the model from scratch
- Allows us to use small dataset

### ⚙️ Configuring the Detection Pipeline

- Performance issues: batch size
- Starting not from scratch: checkpoints

### 🏋🏻‍♂️ Model Training

- Error prone: saving checkpoints
- How many epochs
- Monitoring the performance while training

### 🚀 Evaluating the Model

- Checking how accurate our model is on test dataset
- Are we good with performance, should we save the model?
- It is not a general purpose anymore, does it recognize our custom objects?

## 🗜 Exporting the Model

- Saving the model to the file for further re-use
- Show the list of files, how the model looks like on dics
- What the size of the model

## 🚀 Evaluating the Exported Model

- Example of how to use the trained model

## 🗜 Converting the Model for Web

- What formats are sutable for the web
- Few words about Tensorflow.js
- Show list of exported files - how model looks like on disc
- What the size of the model
- Why it is split in chucnks and how they are connected (via model.json)

In [None]:
pip install tensorflowjs --quiet

[?25l[K     |█████▎                          | 10kB 26.6MB/s eta 0:00:01[K     |██████████▌                     | 20kB 12.7MB/s eta 0:00:01[K     |███████████████▊                | 30kB 9.5MB/s eta 0:00:01[K     |█████████████████████           | 40kB 8.3MB/s eta 0:00:01[K     |██████████████████████████▏     | 51kB 4.7MB/s eta 0:00:01[K     |███████████████████████████████▍| 61kB 5.3MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 4.1MB/s 
[?25h[?25l[K     |███▏                            | 10kB 23.4MB/s eta 0:00:01[K     |██████▍                         | 20kB 20.6MB/s eta 0:00:01[K     |█████████▌                      | 30kB 16.4MB/s eta 0:00:01[K     |████████████▊                   | 40kB 14.5MB/s eta 0:00:01[K     |███████████████▉                | 51kB 11.1MB/s eta 0:00:01[K     |███████████████████             | 61kB 11.1MB/s eta 0:00:01[K     |██████████████████████▏         | 71kB 7.4MB/s eta 0:00:01[K     |██████████████████████

## 🤔 Conclusions

- I'm just an amatour
- Links to demo app
- Issues and limitations of this approach
- Links to my ML repositories that thy might like