Skip to content

a deep learning model to break CAPTCHA codes perfectly using CRNN model, which was the state of art for OCR (Optical Character Recognition) task.

License

Notifications You must be signed in to change notification settings

TomHuynhSG/Captchas-Solver-CRNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAPTCHAs_solver_CRNN

intro

A deep learning model to break CAPTCHA codes perfectly using CRNN model (Convolutional Recurrent Neural Network).

Description

CAPTCHAs stands for the Completely Automated Public Turing test to tell Computers and Humans Apart. CAPTCHAs are tools you can use to differentiate between real users and automated users, such as bots. CAPTCHAs provide challenges that are difficult for computers to perform but relatively easy for humans.

However, these wormy and curvy text images are even difficult for humans to figure out sometimes.

Therefore, it would be great to have an super accurate machine learning model to help you reveal these correct texts everytime without fail.

Requirement

  • tensorflow 2.0+
  • scikit-learn
  • opencv-python
  • editdistance

Dataset

The dataset is generated by the most popular Wordpress CAPTCHAs plugin with nearly 8 millions downloads (https://wordpress.org/plugins/really-simple-captcha/)

dataset

It generates 9,955 images of 4-letter CAPTCHAs using a random mix of four different fonts.

Architecture

Ideally, we want to detect text from a text image:

architecture

However, character segmentation is not practical because: architecture

  • Too time comsuming
  • Too expensive
  • Impossible in most cases

For example, the above character segmentation is fine but the below one is challenging. In fact, the traditional method will face a problem where two or more characters are too close to each other like this: architecture

This project will use state of the art CRNN model which is a combination of CNN, RNN and CTC loss for image-based sequence recognition tasks, specially OCR (Optical Character Recognition) task which is perfect for CAPTCHAs.

architecture

This model is much more superior than traditional way which does not involve any bounding box detection for each character (character segmentation).

In this model, the image will be dissected by a fixed number of timesteps in the RNN layers so as long as each character is seperated by two or three parts to be processed and decoded later then the spacing between each character is irrelevant like so:

architecture

Here is more details of CRNN architecture:

architecture

As you can see in this diagram, the last layer of CNN produces a feature vector of the shape 4*8*4 then we flatten the first and third dimension to be 16 and keep the second dimension to be the same to produce 16*8. It's effective to cut the original image to be 8 vertical parts (red lines) and each parts contains 16 feature numbers. Since we have 8 parts to be processed as the output of CNN then we also choose 8 for our time step in the LSTM layer. After stacked LSTM layers with softmax (SM) activation function, we have CTC loss to optimize our probability table.

More information regarding the implementation can be found in the jupyter notebook in the github.

Result

We need to have the right evaluation/metrics for OCR task with edit distance library.

This is inspired from https://github.com/arthurflor23/handwritten-text-recognition/blob/master/src/data/evaluation.py

This only helps to calculate three evaluation metris for any OCR task:

  • CER (Character Error Rate)
  • WER (Word Error Rate)
  • SER (Sequence Error Rate)

Here is my result for a test set:

result

This is a easy dataset so I got absolutely perfect score for 200 images of the test set! Not even a challenge for CRNN power:

  • Character Error Rate: 0.0
  • Word Error Rate: 0.0
  • Sequence Error Rate: 0.0

Afterthought:

  • CRNN + CTC is not that challenging, just want sure we follow above process step by step like in the notebook
  • Keep our height and width is a power of 2 or at least even number is making our time much easier to divide by half (it is not really important, since it is related to design your model and preprocessing)
  • The number bi LSTM paramater is larger the number of timestep since our biLSTM /2 will be at least the size of hidden node for each single LSTM.
  • The max label length should be the same to the number of time steps, but some people report if they set it to be slightly lower than time step, it helps. But you should stick with the basics!
  • The data is super clean and same image dimension. So for other datasets, maybe a bit of noise cleaning and binarization may help!

Things to improve for other datasets:

  • Resize image logics with multiple image sizes (maybe as following):
    • find min, max of height and width
    • resize to a fixed height you want
    • calculate the max width of all resized images
    • padding to all images to that max width
  • Combine the logic of preprocessing of train set and test set together
  • Convert them to tfdataset pipeline (note that it is challenging since OpenCV won't work with tensor)\

License

This project is licensed under the MIT License - see the LICENSE.md file for details

🏆 Author

About

a deep learning model to break CAPTCHA codes perfectly using CRNN model, which was the state of art for OCR (Optical Character Recognition) task.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published