Project Abandoned

I will no longer be working on this project mostly out of guilt. There is 1 million post per day so roughly 500k per day split between blue and orange board. Google's reCaptcha cost $1000 per 1M post so they proably end up paying google $500 / day just for captcha or $185k/year just for captcha. 1 less google engineer mean 3 more 4chan engineer. And maybe jannies will get paid this time.

4chan Captcha Bypass (OCR)

This repo only contains OCR. For cleaning/spliting or scraping/posting please refer to the 'Related' section

Background

There is a new captcha that 4chan just recently adopted (July 5th). And supposedly it will help make it's security better or something. I don't know the reasoning for this sudden change.

However, what i do know is that new changes means new exploits. And as an autist, I cannot let this sit unexploited.

Project Summary

In this project I'll show you how you can generate a .traineddata that can later be used in your Tesseract project.

Tools

Tesseract Ocr (4.0.0 rc3)

For this we'll use a slightly older version of Tesseract. You can try newer version tho i'm not sure if it will work

You can download 'tesseract-ocr-w32-setup-v4.0.0-rc4.20181024.exe' archived from: https://digi.bib.uni-mannheim.de/tesseract/

JTessBoxEditor (Optional)

You don't really need this unless you want create your own .trainneddata Instructions are on the bottom if you want it

Download the jTessBoxEditor-2.3.1.zip: https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

We use this editor to train the data

QT Box Editor (Optional)

You don't really need this unless you want create your own .trainneddata Instructions are on the bottom if you want it

We'll use this to generate .box data

This contains the fix if you can't open the stable release version: https://github.com/zdenop/qt-box-editor/releases/tag/v1.12rc1

Instructions (Simple)

Download the .zip file which contains "chan.traineddata" in the 'tessdata' folder. You don't need anything else (unless you want to train your own data, then instructions are below)
Copy 'chan.traineddata' to 'C:\Program Files (x86)\Tesseract-OCR\tessdata'
Open cmd and type:

    cd "C:\Program Files (x86)\Tesseract-OCR"

    tesseract "C:\Path\To\Captcha.png" "C:\Path\To\Output" -l chan

You should see this in the out.txt:

0_Clean.png (No clean, No Split):

0_Clean.png (Clean, No split):

0_Split.png (Clean, Split):

As you can see, the best result that the ones that are fully cleaned and has gap in between the letters

Instructions (Advanced)

The 'chan.traineddata' that was included was generated with realitively small dataset. To improve accuracy, it is best if you add your own dataset in addition to what was included.

Create your datasheet.png

Data sheet just contains grid the characters you want to train. Nothing too fancy. I include a 'chan.xcd' that you can open with Gimp if you would like to see it.

You can add your character here

Create new folder It's very important that you put the .png you made inside a new folder
Creating .box file Drag and drop your .png into QT Box Editor

Click Yes To generate .box data

QT Editor Will Generate .box data automatically

Edit the letters to make sure it's correct

Save it

Train your data

Open JTessBoxeditor (Make sure you have java installed)

Make sure you have the right input

'Tesseract Executables' should point to "C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe"
'Training Data' should point to your .box file
'Language' Should be the file name of your .box and .png
Set option to 'Train with Existing Box'
Hit Run

It should take less than a second to train since there is very little dataset

It will generate a bunch of files but you can delete everything except the tessdata folder, .box and .png file. The only thing that matters is the .trainneddata

Your Done! Now scroll back up and follow the Instruction (Simple) on how to use your .traineddata

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
4Pass-Ocr		4Pass-Ocr
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4Pass-Ocr

4Pass-Ocr

.gitattributes

.gitattributes

README.md

README.md

Repository files navigation

Project Abandoned

4chan Captcha Bypass (OCR)

Related

Background

Project Summary