I will no longer be working on this project mostly out of guilt. There is 1 million post per day so roughly 500k per day split between blue and orange board. Google's reCaptcha cost $1000 per 1M post so they proably end up paying google $500 / day just for captcha or $185k/year just for captcha. 1 less google engineer mean 3 more 4chan engineer. And maybe jannies will get paid this time.
This repo only contains OCR. For cleaning/spliting or scraping/posting please refer to the 'Related' section
For Cleaning/Spliting Captcha refer to: https://github.com/14AwooTard88/4Pass-CleanSplit
For Scraping Captcha and posting refer to: https://github.com/14AwooTard88/4Pass-Scrape
There is a new captcha that 4chan just recently adopted (July 5th). And supposedly it will help make it's security better or something. I don't know the reasoning for this sudden change.
However, what i do know is that new changes means new exploits. And as an autist, I cannot let this sit unexploited.
In this project I'll show you how you can generate a .traineddata that can later be used in your Tesseract project.
For this we'll use a slightly older version of Tesseract. You can try newer version tho i'm not sure if it will work
You can download 'tesseract-ocr-w32-setup-v4.0.0-rc4.20181024.exe' archived from: https://digi.bib.uni-mannheim.de/tesseract/
You don't really need this unless you want create your own .trainneddata Instructions are on the bottom if you want it
Download the jTessBoxEditor-2.3.1.zip: https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/
We use this editor to train the data
You don't really need this unless you want create your own .trainneddata Instructions are on the bottom if you want it
We'll use this to generate .box data
This contains the fix if you can't open the stable release version: https://github.com/zdenop/qt-box-editor/releases/tag/v1.12rc1
-
Download the .zip file which contains "chan.traineddata" in the 'tessdata' folder. You don't need anything else (unless you want to train your own data, then instructions are below)
-
Copy 'chan.traineddata' to 'C:\Program Files (x86)\Tesseract-OCR\tessdata'
-
Open cmd and type:
cd "C:\Program Files (x86)\Tesseract-OCR"
tesseract "C:\Path\To\Captcha.png" "C:\Path\To\Output" -l chan
You should see this in the out.txt:
0_Clean.png (No clean, No Split):
0_Clean.png (Clean, No split):
0_Split.png (Clean, Split):
As you can see, the best result that the ones that are fully cleaned and has gap in between the letters
The 'chan.traineddata' that was included was generated with realitively small dataset. To improve accuracy, it is best if you add your own dataset in addition to what was included.
- Create your datasheet.png
Data sheet just contains grid the characters you want to train. Nothing too fancy. I include a 'chan.xcd' that you can open with Gimp if you would like to see it.
You can add your character here
-
Create new folder It's very important that you put the .png you made inside a new folder
-
Creating .box file Drag and drop your .png into QT Box Editor
Click Yes To generate .box data
QT Editor Will Generate .box data automatically
Edit the letters to make sure it's correct
Save it
- Train your data
Open JTessBoxeditor (Make sure you have java installed)
Make sure you have the right input
- 'Tesseract Executables' should point to "C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe"
- 'Training Data' should point to your .box file
- 'Language' Should be the file name of your .box and .png
- Set option to 'Train with Existing Box'
- Hit Run
It should take less than a second to train since there is very little dataset
It will generate a bunch of files but you can delete everything except the tessdata folder, .box and .png file. The only thing that matters is the .trainneddata
- Your Done! Now scroll back up and follow the Instruction (Simple) on how to use your .traineddata