Faircent Internship

Arjun Sanghi

This bundle of files consists of the following files:

stage0Prelim.py pulls files from a given folder and converts them all to a single format. In this case, for ease of conversion, speed and quality, I have converted all files to PNG. In-case of PDF files where one file can contain one or more pages, each page will be converted to a single PNG file with Page_n as part of the filename where n is the page number of the image in the original PDF file. The code saves everything to the same folder and deletes all the original PDF/JPG files from the folder.
Stage1CheckForNamed.py exists to check for already obviously named files and renames them in a set format. For example, if a file contains the word PAN or any similar variation in it, we can reasonably assume that it contains an image or scan of a PAN card. This check is done in order to eliminate obviously named files, as OCR is resource-hungry and takes time. The code renames such files in the same folder.
Stage1Unnamed.py is one way to run OCR. However, due to Tesseract's inability to correctly process most files, using this code is not recommended without major changes to the model.
Stage1UnnamedAlt.py is my recommended way to run OCR. This uses MMOCR to pull details and categorize the files. In the current state, it can only accurately pull aadhaar and pan numbers.
Stage1IdentifyFilesUsingNumbers further checks the numbers obtained from MMOCR to determine whether the numbers are valid and double-checks whether the files are accurately named.
driverCode.py obtains the current date, imports other data from a Google Drive sheet that is connected to Google Forms, and inserts it all into a pre-set mySQL database.

The code is fully modular and all the dependencies are mentioned in requirements.txt so that it is easy to work on this code.
MMOCR prefers GPU to CPU, however, the Windows support for CUDA on MMOCR is not good. To that extent, I have been testing on CPU only. For faster inference, kindly download the CUDA version of the applicabl libraries and run this on a supported NVIDIA GPU.

Goals:

Increase accuracy to 90+%, possibly via implementation of an in-house model trained using 10lakh+ customers' data.
Increase inference speed using CUDA Toolkit and platform-specific recompilation instead of the current platform-agnostic versions.

mySQL database has the following columns and is named faircent1

All paths are set for my system. I have included comments wherever path needs to be changed for the system.
My System: Windows 11, Ryzen 9 5900HS, RTX 3070, 24GB RAM, 1TB SSD.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
.idea		.idea
Test1Stage0		Test1Stage0
TestForMMOCR		TestForMMOCR
Test_imagesStage0		Test_imagesStage0
data/wildreceipt		data/wildreceipt
README.md		README.md
Rule for fourth letter of PAN card.txt		Rule for fourth letter of PAN card.txt
SECURITY.md		SECURITY.md
Screenshot 2022-06-28 121701.png		Screenshot 2022-06-28 121701.png
Stage1CheckForNamed.py		Stage1CheckForNamed.py
Stage1UnnamedAlt.py		Stage1UnnamedAlt.py
Stage2.py		Stage2.py
dataSourcesToOCRText.py		dataSourcesToOCRText.py
driverCode.py		driverCode.py
driverForStage0.py		driverForStage0.py
driverForStage1CheckForNamed.py		driverForStage1CheckForNamed.py
listOfLoanIDTaken.json		listOfLoanIDTaken.json
loanIDtaken.py		loanIDtaken.py
requirements.txt		requirements.txt
stage0Prelim.py		stage0Prelim.py
stage1IdentifyFilesUsingNumbers.py		stage1IdentifyFilesUsingNumbers.py
stage1Unnamed.py		stage1Unnamed.py
tesserocr-2.5.2-cp39-cp39-win_amd64.whl		tesserocr-2.5.2-cp39-cp39-win_amd64.whl
testofmmocr.py		testofmmocr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Faircent Internship

Arjun Sanghi

Goals:

Dataset credit: M. Sohail

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Faircent Internship

Arjun Sanghi

Goals:

Dataset credit: M. Sohail

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages