SummerHack2023 - Tabular Data Extractor

Overview

This project, submitted by Team Cabbage for SummerHack2023 by CISSA, is a web application that simplifies the process of data collection for data scientists. The app extracts tabular data from PNG, JPG, or PDF files uploaded by the user and converts it into a downloadable CSV file. A majority of data scientists' time and effort is spent on collecting, cleaning, and preparing of data for analysis. Our project hopes to minimise such overhead.

Access the video @ https://youtu.be/hb62_uenFzI

Key Features

Extracts tabular data from multiple image files (PNG, JPG, PDF)
Outputs data as a CSV file that is easily accessible and compatible with most data analysis tools
Minimizes the time and effort spent on data collection and preparation for analysis

Technologies Used

The web application was built with React JS for the front-end, and express.js and Python for the back-end. The image scanning feature were implemented in Python with the following technologies:

Tesseract OCR

Tesseract OCR is an open-source optical character recognition (OCR) engine developed by Google. In our web application, we utilized Tesseract OCR for the purpose of recognizing text in images. This technology is useful for our application because it allows us to extract meaningful data from images of text and convert it into a machine-readable format.

OpenCV

OpenCV is an open-source computer vision library that provides numerous functions for image and video processing. In our web application, we utilized OpenCV for several purposes, including contour detection, image preprocessing, and image editing. By using contour detection, we can identify and extract specific regions of an image that are relevant to our application. Image preprocessing helps us to prepare the image for OCR processing by removing noise and improving the image quality. Finally, image editing allows us to perform transformations on the image that enhance its suitability for OCR processing.

NumPy

NumPy is a Python library for scientific computing that provides support for large, multi-dimensional arrays and matrices. In our web application, we used NumPy for applying sampling methods for binarizing and upscaling images. Binarizing an image means converting it into a black and white format, which can help improve the performance of OCR processing. Upscaling an image means increasing its resolution, which can improve its visual quality and the accuracy of OCR processing. By using NumPy, we can efficiently and effectively perform these image processing tasks and improve the overall performance of our application.

How to Install and Run the Project

Download the github code to local machine.
Go to the code directory.
run pip install -r requirements.txt at directory /SummerHack2023.
Change directory into /SummerHack2023/client and run npm install.
Change directory into /SummerHack2023/server and run npm install.
1. If you are a macOS user, change directory into /SummerHack2023/scanner and comment out line 152 of img_plumber.py.
Run npm run start in client and server folders to start the front-end and back-end, respectively.
Visit http://localhost:5173/ and upload images you want to extract data out of.

Limitations

Data extraction may take a long time, depending on the size and number of images.
Although user inputted images are upscaled, the application may have issues extracting data from images with lower resolutions.

Future Plans

Making the data extraction API available.
Implementing a log-in system with authentication and databases.

Using Docker

We have a branch called jensen/docker-things where we attempted to use dockers. Check it out if you're curious. The branch's READ.ME has instruction to run the application dockers (assuming you already have a Docker setup).

Credits

Haruki Koh (GitHub | UniMelb student): Front-end implementation with React JS and TypeScript
Euan Lim (GitHub | Monash U student): Back-end implementation with Express.js (server) and Python (scanner)
Zoe Tay (GitHub | Monash U student): CSS styling and website design
Jensen Kau (GitHub | Monash U student): Debugging, technology research, and Docker

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
client		client
scanner		scanner
server		server
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client

client

scanner

scanner

server

server

.gitignore

.gitignore

README.md

README.md

package-lock.json

package-lock.json

requirements.txt

requirements.txt

Repository files navigation

SummerHack2023 - Tabular Data Extractor

Overview

Key Features

Technologies Used

Tesseract OCR

OpenCV

NumPy

How to Install and Run the Project

Limitations

Future Plans

Using Docker

Credits

About

Releases

Packages

Contributors 4

Languages

KohHaruki/SummerHack2023

Folders and files

Latest commit

History

Repository files navigation

SummerHack2023 - Tabular Data Extractor

Overview

Key Features

Technologies Used

Tesseract OCR

OpenCV

NumPy

How to Install and Run the Project

Limitations

Future Plans

Using Docker

Credits

About

Topics

Resources

Stars

Watchers

Forks

Languages