Directories 🗂 Features 🪴 Future Plans 🔮 Credits 📜 Warnings ⚠
We decided to tackle this project because as college students, most of us will spend much of our time reading an abundance of documents. Using the guidelines, we thought it would be appropriate to create a Smart PDF reader so that when given a pdf or txt file, we are able to use features that help us understand the document to its full effect.
Clippy takes a PDF and displays its contents, a summary, and its headings with a straightforward user interface. The summaries are generated using tokenization, count vectorization, TF-IDF, and Multinomial NB classification. The program also predicts the category of the given text (see summarizer.py for more information).
Using your preferred shell and the Git CLI, the steps are as follows:
➊ Create and move to new directory.
mkdir clippy-clone
cd clippy-clone
➋ Clone repo using Git CLI.
gh repo clone jwc524/clippy
- fpdf
- matplotlib
- nltk
- pdfplumber
- pdfminer
- pymupdf (requires version 1.18.17)
- pypdf2
- sklearn
- ssl
- textwrap
- tkinter/ttk
- tkpdfviewer
To install each dependency, use the following structure:
pip install <package>
However, as mentioned in the dependencies, pymupdf must be installed as such:
pip install pymupdf==1.18.17
Alternatively:
python3 -m pip install -U pymupdf==1.18.17
For help with repository cloning, refer to Quickstart ⏩.
The pdfs/ directory contains sample PDFs to use with Clippy.
The reader/ directory contains the main Python scripts for the program.
The future/ directory contains work-in-progress scripts of upcoming features.
Headings parses the PDF for its headings and uses the document's outlines if they already exist. Primarily functions as a GUI class.
Main is the bulk of the program, handling the user interface and calls to other functions.
Merging handles the PDF merging calls from main.py. Primarily functions as a GUI class.
Rotating handles PDF rotation as controlled by the user. Primarily functions as a GUI class.
Summarizer parses the PDF and generates a summary using NLP methods. It also generates a number of graphs based on the extracted text.
Even though this project was created in a limited amount of time, there are some improvements to be made:
- Creating a more responsive, fully-featured GUI
- Improving the Data Mining Features
- Implementing more user-friendly features
- Extracting images and data tables for easy access
- Google Scholar API + JSTOR Integration
This project was written by Ryan Truong, Tony Nguyen, and Jonathan Cole.
It takes a long time for the application to start up for the first time.
Program will not run correctly without the correct version of PyMuPDF.
This project was completed in fulfillment of the requirements of CSC 3400 at Belmont University. Special thank you to Dr. Esteban Parra Rodriguez.