Top2VecApp

Top2VecApp is a desktop application, offering topic modelling capabilities based on Top2Vec algorithm.

Video demo

How the application works

At its core, Top2VecApp is a web application embedded into a desktop application.

When the user launches Top2VecApp, he is directed to the file upload page, where he can upload a CSV file, select the column containing his text corpus and select two other columns (optional) for grouping the text corpus. Each time the file upload page is visited via a GET request (default), the storage folder within the application is cleared of its contents. When the user submits his CSV file as well as his selected columns, the CSV file is temporarily stored in the storage folder within the application.

Upon storing the CSV file, a job is submitted to the worker thread and the user is directed to the progress page. Every second, the main application thread sends a GET request to the worker thread to query about the job status. Once the main application thread receives the signal from the worker thread that the job is completed, the user is directed to the download page where he can download the zipped job results.

Underneath the hood, the job processed by the worker thread consists of the following stages.

Once the job is submitted, the CSV file is loaded into memory and its text corpus is tokenized (i.e. individual words are broken up into smaller subwords).
The tokenized text corpus is then fed through the pretrained transformer-based natural language processing (NLP) model, all-MiniLM-L6-v2, which outputs the corresponding text embeddings for the input text corpus (i.e. numerical vector representation of text).
Following which, the text embeddings are passed through Uniform Manifold Approximation and Projection (UMAP) algorithm for dimensionality reduction.
Thereafter, the compressed text embeddings are clustered into topics using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm.
An extractive summary of each topic is obtained based on the top 5 text responses closest to the topic centroid in terms of cosine distance.
The clustering results and extractive summaries are then written into two separate excel files, zipped together into a single folder for download.

Installation Instructions

Run from executable file

Download Top2VecApp from here and unzip it

Run from source code

Clone this repository

    $ git clone https://github.com/AddChew/cs50x.git
    $ cd cs50x

Install the required dependencies

    $ pip install -r requirements.txt

Usage Instructions

Run from executable file

Launch Top2VecApp by double clicking on Top2VecApp.exe
Upload CSV file containing the text that you want to cluster and follow the on screen instructions

Run from source code

Navigate to app folder in cs50x folder

    $ cd app

Launch Top2VecApp

    $ python app.py

Upload CSV file containing the text that you want to cluster and follow the on screen instructions

Design Considerations

The following issues were considered when building the application.

Desktop vs web application

Initially, it was conceived for Top2VecApp to be a web application. This is because a web application does not require any prior setup from the user's end (i.e. the user need not install anything to run the application; all he needs is internet access and a web browser). But resource constraints (i.e. insufficient RAM) on Heroku cloud platform made this infeasible.

As such, Top2VecApp pivoted from being a web application to being a desktop application. This is because local machines have significantly more computing power and memory as compared to cloud platforms (i.e. Heroku) and hence can better handle the payload incurred by Top2VecApp. Another reason why Top2VecApp became a desktop application is because user input cannot be trusted. For instance, the user could navigate to urls within the web application via unintended ways or refresh the page even when he is told not to do so, and this would cause the web application to behave unexpectedly. Designing Top2VecApp as a desktop application allows for fine grain control over the application widgets (i.e. what widgets to include and their functionalities). This helps to prevent users from using Top2VecApp in unintended ways. For example, by excluding a url bar widget in Top2VecApp, users are prohibited from navigating to urls in unauthorised ways.

Packaging of application

Top2VecApp is bundled and distributed as a single executable file which contains all the required dependencies for the application to run. This is to minimise any prior setup required from the user's end. All the user needs to do is to download the executable file and launch it and he is all set to use Top2VecApp.

Project Navigation

app folder contains the source code for Top2VecApp

desktop folder

gui.py
- Contains helper classes for creating the desktop GUI

models folder

encoder.py
- Contains helper classes for loading the model and running model inference
pipeline.py
- Contains helper classes for processing the input text corpus through Top2Vec algorithm and then saving the results to excel files
tokenizer.py
- Contains helper functions and classes to tokenize text
top2vec.py
- A lightweight version of the original Top2Vec library
vocab.txt
- Contains the tokens used for tokenizing text
sent-transformer.onnx
- The natural language processing (NLP) model used for obtaining the text embeddings

static folder

appscripts folder
- Contains the JavaScript libraries and scripts used in Top2VecApp
images folder
- Contains Top2VecApp desktop application icon
styles folder
- Contains the stylesheets used for styling Top2VecApp

templates folder

Contains the HTML templates used in Top2VecApp

app.py

Contains the backend logic of Top2VecApp

config.py

Contains the application configurations for Top2VecApp

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
app		app
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Top2VecApp

Video demo

How the application works

Installation Instructions

Run from executable file

Run from source code

Usage Instructions

Run from executable file

Run from source code

Design Considerations

Desktop vs web application

Packaging of application

Project Navigation

desktop folder

models folder

static folder

templates folder

app.py

config.py

About

Releases

Packages

Languages

AddChew/CS50X

Folders and files

Latest commit

History

Repository files navigation

Top2VecApp

Video demo

How the application works

Installation Instructions

Run from executable file

Run from source code

Usage Instructions

Run from executable file

Run from source code

Design Considerations

Desktop vs web application

Packaging of application

Project Navigation

desktop folder

models folder

static folder

templates folder

app.py

config.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages