Private, secure, low-resource file sorting assistant powered by a self-learning local algorithm.
Themis is a cross-platform software that helps users reorganize files on their own computer. It analyzes file names, proposes a new folder structure, shows the proposed changes before anything is moved, and lets the user edit categories before applying the plan.
The project originally worked as an LDA-only topic-based sorter. The current version keeps LDA as the first discovery layer (1.0), but adds a local Naive Bayes learning layer (2.1). This means Themis can now improve over time from the categories approved by the user, without uploading files, file names, or training data to any external service.
overview2.mp4
2.4 update note : This improves both the graphical interface and the terminal/C++ workflow with a stronger, more persistent Bayes training system. Bayes now keeps a local memory file, themis_bayes_memory.jsonl, next to the launcher or executable, and can be trained from an already sorted folder without moving any existing files. The destination folder can also be used as a training source during analysis, making it easier to add unsorted files into an existing sorted repository. The move history is now always written to themis_history.jsonl, while Bayes training can be enabled or skipped independently for each move. The interface also adds clearer controls, safer text wrapping, and dedicated Import plan / Export plan buttons for CSV plans. The CLI and C++ TUI were updated to match the same 2.4 behavior, including optional Bayes training, persistent memory, sorted-folder training import, and consistent plan import/export support.
| Property | Meaning |
|---|---|
| Name | Themis |
| License | GNU Affero General Public License |
| Privacy model | Local-first. File names, paths, categories, and training history stay on the user’s machine. |
| Network use | No network connection is required for normal use. |
| Data sent to third parties | None by design. |
| Main input | File names and paths selected by the user. |
| Main output | A proposed file-moving plan and, after confirmation, moved files. |
| Destructive operations | No deletion is performed by design. |
| Default behavior | Preview-first: Themis proposes a plan before moving files. |
| Original algorithm | Lightweight LDA-inspired topic model over file-name tokens. |
| Current algorithm | Hybrid workflow: LDA discovers initial groups, then local Naive Bayes takes over when enough approved training history exists. |
| Training model | Local, incremental-by-history training from user-approved categories stored in themis_history.jsonl. |
| Resource usage | Lightweight. It uses file-name tokens, not full file contents. |
| Cross-platform | Windows, macOS, and GNU/Linux. |
| Interfaces | GUI and CLI. |
| Target users | People who want to reorganize local folders safely and privately. |
Thémis is named after Themis, the ancient Greek personification of divine order, law, fairness, and proper arrangement.
The name was chosen because the software’s purpose is to restore order in chaotic folders.
Themis helps sort local files by analyzing their file names.
For example, a folder containing:
invoice_client_alpha_2025.pdf
invoice_client_beta_2024.pdf
holiday_family_photo.jpg
project_report_ai.docx
meeting_notes_budget.txt
may be reorganized into folders such as:
Themis_Sorted/
invoices/
invoice_client_alpha_2025.pdf
invoice_client_beta_2024.pdf
photos/
holiday_family_photo.jpg
reports/
project_report_ai.docx
meetings/
meeting_notes_budget.txt
At the very beginning, Themis may create LDA-style topic folders such as:
Themis_Sorted/
01_invoice_client_2025/
02_photo_holiday_family/
03_project_report_ai/
After the user reviews and approves categories, the local Bayes model learns from that history. Future scans can then use clearer user-defined categories such as:
invoices
photos
reports
meetings
courses
archives
Themis does not:
- read the full content of files;
- upload files to a server;
- require an online account;
- train a remote model;
- send file names, paths, or categories to third parties;
- delete files;
- guarantee perfect classification;
- replace human validation;
- understand private business context unless that context appears in file names or approved categories;
- decrypt, inspect, or modify document contents.
The program is designed as a local assistant, not as an autonomous document manager.
Themis follows a review-first workflow.
Select folders
↓
Scan file names
↓
Tokenize names
↓
Run LDA discovery model
↓
Load local Bayes history if available
↓
Use Bayes when trained and confident
↓
Generate proposed categories and destinations
↓
Show editable plan
↓
User validates or edits categories
↓
Move selected files
↓
Write local training history
↓
Bayes improves for the next scan
No move is applied until the user confirms the operation.
The current version uses a hybrid approach.
LDA = discovery layer
Bayes = learned classification layer
The project started as an LDA-only sorter. That was useful for discovering groups without any predefined categories. However, LDA does not truly learn the user’s preferred folder names. The new Bayes layer solves that by learning from locally approved categories.
A file name such as:
invoice_client_alpha_2025.pdf
is transformed into tokens:
invoice, client, alpha, 2025
The extension may be used as fallback information if the name contains no useful token.
Common words such as the, and, de, le, la, document, file, or copy are ignored because they usually do not help classification.
When there is no training history, Themis uses the lightweight LDA model to group file-name tokens into topics.
Example LDA topic labels:
invoice_client_2025
photo_holiday_family
project_report_ai
This is useful for first-time use because it does not require predefined categories.
After the user reviews the proposed plan and applies selected moves, Themis writes approved category decisions into:
themis_history.jsonl
On the next scan, Themis reads this local history and trains a small Multinomial Naive Bayes classifier from it.
Bayes is used only when it has enough local examples and enough confidence.
Default activation conditions:
minimum approved examples: 3
minimum distinct categories: 2
Bayes confidence threshold: 0.68
If Bayes is confident enough, it proposes the category. If not, Themis falls back to LDA.
If Bayes is not trained:
use LDA
If Bayes is trained but confidence is too low:
use LDA
If Bayes is trained and confidence is high enough:
use Bayes
The user can edit categories before applying moves. Manual corrections are important because they become high-quality local training data for Bayes.
The Bayes model is trained locally from the user’s own approved history.
Training examples are stored in:
themis_history.jsonl
This file is written inside the selected target root when moves are applied.
Each applied move records information such as:
{"selected": true, "source": "/old/invoice_alpha.pdf", "destination": "/sorted/invoices/invoice_alpha.pdf", "category": "invoices", "model": "manual", "confidence": 0.8125, "applied_at": "2026-05-22T21:30:00"}Bayes learns associations between file-name tokens and approved categories.
Example:
invoice, alpha, 2025 -> invoices
meeting, budget -> meetings
holiday, family, photo -> photos
The original LDA-only version could discover patterns, but it could not reliably remember the user’s preferred categories. The Bayes layer adds local personalization:
- the more the user validates, the better Bayes becomes;
- the training data remains on the user’s machine;
- the model adapts to the user’s naming habits;
- no cloud service is required;
- no file contents are read.
Thémis/
README.md
run_themis.py
create.cpp
tte.cpp
themis/
__init__.py
cli.py
gui.py
lda_model.py
scanner.py
Note :
create.cppis a C++ program used to simulate a chaotic file structure. To change the number of generated files, edit the value infor (int i = 0; i < 100; i++) {. You can compile it on Linux with:
g++ create.cpp -o chaos -std=c++17Note² : The C++ Tree Exporter (
tte.cpp) is especially useful for old hard drives because it can create a local inventory of the disk without reading full file contents or moving anything. It scans the folder tree, records paths and basic metadata into a CSV file, and lets Themis or the user review the structure safely before sorting, cleaning, or migrating data.
g++ tte.cpp -o chaos -std=c++17Note³ : The C++ Rollback Engine (
rollback.cpp) restores files moved by Themis by reading the local themis_history.jsonl file. It checks each recorded move, processes the history in reverse order, and moves files back from their destination to their original source path. By default, it runs in dry-run mode for safety, so users can preview the rollback before applying it with --apply.
g++ rollback.cpp -o chaos -std=c++17Note⁴ : The C++ Safe Move Validator (
safe_move_validator.cpp) checks a Themis CSV plan before files are moved. It validates sources, destinations, duplicate targets, missing files, invalid parents, long paths, and other risky cases. It does not move or modify files; it only reports warnings and errors.
g++ safe_move_validator.cpp -o safe-move-validator -std=c++17 -O2Note⁵ : The C++ Clean tool (
clean_duplicates.cpp) combines an empty-folder scanner with a duplicate-file candidate detector. Empty folder removal is dry-run by default and only happens with--apply-empty-clean; duplicate detection never deletes files. It can produce CSV reports for empty folders and duplicate groups.
g++ clean_duplicates.cpp -o clean-duplicates -std=c++17 -O2Convenience launcher for the graphical interface. Equivalent to:
python -m themis guiDefines basic package metadata such as application name and version.
Contains the lightweight LDA model.
Main responsibilities:
- store topic counts;
- store word-topic counts;
- run Gibbs sampling;
- compute document-topic distributions;
- find the dominant topic of each file;
- generate human-readable topic labels.
Contains the file scanning, planning, LDA fallback, Bayes training, and move logic.
Main responsibilities:
- recursively list files;
- ignore hidden files unless requested;
- tokenize file names;
- remove stopwords;
- run LDA when there is no reliable Bayes prediction;
- train local Naive Bayes from approved history;
- choose Bayes categories when confidence is high enough;
- build the proposed move plan;
- create safe destination paths;
- write and read CSV plans;
- apply selected file moves;
- write local training history.
Contains the Tkinter graphical interface.
Main responsibilities:
- add source folders;
- choose target root;
- select LDA topic count;
- set Bayes confidence threshold;
- run analysis;
- display the proposed plan;
- show whether LDA, Bayes, or manual correction produced the category;
- edit categories efficiently;
- apply one category to multiple selected rows;
- filter and review proposals;
- apply selected moves after confirmation;
- update local Bayes history.
Contains the command-line interface.
Main commands:
python -m themis gui
python -m themis scan
python -m themis apply
python -m themis categories- Python 3.10 or newer is recommended.
- Tkinter is required for GUI mode.
- No mandatory third-party Python package is required for the current version.
NLTK can be installed to improve stopword handling:
pip install nltkIf NLTK or its stopword corpus is unavailable, Thémis automatically falls back to a built-in stopword list.
To compile standalone executables, install PyInstaller:
pip install pyinstallerImportant: PyInstaller is not a true cross-compiler. Build Windows binaries on Windows, macOS binaries on macOS, and Linux binaries on Linux.
git clone https://github.com/Malwprotector/themis.git
cd themisIf you downloaded a ZIP archive, extract it and open a terminal inside the extracted folder.
python -m venv .venvWindows PowerShell:
.\.venv\Scripts\Activate.ps1Windows Command Prompt:
.venv\Scripts\activate.batmacOS / Linux:
source .venv/bin/activatepython -m pip install --upgrade pip setuptools wheelpip install nltkGUI:
python run_themis.pyCLI:
python -m themis --helpAgain, this doesn't show new 2.1 version features.
python run_themis.pyThe main window opens with the title:
Themis - Guided LDA + Bayes File Sorting
Click:
Add folder
Choose one or more folders containing files to sort.
Use the LDA topics input.
| Number Of Files | Suggested LDA Topics |
|---|---|
| 10–50 | 3–6 |
| 50–500 | 6–12 |
| 500+ | 10–25 |
A higher topic count creates more specific LDA groups. A lower topic count creates broader LDA groups.
This setting mostly matters when Bayes is not trained yet or when Bayes confidence is too low.
Use the Bayes threshold input.
Default:
0.68
A higher value makes Bayes more cautious. A lower value lets Bayes override LDA more often.
You can choose a target root folder. If no target root is selected, Themis creates a default folder inside the first selected directory:
Themis_Sorted
Click:
Analyze
Themis scans file names, runs LDA, loads Bayes history if available, and proposes categories and destinations.
The table shows:
| Column | Meaning |
|---|---|
| Selected | Whether the file will be moved. |
| Model | lda, bayes, or manual. Shows which decision source produced the current category. |
| Category | The category that will be used as Bayes training data after applying moves. |
| Source | Current file path. |
| Destination | Proposed new path. |
| Topic | Numeric LDA topic identifier. |
| Topic Label | LDA-generated label. |
| Confidence | Confidence of the chosen model. |
| Bayes Label | Bayes suggestion, when available. |
| Bayes Confidence | Bayes confidence score. |
| Reason | Explanation based on tokens and model state. |
You can:
- double-click a row to edit its category;
- select multiple rows and apply one category to all of them;
- use the fast category editor in the right panel;
- accept the Bayes suggestion when available;
- filter rows to review a subset of files;
- right-click a row to open the context menu.
Manual category corrections are saved as training examples when moves are applied.
Click:
Apply moves + train Bayes
Confirm the operation. Themis moves only selected files and writes approved categories to themis_history.jsonl.
Ctrl+A select all
Space toggle move flag
Enter edit category
Ctrl+R analyze again
Ctrl+S apply moves
Thémis can also be launched from a terminal based interface, C++ written algorithm, offering better performance than Python. To run this version, you will need to compile the programme and then run it. The options are the same as in the GUI, but are entered directly into the terminal.
g++ themis.cpp -o themis-cpp -std=c++17
./themis-cpppython -m themis --helpExpected command structure:
usage: themis [-h] {gui,scan,apply,categories} ...
python -m themis guipython -m themis scan ~/Downloads ~/Documents --topics 8 --output themis_plan.csvThis scans the folders and writes a CSV plan. It does not move files.
python -m themis scan ~/Downloads --target ~/SortedFiles --topics 6 --output plan.csvpython -m themis scan ~/Downloads --target ~/SortedFiles --bayes-threshold 0.75 --output plan.csvpython -m themis scan ~/Downloads --min-bayes-examples 5 --output plan.csvpython -m themis scan ~/Downloads --no-recursive --output plan.csvInclude Hidden Files
python -m themis scan ~/Downloads --include-hidden --output plan.csvpython -m themis apply plan.csv --target ~/SortedFilespython -m themis categories --target ~/SortedFilesUse with caution:
python -m themis scan ~/Downloads --topics 6 --applyThe safer workflow is to generate a CSV, review categories, then apply it.
The generated CSV contains:
selected,source,destination,topic,topic_label,confidence,reason,model,category,bayes_label,bayes_confidence
Example row:
true,/home/user/Downloads/invoice_alpha.pdf,/home/user/SortedFiles/invoices/invoice_alpha.pdf,1,invoice_alpha,0.8462,"Bayes used approved history: 12 examples, 4 categories. Tokens: invoice, alpha",bayes,invoices,invoices,0.8462Before applying a plan, you may edit:
selected: set totrueorfalse;destination: change the destination path;category: change the category that Bayes should learn;- other columns are mostly informational and should normally remain unchanged.
The category field is especially important. It is the label used to train Bayes after the plan is applied.
Applied moves are logged in:
themis_history.jsonl
Each line is a JSON object containing the source, destination, category, model, confidence, Bayes information, and timestamp.
Example:
{"selected": true, "source": "/old/file.pdf", "destination": "/new/invoices/file.pdf", "topic": 1, "topic_label": "invoice_client", "confidence": 0.8125, "reason": "Manual category correction. This row will train Bayes after Apply.", "model": "manual", "category": "invoices", "bayes_label": "invoices", "bayes_confidence": 0.74, "applied_at": "2026-05-22T21:30:00"}The history file is used as local training data for Bayes during later scans.
This section explains how to package Thémis as a standalone application using PyInstaller.
Install PyInstaller:
pip install pyinstallerRecommended build modes:
| Mode | Command Option | Description |
|---|---|---|
| One-folder | --onedir |
Creates a folder containing the executable and dependencies. Recommended for reliability. |
| One-file | --onefile |
Creates a single executable. Easier to distribute, but startup can be slower. |
| Windowed | --windowed |
Hides the terminal window for GUI builds. |
| Named app | --name Themis |
Sets output executable or app name. |
Recommended entry point:
run_themis.py
Install Python 3.10 or newer from the official Python website or Microsoft Store.
During installation, enable:
Add Python to PATH
Go to the project directory:
cd path\to\themispython -m venv .venv
.\.venv\Scripts\Activate.ps1If script execution is blocked, run PowerShell as administrator or use:
Set-ExecutionPolicy -Scope CurrentUser RemoteSignedpython -m pip install --upgrade pip setuptools wheel
pip install pyinstallerOptional:
pip install nltkpython run_themis.pypyinstaller --noconfirm --clean --onedir --windowed --name Themis run_themis.pyOutput:
dist\Themis\Themis.exe
pyinstaller --noconfirm --clean --onefile --windowed --name Themis run_themis.pyOutput:
dist\Themis.exe
.\dist\Themis\Themis.exeor for one-file mode:
.\dist\Themis.exeInstall Python 3.10 or newer.
Using Homebrew:
brew install pythoncd /path/to/themispython3 -m venv .venv
source .venv/bin/activatepython -m pip install --upgrade pip setuptools wheel
pip install pyinstallerOptional:
pip install nltkpython run_themis.pypyinstaller --noconfirm --clean --windowed --name Themis run_themis.pyOutput:
dist/Themis.app
open dist/Themis.apppyinstaller --noconfirm --clean --onefile --windowed --name Themis run_themis.pyOutput:
dist/Themis
Unsigned macOS applications may be blocked by Gatekeeper. For local testing, you can right-click the app and choose Open.
For public distribution, use proper Apple code signing and notarization.
Debian / Ubuntu:
sudo apt update
sudo apt install python3 python3-venv python3-pip python3-tk binutilsFedora:
sudo dnf install python3 python3-pip python3-tkinter binutilsArch Linux:
sudo pacman -S python python-pip tk binutilscd /path/to/themispython3 -m venv .venv
source .venv/bin/activatepython -m pip install --upgrade pip setuptools wheel
pip install pyinstallerOptional:
pip install nltkpython run_themis.pypyinstaller --noconfirm --clean --onedir --windowed --name Themis run_themis.pyOutput:
dist/Themis/Themis
Run:
./dist/Themis/Themispyinstaller --noconfirm --clean --onefile --windowed --name Themis run_themis.pyOutput:
dist/Themis
Run:
./dist/ThemisPyInstaller creates binaries, not full native packages. To create Linux packages, use an additional packaging tool after building:
- AppImage:
appimagetool - Debian package:
dpkg-deb,fpm, or Debian packaging tools - RPM package:
rpmbuildorfpm
Keep the AGPL license file and corresponding source code with any distributed package.
python run_themis.pypython -m themis scan ./test_files --topics 5 --output plan.csvpython -m themis apply plan.csvInstall Tkinter:
sudo apt install python3-tkUse:
python -m PyInstaller --versionor reinstall:
pip install --upgrade pyinstallerBuild separately on each target operating system. Do not expect a Windows executable built on Linux to work as a native Windows build.
Possible causes:
- not enough approved history yet;
- fewer than two distinct categories in history;
- Bayes confidence is below the threshold;
- file names are too short or too generic;
- the target root does not point to the history file used previously.
Try:
- applying a few manually corrected categories;
- using clearer category names;
- lowering
--bayes-thresholdslightly; - making sure the same target root is used across scans.
Try:
- correcting categories manually and applying moves to train Bayes;
- reducing the number of LDA topics;
- increasing the number of LDA topics;
- renaming unclear files before scanning;
- sorting a smaller folder first;
- using more descriptive file names.
Possible causes:
- missing permissions;
- file currently open in another program;
- source file removed after plan creation;
- destination drive unavailable;
- synchronized folder conflict.
- File classification is based mainly on file names.
- Very short or generic names are difficult to classify.
- LDA works better when there are enough files and repeated naming patterns.
- Bayes requires approved local history before it can outperform LDA.
- Bayes quality depends on the quality and consistency of user-approved categories.
- The current Bayes training is history-based, not a persistent binary model file.
- The software does not yet provide a full rollback button, although moves are logged.
- Packaging and signing must be handled separately for production distribution.
Possible future improvements:
- full undo / rollback interface;
- history deduplication and history cleanup tools;
- explicit Bayes training dashboard;
- training statistics per category;
- drag-and-drop directory selection;
- richer preview tree;
- extension-aware sorting rules;
- date-aware sorting rules;
- file metadata support;
- duplicate detection;
- export to JSON and YAML;
- saved sorting profiles;
- optional advanced NLP backend;
- signed installers for Windows and macOS;
- AppImage,
.deb, and.rpmbuilds for Linux.
Thémis is licensed under:
GNU Affero General Public License
Thémis is provided without warranty. Use it carefully, especially on important folders. Always review the proposed plan before applying file moves.
The license summary in this README is provided for convenience and is not legal advice. The actual license text controls.

