A hairy, tangled pile of experimental code to demonstrate various approaches to document classification.
Given a giant pile of Reddit posts tagged with flair text, this project demonstrates various ways of categorizing and classifying those posts using various AI tools: Natural Language Processing, Machine Learning, and LLMs. Currently, four approaches are implemented:
- Train a simple word-based classifier
- Generate vector embeddings and match posts to "proximate" topics
- Generate vector embeddings and search for clusters of similar posts
- Describe the desired categories in an LLM prompt, and ask it to categorize each post
No claim is made that this code is good, just that it's servicable. Our goal is a rough comparison of output quality, the resource demands of each approach, and the approachability of each technique for technically adept users who aren't actual ML/AI specialists.
No, seriously. It's bad code.
The project looks for several local environment variables when accessing AI API providers. env.ANTHROPIC_API_KEY
, env.GOOGLE_API_KEY
, and env.OPENAI_API_KEY
will all be respected. If they aren't set, AI models from those providers won't be usable in the tests.
If you're using Ollama on a separate machine, changing env.OLLAMA_HOST
will get things wired up.
Finally, data is stored in a Postgres database. env.POSTGRES_URL
can be set to control the server used; if you want to spin something up with Docker, a docker-compose.yaml
file has been included. If no server is set, this project will fall back to PGLite, a sqlite-like implementation of Postgres that stores its data on the local filesystem. It's slower but will get you there.
These scripts are how we ran our tests over and over as we iterated on our process and gathered our data. It's still dirty, but we'll be doing a bit of cleanup shortly.
db:setup
: Sets up a fresh Postgres database, imports the example Reddit posts, tags, and list of models used in testing.test:ollama
: Checks to see whether Ollama is installed and the necessary AI models have been downloaded.
vector:embed
: Generates vector embeddings for every post and label using multiple modelsvector:locate
: Uses vector embeddings and simple distance calculations to find the "closest" label for each postvector:docsearch
: Specialized version ofvector:locate
for the MixedBread and Nomic-Embed modelsvector:cluster
: Uses kmeans or dbscan clustering to suggest new post categoriesvector:project
: Generates 2D projections of high-dimensionality embeddings for visualization
prompt:categorize
: Uses a standardized system prompt to classify individual postsprompt:describe-clusters
: Uses an LLM to generate descriptions of the clusters created byvector:cluster
nlp:wordcloud
Generates data for the ObservableHQ wordcloud visualizationreport:cluster
: Generates JSON data files for the Vega-Lite embedding visualizationreport:movement
: Generates JSON data files for the Vega-Lite old/new category movement visualizationsreport:accuracy
: Displays a simple histogram of 'known' tags and proposed tags. This data is used for the Vega-Lite Model Accuracy chart