AI for IA Playground

A hairy, tangled pile of experimental code to demonstrate various approaches to document classification.

Given a giant pile of Reddit posts tagged with flair text, this project demonstrates various ways of categorizing and classifying those posts using various AI tools: Natural Language Processing, Machine Learning, and LLMs. Currently, four approaches are implemented:

Train a simple word-based classifier
Generate vector embeddings and match posts to "proximate" topics
Generate vector embeddings and search for clusters of similar posts
Describe the desired categories in an LLM prompt, and ask it to categorize each post

No claim is made that this code is good, just that it's servicable. Our goal is a rough comparison of output quality, the resource demands of each approach, and the approachability of each technique for technically adept users who aren't actual ML/AI specialists.

No, seriously. It's bad code.

Setup

The project looks for several local environment variables when accessing AI API providers. env.ANTHROPIC_API_KEY, env.GOOGLE_API_KEY, and env.OPENAI_API_KEY will all be respected. If they aren't set, AI models from those providers won't be usable in the tests.

If you're using Ollama on a separate machine, changing env.OLLAMA_HOST will get things wired up.

Finally, data is stored in a Postgres database. env.POSTGRES_URL can be set to control the server used; if you want to spin something up with Docker, a docker-compose.yaml file has been included. If no server is set, this project will fall back to PGLite, a sqlite-like implementation of Postgres that stores its data on the local filesystem. It's slower but will get you there.

Included Scripts

These scripts are how we ran our tests over and over as we iterated on our process and gathered our data. It's still dirty, but we'll be doing a bit of cleanup shortly.

db:setup: Sets up a fresh Postgres database, imports the example Reddit posts, tags, and list of models used in testing.
test:ollama: Checks to see whether Ollama is installed and the necessary AI models have been downloaded.

Vector embeddings

vector:embed: Generates vector embeddings for every post and label using multiple models
vector:locate: Uses vector embeddings and simple distance calculations to find the "closest" label for each post
vector:docsearch: Specialized version of vector:locate for the MixedBread and Nomic-Embed models
vector:cluster: Uses kmeans or dbscan clustering to suggest new post categories
vector:project: Generates 2D projections of high-dimensionality embeddings for visualization

Prompted LLMs

prompt:categorize: Uses a standardized system prompt to classify individual posts
prompt:describe-clusters: Uses an LLM to generate descriptions of the clusters created by vector:cluster

Dataviz/reporting

nlp:wordcloud Generates data for the ObservableHQ wordcloud visualization
report:cluster: Generates JSON data files for the Vega-Lite embedding visualization
report:movement: Generates JSON data files for the Vega-Lite old/new category movement visualizations
report:accuracy: Displays a simple histogram of 'known' tags and proposed tags. This data is used for the Vega-Lite Model Accuracy chart

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
drizzle		drizzle
input		input
output		output
src		src
.gitignore		.gitignore
.nvmrc		.nvmrc
README.md		README.md
docker-compose.yaml		docker-compose.yaml
drizzle.config.ts		drizzle.config.ts
eslint.config.mjs		eslint.config.mjs
package.json		package.json
prettier.config.js		prettier.config.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI for IA Playground

Setup

Included Scripts

Vector embeddings

Prompted LLMs

Dataviz/reporting

About

Releases

Packages

Languages

autogram-is/ai-for-ia

Folders and files

Latest commit

History

Repository files navigation

AI for IA Playground

Setup

Included Scripts

Vector embeddings

Prompted LLMs

Dataviz/reporting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages