Skip to content

Pro-GenAI/ShortLang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project banner

ShortLang: Compressed Text for efficient LLMs

The future of text representation and processing

AI LLMs Python License: CC BY 4.0

Overview

ShortLang is a minimal-length, semantically-preserving textual representation framework designed to optimize language model reasoning, training efficiency, and storage requirements. It compresses natural language into a concise symbolic form while retaining core meaning as measured by embedding similarity.

Features

  • Rule-Based Compression: Deterministic methods to remove stopwords, abbreviate entities, and eliminate redundancy.
  • Model-Based Compression: Uses fine-tuned language models for nuanced semantic compression.
  • Hybrid Approach: Combines rule-based preprocessing with model-based compression.
  • Embedding Validation: Objective assessment of semantic retention using cosine similarity.

Installation

  1. Clone the repository:

    git clone https://github.com/Pro-GenAI/ShortLang.git
    cd ShortLang
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables in ".env" based on ".env.example".

Usage

Run "short_lang/shortlang.py".

Applications

  • Reasoning Optimization
  • Training Data Compression
  • Efficient Chunking for Vector Embedding
  • Vector Database Storage and Retrieval
  • Multi-Agent and Multi-Step Systems