Skip to content

LLMSystems/llm_tools

Repository files navigation

LLM Tools

A comprehensive toolkit for Large Language Models (LLM) and embedding models, supporting chat, embeddings, and reranking functionality with flexible configuration options and multi-engine support. Whether handling synchronous or asynchronous calls, LLM Tools efficiently manages various AI model integration tasks.

English | 中文

Features

  • LLM Chat: Interact with large language models (e.g., GPT) with support for streaming output and formatted responses
  • Embedding Models: Support for single and multi-sentence embedding generation for semantic analysis and retrieval
  • Reranking Models: Rank documents by query similarity with support for single and multi-sentence inputs
  • Highly Configurable: Flexibly adjust parameters through YAML configuration files
  • Multi-Engine Support: Support for Azure OpenAI, local models, and various embedding engines
  • Async Support: Provides async interfaces for enhanced performance
  • Memory Management: Built-in chat memory management with customizable history length
  • Response Caching: Optional LLM response caching for improved efficiency

Supported Models

  • LLM Engines: Compatible with OpenAI SDK format
  • Embedding Models: m3e-base, bge-m3, and other embedding models
  • Reranking Models: bge-reranker-large and other reranking models

Installation

Using pip

# Clone the repository
git clone https://github.com/LLMSystems/llm_tools.git
cd llm_tools

# Install the package
pip install -e .

# Or install with development dependencies
pip install -e ".[dev]"

Configuration

1. Model Configuration

Create a configs/models.yaml file based on example_configs/models.yaml and configure the model parameters:

params:
    default:
        temperature: 0.2
        max_tokens: 1000
        top_p: 1
        frequency_penalty: 1.4
        presence_penalty: 0

LLM_engines:
    gpt-4o:
        model: "gpt-4o"
        azure_api_base: "your_azure_api_base_url"
        azure_api_key: "your_azure_api_key"
        azure_api_version: "your_azure_api_version"
    Qwen2-7B-Instruct:
        model: "Qwen2-7B-Instruct"
        local_api_key: "Empty"
        local_base_url: "http://localhost:8887/v1"
        translate_to_cht: true  # Optional: Translate to Traditional Chinese

embedding_models:
    m3e-base:
        model: "m3e-base"
        local_api_key: "Empty"
        local_base_url: "http://localhost:8887/v1"

reranking_models:
    bge-reranker-large:
        model: "bge-reranker-large"
        local_api_key: "Empty"
        local_base_url: "http://localhost:8887/v1"

2. Configuration Parameters

  • DEFAULT: Default parameters including temperature, max_tokens, top_p, frequency_penalty, presence_penalty
  • Azure OpenAI: Configure azure_api_base, azure_api_key, azure_api_version (Note: Usage may incur costs)
  • Local Models: Configure local_api_key and local_base_url
  • translate_to_cht: When set to True, automatically translates results to Traditional Chinese

Quick Start

Basic Chat Usage

from llm_chat import LLMChat

# Initialize LLM chat
llmchat = LLMChat(model="gpt-4o", config_path="./configs/models.yaml")

# Simple chat
response, history = llmchat.chat(query="Hello, how are you?")
print(response)

# Interactive chat with history
history = []
while True:
    user_input = input("You: ")
    if user_input.lower() == "quit":
        break
    response, history = llmchat.chat(query=user_input, history=history)
    print(f"AI: {response}")

Streaming Chat

from llm_chat import LLMChat

llmchat = LLMChat(model="gpt-4o", config_path="./configs/models.yaml")

# Streaming response
for chunk in llmchat.chat(query="Tell me a story", stream=True):
    print(chunk, end="", flush=True)
print()

Chat Memory Management

from llm_chat import LLMChat
from memory import ChatMemory

# Initialize chat memory
system_prompt = "You are a professional assistant who answers questions in Traditional Chinese."
chat_memory = ChatMemory(system_prompt=system_prompt, max_len=1000)

llmchat = LLMChat(model="gpt-4o", config_path="./configs/models.yaml")

# Streaming with memory
while True:
    user_input = input("You: ")
    if user_input.lower() == "quit":
        break
    
    text = ''
    for chunk in llmchat.chat(query=user_input, history=chat_memory.get_history(), stream=True):
        text += chunk
        print(chunk, end="", flush=True)
    print()
    
    chat_memory.add_user_message(user_input)
    chat_memory.add_system_response(text)

Embeddings and Reranking

import numpy as np
from embed_rerank_model import EmbeddingModel, RerankingModel

# Embedding generation
embed_model = EmbeddingModel(embedding_model="m3e-base", config_path="./configs/models.yaml")
query_embedding = np.array(embed_model.embed_query("The food is delicious."))
print(f"Embedding shape: {query_embedding.shape}")

# Document embedding
documents = ["The food is great.", "The service is excellent.", "The atmosphere is nice."]
doc_embeddings = embed_model.embed_documents(documents)
print(f"Document embeddings: {len(doc_embeddings)} vectors")

# Document reranking
rerank_model = RerankingModel(reranking_model="bge-reranker-large", config_path="./configs/models.yaml")
query = "Tell me about the food quality"
ranked_docs = rerank_model.rerank_documents(documents, query)
print(f"Reranked documents: {ranked_docs}")

Async Usage

import asyncio
from async_llm_chat import AsyncLLMChat

async def async_chat_example():
    # Initialize async LLM chat
    async_llm = AsyncLLMChat(model="gpt-4o", config_path="./configs/models.yaml")
    
    # Concurrent requests
    async def query_a():
        response, _ = await async_llm.chat(query="What is artificial intelligence?")
        return response
    
    async def query_b():
        response, _ = await async_llm.chat(query="What is machine learning?")
        return response
    
    # Execute concurrently
    responses = await asyncio.gather(query_a(), query_b())
    for i, response in enumerate(responses):
        print(f"Response {i+1}: {response}")

# Run async example
asyncio.run(async_chat_example())

Additional Features

Response Caching

from async_llm_chat import AsyncLLMChat

# Enable caching
cache_config = {
    'enable': True,
    'cache_file': './cache/llm_cache.json'
}

async_llm = AsyncLLMChat(
    model="gpt-4o", 
    config_path="./configs/models.yaml",
    cache_config=cache_config
)

Project Structure

llm_tools/
├── llm_chat.py              # Synchronous LLM chat functionality
├── async_llm_chat.py        # Asynchronous LLM chat functionality
├── embed_rerank_model.py    # Embedding and reranking models
├── memory.py                # Chat memory management
├── llm_response_cache.py    # Response caching functionality
├── tutorial.py              # Tutorial examples
├── tutorial.ipynb           # Jupyter notebook tutorial
├── example_configs/         # Configuration examples
│   └── models.yaml         # Model configuration template
├── pyproject.toml          # Project configuration
└── README_zh-CN.md         # Chinese README
└── README.md               # This file (English)

Examples and Tutorials

For detailed usage examples, please refer to:

  • tutorial.py - Python script examples
  • tutorial.ipynb - Jupyter notebook with interactive examples

License

This project is licensed under the MIT License.

About

A comprehensive Python toolkit for LLM integration with chat, embeddings, and reranking. Supports Azure OpenAI, local models, async operations, memory management, and response caching.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors