Skip to content

Commit

Permalink
Data docs (#1418)
Browse files Browse the repository at this point in the history
  • Loading branch information
levkk committed Apr 24, 2024
1 parent 2123430 commit 82dc23f
Show file tree
Hide file tree
Showing 36 changed files with 734 additions and 253 deletions.
4 changes: 4 additions & 0 deletions packages/pgml-rds-proxy/ec2/.gitignore
@@ -0,0 +1,4 @@
.terraform
*.lock.hcl
*.tfstate
*.tfstate.backup
7 changes: 7 additions & 0 deletions packages/pgml-rds-proxy/ec2/README.md
@@ -0,0 +1,7 @@
# Terraform configuration for pgml-rds-proxy on EC2

This is a sample Terraform deployment for running pgml-rds-proxy on EC2. This will spin up an EC2 instance
with a public IP and a working security group & install the community Docker runtime.

Once the instance is running, you can connect to it using the root key and run the pgml-rds-proxy Docker container
with the correct PostgresML `DATABASE_URL`.
84 changes: 84 additions & 0 deletions packages/pgml-rds-proxy/ec2/ec2-deployment.tf
@@ -0,0 +1,84 @@
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.46"
}
}

required_version = ">= 1.2.0"
}

provider "aws" {
region = "us-west-2"
}

data "aws_ami" "ubuntu" {
most_recent = true

filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}

filter {
name = "virtualization-type"
values = ["hvm"]
}

owners = ["099720109477"] # Canonical
}

resource "aws_security_group" "pgml-rds-proxy" {
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
ipv6_cidr_blocks = ["::/0"]
}

ingress {
from_port = 6432
to_port = 6432
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
ipv6_cidr_blocks = ["::/0"]
}

ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
ipv6_cidr_blocks = ["::/0"]
}
}

resource "aws_instance" "pgml-rds-proxy" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro"
key_name = var.root_key

root_block_device {
volume_size = 30
delete_on_termination = true
}

vpc_security_group_ids = [
"${aws_security_group.pgml-rds-proxy.id}",
]

associate_public_ip_address = true
user_data = file("${path.module}/user_data.sh")
user_data_replace_on_change = false

tags = {
Name = "pgml-rds-proxy"
}
}

variable "root_key" {
type = string
description = "The name of the SSH Root Key you'd like to assign to this EC2 instance. Make sure it's a key you have access to."
}
21 changes: 21 additions & 0 deletions packages/pgml-rds-proxy/ec2/user_data.sh
@@ -0,0 +1,21 @@
#!/bin/bash
#
# Cloud init script to install Docker on an EC2 instance running Ubuntu 22.04.
#

sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo groupadd docker
sudo usermod -aG docker ubuntu
1 change: 1 addition & 0 deletions pgml-cms/.gitignore
@@ -0,0 +1 @@
*.md.bak
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified pgml-cms/docs/.gitbook/assets/architecture.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pgml-cms/docs/.gitbook/assets/fdw_1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pgml-cms/docs/.gitbook/assets/vpc_1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 35 additions & 23 deletions pgml-cms/docs/README.md
Expand Up @@ -4,38 +4,50 @@ description: The key concepts that make up PostgresML.

# Overview

PostgresML is a complete MLOps platform built on PostgreSQL.
PostgresML is a complete MLOps platform built on PostgreSQL. Our operating principle is:

> _Move the models to the database, rather than continuously moving the data to the models._
> _Move the models to the database, rather than constantly moving the data to the models._
The data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move the models to the database, rather than continuously moving the data to the models. PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities and goals:
The data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move the models to the database, rather than continuously moving data to the models.

* **Model Serving** - _**GPU accelerated**_ inference engine for interactive applications, with no additional networking latency or reliability costs.
* **Model Store** - Download _**open-source**_ models including state of the art LLMs from HuggingFace, and track changes in performance between versions.
* **Model Training** - Train models with _**your application data**_ using more than 50 algorithms for regression, classification or clustering tasks. Fine tune pre-trained models like LLaMA and BERT to improve performance.
* **Feature Store** - _**Scalable**_ access to model inputs, including vector, text, categorical, and numeric data. Vector database, text search, knowledge graph and application data all in one _**low-latency**_ system.
## AI engine

<figure><img src=".gitbook/assets/ml_system.svg" alt="Machine Learning Infrastructure (2.0) by a16z"><figcaption><p>PostgresML handles all of the functions typically performed by a cacophony of services, <a href="https://a16z.com/emerging-architectures-for-modern-data-infrastructure/">described by a16z</a></p></figcaption></figure>
PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities:

These capabilities are primarily provided by two open-source software projects, that may be used independently, but are designed to be used with the rest of the Postgres ecosystem, including trusted extensions like pgvector and pg\_partman.
* **Model Serving** - GPU accelerated inference engine for interactive applications, with no additional networking latency or reliability costs
* **Model Store** - Access to open-source models including state of the art LLMs from HuggingFace, and track changes in performance between versions
* **Model Training** - Train models with your application data using more than 50 algorithms for regression, classification or clustering tasks; fine tune pre-trained models like LLaMA and BERT to improve performance
* **Feature Store** - Scalable access to model inputs, including vector, text, categorical, and numeric data: vector database, text search, knowledge graph and application data all in one low-latency system

* **pgml** is an open source extension for PostgreSQL. It adds support for GPUs and the latest ML & AI algorithms _**inside**_ the database with a SQL API and no additional infrastructure, networking latency, or reliability costs.
* **PgCat** is an open source proxy pooler for PostgreSQL. It abstracts the scalability and reliability concerns of managing a distributed cluster of Postgres databases. Client applications connect only to the proxy, which handles load balancing and failover, _**outside**_ of any single database.
<figure><img src=".gitbook/assets/ml_system.svg" alt="Machine Learning Infrastructure (2.0) by a16z"><figcaption class="mt-2"><p>PostgresML handles all of the functions <a href="https://a16z.com/emerging-architectures-for-modern-data-infrastructure/">described by a16z</a></p></figcaption></figure>

<figure><img src=".gitbook/assets/architecture.png" alt="PostgresML architectural diagram" width="275"><figcaption><p>A PostgresML deployment at scale</p></figcaption></figure>
These capabilities are primarily provided by two open-source software projects, that may be used independently, but are designed to be used with the rest of the Postgres ecosystem:

In addition, PostgresML provides [native language SDKs](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml) to implement best practices for common ML & AI applications. The JavaScript and Python SDKs are generated from the core Rust SDK, to provide the same API, correctness and efficiency across all application runtimes.
* **pgml** - an open source extension for PostgreSQL. It adds support for GPUs and the latest ML & AI algorithms _inside_ the database with a SQL API and no additional infrastructure, networking latency, or reliability costs
* **PgCat** - an open source pooler for PostgreSQL. It abstracts the scalability and reliability concerns of managing a distributed cluster of Postgres databases. Client applications connect only to the pooler, which handles load balancing, sharding, and failover, outside of any single database server.

SDK clients can perform advanced machine learning tasks in a single SQL request, without having to transfer additional data, models, hardware or dependencies to the client application. For example:
<figure><img src=".gitbook/assets/architecture.png" alt="PostgresML architectural diagram"><figcaption></figcaption></figure>

* Chat with streaming response support from the latest LLMs
* Search with both keywords and embedding vectors
* Text Generation with RAG in a single request
* Translate text between hundreds of language pairs
* Summarization to distil complex documents
* Forecasting timeseries data for key metrics with complex metadata
* Fraud and anomaly detection with application data
## Client SDK

Our goal is to provide access to Open Source AI for everyone. PostgresML is under continuous development to keep up with the rapidly evolving use cases for ML & AI, and we release non breaking changes with minor version updates in accordance with SemVer. We welcome contributions to our [open source code and documentation](https://github.com/postgresml).
The PostgresML team also provides [native language SDKs](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml) which implement best practices for common ML & AI applications. The JavaScript and Python SDKs are generated from the a core Rust library, which provides a uniform API, correctness and efficiency across all environments.

We can host your AI database in our cloud, or you can run our Docker image locally with PostgreSQL, pgml, pgvector and NVIDIA drivers included.
While using the SDK is completely optional, SDK clients can perform advanced machine learning tasks in a single SQL request, without having to transfer additional data, models, hardware or dependencies to the client application.

Use cases include:

* Chat with streaming responses from state-of-the-art open source LLMs
* Semantic search with keywords and embeddings
* RAG in a single request without using any third-party services
* Text translation between hundreds of languages
* Text summarization to distill complex documents
* Forecasting timeseries data for key metrics with and metadata
* Anomaly detection using application data

## Our mission

PostgresML strives to provide access to open source AI for everyone. We are continuously developping PostgresML to keep up with the rapidly evolving use cases for ML & AI, but we remain committed to never breaking user facing APIs. We welcome contributions to our [open source code and documentation](https://github.com/postgresml) from the community.

## Managed cloud

While our extension and pooler are open source, we also offer a managed cloud database service for production deployments of PostgresML. You can [sign up](https://postgresml.org/signup) for an account and get a free Serverless database in seconds.
12 changes: 7 additions & 5 deletions pgml-cms/docs/SUMMARY.md
Expand Up @@ -6,9 +6,11 @@
* [Getting Started](introduction/getting-started/README.md)
* [Create your database](introduction/getting-started/create-your-database.md)
* [Connect your app](introduction/getting-started/connect-your-app.md)
* [Import your data](introduction/getting-started/import-your-data/README.md)
* [CSV](introduction/getting-started/import-your-data/csv.md)
* [Foreign Data Wrapper](introduction/getting-started/import-your-data/foreign-data-wrapper.md)
* [Import your data](introduction/getting-started/import-your-data/README.md)
* [Logical replication](introduction/getting-started/import-your-data/logical-replication/README.md)
* [Foreign Data Wrappers](introduction/getting-started/import-your-data/foreign-data-wrappers.md)
* [Move data with COPY](introduction/getting-started/import-your-data/copy.md)
* [Migrate with pg_dump](introduction/getting-started/import-your-data/pg-dump.md)

## API

Expand Down Expand Up @@ -51,7 +53,7 @@
## Product

* [Cloud Database](product/cloud-database/README.md)
* [Serverless databases](product/cloud-database/serverless-databases.md)
* [Serverless](product/cloud-database/serverless.md)
* [Dedicated](product/cloud-database/dedicated.md)
* [Enterprise](product/cloud-database/plans.md)
* [Vector Database](product/vector-database.md)
Expand Down Expand Up @@ -79,7 +81,7 @@
## Resources

* [FAQs](resources/faqs.md)
* [Data Storage & Retrieval](resources/data-storage-and-retrieval/README.md)
* [Data Storage & Retrieval](resources/data-storage-and-retrieval/tabular-data.md)
* [Tabular data](resources/data-storage-and-retrieval/tabular-data.md)
* [Documents](resources/data-storage-and-retrieval/documents.md)
* [Partitioning](resources/data-storage-and-retrieval/partitioning.md)
Expand Down
2 changes: 1 addition & 1 deletion pgml-cms/docs/api/client-sdk/README.md
@@ -1,4 +1,4 @@
# Client SDKs
# Client SDK

### Key Features

Expand Down
4 changes: 2 additions & 2 deletions pgml-cms/docs/api/client-sdk/getting-started.md
Expand Up @@ -18,7 +18,7 @@ pip install pgml

## Example

Once the SDK is installed, you an use the following example to get started.
Once the SDK is installed, you can use the following example to get started.

### Create a collection

Expand Down Expand Up @@ -85,7 +85,7 @@ await collection.add_pipeline(pipeline)
{% endtab %}
{% endtabs %}

#### Explanation:
#### Explanation

* The code constructs a pipeline called `"sample_pipeline"` and adds it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for the `text` key for every upserted document.

Expand Down
12 changes: 6 additions & 6 deletions pgml-cms/docs/introduction/getting-started/README.md
Expand Up @@ -4,14 +4,14 @@ description: Setup a database and connect your application to PostgresML

# Getting Started

A PostgresML deployment consists of multiple components working in concert to provide a complete Machine Learning platform. We provide a fully managed solution in our cloud.
A PostgresML deployment consists of multiple components working in concert to provide a complete Machine Learning platform. We provide a fully managed solution in [our cloud](create-your-database), and document a self-hosted installation in [Developer Docs](/docs/resources/developer-docs/quick-start-with-docker).

* A PostgreSQL database, with pgml and pgvector extensions installed, including backups, metrics, logs, replicas and high availability configurations
* A PgCat pooling proxy to provide secure access and model load balancing across tens of thousands of clients
* A web application to manage deployed models and host SQL notebooks
* PostgreSQL database, with `pgml`, `pgvector` and many other extensions installed, including backups, metrics, logs, replicas and high availability
* PgCat pooler to provide secure access and model load balancing across thousands of clients
* A web application to manage deployed models and share experiments and analysis in SQL notebooks

<figure><img src="../../.gitbook/assets/architecture.png" alt=""><figcaption></figcaption></figure>
<figure class="m-3"><img src="../../.gitbook/assets/architecture.png" alt="PostgresML architecture"><figcaption></figcaption></figure>

By building PostgresML on top of a mature database, we get reliable backups for model inputs and proven scalability without reinventing the wheel, so that we can focus on providing access to the latest developments in open source machine learning and artificial intelligence.

This guide will help you get started with a generous free account, that includes access to GPU accelerated models and 5GB of storage, or you can skip to our Developer Docs to see how to run PostgresML locally with our Docker image.
This guide will help you get started with a generous free account, that includes access to GPU accelerated models and 5 GB of storage, or you can skip to our [Developer Docs](/docs/resources/developer-docs/quick-start-with-docker) to see how to run PostgresML locally with our Docker image.
22 changes: 13 additions & 9 deletions pgml-cms/docs/introduction/getting-started/connect-your-app.md
Expand Up @@ -4,16 +4,16 @@ description: PostgresML is compatible with all standard PostgreSQL clients

# Connect your app

You can connect to your database from any Postgres compatible client. PostgresML is intended to serve in the traditional role of an application database, along with it's extended role as an MLOps platform to make it easy to build and maintain AI applications.
You can connect to your PostgresML database from any PostgreSQL-compatible client. PostgresML can serve in the traditional role of an application database, along with it's extended role as an MLOps platform, to make it easy to build and maintain AI applications together with your application data.

## Application SDKs
## Client SDK

We provide client SDKs for JavaScript, Python and Rust apps that manage connections to the Postgres database and make it easy to construct efficient queries for AI use cases, like managing a document collection for RAG, or building a chatbot. All of the ML & AI still happens in the database, with centralized operations, hardware and dependency management.

These SDKs are under rapid development to add new features and use cases, but we release non breaking changes with minor version updates in accordance with SemVer. It's easy to install into your existing application.
We provide a client SDK for JavaScript, Python and Rust. The SDK manages connections to the database, and makes it easy to construct efficient queries for AI use cases, like managing RAG document collections, or building chatbots. All of the ML & AI still happens inside the database, with centralized operations, hardware and dependency management.

### Installation

The SDK is available from npm and PyPI:

{% tabs %}
{% tab title="JavaScript" %}
```bash
Expand All @@ -28,8 +28,12 @@ pip install pgml
{% endtab %}
{% endtabs %}

Our SDK comes with zero additional dependencies. The core of the SDK is written in Rust, and we provide language bindings and native packaging & distribution.

### Test the connection

Once you have installed our SDK into your environment, you can test connectivity to our cloud with just a few lines of code:

{% tabs %}
{% tab title="JavaScript" %}
```javascript
Expand Down Expand Up @@ -80,9 +84,9 @@ async def main():
{% endtab %}
{% endtabs %}

## Native Language Bindings
## Native PostgreSQL libraries

You can also connect directly to the database with your favorite bindings or ORM:
Using the SDK is completely optional. If you're comfortable with writing SQL, you can connect directly to the database using your favorite PostgreSQL client library or ORM:

* C++: [libpqxx](https://www.tutorialspoint.com/postgresql/postgresql\_c\_cpp.htm)
* C#: [Npgsql](https://github.com/npgsql/npgsql),[Dapper](https://github.com/DapperLib/Dapper), or [Entity Framework Core](https://github.com/dotnet/efcore)
Expand All @@ -101,9 +105,9 @@ You can also connect directly to the database with your favorite bindings or ORM
* Rust: [postgres](https://crates.io/crates/postgres), [SQLx](https://github.com/launchbadge/sqlx) or [Diesel](https://github.com/diesel-rs/diesel)
* Swift: [PostgresNIO](https://github.com/vapor/postgres-nio) or [PostgresClientKit](https://github.com/codewinsdotcom/PostgresClientKit)

## SQL Editors
## SQL editors

Use any of these popular tools to execute SQL queries directly against the database:
If you need to write ad-hoc queries, you can use any of these popular tools to execute SQL queries directly on your database:

* [Apache Superset](https://superset.apache.org/)
* [DBeaver](https://dbeaver.io/)
Expand Down

0 comments on commit 82dc23f

Please sign in to comment.