diff --git a/pgml-cms/docs/.gitbook/assets/fdw_1.png b/pgml-cms/docs/.gitbook/assets/fdw_1.png index 0dbce2380..c19ed86f6 100644 Binary files a/pgml-cms/docs/.gitbook/assets/fdw_1.png and b/pgml-cms/docs/.gitbook/assets/fdw_1.png differ diff --git a/pgml-cms/docs/.gitbook/assets/logical_replication_1.png b/pgml-cms/docs/.gitbook/assets/logical_replication_1.png index ac15be858..171959b62 100644 Binary files a/pgml-cms/docs/.gitbook/assets/logical_replication_1.png and b/pgml-cms/docs/.gitbook/assets/logical_replication_1.png differ diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 3588b3f1d..fd87411ad 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -41,7 +41,6 @@ * [Zero-shot Classification](api/sql-extension/pgml.transform/zero-shot-classification.md) * [pgml.tune()](api/sql-extension/pgml.tune.md) * [Client SDK](api/client-sdk/README.md) - * [Overview](api/client-sdk/getting-started.md) * [Collections](api/client-sdk/collections.md) * [Pipelines](api/client-sdk/pipelines.md) * [Vector Search](api/client-sdk/search.md) diff --git a/pgml-cms/docs/api/apis.md b/pgml-cms/docs/api/apis.md index 146d83be8..70f3b1ed0 100644 --- a/pgml-cms/docs/api/apis.md +++ b/pgml-cms/docs/api/apis.md @@ -1,28 +1,47 @@ -# Overview +--- +description: Overview of the PostgresML SQL API and SDK. +--- -## Introduction +# API overview -PostgresML adds extensions to the PostgreSQL database, as well as providing separate Client SDKs in JavaScript and Python that leverage the database to implement common ML & AI use cases. +PostgresML is a PostgreSQL extension which adds SQL functions to the database where it's installed. The functions work with modern machine learning algorithms and latest open source LLMs while maintaining a stable API signature. They can be used by any application that connects to the database. -The extensions provide all of the ML & AI functionality via SQL APIs, like training and inference. They are designed to be used directly for all ML practitioners who implement dozens of different use cases on their own machine learning models. +In addition to the SQL API, we built and maintain a client SDK for JavaScript, Python and Rust. The SDK uses the same extension functionality to implement common ML & AI use cases, like retrieval-augmented generation (RAG), chatbots, and semantic & hybrid search engines. -We also provide Client SDKs that implement the best practices on top of the SQL APIs, to ease adoption and implement common application use cases in applications, like chatbots or search engines. +Using the SDK is optional, and you can implement the same functionality with standard SQL queries. If you feel more comfortable using a programming language, the SDK can help you to get started quickly. -## SQL Extension +## [SQL extension](sql-extension/) -PostgreSQL is designed to be _**extensible**_. This has created a rich open-source ecosystem of additional functionality built around the core project. Some [extensions](https://www.postgresql.org/docs/current/contrib.html) are include in the base Postgres distribution, but others are also available via the [PostgreSQL Extension Network](https://pgxn.org/).\ -There are 2 foundational extensions included in a PostgresML deployment that provide functionality inside the database through SQL APIs. +The PostgreSQL extension provides all of the ML & AI functionality, like training models and inference, via SQL functions. The functions are designed for ML practitioners to use dozens of ML algorithms to train models, and run real time inference, on live application data. Additionally, the extension provides access to the latest Hugging Face transformers for a wide range of NLP tasks. -* **pgml** - provides Machine Learning and Artificial Intelligence APIs with access to more than 50 ML algorithms to train classification, clustering and regression models on your own data, or you can perform dozens of tasks with thousands of models downloaded from HuggingFace. -* **pgvector** - provides indexing and search functionality on vectors, in addition to the traditional application database storage, including JSON and plain text, provided by PostgreSQL. +### Functions -Learn more about developing with the [sql-extension](sql-extension/ "mention") +The following functions are implemented and maintained by the PostgresML extension: -## Client SDK +| Function name | Description | +|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [pgml.embed()](sql-extension/pgml.embed) | Generate embeddings inside the database using open source embedding models from Hugging Face. | +| [pgml.transform()](sql-extension/pgml.transform/) | Download and run latest Hugging Face transformer models, like Llama, Mixtral, and many more to perform various NLP tasks like text generation, summarization, sentiment analysis and more. | +| [pgml.train()](sql-extension/pgml.train/) | Train a machine learning model on data from a Postgres table or view. Supports XGBoost, LightGBM, Catboost and all Scikit-learn algorithms. | +| [pgml.deploy()](sql-extension/pgml.deploy) | Deploy a version of the model created with pgml.train(). | +| [pgml.predict()](sql-extension/pgml.predict/) | Perform real time inference using a model trained with pgml.train() on live application data. | +| [pgml.tune()](sql-extension/pgml.tune) | Run LoRA fine tuning on an open source model from Hugging Face using data from a Postgres table or view. | -PostgresML provides a client SDK that streamlines ML & AI use cases in both JavaScript and Python. With this SDK, you can seamlessly manage various database tables related to documents, text chunks, text splitters, LLM (Language Model) models, and embeddings. By leveraging the SDK's capabilities, you can efficiently index LLM embeddings using pgvector with HNSW for fast and accurate queries. +Together with standard database functionality provided by PostgreSQL, these functions allow to create and manage the entire life cycle of a machine learning application. -The SDK delegates all work to the extension running in the database, which minimizes software and hardware dependencies that need to be maintained at the application layer, as well as securing data and models inside the data center. Our SDK minimizes data transfer to maximize performance, efficiency, security and reliability. +## [Client SDK](client-sdk/) -Learn more about developing with the [client-sdk](client-sdk/ "mention") +The client SDK implements best practices and common use cases, using the PostgresML SQL functions and standard PostgreSQL features to do it. The SDK core is written in Rust, which manages creating and running queries, connection pooling, and error handling. +For each additional language we support (current JavaScript and Python), we create and publish language-native bindings. This architecture ensures all programming languages we support have identical APIs and similar performance when interacting with PostgresML. + +### Use cases + +The SDK currently implements the following use cases: + +| Use case | Description | +|----------|---------| +| [Collections](client-sdk/collections) | Manage documents, embeddings, full text and vector search indexes, and more, using one simple interface. | +| [Pipelines](client-sdk/pipelines) | Easily build complex queries to interact with collections using a programmable interface. | +| [Vector search](client-sdk/search) | Implement semantic search using in-database generated embeddings and ANN vector indexes. | +| [Document search](client-sdk/document-search) | Implement hybrid full text search using in-database generated embeddings and PostgreSQL tsvector indexes. | diff --git a/pgml-cms/docs/api/client-sdk/README.md b/pgml-cms/docs/api/client-sdk/README.md index 01286e9cb..881be3046 100644 --- a/pgml-cms/docs/api/client-sdk/README.md +++ b/pgml-cms/docs/api/client-sdk/README.md @@ -1,24 +1,247 @@ +--- +description: PostgresML client SDK for JavaScript, Python and Rust implements common use cases and PostgresML connection management. +--- + # Client SDK -### Key Features +The client SDK can be installed using standard package managers for JavaScript, Python, and Rust. Since the SDK is written in Rust, the JavaScript and Python packages come with no additional dependencies. + + +## Installation + +Installing the SDK into your project is as simple as: + +{% tabs %} +{% tab title="JavaScript " %} +```bash +npm i pgml +``` +{% endtab %} + +{% tab title="Python " %} +```bash +pip install pgml +``` +{% endtab %} +{% endtabs %} + +## Getting started + +The SDK uses the database to perform most of its functionality. Before continuing, make sure you created a [PostgresML database](https://postgresml.org/signup) and have the `DATABASE_URL` connection string handy. + +### Connect to PostgresML + +The SDK automatically manages connections to PostgresML. The connection string can be specified as an argument to the collection constructor, or as an environment variable. + +If your app follows the twelve-factor convention, we recommend you configure the connection in the environment using the `PGML_DATABASE_URL` variable: + +```bash +export PGML_DATABASE_URL=postgres://user:password@sql.cloud.postgresml.org:6432/pgml_database +``` + +### Create a collection + +The SDK is written in asynchronous code, so you need to run it inside an async runtime. Both Python and JavaScript support async functions natively. + +{% tabs %} +{% tab title="JavaScript " %} +```javascript +const pgml = require("pgml"); + +const main = async () => { + const collection = pgml.newCollection("sample_collection"); +} +``` +{% endtab %} + +{% tab title="Python" %} +```python +from pgml import Collection, Pipeline +import asyncio + +async def main(): + collection = Collection("sample_collection") +``` +{% endtab %} +{% endtabs %} + +The above example imports the `pgml` module and creates a collection object. By itself, the collection only tracks document contents and identifiers, but once we add a pipeline, we can instruct the SDK to perform additional tasks when documents and are inserted and retrieved. + + +### Create a pipeline + +Continuing the example, we will create a pipeline called `sample_pipeline`, which will use in-database embeddings generation to automatically chunk and embed documents: + +{% tabs %} +{% tab title="JavaScript" %} +```javascript +// Add this code to the end of the main function from the above example. +const pipeline = pgml.newPipeline("sample_pipeline", { + text: { + splitter: { model: "recursive_character" }, + semantic_search: { + model: "intfloat/e5-small", + }, + }, +}); + +await collection.add_pipeline(pipeline); +``` +{% endtab %} + +{% tab title="Python" %} +```python +# Add this code to the end of the main function from the above example. +pipeline = Pipeline( + "test_pipeline", + { + "text": { + "splitter": { "model": "recursive_character" }, + "semantic_search": { + "model": "intfloat/e5-small", + }, + }, + }, +) + +await collection.add_pipeline(pipeline) +``` +{% endtab %} +{% endtabs %} + +The pipeline configuration is a key/value object, where the key is the name of a column in a document, and the value is the action the SDK should perform on that column. + +In this example, the documents contain a column called `text` which we are instructing the SDK to chunk the contents of using the recursive character splitter, and to embed those chunks using the Hugging Face `intfloat/e5-small` embeddings model. + +### Add documents + +Once the pipeline is configured, we can start adding documents: + +{% tabs %} +{% tab title="JavaScript" %} +```javascript +// Add this code to the end of the main function from the above example. +const documents = [ + { + id: "Document One", + text: "document one contents...", + }, + { + id: "Document Two", + text: "document two contents...", + }, +]; + +await collection.upsert_documents(documents); +``` +{% endtab %} + +{% tab title="Python" %} +```python +# Add this code to the end of the main function in the above example. +documents = [ + { + "id": "Document One", + "text": "document one contents...", + }, + { + "id": "Document Two", + "text": "document two contents...", + }, +] + +await collection.upsert_documents(documents) +``` +{% endtab %} +{% endtabs %} + +If the same document `id` is used, the SDK computes the difference between existing and new documents and only updates the chunks that have changed. + +### Search documents + +Now that the documents are stored, chunked and embedded, we can start searching the collection: + +{% tabs %} +{% tab title="JavaScript" %} +```javascript +// Add this code to the end of the main function in the above example. +const results = await collection.vector_search( + { + query: { + fields: { + text: { + query: "Something about a document...", + }, + }, + }, + limit: 2, + }, + pipeline, +); + +console.log(results); +``` +{% endtab %} + +{% tab title="Python" %} +```python +# Add this code to the end of the main function in the above example. +results = await collection.vector_search( + { + "query": { + "fields": { + "text": { + "query": "Something about a document...", + }, + }, + }, + "limit": 2, + }, + pipeline, +) + +print(results) +``` +{% endtab %} +{% endtabs %} + +We are using built-in vector search, powered by embeddings and the PostgresML [pgml.embed()](../sql-extension/pgml.embed) function, which embeds the `query` argument, compares it to the embeddings stored in the database, and returns the top two results, ranked by cosine similarity. + +### Run the example -* **Automated Database Management**: You can easily handle the management of database tables related to documents, text chunks, text splitters, LLM models, and embeddings. This automated management system simplifies the process of setting up and maintaining your vector search application's data structure. -* **Embedding Generation from Open Source Models**: Provides the ability to generate embeddings using hundreds of open source models. These models, trained on vast amounts of data, capture the semantic meaning of text and enable powerful analysis and search capabilities. -* **Flexible and Scalable Vector Search**: Build flexible and scalable vector search applications. PostgresML seamlessly integrates with PgVector, a PostgreSQL extension specifically designed for handling vector-based indexing and querying. By leveraging these indices, you can perform advanced searches, rank results by relevance, and retrieve accurate and meaningful information from your database. +Since the SDK is using async code, both JavaScript and Python need a little bit of code to run it correctly: -### Use Cases +{% tabs %} +{% tab title="JavaScript" %} +```javascript +main().then(() => { + console.log("SDK example complete"); +}); +``` +{% endtab %} -* Search: Embeddings are commonly used for search functionalities, where results are ranked by relevance to a query string. By comparing the embeddings of query strings and documents, you can retrieve search results in order of their similarity or relevance. -* Clustering: With embeddings, you can group text strings by similarity, enabling clustering of related data. By measuring the similarity between embeddings, you can identify clusters or groups of text strings that share common characteristics. -* Recommendations: Embeddings play a crucial role in recommendation systems. By identifying items with related text strings based on their embeddings, you can provide personalized recommendations to users. -* Anomaly Detection: Anomaly detection involves identifying outliers or anomalies that have little relatedness to the rest of the data. Embeddings can aid in this process by quantifying the similarity between text strings and flagging outliers. -* Classification: Embeddings are utilized in classification tasks, where text strings are classified based on their most similar label. By comparing the embeddings of text strings and labels, you can classify new text strings into predefined categories. +{% tab title="Python" %} +```python +if __name__ == "__main__": + asyncio.run(main()) +``` +{% endtab %} +{% endtabs %} -### How the SDK Works +Once you run the example, you should see something like this in the terminal: -SDK streamlines the development of vector search applications by abstracting away the complexities of database management and indexing. Here's an overview of how the SDK works: +```bash +[ + { + "chunk": "document one contents...", + "document": {"id": "Document One", "text": "document one contents..."}, + "score": 0.9034339189529419, + }, + { + "chunk": "document two contents...", + "document": {"id": "Document Two", "text": "document two contents..."}, + "score": 0.8983734250068665, + }, +] +``` -* **Automatic Document and Text Chunk Management**: The SDK provides a convenient interface to manage documents and pipelines, automatically handling chunking and embedding for you. You can easily organize and structure your text data within the PostgreSQL database. -* **Open Source Model Integration**: With the SDK, you can seamlessly incorporate a wide range of open source models to generate high-quality embeddings. These models capture the semantic meaning of text and enable powerful analysis and search capabilities. -* **Embedding Indexing**: The Python SDK utilizes the PgVector extension to efficiently index the embeddings generated by the open source models. This indexing process optimizes search performance and allows for fast and accurate retrieval of relevant results. -* **Querying and Search**: Once the embeddings are indexed, you can perform vector-based searches on the documents and text chunks stored in the PostgreSQL database. The SDK provides intuitive methods for executing queries and retrieving search results. diff --git a/pgml-cms/docs/api/client-sdk/getting-started.md b/pgml-cms/docs/api/client-sdk/getting-started.md deleted file mode 100644 index fd2f590ae..000000000 --- a/pgml-cms/docs/api/client-sdk/getting-started.md +++ /dev/null @@ -1,249 +0,0 @@ -# Overview - -## Installation - -{% tabs %} -{% tab title="JavaScript " %} -```bash -npm i pgml -``` -{% endtab %} - -{% tab title="Python " %} -```bash -pip install pgml -``` -{% endtab %} -{% endtabs %} - -## Example - -Once the SDK is installed, you can use the following example to get started. - -### Create a collection - -{% tabs %} -{% tab title="JavaScript " %} -```javascript -const pgml = require("pgml"); - -const main = async () => { // Open the main function - collection = pgml.newCollection("sample_collection"); -``` -{% endtab %} - -{% tab title="Python" %} -```python -from pgml import Collection, Pipeline -import asyncio - -async def main(): # Start of the main function - collection = Collection("sample_collection") -``` -{% endtab %} -{% endtabs %} - -**Explanation:** - -* The code imports the pgml module. -* It creates an instance of the Collection class which we will add pipelines and documents onto - -### Create a pipeline - -Continuing with `main` - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const pipeline = pgml.newPipeline("sample_pipeline", { - text: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "intfloat/e5-small", - }, - }, -}); -await collection.add_pipeline(pipeline); -``` -{% endtab %} - -{% tab title="Python" %} -```python -pipeline = Pipeline( - "test_pipeline", - { - "text": { - "splitter": { "model": "recursive_character" }, - "semantic_search": { - "model": "intfloat/e5-small", - }, - }, - }, -) -await collection.add_pipeline(pipeline) -``` -{% endtab %} -{% endtabs %} - -#### Explanation - -* The code constructs a pipeline called `"sample_pipeline"` and adds it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for the `text` key for every upserted document. - -### Upsert documents - -Continuing with `main` - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const documents = [ - { - id: "Document One", - text: "document one contents...", - }, - { - id: "Document Two", - text: "document two contents...", - }, -]; -await collection.upsert_documents(documents); -``` -{% endtab %} - -{% tab title="Python" %} -```python -documents = [ - { - "id": "Document One", - "text": "document one contents...", - }, - { - "id": "Document Two", - "text": "document two contents...", - }, -] -await collection.upsert_documents(documents) -``` -{% endtab %} -{% endtabs %} - -**Explanation** - -* This code creates and upserts some filler documents. -* As mentioned above, the pipeline added earlier automatically runs and generates chunks and embeddings for each document. - -### Query documents - -Continuing with `main` - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const results = await collection.vector_search( - { - query: { - fields: { - text: { - query: "Something about a document...", - }, - }, - }, - limit: 2, - }, - pipeline, -); - -console.log(results); - -await collection.archive(); - -} // Close the main function -``` -{% endtab %} - -{% tab title="Python" %} -```python -results = await collection.vector_search( - { - "query": { - "fields": { - "text": { - "query": "Something about a document...", - }, - }, - }, - "limit": 2, - }, - pipeline, -) - -print(results) - -await collection.archive() - -# End of the main function -``` -{% endtab %} -{% endtabs %} - -**Explanation:** - -* The `query` method is called to perform a vector-based search on the collection. The query string is `Something about a document...`, and the top 2 results are requested -* The search results are printed to the screen -* Finally, the `archive` method is called to archive the collection - -Call `main` function. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -main().then(() => { - console.log("Done with PostgresML demo"); -}); -``` -{% endtab %} - -{% tab title="Python" %} -```python -if __name__ == "__main__": - asyncio.run(main()) -``` -{% endtab %} -{% endtabs %} - -### **Running the Code** - -Open a terminal or command prompt and navigate to the directory where the file is saved. - -Execute the following command: - -{% tabs %} -{% tab title="JavaScript" %} -```bash -node vector_search.js -``` -{% endtab %} - -{% tab title="Python" %} -```bash -python3 vector_search.py -``` -{% endtab %} -{% endtabs %} - -You should see the search results printed in the terminal. - -```bash -[ - { - "chunk": "document one contents...", - "document": {"id": "Document One", "text": "document one contents..."}, - "score": 0.9034339189529419, - }, - { - "chunk": "document two contents...", - "document": {"id": "Document Two", "text": "document two contents..."}, - "score": 0.8983734250068665, - }, -] -``` diff --git a/pgml-cms/docs/introduction/getting-started/README.md b/pgml-cms/docs/introduction/getting-started/README.md index cde0c6d3a..df15a1dee 100644 --- a/pgml-cms/docs/introduction/getting-started/README.md +++ b/pgml-cms/docs/introduction/getting-started/README.md @@ -1,5 +1,5 @@ --- -description: Setup a database and connect your application to PostgresML +description: Getting starting with PostgresML, a GPU powered machine learning database. --- # Getting started diff --git a/pgml-cms/docs/introduction/getting-started/connect-your-app.md b/pgml-cms/docs/introduction/getting-started/connect-your-app.md index c0b003220..642b32597 100644 --- a/pgml-cms/docs/introduction/getting-started/connect-your-app.md +++ b/pgml-cms/docs/introduction/getting-started/connect-your-app.md @@ -1,5 +1,5 @@ --- -description: PostgresML is compatible with all standard PostgreSQL clients +description: Connect your application to PostgresML using our SDK or any standard PostgreSQL client. --- # Connect your app diff --git a/pgml-cms/docs/introduction/getting-started/create-your-database.md b/pgml-cms/docs/introduction/getting-started/create-your-database.md index 01e8c53f4..c20568059 100644 --- a/pgml-cms/docs/introduction/getting-started/create-your-database.md +++ b/pgml-cms/docs/introduction/getting-started/create-your-database.md @@ -1,6 +1,6 @@ --- description: >- - You can create a GPU powered database in less than a minute using our hosted + Create a GPU powered database in less than a minute using our hosted cloud. --- diff --git a/pgml-cms/docs/introduction/getting-started/import-your-data/README.md b/pgml-cms/docs/introduction/getting-started/import-your-data/README.md index 49d2cd15e..0ab10669c 100644 --- a/pgml-cms/docs/introduction/getting-started/import-your-data/README.md +++ b/pgml-cms/docs/introduction/getting-started/import-your-data/README.md @@ -1,3 +1,7 @@ +--- +description: Import your data into PostgresML using one of many supported methods. +--- + # Import your data AI needs data, whether it's generating text with LLMs, creating embeddings, or training regression or classification models on customer data. @@ -12,7 +16,7 @@ If your intention is to use PostgresML as your primary database, your job here i If your primary database is hosted elsewhere, for example AWS RDS, or Azure Postgres, you can get your data replicated to PostgresML in real time using logical replication. -
Logical replication
+
Logical replication
Having access to your data immediately is very useful to accelerate your machine learning use cases and removes the need for moving data multiple times between microservices. Latency-sensitive applications should consider using this approach. @@ -21,7 +25,7 @@ accelerate your machine learning use cases and removes the need for moving data Foreign data wrappers are a set of PostgreSQL extensions that allow making direct connections from inside the database directly to other databases, even if they aren't running on Postgres. For example, Postgres has foreign data wrappers for MySQL, S3, Snowflake and many others. -
Foreign data wrappers
+
Foreign data wrappers
FDWs are useful when data access is infrequent and not latency-sensitive. For many use cases, like offline batch workloads and not very busy websites, this approach is suitable and easy to get started with. diff --git a/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md b/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md index 1e590cb87..29b22b684 100644 --- a/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md +++ b/pgml-cms/docs/introduction/getting-started/import-your-data/copy.md @@ -1,3 +1,7 @@ +--- +description: Move data into PostgresML from data files using COPY and CSV. +--- + # Move data with COPY Data that changes infrequently can be easily imported into PostgresML (and any other Postgres database) using `COPY`. All you have to do is export your data as a file, create a table in Postgres to store it, and import it using the command line (or your IDE of choice). diff --git a/pgml-cms/docs/introduction/getting-started/import-your-data/foreign-data-wrappers.md b/pgml-cms/docs/introduction/getting-started/import-your-data/foreign-data-wrappers.md index e6d068e88..27c9d9227 100644 --- a/pgml-cms/docs/introduction/getting-started/import-your-data/foreign-data-wrappers.md +++ b/pgml-cms/docs/introduction/getting-started/import-your-data/foreign-data-wrappers.md @@ -1,8 +1,12 @@ +--- +description: Connect your production database to PostgresML using Foreign Data Wrappers. +--- + # Foreign Data Wrappers Foreign data wrappers are a set of Postgres extensions that allow making direct connections to other databases from inside your PostgresML database. Other databases can be your production Postgres database on RDS or Azure, or another database engine like MySQL, Snowflake, or even an S3 bucket. -
+
Foreign data wrappers
## Getting started diff --git a/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/README.md b/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/README.md index 11de28b51..d5371b391 100644 --- a/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/README.md +++ b/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/README.md @@ -1,8 +1,12 @@ +--- +description: Stream data from your primary database to PostgresML in real time using logical replication. +--- + # Logical replication Logical replication allows your PostgresML database to copy data from your primary database to PostgresML in real time. As soon as your customers make changes to their data on your website, those changes will become available in PostgresML. -
+
Logical replication
## Getting started diff --git a/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/inside-a-vpc.md b/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/inside-a-vpc.md index 4c45db575..55da8bafb 100644 --- a/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/inside-a-vpc.md +++ b/pgml-cms/docs/introduction/getting-started/import-your-data/logical-replication/inside-a-vpc.md @@ -3,7 +3,7 @@ If your database doesn't have Internet access, PostgresML will need a service to proxy connections to your database. Any TCP proxy will do, and we also provide an nginx-based Docker image than can be used without any additional configuration. -
+
VPC
## PostgresML IPs by region diff --git a/pgml-cms/docs/introduction/getting-started/import-your-data/pg-dump.md b/pgml-cms/docs/introduction/getting-started/import-your-data/pg-dump.md index 61cf688f6..b6e13b183 100644 --- a/pgml-cms/docs/introduction/getting-started/import-your-data/pg-dump.md +++ b/pgml-cms/docs/introduction/getting-started/import-your-data/pg-dump.md @@ -1,3 +1,7 @@ +--- +description: Migrate your PostgreSQL database to PostgresML using pg_dump. +--- + # Migrate with pg_dump _pg_dump_ is a command-line PostgreSQL tool that can move data between PostgreSQL databases. If you're planning a migration from your database to PostgresML, _pg_dump_ is a good tool to get you going quickly. diff --git a/pgml-cms/docs/product/pgcat/README.md b/pgml-cms/docs/product/pgcat/README.md index 326252032..805422e97 100644 --- a/pgml-cms/docs/product/pgcat/README.md +++ b/pgml-cms/docs/product/pgcat/README.md @@ -1,5 +1,5 @@ --- -description: Nextgen PostgreSQL Pooler +description: PgCat, the PostgreSQL connection pooler and proxy with support for sharding, load balancing, failover, and many more features. --- # PgCat pooler diff --git a/pgml-cms/docs/product/pgcat/configuration.md b/pgml-cms/docs/product/pgcat/configuration.md index c7e14db72..0fe2c4e54 100644 --- a/pgml-cms/docs/product/pgcat/configuration.md +++ b/pgml-cms/docs/product/pgcat/configuration.md @@ -1,4 +1,8 @@ -# Configuration +--- +description: PgCat configuration settings & recommended default values. +--- + +# PgCat configuration PgCat offers many features out of the box, and comes with good default values for most of its configuration options, but some minimal configuration is required before PgCat can start serving PostgreSQL traffic. diff --git a/pgml-cms/docs/product/pgcat/features.md b/pgml-cms/docs/product/pgcat/features.md index df09649cb..f00ff7fb4 100644 --- a/pgml-cms/docs/product/pgcat/features.md +++ b/pgml-cms/docs/product/pgcat/features.md @@ -1,3 +1,7 @@ +--- +description: PgCat features like sharding, load balancing and failover. +--- + # PgCat features PgCat has many features currently in various stages of readiness and development. Most of its features are used in production and at scale. diff --git a/pgml-cms/docs/product/pgcat/installation.md b/pgml-cms/docs/product/pgcat/installation.md index 07248ba4d..b3b151bc4 100644 --- a/pgml-cms/docs/product/pgcat/installation.md +++ b/pgml-cms/docs/product/pgcat/installation.md @@ -1,3 +1,7 @@ +--- +description: PgCat installation instructions from source, Aptitude repository and using Docker. +--- + # PgCat installation If you're using our [cloud](https://postgresml.org/signup), you're already using PgCat. All databases are using the latest and greatest PgCat version, with automatic updates and monitoring. You can connect directly with your PostgreSQL client libraries and applications, and PgCat will take care of the rest. diff --git a/pgml-cms/docs/product/vector-database.md b/pgml-cms/docs/product/vector-database.md index 71db1684f..5db1582bc 100644 --- a/pgml-cms/docs/product/vector-database.md +++ b/pgml-cms/docs/product/vector-database.md @@ -1,5 +1,5 @@ --- -description: Store, index and query vectors, with pgvector +description: Use PostgreSQL as your vector database to store, index and search vectors with the pgvector extension. --- # Vector database diff --git a/pgml-dashboard/src/api/cms.rs b/pgml-dashboard/src/api/cms.rs index e376d7e9a..608273ef0 100644 --- a/pgml-dashboard/src/api/cms.rs +++ b/pgml-dashboard/src/api/cms.rs @@ -29,7 +29,6 @@ lazy_static! { "Blog", true, HashMap::from([ - ("the-1.0-sdk-is-here", "the-1.0-sdk-is-here"), ("announcing-hnsw-support-in-our-sdk", "speeding-up-vector-recall-5x-with-hnsw"), ("backwards-compatible-or-bust-python-inside-rust-inside-postgres/", "backwards-compatible-or-bust-python-inside-rust-inside-postgres"), ("data-is-living-and-relational/", "data-is-living-and-relational"), @@ -63,6 +62,7 @@ lazy_static! { ("transformers/fine_tuning/", "api/sql-extension/pgml.tune"), ("guides/predictions/overview", "api/sql-extension/pgml.predict/"), ("machine-learning/supervised-learning/data-pre-processing", "api/sql-extension/pgml.train/data-pre-processing"), + ("api/client-sdk/getting-started", "api/client-sdk/"), ]) ); } @@ -115,6 +115,7 @@ pub struct Document { // url to thumbnail for social share pub thumbnail: Option, pub url: String, + pub ignore: bool, } // Gets document markdown @@ -189,7 +190,7 @@ impl Document { }; // parse meta section - let (description, image, featured, tags) = match meta { + let (description, image, featured, tags, ignore) = match meta { Some(meta) => { let description = if meta["description"].is_badvalue() { None @@ -234,9 +235,15 @@ impl Document { tags }; - (description, image, featured, tags) + let ignore = if meta["ignore"].is_badvalue() { + false + } else { + meta["ignore"].as_bool().unwrap_or(false) + }; + + (description, image, featured, tags, ignore) } - None => (None, Some(default_image_path.clone()), false, Vec::new()), + None => (None, Some(default_image_path.clone()), false, Vec::new(), false), }; let thumbnail = match &image { @@ -300,6 +307,7 @@ impl Document { doc_type, thumbnail, url, + ignore, }; Ok(document) } @@ -328,6 +336,38 @@ impl Document { html } + + pub fn ignore(&self) -> bool { + self.ignore + } +} + +#[derive(Debug, Clone)] +pub struct ContentPath { + path: PathBuf, + canonical: String, + redirected: bool, +} + +impl ContentPath { + /// Should we issue a 301 redirect instead. + pub fn redirect(&self) -> bool { + self.redirected + } + + pub fn path(&self) -> PathBuf { + self.path.clone() + } + + pub fn canonical(&self) -> String { + self.canonical.clone() + } +} + +impl From for PathBuf { + fn from(path: ContentPath) -> PathBuf { + path.path + } } /// A Gitbook collection of documents @@ -373,37 +413,56 @@ impl Collection { } pub async fn get_asset(&self, path: &str) -> Option { - info!("get_asset: {} {path}", self.name); + debug!("get_asset: {} {path}", self.name); NamedFile::open(self.asset_dir.join(path)).await.ok() } - pub async fn get_content_path(&self, mut path: PathBuf, origin: &Origin<'_>) -> (PathBuf, String) { - info!("get_content: {} | {path:?}", self.name); + /// Get the actual path on disk to the content being requested. + /// + /// # Arguments + /// + /// * `path` - The path to the content being requested. + /// * `origin` - The HTTP origin of the request. + /// + pub async fn get_content_path(&self, mut path: PathBuf, origin: &Origin<'_>) -> ContentPath { + debug!("get_content: {} | {path:?}", self.name); - let mut redirected = false; match self .redirects .get(path.as_os_str().to_str().expect("needs to be a well formed path")) { Some(redirect) => { - warn!("found redirect: {} <- {:?}", redirect, path); - redirected = true; // reserved for some fallback path - path = PathBuf::from(redirect); + debug!("found redirect: {} <- {:?}", redirect, path); + + return ContentPath { + redirected: true, + path: PathBuf::from(redirect), + canonical: "".into(), + }; } - None => {} - }; + None => (), + } + let canonical = format!( "https://postgresml.org{}/{}", self.url_root.to_string_lossy(), path.to_string_lossy() ); - if origin.path().ends_with("/") && !redirected { + + if origin.path().ends_with("/") { path = path.join("README"); } + let path = self.root_dir.join(format!("{}.md", path.to_string_lossy())); - (path, canonical) + let path = ContentPath { + path, + canonical, + redirected: false, + }; + + path } /// Create an index of the Collection based on the SUMMARY.md from Gitbook. @@ -605,7 +664,7 @@ impl Collection { path: &'a PathBuf, canonical: &str, cluster: &Cluster, - ) -> Result { + ) -> Result { match Document::from_path(&path).await { Ok(doc) => { let head = crate::components::layouts::Head::new() @@ -626,7 +685,7 @@ impl Collection { article.is_careers() }; - Ok(ResponseOk(layout.render(article))) + Ok(Response::ok(layout.render(article))) } // Return page not found on bad path _ => { @@ -758,9 +817,16 @@ async fn get_blog( path: PathBuf, cluster: &Cluster, origin: &Origin<'_>, -) -> Result { - let (doc_file_path, canonical) = BLOG.get_content_path(path.clone(), origin).await; - BLOG.render(&doc_file_path, &canonical, cluster).await +) -> Result { + let content_path = BLOG.get_content_path(path, origin).await; + + if content_path.redirect() { + let redirect = Path::new("/blog/").join(content_path.path()).display().to_string(); + return Ok(Response::redirect(redirect)); + } + + let canonical = content_path.canonical(); + BLOG.render(&content_path.into(), &canonical, cluster).await } #[get("/careers/", rank = 5)] @@ -768,9 +834,16 @@ async fn get_careers( path: PathBuf, cluster: &Cluster, origin: &Origin<'_>, -) -> Result { - let (doc_file_path, canonical) = CAREERS.get_content_path(path.clone(), origin).await; - CAREERS.render(&doc_file_path, &canonical, cluster).await +) -> Result { + let content_path = CAREERS.get_content_path(path, origin).await; + + if content_path.redirect() { + let redirect = Path::new("/blog/").join(content_path.path()).display().to_string(); + return Ok(Response::redirect(redirect)); + } + + let canonical = content_path.canonical(); + CAREERS.render(&content_path.into(), &canonical, cluster).await } #[get("/careers/apply/", rank = 4)] @@ -789,33 +862,35 @@ async fn get_docs( path: PathBuf, cluster: &Cluster, origin: &Origin<'_>, -) -> Result<ResponseOk, crate::responses::NotFound> { - let (doc_file_path, canonical) = DOCS.get_content_path(path.clone(), origin).await; +) -> Result<Response, crate::responses::NotFound> { + use crate::components::{layouts::Docs, pages::docs::Article}; + + let content_path = DOCS.get_content_path(path, origin).await; - match Document::from_path(&doc_file_path).await { - Ok(doc) => { + if content_path.redirect() { + let redirect = Path::new("/docs/").join(content_path.path()).display().to_string(); + return Ok(Response::redirect(redirect)); + } + + if let Ok(doc) = Document::from_path(&content_path.clone().into()).await { + if !doc.ignore() { let index = DOCS.open_index(&doc.path); - let layout = crate::components::layouts::Docs::new(&doc.title, Some(cluster)) + let layout = Docs::new(&doc.title, Some(cluster)) .index(&index) .image(&doc.thumbnail) - .canonical(&canonical); + .canonical(&content_path.canonical()); - let page = crate::components::pages::docs::Article::new(&cluster) - .toc_links(&doc.toc_links) - .content(&doc.html()); + let page = Article::new(&cluster).toc_links(&doc.toc_links).content(&doc.html()); - Ok(ResponseOk(layout.render(page))) + return Ok(Response::ok(layout.render(page))); } - // Return page not found on bad path - _ => { - let layout = crate::components::layouts::Docs::new("404", Some(cluster)).index(&DOCS.index); + } - let page = crate::components::pages::docs::Article::new(&cluster).document_not_found(); + let layout = crate::components::layouts::Docs::new("404", Some(cluster)).index(&DOCS.index); + let page = crate::components::pages::docs::Article::new(&cluster).document_not_found(); - Err(crate::responses::NotFound(layout.render(page))) - } - } + Err(crate::responses::NotFound(layout.render(page))) } #[get("/blog")] diff --git a/pgml-dashboard/src/utils/markdown.rs b/pgml-dashboard/src/utils/markdown.rs index 0b42a9121..26b39155f 100644 --- a/pgml-dashboard/src/utils/markdown.rs +++ b/pgml-dashboard/src/utils/markdown.rs @@ -517,7 +517,8 @@ pub fn get_toc<'a>(root: &'a AstNode<'a>) -> anyhow::Result<Vec<TocLink>> { let text = if let NodeValue::Text(text) = &sibling.data.borrow().value { Some(text.clone()) } else if let NodeValue::Link(_link) = &sibling.data.borrow().value { - let text = sibling.children() + let text = sibling + .children() .into_iter() .map(|child| { if let NodeValue::Text(text) = &child.data.borrow().value { @@ -1378,6 +1379,10 @@ impl SiteSearch { let documents: Vec<Document> = documents .into_iter() .filter(|f| { + if f.ignore() { + return false; + } + !EXCLUDED_DOCUMENT_PATHS .iter() .any(|p| f.path == config::cms_dir().join(p)) diff --git a/pgml-dashboard/static/css/bootstrap-theme.scss b/pgml-dashboard/static/css/bootstrap-theme.scss index 212a7a47f..7bc03ad0c 100644 --- a/pgml-dashboard/static/css/bootstrap-theme.scss +++ b/pgml-dashboard/static/css/bootstrap-theme.scss @@ -90,6 +90,8 @@ @import 'scss/components/images'; @import 'scss/components/code'; @import 'scss/components/forms'; +@import 'scss/components/modals'; + // pages @import 'scss/pages/docs'; @import 'scss/pages/notebooks'; diff --git a/pgml-dashboard/static/css/scss/components/_modals.scss b/pgml-dashboard/static/css/scss/components/_modals.scss index 6b1d6efdd..6c6837c20 100644 --- a/pgml-dashboard/static/css/scss/components/_modals.scss +++ b/pgml-dashboard/static/css/scss/components/_modals.scss @@ -26,3 +26,7 @@ border: none; } } + +.modal-backdrop { + --bs-backdrop-opacity: 0.9; +}