Backend: Schema & APIs for AI Training Dataset (MongoDB + Vector + Upload)


## **Description**

We need to implement backend schemas and APIs to store, manage, and retrieve AI training samples.
These samples will be used to train, fine-tune, and improve the AI model using structured datasets.
The feature includes new Mongoose schemas, vector indexing, CRUD endpoints, dataset upload & parsing, and embedding generation.

---

## **Requirements**

### **Training Sample Fields**

Each training sample must support:

* `question`
* `type` → `qa | snippet | doc | faq | other`
* **Structured answer JSON**:

  ```json
  {
    "greeting": "Hi {{userName}}! 👋 How can I help you today?",
    "answer": "<concise plain-text answer>",
    "sections": [
      {
        "title": "Section title",
        "content": "Plain text content here"
      }
    ],
    "suggestions": [
      "Suggestion 1",
      "Suggestion 2",
      "Suggestion n"
    ]
  }
  ```
* Optional `codeSnippet`
* `embedding` (vector for semantic retrieval)
* File metadata:

  * `filePath`
  * `fileMimeType`
  * `fileSizeInBytes`
* `sourceType` → `manual | dataset`
* `datasetId` (reference to uploaded dataset)
* `tags`
* `language`
* `isActive`

---

## **Schema (Mongoose Example)**

```ts
// models/trainingSample.model.ts
import { Schema, model, Types } from "mongoose";

const SectionSchema = new Schema(
  {
    title: { type: String, required: true },
    content: { type: String, required: true }
  },
  { _id: false }
);

const AnswerTemplateSchema = new Schema(
  {
    greeting: { type: String, required: false },
    answer: { type: String, required: true },
    sections: { type: [SectionSchema], default: [] },
    suggestions: { type: [String], default: [] }
  },
  { _id: false }
);

const TrainingSampleSchema = new Schema(
  {
    question: { type: String, required: true },
    type: {
      type: String,
      enum: ["qa", "snippet", "doc", "faq", "other"],
      default: "qa"
    },
    answerTemplate: { type: AnswerTemplateSchema, required: true },
    codeSnippet: { type: String },

    embedding: {
      type: [Number],
      index: "vector",
      required: true
    },

    filePath: { type: String },
    fileMimeType: { type: String },
    fileSizeInBytes: { type: Number },

    sourceType: {
      type: String,
      enum: ["manual", "dataset"],
      default: "manual"
    },
    datasetId: { type: Types.ObjectId, ref: "DatasetFile" },

    tags: [{ type: String }],
    language: { type: String, default: "en" },

    isActive: { type: Boolean, default: true }
  },
  { timestamps: true }
);

export const TrainingSample = model("TrainingSample", TrainingSampleSchema);
```

---

## **APIs Needed**

### **Training Data CRUD**

* **POST** `/api/v1/training-samples`
  Create sample + generate embedding
* **GET** `/api/v1/training-samples`
  Filters: `type`, `tags`, `isActive`, `sourceType`
* **GET** `/api/v1/training-samples/:id`
* **PUT** `/api/v1/training-samples/:id`
  Re-generate embedding if content changes
* **DELETE** `/api/v1/training-samples/:id`
  Soft delete → `isActive = false`

### **Vector Search**

* **POST** `/api/v1/training-samples/search`
  Body:

  ```json
  { "query": "", "topK": 5, "filters": {} }
  ```

### **Dataset Upload**

* **POST** `/api/v1/training-datasets/upload`
  Accept CSV, JSON, TXT, MD
* **POST** `/api/v1/training-datasets/:id/process`
  Parse file → create `TrainingSample` records

---

## **Acceptance Criteria**

* [ ] TrainingSample schema created with all fields
* [ ] Vector index configured on `embedding`
* [ ] CRUD APIs implemented with validation
* [ ] Dataset upload + parser supports CSV/JSON/TXT/MD
* [ ] README includes:

  * Example `answerTemplate` JSON
  * Steps to create a training sample
  * Steps to upload & process dataset




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backend: Schema & APIs for AI Training Dataset (MongoDB + Vector + Upload) #5

Description

Requirements

Training Sample Fields

Schema (Mongoose Example)

APIs Needed

Training Data CRUD

Vector Search

Dataset Upload

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backend: Schema & APIs for AI Training Dataset (MongoDB + Vector + Upload) #5

Description

Description

Requirements

Training Sample Fields

Schema (Mongoose Example)

APIs Needed

Training Data CRUD

Vector Search

Dataset Upload

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions