-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Description
We need to implement backend schemas and APIs to store, manage, and retrieve AI training samples.
These samples will be used to train, fine-tune, and improve the AI model using structured datasets.
The feature includes new Mongoose schemas, vector indexing, CRUD endpoints, dataset upload & parsing, and embedding generation.
Requirements
Training Sample Fields
Each training sample must support:
-
question -
type→qa | snippet | doc | faq | other -
Structured answer JSON:
{ "greeting": "Hi {{userName}}! 👋 How can I help you today?", "answer": "<concise plain-text answer>", "sections": [ { "title": "Section title", "content": "Plain text content here" } ], "suggestions": [ "Suggestion 1", "Suggestion 2", "Suggestion n" ] } -
Optional
codeSnippet -
embedding(vector for semantic retrieval) -
File metadata:
filePathfileMimeTypefileSizeInBytes
-
sourceType→manual | dataset -
datasetId(reference to uploaded dataset) -
tags -
language -
isActive
Schema (Mongoose Example)
// models/trainingSample.model.ts
import { Schema, model, Types } from "mongoose";
const SectionSchema = new Schema(
{
title: { type: String, required: true },
content: { type: String, required: true }
},
{ _id: false }
);
const AnswerTemplateSchema = new Schema(
{
greeting: { type: String, required: false },
answer: { type: String, required: true },
sections: { type: [SectionSchema], default: [] },
suggestions: { type: [String], default: [] }
},
{ _id: false }
);
const TrainingSampleSchema = new Schema(
{
question: { type: String, required: true },
type: {
type: String,
enum: ["qa", "snippet", "doc", "faq", "other"],
default: "qa"
},
answerTemplate: { type: AnswerTemplateSchema, required: true },
codeSnippet: { type: String },
embedding: {
type: [Number],
index: "vector",
required: true
},
filePath: { type: String },
fileMimeType: { type: String },
fileSizeInBytes: { type: Number },
sourceType: {
type: String,
enum: ["manual", "dataset"],
default: "manual"
},
datasetId: { type: Types.ObjectId, ref: "DatasetFile" },
tags: [{ type: String }],
language: { type: String, default: "en" },
isActive: { type: Boolean, default: true }
},
{ timestamps: true }
);
export const TrainingSample = model("TrainingSample", TrainingSampleSchema);APIs Needed
Training Data CRUD
- POST
/api/v1/training-samples
Create sample + generate embedding - GET
/api/v1/training-samples
Filters:type,tags,isActive,sourceType - GET
/api/v1/training-samples/:id - PUT
/api/v1/training-samples/:id
Re-generate embedding if content changes - DELETE
/api/v1/training-samples/:id
Soft delete →isActive = false
Vector Search
-
POST
/api/v1/training-samples/search
Body:{ "query": "", "topK": 5, "filters": {} }
Dataset Upload
- POST
/api/v1/training-datasets/upload
Accept CSV, JSON, TXT, MD - POST
/api/v1/training-datasets/:id/process
Parse file → createTrainingSamplerecords
Acceptance Criteria
-
TrainingSample schema created with all fields
-
Vector index configured on
embedding -
CRUD APIs implemented with validation
-
Dataset upload + parser supports CSV/JSON/TXT/MD
-
README includes:
- Example
answerTemplateJSON - Steps to create a training sample
- Steps to upload & process dataset
- Example