Skip to content

Backend: Schema & APIs for AI Training Dataset (MongoDB + Vector + Upload) #5

@Anjali3366

Description

@Anjali3366

Description

We need to implement backend schemas and APIs to store, manage, and retrieve AI training samples.
These samples will be used to train, fine-tune, and improve the AI model using structured datasets.
The feature includes new Mongoose schemas, vector indexing, CRUD endpoints, dataset upload & parsing, and embedding generation.


Requirements

Training Sample Fields

Each training sample must support:

  • question

  • typeqa | snippet | doc | faq | other

  • Structured answer JSON:

    {
      "greeting": "Hi {{userName}}! 👋 How can I help you today?",
      "answer": "<concise plain-text answer>",
      "sections": [
        {
          "title": "Section title",
          "content": "Plain text content here"
        }
      ],
      "suggestions": [
        "Suggestion 1",
        "Suggestion 2",
        "Suggestion n"
      ]
    }
  • Optional codeSnippet

  • embedding (vector for semantic retrieval)

  • File metadata:

    • filePath
    • fileMimeType
    • fileSizeInBytes
  • sourceTypemanual | dataset

  • datasetId (reference to uploaded dataset)

  • tags

  • language

  • isActive


Schema (Mongoose Example)

// models/trainingSample.model.ts
import { Schema, model, Types } from "mongoose";

const SectionSchema = new Schema(
  {
    title: { type: String, required: true },
    content: { type: String, required: true }
  },
  { _id: false }
);

const AnswerTemplateSchema = new Schema(
  {
    greeting: { type: String, required: false },
    answer: { type: String, required: true },
    sections: { type: [SectionSchema], default: [] },
    suggestions: { type: [String], default: [] }
  },
  { _id: false }
);

const TrainingSampleSchema = new Schema(
  {
    question: { type: String, required: true },
    type: {
      type: String,
      enum: ["qa", "snippet", "doc", "faq", "other"],
      default: "qa"
    },
    answerTemplate: { type: AnswerTemplateSchema, required: true },
    codeSnippet: { type: String },

    embedding: {
      type: [Number],
      index: "vector",
      required: true
    },

    filePath: { type: String },
    fileMimeType: { type: String },
    fileSizeInBytes: { type: Number },

    sourceType: {
      type: String,
      enum: ["manual", "dataset"],
      default: "manual"
    },
    datasetId: { type: Types.ObjectId, ref: "DatasetFile" },

    tags: [{ type: String }],
    language: { type: String, default: "en" },

    isActive: { type: Boolean, default: true }
  },
  { timestamps: true }
);

export const TrainingSample = model("TrainingSample", TrainingSampleSchema);

APIs Needed

Training Data CRUD

  • POST /api/v1/training-samples
    Create sample + generate embedding
  • GET /api/v1/training-samples
    Filters: type, tags, isActive, sourceType
  • GET /api/v1/training-samples/:id
  • PUT /api/v1/training-samples/:id
    Re-generate embedding if content changes
  • DELETE /api/v1/training-samples/:id
    Soft delete → isActive = false

Vector Search

  • POST /api/v1/training-samples/search
    Body:

    { "query": "", "topK": 5, "filters": {} }

Dataset Upload

  • POST /api/v1/training-datasets/upload
    Accept CSV, JSON, TXT, MD
  • POST /api/v1/training-datasets/:id/process
    Parse file → create TrainingSample records

Acceptance Criteria

  • TrainingSample schema created with all fields

  • Vector index configured on embedding

  • CRUD APIs implemented with validation

  • Dataset upload + parser supports CSV/JSON/TXT/MD

  • README includes:

    • Example answerTemplate JSON
    • Steps to create a training sample
    • Steps to upload & process dataset

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions