# Engineering Knowledge AI Agent Test

## 1. Perbedaan REST API vs MCP dalam konteks AI

### REST API:
- Protokol komunikasi standar untuk integrasi sistem
- Stateless, setiap request berdiri sendiri
- AI model di-deploy sebagai service endpoint
- Client kirim request → server proses → return response

### MCP (Model Context Protocol):
- Protocol standar khusus untuk AI agents berinteraksi dengan tools/resources
- **Key advantage**: AI agent manapun bisa langsung pakai MCP server tanpa perlu ubah banyak kode, **cukup adjust di prompt saja**
- AI agent otomatis mengenali tools yang tersedia dari MCP server dan bisa gunakan sesuai kebutuhan
- Contoh: MCP server expose database access, file system, calculator - AI agent tinggal pilih tool mana yang diperlukan untuk task tertentu
- Standardisasi ini bikin development lebih cepat karena sekali bikin MCP server, semua AI agent bisa pakai

---

## 2. Bagaimana REST API & MCP improve AI use case

### REST API:
- Deploy model sebagai microservice yang scalable
- Mudah diintegrasikan ke aplikasi existing
- Load balancing untuk handle traffic tinggi
- Versioning model lebih mudah

### MCP:
- AI agent bisa akses database, file system, external APIs secara dynamic
- Meningkatkan akurasi dengan real-time context
- Agents bisa gunakan multiple tools sekaligus
- Reduce hallucination dengan grounding pada data aktual

---

## 3. Cara memastikan AI agent jawab dengan benar

### 1. Prompt Engineering:
- Instruksi yang clear dan specific
- Berikan examples dan expected format

### 2. RAG (Retrieval Augmented Generation):
- Ground jawaban pada data/dokumen yang kita tau valid
- Mengurangi hallucination karena model tidak mengarang
- Source of truth jelas

### 3. Human Evaluation & Monitoring:
- Output dinilai langsung oleh user
- Track metric kepuasan (thumbs up/down, ratings)
- Monitor performa model dari feedback real users
- Continuous improvement based on metrics

---

## 4. Docker/Container dalam konteks AI

### Use cases:
- **Reproducible Environment**: Package semua dependencies (CUDA, Python libs) jadi satu
- **Model Serving**: Deploy model dengan isolation yang jelas
- **Development Consistency**: Environment sama dari laptop developer ke server production
- **GPU Support**: NVIDIA Container Runtime untuk akses GPU
- **Versioning**: Bisa run multiple model versions bersamaan
- **Resource Limits**: Control CPU/Memory/GPU usage per container
- **Portability**: Deploy dimana saja yang support Docker

---

## 5. Finetune LLM menggunakan Post-Training

Berdasarkan pengalaman, saya menggunakan teknik **Post-Training** yang terdiri dari 2 tahap:

### Tahap 1: Unsupervised Training (Domain Adaptation)
- Training dengan data **domain yang sama** dengan use case target
- **Tujuan**: Meningkatkan pengetahuan model tentang domain spesifik
- **Format**: Raw text/corpus dari domain tersebut
- **Contoh**: Kalau target medical domain, train dengan medical journals, textbooks

### Tahap 2: Supervised Training (Task Alignment)
- Training dengan data **domain berbeda** tapi **task yang sama**
- **Tujuan**: Mengajarkan model memahami task dan mapping input-output
- **Format**: Instruction-response pairs atau Q&A
- Model jadi lebih paham "task ini tujuannya apa dan bagaimana cara mappingnya"

### Efisiensi:
- Gunakan **LoRA** atau parameter-efficient methods
- Tidak perlu update full model, jadi training lebih ringan
- Resource requirement lebih kecil tapi hasil tetap bagus

**Benefit pendekatan ini**: Model dapat knowledge dari domain spesifik + kemampuan execute task dengan baik dari exposure ke task patterns yang beragam.

In [1]:
import pandas as pd

# 1 Parse Small Files

In [None]:
df = pd.read_csv('data/customers-100000.csv')

In [3]:
df.shape

(100000, 12)

# 2 Parse Large Files

In [5]:
import time

In [4]:
def read_large_files(filepath:str, chunk_size:int):
    chunks = pd.read_csv(filepath, chunksize=chunk_size)
    return chunks

In [6]:
#difference between using chunk size and not
filepath = 'data/customers-2000000.csv'


start_time = time.time()
chunks = read_large_files(filepath, chunk_size=10000)
duration = time.time()-start_time
print(f'USING CHUNK TIME : {duration}')

start_time = time.time()
df = pd.read_csv(filepath)
duration = time.time()-start_time
print(f'USING NON CHUNK : {duration}')

USING CHUNK TIME : 0.004730939865112305
USING NON CHUNK : 12.08965516090393


In [None]:
#just to see one chunk
for chunk in chunks:
    print(chunk)
    break

      Index      Customer Id First Name  Last Name  \
0         1  4962fdbE6Bfee6D        Pam     Sparks   
1         2  9b12Ae76fdBc9bE       Gina      Rocha   
2         3  39edFd2F60C85BC    Kristie      Greer   
3         4  Fa42AE6a9aD39cE     Arthur     Fields   
4         5  F5702Edae925F1D   Michelle    Blevins   
...     ...              ...        ...        ...   
9995   9996  0eA8b60A83fDB0B     Marvin     Deleon   
9996   9997  0B426BAc82F4de7     Carlos   Jennings   
9997   9998  eAFCeD87EE37Dd3    Maureen        May   
9998   9999  bC8D48359ba1e57      Sandy     Horton   
9999  10000  AA8464EbF1Dd2FA       Kyle  Blanchard   

                         Company               City  \
0                   Patel-Deleon         Blakemouth   
1        Acosta, Paul and Barber   East Lynnchester   
2                      Ochoa PLC        West Pamela   
3                     Moyer-Wang       East Belinda   
4                  Shah and Sons         West Jared   
...                  

# 3 Difference Between Reading Large and Small Chunk

so the difference between reading large file and small file is using chunk when reading with pandas. This make the load process not all at once but per-chunk and we will retrieve each chunk later based on needs. We can also do it without pandas like this [link](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file)

In [9]:
import csv

In [None]:
def parse_large_file(filepath):
    with open(filepath, mode ='r')as file:
        csv_file = csv.reader(file)
        for row in csv_file:
            # do something with row
            yield row

In [22]:
i = 0
for row in parse_large_file(filepath):
    i+=1
    print(row)
    if i==3:
        break

['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website']
['1', '4962fdbE6Bfee6D', 'Pam', 'Sparks', 'Patel-Deleon', 'Blakemouth', 'British Indian Ocean Territory (Chagos Archipelago)', '267-243-9490x035', '480-078-0535x889', 'nicolas00@faulkner-kramer.com', '2020-11-29', 'https://nelson.com/']
['2', '9b12Ae76fdBc9bE', 'Gina', 'Rocha', 'Acosta, Paul and Barber', 'East Lynnchester', 'Costa Rica', '027.142.0940', '+1-752-593-4777x07171', 'yfarley@morgan.com', '2021-01-03', 'https://pineda-rogers.biz/']


so using csv reader we parse the csv one by one using yield generator. Using this yield generator wont use any memory so it's good to process large file one by one per-row and not processing all at once.

# 4 Make Vector DB

Assumption is this vector db will be used for UI Platform in next question. So for this i will have vector db like this

| description    | quantity | price | date |
| -------------- | -------- | ----- | ---- |
| burger         | 1        | 1000  | 05-11-2025 |

In [17]:
from pydantic import BaseModel
from typing_extensions import List
import os
from dotenv import load_dotenv

class Receipt(BaseModel):
    description : str
    quantity : int
    price : float
    date : str
    vendor : str

class Receipts(BaseModel):
    receipts : List[Receipt]

In [18]:
load_dotenv()

True

In [19]:
from langchain_openai import OpenAIEmbeddings
from typing_extensions import List, Dict
import json

from src.be.constant import VECTOR_DB_JSON, VECTOR_DB_EMBEDDING_DIM, VECTOR_DB_ADD_BATCH_SIZE


class VectorDB:
    def __init__(self, embedding_dimension:int=VECTOR_DB_EMBEDDING_DIM, add_batch_size:int=VECTOR_DB_ADD_BATCH_SIZE):
        try:
            with open(VECTOR_DB_JSON, 'r') as f:
                self.table = json.load(f)
        except:
            print('Initializing Empty Table')
            self.table = []
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            dimensions=embedding_dimension
        )
        self.add_batch_size = add_batch_size

    def add_item(self, receipts:List[Dict]):
        for i in range(0, len(receipts), self.add_batch_size):
            batch_receipt = receipts[i:i+self.add_batch_size]
            batch_description = [receipt['description'] for receipt in batch_receipt]
            batch_embeddings = self.make_embedding(batch_description)
            self.table.extend([{**batch_receipt[batch_idx], 'embeddings':batch_embeddings[batch_idx]} for batch_idx in range(len(batch_receipt))])
        with open(VECTOR_DB_JSON, 'w') as fp:
            json.dump(self.table, fp)

    def get_item(self):
        return [{k:v for k,v in data.items() if k!='embeddings'} for data in self.table]

    def make_embedding(self, texts:List[str]):
        return self.embeddings.embed_documents(texts) 
    
    def dot_product(self, l1, l2):
        assert len(l1)==len(l2), f"Array size not same, len l1 {len(l1)} len l2 {len(l2)}"
        dot_result = 0
        for i in range(len(l1)):
            dot_result+=(l1[i]*l2[i])
        return dot_result
    
    def magnitude(self, l):
        sum_l = 0
        for i in range(len(l)):
            sum_l+=l[i]**2
        return sum_l**0.5
    
    def calculate_distance(self, embedding_1:List, embedding_2:List):
        return self.dot_product(embedding_1, embedding_2)/(self.magnitude(embedding_1)*self.magnitude(embedding_2))
    
    def search(self, query:str, top_n:int):
        query_embedding = self.make_embedding([query])[0]
        top_n_results = []
        for i,receipt in enumerate(self.table):
            distance = self.calculate_distance(query_embedding, receipt['embeddings'])
            if len(top_n_results)>top_n:
                if distance>top_n_results[-1]['distance']:
                    top_n_results[-1] = {'index':i, 'distance':distance}
                    top_n_results = sorted(top_n_results, key=lambda x:x['distance'], reverse=True)
            else:
                top_n_results.append({'index':i, 'distance':distance})
                top_n_results = sorted(top_n_results, key=lambda x:x['distance'], reverse=True)
        print(top_n_results)
        top_n_results = [{k:v for k,v in self.table[top_n['index']].items() if k!='embeddings'} for top_n in top_n_results]
        return top_n_results

In [None]:
r1 = Receipt(description='burger', quantity=1, price=60000.0, date='2025-11-04', vendor='bmb')
r2 = Receipt(description='nasi goreng', quantity=1, price=12000.0, date='2025-11-03', vendor='nasi goreng mafia')
r3 = Receipt(description='mie ayam', quantity=1, price=17000.0, date='2025-11-02', vendor='bakmie gm')
r4 = Receipt(description='bakso', quantity=1, price=15000.0, date='2025-11-01', vendor='bakso lapangan tembak senayan')
r5 = Receipt(description='sate', quantity=1, price=18000.0, date='2025-11-01', vendor='sate hj budi')
r6 = Receipt(description='iga bakar', quantity=1, price=150000.0, date='2025-11-02', vendor='daeng tata')
r7 = Receipt(description='kentang goreng', quantity=1, price=20000.0, date='2025-10-30', vendor='mcd')
r8 = Receipt(description='es krim', quantity=1, price=8000.0, date='2025-10-27', vendor='mcd')
r9 = Receipt(description='jus', quantity=1, price=15000.0, date='2025-10-15', vendor='jus kode')
r10 = Receipt(description='es teh', quantity=1, price=5000.0, date='2025-10-29', vendor='solaria')
receipts = Receipts(receipts=[
    r1,r2,r3,r4,r5,r6,r7,r8,r9,r10
])

In [21]:
vector_db = VectorDB()

Initializing Empty Table


In [22]:
vector_db.add_item(receipts.model_dump()['receipts'])

In [23]:
vector_db.search('burger', 2)

[{'index': 0, 'distance': 0.9999999999999998}, {'index': 3, 'distance': 0.4241921288922651}, {'index': 1, 'distance': 0.4214685026955221}]


[{'description': 'burger',
  'quantity': 1,
  'price': 60000.0,
  'date': '2025-11-04',
  'vendor': 'bmb'},
 {'description': 'bakso',
  'quantity': 1,
  'price': 15000.0,
  'date': '2025-11-01',
  'vendor': 'bakso lapangan tembak senayan'},
 {'description': 'nasi goreng',
  'quantity': 1,
  'price': 12000.0,
  'date': '2025-11-03',
  'vendor': 'nasi goreng mafia'}]