# HOW TO RUN

### Google Colab Setup

### Creation of Qdrant Vector DB from uploaded code in `data/examples`

In [5]:
!python ../scripts/ingest.py  \
    --input_glob "../data/examples/**/*.py"   \
    --model sentence-transformers/all-MiniLM-L6-v2  \
    --qdrant-host qdrant \
    --qdrant-port 6333   \
    --qdrant-collection codes \
    # --qdrant-recreate 

Prepared 12 chunks
Ingested 12 chunks into Qdrant collection 'codes'.
Tip: you can filter by payload fields, e.g., lang=='python' or kind in ['function','class'].


### Serving of API to Qdrant DB

In [7]:
!python ../scripts/serve_api.py \
    --backend qdrant \
    --model sentence-transformers/all-MiniLM-L12-v2 \
    --qdrant-host qdrant \
    --qdrant-port 6333 \
    --qdrant-collection codes 

[32mINFO[0m:     Started server process [[36m2947[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://0.0.0.0:8000[0m (Press CTRL+C to quit)
^C
[32mINFO[0m:     Shutting down
[32mINFO[0m:     Waiting for application shutdown.
[32mINFO[0m:     Application shutdown complete.
[32mINFO[0m:     Finished server process [[36m2947[0m]


### Evaluaton of code retrival system

In [None]:
!python ../scripts/evaluate.py \
    --model "sentence-transformers/all-MiniLM-L12-v2" \
    --qdrant-host "qdrant" \
    --qdrant-port 6333 \
    --qdrant-collection "cosqa_test_bodies" \
    --K 3 

### Finetuning of the Model

In [1]:
!python ../scripts/train.py \
    --model "sentence-transformers/all-MiniLM-L12-v2" \
    --finetune-dir "../models" \
    --checkpoint-path "../checkpoint" \
    --assets-dir "../results/assets" \
    --qdrant-host "qdrant" \
    --qdrant-port 6333 \
    --qdrant-collection "cosqa_test_bodies" \
    --qdrant-collection-ft "cosqa_test_ft" \
    --K 3 \
    --batch-size 1 \
    --epochs 1 \
    --lr 2e-5 \
    --max-steps-per-epoch 0 \
    --seed 69 

[Data] Train: 19604 q / 19604 docs

[Train] Fine-tuning with MultipleNegativesRankingLoss (in-batch negatives)…
[Info] data in pairs ['python code to write bool value 1', 'def writeBoolean(self, n):\n        """\n        Writes a Boolean to the stream.\n        """\n        t = TYPE_BOOL_TRUE\n\n        if n is False:\n            t = TYPE_BOOL_FALSE\n\n        self.stream.write(t)']
[Info] data in pairs ['"python how to manipulate clipboard"', 'def paste(xsel=False):\n    """Returns system clipboard contents."""\n    selection = "primary" if xsel else "clipboard"\n    try:\n        return subprocess.Popen(["xclip", "-selection", selection, "-o"], stdout=subprocess.PIPE).communicate()[0].decode("utf-8")\n    except OSError as why:\n        raise XclipNotFound']
[Info] data in pairs ['python colored output to html', 'def _format_json(data, theme):\n    """Pretty print a dict as a JSON, with colors if pygments is present."""\n    output = json.dumps(data, indent=2, sort_keys=True)\n\n   

### Evaluation of Fine-tuned Model

In [2]:
!python ../scripts/evaluate.py \
    --model "../models" \
    --qdrant-host "qdrant" \
    --qdrant-port 6333 \
    --qdrant-collection "cosqa_test_bodies" \
    --K 3 

[Info] Prepared test split: 500 corpus docs, 500 queries. Missing labels: 0
Batches: 100%|████████████████████████████████████| 8/8 [02:13<00:00, 16.74s/it]
Batches: 100%|████████████████████████████████████| 8/8 [00:03<00:00,  2.14it/s]
[Timing] Encoded corpus in 133.95s, queries in 3.78s with ../models (dim=384).
[Index] Upserted 500 vectors into Qdrant collection 'cosqa_test_bodies' at qdrant:6333

=== CoSQA (test) — Qdrant Retrieval Metrics ===
Model: ../models
K: 3
Recall@3: 0.8940
MRR@3:    0.7870
nDCG@3:   0.8147
(Retrieval time for 500 queries: 0.82s)

--- Sample retrieved doc_ids for first few queries ---
Query q20105 -> hits: [(0, 'd20105'), (225, 'd20330'), (317, 'd20422'), (320, 'd20425'), (12, 'd20117')] ; relevant: [0]
Query q20106 -> hits: [(1, 'd20106'), (377, 'd20482'), (25, 'd20130'), (487, 'd20592'), (237, 'd20342')] ; relevant: [1]
Query q20107 -> hits: [(224, 'd20329'), (2, 'd20107'), (279, 'd20384'), (31, 'd20136'), (448, 'd20553')] ; relevant: [2]
