I'm trying HNM via hn_mine.py, but the hard negatives are gibberish.

Hi, I'm trying to do HNM via hn_mine.py. The dataset exists as below:
```python
# sample.jsonl (120k rows)
{
    "query": "사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의가 있어야 하나, 법원 ...(omitted)
    "pos": "아닙니다. 사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의 ...(omitted)
}
```

```bash
python hn_mine.py \
--input_file sample.jsonl \
--output_file sample_output.jsonl \
--range_for_sampling 2-30 \
--negative_number 5 \
--use_gpu_for_searching \
--embedder_name_or_path .../models/bge-m3  \ (downloaded via hugging face git clone (BAAI/BGE-m3))
--embedder_model_class encoder-only-m3 \ (or none, tried both).
```

However, the following Hard Negative dataset was extracted:
```python
{
    "query": "사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의가 있어야 하나, 법원 ...(omitted)
    "pos": "아닙니다. 사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의 ...(omitted)
    "neg": [
        "初",
        "샹",
        "듭",
        "試",
        "ち"
    ]
}
```

My dataset is fully natural language data. How can I solve this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I'm trying HNM via hn_mine.py, but the hard negatives are gibberish. #1388

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

I'm trying HNM via hn_mine.py, but the hard negatives are gibberish. #1388

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions