LazySupervisedDataset的token_length计算

LazySupervisedDataset的token_length计算感觉存在问题。在线计算token_length的时候，第一次计算的token_length是input_id的长度。而保存到字典conv2length中的value却是input_id加上图片token的长度。
```
        if self.group_by_length:
            self.conv2length = {}  # Using a dictionary to speed up token length calculation
            self.length = []
            for data_item in self.raw_data:
                data_item = json.loads(data_item)
                if 'length' in data_item:
                    token_length = data_item['length']  # Use precomputed length if available
                else:
                    # Compute token length using the tokenizer
                    conversations = '\n'.join([temp['value'] for temp in data_item['conversations']])
                    str_length = len(conversations)
                    if str_length not in self.conv2length:
                        token_length = tokenizer(
                            conversations, return_tensors='pt', padding=False, truncation=False,
                        ).input_ids.size(1)
                        self.conv2length[str_length] = token_length + num_image_token * (
                                    max_dynamic_patch + use_thumbnail)
                    else:
                        token_length = self.conv2length[str_length]
                self.length.append(token_length)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LazySupervisedDataset的token_length计算 #1205

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LazySupervisedDataset的token_length计算 #1205

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions