From ad4d5162857ade2b6b7a309614e5ba58a5f0e468 Mon Sep 17 00:00:00 2001
From: xiangan <anxiangsir@outlook.com>
Date: Tue, 30 Dec 2025 19:28:04 +0800
Subject: [PATCH] docs: update README.md and add data card for OneVision
 Encoder training data

---
 README.md                          | 8 ++++----
 docs/{datacard.md => data_card.md} | 5 ++++-
 2 files changed, 8 insertions(+), 5 deletions(-)
 rename docs/{datacard.md => data_card.md} (94%)
diff --git a/README.md b/README.md
index a3fa235b..c70b0d67 100644
--- a/README.md
+++ b/README.md
@@ -14,9 +14,9 @@
 
 📝 **[Homepage](https://www.lmms-lab.com/onevision-encoder/index.html)**
 🤗 **[Models](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large)** |
-🤗 **[Datasets](coming)** |
 📄 **[Tech Report (coming)]()** |
-📋 **[Model Card](docs/model_card.md)**
+📋 **[Model Card](docs/model_card.md)** |
+📊 **[Data Card](docs/data_card.md)**
 
 </div>
 
@@ -283,7 +283,7 @@ cd eval_encoder
 Then run the following command:
 
 ```bash
-bash eval_encoder/shells_eval_ap/eval_ov_encoder_large_16frames.sh
+bash shells_eval_ap/eval_ov_encoder_large_16frames.sh
 ```
 
 **Sampling-Specific Parameters:**
@@ -320,8 +320,8 @@ torchrun --nproc_per_node=8 --master_port=29512 attentive_probe_codec.py \
 
 **Codec-Specific Parameters:**
 - `K_keep`: Number of patches to keep.
-- `cache_dir`: Directory for cached codec patches. This is where the codec-selected patches will be stored/loaded.
 - `mv_compensate`: Motion vector compensation method (e.g., `median`).
+- `cache_dir` (optional): Directory for cached codec patches. Use this to specify where codec-selected patches are stored/loaded when you want to persist or reuse them.
 
 #### Shared Parameters
 
diff --git a/docs/datacard.md b/docs/data_card.md
similarity index 94%
rename from docs/datacard.md
rename to docs/data_card.md
index 2aae3ea7..f017a659 100644
--- a/docs/datacard.md
+++ b/docs/data_card.md
@@ -1,8 +1,11 @@
 # Data Card: OneVision Encoder Training Data
 
+> **📦 Data Availability Notice:** The training data requires approximately **200TB** of storage. We are currently looking for suitable storage solutions. If you need access to the data immediately, please contact [anxiangsir@outlook.com](mailto:anxiangsir@outlook.com).
+
+
 ## Overview
 
-This document describes the datasets used for training OneVision Encoder. The training data consists of both image and video datasets, totaling approximately 754 million samples.
+This document describes the datasets used for training OneVision Encoder. The training data consists of both image and video datasets.
 
 ## Dataset Summary