Skip to content

THU-BPM/MR2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Obtain the Dataset

Usage

You can find the baselines and guidance at https://aistudio.baidu.com/datasetdetail/230144.

Notebook baseline: https://aistudio.baidu.com/projectdetail/6371316?sUid=2739660&shared=1&ts=1689818284921 baseline: https://aistudio.baidu.com/clusterprojectdetail/6549711

Data

Organization Structure

The organization structure of the dataset is as follows

.
├── dataset_items_test.json
├── dataset_items_train.json
├── dataset_items_val.json
├── test
├── train
└── val

We have merged the English and Chinese datasets together and you can split them if necessary

Labels 0, 1, 2 represent non-rumor, rumor, and unverified, respectively.

File Format

Images for each claim are stored in the img folder.

The img_html_news folder contains web pages and images retrieved based on the caption of each claim. The folder includes a direct_annotation.json file containing the following information:

{
      "img_link": "Link to the retrieved related image",
      "page_link": "Link to the retrieved web page",
      "domain": "Domain of the retrieved web page",
      "snippet": "Brief summary of the retrieved web page",
      "image_path": "Path to the retrieved image",
      "html_path": "Path to the retrieved web page",
      "page_title": "Title of the retrieved web page"
}

The inverse_search folder contains web pages found based on the images in the claims. The folder includes an inverse_annotation.json file containing the following information:

{
"entities": "Entities in the image of the claim", 
"entities_scores": "Scores of the entities in the image of the claim", 
"best_guess_lbl": "The most likely description of the image in the claim", 
"all_fully_matched_captions": "", 
"all_partially_matched_captions": "",
"fully_matched_no_text": "",
//The above three fields are values of the found web pages, which are a list. Each element in the list is a dictionary, formatted as follows:
	{
	"page_link": "Link to the retrieved web page", 
	"image_link": "Link to the retrieved image", 
	"html_path": "Path to the retrieved web page", 
	"title": "Title of the retrieved web page"
	}
}

Dataset Copyright

Our datasets are open sourced for commercial and academic use and licensed under CC-BY-SA 4.0 license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published