# 从 IEEE 中爬取 IROS 2024

## 环境准备

- 导入必要的库和模块。
- 设置 ICRA 链接和页数


In [1]:
# 在 Jupyter Notebook 中安装依赖
!pip install selenium beautifulsoup4 webdriver-manager



In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
import json
import requests

# 入口页面信息
BASE_URL = "https://ieeexplore.ieee.org/xpl/conhome/10609961/proceeding"
ISNUMBER = "10609862"

# 最大页数
MAX_PAGES = 50

## 收集论文 title 和 abstract

- 设置 WebDriver


IEEE 通过 AJAX 请求动态加载论文，因此在爬取时可以采用模拟 Node.js 的方式向服务器发送请求，获取 JSON 数据。

In [3]:
# ICRA 2024 url
url = "https://ieeexplore.ieee.org/rest/search/pub/10609961/issue/10609862/toc"
headers = {
    "accept": "application/json, text/plain, */*",
    "content-type": "application/json",
    "x-security-request": "required",
    "Referer": "https://ieeexplore.ieee.org/xpl/conhome/10609961/proceeding",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
all_papers = []
page = 1

while True:
    data = {
        "sortType": "vol-only-seq",
        "punumber": "10609961",
        "isnumber": 10609862,
        "pageNumber": page
    }
    print(f"正在抓取第 {page} 页...")
    response = requests.post(url, headers=headers, json=data)
    json_data = response.json()
    records = json_data.get("records", [])
    if not records:
        print("已到最后一页，结束抓取")
        break
    for record in records:
        title = record.get("articleTitle", "")
        abstract = record.get("abstract", "")
        all_papers.append({"title": title, "abstract": abstract})
    page += 1
    time.sleep(2)  # 防止被限流

print(f"抓取完成，共获得 {len(all_papers)} 篇文章。")


正在抓取第 1 页...
正在抓取第 2 页...
正在抓取第 3 页...
正在抓取第 4 页...
正在抓取第 5 页...


KeyboardInterrupt: 

In [None]:
with open("all_papers.json", "w", encoding="utf-8") as f:
    json.dump(all_papers, f, ensure_ascii=False, indent=2)

从 json 文件中提取带有关键词 dataset 的论文。

In [None]:
with open('all_papers.json', 'r', encoding='utf-8') as f:
    papers = json.load(f)

results = []
for paper in papers:
    title = paper.get('title', '').lower()
    abstract = paper.get('abstract', '').lower()
    if 'dataset' in title or 'dataset' in abstract:
        results.append(paper)

with open('dataset_papers.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print(f"共找到 {len(results)} 篇包含 'dataset' 的论文。")

for paper in results:
    print(f"Title: {paper.get('title')}")
    print(f"Abstract: {paper.get('abstract')}\n")

## 爬取 arxiv 中 ICRA/IROS 2025 论文

由于 ICRA/IROS 2025 论文尚未发布在 IEEE 上，因此先从 arxiv 上爬取相关论文。

## 环境准备

- 导入 package 和 函数
- 全局变量设置

In [None]:
!pip install arxiv

In [3]:
import arxiv
import json

# 设置搜索关键词（模糊匹配）
query = 'IROS AND 2025'

# 执行搜索
search = arxiv.Search(
    query=query,
    max_results=1000,  # 可改为更大或分页处理
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending,
)

# 收集结果
results = []
for result in search.results():
    results.append({
        "title": result.title.strip(),
        "abstract": result.summary.strip().replace("\n", " "),
        "url": result.entry_id
    })

# 保存为 JSON 文件
with open("/home2/qrchen/embodied-datasets/scripts/xintong/iros_2025_arxiv.json", "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print(f"共爬取 {len(results)} 条结果，已保存为 iros_2025_arxiv.json")

  for result in search.results():


共爬取 371 条结果，已保存为 iros_2025_arxiv.json


从中筛选出与 dataset/manipulation 相关的论文

In [4]:
import json
import re

# manipulation 相关表达
manipulation_keywords = [
    "manipulation", "manipulations", "manipulate", "manipulated", "manipulating", "manipulative",
    "robot","robotics",
    # "handle", "handling", "handled", "handles",
    # "grasp", "grasping", "grasped",
    # "pick", "picking", "picked",
    # "rearrange", "rearrangement", "rearranged",
    # "place", "placement", "placing",
    # "interact", "interaction", "interactive"
]

# dataset 相关表达
dataset_keywords = [
    "dataset", "datasets"
]

# 构建正则表达式，忽略大小写
manipulation_pattern = re.compile(r'\b(' + '|'.join(manipulation_keywords) + r')\b', re.IGNORECASE)
dataset_pattern = re.compile(r'\b(' + '|'.join(dataset_keywords) + r')\b', re.IGNORECASE)

# 读取 JSON 文件
with open('/home2/qrchen/embodied-datasets/scripts/xintong/iros_2025_arxiv.json', 'r', encoding='utf-8') as f:
    articles = json.load(f)

# 筛选
filtered = []
for article in articles:
    title = article.get('title', '')
    abstract = article.get('abstract', '')
    text = title + ' ' + abstract
    if manipulation_pattern.search(text) and dataset_pattern.search(text):
        filtered.append(article)

# 输出筛选结果
with open('/home2/qrchen/embodied-datasets/scripts/xintong/iros_2025_filtered_articles_dataset.json', 'w', encoding='utf-8') as f:
    json.dump(filtered, f, ensure_ascii=False, indent=2)

print(f"筛选出 {len(filtered)} 篇文章，已保存到 filtered_articles.json")

筛选出 37 篇文章，已保存到 filtered_articles.json
