Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 为html/json转化为rss提供通用支持 #12882

Merged
merged 13 commits into from
Aug 3, 2023
74 changes: 74 additions & 0 deletions docs/en/other.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,80 @@ please refer to the [Notion API documentation](https://developers.notion.com/ref

<RouteEn author="sbilly" example="/sans/summit_archive" path="/sans/summit_archive" />

## Transformation

Pass URL and transformation rules to convert HTML/JSON into RSS.

### HTML

Specify options (in the format of query string) in parameter `routeParams` parameter to extract data from HTML.

| Key | Meaning | Accepted Values | Default |
| -------------- | -------------------------------------------------- | --------------- | ----------------------- |
| `title` | The title of the RSS | `string` | Extract from `<title>` |
| `item` | The HTML elements as `item` using CSS selector | `string` | html |
| `itemTitle` | The HTML elements as `title` in `item` using CSS selector | `string` | `item` element |
| `itemTitleAttr` | The attributes of `title` element as title | `string` | Element text |
| `itemLink` | The HTML elements as `link` in `item` using CSS selector | `string` | `item` element |
| `itemLinkAttr` | The attributes of `link` element as link | `string` | `href` |
| `itemDesc` | The HTML elements as `descrption` in `item` using CSS selector | `string` | `item` element |
| `itemDescAttr` | The attributes of `descrption` element as description | `string` | Element html |

<RouteEn author="ttttmr" example="/rsshub/transform/html/https%3A%2F%2Fwechat2rss.xlab.app%2Fposts%2Flist%2F/item=div%5Bclass%3D%27post%2Dcontent%27%5D%20p%20a" path="/rsshub/transform/html/:url/:routeParams" :paramsDesc="['`encodeURIComponent`ed URL address', 'Transformation rules, requires URL encode']" selfhost="1">

Parameters parsing in the above example:

| Parameter | Value |
| ------------ | ----------------------------------------- |
| `url` | `https://wechat2rss.xlab.app/posts/list/` |
| `routeParams`| `item=div[class='post-content'] p a` |

Parsing of `routeParams` parameter:

| Parameter | Value |
| --------- | ------------------------------- |
| `item` | `div[class='post-content'] p a` |

</RouteEn>

### JSON

Specify options (in the format of query string) in parameter `routeParams` parameter to extract data from JSON.

| Key | Meaning | Accepted Values | Default |
| ---------- | ----------------------------- ---------- | --------------- | ---------- ------------------------------ |
| `title` | The title of the RSS | `string` | Extracted from home page of current domain |
| `item` | The JSON Path as `item` element | `string` | Entire JSON response |
| `itemTitle` | The JSON Path as `title` in `item` | `string` | None |
| `itemLink` | The JSON Path as `link` in `item` | `string` | None |
| `itemDesc` | The JSON Path as `description` in `item` | `string` | None |

::: tip Note

JSON Path only supports format like `a.b.c`. if you need to access arrays, like `a[0].b`, you can write it as `a.0.b`.

:::

<RouteEn author="ttttmr" example="/rsshub/transform/json/https%3A%2F%2Fapi.github.com%2Frepos%2Fginuerzh%2Fgost%2Freleases/title=Gost%20releases&itemTitle=tag_name&itemLink=html_url&itemDesc=body" path="/rsshub/transform/json/:url/:routeParams" :paramsDesc="['`encodeURIComponent`ed URL address', 'Transformation rules, requires URL encode']" selfhost="1">

Parameters parsing in the above example:

| Parameter | Value |
| ------------- | ----------------------------------------------- |
| `url` | `https://api.github.com/repos/ginuerzh/gost/releases` |
| `routeParams` | `title=Gost releases&itemTitle=tag_name&itemLink=html_url&itemDesc=body` |

Parsing of `routeParams` parameter:

| Parameter | Value |
| ------------ | ---------------- |
| `title` | `Gost releases` |
| `itemTitle` | `tag_name` |
| `itemLink` | `html_url` |
| `itemDesc` | `body` |

</RouteEn>

## Trending Search Keyword Aggregator

### Aggregated Keyword Tracker
Expand Down
74 changes: 74 additions & 0 deletions docs/other.md
Original file line number Diff line number Diff line change
Expand Up @@ -1121,6 +1121,80 @@ type 为 all 时,category 参数不支持 cost 和 free

<Route author="Fatpandac" example="/ems/apple/EZ319397281CN" path="/ems/apple/:id" :paramsDesc="['苹果邮件编号']"/>

## 转换

传递 URL 和转化规则,将 HTML/JSON 转换为 RSS

### HTML

在 `routeParams` 参数中以 query string 格式指定选项,可以控制提取数据

| 键 | 含义 | 接受的值 | 默认值 |
| --------------- | --------------------------------------------------------------- | -------- | ------------------------ |
| `title` | 指定 RSS 的标题 | `string` | 从当前网页中取 `<title>` |
| `item` | 通过 CSS 选择器查找 HTML 元素作为 `item` 元素 | `string` | html |
| `itemTitle` | 在 `item` 中通过 CSS 选择器查找 HTML 元素作为 `title` 元素 | `string` | `item` 元素 |
| `itemTitleAttr` | 获取 `title` 元素属性作为标题 | `string` | 元素 text |
| `itemLink` | 在 `item` 中通过 CSS 选择器查找 HTML 元素作为 `link` 元素 | `string` | `item` 元素 |
| `itemLinkAttr` | 获取 `link` 元素属性作为链接 | `string` | `href` |
| `itemDesc` | 在 `item` 中通过 CSS 选择器查找 HTML 元素作为 `descrption` 元素 | `string` | `item` 元素 |
| `itemDescAttr` | 获取 `descrption` 元素属性作为描述 | `string` | 元素 html |

<Route author="ttttmr" example="/rsshub/transform/html/https%3A%2F%2Fwechat2rss.xlab.app%2Fposts%2Flist%2F/item=div%5Bclass%3D%27post%2Dcontent%27%5D%20p%20a" path="/rsshub/transform/html/:url/:routeParams" :paramsDesc="['URL地址,需经 URL 编码', '转换规则,需经 URL 编码']" selfhost="1">

上述例子中参数解析如下

| 参数 | 值 |
| -------------- | ----------------------------------------- |
| `:url` | `https://wechat2rss.xlab.app/posts/list/` |
| `:routeParams` | `item=div[class='post-content'] p a` |

`routeParams`参数解析如下

| 参数 | 值 |
| ------ | ------------------------------- |
| `item` | `div[class='post-content'] p a` |

</Route>

### JSON

在 `routeParams` 参数中以 query string 格式指定选项,可以控制提取数据

| 键 | 含义 | 接受的值 | 默认值 |
| ----------- | --------------------------------------- | -------- | ------------------------------------ |
| `title` | 指定 RSS 的标题 | `string` | 从当前域名的根路径网页中取 `<title>` |
| `item` | 通过 JSON Path 查找作为 `item` 元素 | `string` | 整个响应 JSON |
| `itemTitle` | 在 `item` 中通过 JSON Path 查找作为标题 | `string` | 无 |
| `itemLink` | 在 `item` 中通过 JSON Path 查找作为链接 | `string` | 无 |
| `itemDesc` | 在 `item` 中通过 JSON Path 查找作为描述 | `string` | 无 |

::: tip 注意

JSON Path 目前只支持例如 `a.b.c` 的形式,如果需要从数组中读取,例如 `a[0].b`,可以写成 `a.0.b`

:::

<Route author="ttttmr" example="/rsshub/transform/json/https%3A%2F%2Fapi.github.com%2Frepos%2Fginuerzh%2Fgost%2Freleases/title=Gost%20releases&itemTitle=tag_name&itemLink=html_url&itemDesc=body" path="/rsshub/transform/json/:url/:routeParams" :paramsDesc="['URL地址,需经 URL 编码', '转换规则,需经 URL 编码']" selfhost="1">

上述例子中参数解析如下

| 参数 | 值 |
| -------------- | ------------------------------------------------------------------------ |
| `:url` | `https://api.github.com/repos/ginuerzh/gost/releases` |
| `:routeParams` | `title=Gost releases&itemTitle=tag_name&itemLink=html_url&itemDesc=body` |

`routeParams` 参数解析如下

| 参数 | 值 |
| ----------- | --------------- |
| `title` | `Gost releases` |
| `itemTitle` | `tag_name` |
| `itemLink` | `html_url` |
| `itemDesc` | `body` |

</Route>

## 自如

### 房源
Expand Down
2 changes: 1 addition & 1 deletion lib/maintainer.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ const { join } = require('path');
// Presence Check
for (const dir of fs.readdirSync(dirname)) {
const dirPath = join(dirname, dir);
if (!fs.existsSync(join(dirPath, 'maintainer.js'))) {
if (fs.existsSync(join(dirPath, 'router.js')) && !fs.existsSync(join(dirPath, 'maintainer.js'))) {
throw Error(`No maintainer.js in "${dirPath}".`);
}
}
Expand Down
13 changes: 13 additions & 0 deletions lib/v2/altervista/radar.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
module.exports = {
'altervista.org': {
_name: 'Altervista',
hyp3rlinx: [
{
title: 'hyp3rlinx blog',
docs: 'https://docs.rsshub.app/',
source: ['/'],
target: '/rsshub/transform/html/http%3A%2F%2Fhyp3rlinx.altervista.org%2F/item=table[border=%221%22]%20tr%20td%20a',
},
],
},
};
2 changes: 2 additions & 0 deletions lib/v2/rsshub/maintainer.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
module.exports = {
'/routes/:lang?': ['DIYgod'],
'/rsshub/sponsors': ['DIYgod'],
'/transform/html/:url/:routeParams': ['ttttmr'],
'/transform/json/:url/:routeParams': ['ttttmr'],
};
4 changes: 2 additions & 2 deletions lib/v2/rsshub/router.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
module.exports = (router) => {
router.get('/rss', require('./routes')); // 弃用

router.get('/routes/:lang?', require('./routes'));
router.get('/sponsors', require('./sponsors'));
router.get('/transform/html/:url/:routeParams', require('./transform/html'));
router.get('/transform/json/:url/:routeParams', require('./transform/json'));
};
75 changes: 75 additions & 0 deletions lib/v2/rsshub/transform/html.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
const got = require('@/utils/got');
const cheerio = require('cheerio');
const config = require('@/config').value;

module.exports = async (ctx) => {
if (!config.feature.allow_user_supply_unsafe_domain) {
ctx.throw(403, `This RSS is disabled unless 'ALLOW_USER_SUPPLY_UNSAFE_DOMAIN' is set to 'true'.`);
}
const { url } = ctx.params;
const response = await got({
method: 'get',
url,
});

const routeParams = new URLSearchParams(ctx.params.routeParams);
const $ = cheerio.load(response.data);
const rssTitle = routeParams.get('title') ? routeParams.get('title') : $('title').text();
const item = routeParams.get('item') ? routeParams.get('item') : 'html';
const items = $(item)
.toArray()
.map((item) => {
try {
item = $(item);

let title;
const titleEle = routeParams.get('itemTitle') ? item.find(routeParams.get('itemTitle')) : item;
if (routeParams.get('itemTitleAttr')) {
title = titleEle.attr(routeParams.get('itemTitleAttr'));
} else {
title = titleEle.text();
}

let link;
const linkEle = routeParams.get('itemLink') ? item.find(routeParams.get('itemLink')) : item;
if (routeParams.get('itemLinkAttr')) {
link = linkEle.attr(routeParams.get('itemLinkAttr'));
} else {
if (linkEle.is('a')) {
link = linkEle.attr('href');
} else {
link = linkEle.find('a').attr('href');
}
}
// 补全绝对链接
link = link.trim();
if (link && !link.startsWith('http')) {
link = `${new URL(url).origin}${link}`;
}

let desc;
const descEle = routeParams.get('itemDesc') ? item.find(routeParams.get('itemDesc')) : item;
if (routeParams.get('itemDescAttr')) {
desc = descEle.attr(routeParams.get('itemDescAttr'));
} else {
desc = descEle.html();
}

return {
title,
link,
description: desc,
};
} catch (e) {
return null;
}
})
.filter(Boolean);

ctx.state.data = {
title: rssTitle,
link: url,
description: `Proxy ${url}`,
item: items,
};
};
57 changes: 57 additions & 0 deletions lib/v2/rsshub/transform/json.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
const got = require('@/utils/got');
const cheerio = require('cheerio');
const config = require('@/config').value;

function jsonGet(obj, attr) {
if (typeof attr !== 'string') {
return obj;
}
// a.b.c
// a.b[0].c => a.b.0.c
attr.split('.').forEach((key) => {
obj = obj[key];
});
return obj;
}

module.exports = async (ctx) => {
if (!config.feature.allow_user_supply_unsafe_domain) {
ctx.throw(403, `This RSS is disabled unless 'ALLOW_USER_SUPPLY_UNSAFE_DOMAIN' is set to 'true'.`);
}
const { url } = ctx.params;
const response = await got({
method: 'get',
url,
});

const routeParams = new URLSearchParams(ctx.params.routeParams);
let rssTitle = routeParams.get('title');
if (!rssTitle) {
const resp = await got({
method: 'get',
url: new URL(url).origin,
});
const $ = cheerio.load(resp.data);
rssTitle = $('title').text();
}

const items = jsonGet(response.data, routeParams.get('item')).map((item) => {
let link = jsonGet(item, routeParams.get('itemLink')).trim();
// 补全绝对链接
if (link && !link.startsWith('http')) {
link = `${new URL(url).origin}${link}`;
}
return {
title: jsonGet(item, routeParams.get('itemTitle')),
link,
description: routeParams.get('itemDesc') ? jsonGet(item, routeParams.get('itemDesc')) : '',
};
});

ctx.state.data = {
title: rssTitle,
link: url,
description: `Proxy ${url}`,
item: items,
};
};
13 changes: 13 additions & 0 deletions lib/v2/sec/radar.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
module.exports = {
'sec.today': {
_name: '每日安全',
'.': [
{
title: '动态',
docs: 'https://docs.rsshub.app/',
source: ['/pulses', '/'],
target: '/rsshub/transform/html/https%3A%2F%2Fsec.today%2Fpulses%2F/item=div[class="card-body"]',
},
],
},
};
Loading