Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from Pingid/major-refactor
Major refactor
- Loading branch information
Showing
25 changed files
with
2,025 additions
and
2,990 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
name: Publish package | ||
|
||
on: | ||
release: | ||
- created | ||
|
||
jobs: | ||
npm: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: checkout | ||
uses: actions/checkout@v2 | ||
- name: setup node version | ||
uses: actions/setup-node@v2 | ||
with: | ||
cache: 'yarn' | ||
node-version: '16.x' | ||
registry-url: 'https://registry.npmjs.org' | ||
- name: install dependencies | ||
run: yarn install | ||
- name: transpile typescript | ||
run: yarn build | ||
- name: Get current version | ||
id: current_version | ||
run: echo "::set-output name=version::$(node -e 'console.log(require(`./package.json`).version)')" | ||
- name: Get latest version | ||
id: latest_version | ||
run: echo "::set-output name=version::$(npm dist-tag ls | cut -d ' ' -f 2 | xargs echo)" | ||
- run: npm publish | ||
if: ${{ steps.current_version.outputs.version != steps.latest_version.outputs.version }} | ||
env: | ||
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
name: Create release | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
|
||
jobs: | ||
create_release: | ||
name: Create release | ||
runs-on: ubuntu-20.04 | ||
steps: | ||
- name: checkout | ||
uses: actions/checkout@v2 | ||
with: | ||
persist-credentials: false | ||
- name: setup node version | ||
uses: actions/setup-node@v2 | ||
with: | ||
cache: 'yarn' | ||
node-version: '16' | ||
- name: install dependencies | ||
run: yarn install | ||
- name: transpile typescript | ||
run: yarn build | ||
- name: run tests | ||
run: yarn test | ||
- name: Get latest release tag | ||
id: latest_release_tag | ||
uses: InsonusK/get-latest-release@v1.0.1 | ||
with: | ||
myToken: ${{ github.token }} | ||
view_top: 1 | ||
- name: Get current version | ||
id: current_version | ||
run: echo "::set-output name=version::$(node -e 'console.log(require(`./package.json`).version)')" | ||
- name: Create a new release | ||
uses: softprops/action-gh-release@v1 | ||
if: ${{ steps.latest_release_tag.outputs.tag_name != steps.current_version.outputs.version }} | ||
with: | ||
tag_name: ${{ steps.current_version.outputs.version }} | ||
token: ${{ secrets.RELEASE_TOKEN }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
name: Test | ||
|
||
on: | ||
push: | ||
branches: | ||
- '*' | ||
- '!master' | ||
|
||
jobs: | ||
run_tests: | ||
runs-on: ubuntu-20.04 | ||
steps: | ||
- name: checkout | ||
uses: actions/checkout@v2 | ||
- name: setup node version | ||
uses: actions/setup-node@v2 | ||
with: | ||
cache: 'yarn' | ||
node-version: '16' | ||
- name: install dependencies | ||
run: yarn install | ||
- name: transpile typescript | ||
run: yarn build | ||
- name: run tests | ||
run: yarn test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
lib | ||
es6 | ||
node_modules | ||
tsconfig.tsbuildinfo |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
.github | ||
node_modules | ||
src | ||
.vscode | ||
tsconfig.tsbuildinfo | ||
tsconfig.json | ||
tsconfig.json | ||
node_modules | ||
src |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
# Shears | ||
|
||
A Declarative web scraping library that aims to provide an extendable set of tools for building complex typesafe queries and web crawlers. | ||
|
||
```typescript | ||
import sh from 'shears' | ||
|
||
const article = sh({ | ||
title: sh('h1', sh.text), | ||
content: sh('h1', sh.text) | ||
image: sh('img', sh.atr('src')) | ||
}) | ||
|
||
const article_list = sh('#content', ['ul > li'], article) | ||
|
||
await resolveP(article_list, '<html><...') | ||
// [{ title: '...' content: '...' },{...] | ||
``` | ||
|
||
The library works best when used in combination with fp-ts however the resolveP returns a `Promise` instead of `TaskEither` returned by the `run` function, this allows for standalone use but still requires fp-ts as a peer dependency. | ||
|
||
## Usage | ||
|
||
### Selectors | ||
|
||
The `sh` function is the main building block and accepts any number of arguments of type `string`, `[string]`, `Shear<Node, Node>` where the final argument can be of type `Shear<Node, T>`. | ||
|
||
```typescript | ||
sh('body > h1') // Shear<Node | Node[], Node<h1>> | ||
sh('body > h1', 'span') // Shear<Node | Node[], Node<span>> | ||
sh('body > ul', ['li']) // Shear<Node | Node[], Node<li>[]> | ||
sh(['body > ul'], sh.text) // Shear<Node | Node[], string[]> | ||
sh(['body > ul'], ['li'], sh.text) // Shear<Node | Node[], string[][]> | ||
sh({ foo: sh.text }) // Shear<Node | Node[], { foo: string }> | ||
sh([sh.text, sh.text]) // Shear<Node | Node[], [string, string]> | ||
``` | ||
|
||
- `string`: Accepts a css query and returns the first matching DOM node, like `document.querySelector`. | ||
- `[string]`: Accepts a css query and returns all matching DOM nodes like `document.querySelectorAll`. | ||
|
||
Each query in the list of arguments operates on the part of the DOM returned by the previous query where parameters after `[string]` queries operate on each item in the return list. | ||
|
||
### customizing Shears | ||
|
||
A "Shear" extends the `ReaderTaskEither` type class so you can easily build your selectors. | ||
|
||
```typescript | ||
import { map } from 'fp-ts/ReaderTaskEither' | ||
import { pipe } from 'fp-ts/function' | ||
|
||
import sh, { run } from 'shears' | ||
|
||
const trimText = pipe( | ||
sh.text, | ||
map((y) => y.trim()) | ||
) | ||
|
||
run(sh('body > h1', trimText), `<h1> foo </h1>`) // TaskEither<never, 'foo'> | ||
``` | ||
|
||
### Crawling | ||
|
||
The library provides a few shears to help with queries across multiple pages. | ||
|
||
```typescript | ||
import { map } from 'fp-ts/ReaderTaskEither' | ||
import { pipe } from 'fp-ts/function' | ||
import axios from 'axios' | ||
|
||
import sh, { run, connect, goTo } from 'shears' | ||
|
||
const connection = connect((url) => axios.get(url).then((x) => x.data)) | ||
|
||
run( | ||
sh({ | ||
posts: sh( | ||
'[#post]', | ||
goTo( | ||
sh('a', sh.atr('href')), | ||
sh({ | ||
title: sh('h1', sh.text) | ||
}), | ||
{ connection } | ||
) | ||
) | ||
}) | ||
) | ||
``` | ||
|
||
Often it is the case you want to follow relative links on a website where our connection would need to know the current hostname. "Shears" provide a mechanism for passing context we just need to change our connection implementation. | ||
|
||
```typescript | ||
const connection = connect( | ||
(url, ctx) => | ||
(/^http/.test(url) ? axios.get(url) : axios.get(ctx.hostname + url)).then((x) => [ | ||
x.data, | ||
{ hostname: x.response.hostname } | ||
]), // We return a tuple [{ html string }, { new context }] | ||
{ hostname: '' } // Initial context state | ||
) | ||
``` | ||
|
||
We can pass our connection into the run function which provides it in the shear context so we don't need to pass it to `goTo`. | ||
|
||
```typescript | ||
run( | ||
goTo( | ||
'http://foo.bar', | ||
sh({ | ||
posts: sh( | ||
'[#post]', | ||
goTo( | ||
sh('a', sh.atr('href')), | ||
sh({ | ||
title: sh('h1', sh.text) | ||
}) | ||
) | ||
) | ||
}) | ||
), | ||
{ connection } | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.