Skip to content

Commit

Permalink
Merge pull request #1 from Pingid/major-refactor
Browse files Browse the repository at this point in the history
Major refactor
  • Loading branch information
Pingid committed Oct 16, 2021
2 parents 28401c9 + d558de1 commit 4baca8e
Show file tree
Hide file tree
Showing 25 changed files with 2,013 additions and 2,990 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/create-release.yml
@@ -0,0 +1,37 @@
name: Create release

on:
push:
branches:
- main

jobs:
create_release:
name: Create release
runs-on: ubuntu-20.04
steps:
- name: checkout
uses: actions/checkout@v2
- name: setup node version
uses: actions/setup-node@v2
with:
cache: 'yarn'
node-version: '16'
- name: install dependencies
run: yarn install
- name: transpile typescript
run: yarn build
- name: run tests
run: yarn test
- name: Get latest release tag
id: latest_release_tag
uses: InsonusK/get-latest-release@v1.0.1
with:
myToken: ${{ github.token }}
view_top: 1
- name: Get current version
id: current_version
run: echo "::set-output name=version::$(node -e 'console.log(require(`./package.json`).version)')"
- name: Create a new release
uses: softprops/action-gh-release@v1
if: ${{ steps.latest_release_tag.outputs.tag_name != steps.current_version.outputs.version }}
25 changes: 25 additions & 0 deletions .github/workflows/plublish.yml
@@ -0,0 +1,25 @@
name: Publish package

on:
release:
types: [created]

jobs:
npm:
runs-on: ubuntu-latest
steps:
- name: checkout
uses: actions/checkout@v2
- name: setup node version
uses: actions/setup-node@v2
with:
cache: 'yarn'
node-version: '16.x'
registry-url: 'https://registry.npmjs.org'
- name: install dependencies
run: yarn install
- name: transpile typescript
run: yarn build
- run: npm publish
env:
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
18 changes: 0 additions & 18 deletions .github/workflows/publish-npm.yml

This file was deleted.

25 changes: 25 additions & 0 deletions .github/workflows/test.yml
@@ -0,0 +1,25 @@
name: Test

on:
push:
branches:
- '*'
- '!master'

jobs:
run_tests:
runs-on: ubuntu-20.04
steps:
- name: checkout
uses: actions/checkout@v2
- name: setup node version
uses: actions/setup-node@v2
with:
cache: 'yarn'
node-version: '16'
- name: install dependencies
run: yarn install
- name: transpile typescript
run: yarn build
- name: run tests
run: yarn test
1 change: 1 addition & 0 deletions .gitignore
@@ -1,3 +1,4 @@
lib
es6
node_modules
tsconfig.tsbuildinfo
7 changes: 4 additions & 3 deletions .npmignore
@@ -1,5 +1,6 @@
.github
node_modules
src
.vscode
tsconfig.tsbuildinfo
tsconfig.json
tsconfig.json
node_modules
src
123 changes: 123 additions & 0 deletions README.md
@@ -0,0 +1,123 @@
# Shears

A Declarative web scraping library that aims to provide an extendable set of tools for building complex data queries and web crawlers.

```typescript
import sh from 'shears'

const article = sh({
title: sh('h1', sh.text),
content: sh('h1', sh.text)
image: sh('img', sh.atr('src'))
})

const article_list = sh('#content', ['ul > li'], article)

await resolveP(article_list, '<html><...')
// [{ title: '...' content: '...' },{...]
```

The library works best when used in combination with fp-ts however the resolveP returns a `Promise` instead of `TaskEither` returned by the `run` function, this allows for standalone use but still requires fp-ts as a peer dependency.

## Usage

### Selectors

The `sh` function is the main building block and accepts any number of arguments of type `string`, `[string]`, `Shear<Node, Node>` where the final argument can be of type `Shear<Node, T>`.

```typescript
sh('body > h1') // Shear<Node | Node[], Node<h1>>
sh('body > h1', 'span') // Shear<Node | Node[], Node<span>>
sh('body > ul', ['li']) // Shear<Node | Node[], Node<li>[]>
sh(['body > ul'], sh.text) // Shear<Node | Node[], string[]>
sh(['body > ul'], ['li'], sh.text) // Shear<Node | Node[], string[][]>
sh({ foo: sh.text }) // Shear<Node | Node[], { foo: string }>
sh([sh.text, sh.text]) // Shear<Node | Node[], [string, string]>
```

- `string`: Accepts a css query and returns the first matching DOM node, like `document.querySelector`.
- `[string]`: Accepts a css query and returns all matching DOM nodes like `document.querySelectorAll`.

Each query in the list of arguments operates on the part of the DOM returned by the previous query where parameters after `[string]` queries operate on each item in the return list.

### customizing Shears

A "Shear" extends the `ReaderTaskEither` type class so you can easily build your selectors.

```typescript
import { map } from 'fp-ts/ReaderTaskEither'
import { pipe } from 'fp-ts/function'

import sh, { run } from 'shears'

const trimText = pipe(
sh.text,
map((y) => y.trim())
)

run(sh('body > h1', trimText), `<h1> foo </h1>`) // TaskEither<never, 'foo'>
```

### Crawling

The library provides a few shears to help with queries across multiple pages.

```typescript
import { map } from 'fp-ts/ReaderTaskEither'
import { pipe } from 'fp-ts/function'
import axios from 'axios'

import sh, { run, connect, goTo } from 'shears'

const connection = connect((url) => axios.get(url).then((x) => x.data))

run(
sh({
posts: sh(
'[#post]',
goTo(
sh('a', sh.atr('href')),
sh({
title: sh('h1', sh.text)
}),
{ connection }
)
)
})
)
```

Often it is the case you want to follow relative links on a website where our connection would need to know the current hostname. "Shears" provide a mechanism for passing context we just need to change our connection implementation.

```typescript
const connection = connect(
(url, ctx) =>
(/^http/.test(url) ? axios.get(url) : axios.get(ctx.hostname + url)).then((x) => [
x.data,
{ hostname: x.response.hostname }
]), // We return a tuple [{ html string }, { new context }]
{ hostname: '' } // Initial context state
)
```

We can pass our connection into the run function which provides it in the shear context so we don't need to pass it to `goTo`.

```typescript
run(
goTo(
'http://foo.bar',
sh({
posts: sh(
'[#post]',
goTo(
sh('a', sh.atr('href')),
sh({
title: sh('h1', sh.text)
})
)
)
})
),
{ connection }
)
```
78 changes: 57 additions & 21 deletions package.json
@@ -1,25 +1,68 @@
{
"name": "shears",
"version": "0.0.0-alpha.7.1",
"version": "0.0.0-alpha.8",
"main": "lib/index.js",
"license": "Public",
"module": "es6/index.js",
"typings": "lib/index.d.ts",
"license": "MIT",
"author": "Dan Beaven <dm.beaven@gmail.com>",
"description": "Functional web scraping in typescript",
"sideEffects": false,
"repository": {
"type": "git",
"url": "https://github.com/Pingid/shears.git"
},
"bugs": {
"url": "https://github.com/Pingid/shears/issues"
},
"homepage": "https://github.com/Pingid/shears",
"tags": [
"functional-programming",
"typescript",
"selector",
"crawler",
"scraper",
"parser",
"fp-ts",
"html"
],
"keywords": [
"functional-programming",
"typescript",
"selector",
"crawler",
"scraper",
"parser",
"fp-ts",
"html"
],
"scripts": {
"build": "tsc",
"watch": "yarn build --watch",
"build": "yarn build:es5 && yarn build:es6",
"build:es5": "tsc",
"build:es6": "tsc -p ./tsconfig-es6.json",
"test": "jest",
"prettier": "prettier --write \"src/**/*.{js,jsx,ts,tsx}\"",
"clean": "rm tsconfig.tsbuildinfo || true && rm -R ./lib || true",
"postversion": "git push --tags",
"postinstall": "yarn build"
"clean": "tsc --build --clean",
"format": "prettier --write ./src/*.ts",
"postversion": "git push --tags"
},
"dependencies": {
"css-select": "^4.1.3",
"domhandler": "^4.2.2",
"domutils": "^2.8.0",
"htmlparser2": "^7.1.2"
},
"devDependencies": {
"@types/domhandler": "^2.4.1",
"@types/jest": "^26.0.9",
"@types/node": "^14.6.0",
"jest": "^26.1.0",
"ts-jest": "^26.3.0",
"typescript": "^4.1.3"
"@types/domhandler": "^2.4.2",
"@types/jest": "^27.0.2",
"@types/node": "^16.11.0",
"fp-ts": "^2.11.4",
"jest": "^27.2.5",
"prettier": "^2.4.1",
"ts-jest": "^27.0.6",
"typescript": "^4.4.4"
},
"peerDependencies": {
"fp-ts": "^2.11.4"
},
"jest": {
"preset": "ts-jest",
Expand All @@ -30,12 +73,5 @@
"singleQuote": true,
"printWidth": 120,
"trailingComma": "none"
},
"dependencies": {
"css-select": "^3.1.2",
"domhandler": "^4.0.0",
"domutils": "^2.4.4",
"fp-ts": "^2.9.5",
"htmlparser2": "^6.0.0"
}
}

0 comments on commit 4baca8e

Please sign in to comment.