Easy Customizable Scraper

Easy customizable scraping starter kit.

General-purpose Web scraping tool with text analysis function.

The following features help users start development.

Easy setup
Customizability
Text analysis function (tagging / visualization)

Features

Web scraping
Automatic language detection
Morphological analysis
Feature tagging algorithm (original)
2D map visualization technology (original)

Demo

https://mockers.io/scanner

Dependency

Docker

Install

Note that it takes 1-2 hours.

docker build -t scanner .

or

./build.sh

Run

docker run --rm -it -v "$PWD":/usr/src/app \
--name scanner --force scanner  \
-e 'ENTRY_URL=http://recipe.hacarus.com/' \
-e 'ALLOW_RULE=/recipe/' \
-e 'IMAGE_XPATH=//*[@id="root"]/div/div/section/div/div/div[1]/figure/img' \
-e 'DOCUMENT_XPATH=//td/text()|//p/text()' \
-e 'PAGE_LIMIT=2000' \
-e 'EXCLUDE_REG=\d(年|月|日|時|分|秒|ｇ|\u4eba|\u672c|cm|ml|g|\u5206\u679a\u5ea6)|hacarusinc|allrightsreserved' \
scanner:latest /usr/src/app/entrypoint.sh

or

./run.sh

Parametes

Set Environment Variable of Docker Container.

If you have at least ENTRY_URL, it will automatically scan the page and pull out the text. If no options are specified, it is optimized for curated media and can be fully automated, such as extracting the text of articles.

Environment Variable	Description
ENTRY_URL	(Required) Site top URL to start scanning. All the pages are automatically scanned.
ALLOW_RULE	Allow filter rule of target urls.
DENY_RULE	Deny filter rule of target precedence overurls.
IMAGE_XPATH	Specify the image you want to get on the page with XPATH.
DOCUMENT_XPATH	XPATH of the top node in the page where text is to be extracted.
PAGE_LIMIT	Scaned limittation of number of pages. -1 means unlimited number.
EXCLUDE_REG	Regular expression of word rule not extracted by morphological analysis.

Result

result/res.json

Project structure

File	Description
src/scraper.py	Main scrapying logic
src/categorizer.py	Main algorithm to tag and visualize passages.
src/tokenizer.py	Main algorithm to do morphological analysis

Customize

custom/_formatter.py

Edit XPATH for required HTML nodes like below.

def formatter(sel):
    res = {}

    n_howtomake = int(len(sel.xpath('//*[@id="root"]/div/div/section/div/div/div[2]/div[1]/table[2]/tbody/tr/td/text()').extract()) / 2)
    res["n_howtomake"] = n_howtomake

    return res

custom/_finalizer.py

Edit post-process to generate your expected output like below.

import pandas as pd

def finalizer(res):
    pages = res["scatter"]
    pages = list(map(lambda x: x["user_meta"], pages))
    df = pd.DataFrame(pages)

    corr_df = df.loc[:, ["time", "n_howtomake", "n_components"]].corr()

    res["analyzed"] = {}
    res["analyzed"]["correlation"] = {}
    res["analyzed"]["correlation"]["time-n_howtomake"] = corr_df.loc["time", "n_howtomake"]

    return res

Contrubution for example

http://recipe.hacarus.com/

If you can not access it, please open it with secret browser.

Sample Result

Automatic tagging

result/tagged.csv

title	tag1	tag2	tag3
なすとトマトの中華和え(１５分)	なす	トマト	大葉
ぶりの照り焼き(45分)	両面	照り焼き	水気
おでん風煮(2時間)	大根	こんにゃく	竹輪
大根とツナのサラダ(15分)	ツナ	大根	わかめ
鶏の照り焼き丼(20分)	片栗粉	にんにく	れんこん
筑前煮(６０分)	れんこん	ごぼう	こんにゃく
白菜とわかめの酢の物(15分)	白菜	わかめ	しめじ
鮭のホイル焼き(25分)	玉ねぎ	しめじ	ピーマン

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
custom		custom
result		result
src		src
workspace		workspace
.gitignore		.gitignore
Dockerfile		Dockerfile
build.sh		build.sh
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
readme.md		readme.md
report.md		report.md
requirements.txt		requirements.txt
run.sh		run.sh

makotunes/easy-customizable-scraper

Folders and files

Latest commit

History

Repository files navigation

Easy Customizable Scraper

Features

Demo

Dependency

Install

Run

Parametes

Result

Project structure

Customize

custom/_formatter.py

custom/_finalizer.py

Contrubution for example

Sample Result

Automatic tagging

2D map visualization

About

Resources

Stars

Watchers

Forks

Languages