DParsers-Grab-Core (v2.91)

Common core for site parsing with python grab framework.

Install Python (pre-install)

Install Python 3.9

wget -O python.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash python.sh -b
rm python.sh
export PATH=~/miniconda3/bin:$PATH

Install pipenv (pip install pipenv)

Project install

Clone project
In project directory pipenv install
[Optional for Windows] Download and install curl

Running

Run pipenv shell
Run python main.py {SITE_CONFIG_FILE_NAME}

Base config .env description

All values must be strings:

APP_WORK_MODE — dev value sets DEBUG mode for all loggers, supports info value, otherwise set error
APP_CAN_OUTPUT — True allows to print(...) some important messages
APP_LOG_FORMAT — log format (in python logger format)
APP_LOG_DIR — log directory name
APP_LOG_DEBUG_FILE — log file name (only own code output)
APP_LOG_GRAB_FILE — log file name(only grab lib output)
APP_LOG_HTML_ERR — output html in log when occur any exception
APP_CACHE_ENABLED — enable a page caching to your db (any value to enable)
APP_CACHE_DB_HOST — db host
APP_CACHE_DB_PORT — db post (default = 3306)
APP_CACHE_DB_TYPE — db type (support mysql, mongo and some others - look grab docs)
APP_CACHE_DB_USER — db user
APP_CACHE_DB_PASS — db password

Base config {site}.env description/

APP_PARSER — name of file which store parser logic (Spider extended class)
APP_THREAD_COUNT — count of threads for grub.spider
APP_TRY_LIMIT — how many times app can repeat failed task
APP_SAVER_CLASS — save to CSV or JSON format (or you can write own saver) [can occur crash when use csv with nested dicts]
APP_OUTPUT_CAT — save file mode: '' (empty) for single file (and same behaviour when this property not defined), 'test' - for separate result data to single files by 'test' result fields
APP_OUTPUT_DIR — output dir
APP_OUTPUT_ENC — output encoding [default 'utf-8']
APP_SAVE_FIELDS_{NUMBER} — string name fields for saving in a file (other fields dropped, even if parsed)
APP_COOKIE_NAME and APP_COOKIE_VALUE (both optional) — set this cookie before all requests
SITE_URL_{NUMBER} — site url's for parse
INPUT_URLS_FILENAME — *.txt file with url's list (newline separator) for load into self.links_todo in a parser class (for parsing with simple links list instead dynamic xpath rules)

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
d_parser		d_parser
dev		dev
docs		docs
helpers		helpers
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DParsers-Grab-Core (v2.91)

Install Python (pre-install)

Project install

Running

Base config .env description

Base config {site}.env description/

About

Releases 12

Contributors 2

Languages

License

Holovin/PythonParsersGrab

Folders and files

Latest commit

History

Repository files navigation

DParsers-Grab-Core (v2.91)

Install Python (pre-install)

Project install

Running

Base config .env description

Base config {site}.env description/

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Contributors 2

Languages