Common core for site parsing with python grab framework.
- Install Python 3.9
wget -O python.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash python.sh -b
rm python.sh
export PATH=~/miniconda3/bin:$PATH
- Install pipenv (
pip install pipenv
)
- Clone project
- In project directory
pipenv install
- [Optional for Windows] Download and install curl
- Run
pipenv shell
- Run
python main.py {SITE_CONFIG_FILE_NAME}
All values must be strings:
-
APP_WORK_MODE
—dev
value sets DEBUG mode for all loggers, supportsinfo
value, otherwise seterror
-
APP_CAN_OUTPUT
—True
allows toprint(...)
some important messages -
APP_LOG_FORMAT
— log format (in python logger format) -
APP_LOG_DIR
— log directory name -
APP_LOG_DEBUG_FILE
— log file name (only own code output) -
APP_LOG_GRAB_FILE
— log file name(only grab lib output) -
APP_LOG_HTML_ERR
— output html in log when occur any exception -
APP_CACHE_ENABLED
— enable a page caching to your db (any value to enable) -
APP_CACHE_DB_HOST
— db host -
APP_CACHE_DB_PORT
— db post (default = 3306) -
APP_CACHE_DB_TYPE
— db type (support mysql, mongo and some others - look grab docs) -
APP_CACHE_DB_USER
— db user -
APP_CACHE_DB_PASS
— db password
APP_PARSER
— name of file which store parser logic (Spider extended class)APP_THREAD_COUNT
— count of threads for grub.spiderAPP_TRY_LIMIT
— how many times app can repeat failed taskAPP_SAVER_CLASS
— save to CSV or JSON format (or you can write own saver) [can occur crash when use csv with nested dicts]APP_OUTPUT_CAT
— save file mode: '' (empty) for single file (and same behaviour when this property not defined), 'test' - for separate result data to single files by 'test' result fieldsAPP_OUTPUT_DIR
— output dirAPP_OUTPUT_ENC
— output encoding [default 'utf-8']APP_SAVE_FIELDS_{NUMBER}
— string name fields for saving in a file (other fields dropped, even if parsed)APP_COOKIE_NAME
andAPP_COOKIE_VALUE
(both optional) — set this cookie before all requestsSITE_URL_{NUMBER}
— site url's for parseINPUT_URLS_FILENAME
— *.txt file with url's list (newline separator) for load intoself.links_todo
in a parser class (for parsing with simple links list instead dynamic xpath rules)