update doc for v0.3, add zombie_clean.py @ring04h

Qianlitp · May 11, 2020 · 2e68d01 · 2e68d01
1 parent b7d5ce4
commit 2e68d01
Show file tree

Hide file tree

Showing 3 changed files with 115 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -14,6 +14,7 @@ crawlergo 目前支持以下特性：
 * 智能URL去重，去掉大部分的重复请求
 * 全面分析收集，包括javascript文件内容、页面注释、robots.txt文件和常见路径Fuzz
 * 支持Host绑定，自动添加Referer
+* 支持请求代理，支持爬虫结果主动推送
 
 目前开放编译好的程序给大家使用，该项目属于商业化产品的一部分，代码暂无法开源。
 
@@ -34,12 +35,22 @@ crawlergo 目前支持以下特性：
 
 ### Go！
 
-假设你的chromium安装在 `/tmp/chromium/` ，开启最大20标签页，爬取AWVS靶场：
+假设你的chromium安装在 `/tmp/chromium/` ，开启最大10标签页，爬取AWVS靶场：
 
 ```shell
-./crawlergo -c /tmp/chromium/chrome -t 20 http://testphp.vulnweb.com/
+./crawlergo -c /tmp/chromium/chrome -t 10 http://testphp.vulnweb.com/
 ```
 
+
+
+### 使用代理
+
+```shell
+./crawlergo -c /tmp/chromium/chrome -t 10 --request-proxy socks5://127.0.0.1:7891 http://testphp.vulnweb.com/
+```
+
+
+
 ### 系统调用
 
 默认打印当前域名请求，但多数情况我们希望调用crawlergo返回的结果，所以设置输出模式为 `json`，使用python调用并收集结果的示例如下：
@@ -76,7 +87,9 @@ if __name__ == '__main__':
 * `all_domain_list`：发现的所有域名列表。
 * `sub_domain_list`：发现的任务目标的子域名列表。
 
-## 参数说明
+
+
+## 完整参数说明
 
 crawlergo 拥有灵活的参数配置，以下是详细的选项说明：
 
@@ -86,10 +99,12 @@ crawlergo 拥有灵活的参数配置，以下是详细的选项说明：
 * `--max-crawled-count Number, -m Number`   爬虫最大任务数量，避免因伪静态造成长时间无意义抓取。
 * `--filter-mode Mode, -f Mode`   过滤模式，简单：只过滤静态资源和完全重复的请求。智能：拥有过滤伪静态的能力。严格：更加严格的伪静态过滤规则。
 * `--output-mode value, -o value`   结果输出模式，`console`：打印当前域名结果。`json`：打印所有结果的json序列化字符串，可直接被反序列化解析。`none`：不打印输出。
+* `--output-json filepath` 将爬虫结果JSON序列化之后写入到json文件。
 * `--incognito-context, -i`   浏览器启动隐身模式
 * `--max-tab-count Number, -t Number`   爬虫同时开启最大标签页，即同时爬取的页面数量。
 * `--fuzz-path`  使用常见路径Fuzz目标，获取更多入口。
-* `--robots-path` 从robots.txt 文件中解析路径，获取更多入口。
+* `--robots-path` 从 /robots.txt 文件中解析路径，获取更多入口。
+* `--request-proxy proxyAddress` 支持**socks5**代理，crawlergo和chrome浏览器的所有网络请求均经过代理发送。
 * `--tab-run-timeout Timeout`   单个Tab标签页的最大运行超时。
 * `--wait-dom-content-loaded-timeout Timeout`  爬虫等待页面加载完毕的最大超时。
 * `--event-trigger-interval Interval` 事件自动触发时的间隔时间，一般用于目标网络缓慢，DOM更新冲突时导致的URL漏抓。
@@ -100,21 +115,9 @@ crawlergo 拥有灵活的参数配置，以下是详细的选项说明：
 * `--push-pool-max` 发送爬虫结果到监听地址时的最大并发数。
 * `--log-level` 打印日志等级，可选 debug, info, warn, error 和 fatal。
 
-## Bypass headless detect
-
-https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
-
-![](./imgs/bypass.png)
-
-## 关于360天相
 
-crawlergo是[**360天相**](https://skp.360.cn/)的子模块，天相是360自研的**资产管理与威胁探测系统**，主打强大的资产识别能力和全方位分析体系，拥有高效率的扫描能力，核心技术由 [360 0KeeTeam](https://0kee.360.cn/) 和 [360 RedTeam](http://redteam.360.cn/) 提供支持。
-
-![](./imgs/skp.png)
-
-详情请访问：[https://skp.360.cn/](https://skp.360.cn/)
 
-## 推荐用法
+## 使用举例
 
 crawlergo 返回了全量的请求和URL信息，可以有多种使用方法：
 
@@ -136,11 +139,7 @@ crawlergo 返回了全量的请求和URL信息，可以有多种使用方法：
 
 * 带Cookie扫描  [(查看例子)](https://github.com/0Kee-Team/crawlergo/blob/master/examples/request_with_cookie.py)
 
-## // TODO
-
-* 支持不同Host的目标输入
-* 支持从文件中读取请求作为输入
-* 输出结果到消息队列
+* 调用crawlergo调用产生僵尸进程，定时清理 [(查看例子)](https://github.com/0Kee-Team/crawlergo/blob/master/examples/zombie_clean.py) , contributed by @ring04h
 
 ## Trouble Shooting
 
@@ -166,7 +165,28 @@ crawlergo 返回了全量的请求和URL信息，可以有多种使用方法：
   sudo yum update nss -y
   ```
 
-
+
+* 运行提示**导航超时** / 浏览器无法找到 / 不知道正确的**浏览器可执行文件路径**
+
+  确认配置的浏览器可执行路径正确，在地址栏中输入：`chrome://version`，找到可执行程序文件路径：
+
+  ![](./imgs/chrome_path.png)
+
+## Bypass headless detect
+
+https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
+
+![](./imgs/bypass.png)
+
+## 关于360天相
+
+crawlergo是[**360天相**](https://skp.360.cn/)的子模块，天相是360自研的**资产管理与威胁探测系统**，主打强大的资产识别能力和全方位分析体系，拥有高效率的扫描能力，核心技术由 [360 0KeeTeam](https://0kee.360.cn/) 和 [360 RedTeam](http://redteam.360.cn/) 提供支持。
+
+![](./imgs/skp.png)
+
+详情请访问：[https://skp.360.cn/](https://skp.360.cn/)
+
+
 
 ## Follow me
 

diff --git a/examples/zombie_clean.py b/examples/zombie_clean.py
@@ -0,0 +1,72 @@
+#!/usr/bin/python3
+# coding: utf-8
+
+"""
+    author: 猪猪侠 https://github.com/ring04h
+
+"""
+
+import logging
+import subprocess
+
+logging.basicConfig(level=logging.DEBUG)
+
+# 
+# (crontab -l;echo '0 2 * * * /usr/local/bin/python3 /data/script/zombie_clean.py') | crontab -
+# 
+
+def is_timeout(etime):
+    if '-' in etime:
+        day, hour = etime.split('-')
+        return True if int(day) >= 1 else False
+    else:
+        return False
+
+
+def cmdprocess(cmdline):
+
+    pipe = subprocess.Popen(cmdline, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    output, stderr = pipe.communicate()
+    return_code = pipe.returncode
+    stderr = stderr.decode(errors='replace')
+    output = output.decode(errors='replace')
+    return output, stderr, return_code
+
+
+
+def main():
+
+    cmdline = "ps -ef | grep crawlergo | grep -v grep | awk '{print $2}'"
+    output, stderr, return_code = cmdprocess(cmdline)
+
+    if return_code != 0:
+        return
+
+    zombie_pids = output.splitlines()
+
+    for zombie_pid in zombie_pids:
+
+        cmdline = f'''ps -eo pid,etime | grep {zombie_pid}'''
+        ps_output, ps_stderr, ps_return_code = cmdprocess(cmdline)
+
+        if ps_return_code != 0:
+            continue
+
+        for line in ps_output.splitlines():
+
+            pid, etime = line.split()
+
+            status = is_timeout(etime)
+            logging.debug(f"PID: {pid:<8} ETIME: {etime:<15} TIMEOUT: {status}")
+
+            if not status: 
+                continue
+
+            kill_cmdline = f"kill -9 {pid}"
+            logging.debug(f"call kill : [{kill_cmdline}]")
+
+            cmdprocess(kill_cmdline)
+
+if __name__ == "__main__":
+    main()
+
diff --git a/imgs/chrome_path.png b/imgs/chrome_path.png