-
Notifications
You must be signed in to change notification settings - Fork 0
/
atom.xml
528 lines (319 loc) · 248 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Time渐行渐远</title>
<subtitle>Coding Changing The World</subtitle>
<link href="/atom.xml" rel="self"/>
<link href="http://dmlcoding.com/"/>
<updated>2017-12-12T03:43:50.000Z</updated>
<id>http://dmlcoding.com/</id>
<author>
<name>Hushiwei</name>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>Elasticsearch:用Curator辅助Marvel,实现自动删除旧marvel索引</title>
<link href="http://dmlcoding.com/2017/EsDeleteMarvel/"/>
<id>http://dmlcoding.com/2017/EsDeleteMarvel/</id>
<published>2017-12-11T01:24:00.000Z</published>
<updated>2017-12-12T03:43:50.000Z</updated>
<content type="html"><![CDATA[<p>Marvel几乎是所有Elasticsearch用户的标配。Marvel保留观测数据的代价是,<br>它默认每天会新建一个index,命名规律像是这样:.marvel-2017-12-10。<br>marvel自建的索引一天可以产生大概500M的数据,而且将会越来越多,占的容量也将越来越大。<br>有没有什么办法能让它自动过期?比如说只保留最近两天的观测数据,其他的都抛弃掉。</p><p>当然有办法,curator就可以帮你实现.<br><a id="more"></a></p><h1 id="curator是什么?"><a href="#curator是什么?" class="headerlink" title="curator是什么?"></a>curator是什么?</h1><p>它是一个命令,可以帮助你管理你在Elasticsearch中的索引,帮你删除,关闭(close),<br>打开(open)它们。当然这是比较片面的说法,更完整的说明见:<br><a href="https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html</a></p><h1 id="实践"><a href="#实践" class="headerlink" title="实践"></a>实践</h1><p>我们集群里面安装的Elasticsearch的版本是2.1.1.<br>按照官网,我装了最新的5.x版本,显示版本不对.<br>按照 <a href="http://blog.csdn.net/hereiskxm/article/details/47423715" target="_blank" rel="external">http://blog.csdn.net/hereiskxm/article/details/47423715</a> 这个博客,我装了3.3.0版本.<br>显示也不对.</p><p>然后我搜了一下,感觉应该装一个中间的版本,因此我安装了4.0.0版本<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">pip install elasticsearch-curator (4.0.0)</div></pre></td></tr></table></figure></p><p>然后我看了一下这个版本提供的参数</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line"> curator --help</div><div class="line">Usage: curator [OPTIONS] ACTION_FILE</div><div class="line"></div><div class="line"> Curator for Elasticsearch indices.</div><div class="line"></div><div class="line"> See http://elastic.co/guide/en/elasticsearch/client/curator/current</div><div class="line"></div><div class="line">Options:</div><div class="line"> --config PATH Path to configuration file. Default: ~/.curator/curator.yml</div><div class="line"> --dry-run Do not perform any changes.</div><div class="line"> --version Show the version and exit.</div><div class="line"> --help Show this message and exit.</div></pre></td></tr></table></figure><p>和我安装最新的5.X的版本看起来是一致的.正好在这个站点看到配置的办法<br><a href="https://stackoverflow.com/questions/33430055/removing-old-indices-in-elasticsearch/42268400#42268400" target="_blank" rel="external">https://stackoverflow.com/questions/33430055/removing-old-indices-in-elasticsearch/42268400#42268400</a></p><p>之前在博客里面看到的那个3.3.0版本,还不兼容呢.</p><h2 id="用法"><a href="#用法" class="headerlink" title="用法"></a>用法</h2><blockquote><p>目的是删除2天前以.marvel开头的索引</p></blockquote><p>新建目录 /opt/curator</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">~ pwd</div><div class="line">/opt/curator</div><div class="line">~ ll</div><div class="line">total 12</div><div class="line">-rw-r--r-- 1 root root 184 Dec 12 10:48 config_file.yml</div><div class="line">-rw-r--r-- 1 root root 1311 Dec 12 10:37 delete_marvel_indices.yml</div><div class="line">drwxr-xr-x 2 root root 4096 Dec 12 10:49 logs</div></pre></td></tr></table></figure><h2 id="config-file-yml"><a href="#config-file-yml" class="headerlink" title="config_file.yml"></a>config_file.yml</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line"># 记住,这个logfile得提前新建好.不然会启动报错.</div><div class="line">vim config_file.yml</div><div class="line"></div><div class="line">---</div><div class="line">client:</div><div class="line"> hosts:</div><div class="line"> - 10.10.25.217</div><div class="line"> port: 9200</div><div class="line">logging:</div><div class="line"> loglevel: INFO</div><div class="line"> logfile: "/opt/curator/logs/actions.log"</div><div class="line"> logformat: default</div><div class="line"> blacklist: ['elasticsearch', 'urllib3']</div></pre></td></tr></table></figure><h2 id="delete-marvel-indices-yml"><a href="#delete-marvel-indices-yml" class="headerlink" title="delete_marvel_indices.yml"></a>delete_marvel_indices.yml</h2><p>删除以.marvel前缀且是2天之前的索引<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div></pre></td><td class="code"><pre><div class="line">---</div><div class="line"># Remember, leave a key empty if there is no value. None will be a string,</div><div class="line"># not a Python "NoneType"</div><div class="line">#</div><div class="line"># Also remember that all examples have 'disable_action' set to True. If you</div><div class="line"># want to use this action as a template, be sure to set this to False after</div><div class="line"># copying it.</div><div class="line">actions:</div><div class="line"> 1:</div><div class="line"> action: delete_indices</div><div class="line"> description: >-</div><div class="line"> Delete indices older than 30 days (based on index name), for rc- prefixed indices.</div><div class="line"> options:</div><div class="line"> ignore_empty_list: True</div><div class="line"> timeout_override:</div><div class="line"> continue_if_exception: False</div><div class="line"> disable_action: False</div><div class="line"> filters:</div><div class="line"> - filtertype: pattern</div><div class="line"> kind: prefix</div><div class="line"> value: rc-</div><div class="line"> exclude:</div><div class="line"> - filtertype: age</div><div class="line"> source: name</div><div class="line"> direction: older</div><div class="line"> timestring: '%Y.%m.%d'</div><div class="line"> unit: days</div><div class="line"> unit_count: 30</div><div class="line"> exclude:</div><div class="line"> 2:</div><div class="line"> action: delete_indices</div><div class="line"> description: >-</div><div class="line"></div><div class="line"> Delete indices older than 2 days (based on index name), for .marvel prefixed indices.</div><div class="line"> options:</div><div class="line"> ignore_empty_list: True</div><div class="line"> timeout_override:</div><div class="line"> continue_if_exception: False</div><div class="line"> disable_action: False</div><div class="line"> filters:</div><div class="line"> - filtertype: pattern</div><div class="line"> kind: prefix</div><div class="line"> value: .marvel</div><div class="line"> exclude:</div><div class="line"> - filtertype: age</div><div class="line"> source: name</div><div class="line"> direction: older</div><div class="line"> timestring: '%Y.%m.%d'</div><div class="line"> unit: days</div><div class="line"> unit_count: 2</div><div class="line"> exclude:</div></pre></td></tr></table></figure></p><p>配置完成.</p><h2 id="执行命令"><a href="#执行命令" class="headerlink" title="执行命令"></a>执行命令</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">curator --config config_file.yml [--dry-run] delete_marvel_indices.yml</div></pre></td></tr></table></figure><p>注意:</p><ol><li>–dry-run 是可选参数,加上后不会真的删除,只会执行逻辑.你可以通过看日志来判断是否正确.<br>确认正确后,去掉–dry-run参数,再执行命令,既是真正的执行删除了.</li><li>如果没有在config_file.yml里面配置logfile参数,那么日志会在console打印出来.</li></ol><h1 id="配置日常任务"><a href="#配置日常任务" class="headerlink" title="配置日常任务"></a>配置日常任务</h1><p>很明显,我们需要自动化这个过程,让它每天自动执行,因此写一个脚本,让crontab每天自动调用即可</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">#!/bin/bash</div><div class="line"></div><div class="line">curator --config /opt/curator/config_file.yml /opt/curator/delete_marvel_indices.yml</div><div class="line"></div><div class="line">echo "delete success"</div></pre></td></tr></table></figure><p>配置crontab</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"># 每天2点执行删除脚本</div><div class="line">0 2 * * * source /etc/profile;bash /opt/curator/delete_marvel_daily.sh > /opt/curator/delete.log 2>&1</div></pre></td></tr></table></figure>]]></content>
<summary type="html">
<p>Marvel几乎是所有Elasticsearch用户的标配。Marvel保留观测数据的代价是,<br>它默认每天会新建一个index,命名规律像是这样:.marvel-2017-12-10。<br>marvel自建的索引一天可以产生大概500M的数据,而且将会越来越多,占的容量也将越来越大。<br>有没有什么办法能让它自动过期?比如说只保留最近两天的观测数据,其他的都抛弃掉。</p>
<p>当然有办法,curator就可以帮你实现.<br>
</summary>
<category term="elasticsearch" scheme="http://dmlcoding.com/categories/elasticsearch/"/>
<category term="elasticsearch" scheme="http://dmlcoding.com/tags/elasticsearch/"/>
<category term="bigdata" scheme="http://dmlcoding.com/tags/bigdata/"/>
</entry>
<entry>
<title>PythonVirtualenv总结</title>
<link href="http://dmlcoding.com/2017/PythonVirtualenv/"/>
<id>http://dmlcoding.com/2017/PythonVirtualenv/</id>
<published>2017-11-29T07:24:00.000Z</published>
<updated>2017-12-01T03:28:48.000Z</updated>
<content type="html"><![CDATA[<p>我的电脑是macbookpro.我在电脑里面分别装了python2.7和python3.6.</p><p>当我用pip安装了virtual后,我如果想要对应版本的python</p><a id="more"></a><h1 id="virtualenv"><a href="#virtualenv" class="headerlink" title="virtualenv"></a>virtualenv</h1><h2 id="安装"><a href="#安装" class="headerlink" title="安装"></a>安装</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">pip install virtualenv</div></pre></td></tr></table></figure><h2 id="virtualenv的参数"><a href="#virtualenv的参数" class="headerlink" title="virtualenv的参数"></a>virtualenv的参数</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div></pre></td><td class="code"><pre><div class="line">hushiwei@hsw ~/virtual virtualenv</div><div class="line">You must provide a DEST_DIR</div><div class="line">Usage: virtualenv [OPTIONS] DEST_DIR</div><div class="line"></div><div class="line">Options:</div><div class="line"> --version show program's version number and exit</div><div class="line"> -h, --help show this help message and exit</div><div class="line"> -v, --verbose Increase verbosity.</div><div class="line"> -q, --quiet Decrease verbosity.</div><div class="line"> -p PYTHON_EXE, --python=PYTHON_EXE</div><div class="line"> The Python interpreter to use, e.g.,</div><div class="line"> --python=python2.5 will use the python2.5 interpreter</div><div class="line"> to create the new environment. The default is the</div><div class="line"> interpreter that virtualenv was installed with</div><div class="line"> (/usr/local/opt/python3/bin/python3.6)</div><div class="line"> --clear Clear out the non-root install and start from scratch.</div><div class="line"> --no-site-packages DEPRECATED. Retained only for backward compatibility.</div><div class="line"> Not having access to global site-packages is now the</div><div class="line"> default behavior.</div><div class="line"> --system-site-packages</div><div class="line"> Give the virtual environment access to the global</div><div class="line"> site-packages.</div><div class="line"> --always-copy Always copy files rather than symlinking.</div><div class="line"> --unzip-setuptools Unzip Setuptools when installing it.</div><div class="line"> --relocatable Make an EXISTING virtualenv environment relocatable.</div><div class="line"> This fixes up scripts and makes all .pth files</div><div class="line"> relative.</div><div class="line"> --no-setuptools Do not install setuptools in the new virtualenv.</div><div class="line"> --no-pip Do not install pip in the new virtualenv.</div><div class="line"> --no-wheel Do not install wheel in the new virtualenv.</div><div class="line"> --extra-search-dir=DIR</div><div class="line"> Directory to look for setuptools/pip distributions in.</div><div class="line"> This option can be used multiple times.</div><div class="line"> --download Download preinstalled packages from PyPI.</div><div class="line"> --no-download, --never-download</div><div class="line"> Do not download preinstalled packages from PyPI.</div><div class="line"> --prompt=PROMPT Provides an alternative prompt prefix for this</div><div class="line"> environment.</div><div class="line"> --distribute DEPRECATED. Retained only for backward compatibility.</div><div class="line"> This option has no effect.</div></pre></td></tr></table></figure><p>可以看到里面的–python参数可以指定虚拟环境的python版本.并且说明了默认是python3.6</p><h2 id="创建虚拟环境"><a href="#创建虚拟环境" class="headerlink" title="创建虚拟环境"></a>创建虚拟环境</h2><h3 id="virtual安装Python2-7"><a href="#virtual安装Python2-7" class="headerlink" title="virtual安装Python2.7"></a>virtual安装Python2.7</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">virtualenv --python=python python2env</div></pre></td></tr></table></figure><h3 id="virtual安装Python3-6环境"><a href="#virtual安装Python3-6环境" class="headerlink" title="virtual安装Python3.6环境"></a>virtual安装Python3.6环境</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">virtualenv --python=python3 python3env</div></pre></td></tr></table></figure><h2 id="激活进入虚拟环境"><a href="#激活进入虚拟环境" class="headerlink" title="激活进入虚拟环境"></a>激活进入虚拟环境</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">source python2env/bin/activate</div><div class="line">source python3env/bin/activate</div></pre></td></tr></table></figure><h2 id="退出虚拟环境"><a href="#退出虚拟环境" class="headerlink" title="退出虚拟环境"></a>退出虚拟环境</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">deactivate</div></pre></td></tr></table></figure><h2 id="删除虚拟环境"><a href="#删除虚拟环境" class="headerlink" title="删除虚拟环境"></a>删除虚拟环境</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"># 直接用rm删除目录就是删除虚拟环境了</div><div class="line">rm -rf 虚拟环境的目录名称</div></pre></td></tr></table></figure><h2 id="激活虚拟环境-将全部依赖写入文件"><a href="#激活虚拟环境-将全部依赖写入文件" class="headerlink" title="激活虚拟环境,将全部依赖写入文件"></a>激活虚拟环境,将全部依赖写入文件</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">pip freeze > requirements.txt</div></pre></td></tr></table></figure><p>进入项目内,安装全部依赖</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">pip install -r requirements.txt</div></pre></td></tr></table></figure><h1 id="virtualenvwrapper"><a href="#virtualenvwrapper" class="headerlink" title="virtualenvwrapper"></a>virtualenvwrapper</h1><blockquote><p>virtualenvwrapper是virtualenv的扩展管理包,用于更方便管理虚拟环境.</p></blockquote><p>他可以做;</p><ol><li>将所有虚拟环境整合在一个目录下</li><li>管理(新增,删除,复制)虚拟环境</li><li>切换虚拟环境</li></ol><h2 id="安装-1"><a href="#安装-1" class="headerlink" title="安装"></a>安装</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">pip install virtualenvwrapper</div></pre></td></tr></table></figure><h2 id="使用方法"><a href="#使用方法" class="headerlink" title="使用方法"></a>使用方法</h2><p>1.初始化配置</p><p>默认virtualenvwrapper安装在/usr/local/bin下面,实际上需要运行virtualenvwrapper.sh文件才行;</p><p>所以需要先进行配置一下:</p><p>1.1 创建虚拟环境管理目录:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mkdir $HOME/.local/virtualenvs</div></pre></td></tr></table></figure><p>1.2 在~/.bash_profile中添加行</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line">export VIRTUALENV_USE_DISTRIBUTE=1 # 总是使用 pip/distribute</div><div class="line">export WORKON_HOME=$HOME/.local/virtualenvs # 所有虚拟环境存储的目录</div><div class="line">if [ -e $HOME/.local/bin/virtualenvwrapper.sh ];then</div><div class="line"> source $HOME/.local/bin/virtualenvwrapper.sh</div><div class="line">else if [ -e /usr/local/bin/virtualenvwrapper.sh ];then</div><div class="line"> source /usr/local/bin/virtualenvwrapper.sh</div><div class="line">fi</div><div class="line"> fi</div><div class="line">export PIP_VIRTUALENV_BASE=$WORKON_HOME</div><div class="line">export PIP_RESPECT_VIRTUALENV=true</div></pre></td></tr></table></figure><p>2.使用方法</p><p>所有的命令可使用:<code>virtualenvwrapper --help</code> 进行查看,这里列出几个常用的:</p><ul><li><p>创建基本环境:mkvirtualenv [环境名]</p></li><li><p>删除环境:rmvirtualenv [环境名]</p></li><li><p>激活环境:workon [环境名]</p></li><li><p>退出环境:deactivate</p></li><li><p>列出所有环境:workon 或者 lsvirtualenv -b</p></li><li><p>在使用mkvirtualenv命令的时候,-p选项可以指定使用哪一个python环境</p></li></ul><p>3.举例</p><p>安装python2.7</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mkvirtualenv -p python python2env</div></pre></td></tr></table></figure><p>安装python3.6</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mkvirtualenv -p python3 python3env</div></pre></td></tr></table></figure><p>查看现在装了几个虚拟环境</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">hushiwei@hsw ~ workon</div><div class="line">python2env</div><div class="line">python3env</div></pre></td></tr></table></figure><p>所有命令都可在后面使用<code>--help</code>参数查看具体用法!Enjoy it !</p>]]></content>
<summary type="html">
<p>我的电脑是macbookpro.我在电脑里面分别装了python2.7和python3.6.</p>
<p>当我用pip安装了virtual后,我如果想要对应版本的python</p>
</summary>
<category term="python" scheme="http://dmlcoding.com/categories/python/"/>
<category term="python" scheme="http://dmlcoding.com/tags/python/"/>
</entry>
<entry>
<title>机器学习实战之朴素贝叶斯</title>
<link href="http://dmlcoding.com/2017/NaiveBayes/"/>
<id>http://dmlcoding.com/2017/NaiveBayes/</id>
<published>2017-11-26T02:24:00.000Z</published>
<updated>2017-11-29T08:42:12.000Z</updated>
<content type="html"><![CDATA[<p>朴素贝叶斯就是利用先验知识来解决后验概率,因为训练集中我们已经知道了每个单词在类别0和1中的概率,即p(w|c),<br>我们就是要利用这个知识去解决在出现这些单词的组合情况下,类别更可能是0还是1,即p(c|w)。<br>如果说之前的训练样本少,那么这个p(w|c)就更可能不准确,所以样本越多我们会觉得这个p(w|c)越可信。<br><a id="more"></a></p><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> os</div><div class="line"><span class="keyword">import</span> sys</div><div class="line"><span class="keyword">from</span> numpy <span class="keyword">import</span> *</div><div class="line">sys.path.append(os.getcwd())</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> bayes</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 返回实验样本和类别标签(侮辱类和非侮辱类)</span></div><div class="line">listOPosts,listClasses=bayes.loadDataSet()</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 创建词汇表</span></div><div class="line"><span class="comment"># 将实验样本里面的词汇进行去重</span></div><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">createVocabList</span><span class="params">(dataSet)</span>:</span></div><div class="line"> vocabSet=set([])</div><div class="line"> <span class="keyword">for</span> document <span class="keyword">in</span> dataSet:</div><div class="line"> vocabSet=vocabSet|set(document)</div><div class="line"> <span class="keyword">return</span> list(vocabSet)</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">myVocabList=createVocabList(listOPosts)</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 将词汇转成特征向量</span></div><div class="line"><span class="comment"># 也就是将每一行样本转成特征向量</span></div><div class="line"><span class="comment"># 向量的每一元素为1或者0,分别表示词汇表中的单词在输入文档中是否出现</span></div><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">setOfWordsVec</span><span class="params">(vocabList,inputSet)</span>:</span></div><div class="line"> returnVec=[<span class="number">0</span>]*len(vocabList)</div><div class="line"> <span class="keyword">for</span> word <span class="keyword">in</span> inputSet:</div><div class="line"> <span class="keyword">if</span> word <span class="keyword">in</span> vocabList:</div><div class="line"> returnVec[vocabList.index(word)]=<span class="number">1</span></div><div class="line"> <span class="keyword">else</span>:<span class="keyword">print</span> <span class="string">"the word:%s is not in my Vocabulary!"</span> % word</div><div class="line"> <span class="keyword">return</span> returnVec</div></pre></td></tr></table></figure><h1 id="训练算法-从词向量计算概率"><a href="#训练算法-从词向量计算概率" class="headerlink" title="训练算法:从词向量计算概率"></a>训练算法:从词向量计算概率</h1><p>朴素贝叶斯分类器训练函数</p><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 输入为文档矩阵以及由每篇文档类别标签所构成的向量</span></div><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">trainNB0</span><span class="params">(trainMatrix,trainCategory)</span>:</span></div><div class="line"> <span class="comment"># 文档总数,有几篇文档,在这里也就是有几个一维数组</span></div><div class="line"> numTrainDocs=len(trainMatrix)</div><div class="line"> <span class="comment"># 每篇文档里面的单词数,也就是一维数组的长度</span></div><div class="line"> numWords=len(trainMatrix[<span class="number">0</span>])</div><div class="line"> <span class="comment"># 因为就只有0和1两个分类,将类别列表求和后,就是其中一个类别的个数</span></div><div class="line"> <span class="comment"># 然后numTrainDocs也就是文档总数,这样相除后就是这个类别的概率了</span></div><div class="line"> pAbusive=sum(trainCategory)/float(numTrainDocs)</div><div class="line"> <span class="comment"># 以下两行,初始化概率</span></div><div class="line"> p0Num=ones(numWords);p1Num=ones(numWords)</div><div class="line"> p0Denom=<span class="number">2.0</span>;p1Denom=<span class="number">2.0</span></div><div class="line"></div><div class="line"> <span class="comment"># 依次遍历所有的文档</span></div><div class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> range(numTrainDocs):</div><div class="line"> <span class="comment"># 判断这个文档所属类别</span></div><div class="line"> <span class="keyword">if</span> trainCategory[i]==<span class="number">1</span>:</div><div class="line"> <span class="comment"># 数组与数组相加,这里就是统计每个词在这个分类里面出现的次数</span></div><div class="line"> p1Num+=trainMatrix[i]</div><div class="line"> <span class="comment"># 统计该类别下,这些词语一共出现了多少次</span></div><div class="line"> p1Denom+=sum(trainMatrix[i])</div><div class="line"> <span class="keyword">else</span>:</div><div class="line"> p0Num+=trainMatrix[i]</div><div class="line"> p0Denom+=sum(trainMatrix[i])</div><div class="line"> <span class="comment"># 通过求对数避免数据下溢出</span></div><div class="line"> p1Vect=log(p1Num/p1Denom)</div><div class="line"> p0Vect=log(p0Num/p0Denom)</div><div class="line"> <span class="keyword">return</span> p0Vect,p1Vect,pAbusive</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 所有文档的特征向量</span></div><div class="line">trainMat=[]</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 将文档的每一行,转成词向量,然后追加到trainMat中</span></div><div class="line"><span class="keyword">for</span> postinDoc <span class="keyword">in</span> listOPosts:</div><div class="line"> trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">p0V,p1V,pAb=trainNB0(trainMat,listClasses)</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">pAb</div></pre></td></tr></table></figure><pre><code>0.5</code></pre><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">p0V</div></pre></td></tr></table></figure><pre><code>array([-2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654, -2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936, -2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654, -2.15948425, -3.25809654, -3.25809654, -2.56494936, -3.25809654, -2.56494936, -2.56494936, -3.25809654, -2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936, -3.25809654, -2.56494936, -2.56494936, -1.87180218])</code></pre><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">p1V</div></pre></td></tr></table></figure><pre><code>array([-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526, -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526, -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526, -2.35137526, -2.35137526, -2.35137526, -3.04452244, -1.94591015, -3.04452244, -2.35137526, -2.35137526, -3.04452244, -1.94591015, -3.04452244, -1.65822808, -3.04452244, -2.35137526, -3.04452244, -3.04452244, -3.04452244])</code></pre><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 朴素贝叶斯分类函数</span></div><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">classifyNB</span><span class="params">(vec2classify,p0Vec,p1Vec,pClass1)</span>:</span></div><div class="line"> <span class="comment">#元素相乘</span></div><div class="line"> p1=sum(vec2classify*p1Vec)+log(pClass1)</div><div class="line"> p0=sum(vec2classify*p0Vec)+log(pClass1)</div><div class="line"> <span class="keyword">if</span> p1>p0:</div><div class="line"> <span class="keyword">return</span> <span class="number">1</span></div><div class="line"> <span class="keyword">else</span>:</div><div class="line"> <span class="keyword">return</span> <span class="number">0</span></div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">testingNB</span><span class="params">()</span>:</span></div><div class="line"> listOPosts,listClasses=bayes.loadDataSet()</div><div class="line"> myVocabList=createVocabList(listOPosts)</div><div class="line"> trainMat=[]</div><div class="line"> <span class="keyword">for</span> postinDoc <span class="keyword">in</span> listOPosts:</div><div class="line"> trainMat.append(setOfWordsVec(myVocabList,postinDoc))</div><div class="line"></div><div class="line"> p0V,p1V,pAb=trainNB0(array(trainMat),array(listClasses))</div><div class="line"></div><div class="line"> testEntry=[<span class="string">'love'</span>,<span class="string">'my'</span>,<span class="string">'dalmation'</span>]</div><div class="line"> thisDoc=array(setOfWordsVec(myVocabList,testEntry))</div><div class="line"> <span class="keyword">print</span> testEntry,<span class="string">'classified as : '</span>,classifyNB(thisDoc,p0V,p1V,pAb)</div><div class="line"></div><div class="line"> testEntry=[<span class="string">'stupid'</span>,<span class="string">'garbage'</span>]</div><div class="line"> thisDoc=array(setOfWordsVec(myVocabList,testEntry))</div><div class="line"> <span class="keyword">print</span> testEntry,<span class="string">'classified as : '</span>,classifyNB(thisDoc,p0V,p1V,pAb)</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">testingNB()</div></pre></td></tr></table></figure><pre><code>['love', 'my', 'dalmation'] classified as : 0['stupid', 'garbage'] classified as : 1</code></pre><h1 id="使用朴素贝叶斯过滤垃圾邮件"><a href="#使用朴素贝叶斯过滤垃圾邮件" class="headerlink" title="使用朴素贝叶斯过滤垃圾邮件"></a>使用朴素贝叶斯过滤垃圾邮件</h1><ul><li>收集数据:提供文本文件</li><li>准备数据:将文本文件解析成词条向量</li><li>分析数据:检查词条确保解析的正确性</li><li>训练算法:使用我们之前建立的trainNB0()函数</li><li>测试算法:使用classifyNB()</li></ul><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">mySent=<span class="string">'This book is the best book on Python or M.L. I have ever laid eyes upon.'</span></div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="comment"># 文件解析及完整的垃圾邮件测试函数</span></div><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">textParse</span><span class="params">(bigString)</span>:</span></div><div class="line"> <span class="keyword">import</span> re</div><div class="line"> listOfTokens=re.split(<span class="string">r'\W*'</span>,bigString)</div><div class="line"> <span class="keyword">return</span> [tok.lower() <span class="keyword">for</span> tok <span class="keyword">in</span> listOfTokens <span class="keyword">if</span> len(tok)><span class="number">2</span>]</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div></pre></td><td class="code"><pre><div class="line"><span class="function"><span class="keyword">def</span> <span class="title">spamTest</span><span class="params">()</span>:</span></div><div class="line"> docList=[];classList=[];fullText=[]</div><div class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> range(<span class="number">1</span>,<span class="number">26</span>):</div><div class="line"> <span class="comment"># 导入邮件文本,并解析成词条</span></div><div class="line"> wordList=textParse(open(<span class="string">'email/spam/%d.txt'</span> %i).read())</div><div class="line"> docList.append(wordList)</div><div class="line"> fullText.extend(wordList)</div><div class="line"> classList.append(<span class="number">1</span>)</div><div class="line"></div><div class="line"> wordList=textParse(open(<span class="string">'email/ham/%d.txt'</span> %i).read())</div><div class="line"> docList.append(wordList)</div><div class="line"> fullText.extend(wordList)</div><div class="line"> classList.append(<span class="number">0</span>)</div><div class="line"> <span class="comment"># 生成词汇表</span></div><div class="line"> vocabList=createVocabList(docList)</div><div class="line"></div><div class="line"> <span class="comment"># 随机构建训练集、测试集</span></div><div class="line"> trainingSet=range(<span class="number">50</span>);testSet=[]</div><div class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> range(<span class="number">10</span>):</div><div class="line"> randIndex=int(random.uniform(<span class="number">0</span>,len(trainingSet)))</div><div class="line"> testSet.append(trainingSet[randIndex])</div><div class="line"> <span class="keyword">del</span>(trainingSet[randIndex])</div><div class="line"></div><div class="line"> <span class="comment"># 生成测试集的特征向量</span></div><div class="line"> trainMat=[];trainClasses=[]</div><div class="line"> <span class="keyword">for</span> docIndex <span class="keyword">in</span> trainingSet:</div><div class="line"> trainMat.append(setOfWordsVec(vocabList,docList[docIndex]))</div><div class="line"> trainClasses.append(classList[docIndex])</div><div class="line"></div><div class="line"> p0V,p1V,pSpam=trainNB0(trainMat,trainClasses)</div><div class="line"></div><div class="line"> <span class="comment"># 测试集,测试错误率</span></div><div class="line"> errorCount=<span class="number">0</span></div><div class="line"> <span class="keyword">for</span> docIndex <span class="keyword">in</span> testSet:</div><div class="line"> wordVector=setOfWordsVec(vocabList,docList[docIndex])</div><div class="line"> <span class="keyword">if</span> classifyNB(wordVector,p0V,p1V,pSpam)!=classList[docIndex]:</div><div class="line"> errorCount+=<span class="number">1</span></div><div class="line"> <span class="keyword">print</span> <span class="string">'the error rate is : '</span>,float(errorCount)/len(testSet)</div></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">spamTest()</div></pre></td></tr></table></figure><pre><code>the error rate is : 0.2</code></pre><h1 id="使用朴素贝叶斯分类器从个人广告中获取区域倾向"><a href="#使用朴素贝叶斯分类器从个人广告中获取区域倾向" class="headerlink" title="使用朴素贝叶斯分类器从个人广告中获取区域倾向"></a>使用朴素贝叶斯分类器从个人广告中获取区域倾向</h1><ul><li>收集数据:从RSS源收集内容,这里需要对RSS源构建一个接口</li><li>准备数据:将文本文件解析成词条向量</li><li>分析数据:检查词条确保解析的正确性</li><li>训练算法:使用我们之前建立的trainNB0()函数</li><li>测试算法:观察错误率,确保分类器可用.可以修改切分程序,以降低错误率,提高分类结果.</li><li>使用算法:构建一个完整的程序,封装所有内容.给定两个RSS源,该程序会显示最常用的公共词.</li></ul><p>下面将使用来自不同城市的广告训练一个分类器,然后观察分类器的效果。我们的目的并不是使用该分类器进行分类,而是通过观察单词和条件概率值来发现与特定城市相关的内容。</p>]]></content>
<summary type="html">
<p>朴素贝叶斯就是利用先验知识来解决后验概率,因为训练集中我们已经知道了每个单词在类别0和1中的概率,即p(w|c),<br>我们就是要利用这个知识去解决在出现这些单词的组合情况下,类别更可能是0还是1,即p(c|w)。<br>如果说之前的训练样本少,那么这个p(w|c)就更可能不准确,所以样本越多我们会觉得这个p(w|c)越可信。<br>
</summary>
<category term="ml" scheme="http://dmlcoding.com/categories/ml/"/>
<category term="ml" scheme="http://dmlcoding.com/tags/ml/"/>
<category term="python" scheme="http://dmlcoding.com/tags/python/"/>
</entry>
<entry>
<title>【Scikit-Learn 中文文档 】安装 scikit-learn | ApacheCN</title>
<link href="http://dmlcoding.com/2017/Scikit-Learn-Chinese/"/>
<id>http://dmlcoding.com/2017/Scikit-Learn-Chinese/</id>
<published>2017-11-20T01:24:00.000Z</published>
<updated>2017-11-21T06:58:07.000Z</updated>
<content type="html"><![CDATA[<ul><li>1.安装 sciki-learn 中文文档: <a href="http://blog.csdn.net/u012185296/article/details/78582711" target="_blank" rel="external">http://blog.csdn.net/u012185296/article/details/78582711</a></li><li>2.使用 scikit-learn 介绍机器学习 : <a href="http://blog.csdn.net/u012185296/article/details/78583115" target="_blank" rel="external">http://blog.csdn.net/u012185296/article/details/78583115</a></li><li>3.广义线性模型: <a href="http://blog.csdn.net/u012185296/article/details/78583436" target="_blank" rel="external">http://blog.csdn.net/u012185296/article/details/78583436</a></li><li>4.线性和二次判别分析: <a href="http://blog.csdn.net/u012185296/article/details/78584918" target="_blank" rel="external">http://blog.csdn.net/u012185296/article/details/78584918</a></li><li>5.内核岭回归: <a href="http://blog.csdn.net/u012185296/article/details/78584989" target="_blank" rel="external">http://blog.csdn.net/u012185296/article/details/78584989</a></li></ul>]]></content>
<summary type="html">
<ul>
<li>1.安装 sciki-learn 中文文档: <a href="http://blog.csdn.net/u012185296/article/details/78582711" target="_blank" rel="external">http://blo
</summary>
<category term="scikit-learn" scheme="http://dmlcoding.com/categories/scikit-learn/"/>
<category term="ml" scheme="http://dmlcoding.com/tags/ml/"/>
<category term="python" scheme="http://dmlcoding.com/tags/python/"/>
<category term="scikit-learn" scheme="http://dmlcoding.com/tags/scikit-learn/"/>
</entry>
<entry>
<title>Presto的学习笔记</title>
<link href="http://dmlcoding.com/2017/PrestoLearnning/"/>
<id>http://dmlcoding.com/2017/PrestoLearnning/</id>
<published>2017-10-11T02:00:00.000Z</published>
<updated>2017-12-01T03:20:40.000Z</updated>
<content type="html"><![CDATA[<blockquote><p>presto学习笔记</p></blockquote><h1 id="是什么-可以做什么"><a href="#是什么-可以做什么" class="headerlink" title="是什么?可以做什么?"></a>是什么?可以做什么?</h1><ol><li>Presto是一个开源的分布式SQL查询引擎,适用于交互式分析查询,数据量支持GB到PB字节。</li><li>Presto支持在线数据查询,包括Hive, Cassandra, 关系数据库以及专有数据存储。 一条Presto查询可以将多个数据源的数据进行合并,可以跨越整个组织进行分析。</li><li>作为Hive和Pig(Hive和Pig都是通过MapReduce的管道流来完成HDFS数据的查询)的替代者,Presto不仅可以访问HDFS,也可以操作不同的数据源,包括:RDBMS和其他的数据源(例如:Cassandra)。</li><li>查询后的数据自动分页,这个很不错.</li></ol><a id="more"></a><h1 id="源码编译"><a href="#源码编译" class="headerlink" title="源码编译"></a>源码编译</h1><p>下载<strong>presto</strong>源码包地址:<a href="https://github.com/prestodb/presto/releases" target="_blank" rel="external">https://github.com/prestodb/presto/releases</a></p><p>安装文档地址(注意这个中文文档的版本是0.100):</p><p>注意:</p><ul><li>jdk得是1.8以上</li><li>我是用的presto0.161</li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">tar -xzvf presto-0.161.tar.gz</div><div class="line"></div><div class="line"># 编译</div><div class="line">./mvnw clean install -DskipTests</div></pre></td></tr></table></figure><p>在pom.xml文件中加入阿里云的仓库,加速下载依赖</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><repositories></div><div class="line"> <repository></div><div class="line"> <id>nexus-aliyun</id></div><div class="line"> <name>Nexus aliyun</name></div><div class="line"> <url>http://maven.aliyun.com/nexus/content/groups/public</url></div><div class="line"> </repository></div><div class="line"></repositories></div></pre></td></tr></table></figure><p>编译报错</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div></pre></td><td class="code"><pre><div class="line">[ERROR] Failed to execute goal pl.project13.maven:git-commit-id-plugin:2.1.13:revision (default) on project presto-spi: .git directory could not be found! Please specify a valid [dotGitDirectory] in your pom.xml -> [Help 1]</div><div class="line">org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal pl.project13.maven:git-commit-id-plugin:2.1.13:revision (default) on project presto-spi: .git directory could not be found! Please specify a valid [dotGitDirectory] in your pom.xml</div><div class="line">at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)</div><div class="line">at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)</div><div class="line">at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)</div><div class="line">at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)</div><div class="line">at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185)</div><div class="line">at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181)</div><div class="line">at java.util.concurrent.FutureTask.run(FutureTask.java:266)</div><div class="line">at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)</div><div class="line">at java.util.concurrent.FutureTask.run(FutureTask.java:266)</div><div class="line">at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)</div><div class="line">at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)</div><div class="line">at java.lang.Thread.run(Thread.java:745)</div><div class="line">Caused by: org.apache.maven.plugin.MojoExecutionException: .git directory could not be found! Please specify a valid [dotGitDirectory] in your pom.xml</div><div class="line">at pl.project13.maven.git.GitCommitIdMojo.throwWhenRequiredDirectoryNotFound(GitCommitIdMojo.java:432)</div><div class="line">at pl.project13.maven.git.GitCommitIdMojo.execute(GitCommitIdMojo.java:337)</div><div class="line">at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)</div><div class="line">at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)</div></pre></td></tr></table></figure><p>解决办法</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">pom文件中加入这个插件</div><div class="line"><pluginManagement></div><div class="line"> <plugins></div><div class="line"> <plugin></div><div class="line"> <groupId>pl.project13.maven</groupId></div><div class="line"> <artifactId>git-commit-id-plugin</artifactId></div><div class="line"> <configuration></div><div class="line"> <skip>true</skip></div><div class="line"> </configuration></div><div class="line"> </plugin></div><div class="line"> </plugins></div><div class="line"> </pluginManagement></div></pre></td></tr></table></figure><h1 id="部署包安装"><a href="#部署包安装" class="headerlink" title="部署包安装"></a>部署包安装</h1><h2 id="下载地址"><a href="#下载地址" class="headerlink" title="下载地址"></a>下载地址</h2><p>下载presto-cli(记住要下载的presto-cli-xxxx-executable.jar):<a href="https://repo1.maven.org/maven2/com/facebook/presto/presto-cli" target="_blank" rel="external">https://repo1.maven.org/maven2/com/facebook/presto/presto-cli</a></p><p>下载部署包的地址:<a href="https://repo1.maven.org/maven2/com/facebook/presto/presto-server/" target="_blank" rel="external">https://repo1.maven.org/maven2/com/facebook/presto/presto-server/</a></p><h2 id="服务器说明"><a href="#服务器说明" class="headerlink" title="服务器说明"></a>服务器说明</h2><p>机器共有三台U006,U007,U008</p><ul><li>coordinator<ul><li>U007</li></ul></li><li>discovery<ul><li>U007</li></ul></li><li>worker<ul><li>U006</li><li>U008</li></ul></li></ul><p>coordinator和worker的其他配置都是一样的,除了config.properties不一样.具体哪里不一样,看下面的配置说明.</p><h1 id="配置说明"><a href="#配置说明" class="headerlink" title="配置说明"></a>配置说明</h1><p>以下配置均是presto0.161版本的.如果你的版本不一样,启动后如果报错了,那么可能是配置文件里面的参数和版本对应不上,请找相应版本的配置.</p><p>在presto-server-0.161目录下新建etc目录,下面的配置文件均在此etc目录下</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">[druid@U007 presto-server-0.161]$ tree etc/</div><div class="line">etc/</div><div class="line">├── catalog</div><div class="line">│ ├── hive.properties</div><div class="line">│ └── jmx.properties</div><div class="line">├── config.properties</div><div class="line">├── jvm.config</div><div class="line">├── log.properties</div><div class="line">└── node.properties</div></pre></td></tr></table></figure><h2 id="jvm-config"><a href="#jvm-config" class="headerlink" title="jvm.config"></a>jvm.config</h2><blockquote><p>包含一系列在启动JVM的时候需要使用的命令行选项。这份配置文件的格式是:一系列的选项,每行配置一个单独的选项。由于这些选项不在shell命令中使用。 因此即使将每个选项通过空格或者其他的分隔符分开,java程序也不会将这些选项分开,而是作为一个命令行选项处理,信息如下:</p></blockquote><p>VM 系统属性 <code>HADOOP_USER_NAME</code> 来指定用户名</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">-server</div><div class="line">-Xmx16G</div><div class="line">-XX:+UseG1GC</div><div class="line">-XX:G1HeapRegionSize=32M</div><div class="line">-XX:+UseGCOverheadLimit</div><div class="line">-XX:+ExplicitGCInvokesConcurrent</div><div class="line">-XX:+HeapDumpOnOutOfMemoryError</div><div class="line">-XX:OnOutOfMemoryError=kill -9 %p</div><div class="line">-DHADOOP_USER_NAME=hdfs</div></pre></td></tr></table></figure><h2 id="log-properties"><a href="#log-properties" class="headerlink" title="log.properties"></a>log.properties</h2><blockquote><p>这个配置文件中允许你根据不同的日志结构设置不同的日志级别。每个logger都有一个名字(通常是使用logger的类的全标示类名). Loggers通过名字中的“.“来表示层级和集成关系,信息如下:</p></blockquote><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">com.facebook.presto=INFO</div></pre></td></tr></table></figure><ul><li>配置日志等级,类似于log4j。四个等级:DEBUG,INFO,WARN,ERROR</li></ul><h2 id="node-properties"><a href="#node-properties" class="headerlink" title="node.properties"></a>node.properties</h2><blockquote><p>包含针对于每个节点的特定的配置信息。 一个节点就是在一台机器上安装的Presto实例,<strong>etc/node.properties</strong>配置文件至少包含如下配置信息</p></blockquote><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">node.environment=production</div><div class="line">node.id=ffffffff-ffff-ffff-ffff-ffffffffffff # 每个节点的node.id一定要不一样</div><div class="line">node.data-dir=/home/druid/data/presto # 计算临时存储目录,presto得有读写权限</div></pre></td></tr></table></figure><p>说明:</p><ol><li>node.environment: 集群名称, 所有在同一个集群中的Presto节点必须拥有相同的集群名称.</li><li>node.id: 每个Presto节点的唯一标示。每个节点的node.id都必须是唯一的。在Presto进行重启或者升级过程中每个节点的node.id必须保持不变。如果在一个节点上安装多个Presto实例(例如:在同一台机器上安装多个Presto节点),那么每个Presto节点必须拥有唯一的node.id.</li><li>node.data-dir: 数据存储目录的位置(操作系统上的路径), Presto将会把日期和数据存储在这个目录下</li></ol><h2 id="config-properties"><a href="#config-properties" class="headerlink" title="config.properties"></a>config.properties</h2><h3 id="coordinator-主节点配置"><a href="#coordinator-主节点配置" class="headerlink" title="coordinator 主节点配置"></a>coordinator 主节点配置</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">coordinator=true</div><div class="line">node-scheduler.include-coordinator=false</div><div class="line">http-server.http.port=8585</div><div class="line">query.max-memory=10GB</div><div class="line">discovery-server.enabled=true</div><div class="line">discovery.uri=http://U007:8585</div></pre></td></tr></table></figure><p>说明:</p><ol><li>coordinator表示此节点是否作为一个coordinator。每个节点可以是一个worker,也可以同时是一个coordinator,但作为性能考虑,一般大型机群最好将两者分开。</li></ol><ol><li>若coordinator设置成true,则此节点成为一个coordinator。</li><li>若node-scheduler.include-coordinator设置成true,则成为一个worker,两者可以同时设置成true,此节点拥有两种身份。在一个节点上的Presto server即作为coordinator又作为worke将会降低查询性能。因为如果一个服务器作为worker使用,那么大部分的资源都会被worker占用,那么就不会有足够的资源进行关键任务调度、管理和监控查询执行.</li><li>http-server.http.port:指定HTTP server的端口。Presto 使用 HTTP进行内部和外部的所有通讯.</li><li>query.max-memory=10GB:一个单独的任务使用的最大内存 (一个查询计划的某个执行部分会在一个特定的节点上执行)。 这个配置参数限制的GROUP BY语句中的Group的数目、JOIN关联中的右关联表的大小、ORDER BY语句中的行数和一个窗口函数中处理的行数。 该参数应该根据并发查询的数量和查询的复杂度进行调整。如果该参数设置的太低,很多查询将不能执行;但是如果设置的太高将会导致JVM把内存耗光.</li><li>discovery-server.enabled:Presto 通过Discovery 服务来找到集群中所有的节点。为了能够找到集群中所有的节点,每一个Presto实例都会在启动的时候将自己注册到discovery服务。Presto为了简化部署,并且也不想再增加一个新的服务进程,Presto coordinator 可以运行一个内嵌在coordinator 里面的Discovery 服务。这个内嵌的Discovery 服务和Presto共享HTTP server并且使用同样的端口.</li><li>discovery.uri:Discovery server的URI。由于启用了Presto coordinator内嵌的Discovery 服务,因此这个uri就是Presto coordinator的uri。注意:这个URI一定不能以“/“结尾</li></ol><h3 id="worker节点配置"><a href="#worker节点配置" class="headerlink" title="worker节点配置"></a>worker节点配置</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">coordinator=false</div><div class="line">node-scheduler.include-coordinator=true</div><div class="line">http-server.http.port=8585</div><div class="line">query.max-memory=5GB</div><div class="line">query.max-memory-per-node=1GB</div><div class="line">discovery.uri=http://U007:8585</div></pre></td></tr></table></figure><h2 id="catalog"><a href="#catalog" class="headerlink" title="catalog"></a>catalog</h2><p>hive.properties(hive连接器的配置)</p><p>连接hive</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line"># 在etc/catalog目录下,新建hive.properties文件,配置上hive的一些信息</div><div class="line">[druid@U006 catalog]$ pwd</div><div class="line">/home/druid/presto-server-0.161/etc/catalog</div><div class="line">[druid@U006 catalog]$ more hive.properties</div><div class="line">connector.name=hive-cdh5</div><div class="line">hive.metastore.uri=thrift://U006:9083</div><div class="line">hive.config.resources=/etc/hadoop/conf.cloudera.yarn/core-site.xml,/etc/hadoop/conf.clouder</div><div class="line">a.yarn/hdfs-site.xml</div></pre></td></tr></table></figure><p>保证每个节点presto对core-site.xml,hdfs-site.xml两个文件有读权限</p><h1 id="启动停止presto"><a href="#启动停止presto" class="headerlink" title="启动停止presto"></a>启动停止presto</h1><h2 id="单节点启动"><a href="#单节点启动" class="headerlink" title="单节点启动"></a>单节点启动</h2><p>在每个节点依次执行启动脚本</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line"># 后台运行</div><div class="line">bin/launcher start</div><div class="line"># 前台运行</div><div class="line">bin/launcher run</div><div class="line"></div><div class="line"># 重启presto</div><div class="line">bin/launcher restart</div><div class="line"></div><div class="line"># 停止presto</div><div class="line">bin/launcher stop</div></pre></td></tr></table></figure><h2 id="批量启动停止脚本"><a href="#批量启动停止脚本" class="headerlink" title="批量启动停止脚本"></a>批量启动停止脚本</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">ssh -t ${i} -C '. /usr/local/bin/env.sh && /usr/local/presto-server-0.161/bin/launcher restart'</div></pre></td></tr></table></figure><h1 id="监控presto"><a href="#监控presto" class="headerlink" title="监控presto"></a>监控presto</h1><p>启动完成后,在浏览器输入:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">http://U007:8585</div></pre></td></tr></table></figure><p>这个地址也就是coordinator的discovery.uri</p><h1 id="cli连接"><a href="#cli连接" class="headerlink" title="cli连接"></a>cli连接</h1><h2 id="连接器注意说明"><a href="#连接器注意说明" class="headerlink" title="连接器注意说明"></a>连接器注意说明</h2><ul><li>cli下载地址(找到自己的版本下载):<a href="https://repo1.maven.org/maven2/com/facebook/presto/presto-cli" target="_blank" rel="external">https://repo1.maven.org/maven2/com/facebook/presto/presto-cli</a></li></ul><ul><li>连接器的配置文件必须是以<code>.properties</code>后缀结尾的,前面的名字就是连接器的catalog名字</li><li>每次新加连接器配置文件后,都需要在presto的所有机器上加上相同的配置文件,然后重启</li><li>要下载对应版本的cli连接器,不然可能不好使.名字类似<code>presto-cli-0.161-executable.jar</code></li></ul><h2 id="hive连接器"><a href="#hive连接器" class="headerlink" title="hive连接器"></a>hive连接器</h2><p>配置说明(hive连接器的配置在说catalog的时候已经配置好了,你可以回头看看):</p><ul><li>connector.name=hive-cdh5(根据你的hive版本来选择)</li><li>hive.metastore.uri=thrift://U006:9083(hive的metastore地址)</li><li>hive.config.resources=/etc/hadoop/conf.cloudera.yarn/core-(配置文件的地址)</li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">chmod +x presto-cli-0.161-executable.jar</div><div class="line">./presto-cli-0.161-executable.jar --server U007:8585 --catalog hive --schema default</div><div class="line">或者 mv presto-cli-0.161-executable.jar presto都可以</div><div class="line">./presto --server U007:8585 --catalog hive --schema default</div><div class="line">执行该语句后在 presto shell 中执行: show tables 查看 hive 中的 default 库下的表。如果出现对应的表,表安装验证成功</div></pre></td></tr></table></figure><h2 id="jmx连接器"><a href="#jmx连接器" class="headerlink" title="jmx连接器"></a>jmx连接器</h2><ul><li>JMX提供了有关JVM中运行的Java虚拟机和软件的信息</li><li>jmx连接器用于在presto服务器中查询JMX信息</li></ul><p>在etc/catalog目录下新建<strong>jmx.properties</strong></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">connector.name=jmx</div></pre></td></tr></table></figure><p>现在连接presto cli以启用JMX插件</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">[druid@U007 presto-server-0.161]$ ./presto --server U007:8585 --catalog jmx --schema jmx</div><div class="line">presto:jmx> show schemas from jmx;</div><div class="line"> Schema</div><div class="line">--------------------</div><div class="line"> current</div><div class="line"> history</div><div class="line"> information_schema</div><div class="line">(3 rows)</div><div class="line"></div><div class="line">Query 20171012_063601_00020_yuhat, FINISHED, 2 nodes</div><div class="line">Splits: 2 total, 2 done (100.00%)</div><div class="line">0:00 [3 rows, 47B] [39 rows/s, 614B/s]</div></pre></td></tr></table></figure><h2 id="MySQL连接器"><a href="#MySQL连接器" class="headerlink" title="MySQL连接器"></a>MySQL连接器</h2><p>vim etc/catalog/mysql.properties</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">connector.name=mysql</div><div class="line">connection-url=jdbc:mysql://10.10.25.13:3306</div><div class="line">connection-user=root</div><div class="line">connection-password=wankatest***</div></pre></td></tr></table></figure><p>schema 后面跟的mysql的数据库,</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line">[druid@U007 presto-server-0.161]$ ./presto --server U007:8585 --catalog mysql --schema test</div><div class="line">presto:test> show tables;</div><div class="line">Query 20171012_071844_00002_iz4q8 failed: No worker nodes available</div><div class="line"></div><div class="line">presto:test> show tables;</div><div class="line"> Table</div><div class="line">--------------------------</div><div class="line"> dmp_summary_daily_report</div><div class="line"> tb_dmp_stat_appboot</div><div class="line"> tb_dmp_stat_asdk_detail</div><div class="line"> tb_dmp_stat_device</div><div class="line"> tb_leidian1</div><div class="line"> tb_leidian2</div><div class="line">(6 rows)</div><div class="line"></div><div class="line">Query 20171012_072024_00003_iz4q8, FINISHED, 2 nodes</div><div class="line">Splits: 2 total, 2 done (100.00%)</div><div class="line">0:01 [6 rows, 190B] [6 rows/s, 196B/s]</div></pre></td></tr></table></figure><h2 id="kafka连接器"><a href="#kafka连接器" class="headerlink" title="kafka连接器"></a>kafka连接器</h2><p>暂时没这个需求,未测试.</p><h2 id="系统连接器"><a href="#系统连接器" class="headerlink" title="系统连接器"></a>系统连接器</h2><ul><li>系统连接器提供了正在运行的Presto集群的一些信息和指标</li><li>那么这个就可以通过标准sql很方便的查询这些信息</li><li>系统连接器不需要配置,已经内置了.我们可以很方便的访问名为<code>system</code>的catalog</li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line">[druid@U007 presto-server-0.161]$ ./presto --server U007:8585 --catalog system</div><div class="line">presto> show schemas from system;</div><div class="line"> Schema</div><div class="line">--------------------</div><div class="line"> information_schema</div><div class="line"> jdbc</div><div class="line"> metadata</div><div class="line"> runtime</div><div class="line">(4 rows)</div><div class="line"></div><div class="line">Query 20171012_082437_00019_iz4q8, FINISHED, 2 nodes</div><div class="line">Splits: 2 total, 2 done (100.00%)</div><div class="line">0:00 [4 rows, 57B] [70 rows/s, 997B/s]</div></pre></td></tr></table></figure><p>查询有多少个节点</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">presto> SELECT * FROM system.runtime.nodes;</div><div class="line"> node_id | http_uri | node_version | coord</div><div class="line">-------------------------------------------+-------------------------+--------------+------</div><div class="line"> ffffffff-ffff-ffff-ffff-ffffffffffff-u006 | http://10.10.25.13:8585 | 0.161 | false</div><div class="line"> ffffffff-ffff-ffff-ffff-ffffffffffff-u007 | http://10.10.25.14:8585 | 0.161 | true</div><div class="line"> ffffffff-ffff-ffff-ffff-ffffffffffff-u008 | http://10.10.25.15:8585 | 0.161 | false</div><div class="line">(3 rows)</div><div class="line"></div><div class="line">Query 20171012_082952_00023_iz4q8, FINISHED, 2 nodes</div><div class="line">Splits: 2 total, 2 done (100.00%)</div><div class="line">4:08 [3 rows, 228B] [0 rows/s, 0B/s]</div></pre></td></tr></table></figure><h2 id="JDBC接口"><a href="#JDBC接口" class="headerlink" title="JDBC接口"></a>JDBC接口</h2><h3 id="依赖下载安装"><a href="#依赖下载安装" class="headerlink" title="依赖下载安装"></a>依赖下载安装</h3><ul><li>下载地址(找到自己的版本下载):<a href="https://repo1.maven.org/maven2/com/facebook/presto/presto-jdbc" target="_blank" rel="external">https://repo1.maven.org/maven2/com/facebook/presto/presto-jdbc</a></li><li><code>presto-jdbc-0.161.jar</code>在jar文件下载之后,将其添加到Java应用程序的classpath中。</li><li>我不太喜欢用jar包的方式,那么可以在pom文件加入presto-jdbc的依赖</li><li><a href="http://mvnrepository.com/artifact/com.facebook.presto/presto-jdbc" target="_blank" rel="external">http://mvnrepository.com/artifact/com.facebook.presto/presto-jdbc</a> 找到自己相应的版本</li></ul><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><!-- https://mvnrepository.com/artifact/com.facebook.presto/presto-jdbc --></div><div class="line"><dependency></div><div class="line"> <groupId>com.facebook.presto</groupId></div><div class="line"> <artifactId>presto-jdbc</artifactId></div><div class="line"> <version>0.161</version></div><div class="line"></dependency></div></pre></td></tr></table></figure><p>Presto支持的URL格式如下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">jdbc:presto://host:port</div><div class="line">jdbc:presto://host:port/catalog</div><div class="line">jdbc:presto://host:port/catalog/schema</div></pre></td></tr></table></figure><p>例如,可以使用下面的URL来连接运行在U007服务器8585端口上的Presto的mysql catalog中的test schema:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">jdbc:presto://10.10.25.14:8585/mysql/test</div></pre></td></tr></table></figure><p>这个url就是来连接hive catalog中的default schema</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">jdbc:presto://10.10.25.14:8585/hive/default</div></pre></td></tr></table></figure><p>###Java代码</p><p><strong>读取mysql下的test库下的所有表</strong></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">public class PrestoJdbcDemo {</div><div class="line"></div><div class="line"> public static void main(String[] args) throws SQLException, ClassNotFoundException {</div><div class="line"> Class.forName("com.facebook.presto.jdbc.PrestoDriver");</div><div class="line"> Connection connection = DriverManager</div><div class="line"> .getConnection("jdbc:presto://10.10.25.14:8585/mysql/test", "root", "wankatest***");</div><div class="line"> Statement stmt = connection.createStatement();</div><div class="line"> ResultSet rs = stmt.executeQuery("show tables");</div><div class="line"> while (rs.next()) {</div><div class="line"> System.out.println(rs.getString(1));</div><div class="line"> }</div><div class="line"> rs.close();</div><div class="line"> connection.close();</div><div class="line"></div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure><p><strong>读取hive下的default库下的所有表</strong></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">public class PrestoJdbcDemo {</div><div class="line"></div><div class="line"> public static void main(String[] args) throws SQLException, ClassNotFoundException {</div><div class="line"> Class.forName("com.facebook.presto.jdbc.PrestoDriver");</div><div class="line"> Connection connection = DriverManager</div><div class="line"> .getConnection("jdbc:presto://10.10.25.14:8585/hive/default","root",null);</div><div class="line"> Statement stmt = connection.createStatement();</div><div class="line"> ResultSet rs = stmt.executeQuery("show tables");</div><div class="line"> while (rs.next()) {</div><div class="line"> System.out.println(rs.getString(1));</div><div class="line"> }</div><div class="line"> rs.close();</div><div class="line"> connection.close();</div><div class="line"></div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure><h1 id="不同数据源之间的join"><a href="#不同数据源之间的join" class="headerlink" title="不同数据源之间的join"></a>不同数据源之间的join</h1><p>presto的一个特性就是其支持在不同的数据源之间进行join</p><p>当连接presto的客户端的时候,也可以不指定连接器</p><p>不同的数据源就用catalog名称指定,然后加上库名表明即可.</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">./presto --server U007:8585</div><div class="line"> show tables from mysql.dsp_test;</div></pre></td></tr></table></figure><h1 id="presto提供的函数和运算符"><a href="#presto提供的函数和运算符" class="headerlink" title="presto提供的函数和运算符"></a>presto提供的函数和运算符</h1><p>参考文档:<a href="http://prestodb-china.com/docs/current/functions.html" target="_blank" rel="external">http://prestodb-china.com/docs/current/functions.html</a></p><h1 id="看日志"><a href="#看日志" class="headerlink" title="看日志"></a>看日志</h1><h2 id="日志路径"><a href="#日志路径" class="headerlink" title="日志路径"></a>日志路径</h2><p><strong>node.properties</strong>中配置了node.data-dir=/home/druid/data/presto</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line">[druid@U007 presto]$ tree</div><div class="line">.</div><div class="line">├── etc -> /home/druid/presto-server-0.161/etc</div><div class="line">├── plugin -> /home/druid/presto-server-0.161/plugin</div><div class="line">└── var</div><div class="line"> ├── log</div><div class="line"> │ ├── http-request.log</div><div class="line"> │ ├── launcher.log</div><div class="line"> │ └── server.log</div><div class="line"> └── run</div><div class="line"> └── launcher.pid</div><div class="line"></div><div class="line">5 directories, 4 files</div></pre></td></tr></table></figure><p>出现错误后,我们主要关注var/log目录下的日志.</p><p>当服务有问题的时候,看server.log找到报错原因,从而解决问题.</p><h1 id="常见错误"><a href="#常见错误" class="headerlink" title="常见错误"></a>常见错误</h1><h2 id="连接不上连接器"><a href="#连接不上连接器" class="headerlink" title="连接不上连接器"></a>连接不上连接器</h2><p>类似这样的错误</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">No factory for connector mysql</div><div class="line">No factory for connector hive</div></pre></td></tr></table></figure><p>如果服务报相关这样的错误,那么就需要关注各个连接器的配置文件是否写对了.各个连接器的配置文件是否在每个presto服务器上都部署了.</p><h1 id="Presto的实现原理"><a href="#Presto的实现原理" class="headerlink" title="Presto的实现原理"></a>Presto的实现原理</h1><p><img src="http://oz6wyfxp0.bkt.clouddn.com/1512097348.png?imageMogr2/thumbnail/!70p" alt="Presto架构"></p><ul><li>Presto的架构</li><li>Presto执行原理</li></ul><p>简单来说:</p><ol><li>cli客户端把查询需求发送给Coordinator节点.Coordinator节点负责解析sql语句,生成执行计划,分发执行任务给worer节点执行.所以worker节点负责实际执行查询任务.</li><li>worker节点启动后向discovery server服务注册,因此coordinator就可以从discovery server获得可以正常工作的worker节点.</li></ol><p>深入来说:</p><ol><li>参考网上相关blog.</li><li>买书.</li><li>看源码!</li></ol><h1 id="参考文档"><a href="#参考文档" class="headerlink" title="参考文档"></a>参考文档</h1><ul><li><p><a href="https://prestodb.io" target="_blank" rel="external">presto官网</a></p></li><li><p><a href="https://github.com/prestodb/presto" target="_blank" rel="external">presto-Github</a></p></li><li><p><a href="http://prestodb-china.com" target="_blank" rel="external">presto-京东维护-中文文档</a></p></li><li><p><a href="https://tech.meituan.com/presto.html" target="_blank" rel="external">Presto实现原理和美团的使用实践</a></p></li><li><p><a href="http://getindata.com/tutorial-using-presto-to-combine-data-from-hive-and-mysql-in-one-sql-like-query/" target="_blank" rel="external">Presto-Hive-Mysql-InteractiveQuery</a></p></li></ul>]]></content>
<summary type="html">
<blockquote>
<p>presto学习笔记</p>
</blockquote>
<h1 id="是什么-可以做什么"><a href="#是什么-可以做什么" class="headerlink" title="是什么?可以做什么?"></a>是什么?可以做什么?</h1><ol>
<li>Presto是一个开源的分布式SQL查询引擎,适用于交互式分析查询,数据量支持GB到PB字节。</li>
<li>Presto支持在线数据查询,包括Hive, Cassandra, 关系数据库以及专有数据存储。 一条Presto查询可以将多个数据源的数据进行合并,可以跨越整个组织进行分析。</li>
<li>作为Hive和Pig(Hive和Pig都是通过MapReduce的管道流来完成HDFS数据的查询)的替代者,Presto不仅可以访问HDFS,也可以操作不同的数据源,包括:RDBMS和其他的数据源(例如:Cassandra)。</li>
<li>查询后的数据自动分页,这个很不错.</li>
</ol>
</summary>
<category term="presto" scheme="http://dmlcoding.com/categories/presto/"/>
<category term="bigdata" scheme="http://dmlcoding.com/tags/bigdata/"/>
<category term="presto" scheme="http://dmlcoding.com/tags/presto/"/>
</entry>
<entry>
<title>二八定律与长尾理论</title>
<link href="http://dmlcoding.com/2017/erbachangwei/"/>
<id>http://dmlcoding.com/2017/erbachangwei/</id>
<published>2017-09-19T06:24:00.000Z</published>
<updated>2017-10-13T02:16:52.000Z</updated>
<content type="html"><![CDATA[<p>刚入广告行业的时候,有时候会听同事们说长尾啥的.一直不太明白长尾是啥意思.最近看一本书,书里面提到了一个二八定律.<br>本着好奇的态度,搜索了一下二八定律,没想到顺带都提到了长尾理论.接着这个机会,好好理解一下二八定律与长尾理论.</p><a id="more"></a><h1 id="什么是二八定律"><a href="#什么是二八定律" class="headerlink" title="什么是二八定律"></a>什么是二八定律</h1><p>我把wiki上的解释摘过来看看.</p><p>帕雷托法则(英语:Pareto principle),也称为二八定律或80/20法则,此法则指在众多现象中,80%的结果取决于20%的原因,<br>而这一法则在很多方面被广泛的应用。如80%的劳动成果取决于20%的前期努力等等。<br>这个法则最初是意大利经济学家维弗雷多·帕雷托在1906年对意大利20%的人口拥有80%的财产的观察而得出的,<br>后来管理学思想家约瑟夫·朱兰和其他人把它概括为帕雷托法则。</p><p>所以简单说也就是,最重要的</p><h1 id="什么是长尾理论"><a href="#什么是长尾理论" class="headerlink" title="什么是长尾理论"></a>什么是长尾理论</h1><p>未完待续</p>]]></content>
<summary type="html">
<p>刚入广告行业的时候,有时候会听同事们说长尾啥的.一直不太明白长尾是啥意思.最近看一本书,书里面提到了一个二八定律.<br>本着好奇的态度,搜索了一下二八定律,没想到顺带都提到了长尾理论.接着这个机会,好好理解一下二八定律与长尾理论.</p>
</summary>
<category term="computer" scheme="http://dmlcoding.com/categories/computer/"/>
<category term="computer" scheme="http://dmlcoding.com/tags/computer/"/>
</entry>
<entry>
<title>ELK搭建使用</title>
<link href="http://dmlcoding.com/2017/ELKBuild/"/>
<id>http://dmlcoding.com/2017/ELKBuild/</id>
<published>2017-09-17T01:24:00.000Z</published>
<updated>2017-11-01T07:11:07.000Z</updated>
<content type="html"><![CDATA[<blockquote><p>elk搭建记录,学习资料.</p></blockquote><h1 id="ELK学习资料"><a href="#ELK学习资料" class="headerlink" title="ELK学习资料"></a>ELK学习资料</h1><ul><li><a href="https://www.gitbook.com/book/chenryn/elk-stack-guide-cn/details" target="_blank" rel="external">ELKstack 中文指南</a></li><li><a href="https://www.gitbook.com/book/looly/elasticsearch-the-definitive-guide-cn/details" target="_blank" rel="external">Elasticsearch权威指南(中文版)</a></li><li>…..</li></ul><h1 id="ELK下载"><a href="#ELK下载" class="headerlink" title="ELK下载"></a>ELK下载</h1><p>历史版本下载地址 : <a href="https://www.elastic.co/downloads/past-releases" target="_blank" rel="external">https://www.elastic.co/downloads/past-releases</a></p><ul><li>elasticsearch : 2.4.0</li><li>logstash : 2.4.0</li><li>kibana : 4.6.2<a id="more"></a></li></ul><h1 id="安装准备"><a href="#安装准备" class="headerlink" title="安装准备"></a>安装准备</h1><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">[root@U006 opt]# cd /opt</div><div class="line">[root@U006 opt]# mkdir elk</div><div class="line">[root@U006 opt]# chown -R hadoop:hadoop elk</div><div class="line">[root@U006 opt]# su hadoop</div><div class="line">[hadoop@U006 opt]$ cd /opt/elk/</div><div class="line">[hadoop@U006 elk]$ mkdir jars</div><div class="line">[hadoop@U006 jars]$ pwd</div><div class="line">/opt/elk/jars</div><div class="line"></div><div class="line"># 上传jar包</div><div class="line"></div><div class="line">[hadoop@U006 jars]$ ll</div><div class="line">total 116996</div><div class="line">-rw-r--r-- 1 hadoop hadoop 27364449 Sep 18 09:36 elasticsearch-2.4.0.tar.gz</div><div class="line">-rw-r--r-- 1 hadoop hadoop 34125464 Sep 18 09:37 kibana-4.6.2-linux-x86_64.tar.gz</div><div class="line">-rw-r--r-- 1 hadoop hadoop 58310656 Sep 18 09:41 logstash-2.4.0.tar.gz</div></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 elk]$ ll</div><div class="line">total 16</div><div class="line">drwxrwxr-x 6 hadoop hadoop 4096 Sep 18 09:45 elasticsearch-2.4.0</div><div class="line">drwxrwxr-x 2 hadoop hadoop 4096 Sep 18 09:40 jars</div><div class="line">drwxrwxr-x 11 hadoop hadoop 4096 Oct 21 2016 kibana-4.6.2-linux-x86_64</div><div class="line">drwxrwxr-x 5 hadoop hadoop 4096 Sep 18 09:47 logstash-2.4.0</div></pre></td></tr></table></figure><h1 id="安装elasticsearch"><a href="#安装elasticsearch" class="headerlink" title="安装elasticsearch"></a>安装elasticsearch</h1><blockquote><p>测试环境搭建,单节点为例</p></blockquote><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"># 解压elasticsearch</div><div class="line">tar -zxvf elasticsearch-2.4.0.tar.gz -C /opt/elk/</div></pre></td></tr></table></figure><h2 id="修改配置文件config-elasticsearch-yml"><a href="#修改配置文件config-elasticsearch-yml" class="headerlink" title="修改配置文件config/elasticsearch.yml"></a>修改配置文件config/elasticsearch.yml</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 config]$ pwd</div><div class="line">/opt/elk/elasticsearch-2.4.0/config</div><div class="line"></div><div class="line">[hadoop@U006 config]$ ll</div><div class="line">total 8</div><div class="line">-rw-rw-r-- 1 hadoop hadoop 3192 Aug 24 2016 elasticsearch.yml</div><div class="line">-rw-rw-r-- 1 hadoop hadoop 2571 Aug 24 2016 logging.yml</div><div class="line"></div><div class="line">[hadoop@U006 elasticsearch-2.4.0]$ vim config/elasticsearch.yml</div><div class="line"></div><div class="line"># 打开这三个配置的注释</div><div class="line"> cluster.name: test_elasticsearch # es集群名字</div><div class="line"> node.name: node1 # 该节点在es中的名字</div><div class="line"> network.host: 0.0.0.0 # 任意节点可以访问</div></pre></td></tr></table></figure><h2 id="启动es"><a href="#启动es" class="headerlink" title="启动es"></a>启动es</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">./bin/elasticsearch</div><div class="line">./bin/elasticsearch -d #后台启动</div></pre></td></tr></table></figure><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 elasticsearch-2.4.0]$ ./bin/elasticsearch</div><div class="line">[2017-09-18 10:08:50,300][WARN ][bootstrap ] unable to install syscall filter: seccomp unavailable: requires kernel 3.5+ with CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER compiled in</div><div class="line">[2017-09-18 10:08:51,697][INFO ][node ] [node1] version[2.4.0], pid[3098], build[ce9f0c7/2016-08-29T09:14:17Z]</div><div class="line">[2017-09-18 10:08:51,697][INFO ][node ] [node1] initializing ...</div><div class="line">[2017-09-18 10:08:52,332][INFO ][plugins ] [node1] modules [lang-groovy, reindex, lang-expression], plugins [], sites []</div><div class="line">[2017-09-18 10:08:52,363][INFO ][env ] [node1] using [1] data paths, mounts [[/home (/dev/mapper/VolGroup-lv_home)]], net usable_space [195.3gb], net total_space [857.4gb], spins? [possibly], types [ext4]</div><div class="line">[2017-09-18 10:08:52,363][INFO ][env ] [node1] heap size [989.8mb], compressed ordinary object pointers [true]</div><div class="line">[2017-09-18 10:08:54,472][INFO ][node ] [node1] initialized</div><div class="line">[2017-09-18 10:08:54,473][INFO ][node ] [node1] starting ...</div><div class="line">[2017-09-18 10:08:54,675][INFO ][transport ] [node1] publish_address {10.10.25.13:9300}, bound_addresses {[::]:9300}</div><div class="line">[2017-09-18 10:08:54,685][INFO ][discovery ] [node1] test_elasticsearch/oAbAZR_tTaGAJU-5yd3BqQ</div><div class="line">[2017-09-18 10:08:57,757][INFO ][cluster.service ] [node1] new_master {node1}{oAbAZR_tTaGAJU-5yd3BqQ}{10.10.25.13}{10.10.25.13:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)</div><div class="line">[2017-09-18 10:08:57,798][INFO ][http ] [node1] publish_address {10.10.25.13:9200}, bound_addresses {[::]:9200}</div><div class="line">[2017-09-18 10:08:57,799][INFO ][node ] [node1] started</div><div class="line">[2017-09-18 10:08:57,811][INFO ][gateway ] [node1] recovered [0] indices into cluster_state</div></pre></td></tr></table></figure><p>打开<a href="http://10.10.25.13:9200/" target="_blank" rel="external">http://10.10.25.13:9200/</a> 将会看到以下内容.返回数据中包含配置的cluster.name和node.name,以及es的版本等信息.<br><img src="../images/elk/es-startresult.png" alt="es-result"></p><h2 id="安装插件"><a href="#安装插件" class="headerlink" title="安装插件"></a>安装插件</h2><h3 id="elasticsearch-head-插件安装"><a href="#elasticsearch-head-插件安装" class="headerlink" title="elasticsearch-head 插件安装"></a>elasticsearch-head 插件安装</h3><h4 id="下载安装"><a href="#下载安装" class="headerlink" title="下载安装"></a>下载安装</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line"># 进入bin目录下</div><div class="line">[hadoop@U006 bin]$ ./plugin install mobz/elasticsearch-head</div><div class="line">-> Installing mobz/elasticsearch-head...</div><div class="line">Trying https://github.com/mobz/elasticsearch-head/archive/master.zip ...</div><div class="line">Downloading ...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE</div><div class="line">Verifying https://github.com/mobz/elasticsearch-head/archive/master.zip checksums if available ...</div><div class="line">NOTE: Unable to verify checksum for downloaded plugin (unable to find .sha1 or .md5 file to verify)</div><div class="line">Installed head into /opt/elk/elasticsearch-2.4.0/plugins/head</div></pre></td></tr></table></figure><h4 id="使用说明"><a href="#使用说明" class="headerlink" title="使用说明"></a>使用说明</h4><blockquote><p>head插件是一个用浏览器跟ES集群交互的插件,可以查看集群状态、集群的doc内容、执行搜索和普通的Rest请求等。</p></blockquote><p>在浏览器中直接访问接口 <a href="http://10.10.25.13:9200/_plugin/head/" target="_blank" rel="external">http://10.10.25.13:9200/_plugin/head/</a> ,可以看到es集群状态<br><img src="../images/elk/es-plugin-head.png" alt="es-head"></p><h3 id="安装Marvel插件"><a href="#安装Marvel插件" class="headerlink" title="安装Marvel插件"></a>安装Marvel插件</h3><h4 id="下载安装-1"><a href="#下载安装-1" class="headerlink" title="下载安装"></a>下载安装</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div></pre></td><td class="code"><pre><div class="line"># ./bin/plugin install license</div><div class="line"># ./bin/plugin install marvel-agent</div><div class="line"></div><div class="line">[hadoop@U006 elasticsearch-2.4.0]$ ./bin/plugin install license</div><div class="line">-> Installing license...</div><div class="line">Trying https://download.elastic.co/elasticsearch/release/org/elasticsearch/plugin/license/2.4.0/license-2.4.0.zip ...</div><div class="line">Downloading .......DONE</div><div class="line">Verifying https://download.elastic.co/elasticsearch/release/org/elasticsearch/plugin/license/2.4.0/license-2.4.0.zip checksums if available ...</div><div class="line">Downloading .DONE</div><div class="line">Installed license into /opt/elk/elasticsearch-2.4.0/plugins/license</div><div class="line">[hadoop@U006 elasticsearch-2.4.0]$ ./bin/plugin install marvel-agent</div><div class="line">-> Installing marvel-agent...</div><div class="line">Trying https://download.elastic.co/elasticsearch/release/org/elasticsearch/plugin/marvel-agent/2.4.0/marvel-agent-2.4.0.zip ...</div><div class="line">Downloading ..........DONE</div><div class="line">Verifying https://download.elastic.co/elasticsearch/release/org/elasticsearch/plugin/marvel-agent/2.4.0/marvel-agent-2.4.0.zip checksums if available ...</div><div class="line">Downloading .DONE</div><div class="line">@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@</div><div class="line">@ WARNING: plugin requires additional permissions @</div><div class="line">@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@</div><div class="line">* java.lang.RuntimePermission setFactory</div><div class="line">* javax.net.ssl.SSLPermission setHostnameVerifier</div><div class="line">See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html</div><div class="line">for descriptions of what these permissions allow and the associated risks.</div><div class="line"></div><div class="line">Continue with installation? [y/N]y</div><div class="line">Installed marvel-agent into /opt/elk/elasticsearch-2.4.0/plugins/marvel-agent</div><div class="line"></div><div class="line"></div><div class="line"># 安装的插件都会出现在es的plugins目录下</div><div class="line">[hadoop@U006 elasticsearch-2.4.0]$ ll</div><div class="line">total 56</div><div class="line">drwxrwxr-x 2 hadoop hadoop 4096 Sep 18 09:45 bin</div><div class="line">drwxrwxr-x 3 hadoop hadoop 4096 Sep 18 10:05 config</div><div class="line">drwxrwxr-x 3 hadoop hadoop 4096 Sep 18 09:59 data</div><div class="line">drwxrwxr-x 2 hadoop hadoop 4096 Sep 18 09:45 lib</div><div class="line">-rw-rw-r-- 1 hadoop hadoop 11358 Aug 24 2016 LICENSE.txt</div><div class="line">drwxrwxr-x 2 hadoop hadoop 4096 Sep 18 09:59 logs</div><div class="line">drwxrwxr-x 5 hadoop hadoop 4096 Aug 29 2016 modules</div><div class="line">-rw-rw-r-- 1 hadoop hadoop 150 Aug 24 2016 NOTICE.txt</div><div class="line">drwxrwxr-x 3 hadoop hadoop 4096 Sep 18 10:15 plugins</div><div class="line">-rw-rw-r-- 1 hadoop hadoop 8700 Aug 24 2016 README.textile</div><div class="line">[hadoop@U006 elasticsearch-2.4.0]$ cd plugins/</div><div class="line">[hadoop@U006 plugins]$ ll</div><div class="line">total 12</div><div class="line">drwxrwxr-x 6 hadoop hadoop 4096 Sep 18 10:15 head</div><div class="line">drwxrwxr-x 2 hadoop hadoop 4096 Sep 18 10:34 license</div><div class="line">drwxrwxr-x 2 hadoop hadoop 4096 Sep 18 10:34 marvel-agent</div></pre></td></tr></table></figure><h4 id="使用说明-1"><a href="#使用说明-1" class="headerlink" title="使用说明"></a>使用说明</h4><blockquote><p>Marvel是Elasticsearch的管理和监控工具,在开发环境下免费使用。它包含了一个叫做Sense的交互式控制台,<br>使用户方便的通过浏览器直接与Elasticsearch进行交互。<br>marvel插件主要会和kibana进行配置使用,待会看kibana也需要安装marvel插件</p></blockquote><h2 id="注意"><a href="#注意" class="headerlink" title="注意"></a>注意</h2><ul><li>如何之前在config/elasticsearch.yml的文件中,没有修改network.host项.那么你只能用localhost或者127.0.0.1访问es了.</li><li>注意配置yml结尾的配置文件都需要冒号后面加空格才行</li></ul><h1 id="安装kibana"><a href="#安装kibana" class="headerlink" title="安装kibana"></a>安装kibana</h1><h2 id="解压安装"><a href="#解压安装" class="headerlink" title="解压安装"></a>解压安装</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">tar -zxvf kibana-4.6.2-linux-x86_64.tar.gz -C /opt/elk/</div></pre></td></tr></table></figure><h2 id="修改config-kibana-yml的elasticsearch-url属性即可。"><a href="#修改config-kibana-yml的elasticsearch-url属性即可。" class="headerlink" title="修改config/kibana.yml的elasticsearch.url属性即可。"></a>修改config/kibana.yml的elasticsearch.url属性即可。</h2><p><img src="../images/elk/kibana-config.png" alt="kibana-config"></p><h2 id="安装插件-1"><a href="#安装插件-1" class="headerlink" title="安装插件"></a>安装插件</h2><h3 id="安装Marvel插件-1"><a href="#安装Marvel插件-1" class="headerlink" title="安装Marvel插件"></a>安装Marvel插件</h3><blockquote><p>在安装es的时候,已经给es安装了marvel插件,现在给kibana也安装上marvel插件</p></blockquote><h4 id="下载"><a href="#下载" class="headerlink" title="下载"></a>下载</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 kibana-4.6.2-linux-x86_64]$ bin/kibana plugin --install elasticsearch/marvel/latest</div><div class="line">Installing marvel</div><div class="line">Attempting to transfer from https://download.elastic.co/elasticsearch/marvel/marvel-latest.tar.gz</div><div class="line">.....</div><div class="line">Transfer complete</div><div class="line">Extracting plugin archive</div><div class="line">Extraction complete</div><div class="line">Optimizing and caching browser bundles...</div><div class="line">Plugin installation complete</div></pre></td></tr></table></figure><h4 id="启动验证"><a href="#启动验证" class="headerlink" title="启动验证"></a>启动验证</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">bin/elasticsearch</div><div class="line">bin/kibana</div></pre></td></tr></table></figure><p>查看<a href="http://10.10.25.13:5601/app/marvel" target="_blank" rel="external">http://10.10.25.13:5601/app/marvel</a> 页面:<br><img src="../images/elk/kibana-marvel1.png" alt="kibana-marvel1"><br><img src="../images/elk/kibana-marvel2.png" alt="kibana-marvel2"></p><h3 id="安装sense插件"><a href="#安装sense插件" class="headerlink" title="安装sense插件"></a>安装sense插件</h3><blockquote><p>Sense是flask写的elasticsearch查询工具。<br>支持es查询语言自动提示,es结构自动提示,支持两种主题,支持查询历史记录,支持快捷键。</p></blockquote><h4 id="下载-1"><a href="#下载-1" class="headerlink" title="下载"></a>下载</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 kibana-4.6.2-linux-x86_64]$ ./bin/kibana plugin --install elastic/sense</div><div class="line">Installing sense</div><div class="line">Attempting to transfer from https://download.elastic.co/elastic/sense/sense-latest.tar.gz</div><div class="line">.....</div><div class="line">Transfer complete</div><div class="line">Extracting plugin archive</div><div class="line">Extraction complete</div><div class="line">Optimizing and caching browser bundles...</div><div class="line">Plugin installation complete</div></pre></td></tr></table></figure><h4 id="使用说明-2"><a href="#使用说明-2" class="headerlink" title="使用说明"></a>使用说明</h4><p>启动es和kibana<br>查看<a href="http://10.10.25.13:5601/app/sense" target="_blank" rel="external">http://10.10.25.13:5601/app/sense</a> 页面:<br><img src="../images/elk/kibana-sense.png" alt="kibana-sense"></p><h1 id="安装logstash"><a href="#安装logstash" class="headerlink" title="安装logstash"></a>安装logstash</h1><h2 id="解压安装即可"><a href="#解压安装即可" class="headerlink" title="解压安装即可"></a>解压安装即可</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">tar -zxvf logstash-2.4.0.tar.gz -C /opt/elk/</div></pre></td></tr></table></figure><h2 id="配置logstash的配置文件"><a href="#配置logstash的配置文件" class="headerlink" title="配置logstash的配置文件"></a>配置logstash的配置文件</h2><p>此文件input为从log4j接收日志.output为输出到es集群,进行搜索.</p><p>字段具体意思可以看官网 <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html" target="_blank" rel="external">https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html</a></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 logstash-2.4.0]$ vim logstash_log4j_to_es.conf</div><div class="line">input {</div><div class="line"> log4j {</div><div class="line"> mode => "server"</div><div class="line"> host => "10.10.25.13"</div><div class="line"> port => 4567</div><div class="line"> type => "log4j"</div><div class="line"> }</div><div class="line">}</div><div class="line">output{</div><div class="line"> elasticsearch{</div><div class="line"> action => "index"</div><div class="line"> hosts => "10.10.25.13:9200"</div><div class="line"> index => "test_log"</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure><p><img src="../images/elk/logstash-file1.png" alt="logstash-file1"></p><h2 id="启动logstash"><a href="#启动logstash" class="headerlink" title="启动logstash"></a>启动logstash</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line">bin/logstash agent -f logstash_log4j_to_es.conf</div><div class="line"># 或者</div><div class="line">bin/logstash -f logstash_log4j_to_es.conf</div><div class="line"></div><div class="line">[hadoop@U006 logstash-2.4.0]$ bin/logstash -f logstash_log4j_to_es.conf</div><div class="line">Settings: Default pipeline workers: 24</div><div class="line">log4j:WARN No appenders could be found for logger (org.apache.http.client.protocol.RequestAuthCache).</div><div class="line">log4j:WARN Please initialize the log4j system properly.</div><div class="line">log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.</div><div class="line">Pipeline main started</div></pre></td></tr></table></figure><h1 id="ELK框架的综合实际使用"><a href="#ELK框架的综合实际使用" class="headerlink" title="ELK框架的综合实际使用"></a>ELK框架的综合实际使用</h1><blockquote><p>在工作中,我们会使用ELK对业务日志进行收集分析.<br>也就是把项目中的log4j日志用logstash进行收集,然后输出到es中进行索引搜索.最后使用Kibana进行可视化的搜索展示.</p></blockquote><h2 id="启动elk"><a href="#启动elk" class="headerlink" title="启动elk"></a>启动elk</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">bin/elasticsearch</div><div class="line">bin/kibana</div><div class="line">bin/logstash -f logstash_log4j_to_es.conf</div></pre></td></tr></table></figure><p>打开:<a href="http://10.10.25.13:5601" target="_blank" rel="external">http://10.10.25.13:5601</a><br><img src="../images/elk/kibana-index.png" alt="kibana-index"></p><p>如图所以,这里有个WARN警告:没有默认的索引模式,需要创建一个才能继续.<br>那么我们就创建一个索引.</p><h2 id="创建索引"><a href="#创建索引" class="headerlink" title="创建索引"></a>创建索引</h2><p>Kibana界面日志检索只有当第一条日志通过Logstash进入ElasticSearch后,才能配置Kibana索引。</p><p>1、在“Index name or pattern”项下,填入一个elasticsearch的索引名,也即是Logstash配置文件中output项下的index对应的名称;在你这里应该是将“logstash-* ” 改成“test_log”<br>2、在“Time-field name”,选用默认的配置:“@timestamp”<br>3、点击“create”即可</p><p><img src="../images/elk/kibana-index2.png" alt="kibana-index2"></p><h2 id="log4j日志接入"><a href="#log4j日志接入" class="headerlink" title="log4j日志接入"></a>log4j日志接入</h2><h3 id="编写log4j的测试代码"><a href="#编写log4j的测试代码" class="headerlink" title="编写log4j的测试代码"></a>编写log4j的测试代码</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">public class TestFunc {</div><div class="line"></div><div class="line"> Logger logger = LoggerFactory.getLogger(TestFunc.class);</div><div class="line"></div><div class="line"> @Test</div><div class="line"> public void testlog4j() throws Exception {</div><div class="line"> while (true) {</div><div class="line"> long s_time = System.currentTimeMillis();</div><div class="line"> logger.info("当前时间戳: "+s_time+" i am info info hadoop");</div><div class="line"> logger.warn("当前时间戳: "+s_time+" i am info warn spark hushiwei nice");</div><div class="line"> logger.error("当前时间戳: "+s_time+" i am info error elk");</div><div class="line"> Thread.sleep(1000L);</div><div class="line"></div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div></pre></td></tr></table></figure><h3 id="log4j的配置"><a href="#log4j的配置" class="headerlink" title="log4j的配置"></a>log4j的配置</h3><p>log4j.properties<br>remotehost填写logstash的服务器地址.也就是那个input项里面的地址.<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">log4j.rootLogger=INFO,socket</div><div class="line">log4j.appender.socket=org.apache.log4j.net.SocketAppender</div><div class="line">log4j.appender.socket.RemoteHost=10.10.25.13</div><div class="line">log4j.appender.socket.Port=4567</div><div class="line">log4j.appender.socket.LocationInfo=true</div></pre></td></tr></table></figure></p><h2 id="执行代码-输出log4j日志-观察kibana页面变化"><a href="#执行代码-输出log4j日志-观察kibana页面变化" class="headerlink" title="执行代码,输出log4j日志,观察kibana页面变化"></a>执行代码,输出log4j日志,观察kibana页面变化</h2><p>索引日志里面的信息<br><img src="../images/elk/kibana-index3.png" alt="kibana-index3"></p>]]></content>
<summary type="html">
<blockquote>
<p>elk搭建记录,学习资料.</p>
</blockquote>
<h1 id="ELK学习资料"><a href="#ELK学习资料" class="headerlink" title="ELK学习资料"></a>ELK学习资料</h1><ul>
<li><a href="https://www.gitbook.com/book/chenryn/elk-stack-guide-cn/details">ELKstack 中文指南</a></li>
<li><a href="https://www.gitbook.com/book/looly/elasticsearch-the-definitive-guide-cn/details">Elasticsearch权威指南(中文版)</a></li>
<li>…..</li>
</ul>
<h1 id="ELK下载"><a href="#ELK下载" class="headerlink" title="ELK下载"></a>ELK下载</h1><p>历史版本下载地址 : <a href="https://www.elastic.co/downloads/past-releases">https://www.elastic.co/downloads/past-releases</a></p>
<ul>
<li>elasticsearch : 2.4.0</li>
<li>logstash : 2.4.0</li>
<li>kibana : 4.6.2
</summary>
<category term="elk" scheme="http://dmlcoding.com/categories/elk/"/>
<category term="elasticsearch" scheme="http://dmlcoding.com/tags/elasticsearch/"/>
<category term="logstash" scheme="http://dmlcoding.com/tags/logstash/"/>
<category term="kibana" scheme="http://dmlcoding.com/tags/kibana/"/>
</entry>
<entry>
<title>Python读取文件编码及内容</title>
<link href="http://dmlcoding.com/2017/PythonReadFiles/"/>
<id>http://dmlcoding.com/2017/PythonReadFiles/</id>
<published>2017-09-13T07:24:00.000Z</published>
<updated>2017-09-14T07:18:57.000Z</updated>
<content type="html"><![CDATA[<blockquote><p>当你不知道文件的具体编码方式的时候,如何正确的读取文件内容呢?</p></blockquote><h1 id="报错日志"><a href="#报错日志" class="headerlink" title="报错日志"></a>报错日志</h1><p>当我们不知道文件编码方式的时候,贸然的读取文件,有时候就会出现这些问题<br><a id="more"></a></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">UnicodeDecodeError: 'gbk' codec can't decode byte</div><div class="line">......</div></pre></td></tr></table></figure><h1 id="解决办法"><a href="#解决办法" class="headerlink" title="解决办法"></a>解决办法</h1><p>所以就是编码方式不对,那么需要先能识别文件的编码文件,然后根据此编码方式进行对文件编码,最后返回文件内容。<br>可以借助一个第三方库<code>chardet</code><br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"># 安装chardet</div><div class="line">pip install chardet</div></pre></td></tr></table></figure></p><h1 id="正确实例"><a href="#正确实例" class="headerlink" title="正确实例"></a>正确实例</h1><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">with open("your_file", 'rb') as fp:</div><div class="line"> file_data = fp.read()</div><div class="line"> result = chardet.detect(file_data)</div><div class="line"> file_content = file_data.decode(encoding=result['encoding'])</div></pre></td></tr></table></figure>]]></content>
<summary type="html">
<blockquote>
<p>当你不知道文件的具体编码方式的时候,如何正确的读取文件内容呢?</p>
</blockquote>
<h1 id="报错日志"><a href="#报错日志" class="headerlink" title="报错日志"></a>报错日志</h1><p>当我们不知道文件编码方式的时候,贸然的读取文件,有时候就会出现这些问题<br>
</summary>
<category term="python" scheme="http://dmlcoding.com/categories/python/"/>
<category term="python" scheme="http://dmlcoding.com/tags/python/"/>
</entry>
<entry>
<title>spark开发中遇到的问题</title>
<link href="http://dmlcoding.com/2017/SparkBug/"/>
<id>http://dmlcoding.com/2017/SparkBug/</id>
<published>2017-09-12T02:00:00.000Z</published>
<updated>2017-10-13T02:30:51.000Z</updated>
<content type="html"><![CDATA[<h1 id="spark连接mysql"><a href="#spark连接mysql" class="headerlink" title="spark连接mysql"></a>spark连接mysql</h1><h2 id="问题描述"><a href="#问题描述" class="headerlink" title="问题描述"></a>问题描述</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">总是报no suitable driver以及 jdbc.mysql.driver类似这样的错误</div></pre></td></tr></table></figure><a id="more"></a><h2 id="解决办法1"><a href="#解决办法1" class="headerlink" title="解决办法1"></a>解决办法1</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">1.提交任务的时候带上这个,手动指定mysql jar包的位置</div><div class="line"> SPARK_CLASSPATH=/usr/local/spark-1.4.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.38.jar ./bin/spark-submit --class sparkDemo /root/data/demon-parent-1.0-SNAPSHOT-jar-with-dependencies.jar hdfs://192.168.119.100:9000/examples/custom.txt</div></pre></td></tr></table></figure><h2 id="解决办法2"><a href="#解决办法2" class="headerlink" title="解决办法2"></a>解决办法2</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">修改了这个配置SPARK_HOME/conf/spark-env.sh文件,在里面加上了这个参数,就OK了</div><div class="line"></div><div class="line">export SPARK_CLASSPATH=$SPATH_CLASSPATH:/usr/hdp/2.4.0.0-169/spark/lib/mysql-connector-java-5.1.38.jar</div></pre></td></tr></table></figure><h1 id="在spark中使用hive抛出错误"><a href="#在spark中使用hive抛出错误" class="headerlink" title="在spark中使用hive抛出错误"></a>在spark中使用hive抛出错误</h1><h1 id="报错日志"><a href="#报错日志" class="headerlink" title="报错日志"></a>报错日志</h1><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line">17/08/09 12:11:51 WARN DataNucleus.Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator</div><div class="line">ClassLoaderResolver for class "" gave error on creation : {1}</div><div class="line">org.datanucleus.exceptions.NucleusUserException: ClassLoaderResolver for class "" gave error on creation : {1}</div><div class="line">at org.datanucleus.NucleusContext.getClassLoaderResolver(NucleusContext.java:1087)</div><div class="line">at org.datanucleus.PersistenceConfiguration.validatePropertyValue(PersistenceConfiguration.java:797)</div><div class="line">at org.datanucleus.PersistenceConfiguration.setProperty(PersistenceConfiguration.java:714)</div><div class="line">at org.datanucleus.PersistenceConfiguration.setPersistenceProperties(PersistenceConfiguration.java:693)</div><div class="line">at org.datanucleus.NucleusContext.<init>(NucleusContext.java:273)</div><div class="line">at org.datanucleus.NucleusContext.<init>(NucleusContext.java:247)</div><div class="line">at org.datanucleus.NucleusContext.<init>(NucleusContext.java:225)</div><div class="line">at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.<init>(JDOPersistenceManagerFactory.java:416)</div><div class="line">at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:301)</div><div class="line">at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)</div></pre></td></tr></table></figure><h1 id="问题分析"><a href="#问题分析" class="headerlink" title="问题分析"></a>问题分析</h1><p>看日志应该是缺少了hive的一些包,在网上搜了一下,是下面几个包<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">[hadoop@U007 lib]$ pwd</div><div class="line">/opt/spark-1.6.0/lib</div><div class="line">[hadoop@U007 lib]$ ll</div><div class="line">total 305220</div><div class="line">-rw-r--r-- 1 hadoop hadoop 339666 Apr 15 2016 datanucleus-api-jdo-3.2.6.jar</div><div class="line">-rw-r--r-- 1 hadoop hadoop 1890075 Apr 15 2016 datanucleus-core-3.2.10.jar</div><div class="line">-rw-r--r-- 1 hadoop hadoop 1809447 Apr 15 2016 datanucleus-rdbms-3.2.9.jar</div><div class="line">...</div></pre></td></tr></table></figure></p><p>所以在提交spark任务的时候,把这几个包加入到classpath中即可</p><h1 id="解决办法"><a href="#解决办法" class="headerlink" title="解决办法"></a>解决办法</h1><p>在提交spark的脚本中加上这几个jar包和hive-site.xml文件<br>如下<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">nohup spark-submit \</div><div class="line"> --master yarn \</div><div class="line"> --deploy-mode cluster \</div><div class="line"> --class ${className} \</div><div class="line"> --driver-memory 4g \</div><div class="line"> --executor-memory 2g \</div><div class="line"> --executor-cores 4 \</div><div class="line"> --num-executors 4 \</div><div class="line"> --jars ./lib/datanucleus-api-jdo-3.2.6.jar,./lib/datanucleus-core-3.2.10.jar,./lib/datanucleus-rdbms</div><div class="line">-3.2.9.jar \</div><div class="line"> --files ./lib/hive-site.xml \</div><div class="line"> ./app-jar-with-dependencies.jar \</div></pre></td></tr></table></figure></p><p>加上–jars 和 –files即可</p><h1 id="在spark中将数据插入hive动态分区"><a href="#在spark中将数据插入hive动态分区" class="headerlink" title="在spark中将数据插入hive动态分区"></a>在spark中将数据插入hive动态分区</h1><h2 id="问题描述-1"><a href="#问题描述-1" class="headerlink" title="问题描述"></a>问题描述</h2><p>当我用standalone以及yarn-client模式进行提交任务的时候,不会报错.但是当我改成yarn-cluster模式进行提交任务,有时候就会报下面的错</p><h2 id="报错日志-1"><a href="#报错日志-1" class="headerlink" title="报错日志"></a>报错日志</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div></pre></td><td class="code"><pre><div class="line">17/08/09 10:08:01 ERROR scheduler.JobScheduler: Error running job streaming job 1502188440000 ms.0</div><div class="line">java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(org.apache.hadoop.fs.Path, java.lang.String, java.util.Map, boolean, int, boolean, boolean)</div><div class="line">at java.lang.Class.getMethod(Class.java:1670)</div><div class="line">at org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:114)</div><div class="line">at org.apache.spark.sql.hive.client.Shim_v0_12.loadDynamicPartitionsMethod$lzycompute(HiveShim.scala:168)</div><div class="line">at org.apache.spark.sql.hive.client.Shim_v0_12.loadDynamicPartitionsMethod(HiveShim.scala:167)</div><div class="line">at org.apache.spark.sql.hive.client.Shim_v0_12.loadDynamicPartitions(HiveShim.scala:261)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:560)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:560)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:560)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)</div><div class="line">at org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:559)</div><div class="line">at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)</div><div class="line">at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)</div><div class="line">at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)</div><div class="line">at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)</div><div class="line">at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)</div><div class="line">at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)</div><div class="line">at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)</div><div class="line">at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)</div><div class="line">at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)</div><div class="line">at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)</div><div class="line">at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)</div><div class="line">at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)</div><div class="line">at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)</div></pre></td></tr></table></figure><h2 id="分析"><a href="#分析" class="headerlink" title="分析"></a>分析</h2><p>用client模式的时候,是在13上运行的.没有问题<br>用cluster模式的时候,有时候报错,有时候没有报错<br>那不禁让我猜想,为啥cluster模式时而报错时而不报错呢?</p><p>然后我用client模式,在14上提交,不出我所料,基本上每个job都抛出了那个错误.<br>所以定位到问题就是,除了13这个节点外,别的节点缺少了什么包,导致抛出了错误.<br>因为抛出来的错误是java.lang.NoSuchMethodException:,所以肯定是缺少了什么包.<br>之前cluster模式时而报错时而不报错的原因肯定是,当不报错的时候,正好driver端是在13上</p><p>现在的问题就是找出别的机器缺少什么包了.</p><p>然后我在spark的环境变量里面发现了这个参数<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">spark.sql.hive.metastore.jars :/usr/lib/hive/lib/*:/opt/spark-1.6.0/lib/spark-assembly-1.6.0-hadoop2.4.0.jar</div></pre></td></tr></table></figure></p><p>我去,13上有这个/usr/lib/hive/lib/* 路径<br>14和15上都没有,,,<br>问题找到了</p><h2 id="解决办法1-1"><a href="#解决办法1-1" class="headerlink" title="解决办法1"></a>解决办法1</h2><p>把13上这个路径/usr/lib/hive/lib/* 拷贝到14和15上,各自都有一份.这样无论driver端在哪里,都能找到相应的jar包.<br>就这样愉快的解决了.<br>所以遇到问题,慢慢分析,不要像无头苍蝇一样.<br>在网上搜的解决办法,都无法解决这个问题.所以有时候,具体问题具体分析,要慢慢的分析到出错原因.找到了原因,bug就能迎刃而解.</p><h2 id="解决办法2-1"><a href="#解决办法2-1" class="headerlink" title="解决办法2"></a>解决办法2</h2><p>在spark的配置文件中把 <code>spark.sql.hive.metastore.jars</code> 给删了.因为你总不能在每个节点上去拷贝hive的一些依赖吧,如果以后hive升级了,还得替换hive的jar包,太麻烦.所以改成下面的解决办法更好.</p><p>在pom文件中加上hive的依赖</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div></pre></td><td class="code"><pre><div class="line"><!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 --></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.spark</groupId></div><div class="line"> <artifactId>spark-core_2.10</artifactId></div><div class="line"> <version>1.6.0</version></div><div class="line"></dependency></div><div class="line"><!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10 --></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.spark</groupId></div><div class="line"> <artifactId>spark-mllib_2.10</artifactId></div><div class="line"> <version>1.6.0</version></div><div class="line"> <scope>provided</scope></div><div class="line"></dependency></div><div class="line"></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.spark</groupId></div><div class="line"> <artifactId>spark-streaming_2.10</artifactId></div><div class="line"> <version>1.6.0</version></div><div class="line"></dependency></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.spark</groupId></div><div class="line"> <artifactId>spark-streaming-kafka_2.10</artifactId></div><div class="line"> <version>1.6.0</version></div><div class="line"></dependency></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.spark</groupId></div><div class="line"> <artifactId>spark-sql_2.10</artifactId></div><div class="line"> <version>1.6.0</version></div><div class="line"></dependency></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.spark</groupId></div><div class="line"> <artifactId>spark-hive_2.10</artifactId></div><div class="line"> <version>1.6.0</version></div><div class="line"></dependency></div><div class="line"><dependency></div><div class="line"> <groupId>mysql</groupId></div><div class="line"> <artifactId>mysql-connector-java</artifactId></div><div class="line"> <version>5.1.32</version></div><div class="line"></dependency></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.hive</groupId></div><div class="line"> <artifactId>hive-jdbc</artifactId></div><div class="line"> <version>0.13.1</version></div><div class="line"></dependency></div><div class="line"><!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec --></div><div class="line"><dependency></div><div class="line"> <groupId>org.apache.hive</groupId></div><div class="line"> <artifactId>hive-exec</artifactId></div><div class="line"> <version>0.13.1</version></div><div class="line"></dependency></div></pre></td></tr></table></figure><h1 id="sparkstreaming读取kafka数据"><a href="#sparkstreaming读取kafka数据" class="headerlink" title="sparkstreaming读取kafka数据"></a>sparkstreaming读取kafka数据</h1><h2 id="问题描述Couldn’t-find-leaders-for-Set"><a href="#问题描述Couldn’t-find-leaders-for-Set" class="headerlink" title="问题描述Couldn’t find leaders for Set"></a>问题描述Couldn’t find leaders for Set</h2><p>SparkStreaming程序从Kafka读数据的程序运行期间报了描述中的异常.<br>通过监控分析发现,是由于有一个Broker挂掉了。可是对应Topic的replica设置的2,就算挂掉一个,应该有replica顶上啊。<br>后来发现,这是由于存在Partition的Replica没有跟Leader保持同步更新,也就是通常所说的“没追上”。 查看某个Topic是否存在没追上的情况:</p><p>查看某个Topic是否存在没追上的情况:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">kafka-topics.sh --describe --zookeeper XXX --topic XXX</div></pre></td></tr></table></figure></p><h2 id="报错日志-2"><a href="#报错日志-2" class="headerlink" title="报错日志"></a>报错日志</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">17/10/13 09:41:13 ERROR DirectKafkaInputDStream: ArrayBuffer(java.nio.channels.ClosedChannelException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([dsp_request_event,2]))</div><div class="line">17/10/13 09:41:13 ERROR StreamingContext: Error starting the context, marking it as stopped</div><div class="line">org.apache.spark.SparkException: ArrayBuffer(java.nio.channels.ClosedChannelException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([dsp_request_event,2]))</div><div class="line">at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)</div><div class="line">at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)</div><div class="line">at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)</div><div class="line">at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)</div></pre></td></tr></table></figure><h2 id="解决办法-1"><a href="#解决办法-1" class="headerlink" title="解决办法"></a>解决办法</h2><p>观察其中的Replicas和Isr是否一致,如果出现Isr少于Replicas,则对应Partition存在没追上的情况<br>解决方法:<br>增大num.replica.fetchers的值,此参数是Replicas从Leader同步数据的线程数,默认为1,增大此参数即增大了同步IO。经过测试,增大此值后,不再有追不上的情况<br>确定问题已解决的方法:<br>启动出现问题的SparkStreaming程序,在程序正常计算的状态下,kill掉任意一个Broker后,再观察运行情况。在增大同步线程数之前,kill后SparkStreaming会报同样的异常,而增大后程序依然正常运行,问题解决。</p><p>参考:<a href="http://blog.csdn.net/yanshu2012/article/details/53995159" target="_blank" rel="external">http://blog.csdn.net/yanshu2012/article/details/53995159</a></p>]]></content>
<summary type="html">
<h1 id="spark连接mysql"><a href="#spark连接mysql" class="headerlink" title="spark连接mysql"></a>spark连接mysql</h1><h2 id="问题描述"><a href="#问题描述" class="headerlink" title="问题描述"></a>问题描述</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">总是报no suitable driver以及 jdbc.mysql.driver类似这样的错误</div></pre></td></tr></table></figure>
</summary>
<category term="spark" scheme="http://dmlcoding.com/categories/spark/"/>
<category term="kafka" scheme="http://dmlcoding.com/tags/kafka/"/>
<category term="spark" scheme="http://dmlcoding.com/tags/spark/"/>
<category term="mysql" scheme="http://dmlcoding.com/tags/mysql/"/>
<category term="hive" scheme="http://dmlcoding.com/tags/hive/"/>
</entry>
<entry>
<title>诗经 <<伐檀>></title>
<link href="http://dmlcoding.com/2017/fatan/"/>
<id>http://dmlcoding.com/2017/fatan/</id>
<published>2017-08-23T12:30:00.000Z</published>
<updated>2017-08-24T13:00:16.000Z</updated>
<content type="html"><![CDATA[<p><img src="/images/beautifulPic/fatan.png" alt="fatan"></p><a id="more"></a><blockquote><p>抄一首诗,换个心情…</p></blockquote><h1 id="伐檀"><a href="#伐檀" class="headerlink" title="伐檀"></a>伐檀</h1><p>坎坎伐檀兮,置之河之干兮,河水清且涟猗。<br>不稼不穑,胡取禾三百廛[1]兮?<br>不狩不猎,胡瞻尔庭有县[2]貆兮?<br>彼君子兮,不素餐兮!</p><p>坎坎伐辐兮,置之河之侧兮,河水清且直猗。<br>不稼不穑,胡取禾三百亿兮?<br>不狩不猎,胡瞻尔庭有县特[3]兮?<br>彼君子兮,不素食兮!</p><p>坎坎伐轮兮,置之河之漘[4]兮,河水清且沦猗。<br>不稼不穑,胡取禾三百囷[5]兮?<br>不狩不猎,胡瞻尔庭有县鹑兮?<br>彼君子兮,不素飧兮!</p><h1 id="伐檀白话"><a href="#伐檀白话" class="headerlink" title="伐檀白话"></a>伐檀白话</h1><p>砍伐檀树声坎坎啊,<br>棵棵放倒堆河边啊,<br>河水清清微波转哟。<br>不播种来不收割,<br>为何三百捆禾往家搬啊?<br>不冬狩来不夜猎,<br>为何见你庭院猪獾悬啊?<br>那些老爷君子啊,<br>不会白吃闲饭啊!</p><p>砍下檀树做车辐啊,<br>放在河边堆一处啊。<br>河水清清直流注哟。<br>不播种来不收割,<br>为何三百捆禾要独取啊?<br>不冬狩来不夜猎,<br>为何见你庭院兽悬柱啊?<br>那些老爷君子啊,<br>不会白吃饱腹啊!</p><p>砍下檀树做车轮啊,<br>棵棵放倒河边屯啊。<br>河水清清起波纹啊。<br>不播种来不收割,<br>为何三百捆禾要独吞啊?<br>不冬狩来不夜猎,<br>为何见你庭院挂鹌鹑啊?<br>那些老爷君子啊,<br>可不白吃腥荤啊!</p>]]></content>
<summary type="html">
<p><img src="/images/beautifulPic/fatan.png" alt="fatan"></p>
</summary>
<category term="book" scheme="http://dmlcoding.com/categories/book/"/>
<category term="book" scheme="http://dmlcoding.com/tags/book/"/>
<category term="think" scheme="http://dmlcoding.com/tags/think/"/>
</entry>
<entry>
<title>ITerm2下使用ssh访问Linux(包括堡垒机)</title>
<link href="http://dmlcoding.com/2017/MacSsh/"/>
<id>http://dmlcoding.com/2017/MacSsh/</id>
<published>2017-08-17T02:00:00.000Z</published>
<updated>2017-08-18T04:01:28.000Z</updated>
<content type="html"><![CDATA[<p>mac下没有xshell,虽然有SecurtCRT,但是真的太丑了.我还是比较喜欢用Iterm2来进行远程连接.<br>这样不可避免的会碰到要记录远程密码,如果每次都输入,那就太麻烦了.<br><a id="more"></a></p><h1 id="Iterm2下使用ssh访问Linux"><a href="#Iterm2下使用ssh访问Linux" class="headerlink" title="Iterm2下使用ssh访问Linux"></a>Iterm2下使用ssh访问Linux</h1><p>通过情况下,Iterm2访问远程Linux使用ssh命令,如下:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">ssh <用户名>@<ip></div></pre></td></tr></table></figure></p><p>然后输入访问密码即可登录进去.有时候远程访问的默认端口如果不是22,那就需要额外加上<code>-p</code>参数跟上远程访问端口进行登录了.<br>很明显如果每次都要输入访问密码,那在开发过程中是相当的不方便的.</p><p>这里有两个方式实现免密登录.都是用Iterm2的Profiles功能加上脚本来实现.</p><h2 id="方式1-使用spawn脚本文件"><a href="#方式1-使用spawn脚本文件" class="headerlink" title="方式1:使用spawn脚本文件"></a>方式1:使用spawn脚本文件</h2><p>将远程访问的相关内容写成一个脚本,然后在Profile里面调用即可.<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">cd /Users/hushiwei/.ssh/</div><div class="line">$ touch filename</div></pre></td></tr></table></figure></p><p>脚本内容<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">#!/usr/bin/expect -f</div><div class="line"> set user <用户名></div><div class="line"> set host <ip地址></div><div class="line"> set password <密码></div><div class="line"> set timeout -1</div><div class="line"></div><div class="line"> spawn ssh $user@$host</div><div class="line"> expect "*assword:*"</div><div class="line"> send "$password\r"</div><div class="line"> interact</div><div class="line"> expect eof</div></pre></td></tr></table></figure></p><p>如何调用呢?<br>在command中使用命令.command在哪看下面的图你就知道了.<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">expect <保存的脚本完整路径></div></pre></td></tr></table></figure></p><h2 id="方式2-使用sshpass-推荐方式"><a href="#方式2-使用sshpass-推荐方式" class="headerlink" title="方式2:使用sshpass(推荐方式)"></a>方式2:使用sshpass(推荐方式)</h2><p>brew安装sshpass<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">brew install https://raw.githubusercontent.com/kadwanev/bigboybrew/master/Library/Formula/sshpass.rb</div></pre></td></tr></table></figure></p><p>然后把密码写入到一个文件中<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">hushiwei@localhost ~/sshpass pwd</div><div class="line">/Users/hushiwei/sshpass</div><div class="line">hushiwei@localhost ~/sshpass more pass</div><div class="line">passwd123</div></pre></td></tr></table></figure></p><p>参考图中进行配置<br><img src="/images/pics/sshpass.png" alt="sshpass"></p><p>command写上命令<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">/usr/local/bin/sshpass -f /Users/hushiwei/sshpass/pass ssh -p22 用户名@密码</div></pre></td></tr></table></figure></p><p>然后在iterm2的菜单栏选择Profiles,然后点击刚刚的配置,即可免密自动登录到服务器上</p><ul><li>注意:首先用命令行登录一次</li></ul><h1 id="iterm2登录堡垒机"><a href="#iterm2登录堡垒机" class="headerlink" title="iterm2登录堡垒机"></a>iterm2登录堡垒机</h1><p>通过SSH和密钥文件(.pem格式)登录服务器[可能是堡垒机]</p><h2 id="首先修改下密钥文件权限"><a href="#首先修改下密钥文件权限" class="headerlink" title="首先修改下密钥文件权限"></a>首先修改下密钥文件权限</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">sudo chmod 600 /Users/hushiwei/sshpass/Jumpserver/hushiwei.pem</div></pre></td></tr></table></figure><h2 id="其次,终端可直接命令连接"><a href="#其次,终端可直接命令连接" class="headerlink" title="其次,终端可直接命令连接"></a>其次,终端可直接命令连接</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">ssh -i /Users/hushiwei/sshpass/Jumpserver/hushiwei.pem hushiwei@xxx.xxx.xxx.xxx</div></pre></td></tr></table></figure><p>注:首次连接时,会弹出密钥文件密码输入框,可以输入并保存!</p><p>除了直接命令连接外,也可参考上面Profiles功能,配置好,直接在Profile里调用!简单脚本如下:</p><h2 id="配置Profile脚本自动登录堡垒机"><a href="#配置Profile脚本自动登录堡垒机" class="headerlink" title="配置Profile脚本自动登录堡垒机"></a>配置Profile脚本自动登录堡垒机</h2><p>脚本文件 vim jumpserver<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line">hushiwei@localhost ~/sshpass/Jumpserver more jumpserver</div><div class="line">#!/usr/bin/expect -f</div><div class="line"> set user hushiwei</div><div class="line"> set host xxx.xxx.xxx.xxx</div><div class="line"> set empath /Users/hushiwei/sshpass/Jumpserver/hushiwei.pem</div><div class="line"> set timeout -1</div><div class="line"></div><div class="line"> spawn ssh -i $empath $user@$host</div><div class="line"> interact</div><div class="line"> expect eof</div></pre></td></tr></table></figure></p><p>命令行执行<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div></pre></td><td class="code"><pre><div class="line">hushiwei@localhost ~/sshpass/Jumpserver expect /Users/hushiwei/sshpass/Jumpserver/jumpserver</div><div class="line">spawn ssh -i /Users/hushiwei/sshpass/Jumpserver/hushiwei.pem hushiwei@xxx.xxx.xxx.xxx</div><div class="line">Last login: Fri Aug 18 11:06:16 2017 from xxx.xxx.xxx.xxx</div><div class="line"></div><div class="line">### 欢迎使用Jumpserver开源跳板机系统 ###</div><div class="line"></div><div class="line"> 1) 输入 ID 直接登录 或 输入部分 IP,主机名,备注 进行搜索登录(如果唯一).</div><div class="line"> 2) 输入 / + IP, 主机名 or 备注 搜索. 如: /ip</div><div class="line"> 3) 输入 P/p 显示您有权限的主机.</div><div class="line"> 4) 输入 G/g 显示您有权限的主机组.</div><div class="line"> 5) 输入 G/g + 组ID 显示该组下主机. 如: g1</div><div class="line"> 6) 输入 E/e 批量执行命令.</div><div class="line"> 7) 输入 U/u 批量上传文件.</div><div class="line"> 8) 输入 D/d 批量下载文件.</div><div class="line"> 9) 输入 H/h 帮助.</div><div class="line"> 0) 输入 Q/q 退出.</div><div class="line"></div><div class="line">Opt or ID>:</div></pre></td></tr></table></figure></p><p>参考上面的Profile功能,配置好,直接在Profile里调用即可<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line"># 在Command里面写入以下即可</div><div class="line">expect /Users/hushiwei/sshpass/Jumpserver/jumpserver</div></pre></td></tr></table></figure></p>]]></content>
<summary type="html">
<p>mac下没有xshell,虽然有SecurtCRT,但是真的太丑了.我还是比较喜欢用Iterm2来进行远程连接.<br>这样不可避免的会碰到要记录远程密码,如果每次都输入,那就太麻烦了.<br>
</summary>
<category term="mac" scheme="http://dmlcoding.com/categories/mac/"/>
<category term="mac" scheme="http://dmlcoding.com/tags/mac/"/>
<category term="ssh" scheme="http://dmlcoding.com/tags/ssh/"/>
<category term="iterm2" scheme="http://dmlcoding.com/tags/iterm2/"/>
</entry>
<entry>
<title>监控SparkStreaming程序脚本</title>
<link href="http://dmlcoding.com/2017/MonitorSparkStreamingOnYarn/"/>
<id>http://dmlcoding.com/2017/MonitorSparkStreamingOnYarn/</id>
<published>2017-08-15T02:00:00.000Z</published>
<updated>2018-01-16T02:26:33.230Z</updated>
<content type="html"><![CDATA[<p>虽然Spark on yarn非常的稳定,一般情况下是不会出问题的.但是我们的SparkStreaming程序是一直运行着出实时报表的.<br>我们必须得对SparkStreaming程序进行监控,在程序退出后,能够及时的重启.<br>基于此需求,我想到了通过调用yarn的rest接口来获取提交到yarn上的任务</p><a id="more"></a><h1 id="思路"><a href="#思路" class="headerlink" title="思路"></a>思路</h1><p>调用yarn提供的rest接口来获取所有正在运行的任务<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">curl --compressed -H "Accept: application/json" -X GET "http://master:8088/ws/v1/cluster/apps?states=RUNNING"</div></pre></td></tr></table></figure></p><p>如果对别的接口有兴趣,可以看看官网.</p><ul><li><a href="https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html" target="_blank" rel="external">https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html</a></li><li><a href="http://www.winseliu.com/blog/2014/12/07/hadoop-mr-rest-api/" target="_blank" rel="external">http://www.winseliu.com/blog/2014/12/07/hadoop-mr-rest-api/</a></li></ul><h1 id="脚本"><a href="#脚本" class="headerlink" title="脚本"></a>脚本</h1><p>脚本也很简单,简单看看就明白了</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div><div class="line">72</div><div class="line">73</div><div class="line">74</div><div class="line">75</div><div class="line">76</div><div class="line">77</div><div class="line">78</div><div class="line">79</div><div class="line">80</div><div class="line">81</div><div class="line">82</div><div class="line">83</div><div class="line">84</div><div class="line">85</div><div class="line">86</div><div class="line">87</div><div class="line">88</div><div class="line">89</div><div class="line">90</div><div class="line">91</div><div class="line">92</div><div class="line">93</div><div class="line">94</div><div class="line">95</div><div class="line">96</div><div class="line">97</div><div class="line">98</div><div class="line">99</div><div class="line">100</div><div class="line">101</div><div class="line">102</div><div class="line">103</div><div class="line">104</div><div class="line">105</div><div class="line">106</div><div class="line">107</div><div class="line">108</div><div class="line">109</div><div class="line">110</div><div class="line">111</div><div class="line">112</div><div class="line">113</div><div class="line">114</div><div class="line">115</div><div class="line">116</div><div class="line">117</div><div class="line">118</div><div class="line">119</div><div class="line">120</div><div class="line">121</div><div class="line">122</div><div class="line">123</div><div class="line">124</div><div class="line">125</div><div class="line">126</div><div class="line">127</div><div class="line">128</div><div class="line">129</div><div class="line">130</div><div class="line">131</div><div class="line">132</div><div class="line">133</div><div class="line">134</div><div class="line">135</div><div class="line">136</div><div class="line">137</div><div class="line">138</div><div class="line">139</div><div class="line">140</div><div class="line">141</div><div class="line">142</div><div class="line">143</div><div class="line">144</div><div class="line">145</div><div class="line">146</div><div class="line">147</div><div class="line">148</div><div class="line">149</div><div class="line">150</div><div class="line">151</div><div class="line">152</div><div class="line">153</div><div class="line">154</div><div class="line">155</div><div class="line">156</div><div class="line">157</div><div class="line">158</div><div class="line">159</div><div class="line">160</div><div class="line">161</div><div class="line">162</div><div class="line">163</div><div class="line">164</div><div class="line">165</div><div class="line">166</div><div class="line">167</div><div class="line">168</div><div class="line">169</div><div class="line">170</div><div class="line">171</div><div class="line">172</div><div class="line">173</div><div class="line">174</div><div class="line">175</div><div class="line">176</div><div class="line">177</div><div class="line">178</div><div class="line">179</div><div class="line">180</div><div class="line">181</div><div class="line">182</div><div class="line">183</div><div class="line">184</div><div class="line">185</div><div class="line">186</div><div class="line">187</div><div class="line">188</div><div class="line">189</div><div class="line">190</div><div class="line">191</div><div class="line">192</div><div class="line">193</div><div class="line">194</div><div class="line">195</div><div class="line">196</div><div class="line">197</div><div class="line">198</div><div class="line">199</div><div class="line">200</div><div class="line">201</div><div class="line">202</div><div class="line">203</div><div class="line">204</div><div class="line">205</div><div class="line">206</div><div class="line">207</div><div class="line">208</div><div class="line">209</div><div class="line">210</div><div class="line">211</div><div class="line">212</div><div class="line">213</div><div class="line">214</div><div class="line">215</div><div class="line">216</div><div class="line">217</div><div class="line">218</div><div class="line">219</div><div class="line">220</div><div class="line">221</div><div class="line">222</div><div class="line">223</div><div class="line">224</div><div class="line">225</div><div class="line">226</div><div class="line">227</div><div class="line">228</div><div class="line">229</div><div class="line">230</div><div class="line">231</div><div class="line">232</div><div class="line">233</div><div class="line">234</div><div class="line">235</div><div class="line">236</div><div class="line">237</div><div class="line">238</div><div class="line">239</div><div class="line">240</div><div class="line">241</div><div class="line">242</div><div class="line">243</div><div class="line">244</div><div class="line">245</div><div class="line">246</div><div class="line">247</div><div class="line">248</div><div class="line">249</div><div class="line">250</div><div class="line">251</div><div class="line">252</div><div class="line">253</div><div class="line">254</div><div class="line">255</div><div class="line">256</div><div class="line">257</div><div class="line">258</div><div class="line">259</div><div class="line">260</div><div class="line">261</div><div class="line">262</div><div class="line">263</div><div class="line">264</div><div class="line">265</div><div class="line">266</div><div class="line">267</div><div class="line">268</div><div class="line">269</div><div class="line">270</div><div class="line">271</div></pre></td><td class="code"><pre><div class="line"># -*- coding: utf-8 -*-</div><div class="line">'''</div><div class="line"> Created by hushiwei on 2018/1/5.</div><div class="line"> 监控SparkStreaming程序</div><div class="line"> 一旦挂了,执行重启,同时发送邮件和微信报警</div><div class="line">'''</div><div class="line"></div><div class="line">import os</div><div class="line">import subprocess</div><div class="line">import json</div><div class="line">import logging</div><div class="line">import time</div><div class="line">import urllib2</div><div class="line">import smtplib</div><div class="line">from email.mime.text import MIMEText</div><div class="line">from email.mime.multipart import MIMEMultipart</div><div class="line">from email.header import Header</div><div class="line"></div><div class="line">wechats = "HuShiwei"</div><div class="line">sendEmails = ['hsw_v5@163.com', 'xxxx@gm825.com']</div><div class="line"></div><div class="line">urlRun = 'curl --compressed -H "Accept: application/json" -X GET "http://u007:8089/ws/v1/cluster/apps?states=RUNNING"'</div><div class="line">urlAcc = 'curl --compressed -H "Accept: application/json" -X GET "http://u007:8089/ws/v1/cluster/apps?states=ACCEPTED"'</div><div class="line"></div><div class="line">monitorPrograms = {</div><div class="line"> "com.xxxx.streaming.ADXStreaming": "/home/hadoop/statistics/ad/adxstreaming/start_adx_streaming_yarn.sh",</div><div class="line"> "com.xxxx.online.streaming.DSPStreaming": "/home/hadoop/statistics/ad/dsp_ad_puton/dsp_ad_puton_streaming/start_dsp_streaming_yarn_test.sh",</div><div class="line"> "com.xxxx.streaming.CPDAppStreaming": "/home/hadoop/statistics/ad/dsp_app_promotion/start_dsp_app_promotion_yarn.sh"</div><div class="line">}</div><div class="line"></div><div class="line"></div><div class="line">class WeChat(object):</div><div class="line"> '''</div><div class="line"> 发送微信工具类</div><div class="line"> '''</div><div class="line"></div><div class="line"> def __init__(self, corpid, corpsecret, tokenpath):</div><div class="line"> self.corpid = corpid</div><div class="line"> self.corpsecret = corpsecret</div><div class="line"> self.tokenpath = tokenpath</div><div class="line"> self.logger = logging.getLogger('wechat')</div><div class="line"></div><div class="line"> def saveToken(self):</div><div class="line"> '''</div><div class="line"> :return:</div><div class="line"> '''</div><div class="line"> try:</div><div class="line"> with open(self.tokenpath, 'r') as f:</div><div class="line"> token = f.read()</div><div class="line"> if len(token) < 10:</div><div class="line"> token = self.getToken()</div><div class="line"> self.logger.info("Can not get token from %s,prepare to get token on api which token is %s" % (</div><div class="line"> self.tokenpath, token))</div><div class="line"> return token</div><div class="line"> else:</div><div class="line"> return token</div><div class="line"> except IOError:</div><div class="line"> token = self.getToken()</div><div class="line"> self.logger.info(</div><div class="line"> "Can not get token from %s,prepare to get token on api which token is %s" % (self.tokenpath, token))</div><div class="line"> return token</div><div class="line"></div><div class="line"> def getToken(self):</div><div class="line"> Url = 'https://qyapi.weixin.qq.com/cgi-bin/gettoken?corpid=%s&corpsecret=%s' % (self.corpid, self.corpsecret)</div><div class="line"> req = urllib2.Request(Url)</div><div class="line"> result = urllib2.urlopen(req)</div><div class="line"> json_access_token = json.loads(result.read())</div><div class="line"> access_token = json_access_token['access_token']</div><div class="line"></div><div class="line"> with open(self.tokenpath, 'w') as f:</div><div class="line"> f.write(access_token)</div><div class="line"> return access_token</div><div class="line"></div><div class="line"> def setMessage(self, wechatids, text):</div><div class="line"> token = self.saveToken()</div><div class="line"> message = self.makeMessage(text)</div><div class="line"> submiturl = 'https://qyapi.weixin.qq.com/cgi-bin/message/send?access_token={0}'.format(token)</div><div class="line"> data = {"touser": wechatids, "msgtype": "text", "agentid": "1000002", "text": {"content": message}, "safe": "0"}</div><div class="line"> data = json.dumps(data, ensure_ascii=False)</div><div class="line"></div><div class="line"> send_request = urllib2.Request(submiturl, data)</div><div class="line"></div><div class="line"> self.logger.info("Send wechat %s" % text)</div><div class="line"></div><div class="line"> response = json.loads(urllib2.urlopen(send_request).read())</div><div class="line"></div><div class="line"> if response['errcode'] == 42001 or response['errcode'] == 40014:</div><div class="line"> self.logger.info("Send wechat errorcode : %s" % response['errcode'])</div><div class="line"> os.remove(self.tokenpath)</div><div class="line"> self.setMessage(wechatids, text)</div><div class="line"></div><div class="line"> def makeMessage(self, text):</div><div class="line"> def date():</div><div class="line"> date = time.strftime('%m-%d %H:%M:%S', time.localtime())</div><div class="line"> return date</div><div class="line"></div><div class="line"> return "%s \nCall Time:%s" % (text, date())</div><div class="line"></div><div class="line"></div><div class="line">class Message(object):</div><div class="line"> '''</div><div class="line"> 构造邮箱发送的内容</div><div class="line"> '''</div><div class="line"></div><div class="line"> def format_str(self, strs):</div><div class="line"> if not isinstance(strs, unicode):</div><div class="line"> strs = unicode(strs)</div><div class="line"> return strs</div><div class="line"></div><div class="line"> def __init__(self, from_user, to_user, subject, content, with_attach=False):</div><div class="line"> '''</div><div class="line"></div><div class="line"> :param from_user: 谁发过来的邮件</div><div class="line"> :param to_user: 发给谁</div><div class="line"> :param subject: 邮件主题</div><div class="line"> :param content: 邮件内容</div><div class="line"> :param with_attach: 邮件是否包含附件</div><div class="line"> '''</div><div class="line"></div><div class="line"> if with_attach:</div><div class="line"> self._message = MIMEMultipart()</div><div class="line"> self._message.attach(MIMEText(content, 'plain', 'utf-8'))</div><div class="line"> else:</div><div class="line"> self._message = MIMEText(content, 'plain', 'utf-8')</div><div class="line"></div><div class="line"> self._message['Subject'] = Header(subject, 'utf-8')</div><div class="line"> self._message['From'] = Header(self.format_str(from_user), 'utf-8')</div><div class="line"> self._message['To'] = Header(self.format_str(to_user), 'utf-8')</div><div class="line"> self._with_attach = with_attach</div><div class="line"></div><div class="line"> def attach(self, file_path):</div><div class="line"> if self._with_attach == False:</div><div class="line"> print "Please init the Message with attr 'with_attach = True'"</div><div class="line"> exit(1)</div><div class="line"> if os.path.isfile(file_path) == False:</div><div class="line"> print "The file doesn`t exist!"</div><div class="line"> exit(1)</div><div class="line"> atta = MIMEText(open(file_path, 'rb').read(), 'base64', 'utf-8')</div><div class="line"> atta['Content-Type'] = 'application/octet-stream'</div><div class="line"> atta['Content-Disposition'] = 'attachment; filename="%s"' % Header(os.path.basename(file_path), 'utf-8')</div><div class="line"> self._message.attach(atta)</div><div class="line"></div><div class="line"> def getMessage(self):</div><div class="line"> return self._message.as_string()</div><div class="line"></div><div class="line"></div><div class="line">class SMTPClient(object):</div><div class="line"> '''</div><div class="line"> 发送邮件工具类</div><div class="line"> '''</div><div class="line"></div><div class="line"> def __init__(self, hostname, port, user, passwd):</div><div class="line"> '''</div><div class="line"> 初始化相关参数</div><div class="line"> :param hostname: QQ邮箱:smtp.qq.com</div><div class="line"> :param port: QQ邮箱ssl加密端口:465</div><div class="line"> :param user: QQ邮箱账号</div><div class="line"> :param passwd: QQ邮箱授权秘钥,在web qq邮箱上获取</div><div class="line"> '''</div><div class="line"> self._HOST = hostname</div><div class="line"> self._PORT = port</div><div class="line"> self._USER = user</div><div class="line"> self._PASS = passwd</div><div class="line"></div><div class="line"> def send(self, receivers, msg):</div><div class="line"> '''</div><div class="line"> 发送邮件方法</div><div class="line"> :param receivers: 邮件接收者,可以是多个.为列表</div><div class="line"> :param msg: 发送的邮件内容</div><div class="line"> :return:</div><div class="line"> '''</div><div class="line"> if isinstance(msg, Message) == False:</div><div class="line"> print "Error Message Instance!"</div><div class="line"> exit(1)</div><div class="line"> try:</div><div class="line"> smtpObj = smtplib.SMTP_SSL(self._HOST, self._PORT)</div><div class="line"> smtpObj.connect(self._HOST)</div><div class="line"> smtpObj.login(self._USER, self._PASS)</div><div class="line"> smtpObj.sendmail(self._USER, receivers, msg.getMessage())</div><div class="line"> return (1, "邮件发送成功")</div><div class="line"> except smtplib.SMTPException, e:</div><div class="line"> return (0, "Error: 无法发送邮件%s" % e)</div><div class="line"></div><div class="line"></div><div class="line">def run_it(cmd):</div><div class="line"> '''</div><div class="line"> 通过python执行shell命令</div><div class="line"> :param cmd:</div><div class="line"> :return:</div><div class="line"> '''</div><div class="line"> p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True,</div><div class="line"> stderr=subprocess.PIPE)</div><div class="line"> # print ('running:%s' % cmd)</div><div class="line"> out, err = p.communicate()</div><div class="line"> if p.returncode != 0:</div><div class="line"> print ("Non zero exit code:%s executing: %s \nerr course ---> %s" % (p.returncode, cmd, err))</div><div class="line"> return out</div><div class="line"></div><div class="line"></div><div class="line">def reStartSparkScript(scriptPath):</div><div class="line"> '''</div><div class="line"> 执行spark脚本</div><div class="line"> 1.cd到脚本所在路径</div><div class="line"> 2.在改路径执行脚本</div><div class="line"> :param scripyPath:</div><div class="line"> :return:</div><div class="line"> '''</div><div class="line"> logger = logging.getLogger("Main")</div><div class="line"> scriptDir, script = os.path.split(scriptPath)</div><div class="line"> os.chdir(scriptDir)</div><div class="line"> run_it("sh %s" % script)</div><div class="line"> logger.info("exec [ %s ] on [ %s ] " % (script, scriptDir))</div><div class="line"></div><div class="line"></div><div class="line">def collectMonitorStatus(yarnRestApi):</div><div class="line"> '''</div><div class="line"> 从Yarn的Running接口或者Accept接口中获取我们需要监控的程序状态</div><div class="line"> :param str: yarn的running接口或者accept接口</div><div class="line"> :return:</div><div class="line"> '''</div><div class="line"> strUrl = run_it(yarnRestApi)</div><div class="line"> result = []</div><div class="line"> obj = json.loads(strUrl)</div><div class="line"> if obj['apps'] is None:</div><div class="line"> return result</div><div class="line"> else:</div><div class="line"> apps = obj['apps']['app']</div><div class="line"> result = [(app['name'], app['state']) for app in apps if app['name'] in monitorPrograms]</div><div class="line"> return result</div><div class="line"></div><div class="line"></div><div class="line">def checkMonitorApps():</div><div class="line"> '''</div><div class="line"> 调用yarn的running接口和accept接口</div><div class="line"> 判断这里面是否有我们需要监控的spark程序</div><div class="line"> 如果没有就执行报警和重启</div><div class="line"> :return:</div><div class="line"> '''</div><div class="line"></div><div class="line"> logging.basicConfig(level=logging.INFO,</div><div class="line"> format='%(asctime)s - %(message)s',</div><div class="line"> datefmt='%Y-%m-%d %H:%M:%S')</div><div class="line"></div><div class="line"> logger = logging.getLogger("Main")</div><div class="line"></div><div class="line"> smpt_client = SMTPClient('smtp.qq.com', 465, '694244330@qq.com', 'xxxxxx')</div><div class="line"> wechat_client = WeChat('xxxxxxxxxxxxxx', 'xxxxxxxxxxxxxxxxxxxxxxxxx', '/tmp/token.txt')</div><div class="line"></div><div class="line"> runningStatus = collectMonitorStatus(urlRun)</div><div class="line"> acceptStatus = collectMonitorStatus(urlAcc)</div><div class="line"></div><div class="line"> runningAcceptApps = dict(runningStatus + acceptStatus)</div><div class="line"></div><div class="line"> logger.info("SparkStreaming ON Yarn Running And Accept ===>%s " % str(runningAcceptApps))</div><div class="line"></div><div class="line"> for monitor in monitorPrograms:</div><div class="line"> if monitor not in runningAcceptApps:</div><div class="line"> logging.info("[ %s ] is not running or accept,prepare to restart!" % monitor)</div><div class="line"> msg = Message("694244330@qq.com", "hushwiei", monitor, '%s is failed, prepare to resart! -- %s' % (</div><div class="line"> monitor, time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())))</div><div class="line"> smpt_client.send(sendEmails, msg)</div><div class="line"> wechat_client.setMessage(wechats, "%s is not running or accept,prepare to restart!" % monitor)</div><div class="line"> reStartSparkScript(monitorPrograms[monitor])</div><div class="line"></div><div class="line"></div><div class="line">def main():</div><div class="line"> checkMonitorApps()</div><div class="line"></div><div class="line"></div><div class="line">if __name__ == '__main__':</div><div class="line"> main()</div></pre></td></tr></table></figure>]]></content>
<summary type="html">
<p>虽然Spark on yarn非常的稳定,一般情况下是不会出问题的.但是我们的SparkStreaming程序是一直运行着出实时报表的.<br>我们必须得对SparkStreaming程序进行监控,在程序退出后,能够及时的重启.<br>基于此需求,我想到了通过调用yarn的rest接口来获取提交到yarn上的任务</p>
</summary>
<category term="spark" scheme="http://dmlcoding.com/categories/spark/"/>
<category term="spark" scheme="http://dmlcoding.com/tags/spark/"/>
<category term="yarn" scheme="http://dmlcoding.com/tags/yarn/"/>
</entry>
<entry>
<title>YARN 资源分配的配置参数</title>
<link href="http://dmlcoding.com/2017/SparkOnYarn/"/>
<id>http://dmlcoding.com/2017/SparkOnYarn/</id>
<published>2017-08-08T02:00:00.000Z</published>
<updated>2017-08-16T07:36:12.000Z</updated>
<content type="html"><![CDATA[<p>无论是mapreduce程序或者是Spark程序,提交到Yarn上来进行资源管理与分配的时候.都是运行在Yarn的Container容器中.<br>而Container容器中是Yarn封装的内存和CPU资源.暂时还不支持对网络IO等资源进行封装分配.那么在开发调优过程中,我们肯定无法避免会对内存进行一些分配.那么Yarn的哪些配置参数是对哪个地方进行分配的,就很重要,也就值得记一记.<br><a id="more"></a></p><h1 id="内存资源"><a href="#内存资源" class="headerlink" title="内存资源"></a>内存资源</h1><h2 id="ResourceManager"><a href="#ResourceManager" class="headerlink" title="ResourceManager"></a>ResourceManager</h2><table><thead><tr><th style="text-align:left">配置参数</th><th style="text-align:left">说明</th><th style="text-align:left">备注</th></tr></thead><tbody><tr><td style="text-align:left">yarn.scheduler.minimum-allocation-mb</td><td style="text-align:left">单个任务可申请的最少物理内存量,默认是1024(MB),如果一个任务申请的物理内存量少于该值,则该对应的值改为这个数</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">yarn.scheduler.maximum-allocation-mb</td><td style="text-align:left">单个任务可申请的最多物理内存量,默认是8192(MB)</td></tr></tbody></table><p>说明:<br>也就是ResourceManager启动的Container容器的最大与最小内存.</p><h2 id="nodemanager"><a href="#nodemanager" class="headerlink" title="nodemanager"></a>nodemanager</h2><table><thead><tr><th style="text-align:left">配置参数</th><th style="text-align:left">说明</th><th style="text-align:left">备注</th></tr></thead><tbody><tr><td style="text-align:left">yarn.nodemanager.resource.memory-mb</td><td style="text-align:left">节点最大可用内存,默认8096M</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">yarn.nodemanager.vmem-pmem-ratio</td><td style="text-align:left">虚拟内存率,任务每使用1MB物理内存,最多可使用虚拟内存量,默认为 2.1</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">yarn.nodemanager.pmem-check-enabled</td><td style="text-align:left">是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">yarn.nodemanager.vmem-check-enabled</td><td style="text-align:left">是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true</td></tr></tbody></table><p>说明:</p><ol><li>在 Centos/RHEL 6 下,由于虚拟内存的分配策略比较激进,可以调高 yarn.nodemanager.vmem-pmem-ratio 或者关闭 yarn.nodemanager.vmem-check-enabled。</li><li>不然的话有时候就会碰上抛出容器超出内存限制,然后容器被kill掉.</li><li>比如这个问题 <a href="https://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits" target="_blank" rel="external">https://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits</a></li></ol><h2 id="ApplicationMaster"><a href="#ApplicationMaster" class="headerlink" title="ApplicationMaster"></a>ApplicationMaster</h2><table><thead><tr><th style="text-align:left">配置参数</th><th style="text-align:left">说明</th><th style="text-align:left">备注</th></tr></thead><tbody><tr><td style="text-align:left">mapreduce.map.memory.mb</td><td style="text-align:left">分配给 Map Container的内存大小,运行时按需指定</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">mapreduce.reduce.memory.mb</td><td style="text-align:left">分配给 Reduce Container的内存大小,运行时按需指定</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">mapreduce.map.java.opts</td><td style="text-align:left">运行 Map 任务的 jvm 参数,如 -Xmx,-Xms 等选项</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">mapreduce.reduce.java.opts</td><td style="text-align:left">运行 Reduce 任务的 jvm 参数,如-Xmx,-Xms等选项</td></tr></tbody></table><h1 id="CPU资源"><a href="#CPU资源" class="headerlink" title="CPU资源"></a>CPU资源</h1><table><thead><tr><th style="text-align:left">配置参数</th><th style="text-align:left">说明</th><th style="text-align:left">备注</th></tr></thead><tbody><tr><td style="text-align:left">yarn.nodemanager.resource.cpu-vcores</td><td style="text-align:left">该节点上 YARN 可使用的虚拟 CPU 个数,默认是8</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">yarn.scheduler.minimum-allocation-vcores</td><td style="text-align:left">单个任务可申请的最小虚拟CPU个数, 默认是1</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">yarn.scheduler.maximum-allocation-vcores</td><td style="text-align:left">单个任务可申请的最多虚拟CPU个数,默认是32</td></tr></tbody></table>]]></content>
<summary type="html">
<p>无论是mapreduce程序或者是Spark程序,提交到Yarn上来进行资源管理与分配的时候.都是运行在Yarn的Container容器中.<br>而Container容器中是Yarn封装的内存和CPU资源.暂时还不支持对网络IO等资源进行封装分配.那么在开发调优过程中,我们肯定无法避免会对内存进行一些分配.那么Yarn的哪些配置参数是对哪个地方进行分配的,就很重要,也就值得记一记.<br>
</summary>
<category term="yarn" scheme="http://dmlcoding.com/categories/yarn/"/>
<category term="yarn" scheme="http://dmlcoding.com/tags/yarn/"/>
</entry>
<entry>
<title>深入理解Java虚拟机之JDK可视化工具(二)</title>
<link href="http://dmlcoding.com/2017/JDKOrderConsole/"/>
<id>http://dmlcoding.com/2017/JDKOrderConsole/</id>
<published>2017-08-06T12:00:00.000Z</published>
<updated>2017-08-09T04:23:55.000Z</updated>
<content type="html"><![CDATA[<p>除了JDK命令行工具,还有几个很强大的JDK可视化工具,希望接下来的学习,可以提高我们解决bug的能力<br><a id="more"></a></p><h1 id="JConsole-Java监视与管理控制台"><a href="#JConsole-Java监视与管理控制台" class="headerlink" title="JConsole:Java监视与管理控制台"></a>JConsole:Java监视与管理控制台</h1><blockquote><p>JConsole(Java Monitoring and Management Console)<br>JConsole是在JDK1.5时期就已经提供的虚拟机监控工具<br>JConsole是一款基于JMX的可视化监视和管理工具,它管理部分的功能是针对JMX MBean进行管理.</p></blockquote><h2 id="启动JConsole"><a href="#启动JConsole" class="headerlink" title="启动JConsole"></a>启动JConsole</h2><ul><li>1.安装的bin目录下执行<code>jconsole</code></li><li>2.如果配置了 <strong>JAVA_HOME</strong> 直接输入<code>jconsole</code></li></ul><p>Jconsole启动后,会自动搜索出本机运行的所有虚拟机进程,不需要用户再使用jps来查询了,如下图所示,双击选择其中一个进程即可开始监控.当然也可以使用 <em>远程进程</em> 功能来连接远程服务器,对远程虚拟机进行监控<br><img src="/images/jdk/jconsole1.png" alt="jconsole1"></p><p>栈内存 堆内存<br>-Xms100m -Xmx100m -XX:+UseSerialGC</p>]]></content>
<summary type="html">
<p>除了JDK命令行工具,还有几个很强大的JDK可视化工具,希望接下来的学习,可以提高我们解决bug的能力<br>
</summary>
<category term="java" scheme="http://dmlcoding.com/categories/java/"/>
<category term="java" scheme="http://dmlcoding.com/tags/java/"/>
<category term="jdk" scheme="http://dmlcoding.com/tags/jdk/"/>
</entry>
<entry>
<title>深入理解Java虚拟机之Java内存区域与内存溢出</title>
<link href="http://dmlcoding.com/2017/JDKMemory/"/>
<id>http://dmlcoding.com/2017/JDKMemory/</id>
<published>2017-08-03T08:00:00.000Z</published>
<updated>2017-08-07T01:23:28.000Z</updated>
<content type="html"><![CDATA[<p>对于java程序员来说,虽然有虚拟机的自动内存管理机制,我们即使不清楚内存是如何分配的,也不妨碍我们写代码.但是如果你不明白虚拟机究竟做了啥,你既不能快速定位问题,也不能成为一个优秀的程序员.</p><a id="more"></a><h1 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h1><p>对于java程序员来说,在虚拟机的自动内存管理机制的帮助下,不再需要为每一个new操作去写配对的delete/free代码(C/C++语言是需要的),而且不容易出现内存泄漏和内存溢出问题,看起来由虚拟机管理内存一切都很美好.不过,也正是因为java程序员把内存控制的权利交给了Java虚拟机,一旦出现内存泄露和溢出方面的问题,如果不了解虚拟机是怎样使用内存的,那排查错误将会成为一项异常艰难的工作.<br>在这篇文章里,我们会写到这几个部分,了解了这几个部分,也就可以翻越虚拟机内存管理的第一步.</p><ul><li>java虚拟机内存的各个区域</li><li>各个区域的作用,服务对象,以及其中可能产生的问题</li></ul><h1 id="运行时数据区域"><a href="#运行时数据区域" class="headerlink" title="运行时数据区域"></a>运行时数据区域</h1><ul><li>Java虚拟机在执行Java程序的过程中,会把它管理的内存划分为若干个不同的数据区域.</li><li>这些区域都有各自的用途,以及创建和销毁的时间,有的区域随着虚拟机进程的启动而存在,有些区域则是依赖用户线程的启动和结束而建立和销毁.<br><strong>根据Java虚拟机规范规定,Java虚拟机所管理的内存将会包括以下几个运行时数据区域</strong><br><img src="/images/jdk/jdkquyu.png" alt="java运行时区域"><h2 id="程序计数器"><a href="#程序计数器" class="headerlink" title="程序计数器"></a>程序计数器</h2><blockquote><p>程序计数器(Program Counter Register)是一块较小的内存空间,它的作用可以看做是当前线程所执行的字节码的行号指示器.</p></blockquote></li></ul><p>字节码解释器工作时就是通过改变这个计数器的值来选取下一条需要执行的字节码指令,分支,循环,跳转,异常处理,线程恢复等基础功能都需要依赖这个计数器来完成.(想想我们平时写的java代码,是不是感觉一切都是有原因的)</p><p>对于多线程来说.由于每个线程都会执行自己的指令.那么为了线程切换后能恢复到正确的执行位置,因此每条线程都需要有一个独立的程序计数器,这样每条线程之间的计数器互不影响,独立存储.我们称这类内存区域称为 <strong>线程私有</strong> 的内存.</p><h2 id="Java虚拟机栈"><a href="#Java虚拟机栈" class="headerlink" title="Java虚拟机栈"></a>Java虚拟机栈</h2><blockquote><p>与程序计数器一样,Java虚拟机栈(Java Virtual Machine Stacks)也是线程私有的.它的生命周期与线程相同.</p></blockquote><p>我们平时所说的栈内存,也就是这里的虚拟机栈.那么这个虚拟机栈究竟是什么呢?<br>虚拟机栈描述的是java方法执行的内存模型:</p><ul><li>每个方法被执行的时候都会同时创建一个 <strong>栈帧(Stack Frame)</strong> 用于存储<code>局部变量表</code>,<code>操作栈</code>,<code>动态链接</code>,<code>方法出口</code>等信息.</li><li>每一个方法被调用直至执行完成的过程,就对应着一个 <strong>栈帧</strong> 在虚拟机栈中从入栈到出栈的过程.</li></ul><p>提一下局部变量表:</p><ul><li>局部变量表存放了编译期可知的各种基本数据类型(boolean,byte,char,short,int,float,long,double),对象引用(reference类型,它不等同于对象本身).</li><li>局部变量表所需的内存空间在编译期间完成分配,当进入一个方法时,这个方法需要在帧中分配多大的局部变量空间是完全确定的,在方法运行期间不会改变局部变量表的大小.</li><li>局部变量表区域可能会抛出两种异常状况<ul><li>如果线程请求的栈深度大于虚拟机所允许的深度,将抛出StackOverflowError异常.</li><li>如果虚拟机可以动态扩展,当扩展时无法申请到足够的内存时会抛出OutOfMemoryError.</li></ul></li></ul><h2 id="本地方法栈"><a href="#本地方法栈" class="headerlink" title="本地方法栈"></a>本地方法栈</h2><ul><li>本地方法栈(Native Method Stacks)与虚拟机所发挥的作用是非常相似的</li><li>本地方法栈与虚拟机栈的区别<ul><li>虚拟机栈为虚拟机执行Java方法(也就是字节码)服务</li><li>本地方法栈则是为虚拟机使用到的Native方法服务</li></ul></li><li>有些虚拟机(譬如Sun HotSpot)直接就把本地方法栈和虚拟机栈合二为一</li></ul><p>关于什么是Native方法呢?<br>参考:<a href="http://blog.csdn.net/wike163/article/details/6635321" target="_blank" rel="external">http://blog.csdn.net/wike163/article/details/6635321</a></p><h2 id="Java堆"><a href="#Java堆" class="headerlink" title="Java堆"></a>Java堆</h2><p><strong>堆内存的特点</strong></p><ul><li>对于大多数应用来说,Java堆(Java heap)是Java虚拟机所管理的内存中最大的一块.</li><li>Java堆是被所有线程共享的一块内存区域,<strong>在虚拟机启动时创建</strong> .</li><li>Java堆内存区域的唯一目的就是存放对象实例,几乎所有的对象实例都是在这里分配内存.</li><li>Java堆是垃圾收集器管理的主要区域.因为那么多实例在堆上分配内存,实例用完后,我们肯定要及时回收内存,这样才能给新的实例分配足够的内存呢.</li><li>Java堆可以处于物理上不连续的内存空间中,只要逻辑上是连续的即可,就像我们的磁盘空间一样.</li><li>如果在堆中没有内存完成实例分配,并且堆也无法再扩展时,将会抛出OutOfMemoryErrory异常.</li></ul><p><strong>再细分一下堆内存</strong></p><ul><li>从垃圾回收的角度看<ul><li>因为现在的垃圾收集器都是采用分代收集算法,所以Java堆中还可以细分为:<code>新生代</code>,<code>老年代</code>,等等区域.</li><li>后面写到垃圾回收的时候,再细说这部分.</li></ul></li><li>从内存分配的角度看<ul><li>线程共享的Java堆中可能划分出多个线程私有的分配缓存区</li></ul></li><li>在实现上<ul><li>既可以实现成固定大小的,也可以是扩展的.</li><li>不过目前主流的虚拟机都是按照可扩展来实现的</li><li><code>-Xmx</code>来设置程序的堆内存大小</li><li><code>-Xms</code>来设置程序的栈内存大小</li></ul></li></ul><h2 id="方法区"><a href="#方法区" class="headerlink" title="方法区"></a>方法区</h2><blockquote><p>方法区(Method Area)<br><strong>方法区的特点</strong></p><ul><li>方法区与Java堆一样,是各个线程共享的内存区域.</li><li>它用于存储已被虚拟机加载的类信息,常量,静态变量,即时编译器编译后的代码等数据.</li><li>然后Java虚拟机规范把方法区描述为堆的一个逻辑部分,但是它却有一个别名叫做Non-Heap(非堆),目的应该是与Java堆区分开来.</li><li>方法区被有些人称为”永久代(Permanent Generation)”是因为GC分代收集时候,方法区的变量会在永久代区域.</li><li>当方法区无法满足内存分区需求时,将抛出OutOfMemoryErrory异常.</li></ul></blockquote><h2 id="运行时常量池"><a href="#运行时常量池" class="headerlink" title="运行时常量池"></a>运行时常量池</h2><p>运行时常量池是方法区的一部分.Class文件中除了有类的版本,字段,方法,接口等描述信息外,还有一项信息是 <strong>常量池(Constant Pool Table)</strong>.<br>用于存放编译期生成的各种字面量和符号引用,这部分内容将在类加载后存放到方法区的运行时常量池中.</p><p>Java语言并不要求常量一定只能在编译器产生,也就是并非预置入Class文件中常量池的内容才能进入方法区运行时常量池,运行期间也可能将新的常量放入池中.</p><h2 id="直接内存"><a href="#直接内存" class="headerlink" title="直接内存"></a>直接内存</h2><p>直接内存(Direct Memory)并不是虚拟机运行时数据区的一部分,也不是Java虚拟机规范中定义的内存区域,但是这部分内存也被频繁地使用,而且也可能导致OutOfMemoryErrory异常.那么究竟什么是直接内存呢?</p><p>在JDK1.4中新加入了NIO(New Input/Output)类,引入了一种基于通道(Channel)与缓冲区(Buffer)的I/O方式,它可以使用Native函数库直接分配堆外内存,然后通过一个存储在Java堆里面的DirectByteBuffer对象作为这块内存的引用进行操作.</p><p>这样能在一些场景中显著提高性能,因为避免了在Java堆和Native堆中来回复制数据.</p><h1 id="对象访问-问题-在Java语言中-对象访问是如何进行的"><a href="#对象访问-问题-在Java语言中-对象访问是如何进行的" class="headerlink" title="对象访问(问题:在Java语言中,对象访问是如何进行的?)"></a>对象访问(问题:在Java语言中,对象访问是如何进行的?)</h1><blockquote><p>对象访问在java语言中无处不在,是最普通的程序行为,但即使是最简单的访问,也会涉及Java栈,Java堆,方法区这三个最重要内存区域之间的关联关系.<br>上面简单介绍了Java虚拟机的运行时数据区.说得比较文字化,不够具体.那么我们具体来探讨一个问题.<br><strong>在Java语言中,对象访问是如何进行的?</strong></p></blockquote><p>看这行最简单的代码,我们来解释这行代码.<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">Object obj=new Object();</div></pre></td></tr></table></figure></p><p>假设这句代码出现在方法体中,那么</p><ul><li><strong>Object obj</strong> 这部分的语义将会反映到 <strong>Java栈的本地变量表</strong> 中,作为一个reference类型数据出现.</li><li><strong>new Object()</strong> 这部分的语义将会反映到 <strong>Java堆</strong> 中,形成一块存储了Object类型所有实例数据值的结构化内存.<ul><li>这块内存的长度是不固定</li><li>另外,在这个堆中还必须包含能查找到此对象类型数据(如对象类型,父类,实现的接口,方法等)的地址信息,这些类型数据则存储在方法区.</li></ul></li></ul>]]></content>
<summary type="html">
<p>对于java程序员来说,虽然有虚拟机的自动内存管理机制,我们即使不清楚内存是如何分配的,也不妨碍我们写代码.但是如果你不明白虚拟机究竟做了啥,你既不能快速定位问题,也不能成为一个优秀的程序员.</p>
</summary>
<category term="java" scheme="http://dmlcoding.com/categories/java/"/>
<category term="java" scheme="http://dmlcoding.com/tags/java/"/>
<category term="jdk" scheme="http://dmlcoding.com/tags/jdk/"/>
</entry>
<entry>
<title>深入理解Java虚拟机之JDK命令行工具(一)</title>
<link href="http://dmlcoding.com/2017/JDKOrder/"/>
<id>http://dmlcoding.com/2017/JDKOrder/</id>
<published>2017-08-02T12:00:00.000Z</published>
<updated>2017-08-09T02:55:59.000Z</updated>
<content type="html"><![CDATA[<p>JDK命令行工具,是java提供给我们的礼物,我们怎么能拒绝他们的馈赠呢</p><a id="more"></a><h1 id="jps-虚拟机进程状况工具"><a href="#jps-虚拟机进程状况工具" class="headerlink" title="jps:虚拟机进程状况工具"></a>jps:虚拟机进程状况工具</h1><blockquote><p>jps(JVM Process Status)<br>可以列出正在运行的虚拟机进程,,并显示虚拟机执行主类(main函数的名称),以及这些进程的本地虚拟机的唯一ID(LVMID,Local Virtual Machine Identifier)<br>对于本地虚拟机进程来说,LVMID与操作系统的进程ID(PID,Process Identifier)是一致的</p></blockquote><h2 id="jps命令格式"><a href="#jps命令格式" class="headerlink" title="jps命令格式"></a>jps命令格式</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">jps [options] [hostid]</div></pre></td></tr></table></figure><h2 id="jps工具主要选项"><a href="#jps工具主要选项" class="headerlink" title="jps工具主要选项"></a>jps工具主要选项</h2><table><thead><tr><th style="text-align:center">选项</th><th style="text-align:left">作用</th></tr></thead><tbody><tr><td style="text-align:center">-q</td><td style="text-align:left">只输出LVMID,省略主类的名称</td></tr><tr><td style="text-align:center">-m</td><td style="text-align:left">输出虚拟机进程启动时传递给主类main()函数的参数</td></tr><tr><td style="text-align:center">-l</td><td style="text-align:left">输出主类的全名,如果进程执行的是jar包,输出jar路径</td></tr><tr><td style="text-align:center">-v</td><td style="text-align:left">输出虚拟机进程启动时JVM参数</td></tr></tbody></table><h2 id="jps命令样例1"><a href="#jps命令样例1" class="headerlink" title="jps命令样例1"></a>jps命令样例1</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 ~]$ jps -l</div><div class="line">3183 sun.tools.jps.Jps</div><div class="line">17831 org.apache.spark.executor.CoarseGrainedExecutorBackend</div><div class="line">10004 org.apache.spark.deploy.worker.Worker</div><div class="line">17659 org.apache.spark.deploy.SparkSubmit</div><div class="line">10254 org.apache.spark.deploy.worker.Worker</div><div class="line">9830 org.apache.spark.deploy.master.Master</div></pre></td></tr></table></figure><h1 id="jstat-虚拟机统计信息监视工具"><a href="#jstat-虚拟机统计信息监视工具" class="headerlink" title="jstat:虚拟机统计信息监视工具"></a>jstat:虚拟机统计信息监视工具</h1><blockquote><p>jstat(JVM Statistics Monitoring Tool)<br>jstat是用于监视虚拟机各种运行状态信息的命令行工具<br>它可以显示本地或远程虚拟机进程中的类装载,内存,垃圾收集,JIT编译等运行数据.<br>在没有GUI图形界面,只提供了纯文本控制台环境的服务器上,它将是运行期定位虚拟机性能问题的首选工具</p></blockquote><h2 id="jstat命令格式"><a href="#jstat命令格式" class="headerlink" title="jstat命令格式"></a>jstat命令格式</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">jstat [option vmid [interval[s|ms] [count]]]</div></pre></td></tr></table></figure><p><strong>注意</strong>:</p><ul><li>对于命令中的VMID与LVMID需要特别说明一下:如果是本地虚拟机进程,VMID与LVMID是一致的.</li><li>如何是远程虚拟机进程,那VMID的格式应当是:<code>[protocal:][//]lvmid[@hostname[:port]/servername]</code></li><li>参数interval和count代表查询间隔和次数.如果省略这两个参数,说明只查询一次.</li></ul><p>假设需要每250毫秒查询一次进程2764垃圾收集的状况,一共查询20次,那么命令应该是:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">jstat -gc 2764 250 20</div></pre></td></tr></table></figure></p><p>每2毫秒查询10次spark进程的垃圾收集状况:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 ~]$ jstat -gc 17659 2 10</div><div class="line"> S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div><div class="line">349184.0 349184.0 0.0 0.0 2098176.0 954275.3 5592576.0 246723.4 116224.0 115802.6 52 5.099 48 18.956 24.056</div></pre></td></tr></table></figure></p><h2 id="jstat工具主要选项"><a href="#jstat工具主要选项" class="headerlink" title="jstat工具主要选项"></a>jstat工具主要选项</h2><table><thead><tr><th style="text-align:center">选项</th><th style="text-align:left">作用</th></tr></thead><tbody><tr><td style="text-align:center">-class</td><td style="text-align:left">监视类装载,卸载数据,总空间及类装载所耗费的时间</td></tr><tr><td style="text-align:center">-gc</td><td style="text-align:left">监视Java堆状况,包括Eden区,2个survivor区,老年代,永久代等的容量,已用空间,GC时间合计等信息</td></tr><tr><td style="text-align:center">-gccapacity</td><td style="text-align:left">监视内容与-gc基本相同,但输出主要关注Java堆各个区域使用到的最大和最小空间</td></tr><tr><td style="text-align:center">-gcutil</td><td style="text-align:left">监视内容与-gc基本相同,但输出主要关注已使用空间占总空间的百分比</td></tr><tr><td style="text-align:center">…</td><td style="text-align:left">…</td></tr></tbody></table><h2 id="jstat执行样例"><a href="#jstat执行样例" class="headerlink" title="jstat执行样例"></a>jstat执行样例</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">hadoop@U006 ~]$ jstat -gcutil 17659</div><div class="line"> S0 S1 E O P YGC YGCT FGC FGCT GCT</div><div class="line"> 0.00 0.00 37.59 4.41 99.68 77 7.439 73 27.473 34.912</div></pre></td></tr></table></figure><p>查询结果表明:<br>这个进程的</p><ul><li>新生代Eden区(E,表示Eden)使用了37.59%的空间.</li><li>两个Survivor区(S0,S1,表示Survivor0,Survivor1)里面都是空的.</li><li>老年代(O,表示Old)和永久代(P,表示Permanent)则分别使用了4.41%和99.68%的空间.</li><li>程序运行以来共发生Minor GC(YGC,表示Young GC)77次,总耗时7.439秒.</li><li>发生Full GC(FGC,表示Full GC)73次,Full GC总耗时(FGCT,表示Full GC Time)为27.473秒.</li><li>所有GC总耗时(GCT,表示GC Time)为34.912秒.</li></ul><p><strong>总结</strong>:使用jstat工具在纯文本状态下监视虚拟机状态的变化,虽然没有一些可视化监控工具来得直观.但与我而言,更显极客本色.</p><h1 id="jinfo-Java配置信息工具"><a href="#jinfo-Java配置信息工具" class="headerlink" title="jinfo:Java配置信息工具"></a>jinfo:Java配置信息工具</h1><blockquote><p>jinso(Configuration Info For Java)<br>jinfo的作用是实时地查看和调整虚拟机的各项参数</p></blockquote><h2 id="jinfo命令格式"><a href="#jinfo命令格式" class="headerlink" title="jinfo命令格式"></a>jinfo命令格式</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">jinfo [option] pid</div></pre></td></tr></table></figure><h2 id="使用方式"><a href="#使用方式" class="headerlink" title="使用方式"></a>使用方式</h2><p>前面解释过jps的-v选项.这个可以查看虚拟机启动时显示指定的参数列表,比如<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 ~]$ jps -v</div><div class="line">17831 CoarseGrainedExecutorBackend -Xms4096M -Xmx4096M -Dspark.driver.port=42195 -XX:MaxPermSize=256m</div><div class="line">5146 Jps -Dapplication.home=/usr/java/jdk1.7.0_71 -Xms8m</div><div class="line">10004 Worker -Xms1g -Xmx1g -XX:MaxPermSize=256m</div><div class="line">17659 SparkSubmit -Xms8g -Xmx8g -XX:MaxPermSize=256m</div><div class="line">10254 Worker -Xms1g -Xmx1g -XX:MaxPermSize=256m</div><div class="line">9830 Master -Xms1g -Xmx1g -XX:MaxPermSize=256m</div></pre></td></tr></table></figure></p><p>但如果想知道未被显示指定的参数的系统默认值,除了查找资料,还可以使用这里的<code>jinfo的-flag</code>选项进行查询了.</p><p>比如查询CMSInitiatingOccupancyFraction参数值:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">[hadoop@U006 ~]$ jinfo -flag CMSInitiatingOccupancyFraction 17659</div><div class="line">-XX:CMSInitiatingOccupancyFraction=-1</div></pre></td></tr></table></figure></p><h1 id="jmap-Java内存映像工具"><a href="#jmap-Java内存映像工具" class="headerlink" title="jmap:Java内存映像工具"></a>jmap:Java内存映像工具</h1><blockquote><p>jmap(Memory Map For Java)<br>jmap是用于生成堆转储快照(一般称为Heapdump或dump文件).</p></blockquote><h1 id="jstack-Java堆栈跟踪工具"><a href="#jstack-Java堆栈跟踪工具" class="headerlink" title="jstack:Java堆栈跟踪工具"></a>jstack:Java堆栈跟踪工具</h1><blockquote><p>jstack(Stack Trace for Java)<br>jstack命令用于生成虚拟机当前时刻的线程快照(一般称为threaddump或javacore文件)</p></blockquote><p><strong>线程快照:</strong> 就是当前虚拟机内每一条线程正在执行的方法堆栈的集合,生成线程快照的主要目的是 <em>定位线程出现长时间停顿的原因</em>.</p><p>线程长时间停顿的常见原因:<br>1.线程间死锁,死循环<br>2.请求外部资源导致的长时间等待<br>3.等等</p><p>线程出现停顿的时候通过jstack来查看各个线程的调用堆栈,就可以知道没有响应的线程到底在后台做些什么事情,或者等待着什么资源.</p><h2 id="jstack命令格式"><a href="#jstack命令格式" class="headerlink" title="jstack命令格式"></a>jstack命令格式</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">jstack [option] vmid</div></pre></td></tr></table></figure><h2 id="jstack工具的主要选项"><a href="#jstack工具的主要选项" class="headerlink" title="jstack工具的主要选项"></a>jstack工具的主要选项</h2><table><thead><tr><th style="text-align:center">选项</th><th style="text-align:left">作用</th></tr></thead><tbody><tr><td style="text-align:center">-F</td><td style="text-align:left">当正常输出的请求不被响应时,强制输出线程堆栈</td></tr><tr><td style="text-align:center">-l</td><td style="text-align:left">除堆栈外,显示关于锁的附加信息</td></tr><tr><td style="text-align:center">-m</td><td style="text-align:left">如果调用到本地方法的话,可以显示C/C++的堆栈</td></tr></tbody></table>]]></content>
<summary type="html">
<p>JDK命令行工具,是java提供给我们的礼物,我们怎么能拒绝他们的馈赠呢</p>
</summary>
<category term="java" scheme="http://dmlcoding.com/categories/java/"/>
<category term="java" scheme="http://dmlcoding.com/tags/java/"/>
<category term="jdk" scheme="http://dmlcoding.com/tags/jdk/"/>
</entry>
<entry>
<title>机器视觉处理与Tesseract介绍</title>
<link href="http://dmlcoding.com/2017/TesseractBasic/"/>
<id>http://dmlcoding.com/2017/TesseractBasic/</id>
<published>2017-07-27T01:23:46.000Z</published>
<updated>2017-07-27T01:44:00.000Z</updated>
<content type="html"><![CDATA[<p>在读取和处理图像、图像相关的机器学习以及创建图像等任务中,Python 一直都是非常出色的语言。虽然有很多库可以进行图像处理,但目前我只接触到Tesseract.</p><a id="more"></a><h1 id="Tesseract"><a href="#Tesseract" class="headerlink" title="Tesseract"></a>Tesseract</h1><p>Tesseract 是一个 OCR 库,目前由 Google 赞助(Google 也是一家以 OCR 和机器学习技术闻名于世的公司)。Tesseract 是目前公认最优秀、最精确的开源 OCR 系统。 除了极高的精确度,Tesseract 也具有很高的灵活性。它可以通过训练识别出任何字体,也可以识别出任何 Unicode 字符。</p><h1 id="安装Tesseract"><a href="#安装Tesseract" class="headerlink" title="安装Tesseract"></a>安装Tesseract</h1><h2 id="Windows-系统"><a href="#Windows-系统" class="headerlink" title="Windows 系统"></a>Windows 系统</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">下载可执行安装文件https://code.google.com/p/tesseract-ocr/downloads/list安装。</div></pre></td></tr></table></figure><h2 id="Linux-系统"><a href="#Linux-系统" class="headerlink" title="Linux 系统"></a>Linux 系统</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">可以通过 apt-get 安装: $sudo apt-get tesseract-ocr</div></pre></td></tr></table></figure><h2 id="Mac-OS-X系统"><a href="#Mac-OS-X系统" class="headerlink" title="Mac OS X系统"></a>Mac OS X系统</h2><p>用 Homebrew(<a href="http://brew.sh/)等第三方库可以很方便地安装" target="_blank" rel="external">http://brew.sh/)等第三方库可以很方便地安装</a><br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">brew install tesseract</div></pre></td></tr></table></figure></p><p>要使用 Tesseract 的功能,比如后面的示例中训练程序识别字母,要先在系统中设置一个新的环境变量 $TESSDATA_PREFIX,让 Tesseract 知道训练的数据文件存储在哪里,然后搞一份tessdata数据文件,放到Tesseract目录下。</p><p>在大多数 Linux 系统和 Mac OS X 系统上,你可以这么设置:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">$export TESSDATA_PREFIX=/usr/local/share/Tesseract</div></pre></td></tr></table></figure></p><p>或者<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line">hushiwei@localhost ~ more ~/.bash_profile</div><div class="line">alias l='ls -lF'</div><div class="line">alias ll='ls -alF'</div><div class="line">JAVA_HOME=`/usr/libexec/java_home`</div><div class="line">SCALA_HOME=/Users/hushiwei/devApps/scala-2.10.5</div><div class="line">MAVEN_HOME=/Users/hushiwei/devApps/maven-3.3.9</div><div class="line">TESSDATA_PREFIX=/Users/hushiwei/devApps/Tesseract</div></pre></td></tr></table></figure></p><p>在 Windows 系统上也类似,你可以通过下面这行命令设置环境变量:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">#setx TESSDATA_PREFIX C:\Program Files\Tesseract OCR\Tesseract</div></pre></td></tr></table></figure></p><h1 id="安装pytesseract"><a href="#安装pytesseract" class="headerlink" title="安装pytesseract"></a>安装pytesseract</h1><p>Tesseract 是一个 Python 的命令行工具,不是通过 import 语句导入的库。安装之后,要用 tesseract 命令在 Python 的外面运行,但我们可以通过 pip 安装支持Python 版本的 Tesseract库:<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">pip install pytesseract</div></pre></td></tr></table></figure></p><h1 id="简单示例"><a href="#简单示例" class="headerlink" title="简单示例"></a>简单示例</h1><p>目前只能处理规范的文字,那么什么算<code>格式规范</code>呢?<br>格式规范的文字具有以下特点:</p><ul><li>使用一个标准字体(不包含手写体、草书,或者十分“花哨的”字体) • 虽然被复印或拍照,字体还是很清晰,没有多余的痕迹或污点</li><li>排列整齐,没有歪歪斜斜的字</li><li>没有超出图片范围,也没有残缺不全,或紧紧贴在图片的边缘</li></ul><p>格式规范的图片示例<br><img src="/images/python/test.png" alt="test图片"></p><h2 id="命令行方式"><a href="#命令行方式" class="headerlink" title="命令行方式"></a>命令行方式</h2><p>那么试一试Tesseract,,看看效果如何.用起来也是非常简单.读取图片,然后把结果写入到一个文本文件中<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">hushiwei@localhost ~/Desktop tesseract test.png text</div><div class="line">Tesseract Open Source OCR Engine v3.05.01 with Leptonica</div><div class="line">Warning. Invalid resolution 0 dpi. Using 70 instead.</div></pre></td></tr></table></figure></p><p>接着打开这个文本看看效果<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">hushiwei@localhost ~/Desktop more text.txt</div><div class="line">This is some text, written in Arial, that will be read by</div><div class="line">Tesseract. Here are some symbols: !@#$%"&'()</div></pre></td></tr></table></figure></p><p>除了一个小符号没有识别出来,其他的字符基本上都识别对了.</p><h2 id="python代码方式进行识别"><a href="#python代码方式进行识别" class="headerlink" title="python代码方式进行识别"></a>python代码方式进行识别</h2><p>用之前安装的<code>pytesseract</code>模块,就可以很方便的完成我们想要的效果</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">import pytesseract</div><div class="line"></div><div class="line">from PIL import Image</div><div class="line"></div><div class="line"># 打开一个图片</div><div class="line">image=Image.open('test.png')</div><div class="line"></div><div class="line"># 调用pytesseract的image_to_string方法识别出图片中的文字,返回识别出来的文字</div><div class="line">text=pytesseract.image_to_string(image)</div><div class="line"></div><div class="line"># 打印文字看看效果</div><div class="line">print text</div></pre></td></tr></table></figure><p>输出结果<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">This is some text, written in Arial, that will be read by</div><div class="line">Tesseract. Here are some symbols: !@#$%"&'()</div><div class="line"></div><div class="line">Process finished with exit code 0</div></pre></td></tr></table></figure></p>]]></content>
<summary type="html">
<p>在读取和处理图像、图像相关的机器学习以及创建图像等任务中,Python 一直都是非常出色的语言。虽然有很多库可以进行图像处理,但目前我只接触到Tesseract.</p>
</summary>
<category term="python" scheme="http://dmlcoding.com/categories/python/"/>
<category term="python" scheme="http://dmlcoding.com/tags/python/"/>
<category term="tesseract" scheme="http://dmlcoding.com/tags/tesseract/"/>
</entry>
<entry>
<title>python 可视化包-Matplotlib</title>
<link href="http://dmlcoding.com/2017/MatplotlibBasic/"/>
<id>http://dmlcoding.com/2017/MatplotlibBasic/</id>
<published>2017-07-26T15:24:00.000Z</published>
<updated>2017-07-27T06:53:13.000Z</updated>
<content type="html"><![CDATA[<p>Matplotlib是Python中最常用的可视化工具之一,可以非常方便地创建海量类型地2D图表和一些基本的3D图表。</p><a id="more"></a><h1 id="安装方式"><a href="#安装方式" class="headerlink" title="安装方式"></a>安装方式</h1><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line"># ubuntu上安装</div><div class="line">sudo apt install python-matplotlib</div><div class="line"># mac上安装</div><div class="line">pip install matplotlib</div></pre></td></tr></table></figure><h1 id="快速入门"><a href="#快速入门" class="headerlink" title="快速入门"></a>快速入门</h1><h2 id="快速入门小例子1之画单个图"><a href="#快速入门小例子1之画单个图" class="headerlink" title="快速入门小例子1之画单个图"></a>快速入门小例子1之画单个图</h2><p>我们只要有x轴的数和y轴的数,那么就可以在坐标轴上画出图来了.<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div></pre></td><td class="code"><pre><div class="line">import numpy as np</div><div class="line">import matplotlib as mpl</div><div class="line">import matplotlib.pyplot as plt</div><div class="line"></div><div class="line"># 通过rcParams设置全局横纵轴字体大小</div><div class="line">mpl.rcParams['xtick.labelsize']=24</div><div class="line">mpl.rcParams['ytick.labelsize']=24</div><div class="line"></div><div class="line"># x轴的点</div><div class="line">x1 = np.arange(11)</div><div class="line"># y轴的点</div><div class="line">y1 = [</div><div class="line"> 1.0847275042134147E-4,</div><div class="line"> 2.0106877828356476E-4,</div><div class="line"> 1.1836360644802181E-4,</div><div class="line"> 0.043453404423487926,</div><div class="line"> 0.03113001646083574,</div><div class="line"> 0.06,</div><div class="line"> 0.012709253496067191,</div><div class="line"> 0.06,</div><div class="line"> 3.284899860591644E-4,</div><div class="line"> 0.015235253124714847,</div><div class="line"> 0.0034946847451197242,</div><div class="line">]</div><div class="line"></div><div class="line"># 创建一个图,名字为ctr</div><div class="line">plt.figure("ctr")</div><div class="line"># 在图上绘制</div><div class="line">plt.plot(x1,y1)</div><div class="line"></div><div class="line"># 将当前figure的图像保存到文件result.png</div><div class="line">plt.savefig('result.pn</div><div class="line">g')</div><div class="line"># 一定要加上这句才能让画好的图显示在屏幕上</div><div class="line">plt.show()</div></pre></td></tr></table></figure></p><p>如图所示:<br><img src="/images/python/ctr1.png" alt="ctr1"></p><p>看上面就没有几行代码,但是就画出了一个图.所以用Matplotlib可以非常方便的绘制我们想要的图形.这里这是用最简单的例子说明一下.</p><h2 id="快速入门小例子2之把两组坐标画在一个图上进行比较"><a href="#快速入门小例子2之把两组坐标画在一个图上进行比较" class="headerlink" title="快速入门小例子2之把两组坐标画在一个图上进行比较"></a>快速入门小例子2之把两组坐标画在一个图上进行比较</h2><p>这里我们有两组数据,希望能够方便的比较这两组数据的差异,那么我们就可以把趋势都画在一个图上<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div></pre></td><td class="code"><pre><div class="line">import numpy as np</div><div class="line">import matplotlib as mpl</div><div class="line">import matplotlib.pyplot as plt</div><div class="line"></div><div class="line"># 通过rcParams设置全局横纵轴字体大小</div><div class="line">mpl.rcParams['xtick.labelsize']=24</div><div class="line">mpl.rcParams['ytick.labelsize']=24</div><div class="line"></div><div class="line"># x轴的点</div><div class="line">x1 = np.arange(11)</div><div class="line"># y轴的点</div><div class="line">y1 = [</div><div class="line"> 1.0847275042134147E-4,</div><div class="line"> 2.0106877828356476E-4,</div><div class="line"> 1.1836360644802181E-4,</div><div class="line"> 0.043453404423487926,</div><div class="line"> 0.03113001646083574,</div><div class="line"> 0.06,</div><div class="line"> 0.012709253496067191,</div><div class="line"> 0.06,</div><div class="line"> 3.284899860591644E-4,</div><div class="line"> 0.015235253124714847,</div><div class="line"> 0.0034946847451197242,</div><div class="line">]</div><div class="line"></div><div class="line"># 创建一个图,名字为ctr</div><div class="line">#plt.figure("ctr")</div><div class="line"># 在图上绘制</div><div class="line">#plt.plot(x1,y1)</div><div class="line"></div><div class="line"></div><div class="line">x2 = np.arange(11)</div><div class="line"></div><div class="line">y2 = [</div><div class="line"> 3.529088519807792E-5,</div><div class="line"> 1.1895968858318187E-4,</div><div class="line"> 0.0013049292594645469,</div><div class="line"> 0.046417845349992326,</div><div class="line"> 0.03282177644291713,</div><div class="line"> 0.06,</div><div class="line"> 0.013313023920004725,</div><div class="line"> 0.06,</div><div class="line"> 3.554547063283854E-4,</div><div class="line"> 0.014309633417956262,</div><div class="line"> 0.0034946847451197242,</div><div class="line">]</div><div class="line"></div><div class="line"></div><div class="line">#plt.figure("ctrEstimate")</div><div class="line">#plt.plot(x2,y2,'k')</div><div class="line"></div><div class="line"></div><div class="line"># 两个图画一起</div><div class="line">plt.figure('ctr & ctrEstimate')</div><div class="line">plt.plot(x1, y1)</div><div class="line"></div><div class="line"># scatter可以方便出散点图</div><div class="line"># plt.scatter(x1,y11,c='red',marker='v')</div><div class="line"></div><div class="line"># plt.scatter(x2,y22,marker='^')</div><div class="line"></div><div class="line"># 'r'表示用红色线</div><div class="line">plt.plot(x2, y2, 'r')</div><div class="line"></div><div class="line">plt.show()</div></pre></td></tr></table></figure></p><p>如图所示:<br><img src="/images/python/ctr2.png" alt="ctr2"></p><h2 id="入门小例子3之多布局"><a href="#入门小例子3之多布局" class="headerlink" title="入门小例子3之多布局"></a>入门小例子3之多布局</h2><p>在一张图上构建多个布局画多张图<br><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div></pre></td><td class="code"><pre><div class="line">import matplotlib as mpl</div><div class="line">import matplotlib.pyplot as plt</div><div class="line"></div><div class="line"># 通过rcParams设置全局横纵轴字体大小</div><div class="line">mpl.rcParams['xtick.labelsize']=24</div><div class="line">mpl.rcParams['ytick.labelsize']=24</div><div class="line"></div><div class="line">x=range(10)</div><div class="line"></div><div class="line">y=[5,4,3,2,1,6,7,8,9,0]</div><div class="line"></div><div class="line">fig=plt.figure("one figure many subplot")</div><div class="line">ax=fig.add_subplot(131)</div><div class="line">ax.set_title('Histogram')</div><div class="line">ax.bar(x,y)</div><div class="line"></div><div class="line">ax=fig.add_subplot(132)</div><div class="line">ax.set_title('line chart')</div><div class="line">ax.plot(x,y)</div><div class="line"></div><div class="line">ax=fig.add_subplot(133)</div><div class="line">ax.set_title(u'Scatter plot')</div><div class="line">ax.scatter(x,y)</div><div class="line"></div><div class="line">plt.show()</div></pre></td></tr></table></figure></p><p>如图所示<br><img src="/images/python/figure.png" alt="figure"></p><h1 id="matplotlib画图api解释"><a href="#matplotlib画图api解释" class="headerlink" title="matplotlib画图api解释"></a>matplotlib画图api解释</h1><p>单独的讲解api很无聊.直接写代码画图,代码里都有详细的说明</p><h2 id="画2维的柱图和饼图"><a href="#画2维的柱图和饼图" class="headerlink" title="画2维的柱图和饼图"></a>画2维的柱图和饼图</h2><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div><div class="line">72</div><div class="line">73</div><div class="line">74</div><div class="line">75</div><div class="line">76</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">import numpy as np</div><div class="line"></div><div class="line">import matplotlib as mpl</div><div class="line">import matplotlib.pyplot as plt</div><div class="line"></div><div class="line">mpl.rcParams['axes.titlesize'] = 20</div><div class="line">mpl.rcParams['xtick.labelsize'] = 16</div><div class="line">mpl.rcParams['ytick.labelsize'] = 16</div><div class="line">mpl.rcParams['axes.labelsize'] = 16</div><div class="line">mpl.rcParams['xtick.major.size'] = 0</div><div class="line">mpl.rcParams['ytick.major.size'] = 0</div><div class="line"></div><div class="line"># 包含了狗,猫和猎豹的最高奔跑速度,还有对应的可视化颜色</div><div class="line">speed_map = {</div><div class="line"> 'dog': (48, '#7199cf'),</div><div class="line"> 'cat': (45, '#4fc4aa'),</div><div class="line"> 'cheetah': (120, '#e1a7a2')</div><div class="line">}</div><div class="line"></div><div class="line"># 整体图的标图</div><div class="line">fig=plt.figure('Bar chart & Pie chart')</div><div class="line"></div><div class="line"># 在整张图上加入一个子图,121的意思是在一个1行2列的子图中的第一张</div><div class="line">ax=fig.add_subplot(121)</div><div class="line">ax.set_title("Running speed - bar chart")</div><div class="line"></div><div class="line"># 生成x轴每个元素的位置 0 1 2</div><div class="line">xticks=np.arange(3)</div><div class="line"></div><div class="line"># 定义柱状图每个柱的宽度</div><div class="line">bar_width=0.5</div><div class="line"></div><div class="line"># 动物名称</div><div class="line">animals=speed_map.keys()</div><div class="line"></div><div class="line"># 奔跑速度</div><div class="line">speeds=[x[0] for x in speed_map.values()]</div><div class="line"></div><div class="line"># 对应颜色</div><div class="line">colors=[x[1] for x in speed_map.values()]</div><div class="line"></div><div class="line"># 画柱状图,横轴是动物标签的位置,纵轴是速度,定义柱的宽度,同时设置柱的边缘为透明</div><div class="line">bars=ax.bar(xticks,speeds,width=bar_width,edgecolor='none')</div><div class="line"></div><div class="line"># 设置y轴的标图</div><div class="line">ax.set_ylabel('Speed(km/h)')</div><div class="line"></div><div class="line"># x轴每个标签的具体位置,设置为每个柱的中央</div><div class="line">ax.set_xticks(xticks+bar_width/2)</div><div class="line"></div><div class="line"># 设置每个标签的名字</div><div class="line">ax.set_xticklabels(animals)</div><div class="line"></div><div class="line"># 设置x轴的范围</div><div class="line">ax.set_xlim([bar_width/2-0.5,3-bar_width/2])</div><div class="line"></div><div class="line"># 设置y轴的范围</div><div class="line">ax.set_ylim([0,125])</div><div class="line"></div><div class="line"># 给每个bar分配指定的颜色</div><div class="line">for bar,color in zip(bars,colors):</div><div class="line"> bar.set_color(color)</div><div class="line"></div><div class="line"></div><div class="line"># 在122位置加入新的图</div><div class="line">ax=fig.add_subplot(122)</div><div class="line">ax.set_title('Running speed - pie chart')</div><div class="line">labels=['{}\n{} km/h'.format(animal,speed) for animal,speed in zip(animals,speeds)]</div><div class="line"></div><div class="line"># 画饼状图,并指定标签和对应颜色</div><div class="line">ax.pie(speeds,labels=labels,colors=colors)</div><div class="line"># ax.plot(speeds)</div><div class="line"></div><div class="line"></div><div class="line">plt.show()</div></pre></td></tr></table></figure><p><img src="/images/python/subplot.png" alt="subplot"></p>]]></content>
<summary type="html">
<p>Matplotlib是Python中最常用的可视化工具之一,可以非常方便地创建海量类型地2D图表和一些基本的3D图表。</p>
</summary>
<category term="python" scheme="http://dmlcoding.com/categories/python/"/>
<category term="python" scheme="http://dmlcoding.com/tags/python/"/>
<category term="Matplotlib" scheme="http://dmlcoding.com/tags/Matplotlib/"/>
</entry>
<entry>
<title>Hbase笔记</title>
<link href="http://dmlcoding.com/2017/HbaseNotes/"/>
<id>http://dmlcoding.com/2017/HbaseNotes/</id>
<published>2017-07-24T02:00:00.000Z</published>
<updated>2017-07-25T06:07:03.000Z</updated>
<content type="html"><![CDATA[<p>在开发hbase过程中,遇到的一些问题.还有些许知识点的总结.<br><a id="more"></a></p><h1 id="hbase的内存分配"><a href="#hbase的内存分配" class="headerlink" title="hbase的内存分配"></a>hbase的内存分配</h1><blockquote><p>HBase的默认堆分配策略,40%给blockcache,40%给memstore<br>在HBase中,有两个在内存中的结构消费了绝大多数的heap空间。BlockCache缓存读操作的HFile block,Memstore缓存近期的写操作。</p></blockquote><ul><li>hfile.block.cache.size(读多的场景下,适当增大这个参数的值)</li><li>hbase.regionserver.global.memstore.upperLimit(写多的场景下,适当增大这个参数的值)</li></ul>]]></content>
<summary type="html">
<p>在开发hbase过程中,遇到的一些问题.还有些许知识点的总结.<br>
</summary>
<category term="hbase" scheme="http://dmlcoding.com/categories/hbase/"/>
<category term="bigdata" scheme="http://dmlcoding.com/tags/bigdata/"/>
<category term="hbase" scheme="http://dmlcoding.com/tags/hbase/"/>
</entry>
<entry>
<title>年中总结</title>
<link href="http://dmlcoding.com/2017/WhatIWriteIsShit/"/>
<id>http://dmlcoding.com/2017/WhatIWriteIsShit/</id>
<published>2017-07-16T02:20:00.000Z</published>
<updated>2017-07-18T02:01:08.000Z</updated>
<content type="html"><![CDATA[<p><img src="/images/beautifulPic/1.jpg" alt="风景"><br>写了这么多句,没有写出一句有意思的话;<br>写了这么多篇,没有写出一篇有深意的文章;<br>不是流水账,仍似流水账;</p><a id="more"></a><h1 id="我干了些啥"><a href="#我干了些啥" class="headerlink" title="我干了些啥"></a>我干了些啥</h1><blockquote><p>2017已经过去了一半</p></blockquote>]]></content>
<summary type="html">
<p><img src="/images/beautifulPic/1.jpg" alt="风景"><br>写了这么多句,没有写出一句有意思的话;<br>写了这么多篇,没有写出一篇有深意的文章;<br>不是流水账,仍似流水账;</p>
</summary>
<category term="think" scheme="http://dmlcoding.com/categories/think/"/>
<category term="think" scheme="http://dmlcoding.com/tags/think/"/>
</entry>
</feed>