# About: Hadoop - Confirm the services are alive

---

Hadoop環境において以下のサービスが動作しているかの確認用Notebookです。

- ZooKeeper
- HDFS
- YARN
- HBase
- Spark

## *Operation Note*

*This is a cell for your own recording.  ここに経緯を記述*

# 操作対象クラスタの設定

**確認したい対象のクラスタ名**を以下のセルに記述してください。クラスタ名は *Set! Inventory Notebook* にて設定したAnsibleのグループ名で、 `hadoop_all_{{Cluster Name}}` のような形になります。

In [1]:
target_group = 'hadoop_all_testcluster'

対象クラスタにAnsibleでpingできることを確認する。

In [2]:
!ansible -m ping {target_group}

[0;32mXXX.XXX.XXX.72 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}[0m
[0;32mXXX.XXX.XXX.71 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}[0m
[0;32mXXX.XXX.XXX.70 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}[0m
[0;32mXXX.XXX.XXX.112 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}[0m
[0;32mXXX.XXX.XXX.73 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}[0m
[0;32mXXX.XXX.XXX.113 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}[0m
[0;32mXXX.XXX.XXX.114 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}[0m


# ZooKeeperの状態確認

サービスが起動しているか？

ZooKeeperがインストールされたノードで `zookeeper-server is running` となればOK。

In [3]:
!ansible hadoop_zookeeperserver -b -a 'service zookeeper-server status' -l {target_group}

[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
zookeeper-server is running
[0m
[0;32mXXX.XXX.XXX.72 | SUCCESS | rc=0 >>
zookeeper-server is running
[0m
[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
zookeeper-server is running
[0m


# HDFSの状態確認

NameNode, DataNode, JournalNodeの状態を確認する。

## NameNodeの状態確認

サービスが起動しているか？(rc=0ならばサービスが起動している)

In [4]:
# for NameNode-HA
print("ZKFC:")
!ansible hadoop_namenode -s -a 'service hadoop-hdfs-zkfc status' -l {target_group}

print("NameNode:")
!ansible hadoop_namenode -s -a 'service hadoop-hdfs-namenode status' -l {target_group}

ZKFC:
[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
Hadoop zkfc is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
Hadoop zkfc is running[  OK  ]
[0m
NameNode:
[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
Hadoop namenode is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
Hadoop namenode is running[  OK  ]
[0m


HA構成の場合、サービスの状態は適切か？(どちらかがactiveになっているか？)

In [5]:
!ansible hadoop_namenode -s -U hdfs -m shell \
         -a 'timeout 15 hdfs haadmin -getServiceState $(hostname)' -l {target_group}

[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
standby
[0m
[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
active
[0m


## DataNodeの状態確認

サービスが起動しているか？(rc=0ならばサービスが起動している)

In [6]:
!ansible hadoop_slavenode -s -a 'service hadoop-hdfs-datanode status' -l {target_group}

[0;32mXXX.XXX.XXX.112 | SUCCESS | rc=0 >>
Hadoop datanode is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.73 | SUCCESS | rc=0 >>
Hadoop datanode is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.113 | SUCCESS | rc=0 >>
Hadoop datanode is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.114 | SUCCESS | rc=0 >>
Hadoop datanode is running[  OK  ]
[0m


DataNode は Live Nodeとして認識されているか？

Live datanodes の表示に、DataNodeの数が表示されていればOK。

In [7]:
!ansible hadoop_client -s -U hdfs -a 'hdfs dfsadmin -report' -l {target_group}

[0;32mXXX.XXX.XXX.72 | SUCCESS | rc=0 >>
Configured Capacity: 422216597504 (393.22 GB)
Present Capacity: 389015651549 (362.30 GB)
DFS Remaining: 388823590109 (362.12 GB)
DFS Used: 192061440 (183.16 MB)
DFS Used%: 0.05%
Under replicated blocks: 10
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (4):

Name: XXX.XXX.XXX.114:50010 (testvm007)
Hostname: testvm007
Decommission Status : Normal
Configured Capacity: 105554149376 (98.30 GB)
DFS Used: 3772416 (3.60 MB)
Non DFS Used: 8199401223 (7.64 GB)
DFS Remaining: 97350975737 (90.67 GB)
DFS Used%: 0.00%
DFS Remaining%: 92.23%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 7
Last contact: Fri Aug 19 11:47:02 JST 2016


Name: XXX.XXX.XXX.73:50010 (testvm004)
Hostname: testvm004
Decommission Status : Normal
Conf

## JournalNodeの状態確認 - HA構成の場合

サービスが起動しているか？(rc=0ならばサービスが起動している)

In [8]:
!ansible hadoop_journalnode -s -a 'service hadoop-hdfs-journalnode status' -l {target_group}

[0;32mXXX.XXX.XXX.72 | SUCCESS | rc=0 >>
Hadoop journalnode is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
Hadoop journalnode is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
Hadoop journalnode is running[  OK  ]
[0m


## HDFSのWeb UIの確認

以下のURLから、Web UIが確認できることをチェックする。

In [11]:
## for Single Node Cluster
# import re
# ping_stdout = !ansible hadoop_namenode -s -U hdfs -m ping -l {target_group}
# namenode_addr = filter(lambda m: m, map(lambda l: re.match(r'^(\S+)\s*\|\s*SUCCESS\s+', l), ping_stdout))[0].group(1)

# for NameNode-HA
haadmin_stdout = !ansible hadoop_namenode -s -U hdfs -m shell -a 'timeout 15 hdfs haadmin -getServiceState $(hostname)' -l {target_group}
haadmin_result = [line.split()[0] for line in haadmin_stdout if len(line) > 0]
namenode_addr = haadmin_result[haadmin_result.index("active") - 1]

print("http://%s:50070/" % namenode_addr)

http://XXX.XXX.XXX.70:50070/


# YARNの状態確認

ResourceManager, NodeManager, MapReduce HistoryServerの状態を確認する

## ResourceManagerの状態確認

サービスが起動しているか？(rc=0ならばサービスが起動している)

In [12]:
!ansible hadoop_resourcemanager -s -a 'service hadoop-yarn-resourcemanager status' -l {target_group}

[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
Hadoop resourcemanager is running[  OK  ]/etc/default/hadoop-yarn-resourcemanager: line 21: unexpected EOF while looking for matching `"'
/etc/default/hadoop-yarn-resourcemanager: line 23: syntax error: unexpected end of file
[0m
[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
Hadoop resourcemanager is running[  OK  ]/etc/default/hadoop-yarn-resourcemanager: line 21: unexpected EOF while looking for matching `"'
/etc/default/hadoop-yarn-resourcemanager: line 23: syntax error: unexpected end of file
[0m


HA構成の場合、サービスの状態は適切か？(必ずどちらかがactiveになっているか？)

In [13]:
!ansible hadoop_resourcemanager -s -U yarn -m shell \
         -a 'timeout 15 yarn rmadmin -getServiceState $(hostname)' -l {target_group}

[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
active
[0m
[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
standby
[0m


## NodeManagerの状態確認

サービスが起動しているか？(rc=0ならばサービスが起動している)

In [14]:
!ansible hadoop_slavenode -s -a 'service hadoop-yarn-nodemanager status' -l {target_group}

[0;32mXXX.XXX.XXX.112 | SUCCESS | rc=0 >>
Hadoop nodemanager is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.113 | SUCCESS | rc=0 >>
Hadoop nodemanager is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.73 | SUCCESS | rc=0 >>
Hadoop nodemanager is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.114 | SUCCESS | rc=0 >>
Hadoop nodemanager is running[  OK  ]
[0m


NodeManagerはResourceManagerによって認識されているか？

In [15]:
## for Single Node Cluster
# import re
# ping_stdout = !ansible hadoop_resourcemanager -s -U hdfs -m ping -l {target_group}
# resourcemanager_addr = filter(lambda m: m, map(lambda l: re.match(r'^(\S+)\s*\|\s*SUCCESS\s+', l), ping_stdout))[0].group(1)

# for ResourceManager-HA
rmadmin_stdout = !ansible hadoop_resourcemanager -s -U yarn -m shell -a 'timeout 15 yarn rmadmin -getServiceState $(hostname)'
rmadmin_result = [line.split()[0] for line in rmadmin_stdout if len(line) > 0]
resourcemanager_addr = rmadmin_result[rmadmin_result.index("active") - 1]

!ansible {resourcemanager_addr} -s -U yarn -a "timeout 15 yarn node -list"

[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
Total Nodes:4
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
 testvm007:45454	        RUNNING	   testvm007:8042	                           0
 testvm006:45454	        RUNNING	   testvm006:8042	                           0
 testvm005:45454	        RUNNING	   testvm005:8042	                           0
 testvm004:45454	        RUNNING	   testvm004:8042	                           016/08/19 11:48:57 INFO impl.TimelineClientImpl: Timeline service address: http://testvm003:8188/ws/v1/timeline/
[0m


## MapReduce HistoryServerの状態確認

MapReduce HistoryServerを利用している場合・・・サービスが起動しているか？(rc=0ならばサービスが起動している)

In [16]:
!ansible hadoop_mapreduce_historyserver -s -a 'service hadoop-mapreduce-historyserver status' -l {target_group}

[0;32mXXX.XXX.XXX.72 | SUCCESS | rc=0 >>
Hadoop historyserver is running[  OK  ]
[0m


## YARNのWeb UIの確認

In [18]:
## for Single Node Cluster
#import re
#ping_stdout = !ansible hadoop_resourcemanager -s -U hdfs -m ping -l {target_group}
#resourcemanager_addr = filter(lambda m: m, map(lambda l: re.match(r'^(\S+)\s*\|\s*SUCCESS\s+', l), ping_stdout))[0].group(1)

# for ResourceManager-HA
rmadmin_stdout = !ansible hadoop_resourcemanager -s -U yarn -m shell -a 'timeout 15 yarn rmadmin -getServiceState $(hostname)'
rmadmin_result = [line.split()[0] for line in rmadmin_stdout if len(line) > 0]
resourcemanager_addr = rmadmin_result[rmadmin_result.index("active") - 1]

print("http://%s:8088/" % resourcemanager_addr)

http://XXX.XXX.XXX.70:8088/


# HBaseの状態確認

Master, RegionServerの状態を確認する。

## Masterの状態確認

サービスが起動しているか？(rc=0ならばサービスが起動している)

In [19]:
!ansible hadoop_hbase_master -l {target_group} -b -a 'service hbase-master status'

[0;32mXXX.XXX.XXX.71 | SUCCESS | rc=0 >>
HBase master daemon is running[  OK  ]
[0m
[0;32mXXX.XXX.XXX.70 | SUCCESS | rc=0 >>
HBase master daemon is running[  OK  ]
[0m


## RegionServerの状態確認

サービスが起動しているか？

In [20]:
!ansible hadoop_hbase_regionserver -l {target_group} -b -a 'service hbase-regionserver status'

[0;32mXXX.XXX.XXX.113 | SUCCESS | rc=0 >>
hbase-regionserver is running
[0m
[0;32mXXX.XXX.XXX.73 | SUCCESS | rc=0 >>
hbase-regionserver is running
[0m
[0;32mXXX.XXX.XXX.112 | SUCCESS | rc=0 >>
hbase-regionserver is running
[0m
[0;32mXXX.XXX.XXX.114 | SUCCESS | rc=0 >>
hbase-regionserver is running
[0m


## HBaseのWeb UIの確認

In [26]:
hosts_stdout = !ansible {target_group} -b -a 'cat /etc/hosts'
hosts_stdout = filter(lambda l: not l.strip().endswith('>>'), hosts_stdout)
hosts_stdout = map(lambda l: l.split(), hosts_stdout)
hosts_stdout = filter(lambda l: len(l) == 2, hosts_stdout)
machines = dict(map(lambda l: (l[1], l[0]), hosts_stdout))
machines

{'testvm001': 'XXX.XXX.XXX.70',
 'testvm002': 'XXX.XXX.XXX.71',
 'testvm003': 'XXX.XXX.XXX.72',
 'testvm004': 'XXX.XXX.XXX.73',
 'testvm005': 'XXX.XXX.XXX.112',
 'testvm006': 'XXX.XXX.XXX.113',
 'testvm007': 'XXX.XXX.XXX.114'}

In [27]:
zknode_stdout = !ansible -m ping -l {target_group} hadoop_zookeeperserver
zknodes = sorted([l.split()[0] for l in zknode_stdout if 'SUCCESS' in l])

from kazoo.client import KazooClient
zk = KazooClient(hosts='%s:2181' % zknodes[0], read_only=True)
zk.start()
(master_result,v) = zk.get("/hbase/master")
zk.stop()
for host, ip in machines.items():
    if host in master_result:
        hbase_master_host = ip
print("http://%s:60010" % hbase_master_host)

http://XXX.XXX.XXX.70:60010


# Spark HistoryServer

サービスが起動しているか？(rc=0ならばサービスが起動している)

In [33]:
!ansible hadoop_spark_history -l {target_group} -b --become-user spark -m shell \
         -a '[ -s ${{SPARK_PID_DIR}}/spark-spark-org.apache.spark.deploy.history.HistoryServer-1.pid ] && [ -x /proc/$(cat ${{SPARK_PID_DIR}}/spark-spark-org.apache.spark.deploy.history.HistoryServer-1.pid ) ]'

[0;32mXXX.XXX.XXX.72 | SUCCESS | rc=0 >>

[0m
