## Data Storing and Retrieving by using HappyBase (Apache HBase API in Python)
In this notebook, the process of how we save the final result after we finish the POS-Tagging step

It basically using looping for insert the data row-by-row to HBase. And so when retrieve from HBase

In [1]:
from pyspark.sql import SparkSession
import os

In [2]:
os.environ["PYSPARK_PYTHON"]="/home/pc/g5_env/bin/python39"

spark = SparkSession.builder.master("local[5]")\
            .appName("ReadWrite HBase")\
            .config('spark.executor.memory', '10g')\
            .config('spark.driver.maxResultSize', '5g')\
            .config('spark.driver.memory', '10g')\
            .getOrCreate()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/11 14:27:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/11 14:27:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Save POS_TAG result to HBASE TABLE

In this example, the Malay Wikipedia POS Tag will be used.<br>
Please according to POS Tag `[ postag_wiki_en , postag_wiki_ms , postag_wiki_zh , postag_social ]` <br>
To choose the table

In [3]:
# libraries
import happybase
#connection to hbase
CDH6_HBASE_THRIFT_VER='0.92'
connection = happybase.Connection('g5.bigtop.it')

In [3]:
## en_wiki_final, ms_wiki_final, en_wiki_final, social_media_final
df = spark.read.option('header',True).parquet("hdfs://g5.bigtop.it:8020/user/root/social_media_final")

In [4]:
from pyspark.sql.functions import monotonically_increasing_id
df1 = df.withColumn(
    "index", monotonically_increasing_id())
df1.show(10)

+-------------------+------------------+--------------------+------+-----+
|           sentence|          language|             pos_tag|n-gram|index|
+-------------------+------------------+--------------------+------+-----+
|     购买 大爱 产品|        ZH, ZH, ZH|VERB, ADJ, VERB, ...|     4|    0|
| 询问 了 卖家 他 跟|ZH, ZH, ZH, ZH, ZH|VERB, UL, NOUN, P...|     5|    1|
|     到货 谢谢 卖家|        ZH, ZH, ZH|    VERB, NOUN, NOUN|     3|    2|
|       所以 听 卖家|        ZH, ZH, ZH|    CONJ, VERB, NOUN|     3|    3|
|第一次 购买 可是 是|    ZH, ZH, ZH, ZH|NUM, VERB, CONJ, ...|     4|    4|
| 跟 这 卖家 下单 了|ZH, ZH, ZH, ZH, ZH|IN, PRON, NOUN, N...|     5|    5|
|   第三次 购买 看到|        ZH, ZH, ZH|     NUM, VERB, VERB|     3|    6|
|    再 回 购买 好评|    ZH, ZH, ZH, ZH|ADV, VERB, VERB, ...|     4|    7|
|     一起 购买 孩子|        ZH, ZH, ZH|     NUM, VERB, NOUN|     3|    8|
|     卖家 服务 极差|        ZH, ZH, ZH|NOUN, NOUN_VERB, ...|     3|    9|
+-------------------+------------------+--------------------+------+-----+
only showing t

In [6]:
df1.count()

10000

In [7]:
# table name declare
table_name = 'postag_wiki_ms'

In [9]:
### Create HBase Table
connection.open()
    
# schema of table
families = {
    'result': dict(),  # use defaults
}
# create table
connection.create_table(table_name, families)
    
connection.close()

#### IF ERROR SHOWS MEANS TABLE HAS BEEN EXIST IN HBASE, PLEASE USE CODE IN NEXt CELL TO DELETE IT

In [10]:
table = connection.table(table_name)

In [11]:
connection.open()
for row in df1.rdd.collect():
    x = list(row)
    table.put(str(x[3]),
                {'result:sentence': x[0],
                 'result:pos_tag': x[1],
                 'result:ngram': str(x[2]),
                })
connection.close()

## Retrieve POS_TAG result from HBASE TABLE

Continue the section above, the Malay Wikipedia POS Tag will be retrieve from HBase.<br>
Please according to POS Tag `[ postag_wiki_en , postag_wiki_ms , postag_wiki_zh , postag_social ]` <br>
To choose the table

In [12]:
# libraries
import happybase
#connection to hbase
CDH6_HBASE_THRIFT_VER='0.92'
connection = happybase.Connection('g5.bigtop.it')

In [13]:
connection.open()
table_name = 'postag_wiki_ms'
table = connection.table(table_name)
list_hbase = []
i = 0
for key, row in table.scan():
#    if i > 20:
#        break
    sentence = (row[b'result:sentence']).decode("utf-8")
    pos_tag = (row[b'result:pos_tag']).decode("utf-8")
    ngram = (row[b'result:ngram']).decode("utf-8")
    list_hbase.append([sentence,pos_tag,ngram])
    i +=1
connection.close()

In [14]:
columns = ["sentence","pos-tag","ngram"]
df = spark.createDataFrame(data=list_hbase, schema = columns)
df.printSchema()
df.show(truncate=False)

root
 |-- sentence: string (nullable = true)
 |-- pos-tag: string (nullable = true)
 |-- ngram: string (nullable = true)

+----------------------------------------------+--------------------------------+-----+
|sentence                                      |pos-tag                         |ngram|
+----------------------------------------------+--------------------------------+-----+
|yang terletak di jlmetro                      |PRON, VERB, ADP, NOUN           |4    |
|seven terletak                                |PROPN, X                        |2    |
|iaitu di kawasan sukau bukit                  |CCONJ, ADP, NOUN, NOUN, NOUN    |5    |
|bosniaherzegovina kawasan pergunungan merentas|PROPN, NOUN, NOUN, VERB         |4    |
|kini malaysia juara                           |SCONJ, PROPN, PROPN             |3    |
|maria montez terletak pada kedudukan          |NOUN, PROPN, VERB, ADP, NOUN    |5    |
|terbang mariquita terletak pada               |NOUN, PROPN, VERB, ADP          |4    

                                                                                

In [15]:
df.count()

10000