<a href="https://colab.research.google.com/github/ChetanKnowIt/BDT_Notes/blob/main/HBase_Pyspark_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center" > PySpark⚡ HBase🐬 connectivity </h1>

<hr />

## Requirements/Prerequisites: 
### 1. Fully Distributed mode installation (Multinode Hadoop environment + HBase installation) from here [Guru99 installtion link](https://www.guru99.com/hbase-installation-guide.html)
### 2. pip install pyspark 
### 3. pip install happybase
### 4. 





## Flow: 


In [9]:
from IPython.display import Image
Image(url="https://drive.google.com/u/0/uc?id=12AWXzV0JAdP0uZeo7lrdt6_v3Po19bL9")

### 1. We can start with reading file from HDFS, processing and dumping on HBase from pandas 
 https://happybase.readthedocs.io/en/latest/user.html example table.put('2',{'f1': 'hey'})

In [None]:
import happybase
import pandas as pd

# Load data into a Pandas DataFrame
data = pd.read_csv('my_data.csv')

# Create a connection object to HBase
connection = happybase.Connection('localhost', port=9090)

# Create a table object in Happybase
table_name = 'my_table'
column_family = 'my_cf'
table = connection.table(table_name)

# Iterate over the rows in the Pandas DataFrame
for _, row in data.iterrows():
    # Extract data from the row
    row_key = row['my_key_column']
    data_dict = {
        f'{column_family}:{column}': str(row[column])
        for column in data.columns
        if column != 'my_key_column'
    }
    
    # Write the row to HBase
    table.put(row_key, data_dict)


### 2. then we can retrieve values with this example below

In [None]:
import happybase
from pyspark.sql import SparkSession

# Create a HBase connection object
connection = happybase.Connection('localhost', port=9090)

# Create a table object
table = connection.table('my_table')

# Retrieve data from HBase table
data = []
for key, values in table.scan():
    row = {}
    row['key'] = key
    for column, value in values.items():
        row[column.decode('utf-8')] = value.decode('utf-8')
    data.append(row)

# Convert HBase data to PySpark DataFrame
spark = SparkSession.builder.appName('HBase to PySpark').getOrCreate()
df = spark.createDataFrame(data)

# Print PySpark DataFrame
df.show()


### 3. we use this spark dataframe for model training and evaluation 
### 4. we dump data back to HBase with this example below

In [None]:
import happybase
import pickle
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession

# Train a logistic regression model in Spark
spark = SparkSession.builder.appName('Logistic Regression').getOrCreate()
data = spark.read.csv('my_data.csv', header=True, inferSchema=True)
assembler = VectorAssembler(inputCols=['feature1', 'feature2'], outputCol='features')
data = assembler.transform(data)
train, test = data.randomSplit([0.7, 0.3])
lr = LogisticRegression()
model = lr.fit(train)
predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(predictions)

# Serialize the results
results = {'auc': auc}
serialized_results = pickle.dumps(results)

# Create a connection object to HBase
connection = happybase.Connection('localhost', port=9090)

# Create a table object in Happybase
table_name = 'my_table'
table = connection.table(table_name)

# Write the results to HBase
row_key = 'my_row'
column_family = 'results'
column_qualifier = 'logistic_regression'
table.put(row_key, {f'{column_family}:{column_qualifier}': serialized_results})
