# ST446 Distributed Computing for Big Data

## Week 2 class: Google Bigtable

### LT 2020

In this exercise, you will work with tables on Google Cloud Bigtable. 
The Bigtable technology sits behind many of Google's services such as Gmail and Youtube (see [presentation](https://www.youtube.com/watch?v=KaRbKdMInuc)).

They are run on the Google Cloud Platform via a Python API from this Jupyter notebook (on your computer). This will allow you to see how some of the system concepts that characterize Bigtable are implemented in practice, such as those discussed in the lecture. 

You will first go through basic steps that involve creating a table, a column family, columns and updating table cells, and deleting a table. 

You will then be asked to connect to an existing Bigtable instance and enter information about your laptop system properties in [the second activity](google_bigtable_class_activity_2.ipynb).

### Initial steps

Before running the code below for the first time, you need to do the following steps (locally, that is, on your computer). Once you have installed the following python libraries, you will be able to manipulate Bigtables on GCP from this Jupyter notebook on your computer.

1. Set up your credentials by running `gcloud auth application-default login`. You can see more details from https://developers.google.com/identity/protocols/application-default-credentials

2. Install Google Bigtable Python modules: use `pip install` command to install the following modules, if not done already:
   * ```google-cloud-bigtable==0.28.1```
   * ```google-cloud-core==0.28.0```
   * ```oauth2client```
   
You might need to install cython and pyhamcrest as well.

 
    
**Notes**: 
* Creating a Bigtable instance: One way to set up a Bigtable instance is by following the instructions here https://cloud.google.com/bigtable/docs/creating-instance. Note that you *do not* need to create an instance yet, because we are going to do it via a Python API below.
* **Computing resources:** As you conduct this exercise, you may create some new Bigtable instances. Please make sure to delete them after you no longer need them. They are expensive, charged per hour, and we don't have an unlimited budget of credits at Google Cloud Platform. In this notebook, this will be done using the API.

## 1. Working with Bigtable via a Python API

### 1.A. Creating a BigTable instance

We first create an instance in the Python code below. See [instance-api](https://googlecloudplatform.github.io/google-cloud-python/latest/bigtable/instance-api.html) for more details.
If you have trouble importing bigtable, ths could be due to python dependencies. Please ask the teaching staff about it (you might have to `pip install -U protobuf`, too).

In [5]:
from oauth2client.client import GoogleCredentials
from google.cloud import bigtable

credentials = GoogleCredentials.get_application_default()
# print(credentials.to_json())

project_id = "st446-lent"

client = bigtable.Client(project=project_id, admin=True)
instances = client.list_instances()

# print out all instances
#for i in instances[0]:
#    print (i.name)



In [6]:
# create an instance (if it is not created yet)
instance_id = "st446-bigtable-instance-milan" # feel free to change this to your own name
display_name = "st446-bigtable-milan"

location_id = "europe-west2-a"

instance = client.instance(instance_id, location_id, display_name=display_name)
instance.create() 

<google.api_core.operation.Operation at 0x1045c9790>

You may check on the Google Cloud Console that the Bigtable instance has been created:

<img src="./bigtable-instance.png" style="width: 1000px;" alt="Bigtable instance">

### 1.B. Create a table

In [7]:
table_id = "master-table"

table = instance.table(table_id)
table.create()

### Create a column family

In [8]:
cf_sysinfo = "sysinfo"

cf1 = table.column_family(cf_sysinfo)
cf1.create()

### 1.C. List tables in a Bigtable instance

The following code shows you how to list existing tables as well as how to list column families of a table. See [table-api](https://googlecloudplatform.github.io/google-cloud-python/latest/bigtable/table-api.html) for more information.

In [9]:
print(instance.list_tables())
print(instance.list_tables()[0].name)

[<google.cloud.bigtable.table.Table object at 0x104627f50>]
projects/st446-lent/instances/st446-bigtable-instance-milan/tables/master-table


### 1.D. List column families of a table

In [10]:
print(table.list_column_families())
print(table.list_column_families().keys())

{'sysinfo': <google.cloud.bigtable.column_family.ColumnFamily object at 0x10479dbd0>}
dict_keys(['sysinfo'])


### 1.E. Writing rows

We now show how to write a row in a table. See [data-api]( https://googlecloudplatform.github.io/google-cloud-python/latest/bigtable/data-api.html) for more information. What kind of information are using here? Can you use the terminal to obtain the same information about your own computer?

In [11]:
row_key = 'yourname' 

# column-value pairs to be added to a row with row_key `yourname`
d = {
    'num_of_cpu'.encode('utf-8'): '4'.encode('utf-8'),
    'num_physical_cpu'.encode('utf-8'): '2'.encode('utf-8'),
    'memory_size'.encode('utf-8'): '17179869184'.encode('utf-8'),
    'os'.encode('utf-8'): 'Mac OS Sierra'.encode('utf-8'),
    'processor_type'.encode('utf-8'): 'x86_64h (Intel x86-64h Haswell)'.encode('utf-8'),
    'cpu_frequency'.encode('utf-8'): '3300000000'.encode('utf-8'),
    'disk_size'.encode('utf-8'): '931GB'.encode('utf-8'),
    'kernel_version'.encode('utf-8'): 'Darwin Kernel Version 16.7.0: Mon Nov 13 21:56:25 PST 2017; root:xnu-3789.72.11~1/RELEASE_X86_64'.encode('utf-8'),
    'free_disk_space'.encode('utf-8'): '679GB'.encode('utf-8')
    }

# note that we use 'encode(''utf-8)' to encode strings to byte arrays
# note also that we could have also had integers as values

# create a row
row = table.row(row_key)

# add the row information
for col_id, val in d.items():
    row.set_cell(cf_sysinfo, col_id, val)

row.commit()

# different versions of cell values

row.set_cell(cf_sysinfo, 'os'.encode('utf-8'), 'Mac'.encode('utf-8'))
row.set_cell(cf_sysinfo, 'os'.encode('utf-8'), 'Mac os'.encode('utf-8'))

row.commit()

# note that setting row cell to 'Mac' is not committed to the table, because of the subsequent update to 'Mac os'
# see later for a proof of this claim when we read and display the table cell values

### 1.F. Reading rows

First, we read cell values for a given row key.

In [12]:
table1 = instance.table(table_id)

key='yourname'
cf='sysinfo'
col='os'.encode('utf-8')

row = table1.read_row(key.encode('utf-8'))

print(row.cells[cf][col])
print()

for cell in row.cells[cf][col]:
    value = cell.value
    ts = cell.timestamp
    print('\t {}: ({}: {}) {}'.format(key, col.decode('utf-8'), value.decode('utf-8'), ts))

[<google.cloud.bigtable.row_data.Cell object at 0x1047a0790>, <google.cloud.bigtable.row_data.Cell object at 0x1047ad3d0>]

	 yourname: (os: Mac os) 2020-01-30 19:39:45.445000+00:00
	 yourname: (os: Mac OS Sierra) 2020-01-30 19:39:45.190000+00:00


We next read all rows.

In [13]:
partial_rows = table1.read_rows()
partial_rows.consume_all()

# result will be used later to create a dataframe and show table values by using Python dataframe API
result = {}

col_name = None

for row_key, row in partial_rows.rows.items():

    key = row_key.decode('utf-8')
    cells = row.cells[cf_sysinfo] # get all cells in the same col family

    # get the col names
    if col_name is None:
        col_name = [k.decode('utf-8') for k in cells.keys()]

    # store one row 
    one_row_result = []

    print(cells)
    print()
    
    for col_key, col_val in cells.items():
#        value = int.from_bytes(col_val[0].value, byteorder='big') # if original type of value is an integer
        value = col_val[0].value
        one_row_result.append(value.decode('utf-8'))
        ts = col_val[0].timestamp
        print('{}: {} \t({})'.format(col_key.decode('utf-8'), value.decode('utf-8'), ts))
    result[key] = one_row_result
    

{b'cpu_frequency': [<google.cloud.bigtable.row_data.Cell object at 0x1047ad1d0>], b'disk_size': [<google.cloud.bigtable.row_data.Cell object at 0x1047ad190>], b'free_disk_space': [<google.cloud.bigtable.row_data.Cell object at 0x1047ad110>], b'kernel_version': [<google.cloud.bigtable.row_data.Cell object at 0x1047b9d10>], b'memory_size': [<google.cloud.bigtable.row_data.Cell object at 0x1047b9fd0>], b'num_of_cpu': [<google.cloud.bigtable.row_data.Cell object at 0x1047a0bd0>], b'num_physical_cpu': [<google.cloud.bigtable.row_data.Cell object at 0x104614f10>], b'os': [<google.cloud.bigtable.row_data.Cell object at 0x104614450>, <google.cloud.bigtable.row_data.Cell object at 0x104614250>], b'processor_type': [<google.cloud.bigtable.row_data.Cell object at 0x104603b90>]}

cpu_frequency: 3300000000 	(2020-01-30 19:39:45.190000+00:00)
disk_size: 931GB 	(2020-01-30 19:39:45.190000+00:00)
free_disk_space: 679GB 	(2020-01-30 19:39:45.190000+00:00)
kernel_version: Darwin Kernel Version 16.7.0: M

Display the table cell values:

In [14]:
import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(result, orient='index')
df.columns = col_name
df

Unnamed: 0,cpu_frequency,disk_size,free_disk_space,kernel_version,memory_size,num_of_cpu,num_physical_cpu,os,processor_type
yourname,3300000000,931GB,679GB,Darwin Kernel Version 16.7.0: Mon Nov 13 21:56...,17179869184,4,2,Mac os,x86_64h (Intel x86-64h Haswell)


### 1.G Delete table and the Bigtable instance that you created

In [15]:
table.delete()
instance.delete()

## References

* [Google Cloud Bigtable docs](https://cloud.google.com/bigtable/docs/)
* Google Cloud Bigtable docs: [Designing Your Schema](https://cloud.google.com/bigtable/docs/schema-design), 
* Google Cloud Bigtable docs: [Managing Tables](https://cloud.google.com/bigtable/docs/managing-tables), how to manage Bigtable using `cbt` command line tool
* [Google Cloud Bigtable: Python](http://gcloud-python-bigtable.readthedocs.io/en/data-api-complete/index.html)
* [Google Cloud Bigtable examples](https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/)