## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
# File location and type
file_location = "/FileStore/tables/workCountData.txt"
file_type = "txt"
lines = sc.textFile(file_location)

In [3]:
pairRDD = lines.flatMap(lambda line : line.split(" ")).map(lambda word : (word,1))

In [4]:
pairRDD.collect()

In [5]:
list = ["Hadoop","Spark","Hive","Spark"]
rdd = sc.parallelize(list)
pairRDD = rdd.map(lambda word : (word,1))
pairRDD.collect()

In [6]:
pairRDD.reduceByKey(lambda a, b: a+b).collect()

In [7]:
pairRDD1 = sc.parallelize([('spark',1),('spark',2),('hadoop',3),('hadoop',5)])
pairRDD2 = sc.parallelize([('spark','fast')])
pairRDD1.join(pairRDD2).collect()

In [8]:
rdd = sc.parallelize([("spark",2),("hadoop",6),("hadoop",4),("spark",6)])
rdd.mapValues(lambda x: (x,1)) \
  .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1])) \
  .mapValues(lambda x: (x[0] / x[1])).collect()

In [9]:
rdd.mapValues(lambda x: (x,1)).collect()

In [10]:
host = 'localhost'
table = 'student'
conf = {
  "hbase.zookeeper.quorum": host, 
  "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
  "org.apache.hadoop.hbase.mapreduce.TableInputFormat",
  "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
  "org.apache.hadoop.hbase.client.Result",
  keyConverter=keyConv,valueConverter=valueConv,conf=conf)
count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for (k, v) in output:
  print(k, v)

In [11]:
host = 'localhost'
table = 'student'
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {
  "hbase.zookeeper.quorum": host,
  "hbase.mapred.outputtable": table,
  "mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
  "mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
  "mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
rawData = ['3,info,name,Rongcheng','4,info,name,Guanhua']
sc.parallelize(rawData) \
  .map(lambda x: (x[0],x.split(',')))\ 
  .saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)