# Purpose:

```
   Create a databricks notebook (as base python)
   
   Investigate the Databricks DBFS 
   
   Create and read Parquet files, and understand them better 
   
   Create and read JSON files, display etc 
   
   ```

<br>

### Introduction:

#### Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data !

In [6]:
# Parquet is a columnar format that is supported by many other data processing systems. 

# Spark SQL provides support for both reading and writing Parquet files that automatically 
# preserves the schema of the original data. 

# When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

<br>

In [8]:
# lets confirm we have a spark context created
# if you know me, u know i always do these first couple steps
sc

In [9]:
# list all the methods for the spark context, in case you want to look around
dir(sc)   # then you just enter sc.<the method>

In [10]:
# what is my apache version ? 
sc.version

In [11]:
# what is my specific python version ? 
sc.pythonVer

In [12]:
# what is my spark context app name ? 
sc.appName

In [13]:
# tell me my URL for seeing stages etc 
sc.uiWebUrl # etc etc type methods 

<br>

# Let's look at Parquet Files:

THE standard i think:
*  https://github.com/apache/parquet-format

In [17]:
# side note:  
# upload image to databricks DBFS
# it will show up under DBFS /FileStore/tables/<yourfilename>
# and then from there, IF you want to access the file for HTML image view:
#  /files/svm.jpg
# but for writing, it would be:
#  /FileStore/svm.jpg

Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON.

Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression. It is especially good for queries which read particular columns from a “wide” (with many columns) table since only needed columns are read and IO is minimized.

displayHTML("<img src ='/files/tables/parquet.gif'>")

In [20]:
%scala 

// lets use scala to manually create a tiny array and save as an Apache Parquet file ! 

case class MyCaseClass(key: String, group: String, value: Int, someints: Seq[Int], somemap: Map[String, Int])

// paralellize an array of data, then toDF() method

val dataframe = sc.parallelize(Array(MyCaseClass("a", "vowels", 1, Array(1), Map("a" -> 1)),
  MyCaseClass("b", "consonants", 2, Array(2, 2), Map("b" -> 2)),
  MyCaseClass("c", "consonants", 3, Array(3, 3, 3), Map("c" -> 3)),
  MyCaseClass("d", "consonants", 4, Array(4, 4, 4, 4), Map("d" -> 4)),
  MyCaseClass("e", "vowels", 5, Array(5, 5, 5, 5, 5), Map("e" -> 5)))
).toDF()


// now write it to disk * AS parquet * 
// use:  dataframe method .write 
dataframe.write.mode("overwrite").parquet("/tmp/testParquet")  


// your output will include text like:
//  defined class MyCaseClass
//  dataframe: org.apache.spark.sql.DataFrame = [key: string, group: string ... 3 more fields]

In [21]:
%scala
// i want to see my raw data please 
display(dataframe)

key,group,value,someints,somemap
a,vowels,1,List(1),Map(a -> 1)
b,consonants,2,"List(2, 2)",Map(b -> 2)
c,consonants,3,"List(3, 3, 3)",Map(c -> 3)
d,consonants,4,"List(4, 4, 4, 4)",Map(d -> 4)
e,vowels,5,"List(5, 5, 5, 5, 5)",Map(e -> 5)


In [22]:
%scala

// see how i use sqlContext to read the actual file ? 

val data = sqlContext.read.parquet("/tmp/testParquet")
// note:  use sqlContext.read.parquet to access 

// run this in databricks so you can see some of the cool outputs under the View Stage data

display(data)

key,group,value,someints,somemap
b,consonants,2,"List(2, 2)",Map(b -> 2)
c,consonants,3,"List(3, 3, 3)",Map(c -> 3)
d,consonants,4,"List(4, 4, 4, 4)",Map(d -> 4)
e,vowels,5,"List(5, 5, 5, 5, 5)",Map(e -> 5)
a,vowels,1,List(1),Map(a -> 1)


In [23]:
# You have spark context running, but you also have sqlContext as well...
sqlContext  # confirm it is running...

In [24]:
# now lets use Python (not Scala), and read the parquet file you created

data2 = sqlContext.read.parquet("/tmp/testParquet")

display(data2)

# note in the real world, using non Databricks, you can just print out as well

key,group,value,someints,somemap
b,consonants,2,"List(2, 2)",Map(b -> 2)
c,consonants,3,"List(3, 3, 3)",Map(c -> 3)
d,consonants,4,"List(4, 4, 4, 4)",Map(d -> 4)
e,vowels,5,"List(5, 5, 5, 5, 5)",Map(e -> 5)
a,vowels,1,List(1),Map(a -> 1)


In [25]:
data2.sql_ctx

In [26]:
# i want to know how many rows of data are in my data2 file:
data2.count()

In [27]:
print(type(data2))  # this is a DataFrame fyi 

In [28]:
# please list out my dataframe columns:
data2.columns

In [29]:
# i wish to output the column of group only:
display(data2.select('group'))

group
consonants
consonants
consonants
vowels
vowels


In [30]:
data2.schema

##### Use SQL direct commands, which i really like:

*Create a temporary table and then we can query it*

In [33]:
%sql 
    CREATE TEMPORARY TABLE scalaTable
    USING parquet
    OPTIONS (
      path "/tmp/testParquet"
    )

In [34]:
%sql 
-- i love how you can use SQL directly in Databricks, its just freaking excellent
SELECT * FROM scalaTable

key,group,value,someints,somemap
b,consonants,2,"List(2, 2)",Map(b -> 2)
c,consonants,3,"List(3, 3, 3)",Map(c -> 3)
d,consonants,4,"List(4, 4, 4, 4)",Map(d -> 4)
e,vowels,5,"List(5, 5, 5, 5, 5)",Map(e -> 5)
a,vowels,1,List(1),Map(a -> 1)


In [35]:
%sql 
-- i want the data record row where the value is '2':
SELECT * FROM scalaTable WHERE value = 2

key,group,value,someints,somemap
b,consonants,2,"List(2, 2)",Map(b -> 2)


<br>
<br>

# Parquet Example #2

In [38]:
# lets look at a much larger file
# File uploaded to /FileStore/tables/userdata1.parquet

A good source of some parquet files:
* https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet

In [40]:
display(dbutils.fs.ls("dbfs:/FileStore/tables"))  # you can see your file there

path,name,size
dbfs:/FileStore/tables/Cbp/,Cbp/,0
dbfs:/FileStore/tables/README.md,README.md,2033
dbfs:/FileStore/tables/beer_data.csv,beer_data.csv,3143
dbfs:/FileStore/tables/boston_train.csv,boston_train.csv,24462
dbfs:/FileStore/tables/json.json,json.json,243
dbfs:/FileStore/tables/parquet.gif,parquet.gif,43589
dbfs:/FileStore/tables/pledge_of_allegiance.txt,pledge_of_allegiance.txt,174
dbfs:/FileStore/tables/userdata1.parquet,userdata1.parquet,113629


In [41]:
dataP = sqlContext.read.parquet("dbfs:/FileStore/tables/userdata1.parquet")

# -> Number of rows in the file: 1000
# -> Column details:
# column#		column_name		hive_datatype
# =====================================================
# 1		registration_dttm 	    timestamp
# 2		id 			            int
# 3		first_name 		        string
# 4		last_name 		        string
# 5		email 			        string
# 6		gender 			        string
# 7		ip_address 		        string
# 8		cc 			            string
# 9		country 		        string
# 10	birthdate 		        string
# 11	salary 			        double
# 12	title 			        string
# 13	comments 		        string



In [42]:
display(dataP)

registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title,comments
2016-02-03T07:55:29.000+0000,1,Amanda,Jordan,ajordan0@com.com,Female,1.197.201.2,6759521864920116.0,Indonesia,3/8/1971,49756.53,Internal Auditor,1E+02
2016-02-03T17:04:03.000+0000,2,Albert,Freeman,afreeman1@is.gd,Male,218.111.175.34,,Canada,1/16/1968,150280.17,Accountant IV,
2016-02-03T01:09:31.000+0000,3,Evelyn,Morgan,emorgan2@altervista.org,Female,7.161.136.94,6767119071901597.0,Russia,2/1/1960,144972.51,Structural Engineer,
2016-02-03T00:36:21.000+0000,4,Denise,Riley,driley3@gmpg.org,Female,140.35.109.83,3576031598965625.0,China,4/8/1997,90263.05,Senior Cost Accountant,
2016-02-03T05:05:31.000+0000,5,Carlos,Burns,cburns4@miitbeian.gov.cn,,169.113.235.40,5602256255204850.0,South Africa,,,,
2016-02-03T07:22:34.000+0000,6,Kathryn,White,kwhite5@google.com,Female,195.131.81.179,3583136326049310.0,Indonesia,2/25/1983,69227.11,Account Executive,
2016-02-03T08:33:08.000+0000,7,Samuel,Holmes,sholmes6@foxnews.com,Male,232.234.81.197,3582641366974690.0,Portugal,12/18/1987,14247.62,Senior Financial Analyst,
2016-02-03T06:47:06.000+0000,8,Harry,Howell,hhowell7@eepurl.com,Male,91.235.51.73,,Bosnia and Herzegovina,3/1/1962,186469.43,Web Developer IV,
2016-02-03T03:52:53.000+0000,9,Jose,Foster,jfoster8@yelp.com,Male,132.31.53.61,,South Korea,3/27/1992,231067.84,Software Test Engineer I,1E+02
2016-02-03T18:29:47.000+0000,10,Emily,Stewart,estewart9@opensource.org,Female,143.28.251.245,3574254110301671.0,Nigeria,1/28/1997,27234.28,Health Coach IV,


In [43]:
# I can poll the schema directly
dataP.schema

In [44]:
# feeling lazy ? 

# upload parquet file, read it into dataframe in python, and then use 
# databrick's export 'Download csv' buttom (far right downward arrow) to
# download and view the csv form on your laptop...

<br>

# Let's examine JSON

In [47]:
# i manually uploaded this file called json.json
# sidenote: If your cluster is running Databricks Runtime 4.0 and above, you can read JSON 
# files in single-line or multi-line mode. In single-line mode, a file can be split into 
# many parts and read in parallel.

# lets confirm i can see it in my storage: 
display(dbutils.fs.ls("dbfs:/FileStore/tables"))

path,name,size
dbfs:/FileStore/tables/Cbp/,Cbp/,0
dbfs:/FileStore/tables/README.md,README.md,2033
dbfs:/FileStore/tables/beer_data.csv,beer_data.csv,3143
dbfs:/FileStore/tables/boston_train.csv,boston_train.csv,24462
dbfs:/FileStore/tables/json.json,json.json,243
dbfs:/FileStore/tables/pledge_of_allegiance.txt,pledge_of_allegiance.txt,174


In [48]:
# this already exists
spark

In [49]:
# this already exists
sqlContext

In [50]:
# list methods, but without those annoying _ underlines 

for i in dir(sqlContext):
  if not i.startswith("_"): print(i)

In [51]:
randomDF = sqlContext.read.json("dbfs:/FileStore/tables/json.json")

In [52]:
# this is my json.json file i uploaded:
#  {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
#  {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
#  {"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key":extra_value3"}}


# but the cool part is that Spark infers the schema automatically
randomDF.printSchema


In [53]:
display(randomDF)

array,dict,int,string
"List(1, 2, 3)","List(null, value1)",1,string1
"List(2, 4, 6)","List(null, value2)",2,string2
"List(3, 6, 9)","List(extra_value3, value3)",3,string3


<br>

#### Another JSON example:

In [56]:
import json
import requests

# You’ll need to make an API request to the JSONPlaceholder service, so just use the requests 
# package to do the heavy lifting.
response = requests.get("https://jsonplaceholder.typicode.com/todos")
todos = json.loads(response.text)


In [57]:
response.json()

In [58]:
todos == response.json()

In [59]:
type(todos)

In [60]:
todos[:4]

In [61]:
display(todos)

completed,id,title,userId
False,1,delectus aut autem,1
False,2,quis ut nam facilis et officia qui,1
False,3,fugiat veniam minus,1
True,4,et porro tempora,1
False,5,laboriosam mollitia et enim quasi adipisci quia provident illum,1
False,6,qui ullam ratione quibusdam voluptatem quia omnis,1
False,7,illo expedita consequatur quia in,1
True,8,quo adipisci enim quam ut ab,1
False,9,molestiae perspiciatis ipsa,1
True,10,illo est ratione doloremque quia maiores aut,1


#### another JSON example

In [63]:
dbutils.fs.put("/tmp/test.json", """
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
""", True)

In [64]:
testJsonData = sqlContext.read.json("/tmp/test.json")

display(testJsonData)

array,dict,int,string
"List(1, 2, 3)","List(null, value1)",1,string1
"List(2, 4, 6)","List(null, value2)",2,string2
"List(3, 6, 9)","List(extra_value3, value3)",3,string3


In [65]:
%sql 
    CREATE TEMPORARY TABLE jsonTable
    USING json
    OPTIONS (
      path "/tmp/test.json"
    )

In [66]:
%sql SELECT * FROM jsonTable


array,dict,int,string
"List(1, 2, 3)","List(null, value1)",1,string1
"List(2, 4, 6)","List(null, value2)",2,string2
"List(3, 6, 9)","List(extra_value3, value3)",3,string3


<br>

# Databricks Delta Examination

>  *lets look at a deeper examination of parquet and delta, and if we can make some improvements*

In [70]:
# note in the real world, using non Databricks, you can just print out as well

flights = spark.read.format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .load("/databricks-datasets/asa/airlines/2008.csv")

# see how the schema is inferred ???

In [71]:
# write a parquet-based table using this flights data 
# i specifically partition by ORIGIN
# once i do this, it will spread files around, and breakout by folders of Origin

flights.write.format("parquet").mode("overwrite").partitionBy("Origin").save("/tmp/flights_parquet")

In [72]:
# your output files
display(dbutils.fs.ls("dbfs:/tmp/flights_parquet"))  # you can see your file there

path,name,size
dbfs:/tmp/flights_parquet/Origin=ABE/,Origin=ABE/,0
dbfs:/tmp/flights_parquet/Origin=ABI/,Origin=ABI/,0
dbfs:/tmp/flights_parquet/Origin=ABQ/,Origin=ABQ/,0
dbfs:/tmp/flights_parquet/Origin=ABY/,Origin=ABY/,0
dbfs:/tmp/flights_parquet/Origin=ACK/,Origin=ACK/,0
dbfs:/tmp/flights_parquet/Origin=ACT/,Origin=ACT/,0
dbfs:/tmp/flights_parquet/Origin=ACV/,Origin=ACV/,0
dbfs:/tmp/flights_parquet/Origin=ACY/,Origin=ACY/,0
dbfs:/tmp/flights_parquet/Origin=ADK/,Origin=ADK/,0
dbfs:/tmp/flights_parquet/Origin=ADQ/,Origin=ADQ/,0


In [73]:
# 8 core files per origin, example shown:
display(dbutils.fs.ls("dbfs:/tmp/flights_parquet/Origin=DFW/"))  # you can see your file there

path,name,size
dbfs:/tmp/flights_parquet/Origin=DFW/_SUCCESS,_SUCCESS,0
dbfs:/tmp/flights_parquet/Origin=DFW/_committed_1227084317432497623,_committed_1227084317432497623,832
dbfs:/tmp/flights_parquet/Origin=DFW/_started_1227084317432497623,_started_1227084317432497623,0
dbfs:/tmp/flights_parquet/Origin=DFW/part-00000-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-421-79.c000.snappy.parquet,part-00000-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-421-79.c000.snappy.parquet,399190
dbfs:/tmp/flights_parquet/Origin=DFW/part-00001-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-422-79.c000.snappy.parquet,part-00001-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-422-79.c000.snappy.parquet,724010
dbfs:/tmp/flights_parquet/Origin=DFW/part-00002-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-423-80.c000.snappy.parquet,part-00002-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-423-80.c000.snappy.parquet,403831
dbfs:/tmp/flights_parquet/Origin=DFW/part-00003-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-424-80.c000.snappy.parquet,part-00003-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-424-80.c000.snappy.parquet,772614
dbfs:/tmp/flights_parquet/Origin=DFW/part-00004-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-425-79.c000.snappy.parquet,part-00004-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-425-79.c000.snappy.parquet,410121
dbfs:/tmp/flights_parquet/Origin=DFW/part-00005-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-426-80.c000.snappy.parquet,part-00005-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-426-80.c000.snappy.parquet,688253
dbfs:/tmp/flights_parquet/Origin=DFW/part-00006-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-427-79.c000.snappy.parquet,part-00006-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-427-79.c000.snappy.parquet,411121


In [74]:
# seeing display command in action, side note:
# from pyspark.sql.functions import avg
# diamonds_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
# display(diamonds_df.select("color","price").groupBy("color").agg(avg("price")))

In [75]:
# lets look at one single subFile

curiousDF = sqlContext.read.parquet("dbfs:/tmp/flights_parquet/Origin=DFW/part-00000-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-421-79.c000.snappy.parquet")

print("This individual dataset has the following number of rows: ", curiousDF.count())

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,2,9,6,926.0,933,1130.0,1154,OO,5927,N758SK,124.0,141,105.0,-24.0,-7.0,ORD,802,5.0,14.0,0,,0,,,,,
2008,1,28,1,839.0,820,949.0,927,XE,2696,N12530,70.0,67,51.0,22.0,19.0,IAH,224,5.0,14.0,0,,0,2.0,0.0,3.0,0.0,17.0
2008,2,9,6,1958.0,2007,2104.0,2129,OO,6210,N740SK,186.0,202,168.0,-25.0,-9.0,LAX,1235,6.0,12.0,0,,0,,,,,
2008,1,24,4,1935.0,1930,2307.0,2258,XE,3069,N15980,152.0,148,129.0,9.0,5.0,CLE,1021,6.0,17.0,0,,0,,,,,
2008,2,9,6,1228.0,1234,1348.0,1356,OO,6229,N740SK,200.0,202,166.0,-8.0,-6.0,LAX,1235,17.0,17.0,0,,0,,,,,
2008,1,23,3,1301.0,1305,1420.0,1420,XE,2600,N14940,79.0,75,50.0,0.0,-4.0,IAH,224,17.0,12.0,0,,0,,,,,
2008,2,10,7,1315.0,1315,1610.0,1630,OO,1980,N817SK,115.0,135,92.0,-20.0,0.0,ATL,732,9.0,14.0,0,,0,,,,,
2008,1,26,6,1156.0,1200,1303.0,1310,XE,2905,N14938,67.0,70,45.0,-7.0,-4.0,IAH,224,7.0,15.0,0,,0,,,,,
2008,2,10,7,1822.0,1759,2125.0,2115,OO,1998,N812SK,123.0,136,99.0,10.0,23.0,CVG,812,6.0,18.0,0,,0,,,,,
2008,1,9,3,1920.0,1930,2245.0,2258,XE,3069,N15983,145.0,148,125.0,-13.0,-10.0,CLE,1021,3.0,17.0,0,,0,,,,,


In [76]:
#

Once step 1 completes, the "flights" table contains details of US flights for a year.

Next in Step 2, we run a query that get top 20 cities with highest monthly total flights on first day of week.

In [78]:
# so when you open up the tmp folder, you see a few hundred files distributed by 
# origin id of like DFW airport.  

# eight files per original

# ex: /tmp/flights_parquet/Origin=DFW/part-00000-tid-1227084317432497623-9bdbeb42-5722-4f94-9755-6801a612974e-421-79.c000.snappy.parquet

#### step 2

In [80]:
# lets run a query:

from pyspark.sql.functions import count

# this has got to be a huge file, its consolidating everything 
flights_parquet = spark.read.format("parquet").load("/tmp/flights_parquet")

# query the huge file, breakout by day of week 1, groupby, and count totals
display(flights_parquet.filter("DayOfWeek=1").groupBy("Month","Origin").agg(count("*").alias("TotalFlights")).orderBy("TotalFlights", ascending=False).limit(20))

# this took 1.72 minutes ! 
# ran it again, and then it 2.46 minutes


Month,Origin,TotalFlights
6,ATL,6046
3,ATL,6019
12,ATL,5800
9,ATL,5722
6,ORD,5241
3,ORD,5072
9,ORD,4931
7,ATL,4894
8,ATL,4821
4,ATL,4798


In [81]:
print(type(flights_parquet))

In [82]:
flights_parquet.count()
# thats over 7 million rows...

In [83]:
# very very good command for seeing the schema you have in place for the DF
#############################a
flights_parquet.printSchema()
#############################

#### step 3

Once step 2 completes, you can observe the latency with the standard "flights_parquet" table.

In step 3 and step 4, we do the same with a Databricks Delta table. This time, before running the query, we run the OPTIMIZE command with ZORDER to ensure data is optimized for faster retrieval.

In [86]:

# Step 3: Write a Databricks Delta based table using flights data
flights.write.format("delta").mode("overwrite").partitionBy("Origin").save("/tmp/flights_delta")


In [87]:

# Step 3 Continued: OPTIMIZE the Databricks Delta table

display(spark.sql("DROP TABLE  IF EXISTS flights"))

display(spark.sql("CREATE TABLE flights USING DELTA LOCATION '/tmp/flights_delta'"))
                  
display(spark.sql("OPTIMIZE flights ZORDER BY (DayofWeek)"))


path
""


In [88]:
# Step 4 : Rerun the query from Step 2 and observe the latency

flights_delta = spark.read.format("delta").load("/tmp/flights_delta")

display(flights_delta.filter("DayOfWeek=1").groupBy("Month","Origin").agg(count("*").alias("TotalFlights")).orderBy("TotalFlights", ascending=False).limit(20))

# its faster now 


Month,Origin,TotalFlights
6,ATL,6046
3,ATL,6019
12,ATL,5800
9,ATL,5722
6,ORD,5241
3,ORD,5072
9,ORD,4931
7,ATL,4894
8,ATL,4821
4,ATL,4798


In [89]:

#  HMMMMMMMM

#  it went from taking:  1.72 minutes (and then 2.46 mins)
#  to taking about    :  42.35 seconds
#  about 2.43X + faster...


<br>

## Appendix:  Access DBFS and looking around at functionality

<br>

In [93]:
display(dbutils.fs)

In [94]:

#  write a temp file to DBFS with python i/o apis
#  then print it out 
#  all python based (base notebook is python when created)

#write a file to DBFS using python i/o apis
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is the bomb\n")
  f.write("Tom Bresee was the bomb\n")
  f.write("Now Databricks is the bomb\n")
  f.write("Goodbye.")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print (line)
    
# df.write.text("/tmp/foo.txt")


In [95]:
# When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or 
# "dbfs:/mnt/training/file.csv". If you’re using local file APIs, you must provide 
# the path under /dbfs, for example: "/dbfs/mnt/training/file.csv". You cannot use 
# a path under dbfs when using Spark APIs.

In [96]:
%scala
// Now read the file you just created from scala

import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"  // what you just created 

for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

In [97]:
# create a directory called foobar
dbutils.fs.mkdirs("/foobar/")

# now if you go to your main menu, and then click Import and Explore Data,
# you will see this new folder you just created ! 

In [98]:
# put the verbage into the file 
dbutils.fs.put("/foobar/baz.txt", "Hello, World!")

In [99]:
# read the verbage you just created
dbutils.fs.head("/foobar/baz.txt")

In [100]:
# remove the file you just created
dbutils.fs.rm("/foobar/baz.txt")

In [101]:
# list out all the folders you have in DBFS ! 
# this is really a 'dir' level command ! 

display(dbutils.fs.ls("dbfs:/"))
# see how you can see the folder 'foobar' you created earlier ? 

path,name,size
dbfs:/FileStore/,FileStore/,0
dbfs:/cbp.vds/,cbp.vds/,0
dbfs:/databricks-datasets/,databricks-datasets/,0
dbfs:/databricks-results/,databricks-results/,0
dbfs:/delta/,delta/,0
dbfs:/kdd/,kdd/,0
dbfs:/local_disk0/,local_disk0/,0
dbfs:/ml/,ml/,0
dbfs:/mnt/,mnt/,0
dbfs:/tmp/,tmp/,0


In [102]:
# lets look around
display(dbutils.fs.ls("dbfs:/tmp"))

path,name,size
dbfs:/tmp/hive/,hive/,0
dbfs:/tmp/test.json,test.json,243
dbfs:/tmp/testParquet/,testParquet/,0
dbfs:/tmp/test_dbfs.txt,test_dbfs.txt,84


In [103]:
# lets look around into the Parquet folder you created earlier ! ! ! 

display(dbutils.fs.ls("dbfs:/tmp/testParquet"))

path,name,size
dbfs:/tmp/testParquet/_SUCCESS,_SUCCESS,0
dbfs:/tmp/testParquet/_committed_193233552751229673,_committed_193233552751229673,606
dbfs:/tmp/testParquet/_started_193233552751229673,_started_193233552751229673,0
dbfs:/tmp/testParquet/part-00000-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-0-1-c000.snappy.parquet,part-00000-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-0-1-c000.snappy.parquet,807
dbfs:/tmp/testParquet/part-00001-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-1-1-c000.snappy.parquet,part-00001-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-1-1-c000.snappy.parquet,1606
dbfs:/tmp/testParquet/part-00003-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-3-1-c000.snappy.parquet,part-00003-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-3-1-c000.snappy.parquet,1667
dbfs:/tmp/testParquet/part-00004-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-4-1-c000.snappy.parquet,part-00004-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-4-1-c000.snappy.parquet,1667
dbfs:/tmp/testParquet/part-00006-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-6-1-c000.snappy.parquet,part-00006-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-6-1-c000.snappy.parquet,1667
dbfs:/tmp/testParquet/part-00007-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-7-1-c000.snappy.parquet,part-00007-tid-193233552751229673-96f8d3da-04a1-4c5e-a717-783a105255d0-7-1-c000.snappy.parquet,1631


In [104]:
# wanna know something odd ?   
# you CANT put comments into the same cell as a filesystem magic command ! 
# what the what, it will erorr out, so it has to be a stand alone cell ! 

In [105]:
%fs rm -r foobar

In [106]:
%fs ls

path,name,size
dbfs:/FileStore/,FileStore/,0
dbfs:/cbp.vds/,cbp.vds/,0
dbfs:/databricks-datasets/,databricks-datasets/,0
dbfs:/databricks-results/,databricks-results/,0
dbfs:/delta/,delta/,0
dbfs:/kdd/,kdd/,0
dbfs:/local_disk0/,local_disk0/,0
dbfs:/ml/,ml/,0
dbfs:/mnt/,mnt/,0
dbfs:/tmp/,tmp/,0


In [107]:

try:
    import jira.client
    JIRA_IMPORTED = True

except ImportError:
    JIRA_IMPORTED = False