<div style="text-align: left; line-height: 0; padding-top: 9px; padding-left:150px">
  <img src="https://static1.squarespace.com/static/5bce4071ab1a620db382773e/t/5d266c78abb6d10001e4013e/1562799225083/appliedazuredatabricks3.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>
# *Applied Azure Databricks*. Presented by <a href="www.advancinganalytics.co.uk">Advancing Analytics</a>
Course delivered by Simon Whiteley - <a href="mailto:simon@advancinganalytics.co.uk">simon@advancinganalytics.co.uk</a>

## Widgets
Fist things first - we need to have some parameters, so let's create a widget

In [3]:
dbutils.widgets.removeAll()
dbutils.widgets.text("fileName", "Product","AdventureWorks Table")

### Read current widget value
We can now use dbutils.widgets to get the current value of our widget parameter

In [5]:
fileName = dbutils.widgets.get("fileName")

### Read the schema json for our selected file
I've stored a schema file for each of the data files in my lake. I can pick up the right file for the dataset selected by my widget

In [7]:
#Load the relevant libraries to build schemas and read JSON
from pyspark.sql.types import *
import json

#Inject our filename into the lake path
schemaLocation = f"/mnt/dblake/RAW/Public/Adventureworks/SalesLT.{fileName}.json"

#Read the json file contents
jschemadf = sqlContext.read.text(schemaLocation)

#Pull out the first value (it's all one value but the reader turns it into a dataframe)
jschema = jschemadf.first().value

#Convert our JSON schema into a pyspark Struct which can be applied directly to a dataframe
newSchema = StructType.fromJson(json.loads(jschema))
newSchema

### We have a schema, now we need to create a dataframe
We can derive the path of our dataset in the same way as we did with the schema. We then combine schema and data location in a new dataframe

We're also going to use "_corrupt_record", this is a system field which will only be populated if a row fails to parse into the structure we've provided

In [9]:
#Inject the filename into our lake path for the dataset
dataLocation = f"/mnt/dblake/RAW/Public/Adventureworks/SalesLT.{fileName}/"

#Add some magic - add a new column to our structure for the _corrupt_record system field
newSchema.add("_corrupt_record", StringType(), True)

# Now let's load the data, allowing any malformed records to come through but populating our corrupt_record field
df = (spark
       .read
       .schema(newSchema)
       .option("badRecordsPath", f"{dataLocation}/_reject")
       .csv(dataLocation)
     )

### Write the Good rows to a Parquet directory
Parquet compresses a lot better than CSV, and it has the schema built in. It is a STRUCTURED file type

In [11]:
outputLocation = f"/mnt/dblake/BASE/Public/Adventureworks/SalesLT.{fileName}/"
df.write.mode('overwrite').format("parquet").save(outputLocation)

spark.sql(f"create table if not exists Denmark2020.{fileName} using PARQUET location '{outputLocation}'")

###Return results back to the parent caller
dbutils.notebook.exit() returns a message to whatever called the notebook - this can send a message back to ADF!

In [13]:
processedRows = (spark.read.csv(dataLocation)).count()
goodRows = df.count()
rejectedRows = processedRows - goodRows

dbutils.notebook.exit(json.dumps({"processedRows":processedRows, "goodRows":goodRows, "rejectedRows":rejectedRows, "status":"Succeeded"}))

{"processedRows": 848, "goodRows": 848, "rejectedRows": 0, "status": "Succeeded"}