-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Just Enough Python for Databricks SQL

## Learning Objectives
By the end of this lesson, you should be able to:
* Leverage **`if`** / **`else`**
* Describe how errors impact notebook execution
* Write simple tests with **`assert`**
* Use **`try`** / **`except`** to handle errors

## if/else

**`if`** / **`else`** clauses are common in many programming languages.

Note that SQL has the **`CASE WHEN ... ELSE`** construct, which is similar.

<strong>If you're seeking to evaluate conditions within your tables or queries, use **`CASE WHEN`**.</strong>

Python control flow should be reserved for evaluating conditions outside of your query.

More on this later. First, an example with **`"beans"`**.

In [0]:
food = "beans"

Working with **`if`** and **`else`** is all about evaluating whether or not certain conditions are true in your execution environment.

Note that in Python, we have the following comparison operators:

| Syntax | Operation |
| --- | --- |
| **`==`** | equals |
| **`>`** | greater than |
| **`<`** | less than |
| **`>=`** | greater than or equal |
| **`<=`** | less than or equal |
| **`!=`** | not equal |

If you read the sentence below out loud, you will be describing the control flow of your program.

In [0]:
if food == "beans":
    print(f"I love {food}")
else:
    print(f"I don't eat {food}")

I love beans


As expected, because the variable **`food`** is the string literal **`"beans"`**, the **`if`** statement evaluated to **`True`** and the first print statement evaluated.

Let's assign a different value to the variable.

In [0]:
food = "beef"

Now the first condition will evaluate as **`False`**. 

What do you think will happen when you run the following cell?

In [0]:
if food == "beans":
    print(f"I love {food}")
else:
    print(f"I don't eat {food}")

I don't eat beef


Note that each time we assign a new value to a variable, this completely erases the old variable.

In [0]:
food = "potatoes"
print(food)

potatoes


The Python keyword **`elif`** (short for **`else`** + **`if`**) allows us to evaluate multiple conditions.

Note that conditions are evaluated from top to bottom. Once a condition evaluates to true, no further conditions will be evaluated.

**`if`** / **`else`** control flow patterns:
1. Must contain an **`if`** clause
1. Can contain any number of **`elif`** clauses
1. Can contain at most one **`else`** clause

In [0]:
if food == "beans":
    print(f"I love {food}")
elif food == "potatoes":
    print(f"My favorite vegetable is {food}")
elif food != "beef":
    print(f"Do you have any good recipes for {food}?")
else:
    print(f"I don't eat {food}")

My favorite vegetable is potatoes


By encapsulating the above logic in a function, we can reuse this logic and formatting with arbitrary arguments rather than referencing globally-defined variables.

In [0]:
def foods_i_like(food):
    if food == "beans":
        print(f"I love {food}")
    elif food == "potatoes":
        print(f"My favorite vegetable is {food}")
    elif food != "beef":
        print(f"Do you have any good recipes for {food}?")
    else:
        print(f"I don't eat {food}")

Here, we pass the string **`"bread"`** to the function.

In [0]:
foods_i_like("bread")

Do you have any good recipes for bread?


As we evaluate the function, we locally assign the string **`"bread"`** to the **`food`** variable, and the logic behaves as expected.

Note that we don't overwrite the value of the **`food`** variable as previously defined in the notebook.

In [0]:
food

Out[9]: 'potatoes'

## try/except

While **`if`** / **`else`** clauses allow us to define conditional logic based on evaluating conditional statements, **`try`** / **`except`** focuses on providing robust error handling.

Let's begin by considering a simple function.

In [0]:
def three_times(number):
    return number * 3

Let's assume that the desired use of this function is to multiply an integer value by 3.

The below cell demonstrates this behavior.

In [0]:
three_times(2)

Out[11]: 6

Note what happens if a string is passed to the function.

In [0]:
three_times("2")

Out[12]: '222'

In this case, we don't get an error, but we also do not get the desired outcome.

**`assert`** statements allow us to run simple tests of Python code. If an **`assert`** statement evaluates to true, nothing happens. 

If it evaluates to false, an error is raised.

Run the following cell to assert that the number **`2`** is an integer

In [0]:
assert type(2) == int

Uncomment the following cell and then run it to assert that the string **`"2"`"** is an integer.

It should throw an **`AssertionError`**.

In [0]:
assert type("2") == int

[0;31m---------------------------------------------------------------------------[0m
[0;31mAssertionError[0m                            Traceback (most recent call last)
[0;32m<command-2841292000076668>[0m in [0;36m<module>[0;34m[0m
[0;32m----> 1[0;31m [0;32massert[0m [0mtype[0m[0;34m([0m[0;34m"2"[0m[0;34m)[0m [0;34m==[0m [0mint[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;31mAssertionError[0m: 

As expected, the string **`"2"`** is not an integer.

Python strings have a property to report whether or not they can be safely cast as numeric value as seen below.

In [0]:
assert "2".isnumeric()

String numbers are common; you may see them as results from an API query, raw records in a JSON or CSV file, or returned by a SQL query.

**`int()`** and **`float()`** are two common methods for casting values to numeric types. 

An **`int`** will always be a whole number, while a **`float`** will always have a decimal.

In [0]:
int("2")

Out[17]: 2

While Python will gladly cast a string containing numeric characters to a numeric type, it will not allow you to change other strings to numbers.

Uncomment the following cell and give it a try:

In [0]:
int("two")

[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
[0;32m<command-2841292000076674>[0m in [0;36m<module>[0;34m[0m
[0;32m----> 1[0;31m [0mint[0m[0;34m([0m[0;34m"two"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;31mValueError[0m: invalid literal for int() with base 10: 'two'

Note that errors will stop the execution of a notebook script; all cells after an error will be skipped when a notebook is scheduled as a production job.

If we enclose code that might throw an error in a **`try`** statement, we can define alternate logic when an error is encountered.

Below is a simple function that demonstrates this.

In [0]:
def try_int(num_string):
    try:
        int(num_string)
        result = f"{num_string} is a number."
    except:
        result = f"{num_string} is not a number!"
        
    print(result)

When a numeric string is passed, the function will return the result as an integer.

In [0]:
try_int("2")

2 is a number.


When a non-numeric string is passed, an informative message is printed out.

**NOTE**: An error is **not** raised, even though an error occurred, and no value was returned. Implementing logic that suppresses errors can lead to logic silently failing.

In [0]:
try_int("two")

two is not a number!


Below, our earlier function is updated to include logic for handling errors to return an informative message.

In [0]:
def three_times(number):
    try:
        return int(number) * 3
    except ValueError as e:
        print(f"You passed the string variable '{number}'.\n")
        print(f"Try passing an integer instead.")
        return None

Now our function can process numbers passed as strings.

In [0]:
three_times("2")

Out[24]: 6

And prints an informative message when a string is passed.

In [0]:
three_times("two")

You passed the string variable 'two'.

Try passing an integer instead.


Note that as implemented, this logic would only be useful for interactive execution of this logic (the message isn't currently being logged anywhere, and the code will not return the data in the desired format; human intervention would be required to act upon the printed message).

## Applying Python Control Flow for SQL Queries

While the above examples demonstrate the basic principles of using these designs in Python, the goal of this lesson is to learn how to apply these concepts to executing SQL logic on Databricks.

Let's revisit converting a SQL cell to execute in Python.

**NOTE**: The following setup script ensures an isolated execution environment.

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW demo_tmp_vw(id, name, value) AS VALUES
  (1, "Yve", 1.0),
  (2, "Omar", 2.5),
  (3, "Elia", 3.3);

Run the SQL cell below to preview the contents of this temp view.

In [0]:
%sql
SELECT * FROM demo_tmp_vw

id,name,value
1,Yve,1.0
2,Omar,2.5
3,Elia,3.3


Running SQL in a Python cell simply requires passing the string query to **`spark.sql()`**.

In [0]:
query = "SELECT * FROM demo_tmp_vw"
spark.sql(query)

Out[26]: DataFrame[id: int, name: string, value: decimal(2,1)]

But recall that executing a query with **`spark.sql()`** returns the results as a DataFrame rather than displaying them; below, the code is augmented to capture the result and display it.

In [0]:
query = "SELECT * FROM demo_tmp_vw"
result = spark.sql(query)
display(result)

id,name,value
1,Yve,1.0
2,Omar,2.5
3,Elia,3.3


Using a simple **`if`** clause with a function allows us to execute arbitrary SQL queries, optionally displaying the results, and always returning the resultant DataFrame.

In [0]:
def simple_query_function(query, preview=True):
    query_result = spark.sql(query)
    if preview:
        display(query_result)
    return query_result

In [0]:
result = simple_query_function(query)

id,name,value
1,Yve,1.0
2,Omar,2.5
3,Elia,3.3


Below, we execute a different query and set preview to **`False`**, as the purpose of the query is to create a temp view rather than return a preview of data.

In [0]:
new_query = "CREATE OR REPLACE TEMP VIEW id_name_tmp_vw AS SELECT id, name FROM demo_tmp_vw"

simple_query_function(new_query, preview=False)

Out[30]: DataFrame[]

We now have a simple extensible function that could be further parameterized depending on the needs of our organization.

For example, suppose we want to protect our company from malicious SQL, like the query below.

In [0]:
injection_query = "SELECT * FROM demo_tmp_vw; DROP DATABASE prod_db CASCADE; SELECT * FROM demo_tmp_vw"

We can use the **`find()`** method to test for multiple SQL statements by looking for a semicolon.

In [0]:
injection_query.find(";")

Out[32]: 25

If it's not found it will return **`-1`**

In [0]:
injection_query.find("x")

Out[33]: -1

With that knowledge, we can define a simple search for a semicolon in the query string and raise a custom error message if it was found (not **`-1`**)

In [0]:
def injection_check(query):
    semicolon_index = query.find(";")
    if semicolon_index >= 0:
        raise ValueError(f"Query contains semi-colon at index {semicolon_index}\nBlocking execution to avoid SQL injection attack")

**NOTE**: The example shown here is not sophisticated, but seeks to demonstrate a general principle. 

Always be wary of allowing untrusted users to pass text that will be passed to SQL queries. 

Also note that only one query can be executed using **`spark.sql()`**, so text with a semi-colon will always throw an error.

Uncomment the following cell and give it a try:

In [0]:
injection_check(injection_query)

[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
[0;32m<command-2841292000076711>[0m in [0;36m<module>[0;34m[0m
[0;32m----> 1[0;31m [0minjection_check[0m[0;34m([0m[0minjection_query[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m<command-2841292000076708>[0m in [0;36minjection_check[0;34m(query)[0m
[1;32m      2[0m     [0msemicolon_index[0m [0;34m=[0m [0mquery[0m[0;34m.[0m[0mfind[0m[0;34m([0m[0;34m";"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m     [0;32mif[0m [0msemicolon_index[0m [0;34m>=[0m [0;36m0[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 4[0;31m         [0;32mraise[0m [0mValueError[0m[0;34m([0m[0;34mf"Query contains semi-colon at index {semicolon_index}\nBlocking execution to avoid SQL injection attack"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;31mValueError[0m: Quer

If we add this method to our earlier query function, we now have a more robust function that will assess each query for potential threats before execution.

In [0]:
def secure_query_function(query, preview=True):
    injection_check(query)
    query_result = spark.sql(query)
    if preview:
        display(query_result)
    return query_result

As expected, we see normal performance with a safe query.

In [0]:
secure_query_function(query)

id,name,value
1,Yve,1.0
2,Omar,2.5
3,Elia,3.3


Out[38]: DataFrame[id: int, name: string, value: decimal(2,1)]

But prevent execution when when bad logic is run.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>