# Programming Spark with Python

Below you'll find links to PySpark API documentation, and see some of the common Python idioms, expressions and language features used when programming Spark. We assume you are comfortable with the basics of Python already, covering the intermediate language features used in Spark.

## Spark Programming Guides
* <a href="http://spark.apache.org/docs/1.6.1/programming-guide.html#resilient-distributed-datasets-rdds" target="_blank">RDDs</a>
* <a href="http://spark.apache.org/docs/1.6.1/sql-programming-guide.html" target="_blank">DataFrames</a>
* <a href="http://spark.apache.org/docs/1.6.1/streaming-programming-guide.html" target="_blank">Streaming</a>
* <a href="http://spark.apache.org/docs/1.6.1/ml-guide.html" target="_blank">Machine Learning Pipelines</a>

## PySpark API Docs

* <a href="http://spark.apache.org/docs/1.6.1/api/python/" target="_blank">PySpark API Docs Home</a>
* <a href="http://spark.apache.org/docs/1.6.1/api/python/#core-classes" target="_blank">Core classes in PySpark</a>
* <a href="https://spark.apache.org/docs/1.6.1/api/python/pyspark.html#pyspark.RDD" target="_blank">RDDs</a>
* <a href="http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame" target="_blank">DataFrames</a>
* <a href="http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#module-pyspark.sql.functions" target="_blank">SQL Functions</a> (used on DataFrame columns)
* <a href="http://spark.apache.org/docs/1.6.1/api/python/pyspark.streaming.html" target="_blank">Streaming</a>
* <a href="http://spark.apache.org/docs/1.6.1/api/python/pyspark.ml.html" target="_blank">Machine Learning Pipelines</a>


## Python Language Documentation
* <a href="https://docs.python.org/2/library/functions.html" target="_blank">Python built-in functions</a>
* <a href="https://docs.python.org/2/library/" target="_blank">Python standard library</a> (included with Python itself)
* <a href="https://docs.python.org/2/library/stdtypes.html#string-methods" target="_blank">Python string methods</a> (often useful in ETL)

# Passing Functions

In Python, everything is an object... including functions. You can pass a function as an argument to another function. This is commonly used in big data programming, and Spark is no different.

In [4]:
# Here is some data...
numbers = [2, 8, -1, 3, 4, -12, 7]

# And here are some functions.
def double(number):
  return number * 2
def is_positive(number):
  return number > 0
def add(a, b):
  return a + b

# We can use this with mapping operations.
# Notice we write "double", not "double()" - no parentheses!
print("Doubling the numbers:")
print(map(double, numbers))

# You can, of course, use built-in functions too. For example, abs(), for absolute value.
print("Absolute values:")
print(map(abs, numbers))

# Filter operations require a function that, when called, return True or False.
print("Positive numbers:")
print(filter(is_positive, numbers))

# Reduce operations take a combination function and a sequence, and return a derived value.
# Unlike map and filter, reduce requires a function taking *two* arguments, not one.
print("Sum of numbers:")
print(reduce(add, numbers))

# Lambdas
Sometimes we want to use a function object just once, and would rather not define it somewhere else. For convenience and (sometimes) improved readability, Python lets you create anonymous functions, or *lambdas*.

The syntax looks like:

    lambda n: n + 2

Notice that:

* You start with the keyword `lambda`.
* There is no return statement.
* This entire expression evaluates to a function object. It can be passed to map, filter, etc.

Let's see how this is used in code.

In [6]:
# Again, the same numbers...
numbers = [2, 8, -1, 3, 4, -12, 7]

# You use the lambda expression in the same place you would 
# normally use a function.
print("Doubling the numbers:")
print(map(lambda number: number * 2, numbers))

# Remember, Python lambdas have no return statement. The part to the right
# of the : is automatically returned.
print("Positive numbers:")
print(filter(lambda number: number > 0, numbers))

# Lambdas can take several arugments.
print("Sum of numbers:")
print(reduce(lambda a, b: a+b, numbers))

In Python, lambdas are syntactically limited to one line (in contrast to Scala, for example, which lets lambdas be of any length). The opinion of Python's creators is that large anonymous functions quickly become unreadable; regardless of whether you agree, it's worth considering readability when choosing whether to use a lambda, or define a separate function. For short expressions, lambdas are often at least as readable; for longer, more complex logic, a separate function may be a better choice.

# Named Tuples

As you likely know, Python includes a tuple type, which is like a list, but immutable.

    # (name, gpa, major)
    student_info = ("John Doe", 3.8, "chemistry")
    # You can write student_info[0], but not student_info.append(...)

This lets us conveniently work with records as immutable, ordered fields of data. In PySpark programming, we often find it valuable to use an extension called a [namedtuple](https://docs.python.org/2/library/collections.html#collections.namedtuple). It is in the Python `collections` module:

    from collections import namedtuple
    
This works much like a tuple, but also lets us reference the fields by readable names, instead of obscure numeric indices.

In [9]:
# First, import it.
from collections import namedtuple

# To use, we first create a namedtuple instance, giving it a specific name.
# The first argument is a string, and normally the name of the type we assign it to.
# The second argument is a list of strings, which (in order) are the field names.
Student = namedtuple('Student', ['name', 'gpa', 'major'])

# This lets us create Student objects, with a syntax similar to if we had defined a Student class.
student_john = Student("John Doe", 3.8, "chemistry")

# We can reference its fields by name:
print("Using namedtuple fields...")
print("Student name: " + student_john.name)
print("Student GPA: " + str(student_john.gpa))
print("Student major: " + student_john.major)

# Since it's a tuple, we can also reference the fields by index if we need to:
print("\nUsing numeric indices...")
print("Student name: " + student_john[0])
print("Student GPA: " + str(student_john[1]))
print("Student major: " + student_john[2])

# String Operations

Especially when loading and transforming data, you will often need to munge some text. Python's string type (called `str`) has 
<a href="https://docs.python.org/2/library/stdtypes.html#string-methods" target="_blank">many built-in methods</a>. Note none of them modify the original string; they instead create and return a new, different string. Here are some that you may find particularly useful:

In [11]:
poem = "  Beauty is truth; truth, beauty. \t "
underscored = "__Beauty is truth; truth, beauty.__"
advice = "Genius without education is like silver in the mine."

# strip: Without args, strip whitespace from the front and end of a line.
print("Stripped:")
print('<' + poem.strip() + '>')

# You can also use .lstrip() and .rstrip() to just strip one side.
print("Left-stripped:")
print('<' + poem.lstrip() + '>')
print("Right-stripped:")
print('<' + poem.rstrip() + '>')

# Pass in an argument to strip a specific character.
print("Stripped of underscores:")
print(underscored.strip("_"))

# To split a string into a list of words, use .split().
print("Split:")
print(advice.split())
# By default it splits on whitespace.
# Pass in an argument to split by a different character:
print("Split by semicolon:")
print(poem.split(";"))

# To go the opposite direction, use .join().
pets = ["dog", "cat", "bird", "goat", "llama"]
print("Pets:")
print(", ".join(pets))

# .upper() and .lower() are often useful for normalizing data.
print("whispered advice:")
print(advice.lower())
print("SHOUTED ADVICE:")
print(advice.upper())