## Integrating Spark and Jupyter Notebook

#### One Method

```
PYSPARK_DRIVER_PYTHON="jupyter" 
PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
```

#### Another Method 

```
nano ~/.bash_profile
export SPARK_HOME="/usr/local/Cellar/apache-spark/2.2.0/libexec/"

pip install findspark
```

## To Spark or not to Spark, that is the question

In [None]:
import findspark
findspark.init('/usr/local/Cellar/apache-spark/2.2.0/libexec')

In [None]:
import pyspark
sc = pyspark.SparkContext()

In [None]:
raw_hamlet = sc.textFile('hamlet.txt')

In [None]:
raw_hamlet.take(5)

In [None]:
split_hamlet = raw_hamlet.map(lambda line: line.split('\t'))

In [None]:
split_hamlet.take(5)

### Lambda functions are great for writing quick functions we can pass into PySpark methods with simple logic. 

Any function that returns a sequence of data in PySpark (versus a guaranteed Boolean value, like filter() requires) must use a yield statement to specify the values that should be pulled later.

### ```yield``` is a Python technique that allows the interpreter to generate data on the fly and pull it when necessary, instead of storing it to memory immediately

Finally, not all functions require us to use yield; only the ones that generate a custom sequence of data do. For map() or filter(), we use return to return a value for every single element in the RDD we're running the functions on.

In [None]:
def hamlet_speaks(line):
    id = line[0]
    speaketh = False
    
    if "HAMLET" in line:
        speaketh = True
    
    if speaketh:
        yield id,"hamlet speaketh!"

hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))
hamlet_spoken.take(10)

### flatMap() is different than map() because it doesn't require an output for every element in the RDD.

The flatMap() method is useful whenever we want to generate a sequence of values from an RDD.

In this case, we want an RDD object that contains tuples of the unique line IDs and the text "hamlet speaketh!," but only for the elements in the RDD that have "HAMLET" in one of the values. We can't use the map() method for this because it requires a return value for every element in the RDD.

let's use a filter() with a named function to extract the original lines where Hamlet spoke. The functions we pass into filter() must return values, which will be either True or False.

In [None]:
def filter_hamlet_speaks(line):
    return "HAMLET" in line

hamlet_spoken_lines = split_hamlet.filter(lambda line: filter_hamlet_speaks(line))
hamlet_spoken_lines.take(5)

### Spark has two kinds of methods, transformations and actions. 

While we've explored some of the transformations, we haven't used any actions other than take()

Whenever we use an action method, Spark forces the evaluation of lazy code. If we only chain together transformation methods and print the resulting RDD object, we'll see the type of RDD (e.g. a PythonRDD or PipelinedRDD object), but not the elements within it. That's because the computation hasn't actually happened yet.

Whenever we use an action method, Spark forces the evaluation of lazy code. If we only chain together transformation methods and print the resulting RDD object, we'll see the type of RDD (e.g. a PythonRDD or PipelinedRDD object), but not the elements within it. That's because the computation hasn't actually happened yet.

Even though Spark simplifies chaining lots of transformations together, it's good practice to use actions to observe the intermediate RDD objects between those transformations. This will let you know whether your transformations are working the way you expect them to.

### Count()

The count() method returns the number of elements in an RDD. count() is useful when we want to make sure the result of a transformation contains the right number of elements.

In [None]:
hamlet_spoken_lines.count()

### Collect()

We've used take() to preview the first few elements of an RDD, similar to the way we've use head() in pandas. But what about returning all of the elements in a collection? We need to do this to write an RDD to a CSV, for example. It's also useful for running some basic Python code over a collection without going through PySpark

Running .collect() on an RDD returns a list representation of it.

In [None]:
hamlet_spoken_lines.collect()

In [None]:
spoken_101 = list(hamlet_spoken_lines.collect())[100]
spoken_101