**Lecutre 4: support material**

The codes in the cells below are examples presented in the slides of Lecture 4. Use the lecture slides as reference for comments and other informations. 

##### Things to consider:
If an error occures while executing Cell 1 and the error is associated to pyspark (i.e., not found):

###### Q1: Are you missing pyspark?

$ python <br>
import pyspark <br>
Traceback (most recent call last): <br>
  File "<stdin>", line 1, in <module> <br>
ModuleNotFoundError: No module named 'pyspark' <br>
exit() <br>

Install pyspark (on Linux and Mac, and in Anaconda command prompt in windows): <br>
$ pip install pyspark

Now follow the steps in the Q2

###### Q2: Is your setting properly installed? 

Test pyspark on python (not Jupyter): <br>
$ python <br>
 from pyspark import SparkContext <br>
 sc = SparkContext.getOrCreate()  <br>

Find pyspark: <br>
$ which pyspark <br>
/Users/taufer/opt/anaconda3/bin/pyspark

Set your env: <br>
$ emacs .bash_profile <br>

Add these two new paths <br>

export PATH="/Users/taufer/opt/anaconda3/bin:$PATH" <br>
export PYTHONPATH="/Users/taufer/opt/anaconda3/bin" <br>

Make sure they are are seen by your OS, if not: <br>

$ source .bash_profile <br>

###### Q3: Is your Python version too new (e.g., 3.8)?
If you have python version > 3.7 do the following

Install python 3.7 <br>
Install virtualenv <br>

Create a new virtual environment using python 3.7 <br>
$ virtualenv --python=/path/to/python37 ./spark_env <br>

Activate environment <br>
$ source ./spark_env/bin/activate <br>

Install pyspark <br>
$ pip install pyspark <br>

Install jupyter <br>
$ pip install jupyter <br>

Run jupyter <br>
$ ./spark_env/bin/jupyter notebook <br>

Deactivate environment <br>
$ deactivate <br>

##### Q4: Are you getting Java related errors?
If you have Java errors make sure you are running Java 8. <br>

Spark does not support newer versions <br>

$ java --version <br>

##### Q5: (Windows) Is Java installed, and you still have a "Java gateway" error?

Restart Anaconda and Jupyter, especially if you installed Java with Anaconda open.

If you still have the error, check if your personal User folder in Windows has a space in it. If so, you'll need to reinstall Anaconda for all users instead of for just you. 

##### Q6: (Windows) Is winutils not found?

This happened because some files were not included in your Spark installation by default. Do this to get them: 

Clone this repository: https://github.com/cdarlint/winutils<br>

Open your advanced settings and environment variables. <br>

Click the Environment Variables button, near the bottom of that dialog. <br>

In the user variables, section, click New... <br>

Name the variable HADOOP_HOME, use the Browse Directory button, and choose the folder in the repository you cloned with the same version as your pyspark. For example, if you have Pyspark 3.0.1, select hadoop-3.0.1. Do not select hadoop-x.x.x/bin. The path will be looking for a folder that has bin within it. <br>

Click ok, then select the User variable called "Path," and choose Edit... <br>

Click New in this new dialog box. Enter %HADOOP_HOME%\ in the new space, then click all the 'Ok' you need to to get out of this. <br>

Restart Anaconda and Jupyter, then run again.

In [None]:
# import pyspark
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sc.setLogLevel("WARN")
# Load list of words
lines = sc.textFile('FoxInSocks.txt')
# Count the number of items in this RDD
print(lines.count())

In [None]:
# First item in this RDD, i.e. first line of FoxInSocks.txt
print(lines.first())

In [None]:
# sc = SparkContext.getOrCreate()
numbers = sc.parallelize([1, 2, 3, 3], 4)
squared = numbers.map(lambda x: x * x)
squared.collect()

In [None]:
# sc = SparkContext.getOrCreate()

lines = sc.parallelize(["hello world", "hi"])
words = lines.flatMap(lambda line: line.split(" "))
words.collect()

In [None]:
lines = sc.textFile('FoxInSocks.txt')

def hasWhen(line):
	return 'when' in line

whenLines = lines.filter(hasWhen)
whenLines.collect()

In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
lines = sc.textFile('FoxInSocks.txt')

whenLines = lines.filter(lambda line: 'when' in line)
whenLines.collect()

In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

numbers = sc.parallelize([1, 2, 3, 4])
squared = numbers.map(lambda x: x * x).collect()
for num in squared:
    print (num)

In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

lines = sc.parallelize(["hello world", "hi"])
words = lines.flatMap(lambda line: line.split(" "))
words.first() # returns "hello"

In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
lines = sc.textFile('FoxInSocks.txt')

pairs= lines.map(lambda x: (x.split(" ")[0], x))

In [None]:
pairs.collect()

In [None]:
results = pairs.filter(lambda x: len(x[1]) < 28)

In [None]:
results.collect()


In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
lines = sc.textFile('FoxInSocks.txt')

words = lines.flatMap(lambda x: x.split(" "))
pairs= words.map(lambda x: (x, 1))
pairs.collect()

In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

lines = sc.parallelize(["hello world", "hi"])
words = lines.map(lambda line: line.split(" "))
words.collect()


In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

lines = sc.parallelize(["hello world", "hi"])
words = lines.flatMap(lambda line: line.split(" "))
words.collect()


In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

lines = sc.textFile('FoxInSocks.txt')

words = lines.flatMap(lambda x: x.split(" "))

In [None]:
words.collect()

In [None]:
pairs= words.map(lambda x: (x, 1))


In [None]:
pairs.collect()

In [None]:
results = pairs.reduce(lambda x, y: x + y)
print(results)
                            

In [None]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

numbers = sc.parallelize([1, 2, 3, 4, 5, 6])
squared = numbers.map(lambda x: x * x)
squared.collect()
squared.reduce(lambda x, y: x+y)

In [None]:
squared.reduceByKey(lambda x, y: x+y)
squared.collect()

**More:** Add here below your own examples 