Part 0: Spark Installation
--------------------------

Follow these steps for installing PySpark on your laptop.

1. Go to this [link](http://spark.apache.org/downloads.html). 

2. Select `Pre-built for Hadoop 2.4` or earlier under `Choose a
   package type:`. (Note: This is important. Versions after Hadoop 2.4
   have a bug and don't work with Amazon S3.)

3. Download the tar package for `spark-1.4.1-bin-hadoop1.tgz`. If you
   are not sure pick the latest version.

4. Make sure you are downloading the binary version, not the source
   version.

5. Unzip the file and place it at your home directory.

6. Include the following lines in your `~/.bash_profile` file on Mac
   (without the brackets).

   ```
   export SPARK_HOME=[FULL-PATH-TO-SPARK-FOLDER]
   export PYTHONPATH=[FULL-PATH-TO-SPARK-FOLDER]/python:$PYTHONPATH
   ```

7. Install py4j using `sudo pip install py4j`

8. Open a new terminal window.

9. Start ipython console and type `import pyspark as ps`. If this did
   not throw an error, then your installation was successful.

10. Start `ipython notebook` from the new terminal window.

11. If PySpark throws errors about Java you might need to download the
    newest version of the
    [JDK](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html).
    
    
Part 1: RDD and Spark Basics
----------------------------

Lets get familiar with the basics of Spark (PySpark). We will
be using Spark in local mode. 

1. Initiate a `SparkContext`. A `SparkContext` specifies where your
   cluster is, i.e. the resources for all your distributed
   computation. Specify your `SparkContext` as follows.
   
   ```python
   import pyspark as ps
   # Uses all 4 cores on your machine
   sc = ps.SparkContext('local[4]') 
   ```

In [1]:
import pyspark as ps

In [2]:
sc = ps.SparkContext('local[4]')

2. Spark keeps your data in **Resilient Distributed Datasets (RDDs)**.
   **An RDD is a collection of data partitioned across machines**.
   Each group of records that is processed by a single thread (*task*) on a
   particular machine on a single machine is called a *partition*.

   Using RDDs Spark can process your data in parallel across
   the cluster. 
   
   You can create an RDD from a list, from a file or from an existing
   RDD.
   
   Lets create an RDD from a Python list.
   
   ```python
   list_rdd = sc.parallelize([1, 2, 3])
   ```
   
   Read an RDD in from a text file. **By default, the RDD will treat
   each line as an item and read it in as string.**
   
   ```python
   file_rdd = sc.textFile('data/toy_data.txt')
   ```

In [3]:
list_rdd = sc.parallelize([1, 2, 3])

In [4]:
file_rdd = sc.textFile('/Users/Alexander/Documents/6007_Data_Engineering/Data/toy_data.txt')

3. RDDs are lazy so they do not load the data from disk unless it is
   needed. Each RDD knows what it has to do when it is asked to
   produce data. In addition it also has a pointer to its parent RDD
   or a pointer to a file or a pointer to an in-memory list.

   When you use `take()` or `first()` to inspect an RDD does it load
   the entire file or just the partitions it needs to produce the
   results? Exactly. It just loads the partitions it needs.
 
   ```python
   file_rdd.first() # Views the first entry
   file_rdd.take(2) # Views the first two entries
   ```
  

In [5]:
file_rdd.first() # Views the first entry

u'{"Jane": "2"}'

In [6]:
file_rdd.take(2) # Views the first two entries

[u'{"Jane": "2"}', u'{"Jane": "1"}']

  
4. If you want to get all the data from the partitions to be sent back
   to the driver you can do that using `collect()`. However, if your
   dataset is large this will kill the driver. Only do this when you
   are developing with a small test dataset.
   
   ```python
   file_rdd.collect()
   list_rdd.collect()
   ```

In [7]:
file_rdd.collect()

[u'{"Jane": "2"}',
 u'{"Jane": "1"}',
 u'{"Pete": "20"}',
 u'{"Tyler": "3"}',
 u'{"Duncan": "4"}',
 u'{"Yuki": "5"}',
 u'{"Duncan": "6"}',
 u'{"Duncan": "4"}',
 u'{"Duncan": "5"}']

In [8]:
list_rdd.collect()

[1, 2, 3]

Part 2: Transformations and Actions
-----------------------------------

Use
<http://real-chart.finance.yahoo.com/table.csv?s=AAPL&g=d&ignore=.csv>
to download the most recent stock prices of AAPL, and save it to
`aapl.csv`.

        import urllib2
        url = 'http://real-chart.finance.yahoo.com/table.csv?s=AAPL&g=d&ignore=.csv'
        csv = urllib2.urlopen(url).read()
        with open('aapl.csv','w') as f: f.write(csv)        

In [10]:
import urllib2
url = 'http://real-chart.finance.yahoo.com/table.csv?s=AAPL&g=d&ignore=.csv'
csv = urllib2.urlopen(url).read()
with open('aapl.csv','w') as f: f.write(csv)

The data is
in CSV format and has these values.

Date        |Open    |High    |Low     |Close   |Volume      |Adj Close
----        |----    |----    |---     |-----   |------      |---------
11-18-2014  |113.94  |115.69  |113.89  |115.47  |44,200,300  |115.47
11-17-2014  |114.27  |117.28  |113.30  |113.99  |46,746,700  |113.99

In [27]:
client = sc.textFile('aapl.csv')

In [28]:
client.top(3)

[u'Date,Open,High,Low,Close,Volume,Adj Close',
 u'2015-09-10,110.269997,113.279999,109.900002,112.57,62675200,112.57',
 u'2015-09-09,113.760002,114.019997,109.769997,110.150002,84344400,110.150002']

Q: How many records are there in this CSV?

In [29]:
print "Total numer of records {}".format(client\
                                 .filter(lambda line: not line.startswith("Date"))\
                                 .count() ) 

Total numer of records 8762


Q: Find the average *adjusted close* price of the stock. Also find the
min, max, variance, and standard deviation.

###Question about saving objects
    How does saving objects affect memory?
    Would it be better to minimize saved objects for speed?
    What's the best practice?

In [88]:
adj_close = client\
      .filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): float(AdjClose))

print "average  ACP {}".format(adj_close.mean())
print "min      ACP {}".format(adj_close.min())
print "max      ACP {}".format(adj_close.max())
print "stdev    ACP {}".format(adj_close.stdev())
print "variance ACP {}".format(adj_close.variance())

average  ACP 14.3410010144
min      ACP 0.167662
max      ACP 131.942761
stdev    ACP 27.6474014678
variance ACP 764.378807924


Q: Find the dates of the 3 highest adjusted close prices.

In [117]:
client.filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): (Date,float(AdjClose)))\
      .sortBy(lambda (Data,AdjClose):AdjClose,ascending=False)\
      .take(3)

[(u'2015-05-22', 131.942761),
 (u'2015-02-23', 131.849954),
 (u'2015-04-27', 131.502974)]

Q: Find the date of the 3 lowest adjusted close prices.

In [118]:
client.filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): (Date,float(AdjClose)))\
      .sortBy(lambda (Data,AdjClose):AdjClose,ascending=True)\
      .take(3)

[(u'1982-07-08', 0.167662),
 (u'1982-07-09', 0.173378),
 (u'1982-07-07', 0.175283)]

Q: Find the number of days on which the stock price fell, i.e. the
close price was lower than the open.

In [129]:
client.filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): (float(Open),float(AdjClose)))\
      .filter(lambda (Open,AdjClose): Open > AdjClose )\
      .count()

8699

Q: Find the number of days on which the stock price rose.

In [130]:
client.filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): (float(Open),float(AdjClose)))\
      .filter(lambda (Open,AdjClose): Open < AdjClose )\
      .count()

63

Q: Find the number of days on which the stock price neither fell nor
  rose.

####The answer directly depends on the number of significant figures!

In [137]:
client.filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): (float(Open),float(AdjClose)))\
      .filter(lambda (Open,AdjClose): int(Open) == int(AdjClose) )\
      .count()

49

####Find out why we need to take logs differences 

Q: To find out how much the stock price changed on a particular day,
convert the close and the open prices to natural log values using
`math.log()` and then take the difference between the close and the
open. This gives you the log change in the price. Find the 3 days on
which the price increased the most.

In [144]:
from math import log

In [148]:
client.filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): (float(Open),float(AdjClose)))\
      .map(lambda (Open,Close): log(Close) - log(Open))\
      .sortBy(lambda diff :diff,ascending=False)\
      .take(3)

[0.08338582269624606, 0.02700694933363046, 0.023988658595421875]

Q: The log change price lets you calculate the average change by
taking the average of the log changes. Calculate the average change in
log price over the entire range of prices.

In [151]:
client.filter(lambda line: not line.startswith("Date"))\
      .map(lambda x: x.split(","))\
      .map(lambda (Date,Open,High,Low,Close,Volume,AdjClose): (float(Open),float(AdjClose)))\
      .map(lambda (Open,Close): log(Close) - log(Open))\
      .mean()

-2.972334830894987

Part 3: Extra Credit
--------------------

Q: Write a function that given a string date gives you the weekday.
Here is code that calculates the weekday for 2015/05/05. This returns
an integer. `0` is Monday, `1` is Tuesday, etc.

Q: Using this function calculate the weekday for all the stock prices,
and the log change in the price on that day. Convert the log change
back to percentage change. To convert log change to percentage take
use `percent_change = math.exp(log_change) - 1`.

Q: Does the price change more on some days and less on others?