### install fastavro:

```bash 
sudo pip install fastavro
```

In [1]:
import os
from cStringIO import StringIO
import fastavro
import boto
import pyspark

In [26]:
import boto

In [2]:
sc = pyspark.SparkContext()

### Load data, and use avro schema to map to JSON
We must firts acquire data. In this case, we will load a series of .avro files from AWS S3. Note:
- in our binaryFiles() call, we use a "?" character as a regular expression to indicate either zero or one character.
- for access tot the S3 bucket, you need to set your AWS credentials as environment variables

In the few lines below, we:
- read the files from disk into a JavaPairRDD,
- map each binary data value in the RDD to a string using StringIO,
- read each string and combines (flatMap) them into a json RDD.

**An important note on distributed processing**: Imagine we're dealing with files that are ~500TB put together. We can't process that locally, which is where Spark's distributed framework comes in. By loading the data into an RDD, we ensure we're using Spark's core strength to process these large data sets at scale.

**An important note on lazy evaluation in Spark**: As noted during the Spark lecture, it applies _lazy evaluation_, which is to say that for example transformations like map() and flatMap() are only evaluated when their results are explicitly requested through a function like <code>.take()</code> or <code>.collect()</code>.

In [27]:
# connect to AWS
s3 = boto.connect_s3(aws_access_key_id ="AKIAIBZEDBZIIV7PUW5Q", 
                     aws_secret_access_key="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")

In [4]:
# access relevent bucket for lab
dsci = s3.get_bucket("dsci")

In [28]:
dsci = s3.get_bucket("alexander-graphlab-data")

In [33]:
dsci.g

TypeError: get_key() takes at least 2 arguments (1 given)

In [5]:
# get all keys that reference the files in bucket called "dsci" 
# NOTE: for testing purposes, only use data2 file
file_keys = dsci.get_all_keys(prefix = "6007/data/SuperWebAnalytics/new_data/data2")

In [6]:
keys = sc.parallelize(file_keys)

In [7]:
keys.take(2)

[<Key: dsci,6007/data/SuperWebAnalytics/new_data/data2.avro>]

In [8]:
avro_data = keys.map(lambda key: StringIO(key.get_contents_as_string()))

In [9]:
# I can't open this RDD to expore the contents. Because that is serialized 

In [10]:
json_data = avro_data.flatMap(fastavro.reader)

In [33]:
json_data.count()

1000000

## Data Exploration
Before working with a data set, it is useful to explore it a bit and see what we are working with. In this case, our data is based on an avro schema for a graph schema we have worked with before. The data consists of records, each describing either a property of a node, or an edge.

Let's use .take() to grab the first few records: 

In [44]:
print json_data.top(1)

[{u'dataunit': {u'person_property': {u'property': {u'location': {u'city': None, u'state': None, u'country': u'US'}}, u'id': {u'user_id': 9999}}}, u'pedigree': {u'true_as_of_secs': 1438381448}}]


## Partitioning the data
Now that we have the data, we need to divide it into pieces according to the partitioning scheme outlined in the Lab specs. Our data is stored as a json array of objects. We'll take each record and map it to a 2-tuple contaning the datatype and the actual datum. By dynamically generating the partition name, our code will be able to handle any new node properties or edge types that might be added later.

In [45]:
datum = json_data.top(1)

In [49]:
# the map function will automatically pass in the contents of the list
# and not the brackets of the list 
datum[0]

{u'dataunit': {u'person_property': {u'id': {u'user_id': 9999},
   u'property': {u'location': {u'city': None,
     u'country': u'US',
     u'state': None}}}},
 u'pedigree': {u'true_as_of_secs': 1438381448}}

In [53]:
# pulls up the contents of 'dataunit'
datum[0]['dataunit']

{u'person_property': {u'id': {u'user_id': 9999},
  u'property': {u'location': {u'city': None,
    u'country': u'US',
    u'state': None}}}}

In [57]:
# identifies the first key
# in this case, the first key is 'person_property'
datum[0]['dataunit'].keys()[0]

u'person_property'

In [58]:
datatype = datum[0]['dataunit'].keys()[0]

In [60]:
# identifies the name of the actual property under the meta label of 'property'
# in this case it's 'location
# NOTE: these operations assume that the dataunit is, in fact, atomic!
# If that is NOT ATOMIC, I forsee errors!
datum[0]['dataunit'][datatype]['property'].keys()[0]

u'location'

In [62]:
# Finally, we arrive at the virtical partitioning of the data
'/'.join((datatype, datum[0]['dataunit'][datatype]['property'].keys()[0]))

u'person_property/location'

In [1]:
# Output is a tuple
# (file partition path, original datum)
'/'.join((datatype, datum[0]['dataunit'][datatype]['property'].keys()[0])), datum[0]

NameError: name 'datatype' is not defined

In [None]:
# The else return statement is  there so that the function 
# can pass through edge dataunits

In [11]:
def partition_data(datum):
    print datum
    datatype = datum['dataunit'].keys()[0]
    if datatype.endswith('property'):
        return '/'.join((datatype, datum['dataunit'][datatype]['property'].keys()[0])), datum
    else:
        return datatype, datum # Edge 

In [12]:
partitioned_json = json_data.map(partition_data)

In [13]:
print partitioned_json.take(3)

[(u'page_property/page_views', {u'dataunit': {u'page_property': {u'property': {u'page_views': 3}, u'id': {u'url': u'http://mysite.com/'}}}, u'pedigree': {u'true_as_of_secs': 1438381257}}), (u'person_property/location', {u'dataunit': {u'person_property': {u'property': {u'location': {u'city': None, u'state': None, u'country': u'US'}}, u'id': {u'user_id': 2528}}}, u'pedigree': {u'true_as_of_secs': 1438381257}}), (u'person_property/location', {u'dataunit': {u'person_property': {u'property': {u'location': {u'city': None, u'state': None, u'country': u'US'}}, u'id': {u'cookie': u'FGHIJ'}}}, u'pedigree': {u'true_as_of_secs': 1438381257}})]


In [38]:
# cache data into RAM for quick access
# will be performing as many transformations as there are properties + edges 
partitioned_json.cache()

PythonRDD[26] at RDD at PythonRDD.scala:43

In [14]:
# appears to be a sanity check 
# to make sure that only the desired partitions were created
partition_names = partitioned_json.map(lambda t: t[0]).distinct().collect()

In [15]:
# count on each partition 
partitioned_json.countByKey()

defaultdict(<type 'int'>, {u'person_property/location': 149929, u'page_property/page_views': 125713, u'page_view': 599271, u'equiv': 125087})

[u'page_property/page_views',
 u'person_property/location',
 u'page_view',
 u'equiv']

In [None]:
# TODO: need to gracefully handle when dir/file already exists

# This function creates a new folder for each partition name. 
# For this data set, there are 4 partition names. 
# NOTE: each folder will have as many files as there are reducers (MapReduce reducers)
# Recall that large distributed files are broken up into blocks and each block is passed to a reducer 



for p in partition_names:
    path = "../SuperWebAnalytics/master/{}".format(p)
    if os.path.exists(path):
        print "{} exists".format(path)
    else:
        partitioned_json.filter(lambda t: t[0] == p).values().saveAsPickleFile(path)
#         #  line below does avro:
#         partitioned_json.filter(lambda t: t[0] == p).values().mapPartitions(avro_writer).saveAsTextFile(path)

In [67]:
!tree *_property

*_property [error opening dir]

0 directories, 0 files
