# Getting Meta with Big Data Malaysia
Scraping the Big Data Malaysia Facebook group for fun. Profit unlikely.

## Get all the data
This notebook assumes you have already prepared a flattened JSON file into `all_the_data.json`, which you would have done by:
* Writing your oauth token into `oauth_file` according to the instructions in `pull_feed.py`.
* Running `python pull_feed.py` to pull down the feed pages into the BigDataMyData directory.
* Running `python flatten_saved_data.py > all_the_data.json`.

In [19]:
import json
INPUT_FILE = "all_the_data.json"
with open(INPUT_FILE, "r") as big_data_fd:
	big_data = json.load(big_data_fd)

## Is it big enough?
Now we have all our data loaded into variable `big_data`, but can we really say it's Big Data?

In [20]:
print "We have {} posts".format(len(big_data))

We have 1946 posts


Wow! So data! Very big!

Seriously though... it's not big. In fact it's rather small. How small is small? Here's a clue...

In [21]:
import os
print "The source file is {} bytes. Pathetic.".format(os.stat(INPUT_FILE).st_size)

The source file is 3773450 bytes. Pathetic.


But size doesn't matter. It's *variety* that counts.

## Fields of gold

Now we know how many elements (rows I guess?) we have, but how much variety do we have in this data? One measure of this may be to look at the number of fields in each of those items:

In [40]:
import itertools
all_the_fields = set(itertools.chain.from_iterable(big_data))
print "We have {} different field names:".format(len(all_the_fields))
print all_the_fields

We have 30 different field names:
set([u'application', u'actions', u'likes', u'created_time', u'message', u'id', u'story', u'from', u'subscribed', u'privacy', u'comments', u'shares', u'to', u'story_tags', u'type', u'status_type', u'picture', u'description', u'object_id', u'link', u'properties', u'icon', u'name', u'message_tags', u'with_tags', u'updated_time', u'caption', u'place', u'source', u'is_hidden'])


Are we missing anything? A good way to sanity check things is to actually inspect the data, so let's look at a random item:

In [12]:
import random
import pprint
# re-run this as much as you like to inspect different items
pprint.pprint(random.choice(big_data))

{u'actions': [{u'link': u'https://www.facebook.com/497068793653308/posts/1071576142869234',
               u'name': u'Comment'},
              {u'link': u'https://www.facebook.com/497068793653308/posts/1071576142869234',
               u'name': u'Like'},
              {u'link': u'/groups/bigdatamy/', u'name': u'Create Group Chat'}],
 u'caption': u'youtube.com',
 u'comments': [{u'data': [{u'can_remove': True,
                           u'created_time': u'2015-04-04T01:14:40+0000',
                           u'from': {u'id': u'10203861474726148',
                                     u'name': u'Dickson Lukose'},
                           u'id': u'1072127426147439',
                           u'like_count': 0,
                           u'message': u'Hi Peter, what data formats are supported in this initiative?',
                           u'user_likes': False},
                          {u'can_remove': True,
                           u'created_time': u'2015-04-04T01:43:27+0000',
       

From that you should be able to sense that we are missing some things - it isn't simply that there are some number of fields that describe each item, because some of those fields have data hierarchies beneath them, for example:

In [45]:
pprint.pprint(big_data[234])

{u'actions': [{u'link': u'https://www.facebook.com/497068793653308/posts/1032324310127751',
               u'name': u'Comment'},
              {u'link': u'https://www.facebook.com/497068793653308/posts/1032324310127751',
               u'name': u'Like'},
              {u'link': u'/groups/bigdatamy/', u'name': u'Create Group Chat'}],
 u'comments': [{u'data': [{u'can_remove': True,
                           u'created_time': u'2015-02-02T14:20:46+0000',
                           u'from': {u'id': u'10203864949854090',
                                     u'name': u'Teuku Faruq'},
                           u'id': u'1033140356712813',
                           u'like_count': 1,
                           u'message': u'Interesting startup, all the best!',
                           u'user_likes': False},
                          {u'can_remove': True,
                           u'created_time': u'2015-02-04T07:45:13+0000',
                           u'from': {u'id': u'10203477707997024',


TODO:
* Get a cumulative activity timeline