# Analysis of Facebook Archives

First we need to identify what files are available, and set up some other tools. My notebook runs in a directory with downloaded archives in a subdirectory named `archive_dir`.

In [1]:
# Replace this path with the path to your own extracted Facebook archive
# directory in relation to the Jupyter Notebook directory
FACEBOOK_ARCHIVES_PATH = "./archive_dir/facebook-caitelatte-20180528"

In [2]:
import os
import pprint
import re

FILE_FORMAT_RE = r".*(\..*)$"
JSON_DATA_RE = r"(.*).json$"
NO_DATA_RE = r"^no-data.txt$"

# Collect all available categories and the paths to them.
available_categories = {}
with os.scandir(FACEBOOK_ARCHIVES_PATH) as categories:
    for cat_dir in categories:
        if cat_dir.is_dir():
            available_categories[cat_dir.name] = {"path": cat_dir.path}


# Print out each category and then also each folder under that!
available_datafiles = {}
for category in available_categories:
#     print("Category '{0}'".format(category))
    available_datafiles[category] = []
    for (root, dirs, files) in os.walk(available_categories[category]["path"]):
        data_files = list(filter(lambda x: re.match(JSON_DATA_RE, x) is not None, files))
        if len(data_files) > 0:
#             print(" ", os.path.relpath(root, FACEBOOK_ARCHIVES_PATH), data_files)
            for d_file in data_files:
                available_datafiles[category].append(
                    os.path.join(
                        os.path.relpath(root, available_categories[category]["path"]),
                        d_file
                    )
                )

print("The available categories under {path} and the number of json files within:".format(path=FACEBOOK_ARCHIVES_PATH))
for category in available_categories:
    print(" {category}: {count}".format(category=category, count=len(available_datafiles[category])))
# # pprint.pprint is a nice way to make data structures look nicer!
# pprint.pprint(available_datafiles) 

The available categories under ./archive_dir/facebook-caitelatte-20180528 and the number of json files within:
 about_you: 2
 ads: 4
 apps_and_websites: 2
 calls_and_messages: 1
 comments: 1
 events: 3
 following_and_followers: 2
 friends: 5
 groups: 3
 likes_and_reactions: 3
 location_history: 0
 marketplace: 1
 messages: 1106
 network_information: 0
 other_activity: 2
 pages: 1
 payment_history: 1
 photos: 18
 posts: 2
 profile_information: 2
 saved_items: 1
 search_history: 1
 security_and_login_information: 8
 videos: 1
 your_places: 1


## Output 1: Reacts!

**Goal:** Show a bar chart of how many reacts of each type per month.

### Notes:

-   file is `likes_and_reactions/posts_and_comments.json` and an example entry is:
    
    ```json
    {
      "timestamp": 1527326127,
      "data": [
        {
          "reaction": {
            "reaction": "LOVE",
            "actor": "Caitlin Chai Macleod"
          }
        }
      ],
      "title": "Caitlin Chai Macleod reacted to Whittaker's Chocolate Lovers's video."
    },
    ```
-   Reacts other than "LIKE" were introduced recently
-   I expect a lot of "LOVE" reacts

In [3]:
"""Load the reaction JSON into `reacts`!"""
import json


file_path = os.path.join(FACEBOOK_ARCHIVES_PATH, "likes_and_reactions/posts_and_comments.json")

with open(file_path) as target_file:
    reacts = json.load(target_file)
    
print("Loaded {count} likes and reactions!".format(count=len(reacts["reactions"])))
print("The first entry looks like:", reacts["reactions"][0])

Loaded 33078 likes and reactions!
The first entry looks like: {'timestamp': 1527509031, 'data': [{'reaction': {'reaction': 'WOW', 'actor': 'Caitlin Chai Macleod'}}], 'title': "Caitlin Chai Macleod reacted to Eva Penrose's post in patto ur catto ã\x80\x8eÊ\x9cá´\x80á´\x8dÊ\x99postimgã\x80\x8f."}


In [4]:
"""Extract just the timestamps and reaction type!

From `reacts` into `reacts_times_types`, a list of tuples `(timestamp, reactname)`.
"""

import matplotlib.dates as mdates

# an entry of reacts['reactions'] looks like:
# {
#   'timestamp': 1527509031,
#   'data': [
#     {'reaction':
#       {'reaction': 'WOW', 'actor': 'Caitlin Chai Macleod'}
#     }
#   ],
#   'title': "Caitlin Chai Macleod reacted to Eva Penrose's post in patto ur catto ã\x80\x8eÊ\x9cá´\x80á´\x8dÊ\x99postimgã\x80\x8f."
# }

reacts_times_types = list(map(
    lambda reac: (reac['timestamp'], reac['data'][0]['reaction']['reaction']),
    reacts["reactions"]
))

# assure list of tuples is sorted in ascending date order (old first)
reacts_times_types = sorted(reacts_times_types, key=lambda zipzip: zipzip[0])
print(reacts_times_types[:10])

[(1244720004, 'LIKE'), (1245578286, 'LIKE'), (1246180871, 'LIKE'), (1246330006, 'LIKE'), (1246330086, 'LIKE'), (1246332975, 'LIKE'), (1248491167, 'LIKE'), (1248988022, 'LIKE'), (1249376921, 'LIKE'), (1249376946, 'LIKE')]


In [5]:
"""Calculate and print the frequency of various reactions!"""

import numpy as np

# what is the * operator?? https://docs.python.org/3/library/functions.html#zip
reacts_times, reacts_types = zip(*reacts_times_types)

# took this from the twitter thing for usernames
(uniq_reacts, count_reacts) = np.unique(reacts_types, return_counts=True)
tuple_freqs_reacts = sorted(zip(uniq_reacts, count_reacts), key=lambda zipzip: zipzip[-1], reverse=True)

print("My reaction counts are:")
for (reac, count) in tuple_freqs_reacts:
    print(" {reac}: {count}".format(reac=reac, count=count))

My reaction counts are:
 LIKE: 21611
 LOVE: 6616
 WOW: 1948
 HAHA: 1801
 SORRY: 840
 ANGER: 236
 DOROTHY: 26


### My first `LOVE`  💖

That's pretty good! But i'm pretty sure that I've used the `LOVE` react way more often than the `LIKE` and it's only losing because it was implemented pretty, in May 2017 (thanks [Wikipedia List of Facebook Features](https://en.wikipedia.org/wiki/List_of_Facebook_features#Likes_and_Reactions)!

**New updated goal:** Count the number of reactions since the *first non-`LIKE` react*

#### Notes:

-   I copied a lot of things from previous attempts here. New is the use of the `list.index()` function to find the first instance.
-   I had to adjust this a little after realising that the counts of the HAHA react were slightly different in the new numbers. Turns out I did a few HAHA reacts before a LOVE react, which is off brand, but I'm leaving the names and everything as referring to LOVE because it's funnier 😊

In [6]:
"""Print the frequency of reactions since the first non-LIKE react!"""
import numpy as np
import matplotlib.dates as mdates

# what is the * operator?? https://docs.python.org/3/library/functions.html#zip
reacts_times, reacts_types = zip(*reacts_times_types)

# filter out reacts before the first non-like react here
# list.index(x) returns the first index of an occurence of x
# `_al` stands for After Love
# print(reacts_types)
first_love = reacts_types.index("HAHA")
reacts_times_al = reacts_times[first_love:]
reacts_types_al = reacts_types[first_love:]

# took that from the twitter thing
(uniq_reacts, count_reacts) = np.unique(reacts_types_al, return_counts=True)
tuple_freqs_reacts = sorted(zip(uniq_reacts, count_reacts), key=lambda zipzip: zipzip[-1], reverse=True)

print("Reactions other than LIKE were introduced in May 2017.")
print("I sent {count} reactions before other reactions were introduced, and {new_count} since.".format(
    count=len(reacts_types)-len(reacts_types_al),
    new_count=len(reacts_types_al)
))
print("Here are my reactions Post-LOVE react:")
for (reac, count) in tuple_freqs_reacts:
    print(" {reac}: {count}".format(reac=reac, count=count))

Reactions other than LIKE were introduced in May 2017.
I sent 1983 reactions before other reactions were introduced, and 31095 since.
Here are my reactions Post-LOVE react:
 LIKE: 19628
 LOVE: 6616
 WOW: 1948
 HAHA: 1801
 SORRY: 840
 ANGER: 236
 DOROTHY: 26


### Output 1 review

I'm not sure if i believe that I only gave 2000 post likes before May 2017... some help to review this would be amazing!

## Output 2: Messenger group memberships

**TODO**

## Notes

In [7]:
# import re
# help(re)
import json
help(json)

Help on package json:

NAME
    json

MODULE REFERENCE
    https://docs.python.org/3.6/library/json
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    JSON (JavaScript Object Notation) <http://json.org> is a subset of
    JavaScript syntax (ECMA-262 3rd edition) used as a lightweight data
    interchange format.
    
    :mod:`json` exposes an API familiar to users of the standard library
    :mod:`marshal` and :mod:`pickle` modules.  It is derived from a
    version of the externally maintained simplejson library.
    
    Encoding basic Python object hierarchies::
    
        >>> import json
        >>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
        '["foo", {"bar": ["baz", null, 1.0,