QuantifiedMe
============

**Created by:** Erik Bjäreholt   ([GitHub](https://github.com/ErikBjare), [Twitter](https://twitter.com/ErikBjare), [LinkedIn](https://www.linkedin.com/in/erikbjareholt/))

**View the latest built version at: Link not yet available (TODO)**
<br>
**Get the code at: https://github.com/ErikBjare/quantifiedme**

Tools using rich data, decent metrics, and pretty visualizations, for measuring and managing behavior, productivity, health, and life in general.


# You are viewing a **code-only** notebook, there won't be any output or visualizations!

A notebook that includes output and visualizations will be created "soon"!

~~**Do you just want to see pretty visualizations? Scroll down to the "Visualize" section!**~~


# Introduction

The phrase *"What gets measured gets managed"* is sometimes thrown around in professional contexts. While often just appreciated for its contextual face value, it's actually an important observation that today drives practically the entire world. Companies measure performance and financial results, engineering teams measure keep track of their tasks and resources, and scientists measure everything from health outcomes to the trajectory of interstellar objects that could threaten the planet.

Indeed, collecting and analysing data is the foundation for all of science, or as Lord Kelvin put it:

> *I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.*
>
>   ***–  William Thomson*** (Lord Kelvin), Lecture on "Electrical Units of Measurement" (1883)

**However**, although commonly practiced in professional contexts and having words thrown around like "data-driven", it's less common in our personal lives. We generally don't see problems in our personal lives as matters that could be solved through measuring and analysing. This is probably because we don't know which questions to ask, or that it seems too difficult to collect and analyze the data because we're unaware of the tools to get the job done, or simply because we compartmentalize the scientific method as a "work thing", or as something to be left to "real scientists". 

So what if we had good open-source tools to easily ask questions and explore data about our personal lives? What if people shared the data with each other, and together worked on common personal problems (productivity, mental & physical health, work/life balance) in a truly scientific way? I think that seems worthy of exploring.

I've built some of those tools over the past years, among them the open-source time tracker [ActivityWatch](https://activitywatch.net/), and here's a little showcase of some of my work over that time. I've used a notebook like this one almost every week for almost a year to explore my behavior (many of the things didn't make the cut, sorry). It's been both fascinating and rich in insights about how I spend my time, and how I could do better in big and small ways. But the inquiry has just started, there's a lot more to come.

Now, dear reader, I've blabbered enough. Enjoy my work, I hope you find it of interest (and use!). Be sure to check out some of the links at the end for more stuff like this!

# Table of contents

- [Setup](#Setup)
- [Load data](#Load-data)
  - [Generate fake data](#Load-Toggl-data)
  - [Load ActivityWatch data](#Load-ActivityWatch-data)
  - [Load SmarterTime data](#Load-SmarterTime-data)
  - [Load Toggl data](#Load-Toggl-data)
  - [Annotate data](#Annotate-data)
- [Visualize](#Visualize)
  - [Daily time plot](#Daily-time-plot)
  - [Category sunburst plot](#Category-sunburst-plot)
  - [Fictional wage plot](#Fictional-wage-plot)
  - [Uncategorized](#Fictional-wage-plot)
  
TODO: Build the table of contents automatically?

# Setup

First we do some imports, and set some variables used in the rest of the program.

In [None]:
from datetime import datetime, time, date, timezone, timedelta
from pathlib import Path
import random
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pytz
from IPython.utils import io
from IPython.core.display import display, HTML

import aw_core
from aw_core.models import Event
import aw_research, aw_research.classify
from aw_research.classify import _union_no_overlap
from aw_research import verify_no_overlap, split_into_weeks, split_into_days

# Use XKCD-style plots
# FIXME: Causes the day trend plots to take forever for some unknown reason
# matplotlib.pyplot.xkcd(scale=0.8, randomness=1)

## Configuration

**Modify these to your liking!**

In [None]:
# Set this to your timezone
your_timezone = pytz.timezone('Europe/Stockholm')
tz_offset = your_timezone.utcoffset(datetime.now())

# Use personal data, not fake data
personal = False

# Choose where to fetch data from
# Valid options: 'activitywatch', 'smartertime', 'toggl', 'fake'
if personal:
    # List of sources from which to fetch data when running in personal mode
    datasources = ['activitywatch', 'smartertime'] #, 'toggl']
else:
    # Use fake data if not in personal mode
    datasources = ['fake']
    
# Days of history to use
days_back = 360

The below sets the window title to something more descriptive so that ActivityWatch can track that I'm working on this specific notebook (since the default isn't informative in JupyterLab).

In [None]:
%%javascript
document.title='QuantifiedMe - Jupyter'

### Set current time

In [None]:
# Now let's just set the current time and our query interval and we're ready to load data!
# If not running in personal mode, use a fixed datetime to make notebook reproducible
now = datetime.now(tz=timezone.utc) if personal else datetime(2019, 6, 1).astimezone(timezone.utc)
day_offset = timedelta(hours=4)
today = datetime.combine(now.date(), time()).astimezone(timezone.utc) + day_offset
since = today - timedelta(days=days_back)

print(f"Today:  {today.date()}")
print(f"Start:  {since}")
print(f"End:    {now}")

# Load data

We will load data from all sources into the `events` variable. 

Every consecutive source source will fill eventual gaps from previous sources (to prevent overlap), by using `_union_no_overlap`.

In [None]:
events = []

## Generate fake data

So I can show you the plots in this notebook without sacrificing my privacy!

In [None]:
data_weights = [
    (100, None),
    (2, {'title': 'Uncategorized'}),
    (5, {'title': 'ActivityWatch'}),
    (4, {'title': 'Thankful'}),
    (3, {'title': 'QuantifiedMe'}),
    (3, {'title': 'FMAA01 - Analysis in One Variable'}),
    (3, {'title': 'EDAN95 - Applied Machine Learning'}),
    (2, {'title': 'Stack Overflow'}),
    (2, {'title': 'phone: Brilliant'}),
    (2, {'url': 'youtube.com', 'title': 'YouTube'}),
    (1, {'url': 'reddit.com'}),
    (1, {'url': 'facebook.com'}),
    (1, {'title': 'Plex'}),
    (1, {'title': 'Spotify'}),
    (1, {'title': 'Fallout 4'}),
]

def create_fake_events(start: datetime, end: datetime):
    # First set RNG seeds to make the notebook reproducible
    random.seed(0)  
    np.random.seed(0)
    
    pareto_alpha = 0.5
    pareto_mode = 5
    time_passed = timedelta()
    while start + time_passed < end:
        duration = timedelta(seconds=np.random.pareto(pareto_alpha) * pareto_mode)
        duration = min([timedelta(hours=1), duration])
        timestamp = start + time_passed
        data = random.choices([d[1] for d in data_weights], [d[0] for d in data_weights])[0]
        if data:
            yield Event(timestamp=timestamp, duration=duration, data=data)
        time_passed += duration
    
if 'fake' in datasources:
    fake_events = list(create_fake_events(start=since.astimezone(timezone.utc), end=now.astimezone(timezone.utc)))
    events += fake_events

## Load ActivityWatch data

Retrieve events from aw-server. Queried for active windows combined with browser history and filters by AFK/audible.

In [None]:
if 'activitywatch' in datasources:
    # Split up into previous days and today, to take advantage of caching
    # TODO: Split up into whole days
    events_aw = []
    for dtstart, dtend in split_into_weeks(since, now):
        events_aw += aw_research.classify.get_events(since=dtstart, end=dtend, include_smartertime=False, include_toggl=False)
        print(len(events_aw))
    for e in events_aw:
        e.data['$source'] = 'activitywatch'

    events = _union_no_overlap(events, events_aw)
    verify_no_overlap(events)
    
# The above code does caching using joblib, use the following if you want to clear the cache:
# aw_research.classify.memory.clear()

## Load SmarterTime data

[SmarterTime](https://play.google.com/store/apps/details?id=com.smartertime&hl=en) is an Android app that tracks app usage. It was primarily used by me before I got ActivityWatch on Android working (but is still, for old data).

The code loads an ActivityWatch bucket that I converted from the app export (so there's one step here I haven't shown).

In [None]:
def load_smartertime():
    events_smartertime = []
    for smartertime_awbucket_path in [
        'data/smartertime/smartertime_export_erb-a1_2019-02-18_bb7f26aa.awbucket.json',
        'data/smartertime/smartertime_export_erb-f1-miui_2019-10-17_6465fafb.awbucket.json'
    ]:
        new_events = aw_research.classify._get_events_smartertime(since, filepath=smartertime_awbucket_path)
        events_smartertime = _union_no_overlap(events_smartertime, new_events)
    for e in events_smartertime:
        e.data['$source'] = 'smartertime'
    return events_smartertime

if 'smartertime' in datasources:
        events_smartertime = load_smartertime()
        verify_no_overlap(events_smartertime)
        events = _union_no_overlap(events, events_smartertime)
        verify_no_overlap(events)

## Load Toggl data

[Toggl](https://toggl.com/) is a web, desktop, and mobile app that lets you track time manually.

In [None]:
from aw_research import load_toggl
        
if 'toggl' in datasources:
    events_toggl = load_toggl(since, now)
    print(f"Oldest: {min(events_toggl, key=lambda e: e.timestamp).timestamp}")
    verify_no_overlap(events_toggl)
    events = _union_no_overlap(events, events_toggl)
    verify_no_overlap(events)

## Verify data
Just to make sure there are no bugs in underlying code.

In [None]:
# Verify that no events are older than `since`
assert all([since <= e.timestamp for e in events])

# Verify that no events take place in the future
# FIXME: Doesn't work with fake data, atm
if 'fake' not in datasources:
    assert all([e.timestamp + e.duration <= now for e in events])

# Verify that no events overlap
verify_no_overlap(events)

In [None]:
e1 = Event(**{'data': {'title': 'event 1'},
              'duration': timedelta(seconds=1, microseconds=599000),
              'timestamp': datetime(2018, 12, 15, 16, 42, 0, 906000, tzinfo=timezone.utc)})
e2 = Event(**{'data': {'title': 'event 2'},
              'duration': timedelta(seconds=269, microseconds=602000),
              'timestamp': datetime(2018, 12, 15, 16, 42, 0, 964000, tzinfo=timezone.utc)})

es = _union_no_overlap([e1], [e2])
verify_no_overlap(es)

In [None]:
# Inspect the distribution of event duration
fig, ax = plt.subplots()
xlim = 50
pd.Series([e.duration.total_seconds() for e in events if e.duration.total_seconds() <= xlim]).plot.hist(bins=10, bottom=1)
ax.set_xlabel('Seconds')
ax.set_ylabel('# of events')
ax.set_xlim(0, xlim)
#ax.set_yscale('log')

#df = pd.DataFrame(pd.Series([e.duration.total_seconds() for e in events]))
#df["dur"] = (df[0] // 10) * 10
#df["logdur"] = log((df[0] * 1).round())
#df[df["dur"] > 10]["dur"].plot.hist()
#df.groupby("dur").mean() * df.groupby("dur").count()

In [None]:
total_events = len(events)
short_thres = 5
short_events = len([e for e in events if e.duration.total_seconds() < short_thres])
print(f"# of total events:  {total_events}")
print(f"# of events <{short_thres}s:    {short_events} ({round(100 * short_events/total_events)}%)")

In [None]:
# TODO: Include sleep for improved coverage
tracking_cov = __builtins__.sum((e.duration for e in events), timedelta()) / (now - since)
print(f"Tracking coverage: {100 * tracking_cov:.3}%")

## Annotate data

Now we want to annotate our data with tags and categories.
To do so we first need to specify tagging and classification rules.

### Define tagging rules

First we need to specify rules used in categorization and tagging.

To clarify:

 - An event can have **many** tags
 - An event can have **only one** category, but will also belong to that category's parent categories (creating a category hierarchy)

The rules are specified by a list of tuples on the format `(regex, category, parent_category)`. You can write them within the notebook or load them from a CSV file.

In [None]:
classes = [
    # (Social) Media
    (r'Facebook|facebook.com', 'Social Media', 'Media'),
    (r'Reddit|reddit.com', 'Social Media', 'Media'),
    (r'Spotify|spotify.com', 'Music', 'Media'),
    (r'subcategory without matching', 'Video', 'Media'),
    (r'YouTube|youtube.com', 'YouTube', 'Video'),
    (r'Plex|plex.tv', 'Plex', 'Video'),
    (r'Fallout 4', 'Games', 'Media'),
    
    # Work
    (r'github.com|stackoverflow.com', 'Programming', 'Work'),
    (r'[Aa]ctivity[Ww]atch|aw-.*', 'ActivityWatch', 'Programming'),
    (r'[Qq]uantified[Mm]e', 'QuantifiedMe', 'Programming'),
    (r'[Tt]hankful', 'Thankful', 'Programming'),
    
    # School
    (r'subcategory without matching', 'School', 'Work'),
    (r'Duolingo|Brilliant|Khan Academy', 'Self-directed', 'School'),
    (r'Analysis in One Variable', 'Maths', 'School'),
    (r'Applied Machine Learning', 'CS', 'School'),
    (r'Advanced Web Security', 'CS', 'School'),
]

# Now load the classes from within the notebook, or from a CSV file.
load_from_file = True if personal else False
if load_from_file:
    aw_research.classify._init_classes(filename="./aw-research-sym/categories.toml")
else:
    aw_research.classify._init_classes(new_classes=classes)

Now to actually annotate the events with our defined tags/categories we will use the `classify(events)` function which categorizes events by adding the fields `$tags` and `$category_hierarchy` to the event data.

In [None]:
events = aw_research.classify.classify(events)

# Visualize

We know have events loaded from a variety of sources, annotated with categories and tags. **Here comes the fun part!**

Here are some visualizations I've found useful to show how your activity today, over many days, and how much you've spent in each category.

TODO: Add calendar heatmap plot

## Today plot

Barchart of which hours you've been active today.

In [None]:
from aw_research import split_event_on_hour, categorytime_per_day, categorytime_during_day, start_of_day, end_of_day
    
def plot_categorytime_during_day(events, category, color='teal'):
    df = categorytime_during_day(events, category, today)
    
    # FIXME: This will make the first and last hour to always be 0
    df[start_of_day(today) + day_offset - tz_offset] = 0 
    df[end_of_day(today) + day_offset - tz_offset] = 0
    df = df.sort_index().asfreq('H')
    
    fig = plt.figure(figsize=(18, 3))
    ax = df.plot(kind='bar', color=color, rot=60)
    ax.set_ylim(0, 1)
    plt.title(category or "All activity")
    
    def label_format_hour(label):
        """
        Convert time label to the format of pandas line plot
        Based on: https://stackoverflow.com/a/53995225/965332
        """
        label = label.replace(tzinfo=your_timezone)
        label = label + label.utcoffset()
        return f"{label.hour}:{str(label.minute).ljust(2, '0')}"  # if label.hour % 2 == 0 else ''
        
    ax.set_xticklabels(map(lambda x: label_format_hour(x), df.index))
    plt.tight_layout()

In [None]:
plot_categorytime_during_day(events, "")
plot_categorytime_during_day(events, "Work", color='green')

## Daily time plot

Useful to see how much you've engaged in a particular activity over time.

In [None]:
def plot_category(cat, big=False):
    fig = plt.figure(figsize=(18, 5 if big else 3))
    #aw_research.classify._plot_category_daily_trend(events, [cat])
    try:
        ts  = categorytime_per_day(events, cat)
    except Exception as e:
        print(f"Error for category '{cat}': {e}")
        return
    ts.plot(label=f": daily", legend=True)
    ts.rolling(7, min_periods=4).mean().plot(label=f"7d SMA", legend=True)
    ts.rolling(30, min_periods=14).mean().plot(label=f"30d SMA", legend=True)
    plt.legend(loc='upper left')
    plt.title(cat)
    plt.xlim(pd.Timestamp(since), pd.Timestamp(now))
    plt.ylim(0)
    plt.grid(linestyle='--')
    plt.tight_layout()

In [None]:
# All logged activity
plot_category('', big=True)

In [None]:
# Work-related
plot_category('Work', big=True)
plot_category('Programming')
plot_category('ActivityWatch')
plot_category('QuantifiedMe')
plot_category('Thankful')

In [None]:
# School-related
plot_category('School')
plot_category('Self-directed')
plot_category('Maths')

In [None]:
# Entertainment
plot_category('Media', big=True)
plot_category('Social Media')
plot_category('Video')
plot_category('Music')
plot_category('Games')

## Category sunburst

Uses the category hierarchy to create an overview of how time has been spent during a given period.

In [None]:
events_today = [e for e in events if today < e.timestamp]

In [None]:
def plot_sunburst(events):
    plt.figure(figsize=(6, 6))
    aw_research.classify._plot_category_hierarchy_sunburst(events)
    display(HTML(f"<h2>Duration: {__builtin__.sum((e.duration for e in events), timedelta(0))}</h2>"))

In [None]:
plot_sunburst(events_today)

In [None]:
plot_sunburst([e for e in events if today - timedelta(days=30) < e.timestamp])

In [None]:
plot_sunburst(events)

## Fictional wage plot

Prioritizing things in life can be hard, and it's not uncommon to want to maximize how much you earn. But how much is working on project X actually worth to you in monetary terms? What about project Y?

By assigning hourly wages to different categories we can plot which activities we've earned the most (fictional) money from! This can help you identify how much you expect to have earned both from different activities and in total.

In [None]:
category_wages = {
    #"Work": 200,
    "ActivityWatch": 300,
    "QuantifiedMe": 300,
    "Thankful": 400,
    "School": 600,
    #"Finance": 1000,
    #"Maths": 400,
    #"Control": 400,
}

def plot_wages(events, category_wages):
    df = pd.DataFrame()
    for cat, wage in category_wages.items():
        df[cat] = wage * categorytime_per_day(events, cat)
    df.plot.area(label='total', stacked=True, legend=True, figsize=(16, 5))
    df.sum(axis=1).rolling(7).mean().plot(label='Total 7d SMA', legend=True)
    df.sum(axis=1).rolling(30).mean().plot(label='Total 30d SMA', legend=True)
    plt.xlim(pd.Timestamp(since), pd.Timestamp(now))
    plt.grid(linestyle='-.')
    plt.tight_layout()
    
plot_wages(events, category_wages)

## Uncategorized

In [None]:
def time_per_keyval(events, key):
    vals = defaultdict(lambda: timedelta(0))
    for e in events:
        if key in e.data:
            vals[e.data[key]] += e.duration
        else:
            vals[f'key {key} did not exist'] += e.duration
    return vals

def print_time_per_keyval(events, key):
    from tabulate import tabulate
    l = sorted([(v, k) for k, v in time_per_keyval(events, key).items()], reverse=True)
    print(tabulate(l[:20], headers=['time', 'val']))
    
events_uncategorized = [e for e in events if 'Uncategorized' in e.data['$tags']]
print_time_per_keyval(events_uncategorized, 'title')

In [None]:
events_uncategorized_today = [e for e in events_uncategorized if e.timestamp > today]
print_time_per_keyval(events_uncategorized_today, 'title')

In [None]:
events_programming = [e for e in events if 'Work -> Programming' == e.data['$category_hierarchy']]
print_time_per_keyval(events_programming, 'title')

In [None]:
print_time_per_keyval(events, '$source')

## That's the end of the notebook!

Thank you for checking it out! 

I hope you'll upvote and/or comment wherever you saw it to help it get seen!

 - TODO: Add "cover image" to give some sort of preview

### Did you like it? Consider supporting us so we can keep building!

 - TODO: Add link/image/button to Patreon
 - Like ActivityWatch on AlternativeTo! (TODO: ...and ProductHunt)
 - Post about it on Twitter!

### Run it yourself!

You can run this notebook with your own data, it's a lot more fun! Check out the repo for details: https://github.com/ErikBjare/quantifiedme

 - TODO: Actually add details for how to run it in the README
 - TODO: Add prominent link/button to download ActivityWatch
 
### Other interesting links

 - [Memento Labs](https://mementolabs.io/), a platform for self-study using quantified self data.

### Thanks to

 - [Johan Bjäreholt](https://github.com/johan-bjareholt), for his amazing contributions, and for working on ActivityWatch with me for so long. **This wouldn't be possible without him.**
 - TODO: The reviewers
 - All the other contributors, whose [stats are listed here](http://activitywatch.net/contributors/).
 - Our Patrons/backers/supporters, your financial contribution means a lot!
 - [@karpathy](https://twitter.com/karpathy) for creating [ulogme](https://github.com/karpathy/ulogme), a spiritual ancestor of ActivityWatch
 - Our users, you motivate us to keep working!
 
 
 ## TODO: Post to
 
  - Reddit
  - Hacker News
  - Twitter
  - ActivityWatch Forum (under the 'Projects' category)