# Email Wrapped: Template

This notebook provides a generic template to analyze email metadata and summarize activity. All code remains unchanged; only the narrative text is simplified and de-personalized.

Use this as a starting point to generate your own annual “Wrapped” summary. I translated this from dutch to english as part of a tiny project to create a 'wrapped' of the email thread with some friends, so there might be some artifacts of that here and there.

## Load mailbox data

Parse the mailbox archive to extract message metadata (sender, recipients, subject, date, and size). The code reads from a local `.mbox` file and builds a DataFrame for analysis.

In [None]:
import sys
sys.path.append("..")

import pandas as pd
from src.load_mbox import load_mbox
df = load_mbox("../data/wrapped")
len(df)

In [None]:
leaderboard = (
    df["sender"]
    .value_counts()
    .rename("No. of Mails")
    .reset_index()
    .rename(columns={"index": "Collega"})
)

leaderboard.style.hide(axis="index")
# leaderboard.columns

## Activity by time

Visualize messages by month, weekday, and hour to show when activity is highest. Charts are generic and do not include personal annotations.

In [None]:
from IPython.display import HTML, display
import matplotlib.pyplot as plt
import base64
from io import BytesIO

order = ["Monday","Tuesday","Wednesday","Thursday","Friday"]

weekdays = (
    df["weekday"]
    .value_counts()
    .reindex(order)
    .rename("Aantal mails")
    .to_frame()
)

# plot → base64 img
fig, ax = plt.subplots(figsize=(5,3))
weekdays.plot(kind="bar", legend=False, ax=ax)
ax.set_title("Per dag")
ax.set_xlabel("Dag")
ax.set_ylabel("Aantal mails")
plt.tight_layout()

buf = BytesIO()
plt.savefig(buf, format="png")
plt.close(fig)
img = base64.b64encode(buf.getvalue()).decode()

html = f"""
<div style="display:flex; gap:40px; align-items:flex-start">
  <div>{weekdays.to_html()}</div>
  <div><img src="data:image/png;base64,{img}"/></div>
</div>
"""

display(HTML(html))


In [None]:
hours = (
    df["hour"]
    .value_counts()
    .sort_index()
    .rename("No. of Mails")
    .to_frame()
)

# plot → base64 img
fig, ax = plt.subplots(figsize=(5,3))
hours.plot(kind="bar", legend=False, ax=ax)
ax.set_title("Per hour")
ax.set_xlabel("Hour of the day")
ax.set_ylabel("No. of Mails")
plt.tight_layout()

buf = BytesIO()
plt.savefig(buf, format="png")
plt.close(fig)
img = base64.b64encode(buf.getvalue()).decode()

html = f"""
<div style="display:flex; gap:40px; align-items:flex-start">
  <div>{hours.to_html()}</div>
  <div><img src="data:image/png;base64,{img}"/></div>
</div>
"""

display(HTML(html))



## Early bird and late worker

Who on average sends emails the earliest and latest?

In [None]:
avg_hour = (
    df.groupby("sender")["hour"]
    .mean()
    .round(1)
    .sort_values()
    .rename("Gem. uur")
    .to_frame()
)

early_bird = avg_hour.index[0], avg_hour.iloc[0,0]

early_bird[0]

In [None]:
late_worker = avg_hour.index[-1], avg_hour.iloc[-1,0]
late_worker[0]

## Most common openers



In [None]:
openings = (
    df["opening"]
    .value_counts()
    .head(15)
    .rename("Count")
    .to_frame()
)

openings

Some of the most unique ones used:

In [None]:
openings = (
    df["opening"]
    .value_counts()
    .tail(10)
    .rename("Count")
    .to_frame()
)

openings

## Wordcloud

Most common topics discussed. Cleaned to only contain body of emails.

In [None]:
import re
import nltk
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords

nltk.download("stopwords")

def clean_for_wordcloud(text, max_lines=60, drop_last=10):
    if not text:
        return ""

    lines = text.splitlines()[:max_lines]

    out = []
    for ln in lines:
        s = ln.strip()
        if s.startswith(">"):
            break
        if re.search(r"^On .*wrote:", ln) or re.search(r"^Op .*schreef", ln):
            break
        if re.search(r"^(From|Sent|To|Subject|Van|Verzonden|Aan|Onderwerp):\s", ln, flags=re.IGNORECASE):
            break
        out.append(ln)

    if len(out) > drop_last:
        out = out[2:-drop_last]

    text = " ".join(out)
    text = re.sub(r"\bmailto:\S+\b", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\bhttps?://\S+\b", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\b\S+@\S+\b", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# clean bodies
clean_body = df["body"].apply(clean_for_wordcloud)
text = " ".join(clean_body.tolist())

# stopwords
stopwords_wc = set(STOPWORDS)
stopwords_wc.update(stopwords.words("dutch"))
stopwords_wc.update(["re", "fw", "fwd"])

# optional: remove names
stopwords_wc.update(
    w.lower()
    for w in " ".join(df["sender"])
    .replace(",", " ")
    .replace("(", " ")
    .replace(")", " ")
    .split()
)

# wordcloud
wc = WordCloud(
    width=1200,
    height=600,
    background_color="white",
    stopwords=stopwords_wc
).generate(text)

plt.figure(figsize=(12,6))
plt.imshow(wc)
plt.axis("off")
plt.show()


## Most common words used

Filtered for stop words.

In [None]:
from collections import Counter
import re

words = []
for t in clean_body:
    ws = re.findall(r"[A-Za-zÀ-ÿ']{4,}", t.lower())
    words.extend(w for w in ws if w not in stopwords_wc)

Counter(words).most_common(20)


## Further fun data

### Longest email drought

Longest period without an email on the thread:

In [None]:
df_sorted = df.sort_values("date_local").reset_index(drop=True)
df_sorted["gap"] = df_sorted["date_local"].diff()

i = df_sorted["gap"].idxmax()

gap = df_sorted.loc[i, "gap"]
left_hanging = df_sorted.loc[i-1, "sender"]
breaker = df_sorted.loc[i, "sender"]
start = df_sorted.loc[i-1, "date_local"]
end = df_sorted.loc[i, "date_local"]

print(
    f"Longest drought: {gap.days} dagen\n"
    f"Left hanging: {left_hanging} ({start.strftime('%d %B %Y %H:%M')})\n"
    f"Silence broken by: {breaker} ({end.strftime('%d %B %Y %H:%M')})"
)



### Longest mail on thread

Name, no. of words in mail.

In [None]:
df["clean_body"] = df["body"].apply(lambda t: clean_for_wordcloud(t, max_lines=80, drop_last=5))
df["clean_words"] = df["clean_body"].str.split().str.len()
i = df["clean_words"].idxmax()

name = df.loc[i, "sender"]
words = int(df.loc[i, "clean_words"])

print(f"{name} — {words} woorden")

### Most mentioned colleague

Number of mentions per participant in email thread.

In [None]:
def first_token(name):
    s = re.sub(r"[^\w\sÀ-ÿ]", " ", str(name).lower())
    return s.split()[0] if s.split() else None

tokens = [first_token(x) for x in df["sender"].unique()]
tokens = [t for t in tokens if t]

mentions = {}
text = df["clean_body"].str.lower().fillna("")

for t in tokens:
    mentions[t] = int(text.str.count(rf"\b{re.escape(t)}\b").sum())

mentions_df = (
    pd.DataFrame(
        sorted(mentions.items(), key=lambda x: x[1], reverse=True),
        columns=["Colleague", "Mentions"]
    )
)

mentions_df.style.hide(axis="index")

