"[The Garden of Forking Paths](https://en.wikipedia.org/wiki/The_Garden_of_Forking_Paths)" (original Spanish title: "El jardín de senderos que se bifurcan") is a 1941 short story by Argentine writer and poet Jorge Luis Borges with a theme that has been said to foreshadow the many worlds interpretation of quantum mechanics.
Borges's vision of "forking paths" has also been cited as inspiration by numerous new media scholars, in particular within the field of hypertext fiction.

The story describes a book in which different versions of reality play out with people swapping roles between them.
In a weird way this reminded me of the Kaggle forums where an increasing number of users write about how it seems unfair that Notebook forks often outnumber votes.
Because the [general forums](https://www.kaggle.com/general) have many topics and an infinite scroll mechanism with slow loading times, people don't tend to see the <mark>past</mark> messages on this subject, so they write their own, resulting in this flow of messages all rephrasing the same thing!

Of course "forks" are mentioned in a lot of other contexts too: here they are, **all** mentions of the word "fork" on the Kaggle forums, a hypertext adventure where you can inspect the <mark>past</mark>, click around and jump into old threads, upvote the sentiments you agree with, or add new points of view from a 2021 perspective!

Feel free to fork it to try other search terms e.g. [shake-up](https://www.kaggle.com/jtrotman/shakeup-the-story-so-far) :-)

And feel free to mention this notebook on the forums, if you do then your comment will appear below, in <mark>future</mark> runs of this notebook itself.

________

#### Alternative Titles

- Fast Forum Search

- What Do We Say When We Talk About Forking?

- Fork Talk

- How To Generate an Index of Past Discussions That Took Place on the Online Forums of the Kaggle Data Science Platform on the Subject of Copying and Editing Notebooks, Scripts or Kernels, using the Generously Provided [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) Dataset

 [1]: https://www.kaggle.com/kaggle/meta-kaggle


In [1]:
%%HTML
<style>
.gold { background-color:#ffcc55; font-weight:bold; }
.silver { background-color:#99ccee; font-weight:bold; }
.bronze { background-color:#bbaabb; font-weight:bold; }
.bagel { background-color:#ffffff; color: #dddddd; }
</style>

# Settings

You can alter

In [2]:
target_word = 'fork'
highlights = {
    r'(\w*vote|\w*voting)' : '#ffff99',
    r'(notebook|kernel|script)' : '#99ffff',
}

replacements = [
    # example usage:
    #dict(pat='(shake)[\s-]+(up)', repl=r'\1\2', case=False)
]

# Number of characters to show either side of the central search column
n_context_chars = 40

# Must specify these in the <pre> tag.
# (Without it you get white text on white background.)
background = 'white'
color = 'black'

# Import

In [3]:
import re
import html
import calendar
import unidecode
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from bs4 import BeautifulSoup
from collections import Counter
from IPython.display import HTML, display

MK = Path(f'../input/meta-kaggle')
NROWS = None

pd.options.display.max_rows = 200

Settings you can try to alter but perhaps best as they are

In [4]:
digits = '⓪①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳㉑㉒㉓㉔㉕㉖㉗㉘㉙㉚㉛㉜㉝㉞㉟㊱㊲㊳㊴㊵㊶㊷㊸㊹㊺㊻㊼㊽㊾㊿🌟'
digit_codes = np.asarray(list(digits))
medal_classes = np.asarray([ 'bagel', 'gold', 'silver', 'bronze' ])
pad = n_context_chars * ' '
pre_style = f'background:{background}; color:{color}; white-space: pre;'
medal_mark_tags = np.asarray([f'<mark class="{c}">' for c in medal_classes])

In [5]:
# Highlight words that appear in series
# Using <mark> instead of <font> ruins the text alignment in Firefox.
def highlight(series):
    for term, highlight_color in highlights.items():
        pattern = r'(\b%s\w*)' % (term,)
        frep = lambda m: f'<font style="background-color:{highlight_color};">{m.group(1)}</font>'
        series = series.str.replace(pattern, frep, case=False)
    return series


# clean for use in html attr="text" using double quotes
def clean_for_html_attr(txt):
    return html.escape(txt, quote=True)


# replace unicode, remove quotes and normalise spaces
def clean_for_line_display(txt):
    txt = unidecode.unidecode(txt)
    txt = re.sub(r'\[quote.*\[/quote\]', ' ', txt, flags=re.S)
    txt = re.sub(r'\s+', ' ', txt)
    return txt


def parse_html(r):
    txt = BeautifulSoup(r, 'html').text
    return clean_for_html_attr(txt), clean_for_line_display(txt)


def read_csv(csv, **kwargs):
    return pd.read_csv(MK / csv, **kwargs)

# Read

In [6]:
comps = read_csv('Competitions.csv')
comps = comps.dropna(subset=['ForumId'])
comps = comps.drop_duplicates(subset=['ForumId'], keep='last')
comps['ForumId'] = comps['ForumId'].astype(int)
comps = comps.set_index('ForumId')

kdf = read_csv('Kernels.csv').set_index('Id')

users = read_csv('Users.csv').set_index('Id')
kdf = kdf.join(users[['UserName']].add_prefix('Kernel'), on='AuthorUserId')

forums = read_csv('Forums.csv').set_index('Id')

topics = read_csv('ForumTopics.csv').set_index('Id')

msgs = read_csv('ForumMessages.csv', nrows=NROWS)
msgs = msgs.dropna(subset=['Message'])
msgs = msgs.set_index('Id')
msgs = msgs.sort_index()

In [7]:
for opt in replacements:
    msgs['Message'] = msgs.Message.str.replace(**opt)

See [Issue: ForumMessageVotes contents duplicated](https://www.kaggle.com/kaggle/meta-kaggle/discussion/181883) - must use drop_duplicates

In [8]:
votes = read_csv('ForumMessageVotes.csv')
votes = votes.drop_duplicates(subset=['Id'])
votes = votes.set_index('Id')

msgs['Votes'] = votes.ForumMessageId.value_counts()
msgs['Medal'] = msgs['Medal'].fillna(0).astype(int)
msgs['Votes'] = msgs['Votes'].fillna(0).astype(int)
msgs['VoteIcon'] = msgs['Votes'].clip(0, 50) # use 51 for the 🌟 icon

# Search

First: a fast scan

In [9]:
idx = msgs.Message.str.contains(target_word, case=False)
idx.sum()

parse_html only for messages that contain our key search word

In [10]:
%%time
parsed = msgs.loc[idx, 'Message'].apply(parse_html)
msgs.loc[idx, 'Preview'] = parsed.str[0]
msgs.loc[idx, 'Clean'] = parsed.str[1]

Regex for actual text search

In [11]:
term = '(.{,%d})(%s)(.{,%d})' % (n_context_chars, target_word, n_context_chars)
term

Do the search - `res` has one row per hit

In [12]:
res = msgs.Clean.str.extractall(term, re.IGNORECASE)
res = res.fillna('').reset_index()
res.shape

In [13]:
# Sneak preview
res.head()

# Tend

Fix widths of context

In [14]:
res[0] = (pad + res[0]).str[-n_context_chars:]
res[2] = (res[2] + pad).str[:n_context_chars]

Replace special chars with html entities, which will render as one character

In [15]:
res[[0, 2]] = res[[0, 2]].applymap(html.escape)

Add fields into results

In [16]:
res = res.join(msgs, on='Id')

In [17]:
res = res.join(users, on='PostUserId')

In [18]:
res = res.join(topics, on='ForumTopicId')

In [19]:
res = res.join(comps.rename(columns={'Title': 'CompetitionTitle', 'Id': 'CompetitionId'}), on='ForumId')

In [20]:
res = res.join(kdf[['KernelUserName', 'CurrentUrlSlug']], on='KernelId')

In [21]:
res.Title = res.Title.fillna('<missing>').apply(clean_for_html_attr)

In [22]:
res.PostDate = pd.to_datetime(res.PostDate)

# Link

Making links to actual contents:

If KernelId set - Author/CurrentSlug

If ForumId in comps - /c/Slug/discussion/[topicID]

If ParentForumId==9 custom url for each: below are de facto parts of URL for discussions in *general* forums.

In [23]:
general_forums = {
    15: 'general',
    208: 'getting-started',
    809: 'product-feedback',
    2239: 'questions-and-answers',
    2241: 'data',
    17686: 'learn-forum',
}

res['Forum'] = res.ForumId.map(general_forums)

In [24]:
res['Url'] = (res.ForumId.map(general_forums).fillna('data') + '/' +
              res.ForumTopicId.map(str))

### Competition URLs

In [25]:
idx = res.ForumId.isin(comps.index)
idx.sum()

In [26]:
res.loc[idx, 'Url'] = ('c/' + res.loc[idx, 'Slug'] + '/discussion/' +
                       res.ForumTopicId.map(str))

### Kernel URLs

In [27]:
idx = ~res.KernelId.isnull()
idx.sum()

In [28]:
res.loc[idx, 'Url'] = (res.loc[idx, 'KernelUserName'] + '/' +
                       res.loc[idx, 'CurrentUrlSlug'] + '/comments')

Key for the medal colors:


In [29]:
color_key_list = [f'<mark class={c}>{c}</mark>' for c in medal_classes]
HTML('Medal Colors: ' + ' '.join(color_key_list))

# Format

Display all matches with a link to the actual post and verbose "title" attribute that summarises it:

- Topic Title
- Competition | Kernel | Dataset
- User
- Date
- Message itself!


In [30]:
def prep(series):
    return (series + '\n').fillna('')


# Make a list of arrays and pd.Series; each entry is part of a line.
# Some fields are optional e.g. res.CompetitionTitle will be N/A for some rows.
# So add prefix and newline and do fillna('') separately - N/A values result in no line
parts = [
    # First entry has to be a series
    pd.Series(medal_mark_tags[res.Medal]),
    pd.Series(digit_codes[res.VoteIcon]),
    '</mark> ',
    highlight(res[0]),  # context: left
    '<a title="',
    prep('Topic: ' + res.Title),
    prep('Forum: ' + res.Forum),
    prep('Competition: ' + res.CompetitionTitle),
    prep('Kernel: ' + res.CurrentUrlSlug + ' by ' + res.KernelUserName),
    prep('User: ' + res.DisplayName + ' [' + res.UserName + ']'),
    prep('Date: ' + res.PostDate.dt.strftime('%c')),
    '\n',
    prep('Message:\n' + res.Preview),
    '" href="https://www.kaggle.com/',
    prep(res['Url'] + '#' + res.Id.map(str)),  # message Id for link anchor
    '">',
    res[1],  # middle: search term
    '</a>',
    highlight(res[2]),  # context: right
]

# This joins the series & arrays into one series.
lines = np.add.reduce(parts)
lines = lines.to_frame('Text')
lines['Date'] = res.PostDate
lines['Year'] = res.PostDate.dt.year
lines['Month'] = res.PostDate.dt.month

# Results

Mouse over the central word for a preview of the full message.

In [31]:
markup = ''
for year, year_df in lines.groupby('Year'):
    markup += f'<h1 id="{year}">{year}</h1>'
    if len(year_df) >= 20:
        # separate month headings
        for month, month_df in year_df.groupby('Month'):
            contents = '\n'.join(month_df['Text'])
            markup += f'<h1 id="{year}-{month}">{year} &mdash; {calendar.month_name[month]}</h1>'
            markup += f'<pre style="{pre_style}">{contents}</pre>'
    else:
        # year as one block
        contents = '\n'.join(year_df['Text'])
        markup += f'<pre style="{pre_style}">{contents}</pre>'
HTML(markup)

# Stats

In [32]:
plt.rc('figure', figsize=(12, 9))
plt.rc('font', size=14)

In [33]:
uniq = res.drop_duplicates(subset=['Id'])

In [34]:
uniq.PostDate.dt.year.value_counts().sort_index()

In [35]:
uniq.PostDate.hist(bins=50)
plt.title(f'Mentions of "{target_word}" on Kaggle Forums');

In [36]:
gb = uniq.groupby('PostDate')
votes = gb.Votes.sum().cumsum()
posts = gb.size().cumsum().rename('Posts')

fig, axl = plt.subplots()
axr = axl.twinx()
votes.plot(title=f'Cumulative Mentions of "{target_word}" on Kaggle Forums', ax=axl)
posts.plot(ax=axl)
(votes/posts).rolling(20).mean().plot(label='Votes per post', ax=axr, c='g')
axl.grid()
axl.set_ylabel('Counts')
axr.set_ylabel('Vote rate')
axl.legend(loc='lower left')
axr.legend(loc='lower right');

So there you have it, a literal pathway of "forks" through the Kaggle forums.

(Now that the Fork button has the label "Copy and Edit" we might miss some of these discussions in the future!)

________

#### See Also

First Fork!

https://www.kaggle.com/jtrotman/shakeup-the-story-so-far


In [37]:
_ = """
Re-run to include recent competitions:

    2021-06-23 | Slug:iwildcam2021-fgvc8
    2021-06-28 | Slug:coleridgeinitiative-show-us-the-data
    2021-07-05 | Slug:tabular-playground-series-jun-2021
    2021-08-10 | Slug:google-smartphone-decimeter-challenge
    2021-08-11 | Slug:commonlitreadabilityprize


"""