# [[Meta Kaggle] Meta Kaggle] Meta Kaggle

[Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) has been around a while now: let's use [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) to *introspect itself!*

## Background: Meta Kaggle

Quoting from the [Meta Kaggle Dataset page](https://www.kaggle.com/kaggle/meta-kaggle):

__________

<div style="background:#4fc3f7; padding: 2em;"><h2>Explore our public data on competitions, datasets, kernels and more</h2>
<p>Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.  </p>
<p>Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.  </p>
<p><img alt="Kaggle Leaderboard Performance" src="https://imgur.com/2Egeb8R.png"></p>
<p>This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.  </p>
<p>Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.</p></div>

__________


## Contents

 * [Dataset Size](#Dataset-Size)
 * [CreatorUserId](#CreatorUserId)
 * [Updates Over Time](#Updates-Over-Time)
 * [Time of Day for Updates](#Time-of-Day-for-Updates)
 * [Time Gaps](#Time-Gaps)
 * [Recurrence Plot](#Recurrence-Plot)
 * [Predicting Dates of Future Updates](#Predicting-Dates-of-Future-Updates)
 * [Notebooks](#Notebooks)
 * [Gold Notebooks 🥇](#Gold-Notebooks-🥇)
 * [Silver Notebooks 🥈](#Silver-Notebooks-🥈)
 * [Bronze Notebooks 🥉](#Bronze-Notebooks-🥉)
 * [Further Notebooks](#Further-Notebooks)
 * [Conclusions](#Conclusions)


In [1]:
from jt_mk_utils import *

In [2]:
import os, sys, re, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from IPython.display import HTML, Image, display
import seaborn as sns
from datetime import datetime

In [3]:
MK = 'Meta Kaggle'
title = MK + ' Updates'

plt.rc('figure', figsize=(15, 9))
plt.rc('font', size=14)

In [4]:
ver = read_dataset_versions(index_col=0)
ver.shape

In [5]:
mk = ver.query("DatasetId==9").sort_index()

In [6]:
mk.nunique()

In [7]:
mk[['Title', 'Slug', 'Subtitle', 'CreationDate', 'VersionNotes']].tail(1)

In [8]:
users = read_users(filter=("Id", mk.CreatorUserId)).set_index("Id")
users.shape

# Dataset Size

It looks like TotalUncompressedBytes is always zero

In [9]:
mk.TotalUncompressedBytes.value_counts()

In [10]:
mk.TotalCompressedBytes.value_counts()

# CreatorUserId

The automated updates are by kaggleteam and two regular users appear to make occasional manual versions

In [11]:
mk.CreatorUserId.value_counts()

In [12]:
mk.CreatorUserId.value_counts().to_frame(MK + ' Updates').join(users).sort_index()

In [13]:
mk.plot.scatter('CreationDate', 'CreatorUserId', title=MK + ' CreatorUserId');

# Updates Over Time

From mid 2018 updates became daily

In [14]:
mk.CreationDate.plot(title=title);

In [15]:
hour_of_day = mk.CreationDate.dt.hour + (mk.CreationDate.dt.minute / 60)
mk = mk.assign(VersionCount=np.arange(len(mk))+1)
mk = mk.assign(Time=hour_of_day)
mk = mk.assign(TimeText=mk.CreationDate.dt.time)
mk = mk.assign(DateText=mk.CreationDate.dt.strftime("%c"))
mk = mk.assign(HasVersionNotes=~mk.VersionNotes.isnull())

# Time of Day for Updates



In [16]:
mk.plot.scatter('CreationDate', 'Time', title=title);

In [17]:
show = mk.fillna("<NA>")

***2018***
Updates become daily

***2019***
The system was clearly changed on 17 December 2019, the update times changed and VersionNotes were added

***2020***
It looks like it takes longer & longer to create the dataset, the CreationDate drifts later & later in the day.

***2021***
Something exploded on 4 March? The curve dies off then the pattern changes completely


In [18]:
px.scatter(show,
           'CreationDate',
           'Time',
           color='HasVersionNotes',
           hover_data=[
               'VersionNumber',
               'VersionCount',
               'DateText',
               'VersionNotes',
               'DatasourceVersionId',
           ],
           title=title)

# Time Gaps

In [19]:
gaps = mk.CreationDate.diff().dropna()

In [20]:
one_day = 24 * 60 * 60 * 1e9

In [21]:
gaps_days = (gaps.astype(int) / one_day)

In [22]:
gaps_days[gaps_days <= 2].plot(title=MK + ': Days Between Versions');

# Recurrence Plot

https://en.wikipedia.org/wiki/Recurrence_plot

In [23]:
subset = gaps_days[gaps_days <= 3.5]
plt.scatter(subset[:-1], subset[1:], c=range(len(subset[1:])))
plt.xlabel("Gap")
plt.ylabel("Next Gap")
plt.title(MK + ' Recurrence Plot');

# Predicting Dates of Future Updates

    TODO *



<div align=right><sub>*(not really!)</sub></div>


# Notebooks

In [24]:
sources = read_kernel_version_dataset_sources(filter=("SourceDatasetVersionId", mk.index))
sources.shape

In [25]:
use = [
    'Id', 'ScriptId', 'ParentScriptVersionId', 'ScriptLanguageId',
    'AuthorUserId', 'CreationDate', 'VersionNumber', 'Title'
]
ver = read_kernel_versions(filter=("Id", sources.KernelVersionId), usecols=use).set_index("Id")
ver.shape

In [26]:
# Data Issue: Some Titles in KernelVersions.csv are NaN
titles = ver.groupby('ScriptId').Title.fillna(method='ffill')
ver.loc[ver.Title.isna(), 'Title'] = titles

In [27]:
kernels = read_kernels(filter=("Id", ver.ScriptId)).set_index("Id")
kernels.shape

In [28]:
UIDS = {v for v in kernels.AuthorUserId if v not in {859104}}
users = read_users(filter=("Id", UIDS)).set_index("Id")
users.shape

In [29]:
kernels.Medal.value_counts(dropna=False)

In [30]:
votes = read_kernel_votes().set_index("Id").sort_index()
votes.shape

In [31]:
votes = votes.join(ver[['ScriptId']], on='KernelVersionId', how='inner')

In [32]:
gb = votes.groupby('VoteDate')
counts = gb.VoteDate.transform('count')
interpolated_time_of_day = (gb.cumcount() / counts) * pd.Timedelta(1, 'd')
votes['VoteDateTime'] = votes['VoteDate'] + interpolated_time_of_day
votes['DateText'] = votes.VoteDate.dt.strftime("%c")
votes['VoteCount'] = votes.groupby("ScriptId").cumcount() + 1

In [33]:
votes.shape

### Power Law? The top TWO Notebooks have over 40% of the total Notebook votes of the whole dataset

In [34]:
N = 10
vc = votes.ScriptId.value_counts().head(N)
vcn = votes.ScriptId.value_counts(normalize=True).head(N)

In [35]:
BC = '#c0c0c0'
pd.concat((vc.to_frame("Votes"),
           vcn.to_frame("Proportion"),
           vcn.cumsum().to_frame("Cumulative")), 1).style.bar(subset=['Votes'], color=BC, width=85)

In [36]:
NOW = datetime.now()
SHOW = ['TotalVotes', 'TotalViews', 'ViewsPerVote', 'Age', 'DisplayNameLink', 'TitleLink']

def user_name_link(r):
    return (f'<a href="https://www.kaggle.com/{r.UserName}" '
            f' title="Tier: {r.PerformanceTier}\n'
            f'RegisterDate: {r.RegisterDate}\n'
            f'UID: {r.AuthorUserId}">'
            f'{r.DisplayName}</a>')

def notebook_link(r):
    return (f'<a href="https://www.kaggle.com/{r.UserName}/{r.CurrentUrlSlug}" '
            f' title="VersionNumber: {r.VersionNumber}\n'
            f'TotalViews: {r.TotalViews}\n'
            f'TotalComments: {r.TotalComments}\n'
            f'CreationDate: {r.CreationDate}\n'
            f'MedalAwardDate: {r.MedalAwardDate}">'
            f'{r.Title}</a>')

def fmt_df(df):
    df = df.join(users, on="AuthorUserId")
    df = df.join(ver.drop(['AuthorUserId', 'CreationDate', 'ScriptId'], 1), on="CurrentKernelVersionId")
    df = df.dropna(subset=['UserName', 'Title'])
    
    counts = df.groupby("Title").cumcount()
    idx = counts>0
    df.loc[idx, "Title"] = df.loc[idx, "Title"] + counts.apply(lambda s: f" ({s+1})")
    
    df['ViewsPerVote'] = (df['TotalViews'] / df['TotalVotes']).astype(int)
    df['DisplayNameLink'] = df.apply(user_name_link, 1)
    df['TitleLink'] = df.apply(notebook_link, 1)
    df["Age"] = (NOW - df.CreationDate).dt.days
    return df

def show_df(df):
    return df[SHOW].set_index('TitleLink').style.bar(color=BC, width=85)

# Gold Notebooks 🥇


In [37]:
TYPE, BC = "Gold", "#c9b037"
subset = fmt_df(kernels.query("Medal==1"))
subset.shape

In [38]:
show_df(subset)

Do that again because the (unsurpassable?) top two skew things

In [39]:
show_df(subset.tail(-2))

In [40]:
votes_df = votes.join(subset, on='ScriptId', how='inner')
votes_df.shape

In [41]:
px.line(votes_df,
        'VoteDateTime',
        'VoteCount',
        hover_name='Title',
        hover_data=['DisplayName', 'UserName', 'DateText'],
        line_group='ScriptId',
        color='ScriptId',
        title=f'{MK} {TYPE} Notebooks')

# Silver Notebooks 🥈

In [42]:
TYPE, BC = "Silver", "#b4b4b4"
subset = fmt_df(kernels.query("Medal==2"))
subset.shape

In [43]:
show_df(subset)

In [44]:
votes_df = votes.join(subset, on='ScriptId', how='inner')
votes_df.shape

In [45]:
px.line(votes_df,
        'VoteDateTime',
        'VoteCount',
        hover_name='Title',
        hover_data=['DisplayName', 'UserName', 'DateText'],
        line_group='ScriptId',
        color='ScriptId',
        title=f'{MK} {TYPE} Notebooks')

# Bronze Notebooks 🥉


In [46]:
TYPE, BC = "Bronze", "#ad8a56"
subset = fmt_df(kernels.query("Medal==3"))
subset.shape

In [47]:
show_df(subset)

In [48]:
votes_df = votes.join(subset, on='ScriptId', how='inner')
votes_df.shape

In [49]:
px.line(votes_df,
        'VoteDateTime',
        'VoteCount',
        hover_name='Title',
        hover_data=['DisplayName', 'UserName', 'DateText'],
        line_group='ScriptId',
        color='ScriptId',
        title=f'{MK} {TYPE} Notebooks')

# Further Notebooks

In [50]:
TYPE, BC = "Closest to gaining a medal", "#4fc3f7"
subset = fmt_df(kernels[kernels.Medal.isnull()].sort_values("TotalVotes", ascending=False).head(100).sort_index())
subset.shape

In [51]:
show_df(subset)

In [52]:
votes_df = votes.join(subset, on='ScriptId', how='inner')
votes_df.shape

In [53]:
px.line(votes_df,
        'VoteDateTime',
        'VoteCount',
        hover_name='Title',
        hover_data=['DisplayName', 'UserName', 'DateText'],
        line_group='ScriptId',
        color='ScriptId',
        title=f'{MK} {TYPE} Notebooks')

# Conclusions

Well, [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) updates are no longer as regular as they used to be and look somewhat chaotic but I hope they continue!

Why the title ***[[Meta Kaggle] Meta Kaggle] Meta Kaggle*** ?

There's a loose convention of putting the dataset or competition in the Notebook title in \[brackets\] to give some context if the Notebook appears in the global listings, and the subject of this Notebook is [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) itself, so it should be ***[Meta Kaggle] Meta Kaggle***.

But this Notebook also inspects the Notebooks with medals listed in [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle).
If this Notebook eventually receives a medal (this is entirely up to ***YOU***) it will appear ***in itself***, hence  ***[[Meta Kaggle] Meta Kaggle] Meta Kaggle*** !