# Meta Kaggle: Count User Activities

This notebook simply counts appearances of UserId in the various Meta Kaggle data tables.

Some highlights from the data (these may drift a bit from the time of writing):

 - Nearly 7M users in total (the so-called *vanity* metric)
 - Around 71% of users (over 4.8M) have not done anything at all on the site
 - A further 1M have accepted rules for a competition but nothing else.
 - So, in millions, borrowing [Hans Rosling's pincode idea][2], approximately a 5:1:1 ratio, Unacted:Dormant:Active

See also: the 90-9-1 rule, aka [the 1% Rule][1]:

<blockquote>
"In Internet culture, the 1% rule is a rule of thumb pertaining to participation in an internet community, stating that only 1% of the users of a website actively create new content, while the other 99% of the participants only lurk. Variants include the 1–9–90 rule (sometimes 90–9–1 principle or the 89:10:1 ratio), which states that in a collaborative website such as a wiki, 90% of the participants of a community only view content, 9% of the participants edit content, and 1% of the participants actively create new content."

</blockquote>


For the users with  highest counts of activities, these show:

 - The top 50 "submittors" are maintaining an average of *over* one submission every day.
 - Around 10 of the top users are gaining 3-7 followers on average per day.
 - A few users vote for 5+ Notebooks every day.
 - The top 50 "Notebookers" are running at least one new Notebook version every day.
 - Chris Deotte has over 20 discussion votes for every day since he registered!
 - Marília Prata is averaging about 14 new forum posts per day!
 - "City of San Jose Data" creates about 75 dataset versions per day

## Contents

 * [League Tables](#League-Tables)
 * [ForumMessageVotes - Ratio](#ForumMessageVotes---Ratio)
 * [Followers - Ratio](#Followers---Ratio)
 * [Average: Votes per Discussion Post](#Average:-Votes-per-Discussion-Post)
 * [Average: Submissions per Competition](#Average:-Submissions-per-Competition)
 * [Summarize Total Activity](#Summarize-Total-Activity)
 * [Correlation Between Different Activity Counts](#Correlation-Between-Different-Activity-Counts)
 * [Hyper Kagglers](#Hyper-Kagglers)
 * [Hyper Kagglers 2](#Hyper-Kagglers-2)
 * [Columnwise Counts](#Columnwise-Counts)
 * [Counting Separate Activities](#Counting-Separate-Activities)
 * [Specialised Counts](#Specialised-Counts)
 * [Signup Rate](#Signup-Rate)
 * [Tiers](#Tiers)
 * [The End](#The-End)

 [1]: https://en.wikipedia.org/wiki/1%25_rule_(Internet_culture)
 [2]: https://www.gapminder.org/answers/where-do-people-live/
 

In [1]:
import gc, os, sys, time
import pandas as pd, numpy as np
from itertools import combinations
from IPython.display import HTML, display
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_rows = 200

In [2]:
plt.rc('figure', figsize=(12, 8))

In [3]:
IN_DIR = '../input/meta-kaggle'
if not os.path.isdir(IN_DIR):
    IN_DIR = '../input'
len(os.listdir(IN_DIR))

In [4]:
users = pd.read_csv(os.path.join(IN_DIR, 'Users.csv'), parse_dates=['RegisterDate'])
idx = users.DisplayName.str.len() <= 1
users.loc[idx, 'DisplayName'] = users.loc[idx, 'UserName']
users.UserName.fillna('', inplace=True)
users.DisplayName.fillna('[deleted user]', inplace=True)
users.DisplayName = users.DisplayName.str[:32]
users = users.set_index('Id')
users.shape

In [5]:
one_day = pd.Timedelta(1, 'd')
latest = users.RegisterDate.max() + one_day
latest

In [6]:
users['Age'] = ((latest - users.RegisterDate) / one_day).astype('int32')

In [7]:
EXCLUDE_USERS = [2080166] # Kaggle Kerneler - very high stats that distort the league tables!

users.loc[EXCLUDE_USERS].T

In [8]:
users = users.drop(EXCLUDE_USERS)
users.shape

In [9]:
users.head()

In [10]:
def columns(fn):
    df = pd.read_csv(fn, nrows=5)
    return df.columns

def user_columns(fn):
    return [c for c in columns(fn) if 'UserId' in c]

For each csv

 -        read user id columns
 -        add counts to main users df  - format is Count_Table_Column, e.g. Count_Kernels_AuthorUserId is how many Kernels they have authored.

In [11]:
nrows = None
for f in sorted(os.listdir(IN_DIR)):
    if '.csv' in f:
        csv = os.path.join(IN_DIR, f)
        cols = user_columns(csv)
        if len(cols) < 1:
            continue
        table = f.replace('.csv', '')
        df = pd.read_csv(csv, usecols=['Id'] + cols, nrows=nrows)
        # ForumMessageVotes contents duplicated
        # https://www.kaggle.com/kaggle/meta-kaggle/discussion/181883
        # must use drop_duplicates
        df = df.drop_duplicates(subset=['Id'])
        for col in cols:
            tag = f'Count_{table}_{col}'
            print(tag)
            vc = df[col].value_counts()
            ser = users.index.map(vc)
            users[tag] = ser.fillna(0).astype('int32')

Here are some rough meanings:

In [12]:
MEANINGS = {
    "Datasets_CreatorUserId": "create a dataset",
    "Datasets_OwnerUserId": "own a dataset",
    "DatasetVersions_CreatorUserId": "create a dataset version",
    "DatasetVotes_UserId": "vote for a dataset",
    "Datasources_CreatorUserId": "create a datasource",
    "ForumMessages_PostUserId": "post a forum message",
    "ForumMessageVotes_FromUserId": "vote for a forum message",
    "ForumMessageVotes_ToUserId": "receive a forum vote",
    "Kernels_AuthorUserId": "author a kernel",
    "KernelVersions_AuthorUserId": "run a new version of a Notebook",
    "KernelVotes_UserId": "vote for a Notebook",
    "Submissions_SubmittedUserId": "submit to a competition",
    "TeamMemberships_UserId": "enter a competition (agree to rules)",
    "UserAchievements_UserId": "reach a new achievement milestone",
    "UserFollowers_UserId": "follow a user",
    "UserFollowers_FollowingUserId": "get followed by a user",
    "UserOrganizations_UserId": "add an organization",
    "Total_Activities": "appear in any activity",
}

pd.Series(MEANINGS)

# League Tables

Show users with highest counts for each columns.

One note: unlimited submissions to competitions can be made after the deadline.
It's possible to see this distinction in the data, but the field is not read by my overly-concise code above.
This affects the **Submissions SubmittedUserId** table, a few users make thousands of post-deadline submissions, pushing their numbers way up.

In [13]:
N_SHOW = 50

In [14]:
tier_names = np.asarray(['novice', 'contributor', 'expert', 'master', 'grandmaster', 'staff'])
tier_colors = np.asarray(["#2ECB99", "#00BFF9", "#9A5289", "#FF6337", "#DFA848", "#000000"])
tier_html = np.asarray([f'<font color={c}>{n}</font>' for c, n in zip(tier_colors, tier_names)])
bar_color = '#20beff'

def twoDP(v):
    return f'{v:.2f}'

def threeDP(v):
    return f'{v:.3f}'

def user_name_link(r):
    return f'<a href="https://www.kaggle.com/{r.UserName}">{r.DisplayName}</a>'


def setup_user(df):
    uid = df.apply(user_name_link, axis=1)
    df.pop('UserName')
    df.pop('DisplayName')
    df['Tier'] = tier_html[df.PerformanceTier]
    df['DisplayName'] = uid


def league_table(col, src_df=users):
    name = col.replace('_', ' ')
    h1 = f"<H1 id={col}>{name}</H1>"
    h2 = f"<P>How many times did user <i>{MEANINGS[col]}</i>?"
    display(HTML(h1+h2))
    #
    col = "Count_" + col
    df = src_df.sort_values(col, ascending=False).head(N_SHOW)
    setup_user(df)
    df['PerDay'] = (df[col] / df['Age'])
    df['Rank'] = df[col].rank(method='min', ascending=False).astype(int)
    use = ['Rank', 'DisplayName', 'Tier', col, 'PerDay']
    return df[use].style.bar(subset=[col, 'PerDay'], vmin=0, width=85, color=bar_color).format({'PerDay': twoDP})


def ratio_league_table(a, b, src_df=users):
    df = src_df.sort_values(a, ascending=False).head(N_SHOW)
    setup_user(df)
    df['Ratio'] = (df[b] / df[a]).round(2)
    df['Rank'] = df[a].rank(method='min', ascending=False).astype(int)
    use = ['Rank', 'DisplayName', 'Tier', a, b, 'Ratio']
    return df[use].style.bar(subset=[a, b], vmin=0, width=85, color=bar_color).format({'Ratio': twoDP})


def average_rate_league_table(a, b, src_df=users):
    df = src_df
    df = df.assign(Average=df.eval(f'({a}+1)/({b}+1)'))
    df = df.sort_values('Average', ascending=False).head(N_SHOW)
    setup_user(df)
    df['Rank'] = df['Average'].rank(method='min', ascending=False).astype(int)
    use = ['Rank', 'DisplayName', 'Tier', a, b, 'Average']
    return df[use].style.bar(subset=[a, b], vmin=0, width=85, color=bar_color).format({'Average': twoDP})

# for c in activity_sums.index: print(f'league_table("{c}")')

In [15]:
league_table("TeamMemberships_UserId")

In [16]:
league_table("Submissions_SubmittedUserId")

In [17]:
league_table("KernelVersions_AuthorUserId")

In [18]:
league_table("Kernels_AuthorUserId")

In [19]:
league_table("KernelVotes_UserId")

In [20]:
league_table("ForumMessages_PostUserId")

In [21]:
league_table("ForumMessageVotes_ToUserId")

In [22]:
league_table("ForumMessageVotes_FromUserId")

In [23]:
league_table("DatasetVersions_CreatorUserId")

In [24]:
league_table("Datasets_CreatorUserId")

In [25]:
league_table("Datasources_CreatorUserId")

In [26]:
league_table("Datasets_OwnerUserId")

In [27]:
league_table("DatasetVotes_UserId")

In [28]:
league_table("UserFollowers_FollowingUserId")

In [29]:
league_table("UserFollowers_UserId")

In [30]:
league_table("UserOrganizations_UserId")

# ForumMessageVotes - Ratio

Quick ratio of forum votes given for those who've received most.
*(Votes by Kaggle staff are not in the data so I filter them out based on PerformanceTier.)*

In [31]:
# ratio_league_table("Count_ForumMessageVotes_ToUserId", "Count_ForumMessageVotes_FromUserId")
ratio_league_table(
    "VotesReceived", "VotesGiven", 
    users.rename(
        columns={
            "Count_ForumMessageVotes_FromUserId": "VotesGiven",
            "Count_ForumMessageVotes_ToUserId": "VotesReceived"
        }).query("PerformanceTier<5")
)

# Followers - Ratio

How much "following back" is there?
(The 'follow' feature used to control which content was seen on the homepage [kaggle.com](https://www.kaggle.com/) but not any more?!)


In [32]:
ratio_league_table(
    "Followed", "Following", 
    users.rename(
        columns={
            "Count_UserFollowers_UserId": "Following",
            "Count_UserFollowers_FollowingUserId": "Followed"
        }))

# Average: Votes per Discussion Post

The Kaggle UI has for a long time shown *Votes/Post* on a user's discussion page - but who has the highest rate?

Quite a few users have &lt;10 posts with one that received hundreds of votes - so I put in a 10 post cut-off.

(Note: the average is smoothed `(votes+1)/(posts+1)`)

In [33]:
average_rate_league_table(
    "Votes", "Posts", 
    users.rename(
        columns={
            "Count_ForumMessageVotes_ToUserId": "Votes",
            "Count_ForumMessages_PostUserId": "Posts"
        }).query("Posts>=10"))

# Average: Submissions per Competition

It is not necessarily a good thing to make lots of submissions but this is one of the only other comparisons that makes sense ;-)

In [34]:
average_rate_league_table(
    "Subs", "Teams", 
    users.rename(
        columns={
            "Count_Submissions_SubmittedUserId": "Subs",
            "Count_TeamMemberships_UserId": "Teams"
        }).query("Teams>=10"))

# Summarize Total Activity


In [35]:
all_col_counts = users.columns[users.columns.str.startswith('Count_')]
len(all_col_counts)

Count_UserAchievements_UserId - all users have 3 or 0 achievements, not much to see here, let's ignore that count.

(Note this [Bug: Users missing from UserAchievements][1] - users are supposed to have 3 achievements listed, even if just "achieved Novice".
All users that signed up after 2019-10-18 are missing.)

[1]: https://www.kaggle.com/kaggle/meta-kaggle/discussion/181048

In [36]:
users.Count_UserAchievements_UserId.value_counts()

In [37]:
count_cols = [c for c in all_col_counts if c != 'Count_UserAchievements_UserId']
len(count_cols)

Simply **sum** up all activities on Kaggle, over four million have yet to *do* anything!

In [38]:
users[count_cols].sum(1).value_counts().head()

Fork & put your username here to see the sum of all your activities on Kaggle.

In [39]:
users.query('UserName=="jtrotman"').T

Sum the columns as booleans to count the *variety* of things each user has done.

In [40]:
users['Sum_Activity_Flags'] = (users[count_cols]>0).sum(1)

### We see:

 - most (4.8 million) have not done anything (Sum==0) - about 71% of all users
 - there are currently 16 activities at most
 - about 60 have done them **ALL**!


In [41]:
df = users.groupby('Sum_Activity_Flags').size().to_frame('Count')
df['Proportion'] = df['Count'] / df['Count'].sum()
df

In [42]:
users.Sum_Activity_Flags.plot.hist(bins=len(df), log=True)
plt.title('Count of activity types over all Kaggle users');

# Correlation Between Different Activity Counts

Compute correlations between activity counts.
I'm omitting users with no activity as that creates spurious correlations.
Also using spearman to compare ranks of values as some counts go very high.
Other options would be log1p() counts or compare active/non-active flags (count>0).

Here we can see four of the dataset columns are essentially the same thing:

- Count Datasets CreatorUserId
- Count Datasets OwnerUserId
- Count DatasetVersions CreatorUserId
- Count Datasources CreatorUserId

It's interesting how generally competing is *negatively* correlated with casting dataset votes.

In [43]:
counts_df = (users.query('Sum_Activity_Flags>0')[count_cols])
counts_df.shape

In [44]:
counts_df.head().T

In [45]:
counts_df.columns = counts_df.columns.str.replace('^Count_', '')
counts_df.columns = counts_df.columns.str.replace('_', '\n')

In [46]:
plt.figure(figsize=(14, 12))
sns.heatmap(counts_df.corr(method='spearman'), vmin=-1, cmap='RdBu', annot=True, linewidths=1)
plt.title('Kaggle User Activity Counts - Spearman Correlation');

# Hyper Kagglers

How many have done them all? These are the ***hyper Kagglers*** 😎 



In [47]:
idx = users.Sum_Activity_Flags==users.Sum_Activity_Flags.max()
idx.sum()

In [48]:
show = ['UserName', 'DisplayName', 'RegisterDate', 'PerformanceTier']

In order of Id, so oldest accounts first:

In [49]:
users[idx][show]

# Hyper Kagglers 2

Another way: who has interacted with the site the most? (This is simply counting how many times a user's ID appears in any table, which means it is adding *some* things that a user has minimal control over like being followed ...)

In [50]:
league_table('Total_Activities', users.assign(Count_Total_Activities=users[count_cols].sum(1)))

# Columnwise Counts

Sum up which activites are most popular.

Over 1M (23% ish) enter/sign up to one or more competitions but of those only 250k or so (5% ish of total users) get around to making a submission.

Remember the [the 1% Rule][1]?
Kaggle is doing better than 1% for most activities :)
**It is the dataset related activities where participation is under 1%.**

 [1]: https://en.wikipedia.org/wiki/1%25_rule_(Internet_culture)


In [51]:
n_users = len(users)
n_users

In [52]:
activity_sums = (users[count_cols]>0).sum(0).to_frame("UserCount")
activity_sums["PercentageOfUsers"] = ((activity_sums["UserCount"] / n_users) * 100).round(2)
activity_sums.index = activity_sums.index.str.replace("^Count_", "").map(MEANINGS.get)
activity_sums.sort_values("UserCount", ascending=False)

# Counting Separate Activities

Utility function to count users based on their activities.

In [53]:
def users_with_n_activities(n, min_count=1):
    bi_sum = users.Sum_Activity_Flags==n
    for cols in combinations(count_cols, n):
        idx = bi_sum
        for c in cols:
            idx = (idx & (users[c]>0))
            n = idx.sum()
            if n<min_count:
                break
        if n>=min_count:
            yield (n,) + cols

def users_with_n_activities_df(n, min_count=1):
    df = pd.DataFrame.from_records(
        users_with_n_activities(n, min_count),
        columns=['Count'] + list(range(n))
    )
    return df

Of users with 1 activity, the largest group is over 400k users: they have accepted rules for a competition, but not submitted, or posted messages, or voted etc...

Next is Count_DatasetVotes_UserId: they have only voted for a dataset. (Hmmm. Who are they? Any sockpuppets in there? Do they have long streaks of consecutive user Ids?)

Then Count_KernelVotes_UserId: they have only voted for a kernel. (Hmmm.)

Currently 4 have managed to submit to a competition as their only action - i.e. without being in a team(?)

In [54]:
def show_df(df):
    df = df.sort_values('Count', ascending=False)
    df = df.reset_index(drop=True)
    return df.style.bar(subset=['Count'], color=bar_color)

In [55]:
show_df(users_with_n_activities_df(1))

Of users with 2 activities, largest group is currently Count_Submissions_SubmittedUserId and Count_TeamMemberships_UserId, those who've entered a competition AND submitted...

In [56]:
show_df(users_with_n_activities_df(2))

Users with 3 activities...

In [57]:
show_df(users_with_n_activities_df(3, min_count=2000))

# Specialised Counts

Count users who are totally dormant, or have *at most* entered & submitted to competitions. (Following this thread actually leads to [interesting clusters](https://www.kaggle.com/jtrotman/elo-probey-mcprobeface) that *appear* to be sockpuppet accounts used for submission probing in past competitions...)

In [58]:
users.shape

In [59]:
entered = users.Count_TeamMemberships_UserId>0
submitted = users.Count_Submissions_SubmittedUserId>0

idx = (
 (users.Sum_Activity_Flags==0)
 | 
 ((users.Sum_Activity_Flags==1) & (entered))
 | 
 ((users.Sum_Activity_Flags==2) & (entered) & (submitted))
)

Quite a lot...

In [60]:
idx.sum()

In [61]:
idx.mean()

How many have done more than enter competitions?

In [62]:
len(users) - idx.sum()

In [63]:
plt.rc('font', size=14)

# Signup Rate

Counting yet again; count the signups per day.

In [64]:
users.RegisterDate.value_counts().plot()
plt.title('Kaggle User Registrations over Time')
plt.grid();

Looks like exponential growth - heading to 8 billion! Having multiple accounts is a **serious no-no** on Kaggle, if user IDs get to over 8 billion we know, for a fact, it's a *not a good thing!*

Double yearly peaks are interesting (matching the academic year?) and prominent dips at new-years.

In [65]:
users.RegisterDate.value_counts().sort_index().tail(365 * 8).plot(logy=True)
plt.title('Kaggle User Registrations over Time - log scale')
plt.grid();

With smoothing early 2015 jumps out (Otto challenge?)

In [66]:
users.RegisterDate.value_counts().sort_index().rolling(14).mean().tail(365 * 8).plot(logy=True)
plt.title('Kaggle User Registrations over Time - log scale')
plt.grid();

## Dormancy Rate

We discovered above ~70% accounts are inactive - what about over time?

It looks like *inactivity* is growing over time but remember:
 - it takes time to get started
 - data on active competitions (submissions, forums etc) are not in Meta Kaggle.

Many new users will register specifically to do an active competition, so there's a lag factor in play.

Also more generally: beware that accounts registered on the left side have simply had more time to register activities.
On the right of the plot it is **certain** to be overestimating because new accounts have not yet had much of a window of time to act.

So I've shaded the last 180 days; *some* of those registrations will later act and pull the line down.

It appears to be a recent rising trend but the plot will evolve with future runs of the notebook.

In [67]:
dormant = users.Sum_Activity_Flags == 0
reg_date = users.RegisterDate
max_date = reg_date.max()
# floor dates to 1 week resolution for smoothing
reg_week = reg_date - (reg_date.dt.dayofweek * one_day)
dormant.groupby(reg_week).mean().plot()
plt.axvspan(max_date - (180 * one_day), max_date, color='k', alpha=0.2)
plt.title('Kaggle Rate of Dormant Accounts over Time')
plt.grid();

# Tiers

Show counts of tiers.

In [68]:
users.PerformanceTier.value_counts()

In [69]:
vc = users.PerformanceTier.value_counts()

In [70]:
vc.index = tier_names[vc.index]

In [71]:
vc

How have tier counts changed over time?
It's natural for higher tiers to have lower counts for more recent years: it *should* take some time to achieve the top two tiers.
At time of writing the 2018 signups have not kept up the trend, but their stories are still being written!

An exact breakdown by type (competition, discussion etc) might be interesting (though I think others have done this already).

(Also note that until 2016 there was just "Kaggler" and "Master" and those were for competition performance only.)

In [72]:
ty = users.groupby([users.RegisterDate.dt.year, users.PerformanceTier]).size()
ty = ty.unstack()
ty = ty.fillna(0)
ty = ty.astype(int)
ty.columns = tier_names

In [73]:
ty.style.background_gradient(axis=0)

Same trick for tier vs the count of different activities a user has done.

This flags a data quality issue - how can you be a contributor without having done anything?
Perhaps they are deleted accounts that have not been cleaned out of the users table yet?

In [74]:
tier_sums = users.groupby([users.Sum_Activity_Flags, users.PerformanceTier]).size()
tier_sums = tier_sums.unstack()
tier_sums = tier_sums.fillna(0)
tier_sums = tier_sums.astype(int)
tier_sums.columns = tier_names

In [75]:
tier_sums.style.background_gradient(axis=0)

No, they are mostly
*Competitions contributor*
/
*Joined 9 years ago · last seen 9 years ago*

In [76]:
qstr = 'PerformanceTier==1 and Sum_Activity_Flags==0'
# users.query(qstr) # use this to see who they are
users.query(qstr).RegisterDate.dt.year.value_counts() # or this to just summarize

In [77]:
# GM with least activites!
# no need to point this out publicly
# try other queries if you like...

# users.query('PerformanceTier==4 and Sum_Activity_Flags<=4')

In [78]:
# Novices who've nevertheless done everything

# users.query('PerformanceTier==0 and Sum_Activity_Flags==16').T

One last trick: groupby whether the user has *done* the main three activities

In [79]:
tier_sums = users.groupby([
    users.Count_Submissions_SubmittedUserId > 0,
    users.Count_Kernels_AuthorUserId > 0,
    users.Count_ForumMessages_PostUserId > 0,
    users.PerformanceTier
]).size().unstack().fillna(0).astype(int)
tier_sums.columns = tier_names

In [80]:
tier_sums.index.names = ["Submit", "Kernel", "Post"]

All grandmasters have posted a message, but four (at time of writing) have not yet submitted to a competition?!

In [81]:
tier_sums.style.background_gradient(axis=0)

# The End

Save file for further analysis... e.g. load in a spreadsheet & colorize & sort columns - what stands out?
Say, sort by total activity: which masters have done the *least* to reach that tier?!


In [82]:
active_users = users.loc[users.Sum_Activity_Flags > 0]

In [83]:
active_users.to_csv("ActiveUsers.csv")

In [84]:
active_users.query("PerformanceTier>=3").to_csv("ActiveUsers_Masters.csv")

In [85]:
_ = """
Re-run to include recent competitions:

    2021-06-23 | Slug:iwildcam2021-fgvc8
    2021-06-28 | Slug:coleridgeinitiative-show-us-the-data
    2021-07-05 | Slug:tabular-playground-series-jun-2021
    2021-08-10 | Slug:google-smartphone-decimeter-challenge
    2021-08-11 | Slug:commonlitreadabilityprize


"""