# Charting User Progress - Discussions

This notebook plots the progress of the top Kaggle users in the discussions rankings - the top 100 and all discussion masters & grandmasters are shown.
It uses the forum posts HTML source available in [Meta Kaggle][1].

One graph per user shows:

- cumulative sum of discussion medals on way to 50 / 500
- vertical line of last tier achievement date (if available)

_________

The idea comes from [Marília Prata](https://www.kaggle.com/mpwolke) here:

https://www.kaggle.com/jtrotman/meta-kaggle-count-user-activities/comments#1032992

<blockquote>
    
What would be the proportion of "doing nothing" after reaching some tier position? Do Kagglers keep their productivity after achieving any kind of goal? Maybe in another meta Kaggle Notebook of yours that questions could be answered.

</blockquote>

_________

For now it just plots the data - it is interesting to see that some users do stop posting as soon as a milestone is reached - they could of course just be "resting" and planning their return :-)

The majority do not stop, and in some cases seem to *accelerate*!

The different proportions of gold & total medals over time are also interesting to see.
Note the the total:gold ratio is fixed at 10:1 - change the `GOLD_RATIO` if you want to vary it :-)

[1]: https://www.kaggle.com/kaggle/meta-kaggle "Meta Kaggle"


In [1]:
from jt_mk_utils import *

In [2]:
import re
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import HTML, Image, display
from sklearn import preprocessing
from pathlib import Path
from tqdm.notebook import tqdm

In [3]:
USER_RANKS_CSV = Path(f'../input/kaggle-discussion-user-rankings')

HOST = 'https://www.kaggle.com'
TIER_COLORS = ["#2ECB99", "#00BFF9", "#9A5289", "#FF6337", "#DFA848", "#000000"]

# Equalise axes so that total medals is always GOLD_RATIO * gold_count
GOLD_RATIO = 10

# Read top ranked discussion users
users_df = pd.read_csv(USER_RANKS_CSV / 'DiscussionRankings.csv')
users_df = users_df.set_index('UserId').sort_index()
users_df['RegisterDate'] = pd.to_datetime(users_df['RegisterDate'])
display(HTML(f'users_df: {users_df.shape}'))

# Add TierAchievementDate from UserAchievements.csv
df = read_user_achievements(filter=('UserId', users_df.index))
df = df.query('AchievementType=="Discussion"')
df = df.set_index('UserId')
users_df = users_df.join(df[['TierAchievementDate', 'HighestRanking']])
display(HTML(f'users_df: {users_df.shape}'))

# Read messages to find medal dates
msgs = read_forum_messages()
msgs.Message.fillna("", inplace=True)
msgs = msgs.sort_values("Id")
msgs['m1'] = (msgs.Medal == 1).groupby(msgs.PostUserId).cumsum()
msgs['m2'] = (msgs.Medal == 2).groupby(msgs.PostUserId).cumsum()
msgs['m3'] = (msgs.Medal == 3).groupby(msgs.PostUserId).cumsum()
msgs['mT'] = (msgs.eval('m1+m2+m3'))
# msgs['TotalPosts'] = (msgs.Medal.isnull()).groupby(msgs.PostUserId).cumsum()
display(HTML(f'messages: {msgs.shape}'));

# Read forums to find most active areas user posts in
forums = read_forums(index_col="Id")
forum_topics = read_forum_topics(index_col=0, usecols=[0, 1])
msgs.insert(0, 'ForumId', msgs.ForumTopicId.map(forum_topics.ForumId))
msgs.insert(0, 'ParentForumId', msgs.ForumId.map(forums.ParentForumId))
display(HTML(f'messages: {msgs.shape}'));

In [4]:
plt.rc("figure", figsize=(10, 5))
plt.rc("font", size=13)

In [5]:
IMGS = {
    2: "https://www.kaggle.com/static/images/tiers/expert@48.png",
    3: "https://www.kaggle.com/static/images/tiers/master@48.png",
    4: "https://www.kaggle.com/static/images/tiers/grandmaster@48.png"
}

SHOW = [
    'Points',
    'CurrentRanking',
    'HighestRanking',
    'TotalGold',
    'TotalSilver',
    'TotalBronze',
]

now = pd.to_datetime('now')
users_subset = users_df.query('Tier>=3 or CurrentRanking<=100')

for uid, row in users_subset.sort_values('CurrentRanking').iterrows():
    sub = msgs.query('PostUserId==@uid')
    days = sub.PostDate.dt.date.nunique()
    chars = sub.Message.str.len().sum()
    sub = sub.set_index('PostDate').sort_index()

    # extend data to current time
    latest = sub.iloc[-1].copy()
    latest.name = now
    sub = sub.append(latest)

    table_stats = (row.to_frame('Field').T[SHOW].style.format('{:.0f}').set_caption("Stats:"))
    #
    topf = forums.Title[sub.ForumId.dropna()].value_counts()
    topf = topf.to_frame("Messages")
    topf.index.name = "Forum"
    table_forums = (topf.head(5).style.format('{:.0f}').set_caption("Posts in:"))

    html = (
        f'<img src="{IMGS[row.Tier]}" style="display:inline;" /> &nbsp;'
        f' <h1 style="display: inline;" id="{row.UserName}">'
        f'#{row.CurrentRanking:.0f} {row.DisplayName}</h1>'
        f' (@{row.UserName}) -'
        f' Joined {row.RegisterDate.strftime("%A %-d %b %Y")}'
        f'<ul>'
        f'<li><a href="{HOST}/{row.UserName}/discussion">Discussion index</a>'
        f'<li>Posted in {sub.ForumTopicId.nunique()} unique topics'
        f'<li>{days} unique days;'
        f'    {(sub.shape[0] / days):.1f} posts per day'
        f'<li>{sub.shape[0]} messages;'
        f'    {chars} raw characters;'
        f'    {int((chars / sub.shape[0]))} chars per message'
        f'</ul>'
        f'{table_stats.render()}'
        f'{table_forums.render()}'
    )
    display(HTML(html))
    #
    ax1 = sub.m1.plot(c='r')
    ax1.set_ylabel('Gold', color='r')
    ax2 = ax1.twinx()
    ax2.plot(sub.mT, c='b')
    ax2.set_ylabel('Total', color='b')
    ax2.yaxis.set_label_position("right")
    ax2.yaxis.tick_right()
    ngold = ax1.get_ylim()[1]
    ntotal = ax2.get_ylim()[1]
    gold_rate = ntotal / ngold
    if gold_rate <= GOLD_RATIO:
        ax2.set_ylim(top=ngold*GOLD_RATIO)
    else:
        ax1.set_ylim(top=ntotal/GOLD_RATIO)
    d = row.TierAchievementDate
    if not pd.isna(d):
        ax2.plot((d, d), (0, ax2.get_ylim()[1]), TIER_COLORS[row.Tier], linestyle='dashed')
    plt.title(f"{row.DisplayName} (@{row.UserName}) Discussion Medals")
    plt.show()