# Simple EDA of kaggle Grandmasters
As far as I can tell (and please correct me if I am wrong) the Grandmaster rank was introduced in 2016 along with the promotion of 71 Competition GM's and one Discussions GM (see, for example, the Topic ["Revamping Kaggle Profiles and the User Ranking System"](https://www.kaggle.com/general/20695)). This notebook makes use of the [Meta kaggle dataset](https://www.kaggle.com/datasets/kaggle/meta-kaggle) in conjunction with the ["Meta Kaggle-Master Achievements Snapshot"](https://www.kaggle.com/steubk/meta-kagglemaster-achievements-snapshot) created by [steubk](https://www.kaggle.com/steubk). 

**Notes:** 
* The dataset does not include Grandmasters who have gone on to become kaggle staff.
* Given that this notebook makes use of various datasets it may take a few days for things to filter through: if you have just become a GM (or $n$xGM) then well done: your data will appear here soon!

In [1]:
import pandas as pd
import datetime
pd.set_option('display.max_rows', None)
from IPython.display import Markdown

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})

# read in the data

# old dataset:
#GM_individual_Tiers = pd.read_csv("../input/list-of-kaggle-grandmasters/GM_individual_Tiers.csv")
#GM_individual_Tiers = GM_individual_Tiers.rename(columns = {'Comp_Tier':'Competitions', 
#                                                            'Dset_Tier':'Datasets',
#                                                            'Note_Tier':'Notebooks',
#                                                            'Disc_Tier':'Discussions'})

MasterAchievements = pd.read_csv("../input/MasterAchievements.csv")
MasterProfiles = pd.read_csv("../input/MasterProfiles.csv")
# merge the two files together
Master_file = MasterAchievements.merge(MasterProfiles, on = 'UserName')
# select the Grandmasters from the dataset
GM_individual_Tiers = Master_file.query(" Competitions == 'grandmaster' | Datasets == 'grandmaster'| Notebooks == 'grandmaster'| Discussion == 'grandmaster'").copy()
GM_individual_Tiers['GM_tier_count'] = GM_individual_Tiers[GM_individual_Tiers.astype(str) == 'grandmaster'].count(axis=1)

n_GMs  = GM_individual_Tiers.shape[0]
n_4xGM = GM_individual_Tiers.query("GM_tier_count == 4").shape[0]
n_3xGM = GM_individual_Tiers.query("GM_tier_count == 3").shape[0]
n_2xGM = GM_individual_Tiers.query("GM_tier_count == 2").shape[0]

# convert User Names into Display Names
Users        = pd.read_csv("../input/Users.csv")
# select the GM users
Users_GM     = Users.query("PerformanceTier == 4")
# name_mapping = dict(Users_GM[['UserName','DisplayName']].values)

def make_clickable_link(UserName):
    DisplayName = name_mapping.get(UserName)
    return DisplayName

# # print a summary
# now = datetime.datetime.now()
# display(Markdown('This notebook is refreshed almost daily; latest run ' + now.strftime('%A %B %d, %Y') ))


# display(Markdown('### Quick summary'))
# display(Markdown('There are  ' + f'{n_GMs:,}' + '  kaggle Grandmasters.'))
# display(Markdown('Out of these there are ' + f'{n_4xGM:,}' + ' quadruple Grandmasters, '
#                                            + f'{n_3xGM:,}' + ' triple Grandmasters, and '
#                                            + f'{n_2xGM:,}' + ' double Grandmasters.'))
# display(Markdown(""))


# # take a look
# GM_individual_Tiers["User"] = GM_individual_Tiers.UserName.map(lambda x: make_clickable_link(x))
# GM_individual_Tiers_styled  = GM_individual_Tiers[["User","Competitions","Datasets","Notebooks","Discussion","Country","GM_tier_count"]].sort_values(by='GM_tier_count', ascending=False).style.bar(subset=['GM_tier_count'], vmin=0, color='#ddaa17')
# GM_individual_Tiers_styled.hide_index()



In [2]:
# display(Markdown('#### As a table:'))
# display(Markdown(""))

# summary = GM_individual_Tiers[['Competitions','Datasets','Notebooks','Discussion']].apply(pd.Series.value_counts)
# summary.loc[['grandmaster','master','expert','contributor']].style.background_gradient(cmap='YlOrBr', axis=None)

and as an alluvial diagram, where we shall take the liberty of colouring the streams using the most numerous Grandmasters collective; the Competitions Grandmasters:

In [3]:
# import plotly.graph_objects as go
# # see 
# # https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Parcats.html

# tier_colours = {'grandmaster':'#ddaa17','master':'#f76629','expert':'#96508e','contributor':'#20beff'}
# # create a new column basing the colours on the Competitions GM's.
GM_individual_Tiers["colour_column"] = GM_individual_Tiers["Competitions"]
# replace the tier with the tier colour
# with_colours_df = GM_individual_Tiers.replace({"colour_column": tier_colours})
# color = with_colours_df.colour_column;

# # for a simple plot:
# #fig = px.parallel_categories(with_colours_df, 
# #                             dimensions=['Competitions','Datasets','Notebooks','Discussions'],
# #                             color=with_colours_df["colour_column"])

# # Create dimensions
# Competitions_dim = go.parcats.Dimension(
#     values=GM_individual_Tiers.Competitions,
#     categoryorder='array', categoryarray=['grandmaster','master','expert','contributor'], 
#     label="Competitions")

# Datasets_dim = go.parcats.Dimension(
#     values=GM_individual_Tiers.Datasets,
#     categoryorder='array', categoryarray=['grandmaster','master','expert','contributor'], 
#     label="Datasets")

# Notebooks_dim = go.parcats.Dimension(
#     values=GM_individual_Tiers.Notebooks,
#     categoryorder='array', categoryarray=['grandmaster','master','expert','contributor'], 
#     label="Notebooks")

# Discussions_dim = go.parcats.Dimension(
#     values=GM_individual_Tiers.Discussion,
#     categoryorder='array', categoryarray=['grandmaster','master','expert','contributor'], 
#     label="Discussions")

# fig = go.Figure(data = [go.Parcats(dimensions=[Competitions_dim, Datasets_dim, Notebooks_dim, Discussions_dim],
#         line={'color': color,
#               'shape': 'hspline'},
#         #labelfont={'size': 12, 'family': 'Times'},
#         #tickfont={'size': 12, 'family': 'Times'},
#         arrangement='freeform')])

# fig.show();

In [4]:
# display(Markdown('# Country ranking'))
# display(Markdown('Number of Grandmasters by country'))
# display(Markdown(""))

GM_countries = GM_individual_Tiers["Country"].value_counts().to_frame()
GM_countries['Ranking'] = GM_countries.rank(method="min",ascending=False).astype('int')
GM_countries.sort_values(by='Country', ascending=False)
# .style.bar(subset=['Country'], vmin=0, color='#ddaa17')

Unnamed: 0,Country,Ranking
United States,85,1
Japan,55,2
China,41,3
India,32,4
Russia,26,5
United Kingdom,17,6
France,12,7
Ukraine,11,8
Canada,10,9
Turkey,9,10


# GM 'influencers'
A rather arbitrary definition of 'influencer' but here we shall create a ranking of the GM's who have the most GM followers

In [5]:
# Create a list of GM Id
GM_id_list = Users_GM['Id'].tolist()
# read in the Meta Kaggle followers file
followers = pd.read_csv("../input/UserFollowers.csv")
# filter by GM users. Note: "FollowingUserId" is the Id of the person being followed
GM_users = followers[followers['FollowingUserId'].isin(GM_id_list)]
# now filter by GM followers
GM_user_follower = GM_users[GM_users['UserId'].isin(GM_id_list)]
# remove self-following entries 
GM_user_follower = GM_user_follower[GM_user_follower['UserId'] != GM_user_follower['FollowingUserId']]

Id_mapping = dict(Users_GM[['Id','UserName']].values)
name_mapping = dict(Users_GM[['UserName','DisplayName']].values)

def make_clickable_link_by_Id(Id):
    UserName = Id_mapping.get(Id)
    DisplayName = name_mapping.get(UserName)
    return DisplayName

GM_influencers = GM_user_follower["FollowingUserId"].value_counts().to_frame().reset_index()
# rename the columns
GM_influencers.columns = ['GM_Id', 'number_GM_followers']
# create a "Ranking" column
GM_influencers['Ranking'] = GM_influencers.number_GM_followers.rank(method="min",ascending=False).astype('int')
# replace GM_Id by display name
GM_influencers["User"] = GM_influencers.GM_Id.map(lambda x: make_clickable_link_by_Id(x))
GM_influencers_styled  = GM_influencers[["Ranking","User","number_GM_followers"]].sort_values(by='number_GM_followers', ascending=False).style.bar(subset=['number_GM_followers'], vmin=0, color='#ddaa17')
# GM_influencers_styled.hide_index()

and a totally and utterly useless (at the moment) graph made using [NetworkX](https://networkx.org/) 

In [6]:
# import networkx as nx
# fig, ax = plt.subplots(figsize=(20,12))
# G = nx.from_pandas_edgelist(GM_user_follower, 'UserId', 'FollowingUserId', create_using=nx.Graph())
# nx.draw(G, with_labels=False, node_color="#ddaa17")

In [None]:
followers.info()
Users.info()
MasterAchievements
MasterProfiles.info()

# Time series
Number of GM over time

(Note: temporarily commented due to possible bug in the `UserAchievements.csv` file (as per https://www.kaggle.com/discussions/product-feedback/438260 )

In [7]:
UserAchievements = pd.read_csv('../input/UserAchievements.csv', nrows = 30000000)
# Quite a few kaggle staff in this section as Tier 4 (hence PerformanceTier == 5)
GM_Achievements = UserAchievements.query('(Tier == 4) or (Tier == 5)')

# UserAchievements is big, so now delete
del UserAchievements

GM_Achievements["TierAchievementDate"] = pd.to_datetime(GM_Achievements["TierAchievementDate"])
GM_Achievements = GM_Achievements.sort_values(by='TierAchievementDate', ascending=True)
GM_Achievements.set_index('TierAchievementDate', inplace=True)
GM_Achievements["one"] = 1

Users_GM     = Users.query('(PerformanceTier == 4) or (PerformanceTier == 5)')
# new name mapping to now include kaggle staff
name_mapping = dict(Users_GM[['UserName','DisplayName']].values)

Competitions = GM_Achievements.query("AchievementType == 'Competitions'")
Datasets     = GM_Achievements.query("AchievementType == 'Datasets'")
Scripts      = GM_Achievements.query("AchievementType == 'Scripts'")
Discussion   = GM_Achievements.query("AchievementType == 'Discussion'")

UserAchievements_GM_any  = GM_Achievements.drop_duplicates(subset=['UserId'])
UserAchievements_GM_any  = UserAchievements_GM_any.resample('D').sum().cumsum()
UserAchievements_GM_comp = Competitions.resample('D').sum().cumsum()
UserAchievements_GM_data = Datasets.resample('D').sum().cumsum()
UserAchievements_GM_note = Scripts.resample('D').sum().cumsum()
UserAchievements_GM_disc = Discussion.resample('D').sum().cumsum()

# fig, ax = plt.subplots(figsize=(20, 7))
# sns.lineplot(data=UserAchievements_GM_comp, x=UserAchievements_GM_comp.index,  y=UserAchievements_GM_comp["one"],  linewidth = 3, color='magenta',label="Competitions")
# sns.lineplot(data=UserAchievements_GM_any, x=UserAchievements_GM_any.index,  y=UserAchievements_GM_any["one"],  linewidth = 3, color='#ddaa17',label="Total GM")
# sns.lineplot(data=UserAchievements_GM_note, x=UserAchievements_GM_note.index,  y=UserAchievements_GM_note["one"],  linewidth = 3, color='green',label="Notebooks")
# sns.lineplot(data=UserAchievements_GM_disc, x=UserAchievements_GM_disc.index,  y=UserAchievements_GM_disc["one"],  linewidth = 3, color='red',label="Discussions")
# sns.lineplot(data=UserAchievements_GM_data, x=UserAchievements_GM_data.index,  y=UserAchievements_GM_data["one"],  linewidth = 3, color='blue',label="Datasets")
# ax.set(xlabel='Year', ylabel='Number of GM')
# # reordering the labels
# handles, labels = plt.gca().get_legend_handles_labels()
# # specify order
# order = [1,0,2,3,4]
# # pass handle & labels lists along with order as below
# plt.legend([handles[i] for i in order], [labels[i] for i in order]);

  UserAchievements_GM_any  = UserAchievements_GM_any.resample('D').sum().cumsum()
  UserAchievements_GM_comp = Competitions.resample('D').sum().cumsum()
  UserAchievements_GM_data = Datasets.resample('D').sum().cumsum()
  UserAchievements_GM_note = Scripts.resample('D').sum().cumsum()
  UserAchievements_GM_disc = Discussion.resample('D').sum().cumsum()


# Who's got the most Gold?
These tables include kaggle staff
## Competitions "Top 10"

In [8]:
n_top = 10

Competitions_Top_n = Competitions.sort_values(by='TotalGold', ascending=False).head(n_top).copy()
Competitions_Top_n  = pd.merge(Competitions_Top_n, Users_GM, left_on = ['UserId'], right_on = ['Id'], how='left')

# replace GM_Id by display name
Competitions_Top_n["User"] = Competitions_Top_n.UserName.map(lambda x: make_clickable_link(x))
Competitions_Top_n_styled  = Competitions_Top_n[["User","TotalGold"]].sort_values(by='TotalGold', ascending=False).style.bar(subset=['TotalGold'], vmin=0, color='#ddaa17')
Competitions_Top_n_styled.hide_index()

  Competitions_Top_n_styled.hide_index()


User,TotalGold
Giba,63
bestfitting,42
Μαριος Μιχαηλιδης KazAnova,41
Dieter,36
Psi,33
Stanislav Semenov,28
Guanshuo Xu,25
CPMP,25
ZFTurbo,24
Leustagos,23


## Datasets "Top 10"

In [9]:
Datasets_Top_n = Datasets.sort_values(by='TotalGold', ascending=False).head(n_top).copy()
Datasets_Top_n  = pd.merge(Datasets_Top_n, Users_GM, left_on = ['UserId'], right_on = ['Id'], how='left')

# replace GM_Id by display name
Datasets_Top_n["User"] = Datasets_Top_n.UserName.map(lambda x: make_clickable_link(x))
Datasets_Top_n_styled  = Datasets_Top_n[["User","TotalGold"]].sort_values(by='TotalGold', ascending=False).style.bar(subset=['TotalGold'], vmin=0, color='#ddaa17')
Datasets_Top_n_styled.hide_index()

  Datasets_Top_n_styled.hide_index()


User,TotalGold
Kaggle Team,18
Ruchi Bhatia,14
Abhishek Thakur,14
Sourav Banerjee,14
Chris Crawford,13
Tensor Girl,13
Larxel,12
SRK,11
Chris Deotte,11
Paul Mooney,9


## Notebooks "Top 10"

In [10]:
Scripts_Top_n = Scripts.sort_values(by='TotalGold', ascending=False).head(n_top).copy()
Scripts_Top_n  = pd.merge(Scripts_Top_n, Users_GM, left_on = ['UserId'], right_on = ['Id'], how='left')

# replace GM_Id by display name
Scripts_Top_n["User"] = Scripts_Top_n.UserName.map(lambda x: make_clickable_link(x))
Scripts_Top_n_styled  = Scripts_Top_n[["User","TotalGold"]].sort_values(by='TotalGold', ascending=False).style.bar(subset=['TotalGold'], vmin=0, color='#ddaa17')
Scripts_Top_n_styled.hide_index()

  Scripts_Top_n_styled.hide_index()


User,TotalGold
Chris Deotte,95
Abhishek Thakur,67
DanB,43
Andrew Lukyanenko,43
Andrada,41
AmbrosM,38
Y.Nakama,37
Awsaf,36
Rachael Tatman,35
Prashant Banerjee,35


## Discussions "Top 10"

In [11]:
Discussion_Top_n = Discussion.sort_values(by='TotalGold', ascending=False).head(n_top).copy()
Discussion_Top_n  = pd.merge(Discussion_Top_n, Users_GM, left_on = ['UserId'], right_on = ['Id'], how='left')

# replace GM_Id by display name
Discussion_Top_n["User"] = Discussion_Top_n.UserName.map(lambda x: make_clickable_link(x))
Discussion_Top_n_styled  = Discussion_Top_n[["User","TotalGold"]].sort_values(by='TotalGold', ascending=False).style.bar(subset=['TotalGold'], vmin=0, color='#ddaa17')
Discussion_Top_n_styled.hide_index()

  Discussion_Top_n_styled.hide_index()


User,TotalGold
CPMP,496
Chris Deotte,463
hengck23,457
inversion,249
Addison Howard,234
Psi,161
Bojan Tunguz,154
Will Cukierski,151
Abhishek Thakur,151
Marília Prata,147


# Related kaggle notebooks
* ["Kaggle Grand/Masters Map"](https://www.kaggle.com/steubk/kaggle-grand-masters-map) written by [steubk](https://www.kaggle.com/steubk)
* ["Kaggle Grand/Masters Map (Notebooks)"](https://www.kaggle.com/code/steubk/kaggle-grand-masters-map-notebooks) written by [steubk](https://www.kaggle.com/steubk)
* ["Kaggle Grand/Masters Map (Competitions)"](https://www.kaggle.com/code/steubk/kaggle-grand-masters-map-competitions) written by [steubk](https://www.kaggle.com/steubk)
* ["Kaggle Grand/Masters Map (Discussion)"](https://www.kaggle.com/code/steubk/kaggle-grand-masters-map-discussion) written by [steubk](https://www.kaggle.com/steubk)
* ["Kaggle Grand/Masters Map (Datasets)"](https://www.kaggle.com/code/steubk/kaggle-grand-masters-map-datasets) written by [steubk](https://www.kaggle.com/steubk)
* ["Kaggle Grandmasters Map"](https://www.kaggle.com/code/gpreda/kaggle-grandmasters-map) written by [Gabriel Preda](https://www.kaggle.com/gpreda)
* ["Kaggle in Numbers"](https://www.kaggle.com/carlmcbrideellis/kaggle-in-numbers)
* ["Active veteran kaggle users"](https://www.kaggle.com/carlmcbrideellis/active-veteran-kaggle-users)
* ["A Meta Kaggler's Guide To Kaggle"](https://www.kaggle.com/steubk/a-meta-kaggler-s-guide-to-kaggle) written by [steubk](https://www.kaggle.com/steubk) 
* ["Meta Kaggle-Master Achievements Snapshot"](https://www.kaggle.com/steubk/meta-kagglemaster-achievements-snapshot) dataset by [steubk](https://www.kaggle.com/steubk) 
* ["Become GrandMaster"](https://www.kaggle.com/raenish/become-grandmaster) written by Raenish David
* ["EDA: Who will be the Next Kaggle Grandmaster?"](https://www.kaggle.com/japandata509/eda-who-will-be-the-next-kaggle-grandmaster) written by [Kaito](https://www.kaggle.com/japandata509)
* ["Meet The Grandmasters"](https://www.kaggle.com/sahidvelji/meet-the-grandmasters) written by [Sahid Velji](https://www.kaggle.com/sahidvelji)
* ["Meta Kaggle: Count User Activities"](https://www.kaggle.com/code/jtrotman/meta-kaggle-count-user-activities) written by [James Trotman](https://www.kaggle.com/jtrotman)
* ["Meta Kaggle Prize Money"](https://www.kaggle.com/datasets/jpmiller/meta-kaggle-moneyboard) - Details on competition prizes and the Kagglers who won them, a dataset by [JohnM](https://www.kaggle.com/jpmiller)