# Meta Kaggle - Preview CSV Files

An index of the CSV files in the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset. I use this as a reference to work out how to match up information from different files.

With smaller datasets you might want to try [Pandas Profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/index.html) but it would take much more time than this notebook to run on [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset. This is a lightweight simpler version.


## Notes

All files have `Id` as the first column. Where `Id` appears in a column name it generally refers to a guessable `Id` column of another file. For example `Id` in `Users.csv` is referred to by these other files (file name : column name):

 - DatasetVersions : CreatorUserId
 - DatasetVotes : UserId
 - Datasets : CreatorUserId
 - Datasets : OwnerUserId
 - Datasources : CreatorUserId
 - ForumMessageVotes : FromUserId
 - ForumMessageVotes : ToUserId
 - ForumMessages : PostUserId
 - KernelVersions : AuthorUserId
 - KernelVotes : UserId
 - Kernels : AuthorUserId
 - Submissions : SubmittedUserId
 - TeamMemberships : UserId
 - UserAchievements : UserId
 - UserFollowers : UserId
 - UserFollowers : FollowingUserId
 - UserOrganizations : UserId

*Notebooks* were previously called *Kernels*, and before that *Scripts* so `ScriptId` in *KernelVersions* actually refers to the `Id` in the `Kernels` table.

Essentially: `ScriptId` in *KernelVersions* can be renamed `KernelId`.

## Tips

Change the constants in the first cell and you can adapt this notebook to create an index listing any set of csv files.

In [1]:
import os, sys, re, time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path, PosixPath
from IPython.display import HTML, display

CSV_PATH = Path(f'../input/meta-kaggle')
NA_HTML = '<font color=#c0c0c0>?</font>'  # grey '?' means N/A
NROWS = 30_000_000 # due to KernelVersionOutputFiles.csv

In [2]:
plt.rc("figure", figsize=(12, 9))
plt.rc("font", size=14)


# Could use DataFrame.style but it would interpret forum message html.
# (ForumMessages.Message Field)
# I want to show that field has raw html...
def df_to_html(df):
    nan_str = '__missing_value_na__'
    html = df.to_html(na_rep=nan_str, notebook=True)
    # to_html escapes html chars, so do a replacement afterwards
    html = html.replace(nan_str, NA_HTML)
    return html


def make_stats(df: pd.DataFrame):
    stats = df.describe(include='all').T
    stats['count'] = stats['count'].astype(int)
    if 'freq' not in stats.columns:
        stats.insert(1, 'freq', np.nan)
    if 'top' not in stats.columns:
        stats.insert(1, 'top', np.nan)
    if 'unique' not in stats.columns:
        stats.insert(1, 'unique', np.nan)
    stats['unique'] = df.nunique()
    stats.insert(0, 'dtype', df.dtypes)
    # add 'top' and 'freq' for numerical columns too
    for c in df.select_dtypes(['number', bool]).columns:
        vc = df[c].value_counts(dropna=False)
        # only show if truly most frequent
        if vc.values[0] > 1:
            stats.loc[c, 'top'] = vc.index[0]
            stats.loc[c, 'freq'] = vc.values[0]
    return stats


# Shows stats and a small sample of a DataFrame loaded from CSV.
# Works on any CSV file, not just Meta Kaggle.
def preview(csv: PosixPath):
    name = csv.with_suffix("").name
    df = pd.read_csv(csv, nrows=NROWS, low_memory=False, dtype={'Id': 'int32'})
    stats = make_stats(df)
    lines = []
    write = lines.append
    write(f'<h1 id="{name}">{name}</h1>')
    write(f'<h2>Stats</h2>')
    write(f'<p>Rows: {df.shape[0]}')
    write(f'<br/>Columns: {df.shape[1]}')
    write(f'<br/>Memory usage: {df.memory_usage().sum()/(1024**2):,.3f} Mb')
    write(df_to_html(stats))
    write(f'<h2>{name} &mdash; Sample</h2>')
    write(df_to_html(df.sample(n=5, random_state=42).T))
    write(f'<hr/>')
    display(HTML('\n'.join(lines)))

In [3]:
def list_all_ids(csv: PosixPath):
    df = pd.read_csv(csv, nrows=5)
    name = csv.with_suffix('').name
    print()
    for c in df.columns:
        if 'Id' in c:
            print(f'{name} : {c}')

In [4]:
csvs = sorted(CSV_PATH.glob('*.csv'))

# File Listing

See the file sizes and dates.

In [5]:
!ls -l {CSV_PATH}

# Line Counts

Including header row. Beware some files may have multi-line (quoted) fields. KernelVersionOutputFiles.csv is **huge**!

In [6]:
!wc -l {CSV_PATH}/*.csv

# Id Scheme

These are the `Id` columns of each file; first one refers to the entity in the file itself, other `Id`'s are references to other entities (usually in other files):

In [7]:
for csv in csvs:
    list_all_ids(csv)

# All CSVs

Here are previews of the tables.

In [8]:
for csv in csvs:
    preview(csv)

# Anomolies

2020-12-23: Kaggle has fixed the ForumMessageVotes duplicate entries problem! (See [version 3][1])

[1]: https://www.kaggle.com/jtrotman/meta-kaggle-preview-csv-files?scriptVersionId=47238482

# UserAchievements - Users Missing

The previews above demonstrate another [issue](https://www.kaggle.com/kaggle/meta-kaggle/discussion/181048):

The maximum `UserId` in `UserAchievements.csv` is 3.88282e+06.

In `Users.csv` there are 5960127 unique Ids and the max Id is 6.42404e+06.

It suggests ~7% of users have been removed or deleted their accounts?

In [9]:
5960127 / 6.42404e+06

In [10]:
1 - 5960127 / 6.42404e+06

In [11]:
userAchievements = pd.read_csv(CSV_PATH / 'UserAchievements.csv')
userAchievements.shape

In [12]:
userAchievements['TierAchievementDate'] = pd.to_datetime(
    userAchievements['TierAchievementDate'], format="%m/%d/%Y", cache=True)

Literally last TierAchievementDate value in *file* is 2019-10-18

In [13]:
userAchievements.tail()

Not all the top 100 users are there

In [14]:
userAchievements.query("CurrentRanking<=100").groupby("AchievementType").size()

***Many*** fairly useless rows? If points are zero they could be left out to save storage space...

In [15]:
userAchievements.query("Points==0").groupby("AchievementType").size()

In [16]:
colors = {'Competitions':'r', 'Discussion':'g', 'Scripts':'b'}

In [17]:
userAchievements['Color'] = userAchievements['AchievementType'].map(colors)

In [18]:
df = userAchievements#.sample(n=50000).sort_index()
df.shape

It appears that new `UserId`s are added on the end

In [19]:
pargs = dict(c=df.Color, s=1, alpha=0.3);

In [20]:
df.plot.scatter('Id', 'UserId', title='UserIds in UserAchievements', **pargs);

Newer dates *are* in there

In [21]:
df.plot.scatter('Id', 'TierAchievementDate', title='TierAchievementDates in UserAchievements', **pargs);

When a newer date does appear in the file it is for an older user

In [22]:
df.plot.scatter('TierAchievementDate', 'UserId', title='TierAchievementDates in UserAchievements', **pargs);

So essentially, just users with an ID over 3882819 are missing, apparently all new users since: 2019-10-18

# Further Observations

Why so many user exclusions? Perhaps one reason: in [version 1][1] the sample shown of the Organizations table has two genuine rows but three rows of hacking attempts and/or spam that shows why the *Organizations* feature was disabled. *Life, uhh... finds a way*.

[1]: https://www.kaggle.com/jtrotman/meta-kaggle-preview-csv-files?scriptVersionId=42387571