# Don't Mention It!?

____

Don't mention what? In other words: which words *never* appear on the Kaggle Forums?

To answer, we need
1. some external idea of words and how often they *should* appear
2. to scrape all the Kaggle forums!

For #1 I will use the [GloVe: Global Vectors for Word Representation](rtatman/glove-global-vectors-for-word-representation) dataset, and #2 has been done for us!
Kaggle include the HTML for their forums in the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset :-)

____


### Alternative Titles

- What Are You ***Not*** Talking About? [\*](https://www.kaggle.com/jtrotman/what-are-you-talking-about)
- Kagglers Don't Say “❋❋❋❋❋❋”?
- Best Left Unsaid
- Words That Do Not Appear

____

## Contents

 * [Read Forums](#Read-Forums)
 * [Language Model - GloVe](#Language-Model---GloVe)
 * [Parse Forums - Spacy](#Parse-Forums---Spacy)
 * [Plot Word Counts](#Plot-Word-Counts)
 * [Zipf's Law - Log Rank vs Log Count](#Zipf's-Law---Log-Rank-vs-Log-Count)
 * [Regression](#Regression)
 * [Regression Result](#Regression-Result)
 * [Unknown Words](#Unknown-Words)
 * [The Most @mentioned Users!](#The-Most-@mentioned-Users!)
 * [Over Represented](#Over-Represented)
 * [Under Represented](#Under-Represented)
 * [Words That Do Not Appear](#Words-That-Do-Not-Appear)
 * [Scatter Plot Embeddings](#Scatter-Plot-Embeddings)
 * [Conclusions](#Conclusions)
 * [See Also](#See-Also)

In [1]:
!pip install beautifulsoup4 stylecloud -q -q

In [2]:
import re, os, sys
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd
import numpy as np
import stylecloud
from pathlib import Path
from IPython.display import HTML, Image, display
from bs4 import BeautifulSoup
from collections import Counter
from sklearn.isotonic import IsotonicRegression
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import spacy

MK = Path(f'../input/meta-kaggle')

glove_file = '../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt'

def log_log_plot():
    plt.yscale('log')
    plt.xscale('log')
    plt.ylim(bottom=0.5)
    plt.xlim(left=0.5)

def parse_html(r):
    bs = BeautifulSoup(r, 'html')
    for block in bs('code'):
        block.decompose()
    txt = bs.get_text()
    txt = re.sub(r'\[quote.*\[/quote\]', ' ', txt, flags=re.S)
    txt = re.sub(r'\s+', ' ', txt)
    txt = txt.strip()
    return txt

In [3]:
# Thanks to Sam Stoltenberg: https://skelouse.github.io/styling_a_jupyter_notebook
display(HTML("""<style>.container { max-width:100% !important; }
.output_result { max-width:100% !important; }
.output_area { max-width:100% !important; }
.input_area { max-width:100% !important; }
h1 {
  border: 3px solid #333;
  padding: 8px 12px;
  background-image: linear-gradient(180deg, rgb(160, 147, 147), #fff);
  position: static;
}
</style>"""))

plt.rcParams["figure.figsize"] = (12, 9)
plt.rcParams["figure.facecolor"] = "#FFFFFF"
plt.rcParams["axes.facecolor"] = "#E0E0E0"
plt.rcParams["font.size"] = 14
plt.rcParams["axes.grid"] = True
plt.rcParams["grid.alpha"] = 0.5
plt.rcParams["grid.linestyle"] = "--"

# Read Forums

Topic titles and messages are stored in separate tables.

In [4]:
topics = pd.read_csv(MK / 'ForumTopics.csv')
topics = topics.dropna(subset=['Title'])
topics = topics.set_index('Id')
topics.shape

In [5]:
msgs = pd.read_csv(MK / 'ForumMessages.csv')
msgs = msgs.dropna(subset=['Message'])
msgs = msgs.set_index('Id')
msgs.shape

In [6]:
%%time
text1 = ('<html>' + msgs['Message'].str.lower() + '</html>').apply(parse_html)
text2 = ('<html>' + topics['Title'].str.lower() + '</html>').apply(parse_html)
text = text1.append(text2)

# Language Model - GloVe
____

https://nlp.stanford.edu/projects/glove/

The words are frequency ordered, most common first.

The words come from:
 -  <a href="http://dumps.wikimedia.org/enwiki/20140102/">Wikipedia 2014</a> + <a href="https://catalog.ldc.upenn.edu/LDC2011T07">Gigaword 5</a> (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, &amp; 300d vectors, 822 MB download): <a href="http://nlp.stanford.edu/data/glove.6B.zip">glove.6B.zip</a>

Why use an old model? See the conclusion for details...

In [7]:
embedding_dict = {}
with open(glove_file, 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:], 'float32')
        embedding_dict[word] = vectors

In [8]:
words = list(embedding_dict.keys())
words[:10]

In [9]:
len(words)

In [10]:
word_set = set(words)

# Parse Forums - Spacy

Simply count all words.

In [11]:
nlp = spacy.load('en', disable=['parser', 'ner', 'tagger'])

In [12]:
counts = Counter()

In [13]:
%%time
docs = nlp.pipe(text, n_threads=2)
for doc in docs:
    counts.update([w.text for w in doc])

In [14]:
counts.most_common(10)

Create a DataFrame called `oom` which stands for *out of model*

In [15]:
oom = pd.Series({w:c for w, c in counts.most_common() if w not in word_set})
oom.index.name = 'word'
oom = oom.to_frame('count')
oom = oom.reset_index()
oom.shape

Main DataFrame for storing results - count just the words from the language model.
The `index` column is the order they appear in the Glove embeddings.

In [16]:
df = pd.DataFrame(index=pd.Index(words, name='word'))

In [17]:
df['count'] = pd.Series(counts)

In [18]:
df.insert(0, 'index', 1 + np.arange(len(df)))

In [19]:
df.head()

In [20]:
df.describe().T

In [21]:
df = df.reset_index()

In [22]:
df.corr(method='spearman').style.background_gradient()

About 80% of the 400k tokens are yet to appear on Kaggle forums

In [23]:
df['count'].isnull().mean()

# Plot Word Counts

The early words with a low index appear much more frequently.

In [24]:
df.plot.scatter('index',
                'count',
                logy=True,
                alpha=0.05,
                title='GloVe word counts within Kaggle Forums');

# Zipf's Law - Log Rank vs Log Count

It's best seen on a log-log plot.
In log-log space it looks like a linear relationship: this effect is known as [Zipf's law][1].
We don't have the original counts of words from the training (Wikipedia) corpus, but we could approximate them.

If we fit a line to the data shown here, then predictions of *count* (using the line) compared to the real *count* tell you if the word is over or under represented.

[1]: https://en.wikipedia.org/wiki/Zipf%27s_law


In [25]:
df.plot.scatter('index',
                'count',
                alpha=0.3,
                s=2,
                c='r',
                title='GloVe word counts within Kaggle Forums - log/log')
log_log_plot();

In [26]:
df['count'].fillna(0).cumsum().plot(logx=True, title='Cumulative GloVe word counts within Kaggle Forums')
plt.ylabel('Cumulative Count')
plt.xlabel('Word Index');

# Regression

There are many ways to fit a line through that data, here's just one of them (chosen as it's quick and easy and does not require tuning).

Using [IsotonicRegression][1] we can train a line from the index to the count value, a line that is non-increasing, so as the index goes up the line always stays the same or goes down to a lower count.


 [1]: https://en.wikipedia.org/wiki/Isotonic_regression


In [27]:
basis = df['index'].values
targets = df['count'].fillna(0).values

In [28]:
%%time
r = IsotonicRegression(increasing=False)
r.fit(basis, targets)
preds = r.predict(basis)

In [29]:
def set_predictions(p):
    df['expected'] = p
    # avoid zeros
    was_zero = (df['expected'] == 0)
    min_nonzero = df.loc[~was_zero, 'expected'].min()
    df.loc[was_zero, 'expected'] = min_nonzero
    print('Zeros', was_zero.sum(), 'Min', min_nonzero)
    df['ratio'] = df['count'] / df['expected']

In [30]:
set_predictions(preds)

Check stats...

In [31]:
df.describe().T

Plot real counts against predictions of count

In [32]:
df.plot.scatter('count',
                'expected',
                alpha=0.1,
                title='Word counts vs prediction');
coords = np.arange(df['count'].max())
plt.plot(coords, coords, linestyle='--', linewidth=1, color='k', alpha=0.5)
log_log_plot();

# Regression Result

Here is our line through the data.
Not a lovely smooth mathematical line but a wiggly approximation - good enough, as an engineer would say :D

We just want an approximation to the original distribution of the Wikipedia word counts.
Clearly some points are way over and some way under - which is the result of interest.

In [33]:
df.plot.scatter('index',
                'count',
                alpha=0.2,
                s=2,
                c='r',
                title='GloVe word counts within Kaggle Forums - log/log')
plt.plot(df['index'], df['expected'])
log_log_plot();

In [34]:
N_SHOW = 500


def fmt_exp(x):
    return ('<mark title="#{rank}\n'
            'Expected: {expected:.6f}">'
            '{word}'
            '</mark>').format(**x)

def fmt_count(x):
    return ('<mark title="#{rank}\n'
            'Count: {count:.0f}">'
            '{word}'
            '</mark>').format(**x)

def fmt_full(x):
    return ('<mark title="#{rank}\n'
            'Count: {count:.0f}\n'
            'Expected: {expected:.6f}\n'
            'Ratio: {ratio:.6f}" '
            'id="{word}">'
            '{word}'
            '</mark>').format(**x)

def html_report(df, formatter):
    rank = range(1, len(df) + 1)
    df = df.assign(rank=rank)
    return ', '.join(df.apply(formatter, 1))

# Unknown Words

First a table, then the top words: **hover over them to see the appearance count**.

The most prevalent words in the forum that are not in the GloVe language model. We won't see these in the later sections.

It's surprising that <mark>kaggle</mark> did not make the cut, but the model is from 2014...
It's not surprising that <mark>pytorch</mark> is not there because it was released in 2016 :)

[1]: https://en.wikipedia.org/w/index.php?title=Kaggle&oldid=427204425

In [35]:
oom.head()

In [36]:
HTML(html_report(oom.head(N_SHOW), fmt_count))

### Single characters league table!

***Kaggle is a Happy Place #1*** : The most frequent emoticons are overwhelmingly cheery!

Gold is the rarer medal, bronze the commonest but in # of mentions: 🎉>🔥>🏆>🥇>🥈>🥉>🍩, but nothing beats a 👍

In [37]:
HTML(html_report(oom[oom.word.str.len() == 1], fmt_count))

# The Most @mentioned Users!

This is quite correlated to the discussion rankings, but note how @cpmp would be #1 or #2 but *many* times people use his display name instead of his user name.
Similarly for @heng instead of @hengck23 and others.

In [38]:
HTML(html_report(oom[oom.word.str.startswith('@')].head(100), fmt_count))

# Over Represented

First a table, then the top words: **hover over them to see the count, the expected, and the ratio**.

Words appearing on Kaggle more often than we'd expect based on the GloVe data.
Do these words conjure up the 'essence' of Kaggle in *your* mind?

Just some brief highlights...

<a href="#quora"><mark>quora</mark></a>
and
<a href="#lyft"><mark>lyft</mark></a>
deserve to be high up in the list having both run two competitions.

Kaggle/ML legends:
<a href="#giba"><mark>giba</mark></a>,
<a href="#srk"><mark>srk</mark></a>,
<a href="#tatman"><mark>tatman</mark></a>,
<a href="#scirpus"><mark>scirpus</mark></a>,
<a href="#triskelion"><mark>triskelion</mark></a>,
<a href="#marília"><mark>marília</mark></a>,
<a href="#shivam"><mark>shivam</mark></a>,
<a href="#serigne"><mark>serigne</mark></a> and
<a href="#chollet"><mark>chollet</mark></a>

#### Kaggle is a Happy Place #2

The majority of emoticons are happy ones and the first to appear:
<mark>:-)</mark>
<mark>:)</mark>
<mark>;)</mark>
<mark>:P</mark>

Are ranked higher than the first sad one:
<mark>:(</mark>


In [39]:
top = df.sort_values('ratio', ascending=False)
top.head()

In [40]:
HTML(html_report(top.head(N_SHOW), fmt_full))

In [41]:
FONT_PATH = '/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf'


stylecloud.gen_stylecloud(
    text=top.head(200).set_index('word')['count'].to_dict(),
    icon_name='fas fa-comment-alt',
    background_color='#303030',
    colors=["#2ECB99", "#00BFF9", "#9A5289", "#FF6337", "#DFA848"],
    size=600,
    font_path=FONT_PATH,
    stopwords=False,
    random_state=42)

display(HTML('<h3>Word Cloud</h3>'))
Image('stylecloud.png')

# Under Represented


Words that have appeared but with far less frequency than "expected".

In [42]:
top = df.sort_values('ratio', ascending=True)
top.head()

In [43]:
HTML(html_report(top.head(N_SHOW), fmt_full))

# Words That Do Not Appear

This was the original idea: which words are not there at all?

The top words are caused by using a different parser to GloVe, perhaps words with punctuation would be better off removed.

I can see definite politics / warfare / tennis themes.

It may be that these words have appeared but the post was reported/deleted.
It could also be the case that messages in Meta Kaggle are filtered, from: https://www.kaggle.com/kaggle/meta-kaggle
 - *Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.*

Seeing
<a href="#iranians"><mark>iranians</mark></a> and
<a href="#cuba"><mark>cuba</mark></a>
in the list jogged my memory about the general competition rules:

<blockquote>
COMPETITIONS ARE OPEN TO RESIDENTS OF THE UNITED STATES AND WORLDWIDE, EXCEPT THAT IF YOU ARE A RESIDENT OF CRIMEA, <b>CUBA, IRAN</b>, SYRIA, NORTH KOREA, SUDAN, OR ARE SUBJECT TO U.S. EXPORT CONTROLS OR SANCTIONS, YOU MAY NOT ENTER THE COMPETITION.

...
</blockquote>


In [44]:
top = df.loc[df['count'].isnull()].sort_values('expected', ascending=False)
top.head()

In [45]:
HTML(html_report(top.head(N_SHOW), fmt_exp))

# Scatter Plot Embeddings

We can use the word embeddings for the words that do not appear: reduce them down to two dimensions with TSNE, then visualise them to see the natural clusters which are essentially topics that have *never* arisen amongst the Kaggle crowd in the 10 years the site has been running.

In [46]:
top = top.head(1000)
x = np.vstack([embedding_dict[w] for w in top.word])
x.shape

In [47]:
tsne = TSNE(metric='cosine')
x2 = tsne.fit_transform(x)

In [48]:
km = KMeans(n_clusters=20)
km.fit(x)

In [49]:
pdf = pd.DataFrame(x2, index=top.word)
pdf = pdf.add_prefix('tsne')
pdf['cluster'] = km.labels_
pdf['cluster'] = 'C' + pdf['cluster'].astype(str)
pdf = pdf.reset_index()

In [50]:
fig = px.scatter(pdf,
                 title='Unseen Words',
                 x='tsne0',
                 y='tsne1',
                 hover_name='word',
                 color='cluster')
fig.update_layout(showlegend=False)

### Let the game commence :)

There are many words listed.
What insights did I miss out above?
Hmmm.
If you comment under this notebook that you are surprised a word does not appear - it will then be on the forums, it will go into [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) within 24 hours and so ***disappear*** from this list the next time I run the notebook 😂

____

Maybe there are a few words you might think are genuinely underrepresented but worthy topics.
You could fork the notebook and show a much bigger list.
Is there an appetite out there for more tennis stats and discussion?
Perhaps this will inspire some new datasets?

The counts are saved below in case you want to dig deeper!

In [51]:
df.to_csv('glove_word_forum_counts.csv', index=False)
oom.to_csv('out_of_vocabulary_words.csv', index=False)

# Conclusions

Why did I use GloVe 6B?
I tried different options, but with larger vocabulary models on different training texts the 'unseen' words were nearly all obscenities!
(You can fork this notebook and try the [glove.840B.300d.txt](https://www.kaggle.com/takuok/glove840b300dtxt) embeddings to see the effect - but ***please do not make it public!***)

So above are *some* words we are not seeing on the forums, but step back a level: in a very *meta* way, we are not seeing *these* words in this Notebook:

<mark>a&#42;&#42;&#42;&#42;&#42;&#42;s</mark>,
<mark>d&#42;&#42;&#42;&#42;e</mark>,
<mark>f&#42;&#42;&#42;s</mark>,
<mark>t&#42;&#42;s</mark>,
<mark>t&#42;&#42;&#42;&#42;o</mark>,
<mark>y&#42;&#42;&#42;e</mark>,
<mark>w&#42;&#42;&#42;&#42;&#42;t</mark>,
<mark>w&#42;&#42;&#42;t</mark>,
<mark>d&#42;&#42;&#42;s</mark>,
<mark>b&#42;&#42;&#42;&#42;&#42;s</mark>,
<mark>b&#42;&#42;&#42;r</mark>,
<mark>m&#42;&#42;&#42;&#42;&#42;&#42;&#42;&#42;&#42;&#42;r</mark>,
<mark>c&#42;&#42;&#42;&#42;m</mark>,
<mark>w&#42;&#42;&#42;&#42;&#42;s</mark>,
<mark>c&#42;&#42;&#42;&#42;&#42;s</mark>,
<mark>t&#42;&#42;&#42;&#42;&#42;s</mark>,
<mark>o&#42;&#42;&#42;&#42;m</mark>,
etc
etc

Perhaps that's the main result of interest; Kaggle is overwhelmingly polite, decent and upbeat 😄 😊 😃 (if a little vote obsessed)

Remember, our 'expected' word count comes from Wikipedia in 2014.
This is a kind of cross-tabulation of word frequencies between Kaggle and that corpus.
It's not like these missing words *should* appear, let's stick to data science and machine learning ;)

<!--
Coming in version 2? plotting the word vectors?
-->

# See Also

[Don't upvote this, seriously][1] - a forum EDA by @lucabasa that sets out to answer good questions, among them: is there too much pleading for upvotes?



[1]: https://www.kaggle.com/lucabasa/don-t-upvote-this-seriously
