# What are the strategies to earn discussion medals?

Thanks for **UPVOTING** this kernel! Trying to become a Kernel Expert. 👍

> Check out other interesting projects by Pavlo Fesenko:
- [(kernel) How to create interactive Titanic dashboard using Bokeh?](https://www.kaggle.com/pavlofesenko/interactive-titanic-dashboard-using-bokeh)

---
## Table of contents:

1. [Introduction](#1.-Introduction)
2. [Preprocessing of forum messages](#2.-Preprocessing-of-forum-messages)
3. [Vectorization and visualization of forum messages](#3.-Vectorization-and-visualization-of-forum-messages)  
3.1. [Bronze medals](#3.1.-Bronze-medals)  
3.2. [Silver and gold medals](#3.2.-Silver-and-gold-medals)  
4. [Distribution of medals in forums](#4.-Distribution-of-medals-in-forums)
5. [Conclusion](#5.-Conclusion)

## 1. Introduction

From my experience I noticed that silver and gold discussion medals (5 and 10 upvotes respectively) are much harder to get compared to bronze discussion medals (1 upvote). So I began wondering **how forum messages with silver/gold medals are different from messages with bronze medals?**

In this kernel I will present my analysis and for this purpose will use a number of Python libraries:

- `pandas` - working with DataFrames
- `BeautifulSoup` and `re` - cleaning messages from HTML tags
- `spacy` - natural language processing (NLP)
- `TSNE` and `KMeans` from `sklearn` - dimension reduction and clustering
- `seaborn` - visualization
- `Counter` from `collections` - counting words
- `bokeh` - interactive visualization

In [None]:
import pandas as pd

from bs4 import BeautifulSoup
import re

import spacy

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

import seaborn as sns

from collections import Counter

from bokeh.plotting import output_notebook, figure, show
from bokeh.models import ColumnDataSource, Select, CustomJS
from bokeh.layouts import column
from bokeh.transform import factor_cmap, linear_cmap

output_notebook(hide_banner=True)

In order to use Python in Bokeh callback functions, it is also required to install the module `pscript`.

In [None]:
!pip install pscript

## 2. Preprocessing of forum messages

Before NLP techniques are applied to the forum messages, they need to be preprocessed. Raw messages can be uploaded from the file `ForumMessages.csv` in the [Kaggle Meta dataset](https://www.kaggle.com/kaggle/meta-kaggle). Note that this dataset is updated every day so the output will be different if you fork and run this kernel yourself. The rows with empty messages are filtered out and the rest is sorted according to `PostDate`. For the sorting to work properly, the column `PostDate` needs to be tranformed into the format `datetime`. The parameter `infer_datetime_format=True` reads the formatting from the first entry and uses it for the rest of the column, thus significantly speeding up the parser. This is especially useful for this dataset because it contains several hundred thousands of messages.

In [None]:
messages = pd.read_csv('../input/ForumMessages.csv')
messages = messages[messages.Message.notna()]
messages['PostDate'] = pd.to_datetime(messages['PostDate'], infer_datetime_format=True)
messages = messages.sort_values('PostDate')
messages.tail()

For cleaning messages from HTML tags I will be using the module `BeautifulSoup`. Calling it, however, for each individual message is very time consuming. Therefore, I create one string from all forum messages that are divided by a unique separator ` |sep| `.

Before applying `BeautifulSoup` I clean messages from the code snippets that are enclosed by the tag `<code>`. Some messages include code snippets as plain text (withoug the tag `<code>`) and can become a problem. For example, `BeautifulSoup` will consider the symbol `<-` used in R as the beginning of an HTML tag and will remove all text up to the nearest closing bracket `>`. To avoid this, the symbol `<-` is also removed in advance.

After `BeautifulSoup` parses raw messages, one can get the text without HTML tags using the method `get_text()`.

URLs and Kaggle usernames starting with @ can also be removed because they don't bring much semantic information.

Finally, the string of all forum messages is split back into the list  and assigned to the DataFrame `messages`. Note that if something goes wrong during preprocessing and the number of processed messages doesn't match the number of original messages, this assignment won't work.

In [None]:
messages_str = ' |sep| '.join(messages.Message.tolist())

messages_str = re.sub(r'<code>.*?</code>', '', messages_str, flags=re.DOTALL)
messages_str = re.sub('<-', '', messages_str)

messages_str = BeautifulSoup(messages_str, 'lxml').get_text()

messages_str = re.sub(r'http\S+', '', messages_str)
messages_str = re.sub(r'@\S+', '', messages_str)

messages['Message'] = messages_str.split(' |sep| ')

messages.tail()

## 3. Vectorization and visualization of forum messages

Just like in other machine learning tasks, the objects of interest should be transformed into a vector form (*vectorization*). There are many different methods of text vectorization, for example, word counts, TF-IDF, etc. In this kernel I will be using the vectorization method called *word embedding*. It encodes each word with a vector taken from a large dictionary of word-vector pairs. These dictionaries are obtained from the neural networks that were trained on a lot of documents, and can be easily found online. The big advantage of the word embedding method is that contextually similar words have similar word vectors. This approach will allow us to compare forum messages with each other.

### 3.1. Bronze medals

Let's start with the forum messages that got bronze medals. Since text vectorization can be time consuming, I took the last 1000 messages with bronze medals.

In [None]:
corpus = messages[messages.Medal == 3].Message.tolist()[-1000:]
corpus[-5:]

In this kernel I will be using the NLP module called SpaCy. It is much more convenient than the classic NLP module NLTK thanks to a large number of built-in features. In the beginning one has to upload the language model. Here I am using the built-in SpaCy model for the English language `'en_core_web_lg'` that can assign word vectors, POS tags, dependency parse and named entities. Since named entities recognition `'ner'` is not required in this kernel I have dsiabled it to speed up processing time.

In [None]:
nlp = spacy.load('en_core_web_lg', disable=['ner'])

The first step is tokenization. In order to process the collection of documents faster, it is better to use the method `nlp.pipe()` rather than calling `nlp()` on each document individually. When `nlp.pipe()` or `nlp()` are applied to text, SpaCy will automatically perform tokenization, POS tagging and dependency parsing under the hood (for more information about how SpaCy works check their awesome guide ["Get Started"](https://spacy.io/usage)). This makes it extremely easy to extract tokens in the desired form. Here I am taking a lemmatized form of each token in lower case for those tokens that have only letters, have a word vector and are not a stop word. For each document I create one string with all of its tokens divided by whitespace characters.

In [None]:
batch = nlp.pipe(corpus)
corpus_tok = []
for doc in batch:
    tokens = [token.lemma_.lower() for token in doc if token.is_alpha and token.has_vector and not token.is_stop]
    tokens_str = ' '.join(tokens)
    if tokens_str != '':
        corpus_tok.append(tokens_str)

corpus_tok[-5:]

The second step is vectorization. Using SpaCy one can get a vector for each token or for the whole document. The latter is simply the average of vectors for all tokens in the document. Here I apply vectorization to the whole document.

In [None]:
batch_tok = nlp.pipe(corpus_tok)
X = []
for doc in batch_tok:
    X.append(doc.vector)

The next step is dimension reduction. Since the SpaCy vectors are 300-dimensional, they cannot be visualized. Therefore, I apply the dimension reduction technique called t-SNE to get a 2-dimensional representation. To get reproducible results from run to run, the parameter `random_state=0` is fixed. The resulting coordinates are then stored in a DataFrame so that they can be conviently plotted later using Seaborn.

In [None]:
X_emb = TSNE(random_state=0).fit_transform(X)
df = pd.DataFrame(X_emb, columns=['x', 'y'])
df.tail()

Finally, the messages are visualized on the scatter plot.

In [None]:
sns.scatterplot('x', 'y', data=df, edgecolor='none', alpha=0.5)

It looks like there are at least 3 clusters present here. Let's use `KMeans()` to identify these clusters and add the cluster labels to the DataFrame. I have also added message tokens to the DataFrame to analyze the most frequent words in the clusters.

In [None]:
model = KMeans(n_clusters=3)
df['Label'] = model.fit_predict(X_emb)
df['Tokens'] = corpus_tok
df.tail()

The scatter plot of the labeled clusters is shown below.

In [None]:
palette = sns.color_palette(n_colors=3)
sns.scatterplot('x', 'y', data=df, edgecolor='none', alpha=0.5, hue='Label', palette=palette)

Since `KMeans` assigns labels in random order, they might be completely different after commiting the kernel. Therefore, I print the most common words of all 3 clusters below and discuss them in no particular order.

In one of the clusters the top words are "kernel", "thank", "great", "upvote", "share", etc. One could think of possible phrases from these words such as "great kernel", "thanks for sharing", etc. These messages probably show appreciation to kernels. 

In another cluster the top words are "thank", "work", "nice", "great", "share", etc. These are very similar to the words from the previous cluster with an exception of "kernel". These messages probably show appreciation in general.

In another cluster the top words are "datum" (lemmatized form of "data"), "model", "thank", "kernel", "good", etc. This is probably a mix of messages that discuss models/kernels and show appreciation. Another possible reason for this mix is that the clusters are not separated very well on the plot and `KMeans` might mistakenly assign a part of one cluster to the other.

In [None]:
cluster0 = ' '.join(df[df.Label == 0].Tokens.tolist())
words0 = Counter(cluster0.split())
words0.most_common(10)

In [None]:
cluster1 = ' '.join(df[df.Label == 1].Tokens.tolist())
words1 = Counter(cluster1.split())
words1.most_common(10)

In [None]:
cluster2 = ' '.join(df[df.Label == 2].Tokens.tolist())
words2 = Counter(cluster2.split())
words2.most_common(10)

Using Bokeh I made a nice interactive visualization of the messages and its tokens. When you hover over the points on the plot below, the tokens will be highlighted. You can also select a token from the list to see in which messages it was mentioned at least once. Try to choose, for example, the following words: "kernel", "thank", "datum", "model". Does the previous analysis of the most frequent words in clusters match what you see on the plot?

In [None]:
s = ColumnDataSource(df)

p = figure(plot_width=600, plot_height=400, toolbar_location=None, tools=['hover'], tooltips='@Tokens')

cmap = linear_cmap('Label', palette=palette.as_hex(), low=df.Label.min(), high=df.Label.max())
p.circle('x', 'y', source=s, color=cmap)

tokens_all = ' '.join(df.Tokens.tolist()).split()
options = sorted(set(tokens_all))
options.insert(0, 'Please choose...')
select = Select(value='Please choose...', options=options)

def callback(s=s, window=None):
    indices = [i for i, x in enumerate(s.data['Tokens']) if cb_obj.value in x]
    s.selected.indices = indices
    s.change.emit()
    
select.js_on_change('value', CustomJS.from_py_func(callback))
    
show(column(select, p))

### 3.2. Silver and gold medals

Let's repeat the same for the messages that got silver and gold medals. The code here is mostly a copy-paste from the previous section without changing the variable names.

In [None]:
corpus = messages[(messages.Medal == 1) | (messages.Medal == 2)].Message.tolist()[-1000:]

batch = nlp.pipe(corpus)
corpus_tok = []
for doc in batch:
    tokens = [token.lemma_.lower() for token in doc if token.is_alpha and token.has_vector and not token.is_stop]
    tokens_str = ' '.join(tokens)
    if tokens_str != '':
        corpus_tok.append(tokens_str)

batch_tok = nlp.pipe(corpus_tok)
X = []
for doc in batch_tok:
    X.append(doc.vector)
    
X_emb = TSNE(random_state=0).fit_transform(X)

df = pd.DataFrame(X_emb, columns=['x', 'y'])
df['Tokens'] = corpus_tok

sns.scatterplot('x', 'y', data=df, edgecolor='none', alpha=0.5)

It looks like there might be at least 2 clusters here.

In [None]:
model = KMeans(n_clusters=2)
df['Label'] = model.fit_predict(X_emb)

palette = sns.color_palette(n_colors=2)
sns.scatterplot('x', 'y', data=df, edgecolor='none', alpha=0.5, hue='Label', palette=palette)

In one of the clusters the most frequent words are "competition", "kernel", "submission", "time", "score", etc. These messages probably discuss competition scores and related kernels.

In another cluster the most frequent words are "model", "feature", "score", "lb" (short for "leaderboard"), "datum", etc. These messages probably focus on studying model features and improving scores. Note that the word "competition" is also quite frequent in this cluster which means that these messages are also related to competitions.

In [None]:
cluster0 = ' '.join(df[df.Label == 0].Tokens.tolist())
words0 = Counter(cluster0.split())
words0.most_common(10)

In [None]:
cluster1 = ' '.join(df[df.Label == 1].Tokens.tolist())
words1 = Counter(cluster1.split())
words1.most_common(10)

The interactive Bokeh plot is shown below. Try to choose, for example, the following words: "competition", "kernel", "model", "feature". Does the previous analysis of the most frequent words in clusters match what you see on the plot?

In [None]:
s = ColumnDataSource(df)

p = figure(plot_width=600, plot_height=400, toolbar_location=None, tools=['hover'], tooltips='@Tokens')

cmap = linear_cmap('Label', palette=palette.as_hex(), low=df.Label.min(), high=df.Label.max())
p.circle('x', 'y', source=s, color=cmap)

tokens_all = ' '.join(df.Tokens.tolist()).split()
options = sorted(set(tokens_all))
options.insert(0, 'Please choose...')
select = Select(value='Please choose...', options=options)

def callback(s=s, window=None):
    indices = [i for i, x in enumerate(s.data['Tokens']) if cb_obj.value in x]
    s.selected.indices = indices
    s.change.emit()
    
select.js_on_change('value', CustomJS.from_py_func(callback))
    
show(column(select, p))

As an interim conclusion, the messages with silver/gold medals tend to focus more on improving competitions scores and brainstorming new models while the messages with bronze medals are mostly about appreciation of other people's work.

## 4. Distribution of medals in forums

To dig deeper into the above hypothesis, let's investigate the distribution of medals in different forums. For this purpose I will upload the topic titles in the forums and the forum titles themselves from the same Meta Kaggle dataset.

In [None]:
topics = pd.read_csv('../input/ForumTopics.csv').rename(columns={'Title': 'TopicTitle'})
topics.head()

In [None]:
forums = pd.read_csv('../input/Forums.csv').rename(columns={'Title': 'ForumTitle'})
forums.head()

Then I merge selected features from `messages` and `topics` based on `ForumTopicId`.

In [None]:
df1 = pd.merge(messages[['ForumTopicId', 'PostDate', 'Medal']], topics[['Id', 'ForumId', 'TopicTitle']], left_on='ForumTopicId', right_on='Id')
df1 = df1.drop(['ForumTopicId', 'Id'], axis=1)
df1.head()

And then I merge the resulting DataFrame and `forums` based on `ForumId`. 

In [None]:
df2 = pd.merge(df1, forums[['Id', 'ForumTitle']], left_on='ForumId', right_on='Id')
df2 = df2.drop(['ForumId', 'Id'], axis=1)
df2.head()

Let's extract all messages from 2019 that got bronze medals and the ones that got silver/gold medals. Due to the large number of forum titles, only the top-10 forums with most medals are selected.

In [None]:
bronze = df2[(df2.Medal == 3) & (df2.PostDate > '2019-01-01 00:00:00')]
bronze_gr = bronze.groupby('ForumTitle').count()
bronze_ind = bronze_gr.sort_values('Medal')[-10:].index.values
bronze = bronze[bronze.ForumTitle.isin(bronze_ind)]

silver_gold = df2[((df2.Medal == 1) | (df2.Medal == 2)) & (df2.PostDate > '2019-01-01 00:00:00')]
silver_gold_gr = silver_gold.groupby('ForumTitle').count()
silver_gold_ind = silver_gold_gr.sort_values('Medal')[-10:].index.values
silver_gold = silver_gold[silver_gold.ForumTitle.isin(silver_gold_ind)]

The number of bronze medals for each forum title is shown below. Notice that bronze medals are predominantly granted in "Kernels". This is in line with the previous observation that the messages with bronze medals mostly express appreciation of other people's work. Indeed, this happens a lot in the comments section of kernels.

In [None]:
sns.countplot(y='ForumTitle', data=bronze)

The distriubtion of silver/gold medals is significantly different. The majority of medals are granted in the recent Kaggle competitions such as "Microsoft Malware Prediction", "Quora Insincere Questions Classification", "Elo Merchant Category Recommendation", etc. This is in line with the previous observation that the messages with silver/gold medals mostly focus on competitions.

In [None]:
sns.countplot(y='ForumTitle', data=silver_gold)

## 5. Conclusion

In this kernel I used NLP techniques to analyze how Kaggle forum messages with silver/gold medals are different from the ones with bronze medals. In particular, I used word embedding for vectorization and t-SNE for visualization. Further clustering of messages revealed several common topics in these messages:

- Messages with bronze medals mostly show appreciation of other people's work.
- Messages with silver/gold medals mostly focus on improving competition results.

Additional analysis of forum topic titles for different medals confirmed these observations.

Therefore, a possible strategy to earn silver/gold discussion medals could be to start with the most recent competitions and help others to improve their models. Let me know in the comments below if you agree with this strategy or not. 😉

---
Thanks for **UPVOTING** this kernel! Trying to become a Kernel Expert. 👍

> Check out other interesting projects by Pavlo Fesenko:
- [(kernel) How to create interactive Titanic dashboard using Bokeh?](https://www.kaggle.com/pavlofesenko/interactive-titanic-dashboard-using-bokeh)