# Modelling with Word Vectors

I will be using [Spacy](https://spacy.io/) alongside sklearn in this notebook. 

In [1]:
import os
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import warnings

In [3]:
import bay12_solution_eposts as solution

## Load data

In [4]:
post, thread = solution.prepare.load_dfs('train')

In [5]:
post.head(2)

Unnamed: 0,thread_num,user,text,quotes
0,45016,Mephansteras,"Basically, this is where we talk about what ga...",[]
1,45016,dakarian,The currently running or about to run games (i...,[]


In [6]:
thread.head(2)

Unnamed: 0,thread_num,thread_name,thread_label,thread_replies,thread_label_id
0,45016,Games Threshold Discussion and List [Vote for ...,other,5703,8
1,88720,New Player's Guide to the Subforum - New to Ma...,other,961,8


I will set the thread number to be the index, to simplify matching in the future:

In [7]:
thread = thread.set_index('thread_num')
thread.head(2)

Unnamed: 0_level_0,thread_name,thread_label,thread_replies,thread_label_id
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
45016,Games Threshold Discussion and List [Vote for ...,other,5703,8
88720,New Player's Guide to the Subforum - New to Ma...,other,961,8


We'll load the label map as well, which tells us which index goes to which label

In [8]:
label_map = solution.prepare.load_label_map()
label_map

type_name
bastard             0
beginners-mafia     1
byor                2
classic             3
closed-setup        4
cybrid              5
kotm                6
non-mafia-game      7
other               8
paranormal          9
supernatural       10
vanilla            11
vengeful           12
Name: type_id, dtype: int64

## Vectorize our text features

### Load a Spacy model to get word/sentence vectors

I'll be using the large English model (~800 MB size) as shown [here](https://spacy.io/usage/models).

In [9]:
import en_core_web_lg
nlp = en_core_web_lg.load()

In [10]:
ex_name = thread['thread_name'].iloc[0]
doc = nlp(ex_name)
doc

Games Threshold Discussion and List [Vote for games now!]

In [11]:
# Average vector for the entire name
doc.vector[:4]

array([-0.07033058,  0.06838092, -0.00824608, -0.09143116], dtype=float32)

### Get documents for 'thread_name', 'first' (first post), maybe even 'join' (joining of all posts) 

**NOTE**: Not using the "whole thread" text because it takes a long time to calculate. 
Feel free to use the first line below, instead of the second, to add the 'join' column.

In [12]:
# thread_texts = post.groupby('thread_num')['text'].agg(['first', ' '.join])
thread_texts = post.groupby('thread_num')['text'].agg(['first'])
thread_texts = pd.concat(
    [
        thread[['thread_name']], 
        thread_texts
    ], 
    axis='columns'
)
thread_texts.head()

Unnamed: 0_level_0,thread_name,first
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1
14868,Mafia game,Mafia is a fun forum game! Read more hereWe ne...
27839,Mafia - The End,I have finally decided to add another game to ...
28183,Pirate Mafia ARRGH!!! - It's over.,"Description\r\r\n\r\r\nYes you heard right, In..."
28239,Space Mafia,This is the third installment of the Mafia gam...
28682,"CCS Rampage in Liberal City, USA---A Mafia Sty...",QUOTED_SECTION \r\r\nCCS Rampage is your stan...


In [13]:
import itertools 

text_feature_names = [
    '%s_%s' % (col, num) 
    for col, num 
    in itertools.product(thread_texts.columns, range(300)) 
]

def vectorize_row(row, cols=thread_texts.columns, text_feature_names=text_feature_names):
    """Vectorizes a row of texts."""
    
    res = np.array([])
    for col in cols:
        txt = row.loc[col][:100000]  # limit is 10x bigger, but we want to be safe :)
        res = np.r_[res, nlp(txt).vector]
    # v0 = nlp(row.loc['thread_name']).vector
    # v1 = nlp(row.loc['first']).vector
    # v2 = nlp(row.loc['join']).vector
    return pd.Series(res, text_feature_names)


In [14]:
thread_text_vectors = thread_texts.apply(vectorize_row, axis='columns')
thread_text_vectors.head()

Unnamed: 0_level_0,thread_name_0,thread_name_1,thread_name_2,thread_name_3,thread_name_4,thread_name_5,thread_name_6,thread_name_7,thread_name_8,thread_name_9,...,first_290,first_291,first_292,first_293,first_294,first_295,first_296,first_297,first_298,first_299
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14868,-0.134918,-0.081951,0.127861,-0.172295,0.190115,-0.173446,0.036431,0.05202,0.45838,1.641215,...,-0.259338,-0.065182,-0.064679,-0.119324,0.027151,0.044259,-0.07882,-0.187372,-0.012146,0.149975
27839,-0.00933,0.176693,-0.041203,0.203682,0.243933,-0.048978,0.046361,0.12897,0.04878,1.726635,...,-0.147977,0.019893,-0.036932,-0.083081,-0.012876,0.028734,-0.064797,-0.07712,0.034178,0.035782
28183,-0.138806,0.162982,-0.037537,-0.14809,0.067772,-0.027928,0.068096,-0.051705,-0.025569,1.36374,...,-0.160191,0.035747,-0.049968,-0.099663,-0.015113,0.053394,-0.078697,-0.062419,0.003403,0.022516
28239,0.180485,-0.180085,0.141012,0.08426,0.177981,0.156179,-0.212108,-0.28414,0.166125,1.473865,...,-0.186458,0.020265,-0.009883,-0.069076,-0.039644,0.029223,-0.058409,-0.047814,0.006705,0.041114
28682,-0.038636,0.009004,0.022752,-0.029003,0.166535,0.01975,-0.122702,0.032978,0.083143,1.566071,...,-0.130157,0.004898,-0.01304,-0.064136,-0.050408,0.003048,-0.074252,-0.072235,7.1e-05,0.00943


## Create "final" dataset

We only have one non-text feature, i.e. the number of posts. We'll use it and its log, and we'll scale them to be in `[-1, 1]`

In [15]:
thread_numeric_vectors = pd.DataFrame({
    'posts': (thread['thread_replies'] + 1), 
    'posts_log': np.log(thread['thread_replies'] + 1), 
})
thread_numeric_vectors.head(2)

Unnamed: 0_level_0,posts,posts_log
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1
45016,5704,8.648923
88720,962,6.869014


In [16]:
text_features = thread_text_vectors.columns
numeric_features = thread_numeric_vectors.columns

X = pd.concat([thread_numeric_vectors, thread_text_vectors], axis='columns')
X.head()

Unnamed: 0_level_0,posts,posts_log,thread_name_0,thread_name_1,thread_name_2,thread_name_3,thread_name_4,thread_name_5,thread_name_6,thread_name_7,...,first_290,first_291,first_292,first_293,first_294,first_295,first_296,first_297,first_298,first_299
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14868,12,2.484907,-0.134918,-0.081951,0.127861,-0.172295,0.190115,-0.173446,0.036431,0.05202,...,-0.259338,-0.065182,-0.064679,-0.119324,0.027151,0.044259,-0.07882,-0.187372,-0.012146,0.149975
27839,308,5.7301,-0.00933,0.176693,-0.041203,0.203682,0.243933,-0.048978,0.046361,0.12897,...,-0.147977,0.019893,-0.036932,-0.083081,-0.012876,0.028734,-0.064797,-0.07712,0.034178,0.035782
28183,185,5.220356,-0.138806,0.162982,-0.037537,-0.14809,0.067772,-0.027928,0.068096,-0.051705,...,-0.160191,0.035747,-0.049968,-0.099663,-0.015113,0.053394,-0.078697,-0.062419,0.003403,0.022516
28239,13,2.564949,0.180485,-0.180085,0.141012,0.08426,0.177981,0.156179,-0.212108,-0.28414,...,-0.186458,0.020265,-0.009883,-0.069076,-0.039644,0.029223,-0.058409,-0.047814,0.006705,0.041114
28682,253,5.533389,-0.038636,0.009004,0.022752,-0.029003,0.166535,0.01975,-0.122702,0.032978,...,-0.130157,0.004898,-0.01304,-0.064136,-0.050408,0.003048,-0.074252,-0.072235,7.1e-05,0.00943


Our targets are the same as the second model:

In [17]:
y = thread['thread_label_id']

y_aux = y.apply(lambda x: 0 if (x==label_map['other']) else 1).rename('is_game')

pd.concat([y, y_aux], axis='columns').head()

Unnamed: 0_level_0,thread_label_id,is_game
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1
45016,8,0
88720,8,0
39338,8,0
34959,8,0
64229,8,0


### Review

So, what have I done so far? Let's list.

- Selected title, first post, and maybe a concatenation of all posts as our "documents".
- Turned each "document" into a vector, using pre-trained word vectors (the document vector is the average of the word vectors).
- Added number of posts (and its log) as additional features.

Note that we probably need to scale the latter two for some models, because the others components are normalized to 1.

## Split dataset into "training" and "validation"

In order to check the quality of our model in a more realistic setting, we will split all our input (training) data into a "training set" (which our model will see and learn from) and a "validation set" (where we see how well our model generalized). [Relevant link](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
# NOTE: setting the `random_state` lets you get the same results with the pseudo-random generator
validation_pct = 0.25
X_train, X_val = train_test_split(X, test_size=validation_pct, random_state=99)

In [20]:
idx_train = X_train.index
idx_val = X_val.index

X_train.shape, X_val.shape

((268, 602), (90, 602))

## Fit first (auxilliary) model

In [21]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [22]:
from sklearn.ensemble import RandomForestClassifier

In [23]:
cls1 = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42)

In [24]:
cls1.fit(X_train, y_aux.reindex(idx_train))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [25]:
# In-sample and out-of-sample predictions for auxilliary target
y_aux_train_pred = pd.Series(cls1.predict(X_train), index=idx_train)
y_aux_val_pred = pd.Series(cls1.predict(X_val), index=idx_val)

In [26]:
y_t, y_p = y_aux.reindex(idx_train), y_aux_train_pred
print("Aux train:")
print("Accuracy:", accuracy_score(y_t, y_p))
print("Confusion:", confusion_matrix(y_t, y_p), sep="\n")

Aux train:
Accuracy: 0.9402985074626866
Confusion:
[[141   6]
 [ 10 111]]


In [27]:
y_t, y_p = y_aux.reindex(idx_val), y_aux_val_pred 
print("Aux validation:")
print("Accuracy:", accuracy_score(y_t, y_p))
print("Confusion:", confusion_matrix(y_t, y_p), sep="\n")

Aux validation:
Accuracy: 0.8666666666666667
Confusion:
[[52  2]
 [10 26]]


## Fit second (game) model

In [28]:
# Our training index is: train states, where we know we have games (that is, y_aux == 1)
idx_game_train = y_aux[y_aux == 1].reindex(idx_train).dropna().index
# CHECK
(y_aux[idx_game_train] == 1).all()

True

In [29]:
# Our validation index is: validation states, where we PREDICTED we have games (that is, y_aux_val_pred == 1)
idx_game_val = y_aux_val_pred[y_aux_val_pred == 1].dropna().index
# CHECK
(y_aux_val_pred[idx_game_val] == 1).all()

True

In [30]:
cls2 = RandomForestClassifier(n_estimators=200, max_depth=3, max_leaf_nodes=10, random_state=68)

In [31]:
cls2.fit(X_train.reindex(idx_game_train), y.reindex(idx_game_train))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=68, verbose=0, warm_start=False)

In [32]:
# In-sample and out-of-sample predictions for auxilliary target
y_game_train_pred = pd.Series(cls2.predict(X_train.reindex(idx_game_train)), index=idx_game_train)
y_game_val_pred = pd.Series(cls2.predict(X_val.reindex(idx_game_val)), index=idx_game_val)

In [33]:
y_t, y_p = y.reindex(idx_game_train), y_game_train_pred
print("Game train:")
print("Accuracy:", accuracy_score(y_t, y_p))
print("Confusion:", confusion_matrix(y_t, y_p), sep="\n")

Game train:
Accuracy: 0.9256198347107438
Confusion:
[[10  0  0  0  1  0  0  0  0  0  0  0]
 [ 0 19  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 11  0  0  0  0  0  0  0  0  0]
 [ 0  0  0 14  0  0  0  0  0  0  0  0]
 [ 0  0  0  0 26  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  0  0  0  2  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0 18  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  3  0  0]
 [ 0  2  0  0  1  0  0  0  0  0  5  0]
 [ 0  1  0  0  0  0  0  0  0  0  0  6]]


In [34]:
y_t, y_p = y.reindex(idx_game_val), y_game_val_pred
print("Game validation:")
print("Accuracy:", accuracy_score(y_t, y_p))
print("Confusion:", confusion_matrix(y_t, y_p), sep="\n")

Game validation:
Accuracy: 0.5357142857142857
Confusion:
[[0 0 0 0 1 0 0 0 0 0 0 0]
 [0 2 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 5 0 0 0 0 0 0 0]
 [0 0 0 0 9 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 2 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 2]]


Well obviously the algorithm overfit on the majority class (the fifth column, with '26' on the diagonal).

We will tune hyperparameters in another notebook. However, this seems to have worked out well enough.

## Score the resulting model

Our model consists of two parts - let's see how well we did altogether:

In [35]:
# Fill with "other", and when an actual game - fill with the game :)
y_train_pred = pd.Series(label_map['other'], index=idx_train)
y_train_pred[idx_game_train] = y_game_train_pred

In [36]:
# Same with the validation, because our index is dynamic :)
y_val_pred = pd.Series(label_map['other'], index=idx_val)
y_val_pred[idx_game_val] = y_game_val_pred

In [37]:
y_t, y_p = y.reindex(idx_train), y_train_pred
print("Total train:")
print("Accuracy:", accuracy_score(y_t, y_p))
print("Confusion:", confusion_matrix(y_t, y_p), sep="\n")

Total train:
Accuracy: 0.9664179104477612
Confusion:
[[ 10   0   0   0   1   0   0   0   0   0   0   0   0]
 [  0  19   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0  11   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0  14   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0  26   0   0   0   0   0   0   0   0]
 [  0   0   0   0   1   0   0   0   0   0   0   0   0]
 [  0   0   0   0   1   0   0   0   0   0   0   0   0]
 [  0   0   0   0   2   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0 147   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0  18   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   3   0   0]
 [  0   2   0   0   1   0   0   0   0   0   0   5   0]
 [  0   1   0   0   0   0   0   0   0   0   0   0   6]]


In [38]:
y_t, y_p = y.reindex(idx_val), y_val_pred
print("Total val:")
print("Accuracy:", accuracy_score(y_t, y_p))
print("Confusion:", confusion_matrix(y_t, y_p), sep="\n")
print(classification_report(y_t, y_p))

Total val:
Accuracy: 0.7444444444444445
Confusion:
[[ 0  0  0  0  1  0  0  2  0  0  0  0]
 [ 0  2  0  0  0  0  0  2  0  0  0  0]
 [ 0  0  0  0  1  0  0  1  0  0  0  0]
 [ 0  0  0  0  5  0  0  2  0  0  0  0]
 [ 0  0  0  0  9  0  0  1  0  0  0  0]
 [ 0  0  0  0  1  0  0  1  0  0  0  0]
 [ 0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  1  0  0  1  0  0 52  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  2  0  0  0]
 [ 0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  0  0  1  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  2]]


  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.67      0.50      0.57         4
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00         7
           4       0.43      0.90      0.58        10
           5       0.00      0.00      0.00         2
           6       0.00      0.00      0.00         1
           8       0.84      0.96      0.90        54
           9       1.00      1.00      1.00         2
          10       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         2
          12       1.00      1.00      1.00         2

   micro avg       0.74      0.74      0.74        90
   macro avg       0.33      0.36      0.34        90
weighted avg       0.62      0.74      0.67        90



0.75 accuracy! That's pretty good for a slightly-tuned model. That's significantly better than the baseline of ~0.55!

We still have some classes that aren't predicted in the validation set (actually quite a few - 7 out of 12 have 0 predicted!), which is pretty bad (obviously). 
However, we did predict something for all but 3 of them on the training set (and those 3 had 4 threads in total... so...). 

Let's freeze this model for now, and move to the next notebook. I won't predict on the test set, because I can see public *and* private scores, but here is one place where I would suggest you do it yourself. ;)