# Modelling with Word Vectors

I will be using [Spacy](https://spacy.io/) alongside sklearn in this notebook. 

We will also be finding good hyperparameters.

In [1]:
import os
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import warnings

In [3]:
import bay12_solution_eposts as solution

## Load data

In [4]:
post, thread = solution.prepare.load_dfs('train')

In [5]:
post.head(2)

Unnamed: 0,thread_num,user,text,quotes
0,45016,Mephansteras,"Basically, this is where we talk about what ga...",[]
1,45016,dakarian,The currently running or about to run games (i...,[]


In [6]:
thread.head(2)

Unnamed: 0,thread_num,thread_name,thread_label,thread_replies,thread_label_id
0,45016,Games Threshold Discussion and List [Vote for ...,other,5703,8
1,88720,New Player's Guide to the Subforum - New to Ma...,other,961,8


In [7]:
label_map = solution.prepare.load_label_map()
label_map

type_name
bastard             0
beginners-mafia     1
byor                2
classic             3
closed-setup        4
cybrid              5
kotm                6
non-mafia-game      7
other               8
paranormal          9
supernatural       10
vanilla            11
vengeful           12
Name: type_id, dtype: int64

## Vectorize our text features

### Load a Spacy model to get word/sentence vectors

I'll be using the large English model (~800 MB size) as shown [here](https://spacy.io/usage/models).

In [8]:
import en_core_web_lg
nlp = en_core_web_lg.load()

In [9]:
len(nlp('a').vector)

300

In [10]:
ex_name = thread['thread_name'].iloc[0]
doc = nlp(ex_name)
doc

Games Threshold Discussion and List [Vote for games now!]

In [11]:
# Average vector for the entire name
doc.vector[:4]

array([-0.07033058,  0.06838092, -0.00824608, -0.09143116], dtype=float32)

## Create "final" dataset

**NOTE**: Not using the "whole thread" text because it takes a long time to calculate. 
Feel free to set `agg_post=['first', ' '.join]` to add the 'join' column.

In [12]:
import itertools

def vectorize_dataset(thread, post, nlp, agg_post=['first']): 
    # Prepare thread, posts to use 'thread_num' as the index 
    thread = thread.set_index('thread_num')
    post_texts = pd.concat(
        [
            thread[['thread_name']], 
            post.groupby('thread_num')['text'].agg(agg_post)
        ], 
        axis='columns'
    )
     
    # Get text features
    vec_size = len(nlp('a').vector)
    text_feature_names = [
        '%s_%s' % (col, num) 
        for col, num 
        in itertools.product(post_texts.columns, range(vec_size)) 
    ]
    
    def vectorize_row(row, cols=post_texts.columns):
        """Vectorizes a row of texts."""

        res = np.array([])
        for col in cols:
            txt = row.loc[col][:100000]  # limit is 10x bigger, but we want to be safe :)
            res = np.r_[res, nlp(txt).vector]
        # v0 = nlp(row.loc['thread_name']).vector
        # v1 = nlp(row.loc['first']).vector
        # v2 = nlp(row.loc['join']).vector
        return pd.Series(res, text_feature_names)
    
    thread_text_vectors = post_texts.apply(vectorize_row, axis='columns')
    
    # Add numeric features 
    thread_numeric_vectors = pd.DataFrame({
        'posts': (thread['thread_replies'] + 1), 
        'posts_log': np.log(thread['thread_replies'] + 1), 
    })
    
    # Bring it together
    X = pd.concat([thread_numeric_vectors, thread_text_vectors], axis='columns').astype('float')
    try:
        y = thread['thread_label_id']
        y_aux = y.apply(lambda x: 0 if (x==label_map['other']) else 1).rename('is_game')
    except Exception:
        y = y_aux = None

    return X, y, y_aux

z is our 'auxiliary objective' (whether it is the majority class)

In [13]:
X, y, z = vectorize_dataset(thread, post, nlp, agg_post=['first'])  # ['first', ' '.join]

In [14]:
y1 = pd.Series(index=y.index, name=y.name)
z1 = pd.Series(index=z.index, name=z.name)

To make things more readable, we'll use 'train', 'test' (instead of 'val') as the indexes

In [15]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(X.index, test_size=0.25, random_state=99)
train.shape, test.shape

((268,), (90,))

### Review

So, what have I done so far? Let's list.

- Selected title, first post, and maybe a concatenation of all posts as our "documents".
- Turned each "document" into a vector, using pre-trained word vectors (the document vector is the average of the word vectors).
- Added number of posts (and its log) as additional features.

Note that we probably need to scale the latter two for some models, because the others components are normalized to 1.

## Fit models (pipelines)

In [16]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [17]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

#### For Z

In [18]:
pipe_z = Pipeline([
    ('scale', StandardScaler()), 
    ('cls', RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42))
])

In [19]:
pipe_z.fit(X.loc[train], z.loc[train])

Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('cls', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False))])

In [20]:
z1.loc[train] = pipe_z.predict(X.loc[train])
z1.loc[test] = pipe_z.predict(X.loc[test])

In [21]:
print(
"""Aux target (z).
Train:
   Accuracy: {}
   Confusion:
{}

Test
   Accuracy: {}
   Confusion:
{}
""".format(
    accuracy_score(z.loc[train], z1.loc[train]),
    confusion_matrix(z.loc[train], z1.loc[train]),
    accuracy_score(z.loc[test], z1.loc[test]),
    confusion_matrix(z.loc[test], z1.loc[test]),
))


Aux target (z).
Train:
   Accuracy: 0.9402985074626866
   Confusion:
[[141   6]
 [ 10 111]]

Test
   Accuracy: 0.8666666666666667
   Confusion:
[[52  2]
 [10 26]]



#### For Y

In [22]:
train_p2 = z.loc[train][z.loc[train] == 1].dropna().index
test_p2 = z1.loc[test][z1.loc[test] == 1].dropna().index

In [23]:
pipe_y = Pipeline([
    ('scale', StandardScaler()), 
    ('cls', RandomForestClassifier(n_estimators=200, max_depth=3, max_leaf_nodes=10, random_state=68))
])

In [24]:
pipe_y.fit(X.loc[train_p2], y.loc[train_p2])

Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('cls', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=68, verbose=0, warm_start=False))])

In [25]:
y1.loc[train] = label_map['other']
y1.loc[train_p2] = pipe_y.predict(X.loc[train_p2])

y1.loc[test] = label_map['other']
y1.loc[test_p2] = pipe_y.predict(X.loc[test_p2])

In [26]:
print(
"""Aux target (y).
Train:
   Accuracy: {}
   Confusion:
{}

Test
   Accuracy: {}
   Confusion:
{}
""".format(
    accuracy_score(y.loc[train_p2], y1.loc[train_p2]),
    confusion_matrix(y.loc[train_p2], y1.loc[train_p2]),
    accuracy_score(y.loc[test_p2], y1.loc[test_p2]),
    confusion_matrix(y.loc[test_p2], y1.loc[test_p2]),
))


Aux target (y).
Train:
   Accuracy: 0.9256198347107438
   Confusion:
[[10  0  0  0  1  0  0  0  0  0  0  0]
 [ 0 19  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 11  0  0  0  0  0  0  0  0  0]
 [ 0  0  0 14  0  0  0  0  0  0  0  0]
 [ 0  0  0  0 26  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  0  0  0  2  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0 18  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  3  0  0]
 [ 0  2  0  0  1  0  0  0  0  0  5  0]
 [ 0  1  0  0  0  0  0  0  0  0  0  6]]

Test
   Accuracy: 0.5357142857142857
   Confusion:
[[0 0 0 0 1 0 0 0 0 0 0 0]
 [0 2 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 5 0 0 0 0 0 0 0]
 [0 0 0 0 9 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 2 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 2]]



Well obviously the algorithm overfit on the majority class (the fifth column, with '26' on the diagonal).

## Score the resulting model

In [27]:
print(
"""Final target (y).
Train:
   Accuracy: {}
   Confusion:
{}

Test
   Accuracy: {}
   Confusion:
{}
""".format(
    accuracy_score(y.loc[train], y1.loc[train]),
    confusion_matrix(y.loc[train], y1.loc[train]),
    accuracy_score(y.loc[test], y1.loc[test]),
    confusion_matrix(y.loc[test], y1.loc[test]),
))


Final target (y).
Train:
   Accuracy: 0.9664179104477612
   Confusion:
[[ 10   0   0   0   1   0   0   0   0   0   0   0   0]
 [  0  19   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0  11   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0  14   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0  26   0   0   0   0   0   0   0   0]
 [  0   0   0   0   1   0   0   0   0   0   0   0   0]
 [  0   0   0   0   1   0   0   0   0   0   0   0   0]
 [  0   0   0   0   2   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0 147   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0  18   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   3   0   0]
 [  0   2   0   0   1   0   0   0   0   0   0   5   0]
 [  0   1   0   0   0   0   0   0   0   0   0   0   6]]

Test
   Accuracy: 0.7444444444444445
   Confusion:
[[ 0  0  0  0  1  0  0  2  0  0  0  0]
 [ 0  2  0  0  0  0  0  2  0  0  0  0]
 [ 0  0  0  0  1  0  0  1  0  0  0  0]
 [ 0  0  0  0  5  0  0  2  0  0  0  0]
 [ 0 

Our model consists of two parts - let's see how well we did altogether:

0.75 accuracy! That's pretty good for a slightly-tuned model. That's significantly better than the baseline of ~0.55!

We still have some classes that aren't predicted in the validation set (actually quite a few - 7 out of 12 have 0 predicted!), which is pretty bad (obviously). 
However, we did predict something for all but 3 of them on the training set (and those 3 had 4 threads in total... so...). 

### Fit on full set, predict on output set

In [28]:
post2, thread2 = solution.prepare.load_dfs('test')

In [29]:
X2, _, _ = vectorize_dataset(thread2, post2, nlp, agg_post=['first'])  # ['first', ' '.join]

Again, not retraining:

In [30]:
# Predict z2
z2 = pd.Series(pipe_z.predict(X2), X2.index)
pred_p2 = z2[z2 == 1].dropna().index
z2.shape, pred_p2.shape

((236,), (129,))

In [32]:
# Predict y2
y2 = pd.Series(label_map['other'], index=X2.index, name=y.name)
y2.loc[pred_p2] = pipe_y.predict(X2.loc[pred_p2])

In [33]:
result = y2.reset_index()[['thread_num', 'thread_label_id']]
result.head()

Unnamed: 0,thread_num,thread_label_id
0,89396,8
1,89665,4
2,89865,4
3,91240,1
4,91413,4


Exporting:

In [34]:
# NOTE: Exporting next to the notebooks - the files are small, but usually you don't want to do this.
out_dir = os.path.abspath('4_output')
os.makedirs(out_dir, exist_ok=True)

In [35]:
result.to_csv(
    os.path.join(out_dir, 'anatoly_m4_predict.csv'),
    index=False, header=True, encoding='utf-8', 
)