# Initial Modelling notebook

In [1]:
import os
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import warnings

In [3]:
import bay12_solution_eposts as solution

## Load data

In [4]:
post, thread = solution.prepare.load_dfs('train')

In [5]:
post.head(2)

Unnamed: 0,thread_num,user,text,quotes
0,45016,Mephansteras,"Basically, this is where we talk about what ga...",[]
1,45016,dakarian,The currently running or about to run games (i...,[]


In [6]:
thread.head(2)

Unnamed: 0,thread_num,thread_name,thread_label,thread_replies,thread_label_id
0,45016,Games Threshold Discussion and List [Vote for ...,other,5703,8
1,88720,New Player's Guide to the Subforum - New to Ma...,other,961,8


I will set the thread number to be the index, to simplify matching in the future:

In [7]:
thread = thread.set_index('thread_num')
thread.head(2)

Unnamed: 0_level_0,thread_name,thread_label,thread_replies,thread_label_id
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
45016,Games Threshold Discussion and List [Vote for ...,other,5703,8
88720,New Player's Guide to the Subforum - New to Ma...,other,961,8


We'll load the label map as well, which tells us which index goes to which label

In [8]:
label_map = solution.prepare.load_label_map()
label_map

type_name
bastard             0
beginners-mafia     1
byor                2
classic             3
closed-setup        4
cybrid              5
kotm                6
non-mafia-game      7
other               8
paranormal          9
supernatural       10
vanilla            11
vengeful           12
Name: type_id, dtype: int64

## Create features from thread dataframe

We will fit a CountVectorizer, which is a simple transformation that counts the number of times the word was found.

The parameter `min_df` sets the minimum number of occurances in our set that will allow a word to join our vocabulary.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 1), min_df=3)

In [10]:
word_vectors_raw = cv.fit_transform(thread['thread_name'])

To save space, this outputs a sparse matrix:

In [11]:
word_vectors_raw

<358x152 sparse matrix of type '<class 'numpy.int64'>'
	with 1513 stored elements in Compressed Sparse Row format>

However, since we'll be using it with a DataFrame, we need to convert it into a Pandas DataFrame:

In [12]:
word_df = pd.DataFrame(word_vectors_raw.toarray(), columns=cv.get_feature_names(), index=thread.index)
word_df.head()

Unnamed: 0_level_0,10,12,13,14,15,18,19,alien,all,an,...,why,win,wins,winter,with,wizard,world,you,your,zombie
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
45016,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88720,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39338,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34959,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
64229,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The only other feature we have from our thread data is the number of replies. Let's add one to get the number of replies. Also, let's use the logarithm of post count as well, just for fun.

We'll concatenate those into our X dataframe (Note that I'm renaming the columns, to keep track more easily):

In [13]:
X = pd.concat([
        (thread['thread_replies'] + 1).rename('posts'), 
        np.log(thread['thread_replies'] + 1).rename('log_posts'), 
        word_df,
    ], axis='columns')
X.head()

Unnamed: 0_level_0,posts,log_posts,10,12,13,14,15,18,19,alien,...,why,win,wins,winter,with,wizard,world,you,your,zombie
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
45016,5704,8.648923,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88720,962,6.869014,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39338,80,4.382027,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34959,1720,7.45008,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
64229,308,5.7301,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Our target is the category number. Remember that this isn't a regression task - there is no actual order between these categories! Also, our Y is one-dimensional, so we'll keep it as a Series (even though it prints less prettily).

In [14]:
y = thread['thread_label_id']
y.head()

thread_num
45016    8
88720    8
39338    8
34959    8
64229    8
Name: thread_label_id, dtype: int64

## Split dataset into "training" and "validation"

In order to check the quality of our model in a more realistic setting, we will split all our input (training) data into a "training set" (which our model will see and learn from) and a "validation set" (where we see how well our model generalized). [Relevant link](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
# NOTE: setting the `random_state` lets you get the same results with the pseudo-random generator
validation_pct = 0.25
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=validation_pct, random_state=99)

In [17]:
X_train.shape, y_train.shape

((268, 154), (268,))

In [18]:
X_val.shape, y_val.shape

((90, 154), (90,))

## Fit a model

Since we are fitting a multiclass model, [this scikit-learn link](https://scikit-learn.org/stable/modules/multiclass.html) is very relevant. To simplify things, we will be using an algorithm that is inherently multi-class.

In [19]:
from sklearn.tree import DecisionTreeClassifier

# Just using default parameters... what can do wrong?
cls = DecisionTreeClassifier(random_state=1337)

In [20]:
# Fit
cls.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1337,
            splitter='best')

In [21]:
# In-sample and out-of-sample predictions
# NOTE: we 
y_train_pred = pd.Series(
    cls.predict(X_train), 
    index=X_train.index, 
)
y_val_pred = pd.Series(
    cls.predict(X_val), 
    index=X_val.index, 
)

In [22]:
y_val_pred.head()

thread_num
47778    0
47530    4
45499    8
82626    1
86473    8
dtype: int64

## Score the model

To find out how well the model did, we'll use the [model evaluation functionality of sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html); specifically, the [multiclass classification metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics).

In [23]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

The [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) shows how our predictions differ from the actual values.

It's important to note how strongly our in-sample (training) and out-of-sample (validation/test) metrics differ.

In [24]:
def confusion_df(y_actual, y_pred):
    res = pd.DataFrame(
        confusion_matrix(y_actual, y_pred, labels=label_map.values),
        index=label_map.index.rename('predicted'),
        columns=label_map.index.rename('actual'),
    )
    return res

In [25]:
confusion_df(y_train, y_train_pred).style.highlight_max()

actual,bastard,beginners-mafia,byor,classic,closed-setup,cybrid,kotm,non-mafia-game,other,paranormal,supernatural,vanilla,vengeful
predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
bastard,9,0,0,0,0,0,0,0,0,0,0,0,0
beginners-mafia,0,19,0,0,0,0,0,0,0,0,0,0,0
byor,0,0,11,0,0,0,0,0,0,0,0,0,0
classic,0,0,0,13,0,0,0,0,0,0,0,0,0
closed-setup,0,0,0,0,29,0,0,0,0,0,0,0,0
cybrid,0,0,0,0,0,2,0,0,0,0,0,0,0
kotm,0,0,0,0,0,0,1,0,0,0,0,0,0
non-mafia-game,0,0,0,0,0,0,0,2,0,0,0,0,0
other,0,0,0,0,0,0,0,0,146,0,0,0,0
paranormal,0,0,0,0,0,0,0,0,0,17,0,0,0


In [26]:
confusion_df(y_val, y_val_pred).style.highlight_max()

actual,bastard,beginners-mafia,byor,classic,closed-setup,cybrid,kotm,non-mafia-game,other,paranormal,supernatural,vanilla,vengeful
predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
bastard,2,0,0,0,1,0,0,0,2,0,0,0,0
beginners-mafia,0,2,0,0,1,0,0,0,1,0,0,0,0
byor,0,0,0,1,0,0,0,0,1,0,0,0,0
classic,0,0,0,2,4,0,0,0,1,0,0,0,1
closed-setup,0,0,0,0,5,0,0,1,1,0,0,0,0
cybrid,0,0,0,0,1,0,0,0,0,0,0,0,0
kotm,0,0,0,0,1,0,0,0,0,0,0,0,0
non-mafia-game,0,0,0,0,0,0,0,0,0,0,0,0,0
other,0,0,3,2,1,1,0,0,46,2,0,0,0
paranormal,0,0,0,0,0,0,0,0,0,3,0,0,0


Oh boy. That's pretty bad - we didn't predict anything for several columns! 

Let's look at the metrics to confirm that it is indeed bad.

In [27]:
print("Test accuracy:", accuracy_score(y_train, y_train_pred))
print("Validation accuracy:", accuracy_score(y_val, y_val_pred))

Test accuracy: 1.0
Validation accuracy: 0.6888888888888889


In [28]:
report = classification_report(y_val, y_val_pred, labels=label_map.values, target_names=label_map.index)
print(report)

                 precision    recall  f1-score   support

        bastard       1.00      0.40      0.57         5
beginners-mafia       0.67      0.50      0.57         4
           byor       0.00      0.00      0.00         2
        classic       0.40      0.25      0.31         8
   closed-setup       0.33      0.71      0.45         7
         cybrid       0.00      0.00      0.00         1
           kotm       0.00      0.00      0.00         1
 non-mafia-game       0.00      0.00      0.00         0
          other       0.88      0.84      0.86        55
     paranormal       0.60      1.00      0.75         3
   supernatural       0.00      0.00      0.00         0
        vanilla       0.00      0.00      0.00         1
       vengeful       0.67      0.67      0.67         3

      micro avg       0.69      0.69      0.69        90
      macro avg       0.35      0.34      0.32        90
   weighted avg       0.73      0.69      0.69        90



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Well, that's pretty bad. We seriously overfit our training set... which is sort-of what I expected. Oh well.

By the way, the warnings at the bottom say that we have no real Precision or F-score to use, with no predictions for some classes. 

# Predict with the model

Here, we will predict on the test set (predicitions to send in), then save the results and the model.

**IMPORTANT NOTE**: In reality, you need to re-train your same model on the entire set to predict! However, I'm just using the same model as before, as it will bad anyways. ;)

In [29]:
post_test, thread_test = solution.prepare.load_dfs('test')

In [30]:
thread_test = thread_test.set_index('thread_num')
thread_test.head(2)

Unnamed: 0_level_0,thread_name,thread_replies
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1
126856,"Mafia Tools and Utilities (lurkertracker, etc)...",38
132415,Mafia Theory,211


We need to attach a `thread_label_id` column, as given in the training set:

In [31]:
thread.head(2)

Unnamed: 0_level_0,thread_name,thread_label,thread_replies,thread_label_id
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
45016,Games Threshold Discussion and List [Vote for ...,other,5703,8
88720,New Player's Guide to the Subforum - New to Ma...,other,961,8


Use the fitted CountVectorizer and other features to make our X dataframe:

In [32]:
word_vectors_raw_test = cv.transform(thread_test['thread_name'])

In [33]:
word_df_test = pd.DataFrame(word_vectors_raw_test.toarray(), columns=cv.get_feature_names(), index=thread_test.index)
word_df_test.head()

Unnamed: 0_level_0,10,12,13,14,15,18,19,alien,all,an,...,why,win,wins,winter,with,wizard,world,you,your,zombie
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
126856,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
132415,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
134482,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
133728,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
134270,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
X_test = pd.concat([
        (thread_test['thread_replies'] + 1).rename('posts'), 
        np.log(thread_test['thread_replies'] + 1).rename('log_posts'), 
        word_df_test,
    ], axis='columns')
X_test.head()

Unnamed: 0_level_0,posts,log_posts,10,12,13,14,15,18,19,alien,...,why,win,wins,winter,with,wizard,world,you,your,zombie
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
126856,39,3.663562,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
132415,212,5.356586,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
134482,475,6.163315,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
133728,564,6.335054,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
134270,11,2.397895,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we predict with our model, then paste it to a copy of `thread_test` as column `thread_label_id`.

In [35]:
y_test_pred = pd.Series(
    cls.predict(X_test), 
    index=X_test.index, 
)
y_test_pred.head()

thread_num
126856    8
132415    8
134482    8
133728    1
134270    8
dtype: int64

In [36]:
result = thread_test.copy()
result['thread_label_id'] = y_test_pred
result.head()

Unnamed: 0_level_0,thread_name,thread_replies,thread_label_id
thread_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
126856,"Mafia Tools and Utilities (lurkertracker, etc)...",38,8
132415,Mafia Theory,211,8
134482,"Iron Diadem, Night One: Things Said Behind Bar...",474,8
133728,Beginner's Mafia XLIV: The Court of Colors | R...,563,1
134270,Mod Use #2,10,8


We need to reshape to conform to the submission format specified [here](https://www.kaggle.com/c/ni-mafia-gametype#evaluation).

In [37]:
result = result.reset_index()[['thread_num', 'thread_label_id']]
result.head()

Unnamed: 0,thread_num,thread_label_id
0,126856,8
1,132415,8
2,134482,8
3,133728,1
4,134270,8


# Export predictions, model

Our model consists of the text vectorizer `cv` and classifier `cls`. We already formatted our results, we just need to make sure not to write an extra index column.

In [38]:
# NOTE: Exporting next to the notebooks - the files are small, but usually you don't want to do this.
out_dir = os.path.abspath('1_output')
os.makedirs(out_dir, exist_ok=True)

In [39]:
result.to_csv(
    os.path.join(out_dir, 'baseline_predict.csv'),
    index=False, header=True, encoding='utf-8', 
)

In [40]:
import joblib

joblib.dump(cv, os.path.join(out_dir, 'cv.joblib'))
joblib.dump(cls, os.path.join(out_dir, 'cls.joblib'))
print("Done. :)")

Done. :)


# Final Remarks

I'd like to mention that the above notebook is here JUST TO GET YOU STARTED. Feel free to change anything or everything above.

It may be a good idea to keep a piece of paper with you, and draw out your entire pipeline there, to keep organized.

This model is severely overfit because of a huge number of features from the names. Some ways to combat this are PCA and lowering dimensionality, increasing regularization, using a more feature-limited classifier, etc. You can also split this into two sub-problems: a classifier to tell whether it is a game or `"other"`, then classify game type if it's a game.