# Multi-class Classification Task on the R8 dataset

## Goal of this Notebook

The goal of this notebook is to demonstrate the usage of the TW-IDF model implemented in the [gowpy library](https://github.com/GuillaumeDD/gowpy) for a multi-class classification task. More precisely, TW-IDF is compared to a standard TF-IDF on the R8 dataset. 

Long story short: the TW-IDF model shows improved performance on this dataset compared to a standard TF-IDF model w.r.t.  accuracy, F1 (macro, micro) and MCC metrics.

The R8 dataset is the preprocessed Reuters dataset with the top 8 classes. It contains 5,495 training documents and 2,189 testing documents, with 8 different labels. Preprocessing involves: tokenization, stop-words removal and stemming to the initial texts. The version of the dataset comes from this [github repository](https://github.com/Nath-B/Graph-Of-Words).

## Python Environment

Preparation of the python environment:
```bash
pip install gowpy spacy pandas
python -m spacy download en
```

In [1]:
import spacy
import re

import pandas as pd

import pickle

## Loading the dataset

In [2]:
df_train = pd.read_csv('datasets/r8/r8-train-stemmed.txt',
                        header = None, 
                        sep='\t', 
                        names = ['label', 'document'])
X_train = df_train['document']
y_train = df_train['label']

X_train[0], y_train[0]

('champion product approv stock split champion product inc board director approv two for stock split common share for sharehold record april compani board vote recommend sharehold annual meet april increas author capit stock mln mln share reuter',
 'earn')

In [3]:
y_train.value_counts()

earn        2840
acq         1596
crude        253
trade        251
money-fx     206
interest     190
ship         108
grain         41
Name: label, dtype: int64

In [4]:
df_test = pd.read_csv('datasets/r8/r8-test-stemmed.txt',
                        header = None, 
                        sep='\t', 
                        names = ['label', 'document'])
X_test = df_test['document']
y_test = df_test['label']

y_test.value_counts()

earn        1083
acq          696
crude        121
money-fx      87
interest      81
trade         75
ship          36
grain         10
Name: label, dtype: int64

## Classification task

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC

from sklearn.model_selection import GridSearchCV
from pprint import pprint
from time import time

from sklearn.metrics import classification_report, matthews_corrcoef, accuracy_score, f1_score

### Hyper-optimisation metrics

In [6]:
from sklearn.metrics import matthews_corrcoef, make_scorer
scorer_mcc = make_scorer(matthews_corrcoef)

### TF-IDF model

#### Hyperparameter search and cross-validation score

In [7]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('svm', LinearSVC()),
])

parameters = {
    'vect__min_df': [5, 10, 20],
    'vect__max_df': [0.85, 0.9, 0.95],
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #
    'svm__C' : [1, 10, 100, 1000],
    'svm__class_weight' : [None, 'balanced']
}

In [8]:
# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, 
                           cv=10,
                           scoring=scorer_mcc,
                           n_jobs=-1, 
                           verbose=10)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'svm']
parameters:
{'svm__C': [1, 10, 100, 1000],
 'svm__class_weight': [None, 'balanced'],
 'vect__max_df': [0.85, 0.9, 0.95],
 'vect__min_df': [5, 10, 20],
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 10 folds for each of 144 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    8.0s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.2s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:   20.5s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:   24.7s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:   29.0s
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:   33.8s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   37.7s
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:   

done in 600.469s

Best score: 0.962
Best parameters set:
	svm__C: 1
	svm__class_weight: 'balanced'
	vect__max_df: 0.85
	vect__min_df: 20
	vect__ngram_range: (1, 2)


#### Fitting the final TF-IDF model

In [9]:
#
# /!\ manually set the best parameters
#
pipeline_tfidf = Pipeline([
    ('vect', TfidfVectorizer(
        min_df=20,
        max_df=0.85,
        ngram_range=(1, 2),
    )),
    ('svm', LinearSVC(
        C=1,
        class_weight='balanced',
    )),
])

pipeline_tfidf.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 TfidfVectorizer(max_df=0.85, min_df=20, ngram_range=(1, 2))),
                ('svm', LinearSVC(C=1, class_weight='balanced'))])

#### Evaluation on the test

In [10]:
y_pred = pipeline_tfidf.predict(X_test)
y_true = y_test

In [13]:
print(classification_report(y_true, y_pred))

mcc = matthews_corrcoef(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='micro')
print(f"mcc={mcc} ; accuracy={accuracy} ; f1-micro={f1}")

              precision    recall  f1-score   support

         acq       0.97      0.98      0.97       696
       crude       0.95      0.93      0.94       121
        earn       0.99      0.99      0.99      1083
       grain       1.00      0.90      0.95        10
    interest       0.89      0.86      0.87        81
    money-fx       0.88      0.79      0.84        87
        ship       0.84      0.89      0.86        36
       trade       0.89      0.99      0.94        75

    accuracy                           0.97      2189
   macro avg       0.93      0.92      0.92      2189
weighted avg       0.97      0.97      0.97      2189

mcc=0.9519401384452915 ; accuracy=0.9689355870260393 ; f1-micro=0.9689355870260393


### TW-IDF model

In [14]:
from gowpy.feature_extraction.gow import TwidfVectorizer
from gowpy.feature_extraction.gow.tw_vectorizer import TERM_WEIGHT_DEGREE

#### Hyperparameter search  and cross-validation score

In [15]:
pipeline = Pipeline([
    ('gow', TwidfVectorizer()),
    ('svm', LinearSVC()),
])

parameters = {
    'gow__window_size' : [2, 4, 8, 16],
    'gow__b' : [0.0, 0.003],
    'gow__directed' : [False, True],
    'gow__term_weighting' : [TERM_WEIGHT_DEGREE],
#
    'gow__min_df' : [5, 10, 20],
    'gow__max_df' : [0.85, 0.9, 0.95],
#
    'svm__C' : [1, 10, 100, 1000],
    'svm__class_weight' : [None, 'balanced'],
}

In [16]:
grid_search = GridSearchCV(pipeline, 
                           parameters, 
                           cv=10,
                           scoring=scorer_mcc,
                           n_jobs=-1, 
                           verbose=10)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['gow', 'svm']
parameters:
{'gow__b': [0.0, 0.003],
 'gow__directed': [False, True],
 'gow__max_df': [0.85, 0.9, 0.95],
 'gow__min_df': [5, 10, 20],
 'gow__term_weighting': ['degree'],
 'gow__window_size': [2, 4, 8, 16],
 'svm__C': [1, 10, 100, 1000],
 'svm__class_weight': [None, 'balanced']}
Fitting 10 folds for each of 1152 candidates, totalling 11520 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   13.8s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   26.1s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   34.0s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   46.0s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   55.1s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:  3

[Parallel(n_jobs=-1)]: Done 8970 tasks      | elapsed: 404.3min
[Parallel(n_jobs=-1)]: Done 9105 tasks      | elapsed: 409.0min
[Parallel(n_jobs=-1)]: Done 9240 tasks      | elapsed: 420.0min
[Parallel(n_jobs=-1)]: Done 9377 tasks      | elapsed: 427.4min
[Parallel(n_jobs=-1)]: Done 9514 tasks      | elapsed: 435.4min
[Parallel(n_jobs=-1)]: Done 9653 tasks      | elapsed: 446.3min
[Parallel(n_jobs=-1)]: Done 9792 tasks      | elapsed: 452.8min
[Parallel(n_jobs=-1)]: Done 9933 tasks      | elapsed: 465.8min
[Parallel(n_jobs=-1)]: Done 10074 tasks      | elapsed: 470.8min
[Parallel(n_jobs=-1)]: Done 10217 tasks      | elapsed: 483.5min
[Parallel(n_jobs=-1)]: Done 10360 tasks      | elapsed: 489.8min
[Parallel(n_jobs=-1)]: Done 10505 tasks      | elapsed: 500.1min
[Parallel(n_jobs=-1)]: Done 10650 tasks      | elapsed: 508.9min
[Parallel(n_jobs=-1)]: Done 10797 tasks      | elapsed: 517.5min
[Parallel(n_jobs=-1)]: Done 10944 tasks      | elapsed: 528.6min
[Parallel(n_jobs=-1)]: Done 11093

done in 34067.734s

Best score: 0.960
Best parameters set:
	gow__b: 0.0
	gow__directed: False
	gow__max_df: 0.85
	gow__min_df: 5
	gow__term_weighting: 'degree'
	gow__window_size: 2
	svm__C: 1
	svm__class_weight: 'balanced'


#### Fitting the final TW-IDF model

In [17]:
#
# /!\ manually set the best parameters
#
pipeline_gow = Pipeline([
    ('gow', TwidfVectorizer(
        b=0.0,
        directed=False,
        min_df=5,
        max_df=0.85,
        window_size=2,
        term_weighting=TERM_WEIGHT_DEGREE
    )),
    ('svm', LinearSVC(
        C=1,
        class_weight='balanced',
    )),
])

pipeline_gow.fit(X_train, y_train)

Pipeline(steps=[('gow',
                 TwidfVectorizer(directed=False, max_df=0.85, min_df=5,
                                 tokenizer=<function default_tokenizer at 0x7fd04029a950>,
                                 window_size=2)),
                ('svm', LinearSVC(C=1, class_weight='balanced'))])

#### Evaluation on the test

In [18]:
y_pred = pipeline_gow.predict(X_test)
y_true = y_test

In [20]:
print(classification_report(y_true, y_pred))

mcc = matthews_corrcoef(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='micro')
print(f"mcc={mcc} ; accuracy={accuracy} ; f1-micro={f1}")

              precision    recall  f1-score   support

         acq       0.98      0.98      0.98       696
       crude       0.95      0.95      0.95       121
        earn       0.99      0.99      0.99      1083
       grain       1.00      1.00      1.00        10
    interest       0.91      0.84      0.87        81
    money-fx       0.88      0.85      0.87        87
        ship       0.91      0.86      0.89        36
       trade       0.90      0.99      0.94        75

    accuracy                           0.97      2189
   macro avg       0.94      0.93      0.94      2189
weighted avg       0.97      0.97      0.97      2189

mcc=0.9597248748927998 ; accuracy=0.9739607126541799 ; f1-micro=0.9739607126541799
