# TEXT MINING Project 2022-202: Multi-label text classification of scientific papers

**Author**: Vincenzo Collura

**Mail**: vincenzo.collura2@studio.unibo.it

# SECTION 0: Initial part and pre-processing


In [None]:
! pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
! pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
! pip install pytorch_lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Libraries

In [None]:
import os
import string
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
import gc
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from warnings import simplefilter
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from collections import OrderedDict
from typing import List
from tqdm import tqdm
import copy
from copy import deepcopy
from google.colab import drive

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report

from sklearn.metrics import precision_score, multilabel_confusion_matrix, roc_auc_score, recall_score, accuracy_score, f1_score, precision_recall_fscore_support as prfs
from datasets import Dataset, DatasetDict

from transformers import AutoModelForSequenceClassification, AutoTokenizer, RobertaForSequenceClassification, BertForSequenceClassification, DataCollatorWithPadding, EvalPrediction, TrainerCallback, TrainingArguments, Trainer

import torch
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl
import torchmetrics
#from torchinfo import summary
from torch.utils.data import DataLoader
from pytorch_lightning import Callback
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

from collections import defaultdict

# Glove embeddings
import gensim
import gensim.downloader as gloader

from transformers import BertTokenizer, BertModel
import shutil

# Plotting libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Input, Conv1D, Bidirectional, GlobalAveragePooling1D
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dropout
from tensorflow.keras.optimizers import RMSprop

# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
nltk.download("stopwords")
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
drive.mount('/content/drive')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Read data into a dataframe
Data from:

https://datahack.analyticsvidhya.com/contest/janatahack-independence-day-2020-ml-hackathon/True/#About

In [None]:
seed = 42

df = pd.read_csv('/content/drive/MyDrive/data/train.csv', sep=',')

## Quik look to the data

In [None]:
print('df train shape: ', df.shape)

df train shape:  (20972, 9)


In [None]:
df.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


## Pre-processing

In [None]:
df['ABSTRACT'][0]

"  Predictive models allow subject-specific inference when analyzing disease\nrelated alterations in neuroimaging data. Given a subject's data, inference can\nbe made at two levels: global, i.e. identifiying condition presence for the\nsubject, and local, i.e. detecting condition effect on each individual\nmeasurement extracted from the subject's data. While global inference is widely\nused, local inference, which can be used to form subject-specific effect maps,\nis rarely used because existing models often yield noisy detections composed of\ndispersed isolated islands. In this article, we propose a reconstruction\nmethod, named RSM, to improve subject-specific detections of predictive\nmodeling approaches and in particular, binary classifiers. RSM specifically\naims to reduce noise due to sampling error associated with using a finite\nsample of examples to train classifiers. The proposed method is a wrapper-type\nalgorithm that can be used with different binary classifiers in a diagn

#### Here we have an evident problem with '\n' and '-' that cancatenate al lot of word that normally are divided.

In [None]:
df['ABSTRACT'] = df['ABSTRACT'].str.replace('\\n', ' ', regex=True)
df['ABSTRACT'] = df['ABSTRACT'].str.replace('-', ' ', regex=True)
print(df['ABSTRACT'][0])

  Predictive models allow subject specific inference when analyzing disease related alterations in neuroimaging data. Given a subject's data, inference can be made at two levels: global, i.e. identifiying condition presence for the subject, and local, i.e. detecting condition effect on each individual measurement extracted from the subject's data. While global inference is widely used, local inference, which can be used to form subject specific effect maps, is rarely used because existing models often yield noisy detections composed of dispersed isolated islands. In this article, we propose a reconstruction method, named RSM, to improve subject specific detections of predictive modeling approaches and in particular, binary classifiers. RSM specifically aims to reduce noise due to sampling error associated with using a finite sample of examples to train classifiers. The proposed method is a wrapper type algorithm that can be used with different binary classifiers in a diagnostic manner,

### TITLE and ABSTRACT concatenation

In [None]:
df["paper"] = df["TITLE"] + df["ABSTRACT"]
df = df.drop(["TITLE", "ABSTRACT"], axis=1)
df.head()

Unnamed: 0,ID,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,paper
0,1,1,0,0,0,0,0,Reconstructing Subject-Specific Effect Maps P...
1,2,1,0,0,0,0,0,Rotation Invariance Neural Network Rotation i...
2,3,0,0,1,0,0,0,Spherical polyharmonics and Poisson kernels fo...
3,4,0,0,1,0,0,0,A finite element approximation for the stochas...
4,5,1,0,0,1,0,0,Comparative study of Discrete Wavelet Transfor...


### Labels diastribution

In [None]:
labels = df.drop(['ID', 'paper'], axis=1).columns.to_list()
CLASS_NUM = len(labels)

In [None]:
label_counts = pd.concat(
    [
        df[labels].sum(),
    ],
    axis=0
)
label_counts = label_counts.reset_index()
label_counts = label_counts.rename({'level_0': 'Split', 'level_1': 'Value', 0: 'Count'}, axis=1)

In [None]:
fig = px.histogram(
    label_counts.sort_values('Count', ascending=False),
    x='index',
    y='Count',
    title='Distribution of labels',
    barmode='group',
    histnorm='percent',
)
fig.update_layout(yaxis_title="Number of samples (%)", xaxis_title="labels", xaxis_tickangle=-45,)
fig.show()

The dataset is unbalanced with respect to the labels, there is one label that are more prominent than the rest. Two labels are very rare and account for few samples in the dataset. So we will take more account of the f1-score macro. The f1-score is very useful when you are dealing with imbalanced classes problems. These are problems when one class can dominate the dataset.

How did the autghors of this paper for example: https://aclanthology.org/2022.acl-long.306/

### Text processing pipeline
lower case -> tokenization -> remove stopwords -> lemmatization (or stemming) -> remove punctuation

In [None]:
stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

def preprocess(document, stem=True):
    'changes document to lower case and removes stopwords'

    # change sentence to lower case
    document = document.lower()

    # tokenize into words
    words = word_tokenize(document)

    # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]

    # new step: adding a flag, If stem is true, we call the stemmer function, and if stem is false we call the wordnet function
    if stem:
        words = [stemmer.stem(word) for word in words]
    else:
        words = [wordnet_lemmatizer.lemmatize(word, pos='v') for word in words]

    # join words to make sentence
    document = " ".join(words)

    document = " ".join(words).translate(str.maketrans('', '', string.punctuation))

    return document

In [None]:
for i in range(df.shape[0]):
  df['paper'][i] = preprocess(df['paper'][i], stem=False)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Result of pre-processing

In [None]:
df['paper'][2000]

'autonomy interactive music system vivo interactive music systems  ims  introduce new world music make modalities  really say create music  true autonomous creation  discuss video interactive vst orchestra  vivo   ims consider extra musical information adopt simple salience base model user system interaction simulate intentionality automatic music generation  key feature theoretical framework  brief overview pilot research  case study provide validation model present  research demonstrate meaningful usersystem interplay establish define reflexive multidominance '

## Raw and label split

In [None]:
df_raw = df.drop(list(df.columns[:7]), axis=1)
df_label = df.drop(['ID', 'paper'], axis=1)

## Train, Test and val split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_raw, df_label, test_size=0.15, random_state=seed)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=seed)

In [None]:
X_train = X_train.reset_index().drop('index', axis=1)
X_val = X_val.reset_index().drop('index', axis=1)
X_test = X_test.reset_index().drop('index', axis=1)

In [None]:
y_train = y_train.reset_index().drop('index', axis=1)
y_val = y_val.reset_index().drop('index', axis=1)
y_test = y_test.reset_index().drop('index', axis=1)

In [None]:
print('train shape: ', X_train.shape)
print('test shape: ', X_test.shape)
print('val shape: ', X_val.shape)

train shape:  (15152, 1)
test shape:  (3146, 1)
val shape:  (2674, 1)


# SECTION 1: TF-IDF and BOW in combination with SVC, (Multinomial) Naive Bayes and Logistic Regression

## Bag of words and tf-idf models

In [None]:
# bag of words model
vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(X_train['paper'].values.tolist())
bow_model_test = vectorizer.transform(X_test['paper'].values.tolist())

In [None]:
pd.DataFrame(bow_model.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50683,50684,50685,50686,50687,50688,50689,50690,50691,50692
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15147,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15148,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15149,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15150,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# bag of words model using TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_model = vectorizer.fit_transform(X_train['paper'].values.tolist())
tfidf_model_test = vectorizer.transform(X_test['paper'].values.tolist())

In [None]:
pd.DataFrame(tfidf_model.toarray()) # , columns = vectorizer.get_feature_names()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50683,50684,50685,50686,50687,50688,50689,50690,50691,50692
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15149,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15150,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
tfidf_model.shape

(15152, 50693)

## Model Selection

We will benchmark the following four models:

1. (Multinomial) Naive Bayes
2. Linear Support Vector Machine
3. Logistic Regression

In [None]:
def train_classifier(X_train, y_train):

    model_svc = OneVsRestClassifier(LinearSVC()).fit(X_train, y_train)
    model_nb = OneVsRestClassifier(MultinomialNB()).fit(X_train, y_train)
    model_rf = OneVsRestClassifier(LogisticRegression(penalty = 'l2', C = 4, max_iter = 10000)).fit(X_train, y_train)

    return model_svc, model_nb, model_rf

## Train

In [None]:
classifier_tfidf_svc, classifier_tfidf_nb, classifier_tfidf_rf = train_classifier(tfidf_model, y_train)
classifier_bow_svc, classifier_bow_nb, classifier_bow_rf = train_classifier(bow_model, y_train)


Liblinear failed to converge, increase the number of iterations.



## Prediction

In [None]:
predicted_labels_tfidf_svc = classifier_tfidf_svc.predict(tfidf_model_test)
predicted_labels_tfidf_nb = classifier_tfidf_nb.predict(tfidf_model_test)
predicted_scores_tfidf_rf = classifier_tfidf_rf.predict(tfidf_model_test)

predicted_labels_bow_svc = classifier_bow_svc.predict(bow_model_test)
predicted_labels_bow_nb = classifier_bow_nb.predict(bow_model_test)
predicted_scores_bow_rf = classifier_bow_rf.predict(bow_model_test)

### Evaluation

In [None]:
def print_evaluation_scores(y_test, predicted):

    f1 = f1_score(y_test, predicted, average='macro')
    precision = precision_score(y_test, predicted, average='macro')
    recall = recall_score(y_test, predicted, average='macro')

    print('F1-score macro: ', f1)
    print('Precision macro: ', precision)
    print('Recall macro: ', recall)

    return [f1, precision, recall]

In [None]:
print('\nTfidf\n')
print('Linear Support Vector Machine\n')
tfidf_svc_scores = print_evaluation_scores(y_test, predicted_labels_tfidf_svc)
print('\n(Multinomial) Naive Bayes\n')
tfidf_nb_scores = print_evaluation_scores(y_test, predicted_labels_tfidf_nb)
print('\nLogistic Regression\n')
tfidf_rf_scores = print_evaluation_scores(y_test, predicted_scores_tfidf_rf)
print('\nBag-of-words\n')
print('Linear Support Vector Machine\n')
bow_svc_scores = print_evaluation_scores(y_test, predicted_labels_bow_svc)
print('\n(Multinomial) Naive Bayes\n')
bow_nb_scores = print_evaluation_scores(y_test, predicted_labels_bow_nb)
print('\nLogistic Regression\n')
bow_rf_scores = print_evaluation_scores(y_test, predicted_scores_bow_rf)


Tfidf

Linear Support Vector Machine

F1-score macro:  0.6928522711007767
Precision macro:  0.8063790940303628
Recall macro:  0.6339715551726862

(Multinomial) Naive Bayes

F1-score macro:  0.4217004463067089
Precision macro:  0.6174578053889225
Recall macro:  0.3507095994916168

Logistic Regression

F1-score macro:  0.6428341351694895
Precision macro:  0.8463318853573817
Recall macro:  0.580712926164605

Bag-of-words

Linear Support Vector Machine

F1-score macro:  0.6635526044949581
Precision macro:  0.7189122355175552
Recall macro:  0.6284775799153192

(Multinomial) Naive Bayes

F1-score macro:  0.6333431911905568
Precision macro:  0.7765951485739877
Recall macro:  0.6361488072321594

Logistic Regression

F1-score macro:  0.6706607179653253
Precision macro:  0.755915132640718
Recall macro:  0.6237276236674131



Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [None]:
predicted_labels_tfidf_svc = classifier_tfidf_svc.predict(tfidf_model_test)
predicted_labels_tfidf_nb = classifier_tfidf_nb.predict(tfidf_model_test)
predicted_scores_tfidf_rf = classifier_tfidf_rf.predict(tfidf_model_test)

predicted_labels_bow_svc = classifier_bow_svc.predict(bow_model_test)
predicted_labels_bow_nb = classifier_bow_nb.predict(bow_model_test)
predicted_scores_bow_rf = classifier_bow_rf.predict(bow_model_test)

In [None]:
print('TF-IDF: SVC\n')
print(classification_report(y_test, predicted_labels_tfidf_svc, target_names=labels))
cr_tfidf_svc = classification_report(y_test, predicted_labels_tfidf_svc, target_names=labels, output_dict=True)
print('\nTF-IDF: Naive Bayes\n')
print(classification_report(y_test, predicted_labels_tfidf_nb, target_names=labels))
cr_tfidf_nb = classification_report(y_test, predicted_labels_tfidf_nb, target_names=labels, output_dict=True)
print('\nTF-IDF: Logistic Regression\n')
print(classification_report(y_test, predicted_scores_tfidf_rf, target_names=labels))
cr_tfidf_rf = classification_report(y_test, predicted_scores_tfidf_rf, target_names=labels, output_dict=True)
print('\nBoW: SVC\n')
print(classification_report(y_test, predicted_labels_bow_svc, target_names=labels))
cr_bow_svc = classification_report(y_test, predicted_labels_bow_svc, target_names=labels, output_dict=True)
print('\nBoW: Naive Bayes\n')
print(classification_report(y_test, predicted_labels_bow_nb, target_names=labels))
cr_bow_nb = classification_report(y_test, predicted_labels_bow_svc, target_names=labels, output_dict=True)
print('\nBoW: Logistic Regression\n')
print(classification_report(y_test, predicted_scores_bow_rf, target_names=labels))
cr_bow_rf = classification_report(y_test, predicted_scores_bow_rf, target_names=labels, output_dict=True)

TF-IDF: SVC

                      precision    recall  f1-score   support

    Computer Science       0.81      0.82      0.82      1282
             Physics       0.92      0.83      0.87       932
         Mathematics       0.83      0.76      0.80       843
          Statistics       0.78      0.71      0.75       798
Quantitative Biology       0.56      0.22      0.32        89
Quantitative Finance       0.94      0.45      0.61        38

           micro avg       0.83      0.77      0.80      3982
           macro avg       0.81      0.63      0.69      3982
        weighted avg       0.83      0.77      0.80      3982
         samples avg       0.81      0.81      0.79      3982


TF-IDF: Naive Bayes

                      precision    recall  f1-score   support

    Computer Science       0.82      0.81      0.82      1282
             Physics       0.98      0.65      0.78       932
         Mathematics       0.96      0.47      0.63       843
          Statistics       0.94


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with 

### Score plot

In [None]:
metrics = ['F1-score', 'precision', 'recall']

fig = go.Figure(data=[
    go.Bar(name='SVC - TFIDF', x=metrics, y=tfidf_svc_scores),
    go.Bar(name='Naive Bayes - TFIDF', x=metrics, y=tfidf_nb_scores),
    go.Bar(name='Logistic Regression - TFIDF', x=metrics, y=tfidf_rf_scores),
    go.Bar(name='SVC - BOW', x=metrics, y=bow_svc_scores),
    go.Bar(name='Naive Bayes - BOW', x=metrics, y=bow_nb_scores),
    go.Bar(name='Logistic Regression - BOW', x=metrics, y=bow_rf_scores),
])

# Change the bar mode
fig.update_layout(barmode='group', title_text='Metrics bar plot for each combination')
fig.show()

Models with BOW do better on average, although the best score is that of SVC-TFIDF. The most substantial increase is, however, the one obtained by using BOW instead of TFIDF in the Multinomial Naive Bayes, which is the worst. For the remaining two models (SVM and Logistic Regression), the difference between the input represented in BOW and TFIDF is minimal.

### F1-score for each class and for each model

In [None]:
tfidf_svc_for_labels = []
tfidf_nb_for_labels = []
tfidf_rf_for_labels = []
bow_svc_for_labels = []
bow_nb_for_labels = []
bow_rf_for_labels = []

for i in labels:
  tfidf_svc_for_labels.append(cr_tfidf_svc[i]['f1-score'])
  tfidf_nb_for_labels.append(cr_tfidf_nb[i]['f1-score'])
  tfidf_rf_for_labels.append(cr_tfidf_rf[i]['f1-score'])
  bow_svc_for_labels.append(cr_bow_svc[i]['f1-score'])
  bow_nb_for_labels.append(cr_bow_nb[i]['f1-score'])
  bow_rf_for_labels.append(cr_bow_rf[i]['f1-score'])

In [None]:
fig = go.Figure()
# Create and style traces
fig.add_trace(go.Scatter(x=labels, y=tfidf_svc_for_labels, name='SVC - TFID', line=dict(color='red', width=2), marker = dict(symbol = "square", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=tfidf_nb_for_labels, name = 'Naive Bayes - TFIDF', line=dict(color='green', width=2), marker = dict(symbol = "circle", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=tfidf_rf_for_labels, name='Logistic Regression - TFIDF', line=dict(color='blue', width=2), marker = dict(symbol = "triangle-up", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=bow_svc_for_labels, name='SVC - BOW', line = dict(color='goldenrod', width=2), marker = dict(symbol = "triangle-down", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=bow_nb_for_labels, name='Naive Bayes - BOW', line = dict(color='orange', width=2), marker = dict(symbol = "diamond", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=bow_rf_for_labels, name='Logistic Regression - BOW', line = dict(color='grey', width=2), marker = dict(symbol = "star", size = 10)))

# Edit the layout
fig.update_layout(title='F1-score for each class and for each model',
                   xaxis_title='labels',
                   yaxis_title='f1-score')


fig.show()

As we can see, the worst predicted classes are 'Quantitative Biology' and 'Quantitative Finance', as we were expecting looking at the distribution. Even this was not matched once by the worst model, namely Naive Bayes TF-IDF.
The best predicted class is 'Physics' although less present than 'Computer Science'. An interesting thing we can note is that our best model in general (SVC TF-IDF) is only slightly worse than Logistic regression TF IDF, which, however, is much worse in the least represented classes.

# SECTION 2: GloVe Embeddings with simple neural architectures BiLSTM

### Input lengths

In this section the lengths of the input is measured and a **MAX_SEQ_LENGTH** is chosen in order to be used subsequently with the proposed neural architectures.

In [None]:
papers_lenght = [len(x) for x in X_train['paper']]

fig = px.histogram(
    papers_lenght,
    x=0,
    title='Distribution of the lenght of each raw',
    histnorm='percent',
    labels={
        "variable": "Set"
    },
)
cut = np.ceil(np.percentile(papers_lenght, 99.5))
fig.add_vline(x=cut + 0.5, line_width=1, line_color="red")
fig.update_layout(yaxis_title="Number of samples (%)", xaxis_title="Length")
fig.show()

It is possible to cut the sequences in order to reduced the computational power needed to perform both training and inference. In order to perform the cut, the 99th percentile of the lengths has been selected.

In [None]:
MAX_SEQ_LENGTH = int(cut)
print('Max sequence length: ' + str(MAX_SEQ_LENGTH))

Max sequence length: 1552


### OOV terms checking

Since GloVe pre-trained embeddings will be used to perform term embedding in the "simple architectures" section, it is worth to check how many Out-Of-Vocabulary (OOV) terms are present in the data set. Notice the check is performed on training set only as OOVs are expected to appear at any time in any set.

https://aclanthology.org/D14-1162/

In [None]:
# set dimension of the embedding
EMBEDDING_SIZE = 300

# load pre-trained glove vectors
download_path = f"glove-wiki-gigaword-{EMBEDDING_SIZE}"
glove = gloader.load(download_path)

In [None]:
def get_OOV(embedding_model: gensim.models.keyedvectors.KeyedVectors, word_listing):
    """
    find and list OOV words.

    Parameters
    ----------
    embedding_model : gensim.models.keyedvectors.KeyedVectors
         embedding model
    word_listing : pandas series
        column of token of a pandas dataframe
    """
    oovs = list(set(word_listing).difference(embedding_model.index2word))

    return oovs

In [None]:
# function to flatten a matrix
def flatten(l):
    return [item for sublist in l for item in sublist]

In [None]:
oov_train = get_OOV(glove, flatten([x.split() for x in X_train['paper']]))
oov_val = get_OOV(glove, flatten([x.split() for x in X_val['paper']]))
oov_test = get_OOV(glove, flatten([x.split() for x in X_test['paper']]))

print('Train words number: ', len(set(flatten([x.split() for x in X_train['paper']]))))
print('Validation words number: ', len(set(flatten([x.split() for x in X_val['paper']]))))
print('Test words number: ', len(set(flatten([x.split() for x in X_test['paper']]))))
print()
print('OOV in the training set: ', len(oov_train))
print('OOV in the validation set: ', len(oov_val))
print('OOV in the test set: ', len(oov_test))
print()
print('total OOVs: ', len(set(oov_train + oov_val + oov_test)))

Train words number:  50787
Validation words number:  19448
Test words number:  21046

OOV in the training set:  22683
OOV in the validation set:  5277
OOV in the test set:  6132

total OOVs:  29943


### Building Vocabulary

In [None]:
class Vocabulary:
    """ Stores a vocabulary for NLP tasks
    """

    def __init__(self, word2idx=OrderedDict(), idx2word=OrderedDict(), curr_idx=2, oov_index=1, oov_token='-OOV-'):
        self.word2idx = word2idx
        self.idx2word = idx2word
        self.curr_idx = curr_idx  # 0-1 reserved
        self.oov_index = 1
        self.oov_token = oov_token

    def import_gensim(self, embedding_model: gensim.models.keyedvectors.KeyedVectors) -> None:
        """ Import an existing vocabulary from a gensim model

        Parameters
        ----------
        embedding_model : gensim.models.keyedvectors.KeyedVectors
            Embedding model
        """
        self.idx2word.update(OrderedDict(enumerate(embedding_model.index2word, 2))) # strat from 1 as 0 is reserved
        self.word2idx = {v: k for k, v in self.idx2word.items()}
        self.curr_idx = len(self.idx2word) + 2

    def build_from_list(self, sentences: List[List[str]]) -> None:
        """ Adds words to vocabulary starting from a list of documents

        Parameters
        ----------
        sentences : List[List[str]]
            Expected format: [document][sentence][token]
        """
        for sentence in tqdm(sentences):
            for token in sentence:
                if token not in self.word2idx:
                    self.word2idx[token] = self.curr_idx
                    self.idx2word[self.curr_idx] = token
                    self.curr_idx += 1

    def encode(self, sentences: List[List[str]]) -> List[List[int]]:
        """ Encode a sequence

        Parameters
        ----------
        sentences : List[List[str]]
            Expected format: [document][sentence][token]

        Returns
        -------
        List[List[int]]
            Return format: [document][sentence][encoded_token]
        """
        encoded = [[self.word2idx[token] if token in self.word2idx.keys() else self.oov_index for token in sentence] for sentence in sentences]

        return encoded

    def decode(self, sentences: List[List[int]]) -> List[List[str]]:
        """ Decode a sequence

        Parameters
        ----------
        sentences : List[List[int]]
            Expected format: [document][sentence][encoded_token]

        Returns
        -------
        List[List[str]]
            Return format: [document][sentence][token]
        """
        decoded = [[self.idx2word[encoded_id] if encoded_id in self.idx2words.keys() else self.oov_token for encoded_id in sentence] for sentence in sentences]

        return decoded

    def get_OOV(self, sentences: List[List[str]]):
        """ Find and list OOV words

        Parameters
        ----------
        embedding_model : gensim.models.keyedvectors.KeyedVectors
            embedding model
        sentences : pandas series
            column of token of a pandas dataframe
        """
        ret = [item for sublist in sentences for item in sublist]
        oovs = list(set(ret).difference(self.index2word.values()))

        return oovs


    def copy(self):
        """ Returns a (deep) copy of itself

        Returns
        -------
        Vocabulary
            Deep copy of instance
        """
        return copy.deepcopy(self)

In [None]:
voc_glove = Vocabulary()
voc_glove.import_gensim(glove)

voc_train = voc_glove.copy()
voc_train.build_from_list(X_train.squeeze().str.split())

voc_val = voc_train.copy()
voc_val.build_from_list(X_val.squeeze().str.split())

voc_test = voc_train.copy()
voc_test.build_from_list(X_test.squeeze().str.split())

100%|██████████| 15152/15152 [00:00<00:00, 42844.06it/s]
100%|██████████| 2674/2674 [00:00<00:00, 36958.48it/s]
100%|██████████| 3146/3146 [00:00<00:00, 41566.62it/s]


## Encode sequences

In [None]:
# text to sequences
X_train_enc = voc_train.encode(X_train.squeeze().str.split())
X_val_enc = voc_val.encode(X_val.squeeze().str.split())
X_test_enc = voc_test.encode(X_test.squeeze().str.split())

In [None]:
print(f"Tokenized sentence example:\n {X_train_enc[0]}")

Tokenized sentence example:
 [24088, 96008, 19293, 1425, 19293, 1425, 951, 2066, 5865, 391, 6620, 7334, 37361, 40409, 569, 87, 673, 4290, 2360, 2370, 1629, 6285, 535, 522, 4105, 490, 935, 41481, 1801, 236, 3941, 2442, 522, 63392, 44195, 851, 47096, 1961, 13188, 1629, 430, 16149, 30065, 4278, 5658, 19293, 21237, 904, 3606, 1589, 2370, 19293, 1425, 5658, 16149, 28938, 2529, 107427, 5243, 24088, 817, 107427, 12860, 1713, 14537, 4887, 1582, 994, 1425, 1309, 13595, 410, 57939, 8739, 12858, 24088, 57341, 1825, 3472, 1656, 3139, 22906, 6933, 2529, 21532, 8739, 57341, 63844, 5930, 4105, 490, 28749, 2370, 16149, 30065, 30410, 195709, 47096, 763, 1569, 5865, 6420]


### Truncation & Padding

In order for the input to be feed into the models it is important for all the sequences to be of the same length. In this step sequences exceeding the maximum allowed sequence length are truncated, sequences that are shorter are "padded" with zeros.

In [None]:
# Source: https://github.com/keras-team/keras/blob/e6784e4302c7b8cd116b74a784f4b78d60e83c26/keras/utils/data_utils.py#L965
def pad_sequences(
    sequences,
    maxlen=None,
    dtype="int32",
    padding="pre",
    truncating="pre",
    value=0.0,
):
    """Pads sequences to the same length.
    This function transforms a list (of length `num_samples`)
    of sequences (lists of integers)
    into a 2D Numpy array of shape `(num_samples, num_timesteps)`.
    `num_timesteps` is either the `maxlen` argument if provided,
    or the length of the longest sequence in the list.
    Sequences that are shorter than `num_timesteps`
    are padded with `value` until they are `num_timesteps` long.
    Sequences longer than `num_timesteps` are truncated
    so that they fit the desired length.
    The position where padding or truncation happens is determined by
    the arguments `padding` and `truncating`, respectively.
    Pre-padding or removing values from the beginning of the sequence is the
    default.
    >>> sequence = [[1], [2, 3], [4, 5, 6]]
    >>> tf.keras.preprocessing.sequence.pad_sequences(sequence)
    array([[0, 0, 1],
           [0, 2, 3],
           [4, 5, 6]], dtype=int32)
    >>> tf.keras.preprocessing.sequence.pad_sequences(sequence, value=-1)
    array([[-1, -1,  1],
           [-1,  2,  3],
           [ 4,  5,  6]], dtype=int32)
    >>> tf.keras.preprocessing.sequence.pad_sequences(sequence, padding='post')
    array([[1, 0, 0],
           [2, 3, 0],
           [4, 5, 6]], dtype=int32)
    >>> tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=2)
    array([[0, 1],
           [2, 3],
           [5, 6]], dtype=int32)
    Args:
        sequences: List of sequences (each sequence is a list of integers).
        maxlen: Optional Int, maximum length of all sequences. If not provided,
            sequences will be padded to the length of the longest individual
            sequence.
        dtype: (Optional, defaults to `"int32"`). Type of the output sequences.
            To pad sequences with variable length strings, you can use `object`.
        padding: String, "pre" or "post" (optional, defaults to `"pre"`):
            pad either before or after each sequence.
        truncating: String, "pre" or "post" (optional, defaults to `"pre"`):
            remove values from sequences larger than
            `maxlen`, either at the beginning or at the end of the sequences.
        value: Float or String, padding value. (Optional, defaults to 0.)
    Returns:
        Numpy array with shape `(len(sequences), maxlen)`
    Raises:
        ValueError: In case of invalid values for `truncating` or `padding`,
            or in case of invalid shape for a `sequences` entry.
    """
    if not hasattr(sequences, "__len__"):
        raise ValueError("`sequences` must be iterable.")
    num_samples = len(sequences)

    lengths = []
    sample_shape = ()
    flag = True

    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.

    for x in sequences:
        try:
            lengths.append(len(x))
            if flag and len(x):
                sample_shape = np.asarray(x).shape[1:]
                flag = False
        except TypeError as e:
            raise ValueError(
                "`sequences` must be a list of iterables. "
                f"Found non-iterable: {str(x)}"
            ) from e

    if maxlen is None:
        maxlen = np.max(lengths)

    is_dtype_str = np.issubdtype(dtype, np.str_) or np.issubdtype(
        dtype, np.unicode_
    )
    if isinstance(value, str) and dtype != object and not is_dtype_str:
        raise ValueError(
            f"`dtype` {dtype} is not compatible with `value`'s type: "
            f"{type(value)}\nYou should set `dtype=object` for variable length "
            "strings."
        )

    x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
    for idx, s in enumerate(sequences):
        if not len(s):
            continue  # empty list/array was found
        if truncating == "pre":
            trunc = s[-maxlen:]
        elif truncating == "post":
            trunc = s[:maxlen]
        else:
            raise ValueError(f'Truncating type "{truncating}" not understood')

        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError(
                f"Shape of sample {trunc.shape[1:]} of sequence at "
                f"position {idx} is different from expected shape "
                f"{sample_shape}"
            )

        if padding == "post":
            x[idx, : len(trunc)] = trunc
        elif padding == "pre":
            x[idx, -len(trunc) :] = trunc
        else:
            raise ValueError(f'Padding type "{padding}" not understood')
    return x

In [None]:
# pad X training, validation and test
X_train_padded = pad_sequences(X_train_enc, maxlen=MAX_SEQ_LENGTH, padding="post", truncating="post")
X_val_padded = pad_sequences(X_val_enc, maxlen=MAX_SEQ_LENGTH, padding="post", truncating="post")
X_test_padded = pad_sequences(X_test_enc, maxlen=MAX_SEQ_LENGTH, padding="post", truncating="post")

X_train, X_val, X_test = X_train_padded, X_val_padded, X_test_padded

print(f"Padding example:\n {X_train_padded[0]}")

Padding example:
 [24088 96008 19293 ...     0     0     0]


## Embedding matrix with GloVe

In this section the embedding matrix to be used as an embedding layer's weights is built. Out of vocabulary (OOV) terms are assigned with a representation whose values are taken from a uniform distribution.

In [None]:
voc = voc_val

VOCABULARY_SIZE = len(voc.idx2word) + 2

# create an empty embedding matix
embedding_weights = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))

# create a word to index dictionary mapping
word2idx = voc.word2idx

emb_voc = {}

for word, index in word2idx.items():
    try:
        embedding_weights[index, :] = glove[word]
    except (KeyError, TypeError):
        if word not in emb_voc:
            embedding_vector = np.random.uniform(low=-0.05, high=0.05, size=EMBEDDING_SIZE)
            emb_voc[word] = embedding_vector

        embedding_weights[index:] = emb_voc[word]

In [None]:
y_train_np = y_train.to_numpy()
y_val_np = y_val.to_numpy()
y_test_np = y_test.to_numpy()

X_train_padded_lstm = np.concatenate((X_train_padded, X_val_padded), axis = 0)
y_train_padded_lstm = np.concatenate((y_train_np, y_val_np), axis = 0)

## BiLSTM Model definition

CNN and BiLSTM networks in combinations is used for providing a novel multilabel classifier. The CNN is used as feature extractor and BiLSTM as seq2seq learner to get us the desired output.

https://medium.com/star-gazers/multilabel-text-classification-using-cnn-and-bi-lstm-ce561c88e8d

In [None]:
sequence_input = Input(shape=(MAX_SEQ_LENGTH, ))
x = Embedding(embedding_weights.shape[0], EMBEDDING_SIZE, weights=[embedding_weights],trainable = False)(sequence_input)
x = SpatialDropout1D(0.2)(x) ## ostly drops the entire 1D feature map rather than individual elements.
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
x = Bidirectional(LSTM(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
avg_pool = GlobalAveragePooling1D()(x)
x = Dense(128, activation='relu')(avg_pool)
x = Dropout(0.1)(x)
preds = Dense(6, activation="sigmoid")(x)
bilstm_model = Model(sequence_input, preds)
bilstm_model.compile(loss='binary_crossentropy',optimizer=RMSprop(lr=1e-3),metrics=['accuracy'])
print(bilstm_model.summary())



Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1552)]            0         
                                                                 
 embedding (Embedding)       (None, 1552, 300)         127812900 
                                                                 
 spatial_dropout1d (SpatialD  (None, 1552, 300)        0         
 ropout1D)                                                       
                                                                 
 conv1d (Conv1D)             (None, 1550, 64)          57664     
                                                                 
 bidirectional (Bidirectiona  (None, 1550, 256)        197632    
 l)                                                              
                                                                 
 global_average_pooling1d (G  (None, 256)              0     


The `lr` argument is deprecated, use `learning_rate` instead.



### Train

In [None]:
history = bilstm_model.fit(X_train_padded_lstm, y_train_padded_lstm, batch_size=128, epochs=30, verbose=1, validation_split=0.15)

Epoch 1/21
Epoch 2/21
Epoch 3/21
Epoch 4/21
Epoch 5/21
Epoch 6/21
Epoch 7/21
Epoch 8/21
Epoch 9/21
Epoch 10/21
Epoch 11/21
Epoch 12/21
Epoch 13/21
Epoch 14/21
Epoch 15/21
Epoch 16/21
Epoch 17/21
Epoch 18/21
Epoch 19/21
Epoch 20/21
Epoch 21/21



### Predict

In [None]:
pred_lstm = bilstm_model.predict(X_test_padded)



### Thresholding the predictions

In [None]:
def thresholding_nn(l, threshold):
    new_l = []
    new_l1 = []
    for l1 in l:
        new_l1 = []
        for element in l1:
            if element >= threshold:
                new_l1.append(1)
            else:
                new_l1.append(0)
        new_l.append(new_l1)
    return new_l

In [None]:
pred_lstm_thresh = thresholding_nn(pred_lstm, 0.5)

In [None]:
print('Thresholding predictions example:\n')
print(pred_lstm_thresh[4])
print(y_test_np[4].tolist())

Thresholding predictions example:

[1, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0]


## Evaluation

In [None]:
metric_bilstm = prfs(y_test, pred_lstm_thresh, average='macro', zero_division= 0)

print(f"{f'[test] Model bilstm macro precision:':<40} {metric_bilstm[0]:>10}")
print(f"{f'[test] Model bilstm macro recall:':<40} {metric_bilstm[1]:>10}")
print(f"{f'[test] Model bilstm macro f1_score:':<40} {metric_bilstm[2]:>10}")

[test] Model bilstm macro precision:     0.8495129560832333
[test] Model bilstm macro recall:        0.6739606325130714
[test] Model bilstm macro f1_score:      0.7361820980484373


In [None]:
print(classification_report(y_test, pred_lstm_thresh, target_names=labels))
cr_lstm = classification_report(y_test, pred_lstm_thresh, target_names=labels, output_dict=True)

                      precision    recall  f1-score   support

    Computer Science       0.80      0.89      0.84      1282
             Physics       0.90      0.84      0.87       932
         Mathematics       0.88      0.69      0.78       843
          Statistics       0.77      0.73      0.75       798
Quantitative Biology       0.74      0.39      0.51        89
Quantitative Finance       1.00      0.50      0.67        38

           micro avg       0.83      0.79      0.81      3982
           macro avg       0.85      0.67      0.74      3982
        weighted avg       0.84      0.79      0.81      3982
         samples avg       0.84      0.83      0.82      3982




Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



In [None]:
bilstm_for_labels = []

for i in labels:
  bilstm_for_labels.append(cr_lstm_w2v[i]['f1-score'])

### Multilabel confusion matrix

In [None]:
bilstm_dfpred = pd.DataFrame(pred_lstm_thresh)
cm = multilabel_confusion_matrix(y_test_np.tolist(), bilstm_dfpred.to_numpy())
fig = make_subplots(2, 3, subplot_titles=labels)

for i in range(2):
    for j in range(3):
        current_map = cm[((j+1)+(i*3))-1]
        TN = current_map[0][0]
        FN = current_map[1][0]
        TP = current_map[1][1]
        FP = current_map[0][1]
        fig.add_trace(
            go.Heatmap(
                z = [[FP, TN], [TP, FN]],
                x = ['Pos', 'Neg'],
                y = ['Neg', 'Pos'],
                text = cm[((j+1)+(i*3))-1],
                texttemplate="%{text}",
                textfont={"size":20}), (i+1), (j+1))
fig.update_traces(showscale=False)
fig.update_layout(height=1200, width=1200, title_text='bilstm Confusion matrix')
fig.show()

### Score plot

In [None]:
metrics=['precision', 'recall', 'F1-score']

fig = go.Figure(data=[
    go.Bar(name='BiLSTM', x=metrics, y=metric_bilstm),
])

# Change the bar mode
fig.update_layout(barmode='group', title_text='Metrics bar plot for both models')
fig.show()

As expected, the GloVe embeddings + BiLSTM structure works better than all its predecessors and still has room for improvement if you train it further. I did not continue to tow the network for hardware reasons (8 hours of training).

# SECTION 2.5: Word2Vec + BiLSTM

In [None]:
from gensim.models import KeyedVectors

word2vec = KeyedVectors.load_word2vec_format('/content/drive/MyDrive/GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
# create an empty embedding matix
embedding_matrix = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))

emb_voc = {}

for word, index in word2idx.items():
    try:
      embedding_matrix[index, :] = word2vec[word]
    except (KeyError, TypeError):
      print(index)
      if word not in emb_voc:
          embedding_vector = np.random.uniform(low=-0.05, high=0.05, size=EMBEDDING_SIZE)
          emb_voc[word] = embedding_vector

      embedding_matrix[index:] = emb_voc[word]

In [None]:
sequence_input = Input(shape=(MAX_SEQ_LENGTH, ))
x = Embedding(embedding_matrix.shape[0], EMBEDDING_SIZE, weights=[embedding_matrix],trainable = False)(sequence_input)
x = SpatialDropout1D(0.2)(x) ## ostly drops the entire 1D feature map rather than individual elements.
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
x = Bidirectional(LSTM(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
avg_pool = GlobalAveragePooling1D()(x)
x = Dense(128, activation='relu')(avg_pool)
x = Dropout(0.1)(x)
preds = Dense(6, activation="sigmoid")(x)
bilstm_model_w2v = Model(sequence_input, preds)
bilstm_model_w2v.compile(loss='binary_crossentropy',optimizer=RMSprop(lr=1e-3),metrics=['accuracy'])
print(bilstm_model_w2v.summary())

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1552)]            0         
                                                                 
 embedding_1 (Embedding)     (None, 1552, 300)         127812900 
                                                                 
 spatial_dropout1d_1 (Spatia  (None, 1552, 300)        0         
 lDropout1D)                                                     
                                                                 
 conv1d_1 (Conv1D)           (None, 1550, 64)          57664     
                                                                 
 bidirectional_1 (Bidirectio  (None, 1550, 256)        197632    
 nal)                                                            
                                                                 
 global_average_pooling1d_1   (None, 256)              0   

In [None]:
history_w2v = bilstm_model_w2v.fit(X_train_padded_lstm, y_train_padded_lstm, batch_size=128, epochs=21, verbose=1, validation_split=0.15)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
Epoch 9/21
Epoch 10/21
Epoch 11/21
Epoch 12/21
Epoch 13/21
Epoch 14/21
Epoch 15/21
Epoch 16/21
Epoch 17/21
Epoch 18/21
Epoch 19/21
Epoch 20/21
Epoch 21/21


In [None]:
pred_lstm_w2v = bilstm_model_w2v.predict(X_test_padded)



In [None]:
pred_lstm_w2v_thresh = thresholding_nn(pred_lstm_w2v, 0.5)

print('Thresholding predictions example:\n')
print(pred_lstm_w2v_thresh[4])
print(y_test_np[4].tolist())

Thresholding predictions example:

[1, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0]


## Evaluation

In [None]:
pred_lstm_w2v_thresh = thresholding_nn(pred_lstm_w2v, 0.5)
metric_bilstm_w2v = prfs(y_test, pred_lstm_w2v_thresh, average='macro', zero_division= 0)

print(f"{f'[test] Model bilstm macro precision:':<40} {metric_bilstm_w2v[0]:>10}")
print(f"{f'[test] Model bilstm macro recall:':<40} {metric_bilstm_w2v[1]:>10}")
print(f"{f'[test] Model bilstm macro f1_score:':<40} {metric_bilstm_w2v[2]:>10}")

[test] Model bilstm macro precision:     0.8684228569821342
[test] Model bilstm macro recall:        0.6628714425131824
[test] Model bilstm macro f1_score:      0.7272921970485282


In [None]:
print(classification_report(y_test, pred_lstm_w2v_thresh, target_names=labels))
cr_lstm_w2v = classification_report(y_test, pred_lstm_w2v_thresh, target_names=labels, output_dict=True)

	                precision    recall  f1-score   support

    Computer Science       0.81      0.88      0.84      1282
             Physics       0.91      0.85      0.87       932
         Mathematics       0.89      0.69      0.76       843
          Statistics       0.78      0.72      0.73       798
Quantitative Biology       0.75      0.38      0.48        89
Quantitative Finance       1.00      0.51      0.64        38

           micro avg       0.84      0.78      0.80      3982
           macro avg       0.86      0.66      0.73      3982
        weighted avg       0.83      0.78      0.80      3982
         samples avg       0.83      0.82      0.81      3982


## Comparison with GloVe

In [None]:
metrics=['precision', 'recall', 'F1-score']

fig = go.Figure(data=[
    go.Bar(name='BiLSTM + GloVe', x=metrics, y=metric_bilstm),
    go.Bar(name='BiLSTM + W2V', x=metrics, y=metric_bilstm_w2v),
])
# Change the bar mode
fig.update_layout(barmode='group', title_text='Metrics bar plot for both models')
fig.show()

As expected, the two combinations give about the same results, in fact the difference is very small, about +0.1 on GloVe's side. Which in fact will be the kodel chosen for the final comparison with bert is the best model in section 1.
Another thing to note is that the construction of W2V's embeddings matrix took about twice as long as GloVe's. Obviously this was done with everything being equal: same network and training time.

In [None]:
bilstm_w2v_for_labels = []

for i in labels:
  bilstm_w2v_for_labels.append(cr_lstm_w2v[i]['f1-score'])

In [None]:
fig = go.Figure()
# Create and style traces

fig.add_trace(go.Scatter(x=labels, y=bilstm_for_labels, name = 'BiLSTM', line=dict(color='green', width=2), marker = dict(symbol = "circle", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=bilstm_w2v_for_labels, name='SVC - TFID', line=dict(color='red', width=2), marker = dict(symbol = "square", size = 10)))

# Edit the layout
fig.update_layout(title='F1-score for each class and for each model',
                   xaxis_title='labels',
                   yaxis_title='f1-score')


fig.show()


Here, too, we can state the fact that glove is slightly better in almost all classes. So it is our chosen model for comparisons.

# SECTION 3: Transformers BERT-Base

In [None]:
MAX_LEN = 256
TRAIN_BATCH_SIZE = 32
VAL_BATCH_SIZE = 32
TEST_BATCH_SIZE = 32
EPOCHS = 3
LEARNING_RATE = 1e-05

### Construct a BERT tokenizer

https://huggingface.co/bert-base-uncased

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Custom dataset

In [None]:
class CustomDataset(torch.utils.data.Dataset):
  def __init__(self, df, tokenizer, max_len):
    self.df = df
    self.tokenizer = tokenizer
    self.max_len = max_len
    self.title = self.df['paper']
    self.targets = self.df[labels].values

  def __len__(self):
    return len(self.title)

  def __getitem__(self, index):
    title = str(self.title[index])
    title = " ".join(title.split())

    inputs = self.tokenizer.encode_plus(
        title,
        None,
        add_special_tokens = True,
        max_length = self.max_len,
        padding = 'max_length',
        return_token_type_ids = True,
        truncation = True,
        return_attention_mask = True,
        return_tensors = 'pt'
    )

    return {
        'input_ids': inputs['input_ids'].flatten(),
        'attention_mask': inputs['attention_mask'].flatten(),
        'token_type_ids': inputs['token_type_ids'].flatten(),
        'targets': torch.FloatTensor(self.targets[index])
    }

In [None]:
#train_size = 0.8
train_df = pd.concat([X_train, y_train], axis=1).reset_index()
val_df = pd.concat([X_val, y_val], axis=1).reset_index()
test_df = pd.concat([X_test, y_test], axis=1).reset_index()

In [None]:
train_df = train_df.drop(['index'], axis=1)
val_df = val_df.drop(['index'], axis=1)
test_df = test_df.drop(['index'], axis=1)

In [None]:
train_dataset = CustomDataset(train_df, tokenizer, MAX_LEN)
val_dataset = CustomDataset(val_df, tokenizer, MAX_LEN)
test_dataset = CustomDataset(test_df, tokenizer, MAX_LEN)

### Data loaders

In [None]:
train_data_loader = torch.utils.data.DataLoader(
    train_dataset,
    shuffle = True,
    batch_size = TRAIN_BATCH_SIZE,
    num_workers = 0
)

val_data_loader = torch.utils.data.DataLoader(
    val_dataset,
    shuffle = True,
    batch_size = VAL_BATCH_SIZE,
    num_workers = 0
)

test_data_loader = torch.utils.data.DataLoader(
    test_dataset,
    shuffle = True,
    batch_size = TEST_BATCH_SIZE,
    num_workers = 0
)

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

checkpoint

In [None]:
def load_ckp(checkpoint_fpath, model, optimizer):
  checkpoint = torch.load(checkpoint_fpath)
  model.load_state_dict(checkpoint['state_dict'])
  optimizer.load_state_dict(checkpoint['optimizer'])
  valid_loss_min = checkpoint['valid_loss_min']
  return model, optimizer, checkpoint['epoch'], valid_loss_min.item()

def save_ckp(state, is_best, checkpoint_path, best_model_path):
  f_path = checkpoint_path
  torch.save(state, f_path)
  if is_best:
    best_fpath = best_model_path
    shutil.copyfile(f_path, best_fpath)


## Bert class


In [None]:
class BERTClass(nn.Module):
  def __init__(self):
    super(BERTClass, self).__init__()
    self.bert_model = BertModel.from_pretrained('bert-base-uncased', return_dict = True)
    self.dropout = nn.Dropout(0.3)
    self.linear = nn.Linear(768, 6)

  def forward(self, input_ids, attention_mask, token_type_ids):
    output = self.bert_model(input_ids, attention_mask, token_type_ids)
    output_dropout = self.dropout(output.pooler_output)
    output = self.linear(output_dropout)
    return output

model = BERTClass()
model.to(device)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BERTClass(
  (bert_model): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [None]:
def loss_fn(outputs, targets):
  return nn.BCEWithLogitsLoss()(outputs, targets)

optimizer = torch.optim.Adam(params=model.parameters(), lr = LEARNING_RATE)

### Training

In [None]:
def train_model(n_epochs, training_loader, validation_loader, model, optimizer, checkpoint_path, best_model_path):
  valid_loss_min = np.Inf

  history = defaultdict(list)

  for epoch in range(1, n_epochs+1):

    print(f'Epoch {epoch}/{n_epochs}')
    print('-' * 10)

    train_loss = 0
    valid_loss = 0
    model.train()

    # training loop
    for index, batch in enumerate(training_loader):
      input_ids = batch['input_ids'].to(device, dtype=torch.long)
      attention_mask = batch['attention_mask'].to(device, dtype=torch.long)
      token_type_ids = batch['token_type_ids'].to(device, dtype=torch.long)
      targets = batch['targets'].to(device, dtype=torch.float)
      outputs = model(input_ids, attention_mask, token_type_ids)
      optimizer.zero_grad()
      loss = loss_fn(outputs, targets)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      train_loss = train_loss + ((1/(index+1))*(loss.item() - train_loss))

  print(f'Train loss {train_loss}')
  print()

  correct_predictions = 0

  # validation loop
  model.eval()
  with torch.no_grad():
    for index, batch in enumerate(validation_loader):
      input_ids = batch['input_ids'].to(device, dtype=torch.long)
      attention_mask = batch['attention_mask'].to(device, dtype=torch.long)
      token_type_ids = batch['token_type_ids'].to(device, dtype=torch.long)
      targets = batch['targets'].to(device, dtype=torch.float)
      outputs = model(input_ids, attention_mask, token_type_ids)
      loss = loss_fn(outputs, targets)
      valid_loss = valid_loss + ((1/(index+1))*(loss.item()-valid_loss))

    checkpoint = {
        'epoch': epoch+1,
        'valid_loss_min': valid_loss,
        'state_dict': model.state_dict(),
        'optimizer': optimizer.state_dict()
    }

    save_ckp(checkpoint, False, checkpoint_path, best_model_path)

  print(f'Val   loss {valid_loss}')
  print()

  history['train_loss'].append(train_loss)
  history['val_loss'].append(valid_loss)

  return model, history

In [None]:
trained_model, history = train_model(20, train_data_loader, val_data_loader, model, optimizer, "/models", "/models")

Epoch 1/20
----------
Epoch 2/20
----------
Epoch 3/20
----------
Epoch 4/20
----------
Epoch 5/20
----------
Epoch 6/20
----------
Epoch 7/20
----------
Epoch 8/20
----------
Epoch 9/20
----------
Epoch 10/20
----------
Epoch 11/20
----------
Epoch 12/20
----------
Epoch 13/20
----------
Epoch 14/20
----------
Epoch 15/20
----------
Epoch 16/20
----------
Epoch 17/20
----------
Epoch 18/20
----------
Epoch 19/20
----------
Epoch 20/20
----------
Train loss 0.011367219330778323

Val   loss 0.36643756247524706



### Predictions

In [None]:
def get_predictions(model, data_loader):
  model = model.eval()

  predictions = []
  prediction_probs = []
  real_values = []

  with torch.no_grad():
    for d in data_loader:
      #print(d)

      #texts = d["review_text"]
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)
      token_type_ids = d["token_type_ids"].to(device)

      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids
      )

      #_, preds = torch.max(outputs, dim=1)
      preds = outputs

      probs = F.softmax(outputs, dim=1)

      #review_texts.extend(texts)
      predictions.extend(preds)
      prediction_probs.extend(probs)
      real_values.extend(targets)

  predictions = torch.stack(predictions).cpu()
  prediction_probs = torch.stack(prediction_probs).cpu()
  real_values = torch.stack(real_values).cpu()
  return predictions, prediction_probs, real_values

In [None]:
y_pred, y_pred_probs, y_test = get_predictions(
  trained_model,
  test_data_loader
)

### Thresholding the predictions

In [None]:
def thresholding_transformers(l, threshold):
    new_l = []
    new_l1 = []
    for l1 in l:
        new_l1 = []
        for element in l1:
            if element >= threshold:
                new_l1.append(1)
            else:
                new_l1.append(0)
        new_l.append(new_l1)
    return new_l

In [None]:
y_pred_thresh = thresholding_transformers(y_pred, 0)

In [None]:
print('Thresholding predictions example:\n')
print(y_pred_thresh[200])
print(y_test[200].tolist())

Thresholding predictions example:

[1, 1, 0, 1, 0, 0]
[1.0, 1.0, 0.0, 1.0, 0.0, 0.0]


### Evaluation

In [None]:
metric_bert = prfs(y_test, y_pred_thresh, average='macro', zero_division= 0)

In [None]:
print(f"{f'[test] Model Bert-base macro precision:':<40} {metric_bert[0]:>10}")
print(f"{f'[test] Model Bert-base macro recall:':<40} {metric_bert[1]:>10}")
print(f"{f'[test] Model Bert-base macro f1_score:':<40} {metric_bert[2]:>10}")

[test] Model Bert-base macro precision:  0.7549163269065676
[test] Model Bert-base macro recall:     0.7552461024737672
[test] Model Bert-base macro f1_score:   0.7513980801647039


In [None]:
print(classification_report(y_test, y_pred_thresh, target_names=labels))
cr_bert = classification_report(y_test, y_pred_thresh, target_names=labels, output_dict=True)

                      precision    recall  f1-score   support

    Computer Science       0.79      0.85      0.82      1282
             Physics       0.88      0.89      0.89       932
         Mathematics       0.79      0.79      0.79       843
          Statistics       0.73      0.81      0.77       798
Quantitative Biology       0.51      0.58      0.54        89
Quantitative Finance       0.82      0.61      0.70        38

           micro avg       0.79      0.83      0.81      3982
           macro avg       0.75      0.76      0.75      3982
        weighted avg       0.80      0.83      0.81      3982
         samples avg       0.84      0.86      0.83      3982




Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



In [None]:
bert_for_labels = []

for i in labels:
  bert_for_labels.append(cr_bert[i]['f1-score'])

### Confusion matrix

In [None]:
bert_dfpred = pd.DataFrame(y_pred_thresh)

In [None]:
cm = multilabel_confusion_matrix(y_test.tolist(), bert_dfpred.to_numpy())
fig = make_subplots(2, 3, subplot_titles=labels)

for i in range(2):
    for j in range(3):
        current_map = cm[((j+1)+(i*3))-1]
        TN = current_map[0][0]
        FN = current_map[1][0]
        TP = current_map[1][1]
        FP = current_map[0][1]
        fig.add_trace(
            go.Heatmap(
                z = [[FP, TN], [TP, FN]],
                x = ['Pos', 'Neg'],
                y = ['Neg', 'Pos'],
                text = cm[((j+1)+(i*3))-1],
                texttemplate="%{text}",
                textfont={"size":20}), (i+1), (j+1))
fig.update_traces(showscale=False)
fig.update_layout(height=1200, width=1200, title_text='Bert Confusion matrix')
fig.show()

### Score plot

In [None]:
metrics=['precision', 'recall', 'F1-score']

fig = go.Figure(data=[
    go.Bar(name='Bert-Base', x=metrics, y=metric_bert),
])

# Change the bar mode
fig.update_layout(barmode='group', title_text='Metrics bar plot')
fig.show()

After a few many attempts and tests with hyperparameters, here is the result of Bert-base which is much better than the models seen in the section 1, and surprisingly better than BiLSTM by only 0.2 in the f1-score which is the one being considered most for this task.

# SECTION 4: Final Conclusions



### Score plot

In [None]:
metrics=['precision', 'recall', 'F1-score']

fig = go.Figure(data=[
    go.Bar(name='Bert-Base', x=metrics, y=metric_bert),
    go.Bar(name='BiLSTM', x=metrics, y=metric_bilstm),
    go.Bar(name='SVC - TFIDF', x=metrics, y=tfidf_svc_scores),
])

# Change the bar mode
fig.update_layout(barmode='group', title_text='Metrics bar plot')
fig.show()

Here we can see the best of the models in Section 1, i.e. SVC with TF-IDF, the simple neural architecture, i.e. BiLSTM, and the tranformers, i.e. Bert-Base, compared. As expected SVC is the worst, while Bert is the best. Although the BiLSTM in between has a surprisingly close result to Bert's.

### F1-score for each class and for each model

In [None]:
fig = go.Figure()
# Create and style traces
fig.add_trace(go.Scatter(x=labels, y=bert_for_labels, name='Bert-Base', line=dict(color='blue', width=2), marker = dict(symbol = "triangle-up", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=bilstm_for_labels, name = 'BiLSTM', line=dict(color='green', width=2), marker = dict(symbol = "circle", size = 10)))
fig.add_trace(go.Scatter(x=labels, y=tfidf_svc_for_labels, name='SVC - TFID', line=dict(color='red', width=2), marker = dict(symbol = "square", size = 10)))

# Edit the layout
fig.update_layout(title='F1-score for each class and for each model',
                   xaxis_title='labels',
                   yaxis_title='f1-score')


fig.show()

In this very interesting graph, it is remarked that Bert is better than BiLSTM, but the two are very close, even in the prediction of the first class BiLSTM is better than Bert. But the most surprising thing is the SVC-IDF which is better than BiLSTM for the most represented classes and is just below Bert. It only falls in the two least represented classes and especially in 'Quantitative Biology' where NN and transformers are significantly more accurate.

## Conclusion

All the models in section 1 had roughly the same performance, but the best was definitely SVC TF-IDF, which was in fact then chosen for comparison with the next two architectures.
Between BiLSTM and Bert-Base, both exploit pre-trained embeddings. The simpler neural architecture and all combinations in section 1 proved to be less efficient than Bert-Base, however, the performance of BiLSTM is very close to transformers. Bert showed clear signs of overfitting in the early stages, the reduction of the learning rate resulted in slower learning, longer times and no significant improvement. The BiLSTM model performed significantly better than all the solutions in section 1 and almost at Bert-Base levels. This method shows great potential, but is hardly scalable and lacks the generalisation provided by large language models; it is therefore interesting to observe it with respect to large language models, but should not be preferred given its size.