# Exploratory Data Analysis (EDA)

Important Note:- This is the main notebook file but there are other supplementary code and data files for this notebook to work properly. They are moved into the folder "Supplementary files". Please copy them into this folder before proceeding further.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

#### 1. Econbiz dataset

The [EconBiz dataset](https://www.kaggle.com/datasets/hsrobo/titlebased-semantic-subject-indexing) was compiled from a meta-data export provided by ZBW - Leibniz Information Centre for Economics from July 2017. The annotations were selected by human annotators from the Standard Thesaurus Wirtschaft (STW), which contains approximately 5,700 labels.

In [None]:
# Load the data
df = pd.read_csv('econbiz.csv')
df.head()

In [None]:
# Check the data types of the columns
df.dtypes

In [None]:
# Describe the data
df.describe()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Get maximum length of document in the 'title' column
df.title.str.len().max()

In [None]:
# Plot histogram of document length in the 'title' column
df.title.str.len().hist(bins=100)

In [None]:
# Plot histogram of number of labels in the 'labels' column
df.labels.str.split('\t').str.len().hist(bins=100)

In [None]:
# Get vocabulary size of the 'title' column by doing TfIdfVectorizer on the 'title' column
vectorizer = CountVectorizer()
vectorizer.fit(df.title)
len(vectorizer.vocabulary_)

#### 2. Pubmed dataset

The [PubMed dataset](https://www.kaggle.com/datasets/hsrobo/titlebased-semantic-subject-indexing) was compiled from the training set of the 5th BioASQ challenge on large-scale semantic subject indexing of biomedical articles, which were all in English. Again, we removed duplicates by checking for same title and labels. In total, approximately 12.8 million publications remain.
The labels are so called MeSH terms. In our data, approximately 28k of them are used.

In [None]:
# Load the data
df = pd.read_csv('pubmed.csv')
df.head()

In [None]:
# Check the data types of the columns
df.dtypes

In [None]:
# Describe the data
df.describe()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Get maximum length of document in the 'title' column
df.title.str.len().max()

In [None]:
# Plot histogram of document length in the 'title' column
df.title.str.len().hist(bins=100)

In [None]:
# Plot histogram of number of labels in the 'labels' column
df.labels.str.split('\t').str.len().hist(bins=100)

In [None]:
# Get vocabulary size of the 'title' column by doing TfIdfVectorizer on the 'title' column
vectorizer = CountVectorizer()
vectorizer.fit(df.title)
len(vectorizer.vocabulary_)

# Using Omikuji (RUST implementation of Parabel) for eXtreme Multi-label Classification (XMLC)

Here are the instructions to install [Rust](https://doc.rust-lang.org/cargo/getting-started/installation.html) and [Omikuji](https://github.com/tomtung/omikuji) that were followed in the below steps.

#### 1. Prepare data and install Rust, Omikuji

In [30]:
from process_data import create_parabel_data_files

In [31]:
# Define constants
DATASET = 'econbiz'
RAW_DATA = DATASET + '/econbiz.csv'
RESULTS_DIR = DATASET + '/Results'
MODEL_DIR = DATASET + '/Model'

PRED_FILE = RESULTS_DIR + '/{}_pred.txt'.format(DATASET)

In [None]:
# Create train and test .txt data files according to the Parabel data format mentioned here (https://github.com/tomtung/omikuji#data-format). 
train_fname, test_fname = create_parabel_data_files(dataset=DATASET, raw_data_file=RAW_DATA)

In [None]:
# Install Omikuji using Cargo that should've been installed before this step.
!cargo install omikuji --features cli --locked

In [None]:
# Set this variable to see full backtrace of the error
% env RUST_BACKTRACE=full

In [None]:
# Check help to see command options for Omikuji
!omikuji train --help

#### 2. Train the model

Don't forget to clear the model directory before training. Otherwise, you'll face an error.

In [None]:
# Train the model by specifying the train data file path and the model path
!omikuji train $train_fname --model_path $MODEL_DIR

#### 3. Evaluate the model

In [None]:
# Finally evaluate the model on the 
!omikuji test $MODEL_DIR $test_fname --out_path $PRED_FILE

As can be seen at the end of the testing, the precision with default parameters is Precision@[1, 3, 5] = [73.78, 54.11, 41.08]

**This is the end of the code that is being submitted as part of the assignment.**

Below code are the multiple experiments that were tried during the course of this assignment. I've tried doing the BaseMLP model mentioned in the paper referenced in the assignment. Some compatability issues with Keras package have been raised.

Similarily, with Parabel paper where the authors have generously provided the complete source code and the binaries to test their algorithm. Since the code is in C++ and some Matlab scripts were present I installed both but once again due to data format issues, couldn't continue further.

As can be seen at the end of the testing, the precision with default parameters is Precision@[1, 3, 5] = [73.78, 54.11, 41.08]

**This is the end of the code that is being submitted as part of the assignment.**

Below code are the multiple experiments that were tried during the course of this assignment. I've tried doing the BaseMLP model mentioned in the paper referenced in the assignment. Some compatability issues with Keras package have been raised.

Similarily, with Parabel paper where the authors have generously provided the complete source code and the binaries to test their algorithm. Since the code is in C++ and some Matlab scripts were present I installed both but once again due to data format issues, couldn't continue further.

# Using MLP for eXtreme Multi-label Classification

#### 1. Test the template code shared by one of the paper's author on Kaggle.

In [1]:
from sklearn.metrics import f1_score
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.multiclass import OneVsRestClassifier

In [None]:
SINGLE_FOLD = True
ALL_TITLES = False
RAW_CSV_FILE = "econbiz/econbiz.csv"

In [2]:
def load_dataset(dataset_path, fold_i, all_titles=False):
    df = pd.read_csv(dataset_path)
    if not all_titles:
        df = df[df["fold"].isin(range(0, 10))]

    labels = df["labels"].values
    labels = [[l for l in label_string.split()] for label_string in labels]
    multilabel_binarizer = MultiLabelBinarizer(sparse_output=True)
    multilabel_binarizer.fit(labels)

    def to_indicator_matrix(some_df):
        some_df_labels = some_df["labels"].values
        some_df_labels = [[l for l in label_string.split()] for label_string in some_df_labels]
        return multilabel_binarizer.transform(some_df_labels)

    test_df = df[df["fold"] == fold_i]
    X_test = test_df["title"].values
    y_test = to_indicator_matrix(test_df)

    train_df = df[df["fold"] != fold_i]
    X_train = train_df["title"].values
    y_train = to_indicator_matrix(train_df)

    return X_train, y_train, X_test, y_test


In [None]:
# for demonstration, employ TFIDF with binary relevance logistic regression
clf = Pipeline(
    [("vectorizer", TfidfVectorizer(max_features=25000)),
     ("classifier", OneVsRestClassifier(LogisticRegression(), n_jobs=4))])

In [None]:
def evaluate(dataset):
    scores = []
    for i in range(0, 10):
        train_df, y_train, test_df, y_test = load_dataset(dataset, i, all_titles=ALL_TITLES)
        print('Shapes of X_train, y_train, X_test, y_test', train_df.shape, y_train.shape, test_df.shape, y_test.shape)
        clf.fit(train_df, y_train)
        y_pred = clf.predict(test_df)

        scores.append(f1_score(y_test, y_pred, average="samples"))

        if SINGLE_FOLD:
            break
    return np.mean(scores)

In [None]:
print("EconBiz average F-1 score:", evaluate(RAW_CSV_FILE))

#### 2. Implement the BaseMLP model mentioned in the paper.

In [12]:
import mlp_for_xmlc as mlx
import importlib

importlib.reload(mlx)

<module 'mlp_for_xmlc' from 'C:\\Users\\nikhi\\PycharmProjects\\Assignments\\CE807\\Assignment2\\mlp_for_xmlc.py'>

In [None]:
x_train, Y_train, x_test, Y_test = load_dataset(RAW_CSV_FILE, fold_i=0, all_titles=ALL_TITLES)
print('Shapes of X_train, y_train, X_test, y_test', x_train.shape, Y_train.shape, x_test.shape, Y_test.shape)

In [None]:
mlp = mlx.MLP(verbose=1)
tp = mlx.ThresholdingPredictor(mlp, alpha=1.0, stepsize=0.01, verbose=1)
tp.fit(x_train, Y_train)

In [None]:
y_pred = tp.predict(x_test)
print("Mean F1 score:", f1_score(Y_test, y_pred, average="samples"))

# Using Parabel paper code

#### 1. Prepare data in the format required by the Parabel binaries.

In [None]:
dataset = 'econbiz'
data_dir = dataset + '/Data'
results_dir = dataset + '/Results'
model_dir = dataset + '/Model'

trn_ft_file = data_dir + '/trn_X_Xf.txt'
trn_lbl_file = data_dir + '/trn_X_Y.txt'
tst_ft_file = data_dir + '/tst_X_Xf.txt'
tst_lbl_file = data_dir + '/tst_X_Y.txt'
score_file = results_dir + '/score_mat.txt'

In [None]:
from process_data import create_parabel_data_files_v1

create_parabel_data_files_v1(dataset, RAW_CSV_FILE)

In [None]:
# training
# Reads training features (in %trn_ft_file%), training labels (in %trn_lbl_file%), and writes Parabel model to %model_dir%
!parabel_train $trn_ft_file $trn_lbl_file $model_dir -T 1 -s 0 -t 3 -b 1.0 -c 1.0 -m 100 -tcl 0.1 -tce 0 -e 0.0001 -n 20 -k 0 -q 0

In [None]:
# testing
# Reads test features (in %tst_ft_file%), FastXML model (in %model_dir%), and writes test label scores to %score_file%
!parabel_predict $tst_ft_file $model_dir $score_file -t 3

#### 2. Evaluate the performance of the model.

##### Example based
The metrics are computed in a per datapoint manner. For each predicted label its only its score is computed, and then these scores are aggregated over all the datapoints.

Precision = 1n∑ni=1|Yi∩h(xi)||h(xi)| , The ratio of how much of the predicted is correct. The numerator finds how many labels in the predicted vector has common with the ground truth, and the ratio computes, how many of the predicted true labels are actually in the ground truth.
Recall = 1n∑ni=1|Yi∩h(xi)||Yi| , The ratio of how many of the actual labels were predicted. The numerator finds how many labels in the predicted vector has common with the ground truth (as above), then finds the ratio to the number of actual labels, therefore getting what fraction of the actual labels were predicted.
There are other metrics as well.

##### Label based
Here the things are done labels-wise. For each label the metrics (eg. precision, recall) are computed and then these label-wise metrics are aggregated. Hence, in this case you end up computing the precision/recall for each label over the entire dataset, as you do for a binary classification (as each label has a binary assignment), then aggregate it.

The easy way is to present the general form.

This is just an extension of the standard multi-class equivalent.

Macro averaged 1q∑qj=1B(TPj,FPj,TNj,FNj)
Micro averaged B(∑qj=1TPj,∑qj=1FPj,∑qj=1TNj,∑qj=1FNj)
Here the TPj,FPj,TNj,FNj are the true positive, false positive, true negative and false negative counts respectively for only the jth label.

Here B stands for any of the confusion-matrix based metric. In your case you would plug in the standard precision and recall formulas. For macro average you pass in the per label count and then sum, for micro average you average the counts first, then apply your metric function.

In [None]:
import numpy as np


def get_labels_from_txt_file(txt_file, prob_threshold=0.5):
    """
    Reads the labels from a txt file.
    :param prob_threshold:
    :param txt_file:
    :return: list of lists of labels for each datapoint
    """
    assert 1.0 >= prob_threshold >= 0.0, "prob_threshold must be between 0 and 1"
    with open(txt_file) as f:
        lines = f.readlines()
    labels = []
    i = 0
    # max_prob = 0
    for line in lines[1:]:  # skip the first line
        line = line.strip()
        # split on space, gives a list of strings each in the format of label:score
        lbls_and_scores = line.split()

        # split on : and take the first part, which is the label
        sub_labels = [x.split(':')[0] for x in lbls_and_scores if float(x.split(':')[1]) >= prob_threshold]

        # sub_label_scores = [float(x.split(':')[1]) for x in lbls_and_scores]
        # if len(sub_label_scores) > 0:
        #     max_prob = max(max_prob, max(sub_label_scores))

        labels.append(sub_labels)

        if len(sub_labels) == 0 and len(line) > 0:
            i += 1
            # print("Empty label list for line:", line, i, "in file:", txt_file)
    print('Number of datapoints with zero predicted labels:', i, 'for file:', txt_file)
    # print('Max prob:', max_prob)
    return labels


def calculate_performance_metrics(lbl_pred, lbl_true):
    """
    Calculates precision, recall and f1-score using the example-based method.
    :param lbl_pred:
    :param lbl_true:
    :return:
    """
    assert len(lbl_pred) == len(lbl_true)

    ratios = np.array(
        [(len(set(h) & set(y)) / len(h), len(set(h) & set(y)) / len(y)) for h, y in zip(lbl_pred, lbl_true) if
         len(h) > 0 and len(y) > 0])  # if either of the labels is empty, then the ratio is 0
    precision = np.mean(ratios[:, 0])
    recall = np.mean(ratios[:, 1])
    f1 = 2 * precision * recall / (precision + recall)

    print('Precision:', round(precision * 100, 2), '%')
    print('Recall:', round(recall * 100, 2), '%')
    print('F1-score:', round(f1 * 100, 2), '%')

In [None]:
y_pred = get_labels_from_txt_file(score_file, prob_threshold=0.1)
y_true = get_labels_from_txt_file(tst_lbl_file)
calculate_performance_metrics(y_pred, y_true)