<font size="4"></font>

Dataset: DeliciousMIL

Source and description: https://archive.ics.uci.edu/ml/datasets/DeliciousMIL%3A+A+Data+Set+for+Multi-Label+Multi-Instance+Learning+with+Instance+Labels

In [1]:
import re
import warnings
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.metrics import classification_report, zero_one_loss, coverage_error, label_ranking_loss, label_ranking_average_precision_score, silhouette_score

warnings.filterwarnings('ignore')

<h2>Part A</h2>

<h3>Dataset loading and preprocessing</h3>

<font size="3">Below we read the four relevant files corresponding to train and test data and labels. Since the data contains indexes of words we vectorize the features with TF-IDF Vectorizer.<br>
Finally each document is represented as a bag of words with its corresponding label vector</font>

In [2]:
train_data = []
train_label = []
test_data = []
test_label = []

with open("data/train-data.dat", "r") as f:
    lines = f.readlines()
    for line in lines:
        line = re.sub('<\d+>', '', line).strip()
        line = re.sub('  +', ' ', line)
        train_data.append(line)
        
with open("data/train-label.dat", "r") as f:
    lines = f.readlines()
    for line in lines:
        line = line.strip()
        values = line.split(' ')
        for value in values:
            train_label.append(int(value))

with open("data/test-data.dat", "r") as f:
    lines = f.readlines()
    for line in lines:
        line = re.sub('<\d+>', '', line).strip()
        line = re.sub('  +', ' ', line)
        test_data.append(line)
        
with open("data/test-label.dat", "r") as f:
    lines = f.readlines()
    for line in lines:
        line = line.strip()
        values = line.split(' ')
        for value in values:
            test_label.append(int(value))
            
vectorizer = TfidfVectorizer()
X_train_A = vectorizer.fit_transform(train_data)
X_test_A = vectorizer.transform(test_data)

y_train_A = np.array(train_label).reshape((8251, 20))
y_test_A = np.array(test_label).reshape((3983, 20))

print(f"X_train_A shape: {X_train_A.shape}\n" +
    f"X_test_A shape: {X_test_A.shape}\n" +
    f"y_train_A shape: {y_train_A.shape}\n" +
    f"y_test_A shape: {y_test_A.shape}")

X_train_A shape: (8251, 8510)
X_test_A shape: (3983, 8510)
y_train_A shape: (8251, 20)
y_test_A shape: (3983, 20)


<h3>Model training and evaluation</h3>

<font size="3">We define 3 base estimators, Logistic Regression, Random Forest, and an SVM to use with Binary Relevance (MultiOutputClassifier) and Classifier Chain techniques</font>

In [3]:
names = ['Logistic Regression', 'Random Forest', 'Linear SVM']
estimators = [LogisticRegression(class_weight='balanced', random_state=1361),
             RandomForestClassifier(criterion = 'entropy', class_weight='balanced',  random_state = 1361),
             SVC(kernel='linear', class_weight='balanced', max_iter = 10000, probability = True, random_state = 1361)]

In [4]:
for name, est in zip(names, estimators):
    print(f"Binary Relevnace with {name} estimator:\n")
    clf = MultiOutputClassifier(est)
    clf.fit(X_train_A, y_train_A)
    y_pred = clf.predict(X_test_A)
    y_proba = np.array([[k[1] for k in i] for i in clf.predict_proba(X_test_A)]).T
    
    print(classification_report(y_test_A, y_pred, zero_division='warn'))
    print(f"Subset Accuracy: {1-zero_one_loss(y_test_A, y_pred):.2f}")
    print(f"Coverage Error: {coverage_error(y_test_A, y_proba):.2f}")
    print(f"Ranking Loss: {label_ranking_loss(y_test_A, y_proba):.2f}")
    print(f"Average Precision: {label_ranking_average_precision_score(y_test_A, y_proba):.2f}")
    print("----------------------------------------------------------------------------\n")
    

Binary Relevnace with Logistic Regression estimator:

              precision    recall  f1-score   support

           0       0.70      0.80      0.75       977
           1       0.39      0.62      0.48       228
           2       0.57      0.61      0.59      1558
           3       0.53      0.72      0.61       372
           4       0.56      0.68      0.61      1050
           5       0.34      0.58      0.43       537
           6       0.38      0.68      0.49       702
           7       0.54      0.62      0.58      1079
           8       0.49      0.65      0.56       803
           9       0.49      0.64      0.55       483
          10       0.44      0.62      0.52       507
          11       0.42      0.59      0.49       478
          12       0.36      0.60      0.45       509
          13       0.37      0.59      0.45       355
          14       0.42      0.66      0.51       392
          15       0.37      0.63      0.47       441
          16       0.34    

<font size="3">Binary Relevnace evaluation based on macro averaged f1 score: 
  * BR with Logistic Regression: 0.52
  * BR with Random Forest: 0.21
  * BR with SVC: 0.51 
<br><br>
Logistic Regression and SVC produce similar results, while Random Forest preforms significantly worse</font>

In [5]:
for name, est in zip(names, estimators):
    print(f"Classifier Chain with {name} estimator:\n")
    clf = ClassifierChain(est)
    clf.fit(X_train_A, y_train_A)
    y_pred = clf.predict(X_test_A)
    y_proba = clf.predict_proba(X_test_A)
    
    print(classification_report(y_test_A, y_pred, zero_division='warn'))
    print(f"Subset Accuracy: {1-zero_one_loss(y_test_A, y_pred):.2f}")
    print(f"Coverage Error: {coverage_error(y_test_A, y_proba):.2f}")
    print(f"Ranking Loss: {label_ranking_loss(y_test_A, y_proba):.2f}")
    print(f"Average Precision: {label_ranking_average_precision_score(y_test_A, y_proba):.2f}")
    print("----------------------------------------------------------------------------\n")

Classifier Chain with Logistic Regression estimator:

              precision    recall  f1-score   support

           0       0.70      0.80      0.75       977
           1       0.40      0.63      0.49       228
           2       0.57      0.61      0.59      1558
           3       0.43      0.77      0.55       372
           4       0.56      0.66      0.61      1050
           5       0.33      0.51      0.40       537
           6       0.36      0.70      0.48       702
           7       0.49      0.64      0.56      1079
           8       0.46      0.66      0.54       803
           9       0.41      0.65      0.50       483
          10       0.37      0.64      0.47       507
          11       0.31      0.68      0.42       478
          12       0.32      0.57      0.41       509
          13       0.27      0.66      0.38       355
          14       0.33      0.71      0.45       392
          15       0.28      0.64      0.39       441
          16       0.23    

<font size="3">Classifier Chain evaluation based on macro averaged f1 score: 
  * CC with Logistic Regression: 0.47
  * CC with Random Forest: 0.27
  * CC with SVC: 0.50 
<br><br>
Logistic Regression and SVC produce similar results, while Random Forest preforms significantly worse</font>

<h2>Part B</h2>

<h3>Dataset loading and preprocessing</h3>

<font size=3>Out of the 20 classes we isolate the most frequent one (based on the training set) to transform the multi-class problem into a binary classification problem</font>

In [6]:
most_freq_class = np.argmax(np.sum(y_train_A, axis=0))

y_train_Β = y_train_A[:,most_freq_class]
y_test_Β = y_test_A[:,most_freq_class]

<font size=3>In this approach, each document is a bag of sentences. We created 2 dataframes (one for the training set and one for the test set to better visualize the problem representation</font>

In [7]:
doc_count = 0
index_count = 0
bag_of_sentences = {}

with open("data/train-data.dat", "r") as f:
    lines = f.readlines()
    for line, label in zip(lines, y_train_Β):
        line = re.split('<\d+>', line)
        for sentence in line:
            sentence = sentence.strip()
            
            if sentence:
                bag_of_sentences[index_count] = (doc_count, sentence.strip(), label)
                index_count += 1
                
        doc_count += 1
        
data_train_B_DF = pd.DataFrame.from_dict(bag_of_sentences, columns = ['Bag', 'Sentence', 'Target'], orient = 'index')

In [8]:
doc_count = 0
index_count = 0
bag_of_sentences = {}

with open("data/test-data.dat", "r") as f:
    lines = f.readlines()
    for line, label in zip(lines, y_test_Β):
        line = re.split('<\d+>', line)
        for sentence in line:
            sentence = sentence.strip()
            
            if sentence:
                bag_of_sentences[index_count] = (doc_count, sentence.strip(), label)
                index_count += 1
                
        doc_count += 1
        
data_test_B_DF = pd.DataFrame.from_dict(bag_of_sentences, columns = ['Bag', 'Sentence', 'Target'], orient = 'index')    

The head and tail of the two dataframes is presented below

In [20]:
data_train_B_DF.head(10)

Unnamed: 0,Bag,Sentence,Target
0,0,6705 5997 8310 3606 674 8058 5044 4836,1
1,0,4312 5154 8310 4225,1
2,1,1827 1037 8482 483,1
3,1,3567 6172 6172 2892 1362 787 399 777 1332,1
4,1,318 769 4621 3199 1480 6213 971 6890,1
5,1,5909 15 3445 2475,1
6,1,324 4138 3404 6176,1
7,1,65 2926 1375 7705,1
8,1,709 1323 1652,1
9,1,5735 7439 3445 2475,1


In [22]:
data_train_B_DF.tail(10)

Unnamed: 0,Bag,Sentence,Target
149915,8248,3700 2415 6171 2374 4711 5280 5071 1319 6559,0
149916,8248,793 114 246 114 5071 5378 2738,0
149917,8249,7658 8174 3492 246 4015 764 327,1
149918,8249,2023 874 6309 235 7102 8132,1
149919,8249,568 2179 1620 4403 1035 6651 1035 7845,1
149920,8249,1386 384 4282 2229 5349 7139 5663 5742 4282,1
149921,8249,6008 1758 5682 2263 7699 4700,1
149922,8250,6072 1632 6587 2623 1178 6078 345 2651,0
149923,8250,1281 3041 2797 6144 2276 5149 4621 1890 2276 5506,0
149924,8250,8082 2514 5110 1319 5154 8334 2044 677,0


In [21]:
data_test_B_DF.head(10)

Unnamed: 0,Bag,Sentence,Target
0,0,5282 4641 3031 536 5366 1759,1
1,0,4855 1037 7752 2287 1090,1
2,0,1921 6213 3292 5750 6068 5648 1444,1
3,0,6157 1574 6955 2287 3816,1
4,0,5553 568 6955 5523 2793 4312 2033 4217 7593,1
5,0,6955 965 7553,1
6,0,7464 2651 8283 1426 5741 4032 740,1
7,0,470 2413 4767 6629 4551 7859 1007 6629,1
8,0,945 7553 4551 6955 568 7981,1
9,0,4386 6166 539 8115 6183 7440 1137 6 1137,1


In [23]:
data_test_B_DF.tail(10)

Unnamed: 0,Bag,Sentence,Target
73353,3980,682 674 3648 971 3664 980 1564 1551 8487,0
73354,3980,3031 7752 1914 3994 2833,0
73355,3980,8424 1145 1657 2975 7195,0
73356,3980,3292 3997 4812 345,0
73357,3981,2781 3368 3672 704 5667,0
73358,3981,5978 3031 4466 483 3405,0
73359,3981,4466 4081 4621 474 5970 1259,0
73360,3982,1209 6858 1137 4466,1
73361,3982,859 444 1037 859 444 8482 4466 8482,1
73362,3982,7679 2287 2568 3896 2035 728 5817,1


<font size = 3>We take each instance (sentence) from the dataframe and add it to a new list. Then we vectorize based on the train list to create 2 new datasets, the train and the test</font>

In [11]:
train_bags = []
test_bags = []

for index, value in data_train_B_DF['Sentence'].items():
    train_bags.append(value)
    
for index, value in data_test_B_DF['Sentence'].items():
    test_bags.append(value)
    
vectorizer = TfidfVectorizer()
train_bags_transformed = vectorizer.fit_transform(train_bags)
test_bags_transformed = vectorizer.transform(test_bags)

<font size=3>We use the silhouette method to determine the optimal number of clusters for our clustering algorithm</font>

In [12]:
sil_score = []
kmax = 40
x = train_bags_transformed

for k in range(2, kmax + 1):
    kmeans = KMeans(n_clusters = k, random_state = 1361).fit(x)
    labels = kmeans.labels_
    sil_score.append(silhouette_score(x, labels, metric = 'cosine'))
    
best_k = np.argmax(sil_score) + 2
print(best_k)

39


<font size=3>We perform k-means clustering with k=39. Those 39 clusters will become our features for the next step</font>

In [13]:
model = KMeans(n_clusters = best_k, random_state = 1361)
model.fit(train_bags_transformed)

train_predictions = model.predict(train_bags_transformed)
test_predictions = model.predict(test_bags_transformed)

<font size=3>For every instance of every bag, we find the cluster it belongs to and add 1 to that feature. After that we get our new train and test set with features representing all of the instances of each bag</font>

In [14]:
X_train_B = np.zeros((8251, best_k), dtype=np.int32)
X_test_B = np.zeros((3983, best_k), dtype=np.int32)

train_bag_num = data_train_B_DF['Bag'].to_numpy()
test_bag_num = data_test_B_DF['Bag'].to_numpy()

for bag, prediction in zip(train_bag_num, train_predictions):
    X_train_B[bag, prediction] += 1
    
for bag, prediction in zip(test_bag_num, test_predictions):
    X_test_B[bag, prediction] += 1

<font size=3>We train a Random Forest Classifier, once on our multi-instance training set and once on our training set from <b>Part A</b> and we evaluate our models on our test set with only the most frequent class (binary classification)</font>

In [26]:
clf = RandomForestClassifier(criterion = 'entropy', class_weight='balanced',  random_state = 1361)

print(f"Random Forest with k-means (multi-instance)\n")
clf.fit(X_train_B, y_train_Β)
y_pred = clf.predict(X_test_B)

print(classification_report(y_test_Β, y_pred))
print("----------------------------------------------------------------------------\n")

print(f"Random Forest\n")
clf.fit(X_train_A, y_train_Β)
y_pred = clf.predict(X_test_A)

print(classification_report(y_test_Β, y_pred))
print("----------------------------------------------------------------------------\n")

Random Forest with k-means (multi-instance)

              precision    recall  f1-score   support

           0       0.65      0.82      0.73      2425
           1       0.54      0.32      0.40      1558

    accuracy                           0.63      3983
   macro avg       0.59      0.57      0.56      3983
weighted avg       0.61      0.63      0.60      3983

----------------------------------------------------------------------------

Random Forest

              precision    recall  f1-score   support

           0       0.67      0.90      0.77      2425
           1       0.67      0.31      0.43      1558

    accuracy                           0.67      3983
   macro avg       0.67      0.61      0.60      3983
weighted avg       0.67      0.67      0.63      3983

----------------------------------------------------------------------------



<font size=3>Evaluation: 
  * Multi-instance RF accuracy: 0.63
  * Multi-instance RF precision: 0.59
  * Multi-instance RF recall: 0.57
  * Multi-instance RF f1-score: 0.56
<br><br>
  * RF accuracy: 0.67
  * RF precision: 0.67
  * RF recall: 0.61
  * RF f1-score: 0.60  
<br><br>
<font size=4>On all metrics the random forest classifier performed worse on the multi-instance data</font></font>