ADVANCED TOPICS IN MACHINE LEARNING

Assignment - 2

Ιπποκράτης Κοτσάνης - 131

Φιλίτσα-Ιωάννα Κουσκουβέλη - 125



----------------------------------------------------------------------------------------------------------------------------------------------------------------
PART A

There are several techniques for handling multi-label classification:

1. Binary Relevance: This approach transforms the multi-label problem into multiple binary classification problems. Each label is treated as a separate binary classification task, and a separate classifier is trained for each label independently.

2. Classifier Chains: In this approach, a sequence of binary classifiers is created, where each classifier takes into account the predictions of the previous classifiers in the chain as additional features. The order of the classifiers in the chain can be defined randomly or based on the label dependencies.

3. Label Powerset: This technique transforms the multi-label problem into a multi-class problem, where each unique combination of labels represents a separate class. This approach requires training a classifier capable of multi-class classification, such as a Decision Tree or Random Forest.

4. Adapted Algorithm: Some machine learning algorithms, such as k-Nearest Neighbors (k-NN) or Support Vector Machines (SVM), can be directly extended to handle multi-label classification by modifying their original formulations.
-----------------------------------------------------------------------


*  The MultiOutputClassifier is used to wrap a base classifier to create a multi-label classifier using the Binary Relevance approach.

*  Classifier Chains extend the traditional one-vs-rest approach by considering the order of labels. Each classifier in the chain is trained to predict the presence of a specific label in addition to the labels predicted by the preceding classifiers in the chain. The order of labels is determined by the order in which they appear in the training data.

*Note that Classifier Chains can be computationally expensive, especially for a large number of labels, as each label requires training a separate classifier.

In [None]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, zero_one_loss
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC


def txts_to_boW(fileName):
    """
    Takes as input the file's name of our data's file. Each line corresponds to a text.
    Each text comprises a different number of sentences. Each sentence comprises
    a different number of words.

    It returns a list of strings. Each string corresponds to a text's sentences.
    Words are separated with simple spaces one from another.
    """

    # Read data from file
    with open(fileName, 'r') as f:
        lines = f.readlines()

    dataset = []
    pattern = r'<\d+>\s([\d\s]+)'
    for line in lines:
        txt_in_sentences = re.findall(pattern, line)
        txt = ''.join(txt_in_sentences)  # join the strings of different sentences to a single one
        dataset.append(txt.rstrip())  # strip of any characters in the end
    return dataset


def get_clfs():
    # get a list of models to evaluate
    clfs_dict = dict()

    # Logistic Regression Classifier with MultiOutputClassifier
    lr_params = {'penalty': ['l2'], 'C': [0.1, 1, 10, 100]}
    lr_grid = GridSearchCV(LogisticRegression(max_iter=10000), lr_params, cv=5)
    lr_clf = MultiOutputClassifier(lr_grid)
    clfs_dict['lrMOC'] = lr_clf

    # Logistic Regression Classifier with ClassifierChain
    lr_params = {'penalty': ['l2'], 'C': [0.1, 1, 10, 100]}
    lr_grid = GridSearchCV(LogisticRegression(max_iter=10000), lr_params, cv=5)
    lr_clf = ClassifierChain(lr_grid)
    clfs_dict['lrCC'] = lr_clf
    # --------------------------------------------------------------------------
    # Linear SVM Classifier with MultiOutputClassifier
    lsvm_params = {'C': [0.1, 1, 10, 100]}
    lsvm_grid = GridSearchCV(LinearSVC(max_iter=100000), lsvm_params, cv=5)
    lsvm_clf = MultiOutputClassifier(lsvm_grid)
    clfs_dict['lsvcMOC'] = lsvm_clf

    # Linear SVM Classifier with ClassifierChain
    lsvm_params = {'C': [0.1, 1, 10, 100]}
    lsvm_grid = GridSearchCV(LinearSVC(max_iter=100000), lsvm_params, cv=5)
    lsvm_clf = ClassifierChain(lsvm_grid)
    clfs_dict['lsvcCC'] = lsvm_clf

    return clfs_dict


def apply_clf_models(clfs_dict, X_train, y_train, X_test, y_test):
    for clf_name, clf in clfs_dict.items():
        if clf_name.startswith('adapted'):
            # Adapted algorithm
            clf.fit(X_train, y_train)
            pred_labels = clf.predict(X_test)
        else:
            # MultiOutputClassifier
            clf.fit(X_train, y_train)
            pred_labels = clf.predict(X_test)

        print('\nEvaluating {}'.format(clf_name))
        print(classification_report(y_test, pred_labels))
        print('Subset accuracy = {}'.format((1 - zero_one_loss(y_test, pred_labels))))


def main():
    x_train_fileName, x_test_fileName = 'train-data.dat', 'test-data.dat'  # X
    y_train_fileName, y_test_fileName = 'train-label.dat', 'test-label.dat'  # y

    vectorizer = TfidfVectorizer()

    X_train = txts_to_boW(x_train_fileName)
    X_test = txts_to_boW(x_test_fileName)

    # Create the pipeline
    pipeline = Pipeline([
        ('vectorizer', vectorizer),
        ('clf', None)  # Placeholder for the classifier
    ])

    # Read y_train data
    df_y_train = pd.read_csv(y_train_fileName, sep="\s+", header=None)
    # print(df_y_train)

    # Read y_test data
    df_y_test = pd.read_csv(y_test_fileName, sep="\s+", header=None)
    # print(df_y_test)

    clfs_dict = get_clfs()

    for clf_name, clf in clfs_dict.items():
        pipeline.set_params(clf=clf)

        # Fit the pipeline on training data
        pipeline.fit(X_train, df_y_train)

        # Predict on the test data
        pred_labels = pipeline.predict(X_test)

        print('\nEvaluating {}'.format(clf_name))
        print(classification_report(df_y_test, pred_labels))
        print('Subset accuracy = {}'.format((1 - zero_one_loss(df_y_test, pred_labels))))


if __name__ == "__main__":
    main()


Evaluating lrMOC
              precision    recall  f1-score   support

           0       0.79      0.65      0.71       977
           1       0.78      0.30      0.44       228
           2       0.66      0.40      0.49      1558
           3       0.78      0.52      0.62       372
           4       0.71      0.37      0.49      1050
           5       0.63      0.11      0.19       537
           6       0.59      0.12      0.19       702
           7       0.78      0.33      0.46      1079
           8       0.77      0.24      0.36       803
           9       0.70      0.39      0.50       483
          10       0.67      0.35      0.46       507
          11       0.65      0.28      0.39       478
          12       0.71      0.11      0.19       509
          13       0.65      0.28      0.40       355
          14       0.73      0.38      0.50       392
          15       0.55      0.23      0.33       441
          16       0.63      0.25      0.36       269
         

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Evaluating lrCC
              precision    recall  f1-score   support

           0       0.79      0.65      0.71       977
           1       0.77      0.30      0.43       228
           2       0.65      0.40      0.49      1558
           3       0.76      0.52      0.62       372
           4       0.70      0.39      0.50      1050
           5       0.50      0.14      0.22       537
           6       0.62      0.12      0.20       702
           7       0.75      0.29      0.42      1079
           8       0.71      0.30      0.42       803
           9       0.70      0.38      0.49       483
          10       0.73      0.21      0.33       507
          11       0.72      0.25      0.37       478
          12       0.61      0.10      0.18       509
          13       0.63      0.26      0.37       355
          14       0.68      0.40      0.50       392
          15       0.59      0.21      0.31       441
          16       0.62      0.24      0.34       269
          

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Evaluating lsvcMOC
              precision    recall  f1-score   support

           0       0.85      0.58      0.69       977
           1       0.78      0.32      0.45       228
           2       0.66      0.38      0.49      1558
           3       0.79      0.52      0.63       372
           4       0.72      0.36      0.48      1050
           5       0.67      0.09      0.15       537
           6       0.61      0.09      0.15       702
           7       0.80      0.31      0.45      1079
           8       0.79      0.23      0.35       803
           9       0.70      0.40      0.51       483
          10       0.64      0.36      0.46       507
          11       0.62      0.28      0.38       478
          12       0.77      0.08      0.15       509
          13       0.64      0.29      0.40       355
          14       0.70      0.38      0.49       392
          15       0.54      0.24      0.33       441
          16       0.59      0.27      0.37       269
       

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Evaluating lsvcCC
              precision    recall  f1-score   support

           0       0.85      0.58      0.69       977
           1       0.77      0.32      0.45       228
           2       0.67      0.37      0.48      1558
           3       0.76      0.53      0.63       372
           4       0.72      0.36      0.48      1050
           5       0.55      0.12      0.20       537
           6       0.65      0.09      0.15       702
           7       0.77      0.27      0.40      1079
           8       0.73      0.26      0.38       803
           9       0.69      0.39      0.50       483
          10       0.79      0.21      0.33       507
          11       0.66      0.25      0.37       478
          12       0.68      0.08      0.14       509
          13       0.70      0.22      0.33       355
          14       0.66      0.39      0.49       392
          15       0.59      0.21      0.31       441
          16       0.63      0.25      0.36       269
        

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**General - Comments:**

The use of MultiOutputClassifier and ClassifierChain allows us to extend single-label classifiers to handle multi-label classification problems. MultiOutputClassifier treats each label independently and trains separate classifiers for each label. ClassifierChain, on the other hand, takes into account the correlation between labels by incorporating the predictions of previous classifiers as additional features for subsequent classifiers.

By using GridSearchCV, we are performing a grid search to find the best combination of hyperparameters for each model. This can help optimize the performance of the models by selecting the most suitable hyperparameters.

**Metrics - Comments:**

Regarding the four models, we can conclude that:
*    MultiOutputClassifier with Logistic Regression:
        1. Precision: The average precision across all classes is 0.70. It ranges from 0.59 to 0.82 for individual classes. Precision measures the proportion of true positive predictions among the total predicted positives.
        2. Recall: The average recall across all classes is 0.31. It ranges from 0.11 to 0.65 for individual classes. Recall measures the proportion of true positive predictions among the actual positives.
        3. F1-score: The average F1-score across all classes is 0.42. It ranges from 0.15 to 0.71 for individual classes. The F1-score combines precision and recall into a single metric, providing a balance between the two.
        4. Support: The number of samples in each class.
        5. Micro avg: The micro-averaged metrics consider all samples equally and calculate the metrics globally across all classes. The micro-averaged precision, recall, and F1-score are 0.71, 0.33, and 0.45, respectively.
        6. Macro avg: The macro-averaged metrics calculate the metrics independently for each class and then take the average. The macro-averaged precision, recall, and F1-score are 0.70, 0.31, and 0.42, respectively.
        7. Weighted avg: The weighted-averaged metrics calculate the metrics for each class, weighted by the number of samples in each class, and then take the average. The weighted-averaged precision, recall, and F1-score are 0.70, 0.33, and 0.44, respectively.
        8. Samples avg: The sample-based metrics calculate metrics for each instance and then take the average. The sample-based precision, recall, and F1-score are 0.49, 0.33, and 0.36, respectively.
        9. Subset accuracy: The subset accuracy is the ratio of samples where all labels are correctly predicted to the total number of samples. It is 0.0999.

*    ClassifierChain with Logistic Regression:

        The evaluation metrics are similar to the MultiOutputClassifier with Logistic Regression, with slightly lower performance.

*    MultiOutputClassifier with LinearSVC:

        The precision, recall, and F1-score values are similar to the MultiOutputClassifier with Logistic Regression.

*    ClassifierChain with LinearSVC:

        The evaluation metrics are also similar to the MultiOutputClassifier with Logistic Regression and ClassifierChain with Logistic Regression.


Overall, these results indicate that the classifiers struggle with this multi-label classification task. The precision, recall, and F1-score values are generally low, indicating that the models have difficulty predicting all the correct labels for each sample. The subset accuracy is also low, suggesting that the models struggle to predict all labels accurately for each sample. 


-------------------------------------------------------------------------------
In order to be more complete with the assignment, I applied some extra adapted algorithms.

In [None]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, zero_one_loss
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline


def txts_to_boW(fileName):
    """
    Takes as input the file's name of our data's file. Each line corresponds to a text.
    Each text comprises a different number of sentences. Each sentence comprises
    a different number of words.

    It returns a list of strings. Each string corresponds to a text's sentences.
    Words are separated with simple spaces one from another.
    """

    # Read data from file
    with open(fileName, 'r') as f:
        lines = f.readlines()

    dataset = []
    pattern = r'<\d+>\s([\d\s]+)'
    for line in lines:
        txt_in_sentences = re.findall(pattern, line)
        txt = ''.join(txt_in_sentences)  # join the strings of different sentences to a single one
        dataset.append(txt.rstrip())  # strip of any characters in the end
    return dataset


def get_clfs():
    # get a list of models to evaluate
    clfs_dict = dict()

    # Gradient Boosting with MultiOutputClassifier
    gb = GradientBoostingClassifier()
    gb_clf = MultiOutputClassifier(gb)
    clfs_dict['gbMOC'] = gb_clf

    # Adapted algorithms
    adapted_rf = RandomForestClassifier()
    clfs_dict['adaptedRF'] = adapted_rf

    return clfs_dict


def apply_clf_models(clfs_dict, X_train, y_train, X_test, y_test):
    for clf_name, clf in clfs_dict.items():
        if clf_name.startswith('adapted'):
            # Adapted algorithm
            clf.fit(X_train, y_train)
            pred_labels = clf.predict(X_test)
        else:
            # MultiOutputClassifier
            clf.fit(X_train, y_train)
            pred_labels = clf.predict(X_test)

        print('\nEvaluating {}'.format(clf_name))
        print(classification_report(y_test, pred_labels))
        print('Subset accuracy = {}'.format((1 - zero_one_loss(y_test, pred_labels))))


def main():
    x_train_fileName, x_test_fileName = 'train-data.dat', 'test-data.dat'  # X
    y_train_fileName, y_test_fileName = 'train-label.dat', 'test-label.dat'  # y

    vectorizer = TfidfVectorizer()

    X_train = txts_to_boW(x_train_fileName)
    X_test = txts_to_boW(x_test_fileName)

    # Create the pipeline
    pipeline = Pipeline([
        ('vectorizer', vectorizer),
        ('clf', None)  # Placeholder for the classifier
    ])

    # Read y_train data
    df_y_train = pd.read_csv(y_train_fileName, sep="\s+", header=None)
    # print(df_y_train)

    # Read y_test data
    df_y_test = pd.read_csv(y_test_fileName, sep="\s+", header=None)
    # print(df_y_test)

    clfs_dict = get_clfs()

    for clf_name, clf in clfs_dict.items():
        pipeline.set_params(clf=clf)

        # Fit the pipeline on training data
        pipeline.fit(X_train, df_y_train)

        # Predict on the test data
        pred_labels = pipeline.predict(X_test)

        print('\nEvaluating {}'.format(clf_name))
        print(classification_report(df_y_test, pred_labels))
        print('Subset accuracy = {}'.format((1 - zero_one_loss(df_y_test, pred_labels))))


if __name__ == "__main__":
    main()


Evaluating gbMOC
              precision    recall  f1-score   support

           0       0.80      0.53      0.64       977
           1       0.68      0.25      0.37       228
           2       0.64      0.25      0.36      1558
           3       0.77      0.50      0.61       372
           4       0.71      0.33      0.45      1050
           5       0.57      0.09      0.16       537
           6       0.64      0.13      0.21       702
           7       0.74      0.27      0.40      1079
           8       0.72      0.24      0.36       803
           9       0.75      0.27      0.40       483
          10       0.68      0.35      0.47       507
          11       0.68      0.15      0.25       478
          12       0.64      0.11      0.18       509
          13       0.72      0.19      0.31       355
          14       0.80      0.28      0.41       392
          15       0.66      0.11      0.19       441
          16       0.57      0.10      0.17       269
         

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



Evaluating adaptedRF
              precision    recall  f1-score   support

           0       0.87      0.41      0.55       977
           1       0.94      0.07      0.13       228
           2       0.71      0.11      0.20      1558
           3       0.96      0.13      0.24       372
           4       0.82      0.12      0.20      1050
           5       0.70      0.01      0.03       537
           6       0.70      0.02      0.04       702
           7       0.74      0.07      0.13      1079
           8       0.71      0.06      0.11       803
           9       0.90      0.06      0.11       483
          10       0.84      0.05      0.10       507
          11       0.70      0.03      0.06       478
          12       0.43      0.01      0.01       509
          13       0.91      0.06      0.11       355
          14       0.95      0.05      0.09       392
          15       0.42      0.01      0.02       441
          16       0.67      0.01      0.03       269
     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**Metrics - Comments:**

  * Evaluating gbMOC:

      1. Precision, recall, and F1-score: The precision measures the proportion of correctly predicted positive labels, recall measures the proportion of true positive labels correctly predicted, and F1-score is the harmonic mean of precision and recall. Overall, the performance is relatively low for most classes, with varying precision, recall, and F1-score values. Some classes have higher scores (e.g., class 0, 3, 7), indicating better prediction performance, while others have lower scores (e.g., class 5, 12, 15), indicating poorer prediction performance.
      2. Support: The number of samples in each class.
      3. Micro avg, macro avg, weighted avg, samples avg: These are aggregate metrics calculated across all classes. Micro avg calculates the metrics globally by counting the total true positives, false negatives, and false positives, while macro avg calculates the metrics independently for each class and then takes the average. Weighted avg considers the support (number of samples) for each class in calculating the average. Samples avg calculates metrics for each sample, then averages them.
      4. Subset accuracy: Subset accuracy measures the proportion of samples where all the labels are predicted correctly. In this case, the subset accuracy is quite low at 0.088, indicating that the models struggle to accurately predict multiple labels for each sample.

*    Evaluating adaptedRF:

        1. Precision, recall, and F1-score: Similar to the gbMOC results, the precision, recall, and F1-score values are generally low for most classes. Some classes have relatively higher scores (e.g., class 0, 3, 7), while others have lower scores (e.g., class 5, 12, 15).
        2. Support: The number of samples in each class.
        3. Micro avg, macro avg, weighted avg, samples avg: These aggregate 3.metrics are calculated similarly to the gbMOC results.
        4. Subset accuracy: The subset accuracy is even lower at 0.064, indicating poorer performance in accurately predicting multiple labels for each sample.



Overall, both algorithms, gbMOC and adaptedRF, have relatively low performance in predicting multiple labels. 

**Compared with the above linear models, we can conclude that linear models have better performance and generally metrics.**
