![alt text](https://www.auth.gr/sites/default/files/banner-horizontal-282x100.png)
# Advanced Topics in Machine Learning - Assignment 2 - Part B


## Multi Instance Learning

In this part we try to solve a multi instance learning problem, using the "Delicous" dataset from the "MLTM" repository mentioned below.

The approach we follow is to consider each line of the dataset as a bag of instances and transform these data to a standard supervised classification problem, by first clustering all instances of all bugs using K-Means algorithm in order then to use this information as the feature set to input in the SVM classification algorithm.

#### Useful library documentation, references, and resources used on Assignment:

* DeliciousMIL dataset: <https://github.com/hsoleimani/MLTM/tree/master/Data/Delicious>
* scikit-learn ML library (aka *sklearn*): <http://scikit-learn.org/stable/documentation.html>
* K-Means clustering: <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html>
* SVM classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>


# 0. __Install packages - Import necessary libraries__

In [0]:
import warnings
warnings.filterwarnings("ignore")
import os
import re
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics, cluster
from sklearn import svm
from pandas import DataFrame

# 1. __Set desired configuration parameters__

### Define the run configuration parameters

In [0]:
# Whether it should download the required dataset from github or not.
# If set to True, the DATA_PATH must be "MLTM/Data/Delicious/"
SHOULD_DOWNLOAD_DATASET = False

# If the dataset is downloaded from the internet, 
# the DATA_PATH must be "MLTM/Data/Delicious/"
# Otherwise it can point to any other local folder that contains the data sets.
#DATA_PATH = "MLTM/Data/Delicious/"
DATA_PATH = "raw_data"

# The number of clusters to use in the K-Means algorithm.
N_CLUSTERS = 25

# The portion of documents to keep from the training data. 
# If set to -1, then the whole set is used and not a portion of it.
N_TRAIN_PORTION = 1000

# The portion of documents to keep from the test data.
# If set to -1, then the whole set is used and not a portion of it.
N_TEST_PORTION = 1000

# 2. __Define the required dataset__
We will use the *DeliciousMIL* dataset from its GitHub repository that is mentioned above. The dataset consists of 4 separate data files and is optionally downloaded from GitHub based on the configuration parameters above.

### Download the datasets from the Internet (optional)

In [0]:
if SHOULD_DOWNLOAD_DATASET:
  !git clone https://github.com/hsoleimani/MLTM.git

### Define the paths to the target files that contain the train and test data and labels

In [0]:
train_data_filename = os.path.join(DATA_PATH, 'train-data.dat')
train_labels_filename = os.path.join(DATA_PATH, 'train-label.dat')
test_data_filename = os.path.join(DATA_PATH, 'test-data.dat')
test_labels_filename = os.path.join(DATA_PATH, 'test-label.dat')

#3. Functions definitions

In [None]:
def preprocess_data_file(filename):
    """
    Preprocess a file that contains data in order to bring it in a form where each line consists only from an
    array of instances.

    :param filename: The path to the target file.
    :return: an array that contains the cleaned version of the target file.
    """
    raw_file = open(filename).readlines()
    clean_file = []
    for line in raw_file:
        # Remove the first two <##> entries
        line = re.sub('<[0-9]+>', '', line, 2).strip()
        # Split the rest of the line by the remaining <##> entries
        line = [x.strip() for x in re.split('<[0-9]+>', line)]
        clean_file.append(line)

    return clean_file

def preprocess_labels_file(filename):
    """
    Preprocess a file that contains labels for many classes in order to keep only those that correspond
    to the most frequent class.

    :param filename: The path to the target file.
    :return: a list that contains the labels of the most frequent class.
    """
    data_frame = pd.read_csv(filename, delimiter=' ', header=None)
    # Find the index of column with the max sum() and keep only this one.
    labels_of_top_class = data_frame.iloc[:][data_frame.sum(axis=0).idxmax()]

    return list(labels_of_top_class)

def get_portion(size, array_a, array_b):
    """
    Extracts only a randomly selected portion of specified size from the two input arrays.

    :param size: The size of the portion to extract. If the number -1 is passed, then the two arrays are not manipulated
    and the whole data for each of them is returned.
    :param array_a: The contents of the first array. It is expected to represent the contents of the data file.
    :param array_b: The contents of the second array. It is expected to represent the contents of the labels file.
    :return: two arrays where each one corresponds to the extracted portion of the relevant input arrays.
    """
    if size is -1:
        return (array_a, array_b)

    # Randomly select (wo replacement) the indices of samples to keep
    keep_idx = sorted(np.random.choice(len(array_a), size=size, replace=False))

    array_a_portion = [array_a[x] for x in keep_idx]
    array_b_portion = [array_b[x] for x in keep_idx]

    return (array_a_portion, array_b_portion)

def create_dataframe(bags, classes):
    """
    Creates a pandas DataFrame from an array where each line is considered to contain a bag of instances.

    :param bags: Array which has m lines, that are supposed to be the bags and each line has n arrays which
    are supposed to be the instances.
    :param classes: Array which contains the target class for each m line of the bags array.
    :return: A pandas DataFrame that contains the columns: [<bagIndex>, <instance>, <class>].
    """
    data = []

    for bagIndex, bag in enumerate(bags):
        for instanceIndex, instance in enumerate(bag):
            bag = {'bagIndex': bagIndex, 'instance': instance, 'class': classes[bagIndex]}
            data.append(bag)

    return DataFrame(data)

def calculate_clusters(instances):
    """
    Calculates the clusters of the input array by first computing the TF-IDF vector of each line and
    then uses K-Means algorithm by leveraging that vector.

    :param instances: The array of data to perform clustering against.
    :return: An array that contains the cluster that each line of the input array has been assigned to.
    """
    vectorizer = TfidfVectorizer()

    vectorized_data = vectorizer.fit_transform(instances)
    model = cluster.KMeans(n_clusters=N_CLUSTERS, random_state=0)
    clusters = model.fit_predict(vectorized_data)

    return clusters

def add_clusters_to_dataframe(clusters, target_data_frame):
    """
    Adds a new column  to the target pandas DataFrame that contains the assigned cluster of each line.

    :param clusters: The calculated clusters.
    :param target_data_frame: The target pandas DataFrame to append the new column.
    """
    target_data_frame['cluster'] = pd.Series(clusters, index=target_data_frame.index)

def get_features_from_cluster(cluster_index):
    """
    Creates an array of zeros of n_klusters size and assigns the value 1 only to the index that is specified
    in the incoming parameter.

    :param cluster_index: The index of the array to set to 1. It is supposed to reflect the cluster index that an
    instance is assigned to.
    :return: The array like this example: [0, 1, 0, 0]
    """
    features = [0] * N_CLUSTERS
    features[cluster_index] = 1

    return features

def add_features_to_dataframe(target_data_frame):
    """
    Adds a new column to the target pandas DataFrame that contains the features of this line.

    :param target_data_frame: The target pandas DataFrame to append the new column.
    """
    target_data_frame['features'] = target_data_frame.apply(lambda row: get_features_from_cluster(row.cluster), axis=1)

# 4. Execute the required operations to perform the classification


### Preprocess the raw data in order to create the necessary train and test data sets.

In [0]:
X_train = preprocess_data_file(train_data_filename)
y_train = preprocess_labels_file(train_labels_filename)
X_test = preprocess_data_file(test_data_filename)
y_test = preprocess_labels_file(test_labels_filename)

### For speeding up the procedure just keep a portion of the data sets.

In [0]:
X_train_portion, y_train_portion = get_portion(N_TRAIN_PORTION, X_train, y_train)
X_test_portion, y_test_portion = get_portion(N_TEST_PORTION, X_test, y_test)

### Create a pandas DataFrame from the train data set, that contains the calculated clusters as well as their convertion to features.

In [0]:
train_dataframe = create_dataframe(X_train_portion, y_train_portion)
add_clusters_to_dataframe(calculate_clusters(train_dataframe.instance), train_dataframe)
add_features_to_dataframe(train_dataframe)

### Create a pandas DataFrame from the test data set, that contains the calculated clusters as well as their convertion to features.

In [0]:
test_dataframe = create_dataframe(X_test_portion, y_test_portion)
add_clusters_to_dataframe(calculate_clusters(test_dataframe.instance), test_dataframe)
add_features_to_dataframe(test_dataframe)

### Create an SVM classifier.

In [0]:
svmModel = svm.SVC(C=0.1, kernel='poly', degree=2)
svmModel.fit(train_dataframe.features.tolist(), train_dataframe['class'])
y_predicted = svmModel.predict(test_dataframe.features.tolist())

### Calculate the metrics of our classification process.

In [0]:
accuracy = metrics.accuracy_score(test_dataframe['class'].tolist(), y_predicted)
recall = metrics.recall_score(test_dataframe['class'].tolist(), y_predicted, average="macro")
precision = metrics.precision_score(test_dataframe['class'].tolist(), y_predicted, average="macro")
f1 = metrics.f1_score(test_dataframe['class'].tolist(), y_predicted, average="macro")

### Print the metrics results

In [88]:
print("Accuracy: %f" % accuracy)
print("Recall: %f" % recall)
print("Precision: %f" % precision)
print("F1: %f" % f1)

Accuracy: 0.589995
Recall: 0.500000
Precision: 0.294998
F1: 0.371067


# 5. Conclusions

In this experiment we tried to transform the Multi Instance classification problem to a regular supervised classification problem, using K-Means clustering and SVM classifier. In order to let the classification run quickly, we decided to use only a portion of the train and test data sets.

The results of our metrics show a moderate value of the _accuracy_ score, which in our tests was around 60% but not good values for _recall_ (~50%), _precision_ (~30%) and _f1_ (~30%) metrics.

We tried to experiment with the K value of the K-Means algorithm and we didn't make it to achieve better results than this, regardless of the portion of the train data set we used.

We could probably consider that the kind of transformation we used is not the best possible and maybe another kind of transformation, using k-medoids or even an approach without using a transformation like the citation-KNN algorithm may perform better.