![alt text](https://www.auth.gr/sites/default/files/banner-horizontal-282x100.png)
# Advanced Topics in Machine Learning - Assignment 2 - Part Α


## Class Imbalanced Dataset

#### Useful library documentation, references, and resources used on Assignment:

* Scikit-learn ML library (aka *sklearn*): <https://scikit-learn.org/stable/documentation.html>
* Scikit-learn Multi-Label Classification : <https://github.com/scikit-multilearn/scikit-multilearn>
* Logistic Reggresion Classifier: <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>


#### Instructions about Dataset:
In order to run the whole script you have first to download the dataset from the following link. 
* Delicious Dataset : <https://github.com/hsoleimani/MLTM/tree/master/Data>

After downloading it place it in same folder with the project and name it raw_data or wherever you want and be sure to change the path given.

In [21]:
import re
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

# Read the files and clean them

def process_raw_data(data_fnames, labels_fnames):
    
    # Process Data files
    clean_files = []
    for datafile in data_fnames:
     
    # Read file from disc
        file = open(datafile).readlines()
    
    # Each file is a list of docs structure to hold cleaned docs
      
        clean_doc = []
        for doc in file:
           
        # Remove <##> patterns
            
            doc = re.sub('<[0-9]+>','',doc).strip()
        
        # Remove multiple spaces
        
            doc = re.sub('\s+',' ',doc).strip()
            clean_doc.append(doc)
        clean_files.append(clean_doc)
    del file, clean_doc, 
    
    # Datasets to return
    
    train_data = clean_files[0]
    train_labels = pd.read_csv(labels_fnames[0], delimiter = ' ', header = None)
    test_data_total = clean_files[1]
    test_labels_total = pd.read_csv(labels_fnames[1], delimiter = ' ', header = None)
    
    return(train_data,train_labels,test_data_total,test_labels_total)


We split the dataset in training and testing set according to the given files.


In [22]:
# Filenames of files holding the data

data_filenames = [
        'raw_data/train-data.dat',
        'raw_data/test-data.dat'
        ]

# Filenames of files holding the labels

labels_filenames = [
            'raw_data/train-label.dat',
            'raw_data/test-label.dat'
            ]

# Process raw data to create necessary train and test data sets

X_train, y_train, X_test, y_test = process_raw_data(data_filenames, labels_filenames)

# Total accuracy results for all trained classifiers

total_results = pd.DataFrame()


Examine Labels' Distribution

In [23]:
y_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,1,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
2,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0
4,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [24]:
total = y_train.shape[0]
for i in range(20):
    cur = y_train.iloc[:][i][y_train.iloc[:][i] == 1].count()
    print('Train samples assigned to label %d: %d (%.1f%%)' %(i+1, cur, 100*cur/total))

Train samples assigned to label 1: 2050 (24.8%)
Train samples assigned to label 2: 479 (5.8%)
Train samples assigned to label 3: 3181 (38.6%)
Train samples assigned to label 4: 799 (9.7%)
Train samples assigned to label 5: 2203 (26.7%)
Train samples assigned to label 6: 1211 (14.7%)
Train samples assigned to label 7: 1471 (17.8%)
Train samples assigned to label 8: 2221 (26.9%)
Train samples assigned to label 9: 1559 (18.9%)
Train samples assigned to label 10: 1004 (12.2%)
Train samples assigned to label 11: 1034 (12.5%)
Train samples assigned to label 12: 939 (11.4%)
Train samples assigned to label 13: 1049 (12.7%)
Train samples assigned to label 14: 725 (8.8%)
Train samples assigned to label 15: 830 (10.1%)
Train samples assigned to label 16: 898 (10.9%)
Train samples assigned to label 17: 598 (7.2%)
Train samples assigned to label 18: 1001 (12.1%)
Train samples assigned to label 19: 411 (5.0%)
Train samples assigned to label 20: 224 (2.7%)


#### Results

We see that although the labels are not perfectly balanced, there is no label that lacks in quantity over the others. So, we will not have problem in the training phase.

# BINARY RELEVANCE

#### Try one method to treat multi-label dataset
There are a lot of methods to treat a multi-label problem. In this assignment we are about to choose one of them. We are going to use Binary Relevance (BR). This method learns one binary classifier for each of the 20 labels we have and then it outputs the union of their predictions. It's only disadvantage is that it doesn't consider the relationships between the labels, however the results are quite good. We will handle the dataset as a bag of words. In order to do so, we vectorize each word inside the files and we do classification using Logistic Regression. The best results seems to be at label 20 with  96% accuracy. After presenting the accuracy for each label we use the mean accuracy to see something more general about our model.

In [25]:
# Vectorize independeny variables to use them in sklearn algorithms

vectorizer = CountVectorizer()
X_train_v = vectorizer.fit_transform(X_train)
X_test_v = vectorizer.transform(X_test)

# Calculate classification probabilities for all samples in current pool_set
# using standard classifier trained on initial train set

clf = LogisticRegression()
accuracies = []

print('\nTraining one binary classifier for each label (one against all)\n')

for i in range(20):
    ytrain = y_train.iloc[:][i]
    ytest = y_test.iloc[:][i]
    # Trained classifier used for uncertainty sampling
    clf.fit(X_train_v, ytrain)
    
    total_results[i] = clf.predict(X_test_v)
    acc = accuracy_score(ytest, clf.predict(X_test_v))
    
    print('\tAccuracy of classifier on label %d: %0.4f' %(i+1, acc))
    
    accuracies.append(acc)

# Accuracy on predicting each document's labels

total_count= []
for i in range(len(y_test)):
    test  = y_test.iloc[i:i+1][:]
    pred = total_results.iloc[i:i+1][:]
    count = 0
    for j in range(20):
        if(list(test.iloc[:][j])[0] == list(pred.iloc[:][j])[0]):
            count +=1
    total_count.append(count)
mean_acc = 100*np.array(total_count).sum()/(20*len(total_count))

print('\nMean accuracy on predicting each document\'s labels: %0.2f%%' %mean_acc) 



Training one binary classifier for each label (one against all)

	Accuracy of classifier on label 1: 0.8386
	Accuracy of classifier on label 2: 0.9453
	Accuracy of classifier on label 3: 0.6257
	Accuracy of classifier on label 4: 0.9350
	Accuracy of classifier on label 5: 0.7562
	Accuracy of classifier on label 6: 0.8250
	Accuracy of classifier on label 7: 0.7921
	Accuracy of classifier on label 8: 0.7401
	Accuracy of classifier on label 9: 0.7904
	Accuracy of classifier on label 10: 0.8787
	Accuracy of classifier on label 11: 0.8737
	Accuracy of classifier on label 12: 0.8740
	Accuracy of classifier on label 13: 0.8536
	Accuracy of classifier on label 14: 0.9048
	Accuracy of classifier on label 15: 0.9106
	Accuracy of classifier on label 16: 0.8775
	Accuracy of classifier on label 17: 0.9232
	Accuracy of classifier on label 18: 0.8802
	Accuracy of classifier on label 19: 0.9571
	Accuracy of classifier on label 20: 0.9686

Mean accuracy on predicting each document's labels: 85.75%


#### Statistical analysis of results
After using the ground truth file to measure accuracy, we are making a statistical analysis of the results. First we are counding how many documents is each label presented to. Then we present it as a total rate of all documents. The most famous label seems to be label 3 which is assigned to 1465  documents and gives 36.8%. After finishing with that we check how many documents have one or more labels. There were 2 documents which had 10 labels and no documents with 11 labels, while most of documents had 2 labels.

In [26]:
# Statistical analysis of the results

print('\nStatistical analysis (label based): \n')

for i in range(20):
    
    # The column of each label
    
    col = total_results.iloc[:][i]
    
    # Count how many times this label assigned

    counts = col.sum()
    print('\tLabel %d assigned to %d documents  (%.1f%%)'
          %(i+1, counts, 100*counts/len(col)))

print('\nStatistical analysis (document based): \n')

# Add an extra column holding the total number of labels each document classified to

total_results['doc_#_labels']= total_results.iloc[:,:].sum(axis=1)
distr = total_results['doc_#_labels'].value_counts()
for i in range(20):
    try:
        
        print('\t%d documents classified to %d labels (%.1f%%)'
              
              %(distr[i], i, 100*distr[i]/total_results.shape[0]))
    except:
        pass


Statistical analysis (label based): 

	Label 1 assigned to 884 documents  (22.2%)
	Label 2 assigned to 162 documents  (4.1%)
	Label 3 assigned to 1465 documents  (36.8%)
	Label 4 assigned to 289 documents  (7.3%)
	Label 5 assigned to 983 documents  (24.7%)
	Label 6 assigned to 506 documents  (12.7%)
	Label 7 assigned to 562 documents  (14.1%)
	Label 8 assigned to 958 documents  (24.1%)
	Label 9 assigned to 660 documents  (16.6%)
	Label 10 assigned to 382 documents  (9.6%)
	Label 11 assigned to 422 documents  (10.6%)
	Label 12 assigned to 346 documents  (8.7%)
	Label 13 assigned to 384 documents  (9.6%)
	Label 14 assigned to 276 documents  (6.9%)
	Label 15 assigned to 300 documents  (7.5%)
	Label 16 assigned to 313 documents  (7.9%)
	Label 17 assigned to 209 documents  (5.2%)
	Label 18 assigned to 330 documents  (8.3%)
	Label 19 assigned to 120 documents  (3.0%)
	Label 20 assigned to 76 documents  (1.9%)

Statistical analysis (document based): 

	418 documents classified to 0 labels (1

# OVERAL RESULTS

- All binary classifiers showed satisfactory learning outcomes regarding accuracy of the final predictions.
- The indication of good training method is that all labels appear in the final forecast. We can even see the most rare of them.
- The indications of the good performance, that our overall classification system had, is that the distribution of the predicting labels in test set are similar to the distribution of the labels in training set.