
# Lab 6 - Getting familiar with assignment data

In this lab we will focus on the first graded assignment, you can find the task at this link: \
https://knowledgepit.ml/iml2022project1/

You should enroll to the task with the code:
IML2022project1

Please start by reading the assignemnt task description.

Your job is to create an activel learning batch selection algorithm maximizing the improvement of the predefined model on 3 different sized batches:
50, 200, and 500 samples.

Predefined model is an XGBoost with 100 learning rounds.

You are given an initial labeled pool that will be used to train the model and a data pool from which you should choose all of the samples.

Lets start by loading the data into numpy array with with scipy library.

See:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmread.html

In [11]:
from scipy.io import mmread
import numpy as np

initial_batch = mmread('initial_batch_data.mtx')
print(f"Init batch shape {initial_batch.shape}")
print(f"Number of non zero values {initial_batch.getnnz()}")
print(f"Average number of values per row {np.mean(initial_batch.getnnz(axis=1))}")

Init batch shape (2000, 11436)
Number of non zero values 122993
Average number of values per row 61.4965


In [6]:
# Please note that as the read matrix is in COOrdinate sparse format we cannot run the following command
initial_batch[:10, :10]

TypeError: ignored

In [15]:
# If we wish to slice the matrix we should convert it o another format e.g. CSR format
initial_batch = initial_batch.tocsr()
initial_batch[:10, :10].todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.26352642, 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.29295912, 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.15879782],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.37624408, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.23856982, 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.    

Values in the data matrix are the documents in **tf-idf** format.(**term frequency–inverse document frequency**)

You can read more about the format https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In practice we can compute the value as follows.

In [26]:
# Lets consider a follwing set documents
doc1 = "Alice has a cat"
doc2 = "The cat would like to eat the mouse"
doc3 = """How much wood could a woodchuck chuck
If a woodchuck could chuck wood?
As much wood as a woodchuck could chuck,
If a woodchuck could chuck wood."""

corpus = [doc1, doc2, doc3]

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
word_counts = count_vectorizer.fit_transform(corpus)
word_counts.todense()

matrix([[1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 2, 1, 0, 0, 1],
        [0, 2, 0, 4, 4, 0, 0, 1, 2, 0, 0, 2, 0, 0, 4, 4, 0]])

In [33]:
count_vectorizer.vocabulary_

{'alice': 0,
 'as': 1,
 'cat': 2,
 'chuck': 3,
 'could': 4,
 'eat': 5,
 'has': 6,
 'how': 7,
 'if': 8,
 'like': 9,
 'mouse': 10,
 'much': 11,
 'the': 12,
 'to': 13,
 'wood': 14,
 'woodchuck': 15,
 'would': 16}

In [35]:
from sklearn.feature_extraction.text import TfidfTransformer
tfid = TfidfTransformer().fit_transform(word_counts)
tfid.todense()

matrix([[0.62276601, 0.        , 0.4736296 , 0.        , 0.        ,
         0.        , 0.62276601, 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.24573525, 0.        , 0.        ,
         0.32311233, 0.        , 0.        , 0.        , 0.32311233,
         0.32311233, 0.        , 0.64622465, 0.32311233, 0.        ,
         0.        , 0.32311233],
        [0.        , 0.22792115, 0.        , 0.45584231, 0.45584231,
         0.        , 0.        , 0.11396058, 0.22792115, 0.        ,
         0.        , 0.22792115, 0.        , 0.        , 0.45584231,
         0.45584231, 0.        ]])

Or you can do this in one step using TfidfVectorizer \
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer

Lets now load the labels

In [39]:
labels = []

with open('initial_batch_labels.txt', 'r') as f:
  for line in f:
    labels.append(line.strip().split(','))


In [44]:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()

labels_binary = lb.fit_transform(labels)
labels_binary[:2], labels[:2]

(array([[0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]]), [['F', 'G', 'I'], ['F', 'G']])

Try to create an active learning batch selection algorithm for the assignemnt.

Tips:
1. You can split the initial_batch in 3 parts and use one part for evaluation purposes and one as a pool
2. When desigining an algorithm you can test its generalization performance on another fully labeled datasets first. It may be a good idea to use some NLP problem to work on simillar representation. 