# Exam for evaluating ML skills needed for Trantor: Exercise II

### Below there are a number of examples and exercises. The goal of the exam is completing as many  of the exercises as possible. The candidates could create an auxiliary .py file and read from the notebook in order to avoid excess of text. 
### It is highly recommended to create modular code in order to reuse it for the different exercises. The capacity to create modular, self-explanatory, and clean code  that could be used accross tasks will be highly appreciated.
### Short comments could be added to explain the choice of the ML model or algorithm, as well as references to papers where a similar solution is used for a related problem.

In this part of the test, you are required to, given a set of data, propose a supervised learning approach that fits the problem at hand. The specification of the problem follows:

In [1]:
import pickle
import numpy as np
import pandas as pd

In [2]:
with open("data.pickle", "rb") as f:  # This pickle file contains the data that can be used to predict values
    data = pickle.load(f)

In [3]:
print(len(data))  # We visualize the shape of the data
for i in range(len(data)):
    print(data[i].shape)

47
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 5)
(1002, 10)
(1002, 10)
(1002, 10)
(1002, 10)
(1002, 10)
(1002, 10)
(1002, 2)
(1002, 3)
(1002, 3)
(1002, 10)
(1002, 10)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)
(1002, 1)


In each of the positions of this file, we can find 1002 groups of elements that belong to a same category. For example, if we check the fourth position, we will find 1002 groups of ten elements of elements related to singers/music bands.

In [7]:
print(data[4])
print(data[4].shape)

[['Beastie Boys' 'Depeche Mode' 'Ice T' ... 'Fat Boy Slim' 'Eminem'
  'Missy Elliot']
 ['Mana' 'Wolfredo Vargas' 'El Combo show ' ... 'nan' 'nan' 'nan']
 ['Antonio orozco ' 'Pablo alboran' 'Meléndi' ... 'nan' 'nan' 'nan']
 ...
 ['Love of lesbian' 'Niños mutantes ' 'León Benavente ' ... 'nan' 'nan'
  'nan']
 ['Txarango' 'Els catarres' 'Sau' ... 'María Jiménez ' 'Estopa' 'nan']
 ['Rolling stones' 'Love id lesbians' 'Springsteen' ... 'nan' 'nan' 'nan']]
(1002, 10)


Or, in the sixth position, film/saga titles.

In [8]:
print(data[6])
print(data[6].shape)

[['Airbag' 'La Guerra de las Galaxias' 'Seven' ... 'Canta' 'nan' 'nan']
 ['Mad Max' 'Batman' 'Vengadores de Marvel' ... 'nan' 'nan' 'nan']
 ['Harry poter' 'Saw' 'La milla verde ' ... 'nan' 'nan' 'nan']
 ...
 ['El padrino' 'Indiana Jones' 'La vida es bella' ... 'nan' 'nan' 'nan']
 ['Revenge' 'La casa de papel' 'Juego de tronos' ... 'Titanic'
  'Mamma mia' 'nan']
 ['Bethoben' 'avatar' 'Diario de noa' ... 'nan' 'nan' 'nan']]
(1002, 10)


As you can see in these examples, even though there is a significant amount of information in these groups, in most cases, the groups are not "complete". In the first group related to movie titles, for example, there are eigth titles in ten possible positions.

In [9]:
print(data[6][0])

['Airbag' 'La Guerra de las Galaxias' 'Seven' 'Fantasia' 'Interstellar'
 'El Club de la Lucha' 'Regreso al Futuro' 'Canta' 'nan' 'nan']


In the last group, we only find three elements.

In [10]:
print(data[6][-1])

['Bethoben' 'avatar' 'Diario de noa' 'nan' 'nan' 'nan' 'nan' 'nan' 'nan'
 'nan']


In other instances, the proportion of missing information is much larger. As an example, in the first category (places in which clothing can be bought), which consists of groups of a single element, the number of not-'nan' elements (and therefore, groups), is very low, 14/1002.

In [11]:
print([x for x in data[0] if x != ['nan']])
print(len([x for x in data[0] if x != ['nan']]))

[array(['Mercadillo '], dtype='<U19'), array(['Primark'], dtype='<U19'), array(['Donde puedo'], dtype='<U19'), array(['Mercados ambulantes'], dtype='<U19'), array(['No compro'], dtype='<U19'), array(['oulets'], dtype='<U19'), array(['MI FABRICA'], dtype='<U19'), array(['Cualquier '], dtype='<U19'), array(['Mercadillo'], dtype='<U19'), array(['Mercadillos '], dtype='<U19'), array(['Mercadillo'], dtype='<U19'), array(['Mercadillo'], dtype='<U19'), array(['Primark'], dtype='<U19'), array(['mercadillo'], dtype='<U19')]
14


Besides that, we have a set of five dependent variables

In [12]:
with open("variables.pickle", "rb") as f:
    variables = pickle.load(f)

In [13]:
print(variables.shape)
print(variables)

(1002, 5)
[[30. 55. 97. 85. 50.]
 [ 2. 60. 90. 85. 95.]
 [65. 96. 90. 65. 90.]
 ...
 [80. 20. 50. 15. 15.]
 [99. 10. 15.  5.  1.]
 [40. 50. 85. 65.  5.]]


As can be seen, in a simmilar manner to the original data, each of the variables has 1002 recorded values. The task consisits of, given one element from each group, predict the five corresponding values.

For example, given an element (2) from a group (32) in the film title category (6), predict the five dependent values. As many models as necessary can be used, e.g., you can use one model for each of the five values.

In [14]:
print(data[6][32][2])
print(variables[32])

Bad boys
[90. 10. 45. 25. 15.]


This is not an easy task, as it involves written text, which is not the optimal way of presenting the data to a "common" model. To solve that issue, we propose you the following steps to solve the exercise:

# EXERCISE 2

2.1) Perform the necessary transformations of the data so that it consists of six columns. The first column would consist of each element of each group in the data, and the second coulmn would contain its corresponding variables to be predicted. Following the example two cells above, one line in the dataset would be:


In [16]:
[data[6][32][2]] + variables[32].tolist()

['Bad boys', 90.0, 10.0, 45.0, 25.0, 15.0]

Because the same variable values correspond to all the items in that grouping, the next line in the dataset could be

In [17]:
[data[6][32][3]] + variables[32].tolist()

['Infiltrados', 90.0, 10.0, 45.0, 25.0, 15.0]

Another line, using an item from the group with elements related to music,

In [19]:
[data[4][32][0]] + variables[32].tolist()

['Sfdk', 90.0, 10.0, 45.0, 25.0, 15.0]

Note that the variables are the same, as the position (32) in which the data was found has not changed.

Perform any modifications that you may find fitting to this dataset, e.g., treat 'nan's differently, or any other change. Explain and jusitfy whatever transformation you perform.

2.2 Transform the elements to a numeric representation using word embeddings. We recommend the gensim library, but feel free to use any other. Transform the dataset again, this time appending the values obtained from the WE to the values to be predicted. This way, the dataset will now have n+5 columns, being n the number of dimension of the WE chosen for the transformation. Note that many items will consist of multiple words, with which many WE are not compatible. To that end, figure out away of "combining" the multiple words of one element (e.g., the mean of the different values, or any other approach you come up with).

2.3 Because the representation obtained from the WE may not be optimal for a supervised learning task, we ask you to build new features. To that end, we ask you to perform the following steps:

2.3.1: Choose a set of "pivot" values. These are vectors of the same dimension as the one of the chosen WE. The values of the pivots are arbitrary. You can choose random values, zeros, ones, twos, ..., even an item which has been transformed into its vectorized form in the previous step can be used as a pivot. You have to choose 10 pivots.

2.3.2: Next, you will have to compute a distance (e.g., MSE or any other that you may find more suited to this problem) from each vectorized element, to each pivot.

2.3.3: These 10 values now represent each element. Append them to the variables, and now you will have a dataset consisting of 15 columns. The first ten will contain the distances from the vectorized version of the elements to each of the topics, and the last five, the variables to be predicted.

In [None]:
# for example
d = 100  # assuming that the dimension of the WE is 100
pivot0 = np.random(100)
pivot1 = np.zeros(100)
pivot2 = np.zeros(100)+2
pivot3 = #vectorized form of an element

In [None]:
# Assuming that 'vec' contains the vectorized version of an element, and that '-' is a distance
v0 = pivot0-vec
v1 = pivot1-vec
# v0, v1, ... v9 will be the values representig the element vec

Once the dataset has been constructed, you are required to use a neural network-based model (preferable implemented in tensorflow or pytorch) that tries to map each set of values (10), to each varialbe to be predicted (5). Rememeber that you can use 5 different models if you want.

Moreover, we ask you to evaluate the developed approach using tools that fit the problem that you defined. We wold also like you to extract a set of conclusions. For example, the most difficult part, the one with the largest room for improvement, or the one in which you would invest more time if you could.

This is not an easy problem, and you will probably not obtain good results. The goal of this task is to test your capacity of critical thinking, and justifying and defending the proposed approach. Because of this, we ask you to submit the result of your work, whether the results are positive or not.