# Exercise 3: Recommendation loop, Data, Baselines

## From a Customer to a Recommender System
Below we take a look at an (overly) simplified process model of how a recommender system is (or should be) adopted.

<br>
<br>

### Pipeline:

**0)** Unless it is a research project, aiming to demonstrate new concepts or reach new performance heights, a **recommender system is created for a customer**. The customer sees the system as a mean to achieve some goal (e.g. earn more money).

```For example an online shop wants to sell more products, attract attention to a bigger part of their product range.```

```A recommender system is just one of many possible solutions.```

**1)** **Data.** We start from the data, see what is available to us. At the very least we should have a log of purchases, like this:


| Meaningless but Unique<br>Purchase Id | Products List | Quantities | Total | Date |
| ---    |   ---  |   ---  |   ---  |   ---  |
| 002Ax4gf... | [909] | [2]  | 26€ | 13.04.08 |
| 9f2D4jKx... | [909, 117, 3] | [1, 1, 2]  | 102€ | 01.02.09 |
| 3g6lP89qs.. | [3, 4] | [2, 2] | 7€ | 11.10.10 |
| ... | ... | ... | ... | ... |

A list of items should be easily accessessible as well. If we are lucky, we have some data about registered users, this can help us, for example, identify that the first two purchases were made by the same person. Demographic information, such as location and age, or user to item ratings also help a great deal.

**2)** **Start simple.** As much as it is tempting to start using the data to design an effective recommendation system, we should not forget about our **customer and their interests**:
* Customers like simple things because those are cheaper and are delivered quicker;
* Once you design a system you need to compare it to something to confirm that it is worth spending time and money on;

Therefore it is important to decide on baselines and agree on success metrics:
* Baseline: some simple, easily explainable recommendation strategy. E.g. recommend the same set of most popular items to every user, or recommend random items to every user;
* Success metric: discuss with your customer, what performance metric relflects their goals best? NDCG, F-1 score?

**HINT:** It can happen that a simple baseline solution is already enough.

**3)** **Test.** Test the solution you have under consideration, provide intuitive explanation of what the performance metrics mean. If the performance is lacking -- it is time to revise your design. Otherwise -- well done!

**4)** **(re)Design.** Time to apply everything you have learned about designing recommender systems. Define problems and look for solutions, then go back to step **(3)**.

**5)** **Deployment & Maintanence.** You might be asked to take part in these after-design activities as well.

<br>
<br>

### What we know:
* We have already taken a look at various recommendation algoriths and talked about evaluation metrics (steps **3** and **4**);

* Steps **0** and **5** are out of the scope of the course. We can always assume that we are our own customer. All we need is to formulate our goals clearly;

* **It is time to get a better idea about those initial stages of the project: data preparation and setting baselines...**


# <font color='red'>TASKS</font>:

In this exercise you will be required to write a couple of functions and call them to produce some results saving them in given variables. See more details in every task's description.

For this exercise we'll work with a tiny sample of a yet to be released LFM2B dataset. It contains listening histories of Last-fm users.

These are the files which should be placed in the same folder with your notebook (find them on moodle):

* 'sampled_1000_items_inter.txt' - data about user-item interactions;
* 'sampled_1000_items_tracks.txt' - track-related information;
* 'sampled_1000_items_demo.txt' - user-related information;

The modules already imported are enough to complete the task. You are free to import more at your own risk (works relying on modules requiring installation will be ignored).


### Data format clarifications:

    'sampled_1000_items_inter.txt'
User-item interaction:
    
| User Id | Track Id | Number of Interactions | 
| ---    |   ---  |   ---  |
| 0 | 0 | 3  |
| 0 | 6 | 5 |
| 2 | 17 | 8 |
| ... | ... | ... |

    'sampled_1000_items_tracks.txt'
Track-related information (line index, starting from zero, is the **Track ID**):

| Artist | Track Name |
| ---    |   ---  |
| Helstar | Harsh Reality |
| Carpathian Forest | Dypfryst / Dette Er Mitt Helvete |
| Cantique Lépreux | Tourments Des Limbes Glacials |
| ... | ... |

    'sampled_1000_items_demo.txt'
User-related information (line index, starting from zero, is the **User ID**):

| Location | Age | Gender | Reg. Date |
|   ---  |   ---  |   ---  |   ---  |
| RU | 25  | m | 2007-10-12 18:42:00 |
| UK | 27 | m | 2006-11-17 16:51:56 |
| US | 22 | m | 2010-02-02 22:30:15 |
| ... | ... | ... | ... |

All files are in <font color='red'>.tsv (tab '**\t**' separated values)</font> format.

In [1]:
import pandas as pd
import numpy as np
import random as rnd

## <font color='red'>TASK 1/4</font>: Interaction Matrix 
### Method (2 points):
Write a function that receives three file names as input and returns a 2-dimensional numpy array with the corresponding interaction matrix, where **0** means no interaction, **1** means interaction took place (any non-zero number of times).

The first dimension should correspond to users, second - to items;

Insert your solution into the signature below. Please, don't change the name or the argument set, even if they are not beautiful.

In [2]:
def inter_matr_binary(usr_path = 'sampled_1000_items_demo.txt',
                      itm_path = 'sampled_1000_items_tracks.txt',
                      inter_path = 'sampled_1000_items_inter.txt'):
    '''
    usr_path - string path to the file with users data;
    itm_path - string path to the file with item data;
    inter_path - string path to the file with interaction data;
    
    returns - 2D np.array, rows - users, columns - items;
    '''

    tracks = pd.read_csv(itm_path, delimiter='\t')
    numb_col_tracks = tracks.shape[1]
    tracks = pd.read_csv(itm_path, delimiter='\t',names=[i for i in range(numb_col_tracks)]) 
    numb_tracks = tracks.shape[0]
    
    users = pd.read_csv(usr_path, delimiter='\t')
    numb_col_usr = users.shape[1]
    users = pd.read_csv(usr_path, delimiter='\t', names=[e for e in range(numb_col_usr)])
    numb_usr = users.shape[0]
    
    res = np.zeros(shape=(numb_usr, numb_tracks)) # we need to fill this array with interactions
    
    inter = pd.read_csv(inter_path, delimiter='\t')
    numb_col_inter = inter.shape[1]
    inter = pd.read_csv('sampled_1000_items_inter.txt', delimiter='\t', names=[str(l) for l in range(numb_col_inter)])

    
    for usr_id in range(numb_usr):
        small_inter = inter[inter['0']== usr_id]
        list_track_id = small_inter['1'].tolist()
        res[usr_id][list_track_id] = 1
            
    
    return res


### Application (1 point):
Using the files we discussed above, create an interaction matrix corresponding to the requirements of the first part of the task and assign it to the variable **_interaction_matrix_test**. You can use your function to do so:

In [3]:
_interaction_matrix_test = inter_matr_binary(usr_path = 'sampled_1000_items_demo.txt',
                      itm_path = 'sampled_1000_items_tracks.txt',
                      inter_path = 'sampled_1000_items_inter.txt')# Change NONE to something else #


## <font color='red'>TASK 2/4</font> <font color='darkblue'>(BONUS): Interaction Matrix 2</font>
<font color='darkblue'> This task will only grant points (1-2, or about 10% of one whole exercise) to those who didn't get full points on both Exercise 1 and the Test. It will be checked regardless though. </font>

<font color='darkblue'>Write a function that receives three file names as input and returns a 2-dimensional numpy array with the corresponding interaction matrix, **based on playcounts** (number of interactions). There **0** means no interaction, value from **(0,1]** means interaction took place. **For every user the sum of the corresponding row should be equal to 1**!

$u$ - User\
$i$ - Item\
$I$ - Full set of Items\
$res_{u,i}$ - element of the resulting interaction matrix for user $u$ and item $i$;\
$C_{u,i}$ - Playcount for user $u$ and item $i$\
<br>
$res_{u,i} = \frac{C_{u,i}}{\sum \limits _{t \in I} C_{u,t}}$

The first dimension should correspond to users, second - to items;

Insert your solution into the signature below. Please, don't change the name or the argument set, even if they are not beautiful.</font>

In [4]:
def inter_matr_prob(usr_path = 'sampled_1000_items_demo.txt',
                    itm_path = 'sampled_1000_items_tracks.txt',
                    inter_path = 'sampled_1000_items_inter.txt'):
    '''
    usr_path - string path to the file with users data;
    itm_path - string path to the file with item data;
    inter_path - string path to the file with interaction data;
    
    returns - 2D np.array, rows - users, columns - items;
    '''
    tracks = pd.read_csv(itm_path, delimiter='\t')
    numb_col_tracks = tracks.shape[1]
    tracks = pd.read_csv(itm_path, delimiter='\t',names=[i for i in range(numb_col_tracks)]) 
    numb_tracks = tracks.shape[0]
    
    users = pd.read_csv(usr_path, delimiter='\t')
    numb_col_usr = users.shape[1]
    users = pd.read_csv(usr_path, delimiter='\t', names=[e for e in range(numb_col_usr)])
    numb_usr = users.shape[0]
    
    res = np.zeros(shape=(numb_usr, numb_tracks)) # we need to fill this array with interactions
    
    inter = pd.read_csv(inter_path, delimiter='\t')
    numb_col_inter = inter.shape[1]
    inter = pd.read_csv('sampled_1000_items_inter.txt', delimiter='\t', names=[str(l) for l in range(numb_col_inter)])

    
    for usr_id in range(numb_usr):
        small_inter = inter[inter['0']== usr_id]
        list_track_id = small_inter['1'].tolist()
        
        # Change begin
        pd.to_numeric(small_inter['2'])
        
        list_inter_value = small_inter['2'].tolist()
        
        sum_playcount = sum(list_inter_value)
        list_inter_value = np.divide(list_inter_value, sum_playcount)
        
        res[usr_id][list_track_id] = list_inter_value
        # Change end
        

    
    return res


## <font color='red'>TASK 3/4</font>: Most popular Items
### (1 point)
Write some code to put a list of top 10 most popular **(interacted by most different users with)** items to the variable **_top_pop_10** (sorted in the order of descending popularity).
The variable should contain a list or a 1D numpy array of length 10:

In [5]:
# inser some code here

top_values = list('0'*10)
top_values = list(map(int,top_values))
top_it = list('0'*10)
top_it = list(map(int,top_values))

for ind in range(_interaction_matrix_test.shape[1]):
    current_sum = sum(_interaction_matrix_test.T[ind])
    smallest_value = min(top_values)
    if smallest_value < current_sum:
        smallest_ind = top_values.index(smallest_value)
        top_values[smallest_ind] = current_sum 
        top_it[smallest_ind] = ind
    
co2 = [(value,int(index)) for value,index in zip(top_values,top_it)]

co3 = (sorted(co2, key = lambda x: x[0]))
co4 = np.flip(co3) # also within tuples flipping! (index,value)
co5 = []
for final,rest in co4:
    co5.append(int(final))

_top_pop_10 = co5 #[42, 43, 51, 96, 105, 151, 12, 104, 68, 150] # Change NONE to something else 


## <font color='red'>TASK 4/4</font>: POP Recommender
### (3 points)
Write a function that recommends K most popular items to a given user, **making sure that the user hasn't seen any of the recommended items before.**

The function should take three arguments: np.array from task 1 (your prepaired data), user ID (int) and K (int > 0).
Expected return: a list or a 1D array of length K (sorted in the order of descending popularity).

Insert your solution into the signature below. Please, don't change the name or the argument set, even if they are not beautiful.

In [6]:
def recTopKPop(prepaired_data: np.array,
               user: int,
               top_k: int) -> np.array:
    '''
    prepaired_data - np.array from the task 1;
    user - user_id, integer;
    top_k - expected length of the resulting list;
    
    returns - list/array of top K popular items that the user has never seen
              (sorted in the order of descending popularity);
    '''
    
    user_row = prepaired_data[user]
    
    top_values = list('0'*top_k)
    top_values = list(map(int,top_values))
    top_it = list('0'*top_k)
    top_it = list(map(int,top_values))

    for ind in range(prepaired_data.shape[1]):
        current_sum = sum(prepaired_data.T[ind])
        smallest_value = min(top_values)
        user_column = user_row[ind]
        if smallest_value < current_sum and user_column == 0:
            smallest_ind = top_values.index(smallest_value)
            top_values[smallest_ind] = current_sum 
            top_it[smallest_ind] = ind


    co2 = [(value,int(index)) for value,index in zip(top_values,top_it)]

    co3 = (sorted(co2, key = lambda x: x[0]))

    co4 = np.flip(co3) # also within tuples flipping! (index,value)
    pop_res = []
    for final,rest in co4:
        pop_res.append(int(final))
    
    
    
    return pop_res # [42, 43, 51, 96, 105, 151, 12, 104, 68, 150] result for final check


### Final check
* Remove all the code you don't need;
* In your notebook window, do [Kernel] -> [Restart & Run All];
* Make sure the cell below prints sensible values;
* Don't forget to rename the notebook before submission;


In [7]:
print('\n',
      'Task 1.1: ',inter_matr_binary(),'\n',
      'Task 1.2: ',_interaction_matrix_test, '\n',
      'Task 2.0: ',inter_matr_prob(),'\n',
      'Task 3.0: ',_top_pop_10, '\n',
      'Task 4.0: ',recTopKPop(_interaction_matrix_test, 0, 10))


 Task 1.1:  [[1. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]] 
 Task 1.2:  [[1. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]] 
 Task 2.0:  [[0.13043478 0.13043478 0.2173913  ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.1        0.         ... 0.         0.         0.        ]] 
 Task 3.0:  [42, 43, 51, 96, 105, 151, 12, 104, 68, 150] 
 Task 4.0:  [42, 43, 51, 96, 105, 151, 12, 104, 68, 150]


In [8]:
# Leave this cell the way it is, please.