# Installation with pip
Every dependency needed by the framework will be downloaded and installed automatically

In [None]:
!pip install clayrs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting clayrs
  Downloading clayrs-0.4.0.tar.gz (225 kB)
[K     |████████████████████████████████| 225 kB 15.9 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting wn~=0.0.23
  Downloading wn-0.0.23.tar.gz (31.6 MB)
[K     |████████████████████████████████| 31.6 MB 1.2 MB/s 
[?25hCollecting pywsd~=1.2.4
  Downloading pywsd-1.2.5-py3-none-any.whl (26.9 MB)
[K     |████████████████████████████████| 26.9 MB 74.0 MB/s 
[?25hCollecting mysql~=0.0.3
  Downloading mysql-0.0.3-py3-none-any.whl (1.2 kB)
Collecting pyaml~=21.10.1
  Downloading pyaml-21.10.1-py2.py3-none-any.whl (24 kB)
Collecting sentence-transformers~=1.2.0
  Downloading sentence-transformers-1.2.1.tar.gz (80 kB)
[K     |████████████████████████████████| 80 kB 10.2 MB/s 
[?25hCollecting tqd

# **! RESTART RUNTIME !**

In [None]:
# for reproducibility but it's not perfect:
# some environment variables should be set before starting
# the python interpreter

import numpy
import random
numpy.random.seed(42)
random.seed(42)

# Correct order log and prints for IPython
This is necessary only for IPython environments (Colab, Jupyter, etc.), since they mess up the order of  ```print``` and ```logging```

```python
# EXAMPLE of the issue
import logging
print("Should go first")
logging.warning("Should go second")

WARNING:root:Should go second
Should go first
```

In [None]:
import functools
print = functools.partial(print, flush=True)

# Import and datasets download

The framework is made of three modules:
> 1.   Content Analyzer
> 2.   Recommender System
> 3.   Evaluation

We import every module as a library and use classes and methods by using the dot notation:

In [None]:
from clayrs import content_analyzer as ca
from clayrs import recsys as rs
from clayrs import evaluation as eva

# Usage:
# ...
# ca.Ratings()
# rs.ContentBasedRS()
# eva.EvalModel()
# ...

We use **Movielens 100k** as dataset, with items info expanded thanks to imdb

***POSSIBLE TO DO***: custom class with several built-in toy datasets?

In [None]:
import requests

def dl_file(url, output):
    r = requests.get(url, allow_redirects=True)
    with open(output, "wb") as handler:
        handler.write(r.content)

    print(f"Downloaded {output}!")

# Dataset: Movielens-100k

# download items_info
url_items_info = "https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json"
dl_file(url_items_info, "items_info.json")

# download users_info
url_users_info = "https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv"
dl_file(url_users_info, "users_info.csv")

# download ratings
url_ratings = "https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/ratings.csv"
dl_file(url_ratings, "ratings.csv")

Downloaded items_info.json!
Downloaded users_info.csv!
Downloaded ratings.csv!


### Check items file
In this example, the file containing items info is a JSON where every entry corresponds to a movie.

For every movie there are various information, such as *genres, directors, cast, etc.*

In [None]:
with open("items_info.json", "r") as f:
  # 25 lines but in these 25 lines there are only 2 entries:
  # 'Toy Story', and 'Golden Eye'
  for _ in range(25):
    print(f.readline(), end='')


[
    {
        "movielens_id": "1",
        "imdb_id": "0114709",
        "title": "Toy Story",
        "plot": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
        "genres": "Animation, Adventure, Comedy, Family, Fantasy",
        "year": "1995",
        "rating": "8.3",
        "directors": "John Lasseter",
        "cast": "Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn, John Ratzenberger, Annie Potts, John Morris, Erik von Detten, Laurie Metcalf, R. Lee Ermey, Sarah Rayne, Penn Jillette, Jack Angel, Spencer Aste, Greg Berg, Lisa Bradley, Kendall Cunningham, Debi Derryberry, Cody Dorkin, Bill Farmer, Craig Good, Gregory Grudt, Danielle Judovits, Sam Lasseter, Brittany Levenbrown, Sherry Lynn, Scott McAfee, Mickie McGowan, Ryan O'Donohue, Jeff Pidgeon, Patrick Pinney, Phil Proctor, Jan Rabson, Joe Ranft, Andrew Stanton, Shane Sweet, Wayne Allwine, Tony Anselmo, Jonathan Benair, Anthony Burch, 

### Check users file
In this example, the file containing users info is a CSV file where the first column is the *user id*, while the other columns are side information for that user (*gender, occupation, zip code*)

In [None]:
with open("users_info.csv", "r") as f:

  # print the header and the first 2 entries
  for _ in range(3):
    print(f.readline(), end='')

user_id,age,gender,occupation,zip_code
1,24,M,technician,85711
2,53,F,other,94043


<a name="cell-id"></a>
### Check ratings
In this example, the file containing the interactions between the users and the movies is a CSV, where every interaction is a rating in the **[1, 5]** Likert scale

In [None]:
import pandas as pd

pd.read_csv('ratings.csv')

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


# Content Analyzer: representation of Items
In order to define the *item representation*, the following parameters should be defined:
*   ***source***: the path of the file containing items info
*   ***id***: the field that uniquely identifies an item
*   ***output_directory***: the path where serialized representations are saved



In [None]:
# Configuration of item representation 
movies_ca_config = ca.ItemAnalyzerConfig(
    source=ca.JSONFile('items_info.json'),
    id='movielens_id',
    output_directory='movies_codified/',
)

<a name="ca_id"></a>
Each item can be represented using a set of fields.
Every field can be **represented** using several techniques, such as *'tfidf'*, *'entity linking'*, *'embeddings'*, etc.

It is possible to process the content of each field using a **Natural Language Processing (NLP) pipeline**.  
It is also possible to assign a **custom id** for each generated representation, in order to allow a simpler reference in the recommendation phase. Both NLP pipeline and custom id are optional parameters.

> In the following example, we process: 
1. the *'plot'* field by performing **lemmatization** and **stopwords removal**, and we represent it using **tfidf**;

In [None]:
movies_ca_config.add_single_config(
    'plot',
    ca.FieldConfig(ca.SkLearnTfIdf(),
                   preprocessing=ca.NLTK(stopwords_removal=True, lemmatization=True),
                   id='tfidf')  # Custom id
)

At the end of the configuration step, we provide the configuration to the *'Content Analyzer'* and call the `fit()` method:

*   The Content Analyzer will **represent** and **serialize** every item.



In [None]:
ca.ContentAnalyzer(config=movies_ca_config).fit()

[39mINFO[0m - ***********   Processing field: plot   ***********
[39mINFO[0m - Computing tf-idf with SkLearnTfIdf
Serializing contents:  100%|██████████| 1682/1682 [00:08<00:00]


Let's load one of the processed items to see the output of the Content Analyzer!

In [None]:
# Configuration of item representation 
movies_ca_config = ca.ItemAnalyzerConfig(
    source=ca.JSONFile('items_info.json'),
    id='movielens_id',
    output_directory='movies_codified_duplicate/',
)

movies_ca_config.add_single_config(
    'plot',
    ca.FieldConfig(ca.SkLearnTfIdf(),
                   preprocessing=ca.NLTK(stopwords_removal=True, lemmatization=True),
                   id='tfidf')  # Custom id
)

ca.ContentAnalyzer(config=movies_ca_config).fit()

[39mINFO[0m - ***********   Processing field: plot   ***********
[39mINFO[0m - Computing tf-idf with SkLearnTfIdf
Serializing contents:  100%|██████████| 1682/1682 [00:09<00:00]


In [None]:
from clayrs.utils import load_content_instance

item = load_content_instance("movies_codified", "1")

print(item)

Content: 1

Exogenous representations:

No representation found for the Content!


Field: plot 
                                                            representation
internal_id external_id                                                   
0           tfidf          (0, 1024)\t0.1719365702112526\n  (0, 1729)\t...

##############################


As we can see we have **no** exogenous representation (as expected) and **one** representation for the `plot` field!

* The tfidf is saved as a *scipy sparse matrix*, where each secondary index corresponds to a word in the dictionary of the corpus!

Let's see more closely the representation for *content 1*:

In [None]:
item.get_field_representation("plot", "tfidf")

  (0, 1024)	0.1719365702112526
  (0, 1729)	0.31336848582007987
  (0, 2160)	0.3010691777558571
  (0, 2721)	0.25782007590969463
  (0, 3752)	0.2915290932564375
  (0, 4769)	0.1484604944785959
  (0, 5439)	0.31336848582007987
  (0, 5932)	0.2618948840931845
  (0, 6417)	0.3307033869191101
  (0, 6655)	0.3307033869191101
  (0, 6853)	0.25410006749357394
  (0, 6909)	0.2445599829941543
  (0, 6941)	0.31336848582007987

# [Optional] Content Analyzer: representation of Users
In order to define the *'user representation'*, we could use the same process performed for *'item representation'*. In this case we don't want to represent in a complex way users, so this step is completely optional

In this example, the ID for users is the column `user_id`.

In [None]:
#Configuration of user representation
users_ca_config = ca.UserAnalyzerConfig(
    ca.CSVFile('users_info.csv'),
    id='user_id',
    output_directory='users_codified/',
)

# Since no complex representation for users is needed, the fit() method is called immediately
ca.ContentAnalyzer(config=users_ca_config).fit()

Serializing contents:  100%|██████████| 943/943 [00:03<00:00]


Again, let's load one of the serialized users and let's check that it doesn't hold any complex representation:

In [None]:
user = load_content_instance("users_codified", "1")

print(user)

Content: 1

Exogenous representations:

No representation found for the Content!

Field representations:

No representation found for the Content fields!
##############################


# Recommender System: centroid vector algorithm

The Recommender System module needs information about users, items and ratings. 

The **Ratings** class allows you to import rating from a source file (or also from an existent dataframe) into a custom object.   **If** the source file contains users (U), items (I) and ratings (R) in this order, no additional parameters are needed, **otherwise**  the mapping must be explictly specified using:

*   **'user_id'** column,
*   **'item_id'** column,
*   **'score'** column





In [None]:
ratings = ca.Ratings(ca.CSVFile('ratings.csv'))

print(ratings)

Importing ratings:  100%|██████████| 100000/100000 [00:00<00:00]


      user_id item_id  score
0         196     242    3.0
1         196     393    4.0
2         196     381    4.0
3         196     251    3.0
4         196     655    5.0
...       ...     ...    ...
99995     941     919    5.0
99996     941     273    3.0
99997     941       1    5.0
99998     941     294    4.0
99999     941    1007    4.0

[100000 rows x 3 columns]


In [None]:
# (mapping by index) EQUIVALENT:
#
# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column=0,
#     item_id_column=1,
#     score_column=2
# )
#
# print(ratings)

In [None]:
# (mapping by column name) EQUIVALENT:

# ratings = ca.Ratings(
#     ca.CSVFile('ratings.csv'),
#     user_id_column='user_id',
#     item_id_column='item_id',
#     score_column='rating'
# )
#
# print(ratings)

The Recommender System also needs an algorithm for ranking or predicting items to users. In the following example we use the **CentroidVector** algorithm:

*   It computes the centroid vector of the features of items *liked by the user*
*   It computes the similarity between the centroid vector and unrated items

The items liked by a user are those having a rating higher or equal than a specific **threshold**. If the threshold is not specified, the average score of all items liked by the user is used.

The Recommender System leverages the representations defined by the Content Analyzer. In the current example, we use the representation of the field 'plot'. More representations could be adopted for a single field.


```python
# Example with multiple representations for a single field
{
  'plot': ['tfidf', 'word_embedding'],
  'genre': 'doc_embedding',
  ...
}
```

Representations can be referenced using the **external id** (if specified, see [here](#ca_id)) or the **internal id**:


```
For the field 'plot':
First representation created -> internal_id = 0
Second representation created -> internal_id = 1
...
Nth representation created -> internal_id = n-1
```

In [None]:
centroid_vec = rs.CentroidVector(
    {'plot': 'tfidf'},  # EQUIVALENT TO {'plot': 0}
    similarity=rs.CosineSimilarity()
)

# no threshold parameter specified, the average rating given by
# the user wil be used

Before we can instantiate the recommender system, we should perform the splitting of the dataset: let's perform a **KFold with 2 splits**

*   The output of the partition module are two lists. One containing the **two** train set (in this case), the other containing the **two** test set (in this case)





In [None]:
train_list, test_list = rs.KFoldPartitioning(n_splits=2, random_state=42).split_all(ratings)

Performing KFoldPartitioning:  100%|██████████| 943/943 [00:00<00:00]


The Recommender System needs the following parameters: the recommendation  algorithm, the train set, and the path of the items serialized by the Content Analyzer:

*   At the moment let's use the first train set



In [None]:
first_train = train_list[0]

cbrs = rs.ContentBasedRS(centroid_vec, first_train, 'movies_codified/')

Now the ***cbrs*** must be fit before we can compute the rank:

*   We could do this in two separate steps, by first calling the `fit(..)` method and then the `rank(...)` method 

*   Or by calling directly the `fit_rank(...)` method, which performs both in one step

We use the second approach and we compute the **top-3** items for the *user 8*, *user 2* and *user 1*.

*   The first splitted test set is used



In [None]:
first_test = test_list[0]

rank = cbrs.fit_rank(first_test, user_id_list=['8', '2', '1'], n_recs=3)

[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 2:  100%|██████████| 3/3 [00:00<00:00]


Let's print the rank just computed

In [None]:
print(rank)

  user_id item_id     score
0       8     174  0.114191
1       8     229  0.110301
2       8      89  0.082602
3       1      24  0.154034
4       1      74  0.129982
5       1     246  0.124737
6       2     297  0.218399
7       2     305  0.069739
8       2     285  0.068666


Let's now compute the **top-10** rank for all users of our train set, and let's use both the two train set and two test set obtained thanks to the KFold technique

*   We will save the two computed rank in a list, and we will evaluate them in the next step

In order to compute a rank for all users, you simply do not specify the *user_id_list* parameter

In [None]:
result_list = []

for train_set, test_set in zip(train_list, test_list):
  
  cbrs = rs.ContentBasedRS(centroid_vec, train_set, 'movies_codified/')
  rank_to_append = cbrs.fit_rank(test_set)  # by default n_recs=10

  result_list.append(rank_to_append)

[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 842:  100%|██████████| 943/943 [00:28<00:00]
[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 842:  100%|██████████| 943/943 [00:27<00:00]


# Evaluation module

Recommendations can be evaluated using several metrics. In the following example, we use:

*   ***Precision***
*   ***Recall***
*   ***F1 - computed using macro average***
*   ***F1 - computed using micro average***

The Evaluation module needs the following parameters:

*   A list of computed rank/predictions (in case multiple splits must be evaluated)
*   A list of truths (in case multiple splits must be evaluated)
*   List of metrics to compute

Obviously the list of computed rank/predictions and list of truths must have the same length, and the rank/prediction in position $i$ will be compared with the truth at position $i$

In [None]:
em = eva.EvalModel(
    result_list,
    test_list,
    metric_list=[
        eva.Precision(),  # by default sys_average='macro'
        eva.Recall(),     # by default sys_average='macro'
        eva.FMeasure(sys_average='macro'),
        eva.FMeasure(sys_average='micro')
    ]
)

The fit() method returns two pandas DataFrame: the first one contains the metrics aggregated for the system, while the second contains the metrics computed for each user (where possible)

In [None]:
sys_result, users_result =  em.fit()

[39mINFO[0m - Performing evaluation on metrics chosen
Performing F1 - micro:  100%|██████████| 4/4 [00:01<00:00]


For the DataFrame which contains system results, the results are also grouped by splits

In [None]:
sys_result

Unnamed: 0_level_0,Precision - macro,Recall - macro,F1 - macro,F1 - micro
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sys - fold1,0.571686,0.397652,0.405851,0.294872
sys - fold2,0.579745,0.405215,0.410138,0.298678
sys - mean,0.575716,0.401433,0.407995,0.296775


In [None]:
users_result

Unnamed: 0_level_0,Precision - macro,Recall - macro,F1 - macro,F1 - micro
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.65,0.079645,0.141902,0.141902
10,0.30,0.108225,0.157539,0.157539
100,0.50,0.285714,0.360215,0.360215
101,0.50,0.361111,0.415584,0.415584
102,0.45,0.074631,0.128019,0.128019
...,...,...,...,...
95,0.55,0.075758,0.133091,0.133091
96,0.70,0.502564,0.584348,0.584348
97,0.70,0.324561,0.442191,0.442191
98,0.55,0.845238,0.665441,0.665441


# Your turn!

## Different partitioning technique and eval metrics

1. Try to apply the `Bootstrap` partitioning technique instead of the `KFold` as seen in the example and compute the top-10 recommendation for each user
2. Evaluate the recs generated with `Precision@1`, `MAP`, `Catalog coverage`

### Answer to 1

### Answer to 2

## Different recommendation algorithm

Try to apply a **classifier** as recommendation algorithm and find one that let you surpass the `0.6` precision wall

* Again, use the `Bootstrap` partitioning technique and evaluate the produced recommendations with the metrics used in the [previous point](#scrollTo=27THl2vFT64U) (`Precision@1`, `MAP`, `Catalog Coverage`)

### Answer

## Different representation of contents

1. Represent the *plot* field of each item by training the `Word2Vec` model on the whole corpus, after it has been preprocessed by **removing stopwords**, **punctuation** and after applying **lemmatization**
2. Represent the *genres* field by using the pre-trained model **glove-wiki-gigaword-100** provided by the *Gensim* library. **Remove punctuation** as preprocessing operation
3. Compute recommendations using both the representation for the *plot* field and the representation for the *genres* field with the **best classifier** you found in the [previous point](#scrollTo=3QXMG9w9T64V) and evaluate them on the usual metrics (`Precision@1`, `MAP`, `Catalog coverage`)

### Answer to 1 and 2

### Answer to 2