# Installation with pip
Every dependency needed by the framework will be downloaded and installed automatically

In [None]:
!pip install clayrs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting clayrs
  Downloading clayrs-0.4.0.tar.gz (225 kB)
[K     |████████████████████████████████| 225 kB 12.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting transformers~=4.15.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 58.1 MB/s 
[?25hCollecting distex~=0.7.1
  Downloading distex-0.7.2-py3-none-any.whl (19 kB)
Collecting wn~=0.0.23
  Downloading wn-0.0.23.tar.gz (31.6 MB)
[K     |████████████████████████████████| 31.6 MB 127 kB/s 
[?25hCollecting sentence-transformers~=1.2.0
  Downloading sentence-transformers-1.2.1.tar.gz (80 kB)
[K     |████████████████████████████████| 80 kB 9.2 MB/s 
Collecting mysql~=0.0.3
  Downloading mysql-0.0.3-py3-none-any.whl (1.2 kB)
Collecting my

# **! RESTART RUNTIME !**

In [None]:
# for reproducibility but it's not perfect:
# some environment variables should be set before starting
# the python interpreter

import numpy
import random
numpy.random.seed(42)
random.seed(42)

# Correct order log and prints for IPython
This is necessary only for IPython environments (Colab, Jupyter, etc.), since they mess up the order of  ```print``` and ```logging```

```python
# EXAMPLE of the issue
import logging
print("Should go first")
logging.warning("Should go second")

WARNING:root:Should go second
Should go first
```

In [None]:
import functools
print = functools.partial(print, flush=True)

# Import and datasets download

The framework is made of three modules:
> 1.   Content Analyzer
> 2.   Recommender System
> 3.   Evaluation

We import every module as a library and use classes and methods by using the dot notation:

In [None]:
from clayrs import content_analyzer as ca
from clayrs import recsys as rs
from clayrs import evaluation as eva

# Usage:
# ...
# ca.Ratings()
# rs.ContentBasedRS()
# eva.EvalModel()
# ...

We use **Movielens 100k** as dataset, with items info expanded thanks to imdb

***POSSIBLE TO DO***: custom class with several built-in toy datasets?

In [None]:
import requests

def dl_file(url, output):
    r = requests.get(url, allow_redirects=True)
    with open(output, "wb") as handler:
        handler.write(r.content)

    print(f"Downloaded {output}!")

# Dataset: Movielens-100k

# download items_info
url_items_info = "https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/items_info.json"
dl_file(url_items_info, "items_info.json")

# download users_info
url_users_info = "https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/users_info.csv"
dl_file(url_users_info, "users_info.csv")

# download ratings
url_ratings = "https://raw.githubusercontent.com/swapUniba/clayrs/master/datasets/ml-100k/ratings.csv"
dl_file(url_ratings, "ratings.csv")

Downloaded items_info.json!
Downloaded users_info.csv!
Downloaded ratings.csv!


### Check items file
In this example, the file containing items info is a JSON where every entry corresponds to a movie.

For every movie there are various information, such as *genres, directors, cast, etc.*

In [None]:
with open("items_info.json", "r") as f:
  # 25 lines but in these 25 lines there are only 2 entries:
  # 'Toy Story', and 'Golden Eye'
  for _ in range(25):
    print(f.readline(), end='')


[
    {
        "movielens_id": "1",
        "imdb_id": "0114709",
        "title": "Toy Story",
        "plot": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
        "genres": "Animation, Adventure, Comedy, Family, Fantasy",
        "year": "1995",
        "rating": "8.3",
        "directors": "John Lasseter",
        "cast": "Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn, John Ratzenberger, Annie Potts, John Morris, Erik von Detten, Laurie Metcalf, R. Lee Ermey, Sarah Rayne, Penn Jillette, Jack Angel, Spencer Aste, Greg Berg, Lisa Bradley, Kendall Cunningham, Debi Derryberry, Cody Dorkin, Bill Farmer, Craig Good, Gregory Grudt, Danielle Judovits, Sam Lasseter, Brittany Levenbrown, Sherry Lynn, Scott McAfee, Mickie McGowan, Ryan O'Donohue, Jeff Pidgeon, Patrick Pinney, Phil Proctor, Jan Rabson, Joe Ranft, Andrew Stanton, Shane Sweet, Wayne Allwine, Tony Anselmo, Jonathan Benair, Anthony Burch, 

### Check users file
In this example, the file containing users info is a CSV file where the first column is the *user id*, while the other columns are side information for that user (*gender, occupation, zip code*)

In [None]:
with open("users_info.csv", "r") as f:

  # print the header and the first 2 entries
  for _ in range(3):
    print(f.readline(), end='')

user_id,age,gender,occupation,zip_code
1,24,M,technician,85711
2,53,F,other,94043


<a name="cell-id"></a>
### Check ratings
In this example, the file containing the interactions between the users and the movies is a CSV, where every interaction is a rating in the **[1, 5]** Likert scale

In [None]:
import pandas as pd

pd.read_csv('ratings.csv')

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


# Content Analyzer: representation of Items
In order to define the *item representation*, the following parameters should be defined:
*   ***source***: the path of the file containing items info
*   ***id***: the field that uniquely identifies an item
*   ***output_directory***: the path where serialized representations are saved



In [None]:
# Configuration of item representation 
movies_ca_config = ca.ItemAnalyzerConfig(
    source=ca.JSONFile('items_info.json'),
    id='movielens_id',
    output_directory='movies_codified/',
)

<a name="ca_id"></a>
Each item can be represented using a set of fields.
Every field can be **represented** using several techniques, such as *'tfidf'*, *'entity linking'*, *'embeddings'*, etc.

It is possible to process the content of each field using a **Natural Language Processing (NLP) pipeline**.  
It is also possible to assign a **custom id** for each generated representation, in order to allow a simpler reference in the recommendation phase. Both NLP pipeline and custom id are optional parameters.

> In the following example, we process: 
1. the *'plot'* field by performing **lemmatization** and **stopwords removal**, and we represent it using **tfidf** and **sbert** which will produce embedding for each sentence of the document

In [None]:
movies_ca_config.add_multiple_config(
    'plot',
    [ca.FieldConfig(ca.SkLearnTfIdf(),
                   preprocessing=ca.NLTK(stopwords_removal=True, lemmatization=True),
                   id='tfidf'),

     ca.FieldConfig(ca.SentenceEmbeddingTechnique(ca.Sbert()),
                   preprocessing=ca.NLTK(stopwords_removal=True, lemmatization=True),
                   id='sbert'),
     ]
)

At the end of the configuration step, we provide the configuration to the *'Content Analyzer'* and call the `fit()` method:

*   The Content Analyzer will **represent** and **serialize** every item.



In [None]:
ca.ContentAnalyzer(config=movies_ca_config).fit()

[39mINFO[0m - ***********   Processing field: plot   ***********
[39mINFO[0m - Computing tf-idf with SkLearnTfIdf
Processing and producing contents with Sbert:  100%|██████████| 1682/1682 [04:57<00:00]
Serializing contents:  100%|██████████| 1682/1682 [00:10<00:00]


# Recsys phase and Eval phase: Experiment class

Up until now, after complexly representing each item, we performed the same set of operations:
1. Split the dataset;
2. Chose the algorithm;
3. Compute rank for each user;
4. Evaluate recommendations produced on some defined metrics

What if we want to compare **several** algorithms? Should we perform manually each step? Can we ***automate*** this process?

Yes we can with the ***Experiment*** class!

Let's first load the original dataset

In [None]:
ratings = ca.Ratings(ca.CSVFile('ratings.csv'))

print(ratings)

Importing ratings:  100%|██████████| 100000/100000 [00:00<00:00]


      user_id item_id  score
0         196     242    3.0
1         196     393    4.0
2         196     381    4.0
3         196     251    3.0
4         196     655    5.0
...       ...     ...    ...
99995     941     919    5.0
99996     941     273    3.0
99997     941       1    5.0
99998     941     294    4.0
99999     941    1007    4.0

[100000 rows x 3 columns]


And now let's choose several algorithms to compare. In this case, we will compute recommendations with:

* The **KNN** classifier using only the *tfidf* representation of the *'plot'* field
* The **SVC** classifier using both tfidf and *sbert* representation of the *'plot'* field
* The **LinearRegression** regressor using only the *tfidf* representation of the *'plot'* field
    * This last one can also perform *score prediction* other than computing *recommendation lists*

In [None]:
alg1 = rs.ClassifierRecommender({'plot': 'tfidf'}, classifier=rs.SkKNN())
alg2 = rs.ClassifierRecommender({'plot': ['tfidf', 'sbert']}, classifier=rs.SkSVC(random_state=42))
alg3 = rs.LinearPredictor({'plot': 'tfidf'}, regressor=rs.SkLinearRegression())

The last thing to do is to instantiate the `ContentBasedExperiment` class and simply set everything that we want the class to do!

In [None]:
rs.ContentBasedExperiment(
    ratings,
    items_directory="movies_codified",
    partitioning_technique=rs.HoldOutPartitioning(train_set_size=0.75, random_state=99),
    algorithm_list=[alg1, alg2, alg3],
    metric_list=[
        eva.PrecisionAtK(k=5),
        eva.RecallAtK(k=5),
        eva.FMeasureAtK(k=5, sys_average='micro')
    ]
).rank(methodology=rs.TestRatingsMethodology())

Performing HoldOutPartitioning:  100%|██████████| 943/943 [00:00<00:00]

[39mINFO[0m - ******* Processing alg ClassifierRecommender *******
[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 624:  100%|██████████| 943/943 [00:27<00:00]
[39mINFO[0m - Performing evaluation on metrics chosen
Performing F1@5 - micro:  100%|██████████| 3/3 [00:00<00:00]

[39mINFO[0m - Results saved in 'experiment_result/ClassifierRecommender_1'

[39mINFO[0m - ******* Processing alg ClassifierRecommender *******
[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 624:  100%|██████████| 943/943 [01:54<00:00]
[39mINFO[0m - Performing evaluation on metrics chosen
Performing F1@5 - micro:  100%|██████████| 3/3 [00:00<00:00]

[39mINFO[0m - Results saved in 'experiment_result/ClassifierRec

In [None]:
print("KNN results:")

pd.read_csv("experiment_result/ClassifierRecommender_1/eva_sys_results.csv")

KNN results:


Unnamed: 0,user_id,Precision@5 - macro,Recall@5 - macro,F1@5 - micro
0,sys - fold1,0.585154,0.390567,0.296396
1,sys - mean,0.585154,0.390567,0.296396


In [None]:
print("SVC results:")

pd.read_csv("experiment_result/ClassifierRecommender_2/eva_sys_results.csv")

SVC results:


Unnamed: 0,user_id,Precision@5 - macro,Recall@5 - macro,F1@5 - micro
0,sys - fold1,0.541888,0.360356,0.27448
1,sys - mean,0.541888,0.360356,0.27448


In [None]:
print("Linear regression results:")

pd.read_csv("experiment_result/LinearPredictor_1/eva_sys_results.csv")

Linear regression results:


Unnamed: 0,user_id,Precision@5 - macro,Recall@5 - macro,F1@5 - micro
0,sys - fold1,0.604454,0.399843,0.306172
1,sys - mean,0.604454,0.399843,0.306172


# Your turn!

## Experiment report

Let the `Experiment` class produce also the *report* in which will be saved all important parameters to ensure replicability

### Answer

## Statistical significance

Compute the **Ttest** using the eval results for each user of the three content based algorithms

### Answer