# Introduction

In this notebook, we explore using the library [LibRecommender](https://librecommender.readthedocs.io/) to train a recommendation model on top of the dataset.

## The dataset

We'll use the [WordBank](http://wordbank.stanford.edu/) datasets.

We'll use these datasets the *Full Child-by-Word* and *By-Child Summary* datasets.

We won't use the *By-Word Summary* dataset.

## Full Child-by-Word

Contains the words spoken by each child.
 
It is available in http://wordbank.stanford.edu/data?name=instrument_data.

## By-Child Summary

Contains information about each child, such as *age* and *gender*.
 
It is available in http://wordbank.stanford.edu/data?name=admin_data.

## What is the LibRecommender library

The [LibRecommender](https://librecommender.readthedocs.io/) library contains models that can be trained for recommendation tasks.

This is what is stated on the library's webpage:

> LibRecommender is an easy-to-use recommender system focused on end-to-end recommendation process. It contains a training(libreco) and serving(libserving) module to let users quickly train and deploy different kinds of recommendation models.

## Types of models in LibRecommender

There are two types of models provided by LibRecommender:

1. Pure models: These models take into account only the interactions between users and items. They are based on the [Collaborative Filtering paper](https://ieeexplore.ieee.org/document/7176109).
2. Feat models: These models also take into account features of items and users. Examples are (1) the *age* of the user, and (2) genre of a movie.

There are multiple algorithms available at the library, such as [LightGCN](https://arxiv.org/pdf/2002.02126.pdf) (a *pure* model) and [Wide & Deep](https://arxiv.org/pdf/1606.07792.pdf) (a *feat* model).

# Modeling the task

The task to be implemented is to recommend words for a child given a set of words.

For example: Given that my child speaks the words "ball", "dad", and "mom", what other words would the system recommend for this child to also learn so that their cognitive development meets the learning curve of words for children.

Recommendation models use the terms *item* and *user*, which in our case are respectively *words* and *children*.

## Restriction: New children recommendations

A characteristic of recommendation systems is that they're trained offline and then used for inference. They calculate all recommended items for a given user. Calculating the recommendations is a slow and compute-intensive process. These systems are usually trained every few days (at night) and are used the following day.

The problem is that in our task we want to provide recommendations for a child given a set of words. This set of words won't be in the training dataset, i.e., it is a new user with new items, which the recommendation system hasn't seen yet.

Let's give an example so that the problem is more clear:

| child id | word  |
|----------|-------|
| 1        | daddy |
| 1        | mommy |
| 1        | ball  |
| 1        | love  |
| 2        | daddy |
| 2        | mommy |

In this example we can provide recommendations for both children `1` and `2`. However, if we want to give a recommendation for a children that speaks the words `daddy`, `mommy` and `ball`, we can't. This is because the new child speaks a different set of words than all other children. 

This is similar, but not the same, as the **cold-start problem**. It is different because a cold-start problem means that we don't have data about the new user yet, so we need to "initialize" the interactions between `user` and `item`. In our case we **have** data about the new user, however, it is not yet available to the recommendation system because it hasn't yet been trained on the new data.

The **solution** we found is to retrain the model with only the new data for the new child. The LibRecommender library makes this possible, so this can be also be thought as "fine-tuning". The [documentation](https://librecommender.readthedocs.io/en/latest/user_guide/model_retrain.html) of LibRecommender explains how this can be done. The only "hack" we had to do was to create a "fake" child (with id `-1`) so that we can "reuse" this user when retraining the model.

## Modeling with a "pure" algorithm

Given that these model only the interactions between child/words, it is a matter of preparing the dataset with the right column names and types and feed it to LibRecommender.

Under the hood, these algorithms prepare a matrix of (user x item) where the value of each item in this matrix is the "interaction strength".

In the classic examples of movies recommendations, the value used is the *rating* given by the user to a movie. It could also be used the number of times a user has watched a movie, or a combination of both. In our case, we don't have this information, so we user the value `1.0`.

## Modeling with a "feat" algorithm

TODO

---

# Notebook setup

Installs required libraries and add `import` statements.

We'll use Pandas to handle the dataset: data cleaning and data preparation.

In [1]:
import sys

In [2]:
sys.executable

'/home/gustavo/PycharmProjects/tici-turing/ss23-talk-a-palooza/.venv/bin/python'

In [3]:
!{sys.executable} -m pip install LibRecommender tensorflow pandas scikit-learn pyarrow ipywidgets torch

Collecting LibRecommender
  Obtaining dependency information for LibRecommender from https://files.pythonhosted.org/packages/c5/0d/7b12f6b4f6136c6d8217a2e6f0ec73129e3d1633327d45300bcd6cadb558/LibRecommender-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached LibRecommender-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (28 kB)
Collecting tensorflow
  Obtaining dependency information for tensorflow from https://files.pythonhosted.org/packages/81/16/3aaaf911d8309b9afb29bff97e819c52b011d4ab184c7b01cec92abd018a/tensorflow-2.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached tensorflow-2.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/bc/7e/a9e11bd272e3135108892b6230a115568f477864276181eada3a35d03237/pandas-2.1.1-cp39-cp39-manylinux_2_17_x86_64.manylin

In [4]:
sys.version

'3.9.18 (main, Oct  1 2023, 17:59:15) \n[GCC 13.2.1 20230801]'

In [9]:
import pandas as pd
import numpy as np

Load the Full Child-by-Word dataset.
 
We'll use the parquet format so that it is smaller on disk. This dataset size in CSV is ~500MB, while in Parquet it is ~5MB.

In [10]:
from pathlib import Path

def load_df(dataset_filename: str) -> pd.DataFrame:
    file = Path(dataset_filename)
    dataset_file_parquet = file.with_suffix(".parquet")
    if not dataset_file_parquet.exists():
        dataset_file_csv = file.with_suffix(".csv")
        if not dataset_file_csv.exists():
            # This URL might not work for everyone. If it doesn't, then download it manually by following the link available at the Introduction of this notebook.
            !curl -o {dataset_file_csv} http://52.26.82.213/instrument_data/_w_8a927d8be09f64164925949e01d5961a35c4a2199f9395eb/session/620431f5bceb2a1e6a87dc53a0ab3c20/download/download_data?w=8a927d8be09f64164925949e01d5961a35c4a2199f9395eb
        df = pd.read_csv(dataset_file_csv) 
        df.to_parquet(dataset_file_parquet)
    else:
        df = pd.read_parquet(dataset_file_parquet)
    return df

In [46]:
df = load_df("wordbank_instrument_data_full_child_by_word_englishAmerican_WS.parquet")

In [47]:
df

Unnamed: 0,downloaded,data_id,item_kind,category,item_id,item_definition,english_gloss,uni_lemma,child_id,age,value
0,2023-09-07,245518,word,sounds,item_1,baa baa,baa baa,baa baa,1,28,produces
1,2023-09-07,245518,word,sounds,item_2,choo choo,choo choo,choo choo,1,28,
2,2023-09-07,245518,word,sounds,item_3,cockadoodledoo,cockadoodledoo,cockadoodledoo,1,28,
3,2023-09-07,245518,word,sounds,item_4,grrr,grrr,grrr,1,28,produces
4,2023-09-07,245518,word,sounds,item_5,meow,meow,meow,1,28,
...,...,...,...,...,...,...,...,...,...,...,...
6057992,2023-09-07,255023,complexity,,item_793,lookit / lookit what I got,lookit / lookit what I got,,86615,23,simple
6057993,2023-09-07,255023,complexity,,item_794,where's my dolly / where's my dolly name Sam,where's my dolly / where's my dolly name Sam,,86615,23,simple
6057994,2023-09-07,255023,complexity,,item_795,we made this / me and Paul made this,we made this / me and Paul made this,,86615,23,simple
6057995,2023-09-07,255023,complexity,,item_796,I sing song / I sing song for you,I sing song / I sing song for you,,86615,23,simple


In [48]:
# Merge the training dataset with the words we want to use (wordbanks.json)

df_wordbanks = pd.read_json("wordbanks.json")
df_wordbanks["wordBankId"] = df_wordbanks["_id"].apply(lambda d: d["$oid"])
df_wordbanks.rename(columns={"name": "word"}, inplace=True)
df_wordbanks = df_wordbanks[["wordBankId", "word"]]
df = df.merge(df_wordbanks, left_on="item_definition", right_on="word")
df

Unnamed: 0,downloaded,data_id,item_kind,category,item_id,item_definition,english_gloss,uni_lemma,child_id,age,value,wordBankId,word
0,2023-09-07,245518,word,sounds,item_1,baa baa,baa baa,baa baa,1,28,produces,651de3dbf3a9be0887dd1d86,baa baa
1,2023-09-07,245519,word,sounds,item_1,baa baa,baa baa,baa baa,2,22,,651de3dbf3a9be0887dd1d86,baa baa
2,2023-09-07,245520,word,sounds,item_1,baa baa,baa baa,baa baa,3,26,produces,651de3dbf3a9be0887dd1d86,baa baa
3,2023-09-07,245521,word,sounds,item_1,baa baa,baa baa,baa baa,4,27,produces,651de3dbf3a9be0887dd1d86,baa baa
4,2023-09-07,245522,word,sounds,item_1,baa baa,baa baa,baa baa,5,19,produces,651de3dbf3a9be0887dd1d86,baa baa
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4963448,2023-09-07,255019,word,connecting_words,item_680,then,then,then,86611,22,,651de3dbf3a9be0887dd202d,then
4963449,2023-09-07,255020,word,connecting_words,item_680,then,then,then,86612,29,produces,651de3dbf3a9be0887dd202d,then
4963450,2023-09-07,255021,word,connecting_words,item_680,then,then,then,86613,22,,651de3dbf3a9be0887dd202d,then
4963451,2023-09-07,255022,word,connecting_words,item_680,then,then,then,86614,28,,651de3dbf3a9be0887dd202d,then


Drop some columns which we will not use because they're metadata, so we're not interested 

In [50]:
df.drop(["downloaded", "data_id"], axis="columns", inplace=True)
df

KeyError: "['downloaded', 'data_id'] not found in axis"

The `value` column, when `None`, means that child doesn't do anything with the specific word.
 
We'll discard the rows where this column is empty and the entire column because it doesn't add anything of value:  

In [51]:
df.dropna(subset="value", inplace=True)
df.drop(columns=["value"], inplace=True)
df

Unnamed: 0,item_kind,category,item_id,item_definition,english_gloss,uni_lemma,child_id,age,wordBankId,word
0,word,sounds,item_1,baa baa,baa baa,baa baa,1,28,651de3dbf3a9be0887dd1d86,baa baa
2,word,sounds,item_1,baa baa,baa baa,baa baa,3,26,651de3dbf3a9be0887dd1d86,baa baa
3,word,sounds,item_1,baa baa,baa baa,baa baa,4,27,651de3dbf3a9be0887dd1d86,baa baa
4,word,sounds,item_1,baa baa,baa baa,baa baa,5,19,651de3dbf3a9be0887dd1d86,baa baa
5,word,sounds,item_1,baa baa,baa baa,baa baa,6,30,651de3dbf3a9be0887dd1d86,baa baa
...,...,...,...,...,...,...,...,...,...,...
4963438,word,connecting_words,item_680,then,then,then,86601,29,651de3dbf3a9be0887dd202d,then
4963441,word,connecting_words,item_680,then,then,then,86604,30,651de3dbf3a9be0887dd202d,then
4963444,word,connecting_words,item_680,then,then,then,86607,25,651de3dbf3a9be0887dd202d,then
4963446,word,connecting_words,item_680,then,then,then,86609,29,651de3dbf3a9be0887dd202d,then


The `item_id` has this prefix `item_` which doesn't help and makes the data confusing. Let's just turn it into an `int`.

In [52]:
if df.dtypes["item_id"] != int:
    df["item_id"] = df["item_id"].apply(lambda v: int(v.removeprefix("item_")))
assert df.dtypes["item_id"] == int
df

Unnamed: 0,item_kind,category,item_id,item_definition,english_gloss,uni_lemma,child_id,age,wordBankId,word
0,word,sounds,1,baa baa,baa baa,baa baa,1,28,651de3dbf3a9be0887dd1d86,baa baa
2,word,sounds,1,baa baa,baa baa,baa baa,3,26,651de3dbf3a9be0887dd1d86,baa baa
3,word,sounds,1,baa baa,baa baa,baa baa,4,27,651de3dbf3a9be0887dd1d86,baa baa
4,word,sounds,1,baa baa,baa baa,baa baa,5,19,651de3dbf3a9be0887dd1d86,baa baa
5,word,sounds,1,baa baa,baa baa,baa baa,6,30,651de3dbf3a9be0887dd1d86,baa baa
...,...,...,...,...,...,...,...,...,...,...
4963438,word,connecting_words,680,then,then,then,86601,29,651de3dbf3a9be0887dd202d,then
4963441,word,connecting_words,680,then,then,then,86604,30,651de3dbf3a9be0887dd202d,then
4963444,word,connecting_words,680,then,then,then,86607,25,651de3dbf3a9be0887dd202d,then
4963446,word,connecting_words,680,then,then,then,86609,29,651de3dbf3a9be0887dd202d,then


# By-Child dataset

Enhance the dataframe with children information.

In [53]:
df_children = load_df("wordbank_administration_data")

In [54]:
df_children

Unnamed: 0,downloaded,language,form,dataset_name,child_id,age,comprehension,production,is_norming,birth_order,...,race,sex,birth_weight,born_early_or_late,gestational_age,zygosity,language_exposures,health_conditions,monolingual,typically_developing
0,2023-09-28,Croatian,WG,CLEX,18186,13,293,88,True,,...,,Female,,,,,,,True,True
1,2023-09-28,Croatian,WG,CLEX,18187,16,122,12,True,,...,,Male,,,,,,,True,True
2,2023-09-28,Croatian,WG,CLEX,18188,9,3,0,True,,...,,Female,,,,,,,True,True
3,2023-09-28,Croatian,WG,CLEX,18189,12,0,0,True,,...,,Female,,,,,,,True,True
4,2023-09-28,Croatian,WG,CLEX,18190,12,44,0,True,,...,,Female,,,,,,,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90459,2023-09-28,Portuguese (European),WG,Cadime,66175,14,162,11,False,,...,,Female,,,,,,,True,True
90460,2023-09-28,Portuguese (European),WG,Cadime,66176,15,314,47,False,,...,,Female,,,,,,,True,True
90461,2023-09-28,Portuguese (European),WG,Cadime,66177,15,103,15,False,,...,,Female,,,,,,,True,True
90462,2023-09-28,Portuguese (European),WG,Cadime,66178,15,256,24,False,,...,,Male,,,,,,,True,True


### Words dataframe

Create a dataframe only with the words/sentences contents. This way we have an index of words which we can use later.

In [55]:
df_words = df.drop_duplicates("item_id").drop(["child_id", "age"], axis="columns")
df_words.set_index("item_id", inplace=True)
df_words

Unnamed: 0_level_0,item_kind,category,item_definition,english_gloss,uni_lemma,wordBankId,word
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,word,sounds,baa baa,baa baa,baa baa,651de3dbf3a9be0887dd1d86,baa baa
2,word,sounds,choo choo,choo choo,choo choo,651de3dbf3a9be0887dd1d87,choo choo
3,word,sounds,cockadoodledoo,cockadoodledoo,cockadoodledoo,651de3dbf3a9be0887dd1d88,cockadoodledoo
4,word,sounds,grrr,grrr,grrr,651de3dbf3a9be0887dd1d89,grrr
5,word,sounds,meow,meow,meow,651de3dbf3a9be0887dd1d8a,meow
...,...,...,...,...,...,...,...
676,word,connecting_words,because,because,because,651de3dbf3a9be0887dd2029,because
677,word,connecting_words,but,but,but,651de3dbf3a9be0887dd202a,but
678,word,connecting_words,if,if,if,651de3dbf3a9be0887dd202b,if
679,word,connecting_words,so,so,so,651de3dbf3a9be0887dd202c,so


# The `item_kind` column

This refers the type of the entry

In [56]:
# How many entries for each child do we have
df["item_kind"].value_counts()

item_kind
word    2013822
Name: count, dtype: int64

In [57]:
# Look at some of the data
df_words.groupby("item_kind").sample(n=1)

Unnamed: 0_level_0,item_kind,category,item_definition,english_gloss,uni_lemma,wordBankId,word
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
97,word,food_drink,cereal,cereal,cereal,651de3dbf3a9be0887dd1de6,cereal


As can be seen, the entries where `item_kind` is one of:

- `combine`
- `word_endings`
- `word_endings_nouns`
- `word_endings_verbs`
- `how_use_words`

These entries are confusing from a "standalone word" point of view. They don't fit our task of recommending words. For example: We can't use `sockses` because it is not a "word", it is the ending of a word.

Let's drop entries from our `df` with these values:

In [58]:
mask = df["item_kind"].isin(["combine", "word_endings", "word_endings_nouns", "word_endings_verbs", "how_use_words"])
df.drop(index=df[mask].index, inplace=True)
df

Unnamed: 0,item_kind,category,item_id,item_definition,english_gloss,uni_lemma,child_id,age,wordBankId,word
0,word,sounds,1,baa baa,baa baa,baa baa,1,28,651de3dbf3a9be0887dd1d86,baa baa
2,word,sounds,1,baa baa,baa baa,baa baa,3,26,651de3dbf3a9be0887dd1d86,baa baa
3,word,sounds,1,baa baa,baa baa,baa baa,4,27,651de3dbf3a9be0887dd1d86,baa baa
4,word,sounds,1,baa baa,baa baa,baa baa,5,19,651de3dbf3a9be0887dd1d86,baa baa
5,word,sounds,1,baa baa,baa baa,baa baa,6,30,651de3dbf3a9be0887dd1d86,baa baa
...,...,...,...,...,...,...,...,...,...,...
4963438,word,connecting_words,680,then,then,then,86601,29,651de3dbf3a9be0887dd202d,then
4963441,word,connecting_words,680,then,then,then,86604,30,651de3dbf3a9be0887dd202d,then
4963444,word,connecting_words,680,then,then,then,86607,25,651de3dbf3a9be0887dd202d,then
4963446,word,connecting_words,680,then,then,then,86609,29,651de3dbf3a9be0887dd202d,then


In [59]:
df["item_kind"].value_counts()

item_kind
word    2013822
Name: count, dtype: int64

That's still enough data for our model.

In [62]:
# Serialize the `df_words` because we'll use it when doing inference with the API
mask = df_words["item_kind"].isin(["combine", "word_endings", "word_endings_nouns", "word_endings_verbs", "how_use_words"])
df_words.drop(index=df_words[mask].index, inplace=True)
df_words[["word", "wordBankId"]].to_parquet("words.parquet")

In [63]:
df_words

Unnamed: 0_level_0,item_kind,category,item_definition,english_gloss,uni_lemma,wordBankId,word
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,word,sounds,baa baa,baa baa,baa baa,651de3dbf3a9be0887dd1d86,baa baa
2,word,sounds,choo choo,choo choo,choo choo,651de3dbf3a9be0887dd1d87,choo choo
3,word,sounds,cockadoodledoo,cockadoodledoo,cockadoodledoo,651de3dbf3a9be0887dd1d88,cockadoodledoo
4,word,sounds,grrr,grrr,grrr,651de3dbf3a9be0887dd1d89,grrr
5,word,sounds,meow,meow,meow,651de3dbf3a9be0887dd1d8a,meow
...,...,...,...,...,...,...,...
676,word,connecting_words,because,because,because,651de3dbf3a9be0887dd2029,because
677,word,connecting_words,but,but,but,651de3dbf3a9be0887dd202a,but
678,word,connecting_words,if,if,if,651de3dbf3a9be0887dd202b,if
679,word,connecting_words,so,so,so,651de3dbf3a9be0887dd202c,so


# Prepare dataframe for LibRecommender

From the Github readme:

> JUST normal data format, each line represents a sample.
> One thing is important, the model assumes that user, item, and label column index are 0, 1, and 2, respectively.
> You may wish to change the column order if that's not the case.

In [64]:
data = df.rename(columns={"child_id": "user", "item_id": "item"})
data.insert(loc=2, column="label", value=np.nan)

other_columns = set(data.columns).difference(["user", "item", "label"])
data = data[["user", "item", "label"] + list(other_columns)]
data.reset_index(drop=True, inplace=True)
data

Unnamed: 0,user,item,label,word,english_gloss,age,wordBankId,item_kind,uni_lemma,item_definition,category
0,1,1,,baa baa,baa baa,28,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
1,3,1,,baa baa,baa baa,26,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
2,4,1,,baa baa,baa baa,27,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
3,5,1,,baa baa,baa baa,19,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
4,6,1,,baa baa,baa baa,30,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
...,...,...,...,...,...,...,...,...,...,...,...
2013817,86601,680,,then,then,29,651de3dbf3a9be0887dd202d,word,then,then,connecting_words
2013818,86604,680,,then,then,30,651de3dbf3a9be0887dd202d,word,then,then,connecting_words
2013819,86607,680,,then,then,25,651de3dbf3a9be0887dd202d,word,then,then,connecting_words
2013820,86609,680,,then,then,29,651de3dbf3a9be0887dd202d,word,then,then,connecting_words


## The value to use for the `label` column

The column `label` denote how much "interaction" a children had with the word.

In the classic recommendation example, where we're recommending movies for users, this is the **rating** that a user has given a movie.

Unfortunately we don't have anything in our dataset that denotes this value, so we'll simply use the value `1.0`.   

In [65]:
data["label"] = 1.0

In [66]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013822 entries, 0 to 2013821
Data columns (total 11 columns):
 #   Column           Dtype  
---  ------           -----  
 0   user             int64  
 1   item             int64  
 2   label            float64
 3   word             object 
 4   english_gloss    object 
 5   age              int64  
 6   wordBankId       object 
 7   item_kind        object 
 8   uni_lemma        object 
 9   item_definition  object 
 10  category         object 
dtypes: float64(1), int64(3), object(7)
memory usage: 169.0+ MB


In [67]:
# Create a "fake" user so that we can retrain the model later
data = pd.concat([data, pd.DataFrame(
    {"user": -1,
     "item": -1,
     "label": 0.0,
     "category": "unknown",
     "item_kind": "unknown",
     "uni_lemma": "unknown",
     "item_definition": "unknown",
     "english_gloss": "unknown",
     "age": 0,
     },
    index=[len(data)]
)])
data

Unnamed: 0,user,item,label,word,english_gloss,age,wordBankId,item_kind,uni_lemma,item_definition,category
0,1,1,1.0,baa baa,baa baa,28,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
1,3,1,1.0,baa baa,baa baa,26,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
2,4,1,1.0,baa baa,baa baa,27,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
3,5,1,1.0,baa baa,baa baa,19,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
4,6,1,1.0,baa baa,baa baa,30,651de3dbf3a9be0887dd1d86,word,baa baa,baa baa,sounds
...,...,...,...,...,...,...,...,...,...,...,...
2013818,86604,680,1.0,then,then,30,651de3dbf3a9be0887dd202d,word,then,then,connecting_words
2013819,86607,680,1.0,then,then,25,651de3dbf3a9be0887dd202d,word,then,then,connecting_words
2013820,86609,680,1.0,then,then,29,651de3dbf3a9be0887dd202d,word,then,then,connecting_words
2013821,86612,680,1.0,then,then,29,651de3dbf3a9be0887dd202d,word,then,then,connecting_words


In [68]:
data[data["user"] == -1]

Unnamed: 0,user,item,label,word,english_gloss,age,wordBankId,item_kind,uni_lemma,item_definition,category
2013822,-1,-1,0.0,,unknown,0,,unknown,unknown,unknown,unknown


Following the [tutorial](https://librecommender.readthedocs.io/en/latest/tutorial.html).

In [69]:
from libreco.data import random_split

# split data into three folds for training, evaluating and testing
train_data, eval_data, test_data = random_split(data, multi_ratios=[0.8, 0.1, 0.1], seed=42)

In [70]:
train_data

Unnamed: 0,user,item,label,word,english_gloss,age,wordBankId,item_kind,uni_lemma,item_definition,category
628112,824,176,1.0,slipper,slipper,28,651de3dbf3a9be0887dd1e35,word,slipper,slipper,clothing
872975,62818,246,1.0,purse,purse,24,651de3dbf3a9be0887dd1e7b,word,purse,purse,household
1398736,3187,422,1.0,drive,drive,20,651de3dbf3a9be0887dd1f2b,word,drive,drive,action_words
448204,3582,116,1.0,hamburger,hamburger,28,651de3dbf3a9be0887dd1df9,word,hamburger,hamburger,food_drink
1192385,2889,359,1.0,friend,friend,25,651de3dbf3a9be0887dd1eec,word,friend,friend,people
...,...,...,...,...,...,...,...,...,...,...,...
182221,84627,42,1.0,owl,owl,20,651de3dbf3a9be0887dd1daf,word,owl,owl,animals
1230595,3841,374,1.0,teacher,teacher,28,651de3dbf3a9be0887dd1efb,word,teacher,teacher,people
412478,3717,104,1.0,cookie,cookie,28,651de3dbf3a9be0887dd1ded,word,cookie,cookie,food_drink
844717,1343,237,1.0,mop,mop,26,651de3dbf3a9be0887dd1e72,word,mop (object),mop,household


In [71]:
from pathlib import Path
def save_model(model, data_info, model_name: str):
    model_type_name = type(model).__name__
    Path(f"models/{model_type_name}/data-info/").mkdir(parents=True, exist_ok=True)
    data_info.save(f"models/{model_type_name}/data-info/", model_name=model_name)
    model.save(f"models/{model_type_name}-retrain/weights/", model_name=model_name, manual=True, inference_only=False)
    model.save(f"models/{model_type_name}-inference/weights/", model_name=model_name, inference_only=True)

# Using a "Pure" model

We'll use the [LightGCN](https://librecommender.readthedocs.io/en/latest/api/algorithms/lightgcn.html) model because it is what we have from the tutorial.

## Training

The next cells train a LightGCN model.

Skip to the next section ("inference") if you don't want to wait for the training of the model. It takes ~ 10 minutes to train.  

In [72]:
from libreco.algorithms import LightGCN  # pure data, algorithm LightGCN
from libreco.data import DatasetPure
from libreco.evaluation import evaluate

In [73]:
train_data_pure, data_info_pure = DatasetPure.build_trainset(train_data)
eval_data_pure = DatasetPure.build_evalset(eval_data)
test_data_pure = DatasetPure.build_testset(test_data)

In [74]:
data_info_pure

n_users: 6357, n_items: 654, data density: 38.7508 %

In [75]:
lightgcn = LightGCN(
    task="ranking",
    data_info=data_info_pure,
    loss_type="bpr",
    embed_size=16,
    n_epochs=3,
    lr=1e-3,
    batch_size=2048,
    num_neg=1,
    device="cuda",
)

# TODO: Experiment with the hyperparameters to make metrics better. 
# lightgcn = LightGCN(
#     task="ranking",
#     data_info=data_info,
#     loss_type="bpr",
#     embed_size=16,
#     n_epochs=3,
#     lr=1e-1,
#     batch_size=2048,
#     num_neg=1,
#     device="cuda",
# )

In [76]:
# monitor metrics on eval_data during training
lightgcn.fit(
    train_data_pure,
    neg_sampling=True,  # sample negative items for train and eval data
    verbose=2,
    eval_data=eval_data_pure,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
    num_workers=8,
)

# do final evaluation on test data
print(
    "evaluate_result: ",
    evaluate(
        model=lightgcn,
        data=test_data_pure,
        neg_sampling=True,  # sample negative items for test data
        metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
    ),
)

Training start time: [35m2023-10-06 22:41:33[0m


train: 100%|██████████| 787/787 [04:24<00:00,  2.97it/s]


Epoch 1 elapsed: 264.969s
	 [32mtrain_loss: 0.6612[0m


eval_pointwise: 100%|██████████| 50/50 [00:00<00:00, 720.44it/s]
eval_listwise: 100%|██████████| 516/516 [00:00<00:00, 885.14it/s]


	 eval log_loss: 0.7551
	 eval roc_auc: 0.6247
	 eval precision@10: 0.3274
	 eval recall@10: 0.1585
	 eval ndcg@10: 0.6465


train: 100%|██████████| 787/787 [04:32<00:00,  2.89it/s]


Epoch 2 elapsed: 272.140s
	 [32mtrain_loss: 0.6295[0m


eval_pointwise: 100%|██████████| 50/50 [00:00<00:00, 547.37it/s]
eval_listwise: 100%|██████████| 516/516 [00:00<00:00, 853.70it/s]


	 eval log_loss: 0.7390
	 eval roc_auc: 0.6182
	 eval precision@10: 0.3304
	 eval recall@10: 0.1587
	 eval ndcg@10: 0.6474


train: 100%|██████████| 787/787 [04:27<00:00,  2.94it/s]


Epoch 3 elapsed: 267.450s
	 [32mtrain_loss: 0.6235[0m


eval_pointwise: 100%|██████████| 50/50 [00:00<00:00, 640.16it/s]
eval_listwise: 100%|██████████| 516/516 [00:00<00:00, 841.09it/s]


	 eval log_loss: 0.7347
	 eval roc_auc: 0.6159
	 eval precision@10: 0.3320
	 eval recall@10: 0.1588
	 eval ndcg@10: 0.6474


eval_pointwise: 100%|██████████| 50/50 [00:00<00:00, 722.43it/s]
eval_listwise: 100%|██████████| 516/516 [00:00<00:00, 888.50it/s]


evaluate_result:  {'loss': 0.7346561560966327, 'roc_auc': 0.6154107970435512, 'precision': 0.32975420439844766, 'recall': 0.1604368666406779, 'ndcg': 0.650183924354682}


Save the model so we don't need to retrain. Also save an "inference-only" version in case someone exploring the notebook wants to only look at results.

In [78]:
save_model(lightgcn, data_info_pure, "lightgcn")

file folder models/LightGCN-retrain/weights/ doesn't exists, creating a new one...
file folder models/LightGCN-inference/weights/ doesn't exists, creating a new one...


## Inference

Let's load the model and look at results:

In [37]:
from libreco.data import DataInfo
from libreco.algorithms import LightGCN 

try:
    lightgcn
except NameError:
    data_info = DataInfo.load("models/LightGCN/data-info/", model_name="epochs=3")
    lightgcn = LightGCN.load("models/LightGCN-inference/weights/", model_name="epochs=3", data_info=data_info)

## Looking at the model results

First let's look at the results for children that are **already** in the original dataset:

In [79]:
from ipywidgets import interact, interactive, fixed, interact_manual, widgets
from IPython.display import display
from typing import Dict

df_children = df[["child_id", "age"]].drop_duplicates("child_id")
df_children.set_index("child_id", inplace=True)

@interact(x=widgets.BoundedIntText(
    value=0,
    min=0,
    max=len(df_children),
    step=1,
    description='Child:',
    disabled=False
))
def foo(x):
    child_id = df_children.index[x]
    print(f"-> Child id is: {child_id}")
    print(f"-> Child age: {df_children.loc[child_id].age}")
    words = set(df[df["child_id"] == child_id]["word"].unique())
    print(f"-> Words spoken by this child: {len(words)}")
    display(" | ".join(sorted(words)))
    
    recommendation: Dict[int, np.ndarray] = lightgcn.recommend_user(user=child_id, n_rec=100)
    word_ids = recommendation[child_id]
    
    scores = lightgcn.predict(user=child_id, item=word_ids)
    
    display("-> Recommended words:")
    df_result = df_words.loc[word_ids]
    
    # Add column to know if the child already speaks such word (from the original dataset) 
    df_result["speaks?"] = df_result["word"].apply(lambda w: w in words)
    df_result["score"] = scores
    display(df_result[df_result["speaks?"] == False].iloc[:6])

interactive(children=(BoundedIntText(value=0, description='Child:', max=6359), Output()), _dom_classes=('widge…

## Return words given an arbitrary set of words from the original dataset

In this part we retrain the model for every request. Afterwards we ask for a recommendation of the child.

We use the "fake" `child_id == -1` as explained in the Introduction. 

In [80]:
# Sample some random words, just for the sake of trying to make the model work
words = df_words.sample(n=10)["word"].tolist()
words

['bring',
 'beans',
 'not',
 'beach',
 'necklace',
 'eat',
 'flag',
 'pool',
 'lamb',
 'cup']

In [81]:
from typing import List
from libreco.data import DataInfo

def predict(words: List[str]):
    child_id = -1
    # Train the model with the words spoken by the new child
    words_ids = df_words[df_words["word"].isin(words)].index.tolist()
    df_train = pd.DataFrame({"user": child_id, "item": words_ids, "label": 1.0})
    
    old_data_info = DataInfo.load("models/LightGCN/data-info", model_name="lightgcn")
    data, data_info = DatasetPure.merge_trainset(df_train, old_data_info)
    model = LightGCN(
        task="ranking",
        data_info=data_info,
        loss_type="bpr",
        embed_size=16,
        n_epochs=1,
        lr=1e-3,
        batch_size=2048,
        num_neg=1,
        device="cuda",
    )
    model.rebuild_model("models/LightGCN-retrain/weights", model_name="lightgcn")
    
    model.fit(
        data,
        neg_sampling=True,  # sample negative items for train and eval data
        verbose=2,
    )
    
    # Predict the words for this child
    recommendation: Dict[int, np.ndarray] = lightgcn.recommend_user(user=child_id, n_rec=100)
    predicted_word_ids = recommendation[child_id]
    scores = lightgcn.predict(user=child_id, item=predicted_word_ids)
    
    df_result = df_words.loc[predicted_word_ids]
    df_result["score"] = scores
    df_result["speaks?"] = df_result["word"].apply(lambda w: w in words)
    display(df_result[~df_result["speaks?"]].iloc[:6])
    
predict(words)

Training start time: [35m2023-10-06 22:58:04[0m


train: 100%|██████████| 1/1 [00:00<00:00,  3.03it/s]

Epoch 1 elapsed: 0.332s
	 [32mtrain_loss: 0.8323[0m





Unnamed: 0_level_0,item_kind,category,item_definition,english_gloss,uni_lemma,wordBankId,word,score,speaks?
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
643,word,quantifiers,each,each,each,651de3dbf3a9be0887dd2008,each,0.99958,False
329,word,places,country,country,country,651de3dbf3a9be0887dd1ece,country,0.999574,False
674,word,helping_verbs,would,would,would,651de3dbf3a9be0887dd2027,would,0.999552,False
603,word,pronouns,yourself,yourself,2SG.REFL,651de3dbf3a9be0887dd1fe0,yourself,0.999191,False
678,word,connecting_words,if,if,if,651de3dbf3a9be0887dd202b,if,0.999153,False
658,word,helping_verbs,could,could,could,651de3dbf3a9be0887dd2017,could,0.999124,False


# "Feat" algorithm

We'll use the Wide & Deep model.

This model is based on the paper [Wide & Deep Learning for Recommender Systems](https://arxiv.org/pdf/1606.07792.pdf).

The LibRecommender [documentation](https://librecommender.readthedocs.io/en/latest/api/algorithms/wide_deep.html) contains more information on how to use this class.

There is also a [tutorial](https://librecommender.readthedocs.io/en/latest/tutorial.html) at the LibRecommender website showing how to use this model.

In [None]:
# We already have the data split into train/test/eval. We'll just build the model   

In [103]:
from libreco.algorithms import WideDeep
from libreco.data import DatasetFeat
from libreco.evaluation import evaluate

From the tutorial:
> 
> In LibRecommender we use `sparse_col` to represent categorical features and `dense_col` to represent numerical features.

In [113]:
sparse_col = [
    "item_kind",
    # "category",  # TODO: Fill with "unknown"?
    # Maybe we could use "race" here too.
]
dense_col = ["age"]
user_col = ["age"]
item_col = [
    "item_kind",
    # "category"
]

train_data_feat, data_info_feat = DatasetFeat.build_trainset(train_data, user_col, item_col, sparse_col, dense_col)
eval_data_feat = DatasetFeat.build_evalset(eval_data)
test_data_feat = DatasetFeat.build_testset(test_data)

In [131]:
data_info_feat

n_users: 6365, n_items: 743, data density: 39.7250 %

In [134]:
import tensorflow as tf

# Need to call this otherwise this cell can only run once
tf.compat.v1.reset_default_graph()

wide_deep = WideDeep(
    task="ranking",
    data_info=data_info_feat,
    embed_size=16,
    n_epochs=10,
    loss_type="cross_entropy",
    lr={"wide": 0.05, "deep": 7e-4},
    batch_size=2048,
    use_bn=True,
    hidden_units=(128, 64, 32),
)

wide_deep.fit(
    train_data_feat,
    neg_sampling=True,  # perform negative sampling on training and eval data
    verbose=2,
    shuffle=True,
    eval_data=eval_data_feat,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

Training start time: [35m2023-09-28 23:12:29[0m
total params: [33m140,195[0m | embedding params: [33m121,215[0m | network params: [33m18,980[0m


  net = tf.layers.batch_normalization(net, training=is_training)
  net = tf.layers.batch_normalization(net, training=is_training)
train: 100%|██████████| 1835/1835 [00:11<00:00, 161.19it/s]


Epoch 1 elapsed: 11.386s
	 [32mtrain_loss: 0.6996[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 159.48it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 167.76it/s]


	 eval log_loss: 0.6421
	 eval roc_auc: 0.6582
	 eval precision@10: 0.3597
	 eval recall@10: 0.1640
	 eval ndcg@10: 0.6775


train: 100%|██████████| 1835/1835 [00:11<00:00, 161.64it/s]


Epoch 2 elapsed: 11.354s
	 [32mtrain_loss: 0.6368[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 179.65it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 167.89it/s]


	 eval log_loss: 0.6386
	 eval roc_auc: 0.6639
	 eval precision@10: 0.3831
	 eval recall@10: 0.1846
	 eval ndcg@10: 0.6939


train: 100%|██████████| 1835/1835 [00:11<00:00, 164.89it/s]


Epoch 3 elapsed: 11.130s
	 [32mtrain_loss: 0.6343[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 181.49it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 168.36it/s]


	 eval log_loss: 0.6384
	 eval roc_auc: 0.6650
	 eval precision@10: 0.3796
	 eval recall@10: 0.1825
	 eval ndcg@10: 0.6919


train: 100%|██████████| 1835/1835 [00:11<00:00, 163.27it/s]


Epoch 4 elapsed: 11.241s
	 [32mtrain_loss: 0.6331[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 180.31it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 166.99it/s]


	 eval log_loss: 0.6381
	 eval roc_auc: 0.6660
	 eval precision@10: 0.3970
	 eval recall@10: 0.1888
	 eval ndcg@10: 0.7058


train: 100%|██████████| 1835/1835 [00:11<00:00, 164.94it/s]


Epoch 5 elapsed: 11.127s
	 [32mtrain_loss: 0.6316[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 181.73it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 166.46it/s]


	 eval log_loss: 0.6374
	 eval roc_auc: 0.6668
	 eval precision@10: 0.4079
	 eval recall@10: 0.1946
	 eval ndcg@10: 0.7070


train: 100%|██████████| 1835/1835 [00:11<00:00, 165.08it/s]


Epoch 6 elapsed: 11.118s
	 [32mtrain_loss: 0.6298[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 175.72it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 169.18it/s]


	 eval log_loss: 0.6365
	 eval roc_auc: 0.6675
	 eval precision@10: 0.4095
	 eval recall@10: 0.1951
	 eval ndcg@10: 0.7083


train: 100%|██████████| 1835/1835 [00:11<00:00, 164.12it/s]


Epoch 7 elapsed: 11.182s
	 [32mtrain_loss: 0.6283[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 179.98it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 167.54it/s]


	 eval log_loss: 0.6357
	 eval roc_auc: 0.6677
	 eval precision@10: 0.3991
	 eval recall@10: 0.1925
	 eval ndcg@10: 0.7057


train: 100%|██████████| 1835/1835 [00:11<00:00, 165.06it/s]


Epoch 8 elapsed: 11.119s
	 [32mtrain_loss: 0.627[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 179.11it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 167.82it/s]


	 eval log_loss: 0.6358
	 eval roc_auc: 0.6692
	 eval precision@10: 0.4095
	 eval recall@10: 0.1966
	 eval ndcg@10: 0.7111


train: 100%|██████████| 1835/1835 [00:11<00:00, 164.76it/s]


Epoch 9 elapsed: 11.139s
	 [32mtrain_loss: 0.6257[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 180.01it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 167.76it/s]


	 eval log_loss: 0.6373
	 eval roc_auc: 0.6684
	 eval precision@10: 0.4071
	 eval recall@10: 0.1966
	 eval ndcg@10: 0.7096


train: 100%|██████████| 1835/1835 [00:11<00:00, 164.78it/s]


Epoch 10 elapsed: 11.138s
	 [32mtrain_loss: 0.6248[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 156.71it/s]
eval_listwise: 100%|██████████| 572/572 [00:03<00:00, 163.63it/s]


	 eval log_loss: 0.6364
	 eval roc_auc: 0.6688
	 eval precision@10: 0.3992
	 eval recall@10: 0.1923
	 eval ndcg@10: 0.7015


In [152]:
save_model(wide_deep, data_info_feat, "epochs=10")

In [158]:
from libreco.algorithms import (
    DIN,
    FM,
    AutoInt,
    DeepFM,
    GraphSage,
    GraphSageDGL,
    PinSage,
    PinSageDGL,
    TwoTower,
    WideDeep,
    YouTubeRanking,
    YouTubeRetrieval,
)

metrics = [
    "loss",
    "balanced_accuracy",
    "roc_auc",
    "pr_auc",
    "precision",
    "recall",
    "map",
    "ndcg",
]

tf.compat.v1.reset_default_graph()

ytb_ranking = YouTubeRanking(
    "ranking",
    data_info_feat,
    loss_type="cross_entropy",
    embed_size=16,
    n_epochs=3,
    lr=1e-4,
    lr_decay=False,
    reg=None,
    batch_size=2048,
    num_neg=1,
    use_bn=False,
    dropout_rate=None,
    hidden_units=(128, 64, 32),
    tf_sess_config=None,
)
ytb_ranking.fit(
    train_data_feat,
    neg_sampling=True,
    verbose=2,
    shuffle=True,
    eval_data=eval_data_feat,
    metrics=metrics,
)

Training start time: [35m2023-09-28 23:27:10[0m
total params: [33m134,609[0m | embedding params: [33m114,097[0m | network params: [33m20,512[0m


train: 100%|██████████| 1835/1835 [00:58<00:00, 31.14it/s]


Epoch 1 elapsed: 58.924s
	 [32mtrain_loss: 0.6743[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 251.49it/s]
eval_listwise: 100%|██████████| 572/572 [00:02<00:00, 257.68it/s]


	 eval log_loss: 0.6588
	 eval balanced_accuracy: 0.6019
	 eval roc_auc: 0.6402
	 eval pr_auc: 0.6054
	 eval precision@10: 0.3354
	 eval recall@10: 0.1352
	 eval map@10: 0.5019
	 eval ndcg@10: 0.6419


train: 100%|██████████| 1835/1835 [00:58<00:00, 31.23it/s]


Epoch 2 elapsed: 58.754s
	 [32mtrain_loss: 0.656[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 268.41it/s]
eval_listwise: 100%|██████████| 572/572 [00:02<00:00, 253.34it/s]


	 eval log_loss: 0.6492
	 eval balanced_accuracy: 0.6019
	 eval roc_auc: 0.6512
	 eval pr_auc: 0.6341
	 eval precision@10: 0.3502
	 eval recall@10: 0.1569
	 eval map@10: 0.5203
	 eval ndcg@10: 0.6616


train: 100%|██████████| 1835/1835 [00:58<00:00, 31.26it/s]


Epoch 3 elapsed: 58.694s
	 [32mtrain_loss: 0.6471[0m


eval_pointwise: 100%|██████████| 58/58 [00:00<00:00, 270.84it/s]
eval_listwise: 100%|██████████| 572/572 [00:02<00:00, 257.02it/s]


	 eval log_loss: 0.6448
	 eval balanced_accuracy: 0.6003
	 eval roc_auc: 0.6535
	 eval pr_auc: 0.6398
	 eval precision@10: 0.3586
	 eval recall@10: 0.1575
	 eval map@10: 0.5167
	 eval ndcg@10: 0.6532


In [165]:
from ipywidgets import interact, interactive, fixed, interact_manual, widgets
from IPython.display import display
from typing import Dict

df_children = df[["child_id", "age"]].drop_duplicates("child_id")
df_children.set_index("child_id", inplace=True)

# @interact(child_index=widgets.BoundedIntText(
#     value=0,
#     min=0,
#     max=len(df_children),
#     step=1,
#     description='Child:',
#     disabled=False
# ))
def foo(child_index, model, n_rec=1):
    child_id = df_children.index[child_index]
    print(f"-> Child id is: {child_id}")
    age = df_children.loc[child_id].age
    print(f"-> Child age: {age}")
    words = set(df[df["child_id"] == child_id]["item_definition"].unique())
    print(f"-> Words spoken by this child: {len(words)}")
    display(" | ".join(sorted(words)))
    
    for _ in range(10):
        recommendation: Dict[int, np.ndarray] = model.recommend_user(user=child_id, n_rec=n_rec, filter_consumed=True)
        if len(recommendation) > 1:
            display(recommendation)
            break
        
    word_ids = recommendation[child_id]
    
    scores = model.predict(user=child_id, item=word_ids)
    
    display("-> Recommended words:")
    df_result = df_words.loc[word_ids]
    
    # Add column to know if the child already speaks such word (from the original dataset) 
    df_result["speaks?"] = df_result["item_definition"].apply(lambda w: w in words)
    df_result["score"] = scores
    display(df_result[df_result["speaks?"] == False].iloc[:6])
    
foo(5, ytb_ranking, 10)

-> Child id is: 6
-> Child age: 30
-> Words spoken by this child: 569


"I | I fall down / I fell down | I like read stories / I like to read stories | I make tower / I making tower | I no do it / I can't do it | I sing song / I sing song for you | I want that / I want that one you got | TV | a | a lot | airplane | all | alligator | am | and | animal | ant | any | apple | applesauce | are | arm | around | asleep | ate | aunt | awake | baa baa | baby | baby blanket / baby's blanket | baby crying / baby crying cuz she's sad | baby crying / baby is crying | baby want eat / baby want to eat | back | backyard | bad | ball | balloon | banana | basket | bat | bath | bathroom | bathtub | beach | beans | bear | because | bed | bedroom | bee | behind | belly button | belt | bicycle | big | bird | bite | black | blanket | blew | block | blow | blue | boat | book | boots | bottle | bought | bowl | box | boy | bread | break | broke | broken | broom | brother | brush | bubbles | bucket | bug | bunny | bus | but | butter | butterfly | buttocks/bottom* | button | buy | by

InvalidArgumentError: Graph execution error:

Detected at node 'concat' defined at (most recent call last):
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/traitlets/config/application.py", line 1046, in launch_instance
      app.start()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 736, in start
      self.io_loop.start()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 195, in start
      self.asyncio_loop.run_forever()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
      self._run_once()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
      handle._run()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/asyncio/events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
      await self.process_one()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 505, in process_one
      await dispatch(*args)
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
      await result
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
      reply_content = await reply_content
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
      res = shell.run_cell(
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 546, in run_cell
      return super().run_cell(*args, **kwargs)
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3024, in run_cell
      result = self._run_cell(
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3079, in _run_cell
      result = runner(coro)
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3284, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3466, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/tmp/ipykernel_20088/4289332625.py", line 45, in <module>
      ytb_ranking.fit(
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/libreco/bases/tf_base.py", line 128, in fit
      self.build_model()
    File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/libreco/algorithms/youtube_ranking.py", line 211, in build_model
      concat_embed = tf.concat(self.concat_embed, axis=1)
Node: 'concat'
ConcatOp : Dimension 0 in both shapes must be equal: shape[0] = [1,16] vs. shape[1] = [10,16]
	 [[{{node concat}}]]

Original stack trace for 'concat':
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/traitlets/config/application.py", line 1046, in launch_instance
    app.start()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 736, in start
    self.io_loop.start()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
    await self.process_one()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 505, in process_one
    await dispatch(*args)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
    await result
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
    reply_content = await reply_content
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
    res = shell.run_cell(
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 546, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3024, in run_cell
    result = self._run_cell(
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3079, in _run_cell
    result = runner(coro)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
    coro.send(None)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3284, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3466, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_20088/4289332625.py", line 45, in <module>
    ytb_ranking.fit(
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/libreco/bases/tf_base.py", line 128, in fit
    self.build_model()
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/libreco/algorithms/youtube_ranking.py", line 211, in build_model
    concat_embed = tf.concat(self.concat_embed, axis=1)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
    return fn(*args, **kwargs)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler
    return dispatch_target(*args, **kwargs)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tensorflow/python/ops/array_ops.py", line 1650, in concat
    return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1278, in concat_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/home/gustavo/.conda/envs/tici-39/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal
    ret = Operation.from_node_def(


In [None]:
from typing import List
from libreco.data import DataInfo

def predict(words: List[str]):
    child_id = -1
    # Train the model with the words spoken by the new child
    words_ids = df_words[df_words["item_definition"].isin(words)].index.tolist()
    df_train = pd.DataFrame({"user": child_id, "item": words_ids, "label": 1.0})
    
    old_data_info = DataInfo.load("models//data-info", model_name="label-is-all-1-epochs=3")
    data, data_info = DatasetFeat.merge_trainset(df_train, old_data_info)
    model = LightGCN(
        task="ranking",
        data_info=data_info,
        loss_type="bpr",
        embed_size=16,
        n_epochs=1,
        lr=1e-3,
        batch_size=2048,
        num_neg=1,
        device="cuda",
    )
    model.rebuild_model("models/lightgcn-retrain/weights", model_name="label-is-all-1-epochs=3")
    
    model.fit(
        data,
        neg_sampling=True,  # sample negative items for train and eval data
        verbose=2,
    )
    
    # Predict the words for this child
    recommendation: Dict[int, np.ndarray] = lightgcn.recommend_user(user=child_id, n_rec=100)
    predicted_word_ids = recommendation[child_id]
    scores = lightgcn.predict(user=child_id, item=predicted_word_ids)
    
    df_result = df_words.loc[predicted_word_ids]
    df_result["score"] = scores
    df_result["speaks?"] = df_result["item_definition"].apply(lambda w: w in words)
    display(df_result[~df_result["speaks?"]].iloc[:6])
    
predict(words)