# Bienvenido al día 2 de la workshop de Nuclio
## 15-Enero-2025

<img src="./Pictures/CONTENT_BASED_FILTERING.png" width="500">

[Fuente: Content Based](https://developers.google.com/machine-learning/recommendation/content-based/basics)

---

# Objetivos del notebook

En este notebook vamos a construir un recomendador de animes basado en la técnica de **"Content Based Recommendation System"** desde cero.

Para ello usaremos la librería de Python `pandas` para procesar nuestro dataset y hacer la limpieza pertinente.

Si necesitan un repaso de `pandas` puede utilizar el notebook [PANDAS.ipynb](./PANDAS.ipynb) para refrescar todos los conocimientos necesarios.

---

Al final del notebook esperamos que todos los asistentes sepan:
1. En que consiste el **Content Based Recommendation System**
1. Saber preparar el dataset en el formato necesario para poder ejecutar nuestro recomendador.
1. Ejecutar un par de recomendaciones y comprobar sus resultados.
1. Saber cuales son los puntos fuertes y los puntos débiles de CBRS.


# Let's go!

---

Hacemos los imports de las principales librerías.

In [1]:
import os
import pandas as pd

Con la siguiente línea de código podemos detectar desde que ruta estamos ejecutando nuestro notebook `os.getcwd()` posteriormente con `os.path.join` juntamos la ruta de antes con la carpeta `input` (donde están nuestros ficheros y nuestros datasets.

In [2]:
PATH_INPUT_FOLDER = os.path.join(os.getcwd(), "input")

Usamos pandas para cargar los dos ficheros necesarios a Python.

In [3]:
rating = pd.read_parquet(os.path.join(PATH_INPUT_FOLDER, 'cf_rating.parquet.gzip'))
anime = pd.read_parquet(os.path.join(PATH_INPUT_FOLDER, 'cf_anime.parquet.gzip'))

Como podemos observar, tenemos muchos clientes que tiene un rating de `-1` lo más probable que esto implique que son `nulos` y deberíamos eliminarlos de nuestro dataset.

Hay varias formas de hacerlo pero lo que nosotros vamos a hacer es usar una `boolean mask`.

In [4]:
rating.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [5]:
rating.groupby(["rating"]).size()

rating
-1     1476496
 1       16649
 2       23150
 3       41453
 4      104291
 5      282806
 6      637775
 7     1375287
 8     1646019
 9     1254096
 10     955715
dtype: int64

Aplicamos nuestra `boolean mask` y podemos comprobar que los nulos han desaparecido de nuestro **DataFrame**.

In [6]:
rating = rating[rating["rating"] != -1]

In [7]:
rating.head()

Unnamed: 0,user_id,anime_id,rating
47,1,8074,10
81,1,11617,10
83,1,11757,10
101,1,15451,10
153,2,11771,10


In [8]:
rating.groupby(["rating"]).size()

rating
1       16649
2       23150
3       41453
4      104291
5      282806
6      637775
7     1375287
8     1646019
9     1254096
10     955715
dtype: int64

A su vez dentro del **DataFrame** de anime, hay muchas columnas, pero a nosotros sólo nos interesa la columna de **name** y la columna **genre**.

Vamos a usar pandas para eliminar todas las demás columnas.

In [9]:
anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [10]:
anime = anime[["anime_id", "name", "genre"]]

In [11]:
anime.head()

Unnamed: 0,anime_id,name,genre
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural"
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili..."
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S..."
3,9253,Steins;Gate,"Sci-Fi, Thriller"
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S..."


Antes de seguir, vamos a hablar un poco más sobre nuestro recomendador.

**Content Based Recommendation Systems** se basa por un lado en:
1. Que los usuarios han usado nuestros servicios y han dejado "huella". Es decir, en nuestro caso en concreto que han dejado alguna review. Si miramos con cariño, vemos que en el **DataFrame** llamado `rating` tenemos una columna llamada `rating` y es la puntuación que ha dejado cada user para los animes que ha visto.
1. Por el otro lado, cada anime pertece a un género en concreto. En este nuestro caso, tenemos el dataset la columna `genre` dentro de nuestro **DataFrame** llamado `anime`. La columna `genre` es cad uno de los géneros a los que pertence nuestro anime.

Básicamente lo que podríamos hacer es lo siguiente:

1. "Expandir" la columna de `genre` (ahora mismo es text separado por comas) y así obtener muchas columnas.
1. Posteriormente podríamos cruzar nuestro **DataFrame** de `anime` con el **DataFrame** de `rating` por el campo de anime_id. De esta manera, nos traemos la puntuación de cada usuario y podemos saber que géneros ha puntuado y le gusta realmente.
1. Una vez que tenemos la información de antes, lo que nos faltaría por hacer es "ponderar" los géneros según la puntuación que han dejado.
1. Si tenemos toda esta información, vamos a poder empezar con nuestra recomendación.

In [12]:
ORDER_OF_COLUMNS = [
    'user_id',
    'name',
    'rating',
    'Action',
    'Adventure',
    'Cars',
    'Comedy',
    'Dementia',
    'Demons',
    'Drama',
    'Ecchi',
    'Fantasy',
    'Game',
    'Harem',
    'Hentai',
    'Historical',
    'Horror',
    'Josei',
    'Kids',
    'Magic',
    'Martial Arts',
    'Mecha',
    'Military',
    'Music',
    'Mystery',
    'Parody',
    'Police',
    'Psychological',
    'Romance',
    'Samurai',
    'School',
    'Sci-Fi',
    'Seinen',
    'Shoujo',
    'Shoujo Ai',
    'Shounen',
    'Shounen Ai',
    'Slice of Life',
    'Space',
    'Sports',
    'Super Power',
    'Supernatural',
    'Thriller',
    'Vampire',
    'Yaoi',
    'Yuri'
]

In [13]:
def multiply_genre_by_rating(df):
    '''
    Multiplies all the genre columns in the df by the rating given by the user.
    '''
    df.iloc[:, 3::] = df.iloc[:, 3::].multiply(df.iloc[:, 2], axis = 0)
    return df

In [14]:
def calculate_ponderated_score(df):
    '''
    Ponderates the genres by the user rating.
    '''
    return df.divide(df.sum(axis = 1), axis = 0)

In [15]:
ponderated_score = (
    anime                                  #  1. empezamos con nuestro df
    .set_index(["anime_id", "name"])       #  2. ponemos en el index anime_id y name para no perderlos
    ["genre"]                              #  3. seleccionamos la columna genre (es texto separado por ", ")
    .str.split(", ", expand = True)        #  4. hacemos el split y expand = True para obtener columnas en el df
    .reset_index()                         #  5. rescatamos el anime_id y name del index
    .melt(id_vars = ["anime_id", "name"])  #  6. melt para eliminar el 0, 1, 2, 3 ... del df
    [["anime_id", "name", "value"]]        #  7. seleccionamos las columnas de interés
    .dropna()                              #  8. eliminamos todos los nulos que puedan haber
    .pivot_table(                          #  9. pivot_table para poner todos los posibles géneros en columnas
        index = ["anime_id", "name"],      #  9. Nota: algunos de los animes no tendrán algún género y
        columns = "value",                 #  9. tendremos nulos
        aggfunc = len
    )
    .fillna(0)                             # 10. fillna de todos los nulos entre diferentes géneros
    .reset_index()                         # 11. sacamos anime_id y name del index (por la pivot_table)
    .merge(                                # 12. nos traemos los ratings de cada usuario
        right = rating,                    # 12. hacemos el merge por la columna de anime_id
        how = 'inner',
        on = "anime_id"
    )
    .pipe(lambda df: df[ORDER_OF_COLUMNS]) # 13. ordenamos las columnas según nos interesa
    .pipe(multiply_genre_by_rating)        # 14. multipli. cada género por el rating de cada usuario por anime
    .drop(columns = ["name", "rating"])    # 15. drop de las columnas name y rating
    .groupby(["user_id"])                  # 16. agrupamos nuestro df por user_id
    .sum()                                 # 17. sumamos todas las puntuaciones por género de cada user
    .pipe(calculate_ponderated_score)      # 18. ponderamos los scores de cada user
)

In [17]:
ponderated_score.head(10)

Unnamed: 0_level_0,Action,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.136364,0.045455,0.0,0.090909,0.0,0.090909,0.0,0.136364,0.045455,0.045455,...,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
3,0.133255,0.104363,0.0,0.08579,0.0,0.007665,0.068101,0.011498,0.116156,0.016509,...,0.0,0.011498,0.0,0.025354,0.020637,0.053656,0.009434,0.001769,0.0,0.0
5,0.085026,0.055432,0.002819,0.156665,0.000117,0.009278,0.047798,0.029477,0.049207,0.004815,...,0.0,0.038638,0.003406,0.027481,0.020082,0.050029,0.005989,0.003641,0.0,0.0
7,0.081957,0.040153,0.0,0.125587,0.002434,0.015123,0.032331,0.046237,0.054841,0.009995,...,0.0,0.031636,0.0,0.007648,0.022771,0.050408,0.007996,0.005823,0.0,0.0
8,0.133455,0.047532,0.0,0.107861,0.0,0.0,0.051188,0.025594,0.060329,0.016453,...,0.0,0.016453,0.0,0.0,0.047532,0.043876,0.010969,0.0,0.0,0.0
9,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.175,0.175,0.0,0.0,0.0,0.0,0.0625,0.0,0.175,0.1125,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,0.067928,0.080373,0.0,0.093855,0.00363,0.019704,0.106041,0.009593,0.090226,0.004148,...,0.002333,0.021519,0.001815,0.004148,0.009074,0.052113,0.009334,0.00337,0.0,0.0
12,0.127329,0.061077,0.0,0.07764,0.0,0.009317,0.048654,0.006211,0.071429,0.009317,...,0.0,0.009317,0.0,0.0,0.092133,0.016563,0.020704,0.0,0.0,0.0


# Parte de Inferencia

Ya tenemos calculado para cada usuario los géneros que más le gustan.

Ahora vamos a poder empezar a hacer su recomendación.

In [18]:
random_user = pd.Series(rating["user_id"].unique()).sample().iloc[0]

In [19]:
random_user

61145

In [20]:
random_animes = anime_genre.reset_index()["anime_id"].sample(10).values.tolist()

NameError: name 'anime_genre' is not defined

In [None]:
random_animes

In [None]:
random_animes_genres = anime_genre[anime_genre.index.isin(random_animes)]

In [None]:
random_animes_genres

In [None]:
random_user_ponderated_score = ponderated_score[ponderated_score.index.isin([random_user])]

In [None]:
random_user_ponderated_score

In [None]:
r = (
    random_user_ponderated_score.dot(random_animes_genres.T)
    .T
    .rename(columns = {random_user:"score"})
    .sort_values(by = "score", ascending = False)
    # .pipe(lambda df: pd.concat([df, random_animes_genres], axis = 1))
)

In [None]:
r

In [None]:
(
    random_user_ponderated_score
    .melt()
    .sort_values("value", ascending = False)
)

In [None]:
i = 1
(
    random_animes_genres.loc[r.index[i]][random_animes_genres.loc[r.index[i]] != 0]
)