# Deep Learning Recommendations system

Vamos a hacer un Sistema de Recomendacion en 2 partes, primero 2 modelos del tipo Content-based Filtering que se usaran para generar candidatos que seran rankeados por un modelo de Collaborative Filtering con keras.

### Content-based Filtering:  
- 1 Modelo - Most popular content by category: \
Para cada usuario buscamos las n categorias que mas vio y seleccionamos los 20 contenidos mas vistos en cada categoria

- 2 Modelo - Most popular content from closer users: \
Se usa [Nearest Neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) para encontrar los usuarios que vieron contenido similar, tomando como datos caracteristicas del contenido (ie categoria, duracion, director, idioma, etc)

Los resultados de estos dos modelo producen una lista donde cada fila representan los candidatos a recomendar para un usuario.  
Ej:
```
user_id   n_content_ids
0         [2040.0, 1800.0, 774.0, 2299.0, 2178.0, 604.0,...
1         [2012.0, 2942.0, 1462.0, 1573.0, 2152.0, 1957....
2         [2040.0, 4133.0, 2012.0, 3900.0, 2942.0, 3353....
...         ...                        
113879    [2040.0, 774.0, 2299.0, 2178.0, 604.0, 387.0, ...
113880    [2040.0, 1800.0, 774.0, 2299.0, 2178.0, 604.0,...
```

Estos resultados seran usados en el siguiente modelo.

### Collaborative Filtering:  
Para esto creamos un modelo que toma los usuarios, el contenido que vieron y un "score" (generando a partir de los datos) que intenta capturar si a un usuario le gusto o no el contenido visto

> El score:  
> Lo generamos a partir de 2 campos:
- min_watching: "tiempo que el usuario paso viendo el contenido"
- run_time_min: "tiempo que dura el contenido completo"  
>
> El score es el porcentaje de tiempo visto, normalizado entre 1 y 10, siendo 10 el mas alto, (que vio todo el contenido)

Ese modelo lo vamos a usar para rankear sobre candidatos generados por los baselines anteriores.

- Modelo Keras: \
La red neural consta de 2 capas de `Embeddings` que reducen la dimensionalidad de los usuarios y los contenidos. Estos se pasan por una cada `Densa` de 10 neuronas y se conecta con una capa de salida de 1 neurona con funcion de activacion **softmax**, que devolvera el **score** que un usuario daria al ver un contenido.  


### Entregable:
Con los candidatos generados y la red entrenada, vamos a rankear esos candidatos en la red y tomar los 20 que obtengan un score mas alto.
Luego recomendamos a cada su usuario esos contenidos.


In [40]:
# Montamos drive para leer los datos desde alli

from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


In [41]:
import pickle
import datetime as dt
from itertools import chain
from collections import Counter

import numpy as np
import pandas as pd
import datetime as dt

from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

from keras.layers import (
    Input, Reshape, Dot, Add, Activation, Lambda,Concatenate, Dense, Dropout
)
from keras.models import Model
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.regularizers import l2

np.random.seed(16)
np.set_printoptions(precision=2)
pd.set_option('display.max_columns', None)

pd.set_option('display.precision', 2)

In [42]:
import tensorflow as tf
tf.config.list_physical_devices('GPU')

[]

## Functions

- Algunas funciones para leer la data, guardar las recomendaciones de contendio y evaluar nuestro modelo

In [43]:
def create_dfs(sample_data=None, ret_test=True, clean=False):
    """
    Read and joins dataset, split it into train and test and format test to content viewed by account_id
    """
    if clean:
        df = pd.read_csv('/gdrive/MyDrive/Colab Notebooks/data/data_cleaned_joined.csv', parse_dates=['tunein', 'end_vod_date'])
    else:
        df_train = pd.read_csv('/gdrive/MyDrive/Colab Notebooks/data/train_cleaned.csv', parse_dates=['tunein', 'tuneout'])
        df_meta = pd.read_csv('/gdrive/MyDrive/Colab Notebooks/data/metadata_cleaned.csv', sep=',',
                              parse_dates=['end_vod_date'])

        df = pd.merge(df_train, df_meta, how='left', on='asset_id')

    if sample_data is not None:
        df = df.sample(n=sample_data)

    if ret_test:
        filter_date = dt.datetime(2021, 3, 10, 0, 0, 0).date()
        df_test = df[df.tunein.dt.date > filter_date].copy()
        df = df[df.tunein.dt.date < filter_date]
        df_test = df_test.groupby(['account_id'])['content_id'].agg(lambda X: X.value_counts().index.values.tolist())
    else:
        df_test = None
    
    return df, df_test


def write_submit(submit, file=None):
    """
    xx
    """
    assert len(submit[submit.map(len) != 20]) == 0, "Row with non 20 recomendations"
    # assert len(submit[submit.map(lambda x: len(set(x)) != 20)]) == 0, "Row with repeated recomendations"

    submit = submit.map(lambda xs: [int(i) for i in xs])
    if file is None:
        submit.to_csv('submit.csv', header=False)
    else:
        submit.to_csv(file, header=False)


def avg_precision(y_true, y_pred):
    """
    calculate average precision from pandas series y_true and y_pred
    """

    def __get_ap(y_true_, y_pred_):
        positions = [i+1 for i, pred in enumerate(y_pred_) if pred in y_true_]
        if positions:
            return sum([(i+1) / pos for i, pos in enumerate(positions)]) / len(positions)
        return 0
    
    y_true.name = "y_true"
    y_pred.name = "y_pred"
    df_preds = pd.merge(y_true, y_pred, how='inner', left_index=True, right_index=True)

    return df_preds.apply(lambda row: __get_ap(row["y_true"], row["y_pred"]), axis=1)


def mean_avg_precision(y_true, y_pred):
    
    assert len(y_pred[y_pred.map(len) != 20]) == 0, "Row with non 20 recomendations"
    # assert len(y_pred[y_pred.map(lambda x: len(set(x)) != 20)]) == 0, "Row with repeated recomendations"
    
    ap_list = avg_precision(y_true, y_pred)
    return sum(ap_list) / len(ap_list)

## Read Data

In [44]:
# Varibles
is_test = False
sample_data = None
data_clean = True

if is_test:
    filter_date = dt.datetime(2021, 3, 10, 0, 0, 0).date()
else:
    filter_date = dt.datetime(2021, 4, 1, 0, 0, 0).date()

# Lista de las columnas binarias que representas los generos de los contenidos
categories_list = [
                   'accion', 'animacion', 'animales', 'aventura', 'belico', 'biografia', 'ciencia',
                   'ciencia ficcion', 'cocina', 'comedia', 'competencia', 'crimen', 'cultura', 'deporte',
                   'dibujos animados', 'documental', 'drama', 'entretenimiento', 'entrevistas', 'espectaculo',
                   'familia', 'fantasia', 'historia', 'humor', 'infantil', 'interes general', 'investigacion',
                   'magazine', 'moda', 'musica', 'naturaleza', 'periodistico', 'policial', 'politico', 'reality',
                   'religion', 'restauracion', 'romance', 'suspenso', 'teatro', 'terror', 'viajes', 'western'
                  ]

In [45]:
if is_test:
    df, df_test = create_dfs(sample_data=sample_data, clean=data_clean)  # to test
else:
    df, _ = create_dfs(sample_data=sample_data, ret_test=False, clean=data_clean)  # to submit
df.tail(3)

Unnamed: 0,customer_id,account_id,device_type,asset_id,tunein,resume,min_watching,content_id,released_year,description,cast_first_name,credits_first_name,audience,made_for_tv,pay_per_view,pack_premium_1,pack_premium_2,end_vod_date,run_time_min,show_type,country_of_origin,accion,animacion,animales,aventura,belico,biografia,ciencia,ciencia ficcion,cocina,comedia,competencia,crimen,cultura,deporte,dibujos animados,documental,drama,entretenimiento,entrevistas,espectaculo,familia,fantasia,historia,humor,infantil,interes general,investigacion,magazine,moda,musica,naturaleza,periodistico,policial,politico,reality,religion,restauracion,romance,suspenso,teatro,terror,viajes,western,title,keywords,ranking
3089035,92007,113877,STATIONARY,27139.0,2021-03-31 20:21:00,0,24.0,2091.0,2017.0,"Defred, una de las pocas mujeres en edad férti...","Elisabeth Moss, Yvonne Strahovski, Joseph Fien...",Mike Barker,Mujeres,0,0,0,0,2022-06-14,65.0,Serie,US,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,the handmaid's tale (dob),"mujeres,abusos,secuestros,de libros,feminismo,...",3
3089036,92007,113877,STATIONARY,27139.0,2021-03-31 20:46:00,0,24.0,2091.0,2017.0,"Defred, una de las pocas mujeres en edad férti...","Elisabeth Moss, Yvonne Strahovski, Joseph Fien...",Mike Barker,Mujeres,0,0,0,0,2022-06-14,65.0,Serie,US,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,the handmaid's tale (dob),"mujeres,abusos,secuestros,de libros,feminismo,...",3
3089037,92007,113877,STATIONARY,27139.0,2021-03-31 21:12:00,0,9.0,2091.0,2017.0,"Defred, una de las pocas mujeres en edad férti...","Elisabeth Moss, Yvonne Strahovski, Joseph Fien...",Mike Barker,Mujeres,0,0,0,0,2022-06-14,65.0,Serie,US,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,the handmaid's tale (dob),"mujeres,abusos,secuestros,de libros,feminismo,...",1


In [46]:
if not data_clean:
    # Esta columna indica cuanto tiempo se vio un contendio

    df.loc[:, 'min_watching'] = (df.tuneout - df.tunein).dt.seconds / 60

    # vamos a setear como maximo tiempo que alguien vio un contenido la duracion de este

    df.loc[df.min_watching > df.run_time_min, 'min_watching'] = df.run_time_min

    df.loc[:, 'ranking'] = (df.min_watching / df.run_time_min) * 10

    df.loc[df.ranking < 1, 'ranking'] = 1

    df.loc[:, 'ranking'] = df.ranking.astype(int)

    df.tail(3)

## baselines

### 1) Most popular content by category.
- Most popular content in category that the user viewed the most\
Steps:
    - Find the first ten categories that each user most saw
    - Find content most viewed in each category
    - Return top 20 content for 1st, 2nd, 3rd, ... category, if do not reach 20, complete with random content

In [47]:
n_categories = 6  # top categories by each user
n_top = 10        # top content by each category

In [48]:
# Count the content seen in each category by user

df_account_seen = df.drop_duplicates(subset=['asset_id', 'account_id'])\
                    .groupby(['account_id'])[categories_list].sum()

df_account_seen.tail(3)

Unnamed: 0_level_0,accion,animacion,animales,aventura,belico,biografia,ciencia,ciencia ficcion,cocina,comedia,competencia,crimen,cultura,deporte,dibujos animados,documental,drama,entretenimiento,entrevistas,espectaculo,familia,fantasia,historia,humor,infantil,interes general,investigacion,magazine,moda,musica,naturaleza,periodistico,policial,politico,reality,religion,restauracion,romance,suspenso,teatro,terror,viajes,western
account_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1
113878,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
113879,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
113880,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [49]:
# Find "n_categories" categories most viewed by user

n_cat_most_seen_by_user = df_account_seen.apply(lambda s: s[s>0].nlargest(n_categories).index.tolist(), axis=1)

n_cat_most_seen_by_user

account_id
0         [comedia, accion, animacion, drama, infantil, ...
1                          [accion, drama, cocina, reality]
2         [drama, romance, aventura, fantasia, accion, c...
3         [accion, drama, comedia, ciencia ficcion, aven...
4         [comedia, infantil, drama, suspenso, accion, d...
                                ...                        
113876                                              [drama]
113877                                              [drama]
113878           [biografia, documental, drama, naturaleza]
113879                                   [comedia, familia]
113880                           [comedia, drama, fantasia]
Length: 113881, dtype: object

In [50]:
# Find content most viewed by category by differents profiles and will keep available after "filter_date"

top_content_by_cat_diff_user = {}

# We create a dictionary with list of top 10 content by category
for category in categories_list:

    df_content = df[(df[category] != 0) & (df['end_vod_date'].dt.date > filter_date)]
    contents = df_content.drop_duplicates(subset=['asset_id', 'account_id'])["content_id"].value_counts()

    # Select content ordered by "most seen by differents profiles"
    top_content_by_cat_diff_user[category] = contents.index.astype(int).values[:n_top]


# Create pandas series with all content
all_content = df[(df['end_vod_date'].dt.date > filter_date)].content_id
all_content = np.array([x[0] for x in Counter(df.content_id.values).most_common()])

In [51]:
# Create recomendartion list by each user base in their categories and top content by category

recomendations = n_cat_most_seen_by_user.map(lambda xs: 
    np.fromiter(chain(*[ top_content_by_cat_diff_user[x] for x in xs ]), dtype=int)
)

recomendations.name = 'recomendations'
recomendations.iloc[:5]

account_id
0    [2040, 1800, 774, 2299, 2178, 604, 387, 3806, ...
1    [2012, 2942, 1462, 1573, 2152, 1957, 3711, 155...
2    [2040, 4133, 2012, 3900, 2942, 3353, 3712, 209...
3    [2012, 2942, 1462, 1573, 2152, 1957, 3711, 155...
4    [2040, 1800, 774, 2299, 2178, 604, 387, 3806, ...
Name: recomendations, dtype: object

In [52]:
# Find the content already seen by each profile

content_seen = df.groupby(['account_id'])['content_id'].agg(lambda x: list(set(x)))
content_seen = content_seen.map(np.array)

content_seen.name = "content_seen"

content_seen.tail(3)

account_id
113878            [2892.0, 4343.0]
113879                    [1800.0]
113880    [2183.0, 3810.0, 3663.0]
Name: content_seen, dtype: object

In [53]:
# Create a df to filter content_seen from recomendations

df_content_recommended = pd.merge(recomendations, content_seen, how="left", left_index=True, right_index=True)
df_content_recommended.tail(4)

Unnamed: 0_level_0,recomendations,content_seen
account_id,Unnamed: 1_level_1,Unnamed: 2_level_1
113877,"[2040, 4133, 2012, 3900, 2942, 3353, 3712, 209...",[2091.0]
113878,"[3382, 3433, 3863, 3057, 1999, 3832, 3780, 365...","[2892.0, 4343.0]"
113879,"[2040, 1800, 774, 2299, 2178, 604, 387, 3806, ...",[1800.0]
113880,"[2040, 1800, 774, 2299, 2178, 604, 387, 3806, ...","[2183.0, 3810.0, 3663.0]"


In [54]:
# Create submit removing content saw in each user and select top 20
submit_most_common = df_content_recommended.apply(lambda row: row['recomendations'][~np.in1d(row['recomendations'],
                                                                                 row['content_seen'])],
                            axis=1)


# Complete with random or top content if recomendations are less than 20
submit_most_common = submit_most_common.map(lambda xs:
           xs if len(xs) >= 20
           else np.concatenate([xs, all_content[:20]])
)

submit_most_common.iloc[:5]

account_id
0    [2040, 1800, 774, 2299, 2178, 604, 387, 3806, ...
1    [2012, 2942, 1462, 1573, 2152, 1957, 3711, 155...
2    [2040, 4133, 2012, 3900, 2942, 3353, 3712, 209...
3    [2012, 2942, 1462, 1573, 2152, 1957, 3711, 155...
4    [2040, 1800, 774, 2299, 604, 387, 3806, 2901, ...
dtype: object

In [55]:
submit_most_common.map(len).min(), submit_most_common.map(len).max()

(20, 60)

### 2) Most popular content from closer users.
- Most popular content viewed by 10 closer user\
Steps:
  - Find most popular content per user
  - Find the n most viewed categories by each user
  - Find closer users with KNN
  - Select content
    - 50% from non-seen categories
    - 50% from their principal categories

In [56]:
n_closer_users = 10

In [57]:
# Find most viewed content per user that will keep available after "filter_date"

top_content_by_user = df.groupby('account_id')['content_id'].agg(lambda xs: [x[0] for x in Counter(xs).most_common()])


top_content_by_user_keep = df[(df['end_vod_date'].dt.date > filter_date)
                             ].groupby('account_id')['content_id'].agg(lambda xs: [x[0] for x in Counter(xs).most_common()])

# Complete user with "[]"
top_content_by_user_keep = top_content_by_user_keep.append(
    top_content_by_user[~top_content_by_user.index.isin(top_content_by_user_keep.index)].map(lambda xs: [])
)


top_content_by_user.iloc[:5]

account_id
0             [3438.0, 2866.0, 3498.0, 1503.0, 3845.0]
1                             [1020.0, 1220.0, 1761.0]
2    [183.0, 6.0, 1099.0, 557.0, 1582.0, 1443.0, 43...
3    [2344.0, 3900.0, 3769.0, 3206.0, 3790.0, 563.0...
4    [2178.0, 1139.0, 2341.0, 1008.0, 1971.0, 289.0...
Name: content_id, dtype: object

In [58]:
# Find the "n_categories" most viewed categories by each user

# n_categories = 10
# n_cat_most_seen_by_user = df.groupby('account_id')[categories_list
#                                                   ].sum().apply(lambda s: s[s>0].nlargest(n_categories).index.tolist(), axis=1)

n_cat_most_seen_by_user.iloc[:5]

account_id
0    [comedia, accion, animacion, drama, infantil, ...
1                     [accion, drama, cocina, reality]
2    [drama, romance, aventura, fantasia, accion, c...
3    [accion, drama, comedia, ciencia ficcion, aven...
4    [comedia, infantil, drama, suspenso, accion, d...
dtype: object

In [59]:
# Creamos un perfil de usuarios

df_user_profile = pd.get_dummies(df, columns=['device_type', 'audience', 'show_type', 'country_of_origin'])

df_user_profile.loc[:, 'tunein_hour'] = df_user_profile.tunein.dt.hour

df_user_profile.drop(columns=['title', 'keywords', 'description', 'cast_first_name', 'credits_first_name', 'resume',
                           'end_vod_date', 'tunein', 'asset_id', 'customer_id', 'content_id'],
                     inplace=True)

# creamos un array con las funciones de agregacion para cada columna

agg_functs = {col: 'sum' for col in df_user_profile.drop(columns='account_id').columns}
agg_functs.update({
        'min_watching':  ['mean', lambda xs: xs.std() if len(xs) > 2 else 0],
        'released_year': ['mean'],
        'run_time_min': ['mean',  lambda xs: xs.std() if len(xs) > 2 else 0],
        'tunein_hour': ['mean'],
        'ranking': ['max', 'mean']
})

# Creamos un dataframe de perfil de usuario
df_user_profile = df_user_profile.groupby('account_id').agg(agg_functs)

# Join multi level columns
df_user_profile.columns = df_user_profile.columns.map('{0[0]}_{0[1]}'.format)

df_user_profile.head()

Unnamed: 0_level_0,min_watching_mean,min_watching_<lambda_0>,released_year_mean,made_for_tv_sum,pay_per_view_sum,pack_premium_1_sum,pack_premium_2_sum,run_time_min_mean,run_time_min_<lambda_0>,accion_sum,animacion_sum,animales_sum,aventura_sum,belico_sum,biografia_sum,ciencia_sum,ciencia ficcion_sum,cocina_sum,comedia_sum,competencia_sum,crimen_sum,cultura_sum,deporte_sum,dibujos animados_sum,documental_sum,drama_sum,entretenimiento_sum,entrevistas_sum,espectaculo_sum,familia_sum,fantasia_sum,historia_sum,humor_sum,infantil_sum,interes general_sum,investigacion_sum,magazine_sum,moda_sum,musica_sum,naturaleza_sum,periodistico_sum,policial_sum,politico_sum,reality_sum,religion_sum,restauracion_sum,romance_sum,suspenso_sum,teatro_sum,terror_sum,viajes_sum,western_sum,ranking_max,ranking_mean,device_type_CLOUD_CLIENT_sum,device_type_PHONE_sum,device_type_STATIONARY_sum,device_type_STB_sum,device_type_TABLET_sum,audience_Familiar_sum,audience_Gaming_sum,audience_General_sum,audience_Hombres_sum,audience_Juvenil_sum,audience_Mujeres_sum,audience_NIños_sum,audience_Niños_sum,audience_Preescolar_sum,audience_Teens_sum,show_type_Gaming_sum,show_type_Película_sum,show_type_Rolling_sum,show_type_Serie_sum,show_type_TV_sum,show_type_Web_sum,country_of_origin_AR_sum,country_of_origin_AT_sum,country_of_origin_AU_sum,country_of_origin_BE_sum,country_of_origin_BG_sum,country_of_origin_BR_sum,country_of_origin_CA_sum,country_of_origin_CF_sum,country_of_origin_CH_sum,country_of_origin_CL_sum,country_of_origin_CN_sum,country_of_origin_CO_sum,country_of_origin_CZ_sum,country_of_origin_DE_sum,country_of_origin_DK_sum,country_of_origin_DO_sum,country_of_origin_EE_sum,country_of_origin_ES_sum,country_of_origin_FI_sum,country_of_origin_FM_sum,country_of_origin_FR_sum,country_of_origin_Francia_sum,country_of_origin_GB_sum,country_of_origin_HK_sum,country_of_origin_HU_sum,country_of_origin_IE_sum,country_of_origin_IL_sum,country_of_origin_IN_sum,country_of_origin_IR_sum,country_of_origin_IS_sum,country_of_origin_IT_sum,country_of_origin_JP_sum,country_of_origin_KR_sum,country_of_origin_MN_sum,country_of_origin_MU_sum,country_of_origin_MX_sum,country_of_origin_MY_sum,country_of_origin_NL_sum,country_of_origin_NO_sum,country_of_origin_NR_sum,country_of_origin_NZ_sum,country_of_origin_PE_sum,country_of_origin_PH_sum,country_of_origin_PL_sum,country_of_origin_PY_sum,country_of_origin_RS_sum,country_of_origin_RU_sum,country_of_origin_SE_sum,country_of_origin_SY_sum,country_of_origin_TR_sum,country_of_origin_UK_sum,country_of_origin_US_sum,country_of_origin_USA_sum,country_of_origin_UY_sum,country_of_origin_VE_sum,country_of_origin_ZA_sum,tunein_hour_mean
account_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1
0,66.2,41.96,2012.0,0,1,1,0,86.4,30.66,1,1,0,0,0,0,0,0,0,3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,10,7.6,1.0,0.0,0.0,4.0,0.0,1.0,0,4.0,0,0,0.0,0,0.0,0.0,0.0,0,4.0,0,0.0,1.0,0,0.0,0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0.0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0,0,0,0,0,11.6
1,38.33,41.63,2007.67,0,0,1,1,97.67,42.44,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,7,4.33,2.0,1.0,0.0,0.0,0.0,1.0,0,2.0,0,0,0.0,0,0.0,0.0,0.0,0,1.0,1,0.0,1.0,0,1.0,0,0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0.0,0,0.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,0,0,0,0,20.0
2,42.65,25.88,2018.52,0,0,1,4,60.96,18.42,2,1,0,6,0,1,0,0,0,0,0,2,0,0,0,0,15,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,14,0,0,0,0,0,10,6.57,0.0,0.0,7.0,1.0,15.0,0.0,0,18.0,0,4,0.0,0,0.0,0.0,1.0,0,3.0,0,18.0,2.0,0,1.0,0,0,0,0,14.0,0,0,0,0,0,0,0,0,0,0,0,0.0,0,0,0.0,0,4.0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,0,0,0,0,14.78
3,77.84,40.88,2012.84,0,0,29,4,101.74,27.44,28,0,0,10,0,0,0,15,0,21,0,8,0,0,0,1,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,1,0,5,5,0,1,0,0,10,7.26,70.0,0.0,0.0,0.0,0.0,0.0,0,50.0,9,8,3.0,0,0.0,0.0,0.0,0,60.0,0,9.0,1.0,0,11.0,0,1,1,0,0.0,1,0,0,0,0,0,0,0,0,0,0,0.0,0,0,3.0,0,6.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,46.0,0,0,0,0,10.37
4,26.09,25.09,2009.74,20,0,6,6,63.85,39.89,4,2,0,2,0,0,0,1,0,27,0,2,0,0,3,1,14,0,0,0,0,0,0,0,24,0,0,0,0,1,0,0,0,1,2,0,0,3,8,0,3,0,0,10,4.98,0.0,3.0,6.0,45.0,0.0,2.0,0,25.0,4,0,0.0,0,2.0,0.0,21.0,0,26.0,0,5.0,23.0,0,8.0,0,0,0,0,0.0,0,0,0,0,2,0,0,1,0,0,0,1.0,0,0,0.0,0,3.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,37.0,0,0,0,0,13.46


In [60]:
# Scale data and train KNN

scaler = StandardScaler()
X = scaler.fit_transform(df_user_profile.values)

pca = PCA(n_components=23)
X = pca.fit_transform(X)

# p=1: "manhattan_distance", p=1: "euclidean_distance"
knn = NearestNeighbors(n_neighbors=n_closer_users, p=2, n_jobs=-1)

knn.fit(X)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=-1, n_neighbors=10, p=2,
                 radius=1.0)

In [61]:
def get_neighbors(model, account_id, X_users, k=None):
    """ Return closer users index from model"""
    neighbors = knn.kneighbors([X_users.loc[account_id].values], n_neighbors=k,
                               return_distance=False)
    closer_users = X_users.iloc[neighbors[0], :].index
    return closer_users


def make_recomendation(model, user, X_users, user_content_keep, user_content, user_cat, n=20, k=None):
    """
    For a given user it returns a content recommendation that the closest users saw and he did not
    
    :param X_users: matrix with user profiles vectors
    :param users_content: content most seen by user
    :param user_cat: categories most seen by user
    :param n: amount recomendartions gets
    :param k: amount neighbors user use
    """
    # Get closer users
    closer_users = get_neighbors(model, user, X_users, k=k)

    # Get content seen by closer users
    recomendation = list(chain(*user_content_keep[closer_users].values))

    # remove content that user already seen and select top 20
    recomendation = np.array(recomendation)
    recomendation = recomendation[~np.in1d(recomendation, user_content[user])][:n]
    return  recomendation


def create_submit(model, X_users, user_content_keep, user_content, user_cat):
    """
    For a given user it returns a content recommendation that the closest users saw and he did not
    
    :param X_users: matrix with user profiles vectors
    :param user_content_keep: content most seen by user that will keep available
    :param users_content: content most seen by user
    :param user_cat: categories most seen by user
    """
    
    global is_test

    # Get closer users
#     print('Getting neighbors...')
    closer_users = model.kneighbors(X_users, return_distance=False)
    closer_users = list(map(lambda xs: X_users.iloc[xs].index.values, closer_users))
    
    # Get content seen by closer users
#     print('Mapping neighbors to content...')
    recomendation = list(map(lambda xs: list(chain(*user_content_keep[xs].values)),
                             closer_users))

    # Delete repeating content 
#     print('Deleting repeating content...')
    recomendation = list(map(lambda xs: [x[0] for x in Counter(xs).most_common()][:60], recomendation))

    # remove content that user already seen and select top 20
    recomendation = list(map(np.array, recomendation))
    
#     print('Clean content already seen...')
    submit = list(map(lambda xs, user: xs[~np.in1d(xs, user_content[user])],
                      recomendation, X_users.index))
    
#     print('Creating submit...')
    submit = pd.Series(dict(zip(X_users.index.values, submit)))
    
    all_content = np.array([x[0] for x in Counter(chain(*user_content_keep.values)).most_common()])

    submit = submit.map(lambda xs:
               xs if len(xs) >= 20 
               else np.concatenate([xs, all_content[:20]])
    )
    
#     print('Done!\n\n')
    return submit

In [62]:
%%time

X = pd.DataFrame(X, index=df_user_profile.index.values)

submit_knn = create_submit(knn, X,
                           top_content_by_user_keep, top_content_by_user, n_cat_most_seen_by_user)

submit_knn.iloc[:5]

CPU times: user 7min 13s, sys: 4.24 s, total: 7min 17s
Wall time: 4min 13s


In [63]:
submit_knn.map(len).min(), submit_knn.map(len).max()

(20, 60)

#### Joins Baselines

In [64]:
submit_most_common.name = "most_common"
submit_knn.name = "knn"

recomendations = pd.merge(submit_most_common, submit_knn, left_index=True, right_index=True)
recomendations = recomendations.apply(lambda row: np.array(list(row['most_common']) + list(row['knn'])), axis=1)

recomendations

account_id
0         [2040.0, 1800.0, 774.0, 2299.0, 2178.0, 604.0,...
1         [2012.0, 2942.0, 1462.0, 1573.0, 2152.0, 1957....
2         [2040.0, 4133.0, 2012.0, 3900.0, 2942.0, 3353....
3         [2012.0, 2942.0, 1462.0, 1573.0, 2152.0, 1957....
4         [2040.0, 1800.0, 774.0, 2299.0, 604.0, 387.0, ...
                                ...                        
113876    [2040.0, 4133.0, 2012.0, 3900.0, 2942.0, 3353....
113877    [2040.0, 4133.0, 2012.0, 3900.0, 2942.0, 3353....
113878    [3382.0, 3433.0, 3863.0, 3057.0, 1999.0, 3832....
113879    [2040.0, 774.0, 2299.0, 2178.0, 604.0, 387.0, ...
113880    [2040.0, 1800.0, 774.0, 2299.0, 2178.0, 604.0,...
Length: 113881, dtype: object

In [65]:
recomendations.map(len).min(), recomendations.map(len).max()

(40, 118)

In [None]:
# recomendations.map(lambda xs: list(chain(*[[x[0]]*x[1] for x in Counter(xs).most_common()]))[:20] )

account_id
0         {2040.0: 2, 1800.0: 1, 774.0: 1, 2299.0: 1, 21...
1         {2012.0: 3, 2942.0: 2, 1462.0: 1, 1573.0: 1, 2...
2         {2040.0: 2, 3900.0: 1, 2012.0: 2, 2942.0: 3, 3...
3         {2012.0: 3, 2942.0: 3, 1462.0: 2, 1573.0: 1, 2...
4         {2040.0: 2, 1800.0: 1, 774.0: 1, 2299.0: 1, 60...
                                ...                        
112271    {2040.0: 3, 1800.0: 1, 774.0: 1, 2299.0: 1, 21...
112278    {2160.0: 3, 1139.0: 2, 680.0: 2, 20.0: 3, 712....
112298    {2040.0: 1, 3900.0: 1, 2012.0: 1, 2942.0: 1, 3...
112348    {1462.0: 1, 1611.0: 2, 2177.0: 1, 2190.0: 1, 3...
112356    {2040.0: 3, 3900.0: 3, 2012.0: 2, 2942.0: 3, 3...
Length: 103622, dtype: object

## Neural Network to rank recomendations

In [66]:
df_views = df.groupby(['account_id', 'content_id'], as_index=False)['ranking'].max()

df_views.tail()

Unnamed: 0,account_id,content_id,ranking
971561,113878,4343.0,1
971562,113879,1800.0,4
971563,113880,2183.0,1
971564,113880,3663.0,1
971565,113880,3810.0,1


- Creamos una clase con nuestro modelo que nos permita entrenar y obtener recomendaciones

In [67]:
# This class only extends Embedding Layer to return an reshape embedding instead list of it
class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal',
                      embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return x


class CollaborativeFilterKeras:
    def __init__(self, users, content, rating, embedding_size=50):
        
        self.embedding_size = embedding_size
        
        self.n_users = users.nunique()
        self.n_content = content.nunique()
        self.min_rating = min(rating)
        self.max_rating = max(rating)
        # Encode ids
        self.user_enc = LabelEncoder()
        self.content_enc = LabelEncoder()
        
        self.users = self.user_enc.fit_transform(users.values)
        self.content = self.content_enc.fit_transform(content.values)

        self.rating = rating.values.astype(np.float32)

        self.model = None
        self.history = None

    def compile_mode(self):
        user = Input(shape=(1,))
        u = EmbeddingLayer(self.n_users, self.embedding_size)(user)

        movie = Input(shape=(1,))
        m = EmbeddingLayer(self.n_content, self.embedding_size)(movie)

        x = Concatenate()([u, m])
        x = Dropout(0.1)(x)

        # x = Dense(10, kernel_initializer='he_normal')(x)
        # x = Activation('relu')(x)
        # x = Dropout(0.3)(x)

        x = Dense(50, kernel_initializer='he_normal')(x)
        x = Activation('relu')(x)
        x = Dropout(0.3)(x)

        x = Dense(1, kernel_initializer='he_normal')(x)
        x = Activation('sigmoid')(x)
        x = Lambda(lambda x: x * (self.max_rating - self.min_rating) + self.min_rating)(x)

        model = Model(inputs=[user, movie], outputs=x, name="Flow")
        model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.01))
        self.model = model

    def summary(self):
        return self.model.summary()

    def fit_model(self, batch_size=64, epochs=10):
        self.history = self.model.fit(x=[self.users, self.content], y=self.rating, batch_size=batch_size, epochs=epochs, verbose=1)
    
    def predict_all(self, recomendations, n_top=20):
        """
        predict rating from each user to each content given. 
        Return for each user a list with top 20 content (based on predicted rating).
        """
        submit = {}
        users_ids_enc = self.user_enc.transform(recomendations.index.values)

        print(f'make predictions over: {len(recomendations)} users')
        idx_prints = np.linspace(1, len(recomendations)-1, num=15).astype(int)

        for idx, user_id in enumerate(recomendations.index.values):
            if idx in idx_prints:
                print(f'{idx} users ready...')
            # predict only over no-seen content
            content_ids = recomendations[user_id]

            # Make prediction
            preds = self.model.predict([  np.array([users_ids_enc[idx]] * len(content_ids)),
                                          self.content_enc.transform(content_ids)
                                      ])
            # Select top 20 rating content
            top20 = content_ids[np.argsort(preds.flatten())[-n_top:][::-1]]

            # Save result with real ids
            submit[user_id] = top20

        return submit

In [68]:
flow_model = CollaborativeFilterKeras(users=df_views.account_id, 
                                      content=df_views.content_id, 
                                      rating=df_views.ranking)

In [69]:
flow_model.compile_mode()
flow_model.summary()

Model: "Flow"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 1, 50)        5694050     input_3[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 1, 50)        203200      input_4[0][0]                    
_______________________________________________________________________________________________

In [70]:
%%time

flow_model.fit_model(batch_size=512, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
CPU times: user 44min 30s, sys: 3min 40s, total: 48min 11s
Wall time: 28min 23s


In [71]:
%%time

submit_final = flow_model.predict_all(recomendations, n_top=40)

make predictions over: 113881 users
1 users ready
8135 users ready
16269 users ready
24403 users ready
32537 users ready
40672 users ready
48806 users ready
56940 users ready
65074 users ready
73208 users ready
81343 users ready
89477 users ready
97611 users ready
105745 users ready
113880 users ready
CPU times: user 1h 35min 18s, sys: 1min 56s, total: 1h 37min 14s
Wall time: 1h 40min 32s


In [72]:
submit_final = pd.Series(submit_final)

submit_final

0         [491.0, 496.0, 852.0, 1139.0, 393.0, 2160.0, 2...
1         [2394.0, 2629.0, 4233.0, 3738.0, 2722.0, 3805....
2         [1139.0, 3409.0, 2901.0, 3752.0, 2091.0, 2091....
3         [2178.0, 2901.0, 2901.0, 3752.0, 3674.0, 2040....
4         [491.0, 491.0, 496.0, 496.0, 2160.0, 2160.0, 2...
                                ...                        
113876    [491.0, 1139.0, 2160.0, 2160.0, 712.0, 2040.0,...
113877    [491.0, 1139.0, 2160.0, 2160.0, 712.0, 1983.0,...
113878    [2160.0, 186.0, 185.0, 184.0, 2040.0, 2040.0, ...
113879    [491.0, 1139.0, 2160.0, 2160.0, 2178.0, 2180.0...
113880    [2160.0, 2178.0, 2040.0, 2040.0, 2040.0, 3023....
Length: 113881, dtype: object

In [75]:
write_submit(submit_final.map(lambda xs: xs[:20]), file="/gdrive/MyDrive/Colab Notebooks/data/submit_final.csv")