# Sistema de recomendación

Se realizará un proceso de unión y creación de un puntaje para decidir qué tan compatible es un juego con otros. Luego, se trabajará en un sistema de recomendación ítem por ítem, el cual se refiere a la similitud de ese juego con otros, para ello se utilizará la técnica de la similitud del coseno.

### Importamos librerías

Estas librerías nos permiten manipular los datos y trabajar con ellos.

In [31]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import sys
sys.path.insert(0, '../')
import Herramientas as Herr

### Carga de datos

Se realiza la lectura de los archivos con el objetivo de prepararlos para hacer un sistema de recomendación.

In [32]:
data_review = pd.read_csv('../datasets/australian_reviews_listo.csv')
data_output= pd.read_csv('../datasets/output.csv')

Se revisó los datos que tienen dentro ambos DataFrame (data_review y data_output).

In [33]:
data_review

Unnamed: 0,user_id,user_url,posted,item_id,recommend,sentiment_analysis
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011,1250,True,1
1,js41637,http://steamcommunity.com/id/js41637,2014,251610,True,1
2,evcentric,http://steamcommunity.com/id/evcentric,2014,248820,True,1
3,doctr,http://steamcommunity.com/id/doctr,2013,250320,True,2
4,maplemage,http://steamcommunity.com/id/maplemage,2014,211420,True,1
...,...,...,...,...,...,...
59270,BonnieMTD,http://steamcommunity.com/id/BonnieMTD,2015,400,True,2
59271,amillionlemons,http://steamcommunity.com/id/amillionlemons,2015,313120,True,1
59272,keepit1hunid,http://steamcommunity.com/id/keepit1hunid,2014,17410,True,2
59273,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,2014,304930,True,1


In [34]:
data_output

Unnamed: 0,publisher,genres,app_name,title,release_date,price,early_access,item_id,developer
0,Kotoshiro,Action,Lost Summoner Kitty,Lost Summoner Kitty,2018,4.99,False,761140,Kotoshiro
1,Kotoshiro,Casual,Lost Summoner Kitty,Lost Summoner Kitty,2018,4.99,False,761140,Kotoshiro
2,Kotoshiro,Indie,Lost Summoner Kitty,Lost Summoner Kitty,2018,4.99,False,761140,Kotoshiro
3,Kotoshiro,Simulation,Lost Summoner Kitty,Lost Summoner Kitty,2018,4.99,False,761140,Kotoshiro
4,Kotoshiro,Strategy,Lost Summoner Kitty,Lost Summoner Kitty,2018,4.99,False,761140,Kotoshiro
...,...,...,...,...,...,...,...,...,...
68053,Laush Studio,Indie,Russian Roads,Russian Roads,2018,1.99,False,610660,Laush Dmitriy Sergeevich
68054,Laush Studio,Racing,Russian Roads,Russian Roads,2018,1.99,False,610660,Laush Dmitriy Sergeevich
68055,Laush Studio,Simulation,Russian Roads,Russian Roads,2018,1.99,False,610660,Laush Dmitriy Sergeevich
68056,SIXNAILS,Casual,EXIT 2 - Directions,EXIT 2 - Directions,2017,4.99,False,658870,"xropi,stev3ns"


### Pre-Procesamiento de datos

Se unió los dos dataframes, a través del item_id, con las columnas que son necesarias para poder crear la columna con los puntajes y luego hacer nuestro sistema de recomendación.

In [35]:
df_recommend = pd.merge(data_output[['app_name','item_id']],data_review[['item_id','recommend','sentiment_analysis','user_id']],on='item_id')
df_recommend

Unnamed: 0,app_name,item_id,recommend,sentiment_analysis,user_id
0,Carmageddon Max Pack,282010,True,1,InstigatorAU
1,Carmageddon Max Pack,282010,True,1,InstigatorAU
2,Carmageddon Max Pack,282010,True,1,InstigatorAU
3,Half-Life,70,True,1,EizanAratoFujimaki
4,Half-Life,70,True,1,76561198020928326
...,...,...,...,...,...
110785,Counter-Strike: Condition Zero,80,False,1,76561198023508728
110786,Counter-Strike: Condition Zero,80,True,2,Lone_walker
110787,Counter-Strike: Condition Zero,80,True,2,virex4
110788,Counter-Strike: Condition Zero,80,True,2,KILLERamateur


Se hizo una revisión para ver como se comportaron los datos.

In [36]:
Herr.analizar_datos(df_recommend)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,app_name,[<class 'str'>],100.0,0.0,0
1,item_id,[<class 'int'>],100.0,0.0,0
2,recommend,[<class 'bool'>],100.0,0.0,0
3,sentiment_analysis,[<class 'int'>],100.0,0.0,0
4,user_id,[<class 'str'>],100.0,0.0,0


Se creó una función para aplicar al DataFrame "df_recommend" con el objetivo de unificar las recomendaciones y el análisis de sentimientos de los usuarios en un solo valor.

In [37]:
def Score(filas):
    if (filas['sentiment_analysis'] == 0) & (filas['recommend'] == False):
        return 0
    elif (filas['sentiment_analysis'] == 1) & (filas['recommend'] == False):
        return 1
    elif (filas['sentiment_analysis'] == 2) & (filas['recommend'] == False):
        return 2
    elif (filas['sentiment_analysis'] == 0) & (filas['recommend'] == True):
        return 3
    elif (filas['sentiment_analysis'] == 1) & (filas['recommend'] == True):
        return 4
    elif (filas['sentiment_analysis'] == 2) & (filas['recommend'] == True):
        return 5
    return None

Se creo una columna "score" para reemplazar las columnas "recommend" y "sentiment_analysis", para ello se aplicó la función creada anteriormente.

In [38]:
df_recommend['Score']= df_recommend.apply(Score,axis=1)

A modo informativo, vemos cómo están los "score" del resultado de nuestra función. Se puede observar que la mayoría si recomendaron y dieron comentarios neutros.

In [39]:
df_recommend['Score'].value_counts()

Score
4    66556
5    26025
1     9052
3     5429
0     2858
2      870
Name: count, dtype: int64

Se eliminaron las columnas que no son necesarias para nuestro sistema de recomendación y aquellas con las que creamos la columna "score".

In [40]:
df_recommend = df_recommend.drop(columns=['item_id','recommend','sentiment_analysis'])

Revisión final del DataFrame "df_recommend".

In [41]:
df_recommend

Unnamed: 0,app_name,user_id,Score
0,Carmageddon Max Pack,InstigatorAU,4
1,Carmageddon Max Pack,InstigatorAU,4
2,Carmageddon Max Pack,InstigatorAU,4
3,Half-Life,EizanAratoFujimaki,4
4,Half-Life,76561198020928326,4
...,...,...,...
110785,Counter-Strike: Condition Zero,76561198023508728,1
110786,Counter-Strike: Condition Zero,Lone_walker,5
110787,Counter-Strike: Condition Zero,virex4,5
110788,Counter-Strike: Condition Zero,KILLERamateur,5


A continuación, se realizará el procesamiento del DataFrame "df_recommend" y se lo dejará preparado para realizar una función que nos debe devolver los 5 juegos más similares al que se le ingrese, basado en los puntajes. Además, se dejarán los datos listos para ser consumidos por la API.

### Recomendacion_juego

def recomendacion_juego( id de producto ): Ingresando el id de producto, deberíamos recibir una lista con 5 juegos recomendados similares al ingresado.

#### Procesamiento de datos

Se comenzó con una tabla pivote, la cual consiste en crear una tabla dinámica con la columna 'app_name' como índice, los nombres de los usuarios como columnas y el 'score' para los valores.

In [43]:
matriz_pivot = df_recommend.pivot_table(index='app_name',columns='user_id',values='Score')
matriz_pivot

user_id,--000--,--ace--,--ionex--,-2SV-vuLB-Kg,-Azsael-,-Beave-,-I_AM_EPIC-,-Kenny,-Mad-,-PRoSlayeR-,...,zuilde,zukuta,zunbae,zuzuga2003,zv_odd,zvanik,zwanzigdrei,zy0705,zynxgameth,zyr0n1c
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
! That Bastard Is Trying To Steal Our Gold !,,,,,,,,,,,...,,,,,,,,,,
//N.P.P.D. RUSH//- The milk of Ultraviolet,,,,,,,,,,,...,,,,,,,,,,
0RBITALIS,,,,,,,,,,,...,,,,,,,,,,
10000000,,,,,,,,,,,...,,,,,,,,,,
100% Orange Juice,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sZone-Online,,,,,,,,,,,...,,,,,,,,,,
the static speaks my name,,,,,,,,,,,...,,,,,,,,,,
theHunter Classic,,,,,,,,,,,...,,,,1.0,,,,,,
theHunter: Primal,,,,,,,,,,,...,,,,,,,,,,


Se realizó una resta del promedio de los valores de las filas con el objetivo de tener valores más normalizados después de haber transformado el DataFrame "matriz_pivot_prom".

In [44]:
matriz_pivot_prom = matriz_pivot.sub(matriz_pivot.mean(axis=1),axis='index')
matriz_pivot_prom

user_id,--000--,--ace--,--ionex--,-2SV-vuLB-Kg,-Azsael-,-Beave-,-I_AM_EPIC-,-Kenny,-Mad-,-PRoSlayeR-,...,zuilde,zukuta,zunbae,zuzuga2003,zv_odd,zvanik,zwanzigdrei,zy0705,zynxgameth,zyr0n1c
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
! That Bastard Is Trying To Steal Our Gold !,,,,,,,,,,,...,,,,,,,,,,
//N.P.P.D. RUSH//- The milk of Ultraviolet,,,,,,,,,,,...,,,,,,,,,,
0RBITALIS,,,,,,,,,,,...,,,,,,,,,,
10000000,,,,,,,,,,,...,,,,,,,,,,
100% Orange Juice,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sZone-Online,,,,,,,,,,,...,,,,,,,,,,
the static speaks my name,,,,,,,,,,,...,,,,,,,,,,
theHunter Classic,,,,,,,,,,,...,,,,-1.269231,,,,,,
theHunter: Primal,,,,,,,,,,,...,,,,,,,,,,


Para que la similitud del coseno funcione de manera correcta, no puede haber valores nulos. Por lo tanto, se decide rellenar esos valores nulos con ceros y luego aplicar la función de similitud del coseno.

In [45]:
matriz_pivot_prom = matriz_pivot_prom.fillna(0)
similar_cosine = cosine_similarity(matriz_pivot_prom)
similar_cosine

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Se pasó la matriz con el coseno de similitud aplicado a un DataFrame para una mejor visualización y lectura de los datos. Se utilizaron los índices de la matriz llamada "matriz_pivot_prom" como filas y columnas. Y se revisó como es el resultado.

In [49]:
data_similar_cosine = pd.DataFrame(similar_cosine,index=matriz_pivot_prom.index,columns=matriz_pivot_prom.index)
data_similar_cosine

app_name,! That Bastard Is Trying To Steal Our Gold !,//N.P.P.D. RUSH//- The milk of Ultraviolet,0RBITALIS,"10,000,000",100% Orange Juice,100% Orange Juice - Krila & Kae Character Pack,1001 Spikes,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 is Better Than 6,...,liteCam Game: 100 FPS Game Capture,nail'd,oO,planetarian ~the reverie of a little planet~,resident evil 4 / biohazard 4,sZone-Online,the static speaks my name,theHunter Classic,theHunter: Primal,Астролорды: Облако Оорта
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
! That Bastard Is Trying To Steal Our Gold !,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
//N.P.P.D. RUSH//- The milk of Ultraviolet,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0RBITALIS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100% Orange Juice,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sZone-Online,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
the static speaks my name,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
theHunter Classic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
theHunter: Primal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Exportación de datos

Exportamos el dataframe en formato Parquet, ademas de ser mas liviano es mas eficiente que el formato CSV.

In [48]:
data_similar_cosine.to_parquet('../Data_parquet/data_similar_cosine.parquet')

### Función

La función recibe como parámetro el ID del juego, lo ordena de manera descendente según algún criterio (en este caso, por el nombre del juego), toma los cinco primeros valores y luego imprime por pantalla el juego más similar y los que le continúan.

In [None]:
def Recomendacion_juego(id_juego):
    orden = data_similar_cosine.sort_values(by=id_juego,ascending=False).index[1:6]
    print(f'Los 5 juegos más parecidos a {id_juego} son:\n ')
    for nro, game in enumerate(orden,start=1):
        print(f'Nro {nro}:{game}')

Revisión de ejecución de la función.

In [None]:
Recomendacion_juego('the static speaks my name')

Los 5 juegos más parecidos a the static speaks my name son:
 
Nro 1:Call of Duty®: Modern Warfare® 2
Nro 2:Layers of Fear
Nro 3:Team Fortress 2
Nro 4:Five Nights at Freddy's
Nro 5:Unturned


### Conclusión

Para el desarrollo de este notebook, se utilizaron dos dataframes: el de "reviews" y "output". Mediante su unión y modificaciones, se creó una tabla con puntajes basados en la similitud entre juegos, que varían de 0 a 1, donde 0 representa ninguna similitud y 1 indica similitud perfecta. Finalmente, se diseñó y probó la función para su implementación en la API. Para acceder al último notebook, donde se encuentran las funciones, haz clic [aquí](../Notebooks/Data_funciones.ipynb).