En este archivo, voy a desarrollar el segundo sistema de recomendación que se me pide para este proyecto.
Este sistema de recomendación funciona de la siguiente manera:
![alt text](../img/user_item.jpg "Title")

Suponiendo que tenemos a un usuario 1 y a un usuario 2. Al usuario 1 le gusta el cubo de rubik, los dados y un juego de cartas. Al usuario 2 también le gusta el cubo de rubik y los dados, pero no ha jugado al juego de cartas.
 De forma muy simple, el sistema de recomendación user-item reconocería que estos dos usuarios comparten gustos similares, y al usuario 2 se le recomendaría el juego de cartas, que es jugado por otro usuario con gustos similares al suyo.  

El algoritmo esta fundamentado en que un usuario es similar a otro cuando le gustan (en este caso) los mismos juegos que a otro usuario. Entonces, si hay un juego que el usuario 2 no ha jugado, pero que el usuario 1 si, y son usuarios similares, es probable que al usuario 2 le guste el juego. 

Comenzamos importando las librerías necesarias para trabajar

In [1]:
# Para procesar los datos
import pandas as pd
import numpy as np
import scipy.stats as stats

# Para visualizar los datos
import seaborn as sns

# Para la similitud
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
juegos = pd.read_parquet("../Datasets/steam_games_complete.parquet")
juegos.head()

Unnamed: 0,item_id,item_name,developer,genres,tags,specs,release_date,price
88310,761140,Lost Summoner Kitty,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]","[Strategy, Action, Indie, Casual, Simulation]",[Single-player],2018-01-04,4.99
88311,643980,Ironbound,Secret Level SRL,"[Free to Play, Indie, RPG, Strategy]","[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",2018-01-04,0.0
88312,670290,Real Pool 3D - Poolians,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]","[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",2017-07-24,0.0
88313,767400,弹炸人2222,彼岸领域,"[Action, Adventure, Casual]","[Action, Adventure, Casual]",[Single-player],2017-12-07,0.99
88315,772540,Battle Royale Trainer,Trickjump Games Ltd,"[Action, Adventure, Simulation]","[Action, Adventure, Simulation, FPS, Shooter, ...","[Single-player, Steam Achievements]",2018-01-04,3.99


In [3]:
reseñas = pd.read_parquet("../Datasets/reviews_con_puntaje.parquet")
reseñas.head()

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,puntaje
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,76561197970982479,2
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,76561197970982479,2
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479,2
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,js41637,2
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,js41637,2


In [4]:
user_items = pd.read_parquet("../EDA/user_items_complete.parquet")
user_items.head()

Unnamed: 0,item_id,item_name,playtime_forever,playtime_2weeks,user_id
0,10,Counter-Strike,6,0,76561197970982479
1,20,Team Fortress Classic,0,0,76561197970982479
2,30,Day of Defeat,7,0,76561197970982479
3,40,Deathmatch Classic,0,0,76561197970982479
4,50,Half-Life: Opposing Force,0,0,76561197970982479


A continuación, veamos un poco la longitud de estos 3 conjuntos de datos.

In [5]:
print(len(juegos))
print()
print(len(reseñas))
print()
print(len(user_items))


22530

59305

5153209


Comencemos filtrando el dataset de user_items, ya que es el más largo de todos.

In [6]:
# Primero, observemos el valor máximo de la columna playtime_forever de user_items
user_items["playtime_forever"].max()

642773

Según deduzco, la unidad de tiempo en la que se midió este valor es en minutos, ya que (si fuesen horas) 642773 / 24 = 26782 (días jugados), 26782 / 365 = 73 (años jugados). Es una barbaridad, y es lógico pensar que esa no es la unidad de tiempo. En cambio, si consideramos que son minutos: 642773 / 60 = 10712 (horas jugadas), 10712 / 24 = 446 (días jugados), 446 / 365 = un año y meses. Tiene más sentido.


Voy a considerar a los usuarios que unicamente invirtieron historicamente más de 10 horas (600 minutos) en el juego dado. Veamos cómo se reduce el dataset.

In [7]:
len(user_items[user_items["playtime_forever"] > 600])

983916

Con eso pasaría de 5 millones de filas a sólo 983916. Las cuáles siguen siendo bastante, pero con esa consideración logré reducir el dataset en un 80% de su tamaño original.

In [8]:
# Guardamos los cambios en el dataset
user_items = user_items[user_items["playtime_forever"] > 600]
user_items

Unnamed: 0,item_id,item_name,playtime_forever,playtime_2weeks,user_id
8,300,Day of Defeat: Source,4733,0,76561197970982479
9,240,Counter-Strike: Source,1853,0,76561197970982479
16,6910,Deus Ex: Game of the Year Edition,2685,0,76561197970982479
17,7670,BioShock,633,0,76561197970982479
19,220,Half-Life 2,696,0,76561197970982479
...,...,...,...,...,...
5152671,370240,NBA 2K16,1533,19,76561198319916652
5152676,346330,BrainBread 2,756,0,76561198320038728
5153000,730,Counter-Strike: Global Offensive,4557,1698,ArkPlays7
5153001,346110,ARK: Survival Evolved,623,0,ArkPlays7


Lo que voy a hacer a continuación es tener en cuenta sólo aquellos juegos que tienen más de 100 reseñas.

In [9]:
juegos_con_mas_de_100_reseñas = reseñas["item_id"].value_counts() > 100
# juegos_con_mas_de_100_reseñas
juegos_con_mas_de_100_reseñas["240"]
# reseñas[reseñas["item_id"] == (reseñas["item_id"].value_counts() > 100)]

True

In [10]:
len(reseñas[reseñas["item_id"] == "300"])

33

In [11]:
lista_booleana_para_mascara = []
for id in reseñas["item_id"]:
    lista_booleana_para_mascara.append(~juegos_con_mas_de_100_reseñas[id])


In [12]:
lista_booleana_para_mascara

[False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 

In [13]:
juegos_con_mas_de_100_reseñas.value_counts()

count
False    3583
True       99
Name: count, dtype: int64

In [14]:
reseñas = reseñas[lista_booleana_para_mascara]

In [26]:
# Lo que voy a hacer es aumentar el valor del puntaje de las reseñas en 1, considerando las negativas con un 1, las neutras con un 2 y las positivas con un tres.
reseñas["puntaje"] = reseñas["puntaje"] + 1
reseñas

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,puntaje
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,76561197970982479,3
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479,3
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,js41637,3
5,,"Posted November 29, 2013.",,239030,1 of 4 people (25%) found this review helpful,True,Very fun little game to play when your bored o...,js41637,3
7,,"Posted December 4, 2015.","Last edited December 5, 2015.",370360,No ratings yet,True,"""Run for fun? What the hell kind of fun is that?""",evcentric,1
...,...,...,...,...,...,...,...,...,...
59298,,Posted July 21.,,233270,No ratings yet,True,this is a very fun and nice 80s themed shooter...,76561198312638244,3
59299,,Posted July 10.,,130,No ratings yet,True,if you liked Half life i would really recommen...,76561198312638244,3
59300,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...,76561198312638244,3
59301,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...,76561198312638244,3


In [27]:
len(reseñas["item_id"].unique())

3583

In [28]:
matrix = reseñas.pivot_table(index='user_id',columns='item_id',values='puntaje')
matrix.head()

item_id,10,10090,10130,10140,10150,10180,10220,102500,102600,102700,...,98800,9900,99100,9930,99300,9940,99400,99700,99810,99910
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-2SV-vuLB-Kg,,,,,,,,,,,...,,,,,,,,,,
-Azsael-,,,,,,,,,,,...,,,,,,,,,,
-Mad-,,,,,,,,,,,...,,,,,,,,,,
-PRoSlayeR-,,,,,,,,,,,...,,,,,,,,,,
-SEVEN-,,,,,,,,,,,...,,,,,,,,,,


In [29]:
matrix_norm = matrix.subtract(matrix.mean(axis=1), axis = 'rows')
matrix_norm.head()

item_id,10,10090,10130,10140,10150,10180,10220,102500,102600,102700,...,98800,9900,99100,9930,99300,9940,99400,99700,99810,99910
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-2SV-vuLB-Kg,,,,,,,,,,,...,,,,,,,,,,
-Azsael-,,,,,,,,,,,...,,,,,,,,,,
-Mad-,,,,,,,,,,,...,,,,,,,,,,
-PRoSlayeR-,,,,,,,,,,,...,,,,,,,,,,
-SEVEN-,,,,,,,,,,,...,,,,,,,,,,


In [30]:
matrix_norm

item_id,10,10090,10130,10140,10150,10180,10220,102500,102600,102700,...,98800,9900,99100,9930,99300,9940,99400,99700,99810,99910
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-2SV-vuLB-Kg,,,,,,,,,,,...,,,,,,,,,,
-Azsael-,,,,,,,,,,,...,,,,,,,,,,
-Mad-,,,,,,,,,,,...,,,,,,,,,,
-PRoSlayeR-,,,,,,,,,,,...,,,,,,,,,,
-SEVEN-,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zukuta,,,,,,,,,,,...,,,,,,,,,,
zuzuga2003,,,,,,,,,,,...,,,,,,,,,,
zv_odd,,,,,,,,,,,...,,,,,,,,,,
zyr0n1c,,,,,,,,,,,...,,,,,,,,,,


In [31]:
# Rellenar NaN con 0s ya que el coseno no se ve afectado por ceros
matrix_norm_filled = matrix_norm.fillna(0)

# Calcular la similitud del coseno
similitud_del_coseno = cosine_similarity(matrix_norm_filled)

# Convertir a DataFrame para mejor legibilidad
similitud_del_coseno

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [33]:
# Convertir a DataFrame para mejor legibilidad
similitud_de_usuarios_coseno = pd.DataFrame(similitud_del_coseno, index=matrix_norm.index, columns=matrix_norm.index)

# Asegurarse de que la diagonal sea 1
np.fill_diagonal(similitud_de_usuarios_coseno.values, 1)
similitud_de_usuarios_coseno

user_id,-2SV-vuLB-Kg,-Azsael-,-Mad-,-PRoSlayeR-,-SEVEN-,-Ultrix,-_PussyDestroyer_-,0-3-0,00000000000000000001227,0099654321891111,...,zraicis,zrustz16,zsharoarkbr,zucchin1,zuilde,zukuta,zuzuga2003,zv_odd,zyr0n1c,zzoptimuszz
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-2SV-vuLB-Kg,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Azsael-,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-Mad-,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-PRoSlayeR-,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-SEVEN-,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zukuta,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
zuzuga2003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
zv_odd,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
zyr0n1c,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [None]:
# Elijo un usuario cualquiera para encontrar similitudes, en mi caso:
usuario_elegido = "-Azsael-"	
