## Modelo de aprendizaje automático 
### **Sistema de Recomendación (SR)**

En este documento se realiza el preparamiento del archivo que será input de la función del sistema de recomendación item - item (`recomendacion_juego{id de producto}`). Se creará un algoritmo que predice los juegos recomendados ante un determinado item. En otras palabras, en base a qué tan similar es un item al resto se recomiendan 5 juegos similares.

### **Técnica: Similitud de coseno**

Para determinar la similitud entre los items se utilizará la tecnica de similitud coseno.



### Librerías

In [139]:
# pip install pandas
# pip install scikit-learn
#
import pandas as pd
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Cargar archivos

In [140]:
# Carga desde el csv en un dataframe df
df = pd.read_csv("steam_games_final.csv")
print(df.shape)
df.head()

(29964, 8)


Unnamed: 0,item_id,developer,release_year,app_name,tags,specs,genres,price
0,761140,Kotoshiro,2018,Lost Summoner Kitty,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",['Single-player'],"['Action', 'Casual', 'Indie', 'Simulation', 'S...",4.99
1,643980,Secret Level SRL,2018,Ironbound,"['Free to Play', 'Strategy', 'Indie', 'RPG', '...","['Single-player', 'Multi-player', 'Online Mult...","['Free to Play', 'Indie', 'RPG', 'Strategy']",0.0
2,670290,Poolians.com,2017,Real Pool 3D - Poolians,"['Free to Play', 'Simulation', 'Sports', 'Casu...","['Single-player', 'Multi-player', 'Online Mult...","['Casual', 'Free to Play', 'Indie', 'Simulatio...",
3,767400,彼岸领域,2017,弹炸人2222,"['Action', 'Adventure', 'Casual']",['Single-player'],"['Action', 'Adventure', 'Casual']",0.99
4,772540,Trickjump Games Ltd,2018,Battle Royale Trainer,"['Action', 'Adventure', 'Simulation', 'FPS', '...","['Single-player', 'Steam Achievements']","['Action', 'Adventure', 'Simulation']",3.99


In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29964 entries, 0 to 29963
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   item_id       29964 non-null  int64  
 1   developer     28779 non-null  object 
 2   release_year  29964 non-null  int64  
 3   app_name      29963 non-null  object 
 4   tags          29803 non-null  object 
 5   specs         29295 non-null  object 
 6   genres        28730 non-null  object 
 7   price         27607 non-null  float64
dtypes: float64(1), int64(2), object(5)
memory usage: 1.8+ MB


In [142]:
# Rellenar NaN con cadenas vacías en las columnas de listas
df['tags'] = df['tags'].apply(lambda x: [] if pd.isna(x) else x)
df['specs'] = df['specs'].apply(lambda x: [] if pd.isna(x) else x)
df['genres'] = df['genres'].apply(lambda x: [] if pd.isna(x) else x)
df.head(3)

Unnamed: 0,item_id,developer,release_year,app_name,tags,specs,genres,price
0,761140,Kotoshiro,2018,Lost Summoner Kitty,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",['Single-player'],"['Action', 'Casual', 'Indie', 'Simulation', 'S...",4.99
1,643980,Secret Level SRL,2018,Ironbound,"['Free to Play', 'Strategy', 'Indie', 'RPG', '...","['Single-player', 'Multi-player', 'Online Mult...","['Free to Play', 'Indie', 'RPG', 'Strategy']",0.0
2,670290,Poolians.com,2017,Real Pool 3D - Poolians,"['Free to Play', 'Simulation', 'Sports', 'Casu...","['Single-player', 'Multi-player', 'Online Mult...","['Casual', 'Free to Play', 'Indie', 'Simulatio...",


In [143]:
# Preparamos los datos para crear las variables dummies
df['specs'] = df['specs'].apply(lambda x: str(x).replace('[', '').replace(']', '').replace("'", ''))
df['tags'] = df['tags'].apply(lambda x: str(x).replace('[', '').replace(']', '').replace("'", ''))
df['genres'] = df['genres'].apply(lambda x: str(x).replace('[', '').replace(']', '').replace("'", ''))

df.head(3)

Unnamed: 0,item_id,developer,release_year,app_name,tags,specs,genres,price
0,761140,Kotoshiro,2018,Lost Summoner Kitty,"Strategy, Action, Indie, Casual, Simulation",Single-player,"Action, Casual, Indie, Simulation, Strategy",4.99
1,643980,Secret Level SRL,2018,Ironbound,"Free to Play, Strategy, Indie, RPG, Card Game,...","Single-player, Multi-player, Online Multi-Play...","Free to Play, Indie, RPG, Strategy",0.0
2,670290,Poolians.com,2017,Real Pool 3D - Poolians,"Free to Play, Simulation, Sports, Casual, Indi...","Single-player, Multi-player, Online Multi-Play...","Casual, Free to Play, Indie, Simulation, Sports",


In [144]:
# Crea variables dummy
dummy_df1= df['specs'].str.get_dummies(', ') #considera que las categorías están separadas por comas y espacio (', ')
dummy_df1

Unnamed: 0,Captions available,Co-op,Commentary available,Cross-Platform Multiplayer,Downloadable Content,Full controller support,Game demo,In-App Purchases,Includes Source SDK,Includes level editor,...,Single-player,Stats,Steam Achievements,Steam Cloud,Steam Leaderboards,Steam Trading Cards,Steam Turn Notifications,Steam Workshop,SteamVR Collectibles,Valve Anti-Cheat enabled
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,...,1,0,1,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29959,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,0,0,0,0,0,0
29960,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
29961,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,0,0,0,0,0
29962,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,1,0,0,0,0


In [145]:
# Crea variables dummy
dummy_df2= df['genres'].str.get_dummies(', ') #considera que las categorías están separadas por comas y espacio (', ')
dummy_df2

Unnamed: 0,Accounting,Action,Adventure,Animation &amp; Modeling,Audio Production,Casual,Design &amp; Illustration,Early Access,Education,Free to Play,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
0,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,1,0,0,0
2,0,0,0,0,0,1,0,0,0,1,...,0,0,0,1,0,1,0,0,0,0
3,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29959,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29960,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
29961,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
29962,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0


In [146]:
# Ajustamos un nombre para mejor visualización

new_names_cols = {'Animation &amp; Modeling':'Animation & Modeling', 'Design &amp; Illustration':'Design & Illustration'}
dummy_df2 = dummy_df2.rename(columns=new_names_cols)
dummy_df2.head()

Unnamed: 0,Accounting,Action,Adventure,Animation & Modeling,Audio Production,Casual,Design & Illustration,Early Access,Education,Free to Play,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
0,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,1,0,0,0
2,0,0,0,0,0,1,0,0,0,1,...,0,0,0,1,0,1,0,0,0,0
3,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [147]:
# Crea variables dummy
#dummy_df3= df['tags'].str.get_dummies(', ') #considera que las categorías están separadas por comas y espacio (', ')
#dummy_df3
#len(dummy_df3.columns)

In [148]:
columnas_comunes = set(dummy_df1.columns) & set(dummy_df2.columns)

if columnas_comunes:
    print("Hay columnas comunes:", columnas_comunes)
else:
    print("No hay columnas comunes.")

No hay columnas comunes.


Se considera en el sistema de recomendación para la similitud de cosenos, los géneros (variable `genres`). Se puede usar también specs, tags u otra variable, pero para no saturar el modelo y el costo computacional nos centraremos en el género.

In [149]:
# Indicamos el item_id
dummy_df2 = dummy_df2.set_index(pd.Index(df['item_id']))
dummy_df2

Unnamed: 0_level_0,Accounting,Action,Adventure,Animation & Modeling,Audio Production,Casual,Design & Illustration,Early Access,Education,Free to Play,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
761140,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
643980,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,1,0,0,0
670290,0,0,0,0,0,1,0,0,0,1,...,0,0,0,1,0,1,0,0,0,0
767400,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
772540,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745400,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
773640,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
733530,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
610660,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0


In [150]:
# Demora 2 minutos aprox.
# Calculamos la matriz de similitud de cosenos
cosine_sim = cosine_similarity(dummy_df2)

In [151]:
# Convertir la matriz de similitud coseno a un DataFrame para visualizarlo mejor
cosine_sim_df = pd.DataFrame(cosine_sim, columns=dummy_df2.index, index=dummy_df2.index)
cosine_sim_df

item_id,761140,643980,670290,767400,772540,774276,774277,774278,768800,770380,...,761480,771810,767590,747320,769330,745400,773640,733530,610660,658870
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
761140,1.000000,0.447214,0.600000,0.516398,0.516398,0.447214,0.447214,0.447214,0.670820,0.600000,...,0.316228,0.516398,0.632456,0.316228,0.632456,0.670820,0.894427,0.774597,0.516398,0.632456
643980,0.447214,1.000000,0.447214,0.000000,0.000000,0.500000,0.500000,0.500000,0.250000,0.447214,...,0.353553,0.288675,0.353553,0.707107,0.353553,0.250000,0.500000,0.577350,0.288675,0.353553
670290,0.600000,0.447214,1.000000,0.258199,0.258199,0.894427,0.894427,0.894427,0.670820,0.400000,...,0.316228,0.258199,0.632456,0.316228,0.632456,0.447214,0.670820,0.516398,0.516398,0.632456
767400,0.516398,0.000000,0.258199,1.000000,0.666667,0.000000,0.000000,0.000000,0.288675,0.774597,...,0.408248,0.666667,0.408248,0.000000,0.408248,0.866025,0.288675,0.333333,0.000000,0.408248
772540,0.516398,0.000000,0.258199,0.666667,1.000000,0.288675,0.288675,0.288675,0.288675,0.516398,...,0.408248,0.666667,0.000000,0.000000,0.000000,0.577350,0.288675,0.000000,0.333333,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745400,0.670820,0.250000,0.447214,0.866025,0.577350,0.250000,0.250000,0.250000,0.500000,0.894427,...,0.707107,0.866025,0.707107,0.353553,0.707107,1.000000,0.500000,0.577350,0.288675,0.707107
773640,0.894427,0.500000,0.670820,0.288675,0.288675,0.500000,0.500000,0.500000,0.750000,0.447214,...,0.353553,0.288675,0.707107,0.353553,0.707107,0.500000,1.000000,0.866025,0.577350,0.707107
733530,0.774597,0.577350,0.516398,0.333333,0.000000,0.288675,0.288675,0.288675,0.577350,0.516398,...,0.408248,0.333333,0.816497,0.408248,0.816497,0.577350,0.866025,1.000000,0.333333,0.816497
610660,0.516398,0.288675,0.516398,0.000000,0.333333,0.577350,0.577350,0.577350,0.866025,0.258199,...,0.408248,0.333333,0.408248,0.408248,0.408248,0.288675,0.577350,0.333333,1.000000,0.408248


In [152]:
# Función que asigna el listado de juegos recomendados a cada item_id

def recom_item_id (item_id):
    # Filtra la fila de las similitudes del item_id que se indica y devuelve como lista sus items
    max_items = cosine_sim_df.loc[item_id].nlargest(6)[1:6].index.to_list()
    
    #  Crea la lista de nombres de juegos recomendados
    # rec_titles = [df.iloc[i[0]]['app_name'] for i in max_items]
    rec_titles = df['app_name'][df['item_id'].isin(max_items)].to_list()
    return rec_titles

In [153]:
#cosine_sim_df.loc[761140].nlargest(6)[1:6].index.to_list()

In [154]:
# Convertir 'item_id' a tipo entero
# df["item_id"] = df["item_id"].astype(int)

In [155]:
# Selecicionamos la columnas necesarias
df.drop(columns=['developer', 'release_year','tags', 'specs','genres','price'], inplace=True)

In [156]:
# Demora 3 minutos aprox.

# Aplicar la función a la columna 'item_id' y crea la columna 'Recomendaciones'
df['Recommended_Games'] = df['item_id'].apply(lambda x:recom_item_id(x))
df

Unnamed: 0,item_id,app_name,Recommended_Games
0,761140,Lost Summoner Kitty,"[Pixel Puzzles 2: Anime, World of Cinema - Dir..."
1,643980,Ironbound,"[Shadow Hunter, Immortal Empire, Immortal Empi..."
2,670290,Real Pool 3D - Poolians,"[Pixel Puzzles Ultimate - Puzzle Pack: Rio, Pi..."
3,767400,弹炸人2222,"[Atomic Adam: Episode 1, Biozone, Luxor: 5th P..."
4,772540,Battle Royale Trainer,"[The Tomorrow War, Beyond Space Remastered Edi..."
...,...,...,...
29959,745400,Kebab it Up!,"[Foul Play, Bloody Trapland, BattleBlock Theat..."
29960,773640,Colony On Mars,"[Fate of the World: Tipping Point, Fate of the..."
29961,733530,LOGistICAL: South Africa,"[Puzzler World 2, iBomber Defense Pacific, Bum..."
29962,610660,Russian Roads,"[Try Hard Parking, Car Mechanic Simulator 2015..."


In [157]:
# valor_recomendado = df.loc[0, 'Recommended_Games']
# print(valor_recomendado)

In [160]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29964 entries, 0 to 29963
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   item_id            29964 non-null  int64 
 1   app_name           29963 non-null  object
 2   Recommended_Games  29964 non-null  object
dtypes: int64(1), object(2)
memory usage: 702.4+ KB


In [161]:
# Guardar el DataFrame
df.to_csv('recomendacion_juego.csv', index=False)

Recomendación: Al medir la similitud de un item_id con otros se considerar otras variables en la similitud de cosenos, como por ejemplo: 
- la similitud del nombre de un item_id con el nombre de otros juegos
- la similitud con los tags, que tienen palabras clave asociado a un item_id
- la similitud con los specs, que tienen especificaciones que caracteriza a un item_id
- combinar uno o más de ellos