<h1>Machine Learning

En este notebook se realizaran las transformaciones correspondientes al modelo de recomendación utilizando similitud de coseno. La similitud de coseno es una medida utilizada en machine learning para evaluar la similitud entre dos vectores. Es especialmente común en problemas relacionados con el procesamiento del lenguaje natural (NLP) y la recuperación de información.

<h2>Importaciones

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.metrics.pairwise import cosine_similarity
import pyarrow as pa
import pyarrow.parquet as pq

Se cargan los datos y se crea el dataframe correspondiente.

In [3]:
machineLearningDf = pd.read_parquet('../data/machineLearning.parquet')

In [3]:
machineLearningDf.head()

app_name,! That Bastard Is Trying To Steal Our Gold !,//N.P.P.D. RUSH//- The milk of Ultraviolet,0RBITALIS,"10,000,000",100% Orange Juice,100% Orange Juice - Krila & Kae Character Pack,1001 Spikes,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 is Better Than 6,...,nail'd,oO,planetarian ~the reverie of a little planet~,resident evil 4 / biohazard 4,sZone-Online,the static speaks my name,theBlu,theHunter Classic,theHunter: Primal,Астролорды: Облако Оорта
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--000--,,,,,,,,,,,...,,,,,,,,,,
--ace--,,,,,,,,,,,...,,,,,,,,,,
--ionex--,,,,,,,,,,,...,,,,,,,,,,
-2SV-vuLB-Kg,,,,,,,,,,,...,,,,,,,,,,
-Azsael-,,,,,,,,,,,...,,,,,,,,,,


Se normalizan los valores del dataframe restando la media de las calificaciones de cada usuario y dividiendo el resultado por la diferencia entre el valor máximo y mínimo de las calificaciones. Esto centra las calificaciones de cada usuario en cero y las escala según su variabilidad. Durante este proceso se eliminan los usuarios que han otorgado únicamente una calificación o han evaluado todos los juegos de la misma manera.

In [4]:
machineLearningDf = machineLearningDf.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1)
machineLearningDf = machineLearningDf.fillna(0)
machineLearningDf = machineLearningDf.T
machineLearningDf = machineLearningDf.loc[:, (machineLearningDf != 0).any(axis=0)]
machineLearningDf

user_id,-2SV-vuLB-Kg,-PRoSlayeR-,-SEVEN-,-_PussyDestroyer_-,0-3-0,00000000000000000001227,00454211432342,00True,01189958889189157253,04061993,...,zombieskiler6969,zomgieee,zoozles,zp3413,zrustz16,zsharoarkbr,zucchin1,zuzuga2003,zv_odd,zyr0n1c
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
! That Bastard Is Trying To Steal Our Gold !,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
//N.P.P.D. RUSH//- The milk of Ultraviolet,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
0RBITALIS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
10000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
100% Orange Juice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
the static speaks my name,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
theBlu,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
theHunter Classic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.666667,0.0,0.0
theHunter: Primal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0


Convertimos la matriz a formato de matriz dispersa que guarda solo los valores distintos a 0 para reducir la memoria utilizada y mejorar la eficiencia.

In [5]:
machineLearningSparseDf = sp.sparse.csr_matrix(machineLearningDf.values)
machineLearningSparseDf

<3195x7368 sparse matrix of type '<class 'numpy.float64'>'
	with 29040 stored elements in Compressed Sparse Row format>

Se crea la matriz de similitud utilizando coseno inverso. Esto evalúa la similitud entre dos vectores en un espacio multidimensional. Para efectos de este modelo, nos permite determinar que tan similares son dos juegos basado en las recomendaciones de los usuarios.

In [6]:
itemSimilarity = cosine_similarity(machineLearningSparseDf)

In [7]:
itemSimDf = pd.DataFrame(itemSimilarity, index = machineLearningDf.index, columns = machineLearningDf.index)

In [9]:
itemSimDf.head()

app_name,! That Bastard Is Trying To Steal Our Gold !,//N.P.P.D. RUSH//- The milk of Ultraviolet,0RBITALIS,"10,000,000",100% Orange Juice,100% Orange Juice - Krila & Kae Character Pack,1001 Spikes,12 Labours of Hercules,12 Labours of Hercules II: The Cretan Bull,12 is Better Than 6,...,nail'd,oO,planetarian ~the reverie of a little planet~,resident evil 4 / biohazard 4,sZone-Online,the static speaks my name,theBlu,theHunter Classic,theHunter: Primal,Астролорды: Облако Оорта
app_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
! That Bastard Is Trying To Steal Our Gold !,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.371768,0.0,0.0,0.0,0.0
//N.P.P.D. RUSH//- The milk of Ultraviolet,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0RBITALIS,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10000000,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100% Orange Juice,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finalizadas las transformaciones podemos guardar como parquet.

In [11]:
itemSimDf.to_parquet('../data/itemSim.parquet')

Dada la comprensión de las relaciones entre los distintos juegos, se desarrolla un código que realice recomendaciones basadas en la similitud del coseno. Se toma como input el nombre de un juego, ordena de manera descendente la columna correspondiente a ese juego en la matriz de similitud entre elementos (item_sim_df), asegurando que los juegos más similares ocupen las primeras filas. Luego, selecciona los 5 juegos más similares (excluyendo el juego de entrada), y genera una lista de juegos similares al juego especificado. La función correspondiente a este código se puede encontrar en el Notebook [Funciones_API](https://github.com/JBSosa/MLOps-Steam/blob/main/Notebooks/Funciones_API.ipynb).