<a href="https://colab.research.google.com/github/FreddyPinto/recsys-steam-games/blob/feature/notebooks/2.1-feature-engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering

El objetivo de este notebook es optimizar el rendimiento de la API y mejorar el rendimiento predictivo del modelo de Machine Learning para nuestro sistema de recomendación, con el fin de reducir las necesidades computacionales o de datos y mejorar la interpretabilidad de los resultados.

Para ello, haremos uso de la ingeniería de características, que consiste en crear, seleccionar o transformar las variables que se usarán para los endpoints que se consumirán en la API y el modelo.

Algunas de las tareas que realizaremos son:

- Determinar qué características son las más importantes.

- Aplicar un análisis de sentimiento con NLP.

- Codificar categóricos de alta cardinalidad.

- Reducir la dimensionalidad de los datos aplicando análisis de componentes principales para conservar la mayor parte de la información útil.

- Obtener una base de datos única y más eficiente para cada endpoint.

## 0 Configuraciones Globales e Importaciones

En esta sección, importamos todas las bibliotecas y/o modulos necesarios para nuestro proceso de feature engineering y establecemos configuraciones globales de ser requerido.

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import scipy as sp
import textblob
import sklearn
from textblob import TextBlob
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_distances
from sklearn.preprocessing import StandardScaler

print(f"System version: {sys.version}")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"scipy version: {sp.__version__}")
print(f"textblob version: {textblob.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")

System version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
pandas version: 1.5.3
numpy version: 1.23.5
scipy version: 1.11.3
textblob version: 0.17.1
scikit-learn version: 1.2.2


## 1 Extracción

En esta sección, extraemos los datos de los archivos `steam_games`, `user_items` y `user_reviews` que estan en formato parquet.

### 1.1 Extracción de los datos

Creamos una función que lee cada archivo desde su directorio y lo carga a un DataFrame de `pandas`.

In [2]:
# Cargamos los archivos parquet
def read_parquet_files(parquet_files):
    dataframes = {}
    for name in parquet_files:
        dataframes[name] = pd.read_parquet(f'{name}.parquet', engine='pyarrow')
    return dataframes

parquet_files = ['steam_games','user_items', 'user_reviews']
dataframes = read_parquet_files(parquet_files)

# Convertimos a df.
df_steam_games = dataframes['steam_games']
df_user_items = dataframes['user_items']
df_user_reviews = dataframes['user_reviews']

## 2 Análisis de sentimiento

En esta sección, enriqueceremos el dataset `user_reviews` con una nueva columna llamada `sentiment_analysis`. Esta columna contendrá el resultado de aplicar un análisis de sentimiento con NLP a las reseñas de los juegos escritas por los usuarios. De esta manera, podremos explorar la opinión de los usuarios sobre los diferentes juegos.

El análisis de sentimiento consiste en asignar una etiqueta numérica a cada reseña, según el tono o la actitud que expresa el texto. Usaremos la siguiente escala:

* 0: si la reseña es **negativa**, es decir, si el usuario muestra insatisfacción, disgusto o decepción con el juego.
* 1: si la reseña es **neutral**, es decir, si el usuario muestra indiferencia, objetividad o ausencia de emoción con el juego.
* 2: si la reseña es **positiva**, es decir, si el usuario muestra satisfacción, gusto o admiración con el juego.



### 2.1 Función `sentiment_analysis`

Para realizar el análisis de sentimiento con NLP a las reseñas de los juegos, crearemos una función usando la librería TextBlob que se considera facil de usar y muy intuitiva. Usaremos la polaridad que es una medida numérica que indica si el texto es negativo o positivo, según el tono o la actitud que expresa. La polaridad varía entre -1 y 1, donde -1 significa muy negativo, 0 significa neutro y 1 significa muy positivo.

In [3]:
def sentiment_analysis(review):
    # Si la reseña está ausente, retorna 1 (neutral)
    if pd.isnull(review):
        return 1

    # Calcula la polaridad de la reseña usando TextBlob
    polarity = TextBlob(review).sentiment.polarity

    # Retorna 0 (malo) si la polaridad es menor que 0, 2 (positivo) si la polaridad es mayor que 0, y 1 (neutral) en caso contrario
    if polarity < 0:
        return 0
    elif polarity > 0:
        return 2
    else:
        return 1

* Aplicamos la función a la columna `review`.

In [4]:
df_user_reviews['sentiment_analysis'] = df_user_reviews['review'].apply(sentiment_analysis)

* Veamos algunos ejemplos:

In [5]:
df_user_reviews[['review','sentiment_analysis']].sample(5)

Unnamed: 0,review,sentiment_analysis
36437,Can you do the work shop on hotline miami?,1
44660,"Weapon balance is terrible, tripwire need to f...",2
51192,Me and my friends have played over 500 hours t...,2
42251,it is so ADDICTIVE i cant stop playing (not in...,0
3065,11/10 The best suicidal baby game ever made! T...,2


### 2.2 Eliminación de la columna `review`

La nueva columna `sentiment_analysis` reemplazará a la columna `review` en el dataset `user_reviews`, para facilitar el trabajo de los modelos de machine learning y el análisis de datos.

In [6]:
df_user_reviews.drop('review', axis=1, inplace=True)
df_user_reviews.head()

Unnamed: 0,item_id,recommend,user_id,posted_year,sentiment_analysis
0,1250,True,76561197970982479,2011,2
1,22200,True,76561197970982479,2011,2
2,43110,True,76561197970982479,2011,2
3,251610,True,js41637,2014,2
4,227300,True,js41637,2013,0


## 3 Diseño y estructura de las bases de datos para los endpoints de la API

En esta sección, nuestro objetivo es crear diferentes dataset a modo de pseudo base de datos para las funciones que se usarán en los endpoints de la API. De esta manera, podremos acceder a los datos que necesitamos de forma rápida y eficiente, sin tener que cargar toda la información para así, optimizar el rendimiento de la API.

### 3.1 Endpoints 1 y 2

Estos endpoints comparten información en común, por lo que podemos crear un solo dataset para ambos.

#### 3.1.1 Endpoint 1

def **PlayTimeGenre( *`genero` : str* )**:
    Retorna `año` con mas horas jugadas para el género dado.
Ejemplo de retorno:

``` js
{
   "Año de lanzamiento con más horas jugadas para Género X": 2013
}
```



#### 3.1.2 Endpoint 2

+ def **UserForGenre( *`genero` : str* )**:
    Debe devolver el usuario que acumula más horas jugadas para el género dado y una lista de la acumulación de horas jugadas por año.

Ejemplo de retorno:
```js
{
   "Usuario con más horas jugadas para Género X":"us213ndjss09sdf",
   "Horas jugadas":[
      {
         "Año":2013,
         "Horas":203
      },
      {
         "Año":2012,
         "Horas":100
      },
      {
         "Año":2011,
         "Horas":23
      }
   ]
}
```

#### 3.1.4 Pseudo Database 1

Para crear un solo dataset que pueda ser utilizado como pseudo base de datos para estos endpoints, necesitamos combinar `df_steam_games` con `df_user_items` de tal manera que tengamos toda la información necesaria en un solo lugar. Para esto solo necesitamos las columnas:
`item_id`,`genres`,`release_year` del DataFrame `steam_games`. También `item_id`, `user_id` y `playtime_forever` del DataFrame `user_items`.

* Primero, seleccionamos solo las columnas necesarias:

In [7]:
steam_games_columns = ['item_id','genres','release_year']
user_items_columns = ['item_id','user_id', 'playtime_forever']

* Segundo, creamos subsets de los DataFrames con solo las columnas necesarias:

In [8]:
df_games_subset = df_steam_games[steam_games_columns]
df_items_subset = df_user_items[user_items_columns]

* Luego, hacemos un merge entre `steam_games` y `user_items` en la columna `item_id`.

In [9]:
df_pseudo_db1 = pd.merge(df_games_subset, df_items_subset, on='item_id')
df_pseudo_db1.head()

Unnamed: 0,item_id,genres,release_year,user_id,playtime_forever
0,282010,Racing,1997,UTNerd24,0.083333
1,282010,Racing,1997,I_DID_911_JUST_SAYING,0.0
2,282010,Racing,1997,76561197962104795,0.0
3,282010,Racing,1997,r3ap3r78,0.0
4,282010,Racing,1997,saint556,0.216667


In [10]:
df_pseudo_db1.shape

(15255072, 5)

- Con el fin de ahorrar recursos, solo usaremos los registros de juegos que cumplan con las siguientes condiciones: tener un `release_year` válido, un `genres` con una popularidad alta tomando como referencia la frecuencia relativa y haber sido jugados al menos una vez.

In [11]:
# Calculamos la frecuencia relativa de cada género
df_pseudo_db1.genres.value_counts()/len(df_pseudo_db1)

Action                       0.229263
Adventure                    0.182320
Indie                        0.127814
Strategy                     0.098006
RPG                          0.087477
Simulation                   0.068177
Casual                       0.062629
Free to Play                 0.058383
Massively Multiplayer        0.051663
Racing                       0.010949
Sports                       0.008968
Early Access                 0.008278
Education                    0.004132
Utilities                    0.000933
Video Production             0.000321
Web Publishing               0.000259
Software Training            0.000170
unknown                      0.000129
Audio Production             0.000077
Photo Editing                0.000039
Animation &amp; Modeling     0.000009
Design &amp; Illustration    0.000005
Name: genres, dtype: float64

- De acuerdo con la observación anterior, seleccionamos los 10 géneros más frecuentes, excluyendo las categorías que no consideramos un género como tal, sino que hacen referencia a por ejemplo si el juego es gratis (`Free to Play`) o tiene un acceso temprano (`Early Access`).

In [12]:
# Seleccionamos los 10 géneros mas frecuentes
top_10_popular_genres = ['Action', 'Adventure', 'Indie', 'Strategy', 'RPG', 'Simulation', 'Casual', 'Massively Multiplayer', 'Racing', 'Sports']

# Filtramos por las condiciones establecidas
df_pseudo_db1 = df_pseudo_db1[(df_pseudo_db1['release_year'] != 'unknown') & (df_pseudo_db1['playtime_forever'] > 0) & (df_pseudo_db1['genres'].isin(top_10_popular_genres))].reset_index(drop=True)
df_pseudo_db1.shape

(9753061, 5)

- Para optimizar el uso de la memoria en el deploy, convertiremos las columnas a los tipos de datos adecuados según su contenido.

In [13]:
df_pseudo_db1['release_year'] = df_pseudo_db1['release_year'].astype('int16')
df_pseudo_db1['playtime_forever'] = df_pseudo_db1['playtime_forever'].astype('float32')
df_pseudo_db1.memory_usage(deep=True)

Index                     128
item_id             609650394
genres              629483569
release_year         19506122
user_id             686349338
playtime_forever     39012244
dtype: int64

* Por último, creamos una tabla pivote que tenga como índice `user_id` y `release_year`, como columnas `genres` y como valores únicos la suma de `playtime_forever`.

In [14]:
df_pseudo_db1 = df_pseudo_db1.pivot_table(index=['user_id', 'release_year'], columns='genres', values='playtime_forever', aggfunc='sum', fill_value=0)
df_pseudo_db1

Unnamed: 0_level_0,genres,Action,Adventure,Casual,Indie,Massively Multiplayer,RPG,Racing,Simulation,Sports,Strategy
user_id,release_year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
--000--,2006,15.416667,15.416667,0.000000,15.416667,15.416667,0.000000,0.000000,15.416667,0.000000,0.000000
--000--,2009,88.816666,88.816666,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
--000--,2010,0.366667,0.000000,0.000000,0.366667,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
--000--,2011,108.699997,108.699997,0.000000,30.616665,46.049999,62.649998,46.049999,11.083333,0.000000,11.083333
--000--,2012,1822.516724,37.150002,30.016666,37.700001,10.500000,29.516666,0.000000,0.000000,7.683333,1796.400024
...,...,...,...,...,...,...,...,...,...,...,...
zzzmidmiss,2010,7.783334,0.166667,3.916667,7.950000,0.000000,0.000000,0.000000,3.233333,3.233333,3.400000
zzzmidmiss,2011,38.366665,38.366665,1.250000,1.750000,0.266667,37.599998,0.266667,0.000000,0.000000,1.150000
zzzmidmiss,2012,98.366669,61.650005,6.083333,51.316666,8.016666,45.500000,0.000000,6.450000,0.000000,15.383334
zzzmidmiss,2013,1.633333,1.750000,0.283333,1.750000,0.166667,0.166667,0.000000,0.000000,0.000000,1.466667


In [15]:
df_pseudo_db1.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 660003 entries, ('--000--', 2006) to ('zzzmidmiss', 2014)
Data columns (total 10 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Action                 660003 non-null  float32
 1   Adventure              660003 non-null  float32
 2   Casual                 660003 non-null  float32
 3   Indie                  660003 non-null  float32
 4   Massively Multiplayer  660003 non-null  float32
 5   RPG                    660003 non-null  float32
 6   Racing                 660003 non-null  float32
 7   Simulation             660003 non-null  float32
 8   Sports                 660003 non-null  float32
 9   Strategy               660003 non-null  float32
dtypes: float32(10)
memory usage: 28.8+ MB


### 3.2 Endpoints 3, 4 y 5.

#### 3.2.1 Endpoint 3

+ def **UsersRecommend( *`año` : int* )**:
   Devuelve el top 3 de juegos MÁS recomendados por usuarios para el año dado. (reviews.recommend = True y comentarios positivos/neutrales)
  

Ejemplo de retorno:
```js
[
   {
      "Puesto 1":"X"
   },
   {
      "Puesto 2":"Y"
   },
   {
      "Puesto 3":"Z"
   }
]
```


#### 3.2.2 Endpoint 4

+ def **UsersWorstDeveloper( *`año` : int* )**:
   Devuelve el top 3 de desarrolladoras con juegos MENOS recomendados por usuarios para el año dado. (reviews.recommend = False y comentarios negativos)
  
Ejemplo de retorno:
```js
[
   {
      "Puesto 1":"X"
   },
   {
      "Puesto 2":"Y"
   },
   {
      "Puesto 3":"Z"
   }
]
```

#### 3.2.3 Endpoint 5

def **sentiment_analysis( *`empresa desarrolladora` : str* )**:
    Según la empresa desarrolladora, se devuelve un diccionario con el nombre de la desarrolladora como llave y una lista con la cantidad total
    de registros de reseñas de usuarios que se encuentren categorizados con un análisis de sentimiento como valor.

Ejemplo de retorno:
```js
{
   "Valve":[
      Negative = 182,
      Neutral = 120,
      Positive = 278
   ]
}
```


#### 3.1.4 Pseudo Database 2

- Para crear un solo dataset que pueda ser utilizado como pseudo base de datos para estos endpoints, necesitamos combinar `df_steam_games` con `df_user_reviews` de tal manera que tengamos toda la información necesaria en un solo lugar. Para esto solo necesitamos las columnas:
`item_id`,`item_name`,`developer` del DataFrame `steam_games`. También `item_id`, `recommend`, `sentiment_analysis` y `posted_year` del DataFrame `user_reviews`.

- Primero, seleccionamos las columnas necesarias:

In [16]:
steam_games_columns = ['item_id', 'item_name', 'developer']
user_reviews_columns = ['item_id', 'recommend','sentiment_analysis','posted_year']

* Segundo, creamos subsets de los DataFrames con solo las columnas necesarias:

In [17]:
df_games_subset = df_steam_games[steam_games_columns]
df_reviews_subset = df_user_reviews[user_reviews_columns]

* Luego, hacemos un merge entre los subsets `steam_games` y `user_reviews` en la columna `item_id`.

In [18]:
df_pseudo_db2 = pd.merge(df_games_subset, df_reviews_subset, on='item_id')
df_pseudo_db2.head()

Unnamed: 0,item_id,item_name,developer,recommend,sentiment_analysis,posted_year
0,282010,Carmageddon Max Pack,Stainless Games Ltd,True,1,unknown
1,282010,Carmageddon Max Pack,Stainless Games Ltd,True,1,unknown
2,282010,Carmageddon Max Pack,Stainless Games Ltd,True,1,unknown
3,70,Half-Life,Valve,True,0,2015
4,70,Half-Life,Valve,True,0,2011


- Con el objetivo de ahorrar recursos, filtraremos los registros de juegos que cumplan con las siguientes condiciones: tener un review con un `posted_year` válido y un `developer` conocido. Además, eliminaremos la columna `item_id`, ya que no la necesitamos.

In [19]:
df_pseudo_db2 = df_pseudo_db2[(df_pseudo_db2['posted_year'] != 'unknown') & (df_pseudo_db2['developer'] != 'unknown') ].reset_index(drop=True)
df_pseudo_db2.drop('item_id',axis=1, inplace=True)
df_pseudo_db2.head()

Unnamed: 0,item_name,developer,recommend,sentiment_analysis,posted_year
0,Half-Life,Valve,True,0,2015
1,Half-Life,Valve,True,0,2011
2,Half-Life,Valve,True,0,2014
3,Half-Life,Valve,True,2,2013
4,Half-Life,Valve,True,0,2013


- Por último, para optimizar el uso de la memoria en el deploy de la API, convertiremos las columnas a los tipos de datos adecuados según su contenido.

In [20]:
df_pseudo_db2['sentiment_analysis'] = df_pseudo_db2['sentiment_analysis'].astype('int8')
df_pseudo_db2['posted_year'] = df_pseudo_db2['posted_year'].astype('int16')
df_pseudo_db2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152619 entries, 0 to 152618
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   item_name           152619 non-null  object
 1   developer           152619 non-null  object
 2   recommend           152619 non-null  bool  
 3   sentiment_analysis  152619 non-null  int8  
 4   posted_year         152619 non-null  int16 
dtypes: bool(1), int16(1), int8(1), object(2)
memory usage: 2.9+ MB


## 4 Recommendation Engine / System

En esta sección, usaremos las estrategia de Collaborative Filtering para crear el **sistema de recomendación**. Para ello, se ofrecen dos propuestas:

1. User-based:
  * Se identifican usuarios similares
  * Se recomiendan nuevos ítems a otros usuarios basado en el rating dado por otros usuarios similares.

2. Item-based:
  * Calcular la similitud entre items
  * Encontrar los “mejores items similares” a los que un usuario no tenga evaluados y recomendárselos.

### Calcular los Ratings

Como no tenemos **REALMENTE** una valoración o *rating* del 1 al 5 (como podríamos tener por ejemplo al valorar películas), se decide crear uno  a partir del análisis de sentimiento y las recomendaciones del usuario. Como criterio usaremos el análisis de sentimiento como el factor principal, y la recomendación como el factor secundario, para un rango del 1 al 5:

| sentiment_analysis | recommend | rating |
|--------------------|-----------|--------|
|    0 (negativo)    |	 False   |    1   |
|    0 (negativo)    |	 True    |    1   |
|    1 (neutral)     |	 False   |    2   |
|    1 (neutral)     |	 True    |    3   |
|    2 (positivo)    |	 False   |    4   |
|    2 (positivo)    |	 True    |    5   |


* Vamos a unir los tres conjuntos de datos, esto nos permitirá tener un solo conjunto de datos con toda la información relevante sobre los usuarios, como los juegos que han comprado o jugado, el análisis de sentimiento y la recomendación. Vamos a usar la función merge de pandas para hacer esta operación.

* Primero, seleccionamos solo las columnas necesarias:

In [21]:
steam_games_columns = ['item_id', 'item_name']
user_reviews_columns = ['item_id','user_id', 'recommend','sentiment_analysis']
user_items_columns = ['user_id','item_id']

* Segundo, creamos subsets de los DataFrames con solo las columnas necesarias:

In [22]:
df_games_subset = df_steam_games[steam_games_columns]
df_reviews_subset = df_user_reviews[user_reviews_columns]
df_items_subset = df_user_items[user_items_columns]

* Luego, hacemos un merge entre `steam_games` y `user_items` en la columna `item_id`.

In [23]:
df_user_games = pd.merge(df_games_subset, df_items_subset, on='item_id')
df_user_games.head()

Unnamed: 0,item_id,item_name,user_id
0,282010,Carmageddon Max Pack,UTNerd24
1,282010,Carmageddon Max Pack,I_DID_911_JUST_SAYING
2,282010,Carmageddon Max Pack,76561197962104795
3,282010,Carmageddon Max Pack,r3ap3r78
4,282010,Carmageddon Max Pack,saint556


In [24]:
df_features = pd.merge(df_user_games, df_reviews_subset, on=["user_id", "item_id"])
df_features.drop_duplicates(inplace=True)
df_features.reset_index(drop=True, inplace=True)
df_features.head()

Unnamed: 0,item_id,item_name,user_id,recommend,sentiment_analysis
0,282010,Carmageddon Max Pack,InstigatorAU,True,1
1,70,Half-Life,EizanAratoFujimaki,True,0
2,70,Half-Life,GamerFag,True,0
3,70,Half-Life,76561198020928326,True,0
4,70,Half-Life,Bluegills,True,2


In [25]:
# Agrupar el dataframe por item_name para contar los juegos
conteo = df_features.groupby("item_name").size()
conteo


item_name
! That Bastard Is Trying To Steal Our Gold !     1
//N.P.P.D. RUSH//- The milk of Ultraviolet       2
10,000,000                                       2
100% Orange Juice                               14
12 Labours of Hercules                           2
                                                ..
the static speaks my name                       15
theBlu                                           1
theHunter Classic                               53
theHunter: Primal                               15
Астролорды: Облако Оорта                         1
Length: 2736, dtype: int64

In [26]:
# Seleccionamos solo los juegos que tengan al menos 50 reviews.
df_features = df_features.loc[df_features["item_name"].isin(conteo[conteo > 50].index), :]
# Mostrar el dataframe de juegos frecuentes
df_features


Unnamed: 0,item_id,item_name,user_id,recommend,sentiment_analysis
1,70,Half-Life,EizanAratoFujimaki,True,0
2,70,Half-Life,GamerFag,True,0
3,70,Half-Life,76561198020928326,True,0
4,70,Half-Life,Bluegills,True,2
5,70,Half-Life,76561198071955492,True,0
...,...,...,...,...,...
39959,220,Half-Life 2,decplayz,True,1
39960,220,Half-Life 2,jacobval99,True,2
39961,220,Half-Life 2,chikens,True,1
39962,220,Half-Life 2,johnshere,True,2


In [27]:
def get_rating(sentiment_analysis, recommend):
  """
  Devuelve el rating de acuerdo con el criterio especificado.

  Args:
    sentiment_analysis: El valor del análisis de sentimiento, que puede ser 0, 1 o 2.
    recommend: El valor de la recomendación, que puede ser True o False.

  Returns:
    EL rating, que es un entero entre 1 y 5.
  """
  rating = max(1, min(5, 2 * sentiment_analysis + (1 if recommend else 0)))

  return rating


In [28]:
ratings = df_features.apply(lambda row: get_rating(row['sentiment_analysis'], row['recommend']), axis=1)

# Crea el dataframe df_ratings con la nueva columna de ratings
df_ratings = df_features[['item_id', 'item_name', 'user_id']].assign(rating=ratings)
df_ratings.head()

Unnamed: 0,item_id,item_name,user_id,rating
1,70,Half-Life,EizanAratoFujimaki,1
2,70,Half-Life,GamerFag,1
3,70,Half-Life,76561198020928326,1
4,70,Half-Life,Bluegills,5
5,70,Half-Life,76561198071955492,1


In [29]:
df_ratings.groupby(["rating"])["user_id"].count()

rating
1     5580
2      511
3     5534
4      682
5    14127
Name: user_id, dtype: int64

In [30]:
scaler = StandardScaler()
rating_array = df_ratings['rating'].values.reshape(-1, 1)

normalized_rating  =scaler.fit_transform(rating_array)


df_norm = df_ratings.copy()
df_norm['rating'] = normalized_rating
df_norm.head()

Unnamed: 0,item_id,item_name,user_id,rating
1,70,Half-Life,EizanAratoFujimaki,-1.645211
2,70,Half-Life,GamerFag,-1.645211
3,70,Half-Life,76561198020928326,-1.645211
4,70,Half-Life,Bluegills,0.835191
5,70,Half-Life,76561198071955492,-1.645211


### Creamos la matriz usuarios/ratings

In [31]:
df_matrix = df_norm.pivot_table(index=['user_id'], columns=['item_name'], values='rating').fillna(0)
df_matrix

item_name,APB Reloaded,ARK: Survival Evolved,Ace of Spades: Battle Builder,AdVenture Capitalist,Age of Empires II HD,AirMech Strike,Arma 2: Operation Arrowhead,Arma 3,Awesomenauts - the 2D moba,Bad Rats: the Rats' Revenge,...,Total War™: ROME II - Emperor Edition,Trove,Undertale,Unturned,Verdun,War Thunder,Warface,Warframe,XCOM: Enemy Unknown,theHunter Classic
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--000--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
--ace--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
--ionex--,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
-2SV-vuLB-Kg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.835191,0.000000
-Beave-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zuzuga2003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,-1.025111
zv_odd,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
zvanik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.835191,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
zynxgameth,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.835191,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000


Sparcity

In [32]:
def get_sparsity(df):
  """
  Devuelve el sparsity de una matriz como df.

  Args:
    df: La matriz en la que se desea calcular el sparsity.

  Returns:
    El sparsity de la matriz `df`.
  """

  num_zeros = (df == 0).sum()
  num_elements = df.size

  sparsity = (1 - num_zeros.sum() / num_elements)*100

  return f'Sparsity: {round(sparsity, 2)}%'

In [33]:
get_sparsity(df_matrix)

'Sparsity: 1.14%'

In [34]:
df_matrix_sparse = sp.sparse.csr_matrix(df_matrix.values)
df_matrix_sparse

<15518x150 sparse matrix of type '<class 'numpy.float64'>'
	with 26434 stored elements in Compressed Sparse Row format>

Train y Test set split

In [35]:
ratings = df_matrix.values

In [36]:
ratings_train, ratings_test = train_test_split(ratings, test_size = 0.3, random_state=123)
print(ratings_train.shape)
print(ratings_test.shape)

(10862, 150)
(4656, 150)


Matriz de Similitud: Similitud de coseno

In [37]:
sim_matrix = 1 - cosine_distances(ratings)
print(sim_matrix.shape)

(15518, 15518)


In [38]:
df_sim_matrix = pd.DataFrame(sim_matrix)
df_sim_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15508,15509,15510,15511,15512,15513,15514,15515,15516,15517
0,1.0,0.0,0.000000,0.000000,0.0,-1.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000
1,0.0,1.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000
2,0.0,0.0,1.000000,0.500000,0.0,0.0,0.0,-0.191328,0.707107,0.000000,...,0.000000,0.0,0.0,0.0,-0.707107,0.365555,0.0,-0.574406,0.0,-0.424620
3,0.0,0.0,0.500000,1.000000,0.0,0.0,0.0,-0.191328,0.707107,0.000000,...,0.000000,0.0,0.0,0.0,-0.707107,0.365555,0.0,-0.574406,0.0,-0.424620
4,0.0,0.0,0.000000,0.000000,1.0,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,1.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15513,0.0,0.0,0.365555,0.365555,0.0,0.0,0.0,-0.493933,0.516973,0.516973,...,0.000000,0.0,0.0,0.0,-0.516973,1.000000,0.0,-0.419954,0.0,-0.386867
15514,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.0,0.000000,0.000000,1.0,0.000000,0.0,0.000000
15515,0.0,0.0,-0.574406,-0.574406,0.0,0.0,0.0,0.219800,-0.812333,0.000000,...,-0.319705,0.0,0.0,0.0,0.812333,-0.419954,0.0,1.000000,0.0,0.487808
15516,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,...,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,1.0,0.000000


In [39]:
df_sim_matrix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15518 entries, 0 to 15517
Columns: 15518 entries, 0 to 15517
dtypes: float64(15518)
memory usage: 1.8 GB


In [40]:
sim_matrix_train = sim_matrix[0:13281,0:13281]
sim_matrix_test = sim_matrix[13281:18973,13281:18973]

#separar las filas y columnas de train y test
print(sim_matrix_train.shape)
print(sim_matrix_test.shape)



(13281, 13281)
(2237, 2237)


In [41]:
users_predictions = sim_matrix_train.dot(ratings_train) / np.array([np.abs(sim_matrix_train).sum(axis=1)]).T


users_predictions.shape

ValueError: ignored

In [None]:
game = 'jbagnato'
data = df_norm[df_norm['item_name'] == game]
usuario_ver = data.iloc[0]['user_id'] - 1 # resta 1 para obtener el index de pandas.

user0=users_predictions.argsort()[usuario_ver]

# Veamos los tres recomendados con mayor puntaje en la predic para este usuario
for i, aRepo in enumerate(user0[-3:]):
    selRepo = df_repos[df_repos['repoId']==(aRepo+1)]
    print(selRepo['title'] , 'puntaje:', users_predictions[usuario_ver][aRepo])

In [None]:
def recomendacion_juego(item_name):
    # Obtener el índice del item en el dataset
    item_index = df_matrix[df_matrix["item_name"] == item_name].index[0]
    # Obtener los índices de los 5 items más similares al item dado
    top_5 = heapq.nlargest(5, range(len(sim_matrix[item_index])), sim_matrix[item_index].take)
    # Crear una lista vacía para guardar los nombres de los items recomendados
    recomendados = []
    # Recorrer los índices de los items más similares
    for i in top_5:
        # Obtener el nombre del item correspondiente al índice
        item_recomendado = df_matrix.loc[i, "item_name"]
        # Añadir el nombre del item a la lista de recomendados
        recomendados.append(item_recomendado)
    # Devolver la lista de recomendados
    return recomendados


In [None]:
recomendacion_juego(item_name)

### Validemos el error

In [None]:
def get_mse(preds, actuals):
    if preds.shape[1] != actuals.shape[1]:
        actuals = actuals.T
    preds = preds[actuals.nonzero()].flatten()
    actuals = actuals[actuals.nonzero()].flatten()
    return mean_squared_error(preds, actuals)

get_mse(users_predictions, ratings_train)

In [None]:
# Realizo las predicciones para el test set
users_predictions_test = sim_matrix.dot(ratings) / np.array([np.abs(sim_matrix).sum(axis=1)]).T
users_predictions_test = users_predictions_test[13281:18973,:]

get_mse(users_predictions_test, ratings_test)

## 5 Carga

Finalmente, en esta sección cargamos nuestros datos transformados para los endpoints que se consumirán en la API a su destino final. Optamos por almacenarlos en formato parquet con compresion snappy para reducir su tamaño de almacenamiento.

In [None]:
# Nombres correspondientes a cada DataFrame
dfs = [df_pseudo_db1, df_pseudo_db2, df_user_reviews]
names = ['pseudo-db1.parquet', 'pseudo-db2.parquet', 'user_sentiment_analysis.parquet' ]

for dfs, n in zip(dfs, names):
    # Definimos la ruta del directorio
    folder_path = f'../data/processed/'

    # Verificamos si el folder_path existe
    if not os.path.exists(folder_path):
        # Si no existe, lo creamos
        os.makedirs(folder_path)

    # Definimos la ruta completa del archivo
    path = os.path.join(folder_path, n)

    # Guardamos el DataFrame como un archivo parquet
    dfs.to_parquet(path, engine='pyarrow', compression='snappy')

    print(f"'{n}' fue guardado correctamente en '{folder_path}'")