# Introducción, objetivos y contenido

Este trabajo corresponde a la fase de ETL (Extraction, Transformation and Loading). El objetivo de esta fase es obtener datasets limpios y listos para ser utilizado en fases posteriores del proyecto. 

Contenidos:
* Importación de librerías
* Carga de datos
* Preparación de datos para cada dataset 
    * Ingeniería de características
    * Verificación de tipos de datos
    * Valores duplicados
    * Valores nulos
* Exportación de los datasets limpios
* Armado y exportación de dataframes para API

# Importación de librerías

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
from math import factorial
from scipy import stats as st
import json
import gzip
import ast
from pandas import json_normalize
from textblob import TextBlob
import re

# Carga de datos

Tenemos un total de 3 datasets en formato json comprimido, debido a ello haremos la carga de datos en forma separada para tomar los recaudos correspondientes.

Dataset GAMES: Este archivo ha sido posible cargarlo en formato jason descomprimido, por lo cual su código resulta simple.

In [2]:
steam_games = pd.read_json('steam_games.json', lines=True)

Dataset REVIEWS: Dado que este dataset tiene una estructura menos estandarizada, ha sido necesario cargarlo aplicando un código que estandarice cada línea del archivo. Luego esas líneas se incorporan como lista a una variable archivo que recopila los datos originales transformados.

In [3]:
dataset_list_reviews = []
with gzip.open('user_reviews.json.gz', 'rb') as file:
    for line in file:
        dataset_list_reviews.append(ast.literal_eval(line.decode('utf-8')))
user_reviews = pd.DataFrame(dataset_list_reviews)
file.close()

Dataset ITEMS: Dado que este dataset tiene una estructura menos estandarizada, ha sido necesario cargarlo aplicando un código que estandarice cada línea del archivo. Luego esas líneas se incorporan como lista a una variable archivo que recopila los datos originales transformados.

In [4]:
dataset_list_items = []
with gzip.open('users_items.json.gz', 'rb') as file:
    for line in file:
        dataset_list_items.append(ast.literal_eval(line.decode('utf-8')))
user_items = pd.DataFrame(dataset_list_items)
file.close()

# Preparación de datos

## Dataset GAMES

### Ingeniería de características - Dataset GAMES

In [5]:
df_games = steam_games
df_games.sample(2)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
116487,Jetdogs Studios,"[Adventure, Casual, Indie]",Sinister City,Sinister City,http://store.steampowered.com/app/326180/Sinis...,2014-10-18,"[Casual, Adventure, Hidden Object, Indie, Poin...",http://steamcommunity.com/app/326180/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",0.99,0.0,326180.0,Jetdogs Studios
98233,Trion Worlds,"[Action, Free to Play, Massively Multiplayer, ...",Devilian - Fallen Nightmare Pack,Devilian - Fallen Nightmare Pack,http://store.steampowered.com/app/642160/Devil...,2017-05-25,"[RPG, Massively Multiplayer, Violent, Sexual C...",http://steamcommunity.com/app/642160/reviews/?...,"[Multi-player, MMO, Co-op, Downloadable Conten...",99.99,0.0,642160.0,Bluehole Ginno Games


In [6]:
# Renombramiento del campo "id"
df_games.rename(columns={'id': 'item_id'}, inplace=True)

In [7]:
# Desagregación de campos cuyos valores son listas
df_games = df_games.explode('genres')
df_games = df_games.explode('tags')
df_games = df_games.explode('specs')
df_games.sample(5)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer
107558,HypeTrain Digital,Simulation,The Wild Eight,The Wild Eight,http://store.steampowered.com/app/526160/The_W...,2017-02-08,Action,http://steamcommunity.com/app/526160/reviews/?...,Online Co-op,19.99,1.0,526160.0,HypeTrain Digital
91745,Vostok Games,Free to Play,Survarium,Survarium,http://store.steampowered.com/app/355840/Surva...,2015-04-02,First-Person,http://steamcommunity.com/app/355840/reviews/?...,MMO,Free to Play,1.0,355840.0,Vostok Games
82969,,,,,,,,,,,,,
97703,,Action,Cluckles' Adventure Soundtrack,Cluckles' Adventure Soundtrack,http://store.steampowered.com/app/615300/Cluck...,2017-04-10,Action,http://steamcommunity.com/app/615300/reviews/?...,Steam Leaderboards,0.99,0.0,615300.0,@lukasinspace
110187,Home Net Games,Strategy,The Pirate: Caribbean Hunt,The Pirate: Caribbean Hunt,http://store.steampowered.com/app/512470/The_P...,2016-08-24,Indie,http://steamcommunity.com/app/512470/reviews/?...,Online Multi-Player,Free To Play,0.0,512470.0,Home Net Games


In [8]:
# Agregación del campo "year"
default_date = pd.to_datetime('1900-01-01')  # Imputar un valor predeterminado en lugar de los valores no válidos en 'release_date'
df_games['release_date'] = pd.to_datetime(df_games['release_date'], errors='coerce').fillna(default_date)

df_games['release_date'] = pd.to_datetime(df_games['release_date'])     # Convertir la columna 'release_date' a objetos de fecha y hora
df_games['year'] = df_games['release_date'].dt.year
df_games = df_games[df_games['year'] != 1900]
df_games.sample(5)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer,year
119883,TrueThought,Casual,Everyday Genius: SquareLogic,Everyday Genius: SquareLogic,http://store.steampowered.com/app/32150/Everyd...,2009-10-09,Relaxing,http://steamcommunity.com/app/32150/reviews/?b...,Steam Cloud,4.99,0.0,32150.0,TrueThought,2009
105656,Anti Gravity Game Studios,Action,Hell Warders,Hell Warders,http://store.steampowered.com/app/628710/Hell_...,2017-06-06,Casual,http://steamcommunity.com/app/628710/reviews/?...,Online Co-op,14.99,1.0,628710.0,Anti Gravity Game Studios,2017
112791,VDO Games,RPG,Cally's Caves 3,Cally's Caves 3,http://store.steampowered.com/app/418120/Cally...,2016-01-05,Female Protagonist,http://steamcommunity.com/app/418120/reviews/?...,Single-player,6.99,0.0,418120.0,VDO Games,2016
104712,,Racing,RC Plane 3 - Stealth Plane,RC Plane 3 - Stealth Plane,http://store.steampowered.com/app/675821/RC_Pl...,2017-08-07,Simulation,http://steamcommunity.com/app/675821/reviews/?...,Partial Controller Support,2.99,0.0,675821.0,FrozenPepper S.R.L,2017
109094,Out of the Park Developments,Strategy,Franchise Hockey Manager 3,Franchise Hockey Manager 3,http://store.steampowered.com/app/465660/Franc...,2016-10-31,Indie,http://steamcommunity.com/app/465660/reviews/?...,Steam Achievements,19.99,0.0,465660.0,Out of the Park Developments,2016


In [9]:
# Eliminación de campos que no serán utilizados
#df_games_eliminarcampos = ['url', 'title','release_date', 'reviews_url', 'specs', ]
# df_games = df_games.drop(df_games_eliminarcampos, axis=1)

In [10]:
# Filtrado de campos a utilizar
df_games = df_games[['item_id', 'app_name', 'genres', 'year', 'price', 'developer']]
df_games.sample(5)

Unnamed: 0,item_id,app_name,genres,year,price,developer
100787,724520.0,Magilore,RPG,2017,Free,Cole Kitroser
98279,612480.0,Arma 3 DLC Bundle 2,Action,2017,24.99,Bohemia Interactive
114367,369420.0,9 Clues 2: The Ward,Casual,2015,9.99,Tap It Games
89209,204180.0,Waveform,Indie,2012,1.99,Eden Industries
116271,262470.0,Rollers of the Realm,Action,2014,9.99,Phantom Compass


### Verificación de tipos de datos - Dataset GAMES

In [11]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1983295 entries, 88310 to 120443
Data columns (total 6 columns):
 #   Column     Dtype  
---  ------     -----  
 0   item_id    float64
 1   app_name   object 
 2   genres     object 
 3   year       int32  
 4   price      object 
 5   developer  object 
dtypes: float64(1), int32(1), object(4)
memory usage: 98.4+ MB


In [12]:
# Conversión de tipos de datos

df_games['price'] = pd.to_numeric(df_games['price'], errors='coerce')  # Conversión a tipo numérico, forzando los errores a NaN
#df_games['early_access'] = df_games['early_access'].astype(bool)       # Conversión a tipo booleano
df_games['item_id'] = pd.to_numeric(df_games['item_id'], errors='coerce')        # Conversión a tipo entero, forzando los errores a NaN
df_games['item_id'].fillna(0, inplace=True)
df_games['item_id'] = df_games['item_id'].astype(int)
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1983295 entries, 88310 to 120443
Data columns (total 6 columns):
 #   Column     Dtype  
---  ------     -----  
 0   item_id    int64  
 1   app_name   object 
 2   genres     object 
 3   year       int32  
 4   price      float64
 5   developer  object 
dtypes: float64(1), int32(1), int64(1), object(3)
memory usage: 98.4+ MB


In [13]:
df_games.sample(5)

Unnamed: 0,item_id,app_name,genres,year,price,developer
93026,415730,STAR-BOX: Dark Hack,Indie,2015,0.99,Michael Flynn
92682,375290,Superstatic,Action,2015,4.99,Sleepy Studios
92610,361160,Steamalot: Epoch's Journey,Strategy,2015,4.99,Risen Phoenix Studios
117501,271499,Rocksmith® 2014 – Spin Doctors - “Two Princes”,Simulation,2014,2.99,Ubisoft - San Francisco
88961,39160,Dungeon Siege III,Action,2011,14.99,Obsidian Entertainment


### Verficación de valores duplicados - Dataset GAMES

In [14]:
df_games.duplicated().sum()

1911358

In [15]:
df_games = df_games.drop_duplicates().reset_index(drop=True)
df_games = df_games.drop_duplicates(subset=['item_id'])     # Eliminar filas cuyo campo "item_id" tiene duplicados
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29782 entries, 0 to 71935
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   item_id    29782 non-null  int64  
 1   app_name   29781 non-null  object 
 2   genres     28548 non-null  object 
 3   year       29782 non-null  int32  
 4   price      27210 non-null  float64
 5   developer  28532 non-null  object 
dtypes: float64(1), int32(1), int64(1), object(3)
memory usage: 1.5+ MB


### Verificación de valores nulos - Dataset GAMES

In [16]:
df_games.isnull().sum()

item_id         0
app_name        1
genres       1234
year            0
price        2572
developer    1250
dtype: int64

En este marco es difícil establecer un patrón de valores nulos que nos permita comprender la razón de su existencia. Considerando los efectos para los que es necesario este dataset, optaremos por eliminar los registros cuyos campos prinicipales del dataset presenten valores nulos.

In [17]:
df_games = df_games.dropna(subset=['genres', 'app_name'])

In [18]:
df_games.isnull().sum()

item_id         0
app_name        0
genres          0
year            0
price        2502
developer     169
dtype: int64

## Dataset REVIEWS

### Ingeniería de características - Dataset REVIEWS

In [19]:
df_reviews = user_reviews
df_reviews.sample(2)

Unnamed: 0,user_id,user_url,reviews
12896,LordMemestar,http://steamcommunity.com/id/LordMemestar,"[{'funny': '1 person found this review funny',..."
7547,76561198088587850,http://steamcommunity.com/profiles/76561198088...,"[{'funny': '', 'posted': 'Posted August 8, 201..."


In [20]:
# Desagregación del campo "reviews"

df_reviews = user_reviews.explode('reviews')
df_reviews = pd.concat([df_reviews.drop(['reviews'], axis=1), df_reviews['reviews'].apply(pd.Series)], axis=1)
df_reviews.sample(5)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0
5149,76561198043497735,http://steamcommunity.com/profiles/76561198043...,,"Posted June 28, 2014.",,49600,No ratings yet,True,My eyes won't stop bleeding. 11/4,
9883,76561198097281049,http://steamcommunity.com/profiles/76561198097...,,"Posted January 27, 2015.",,223710,No ratings yet,True,La verdad este juego no lo encuentro en otro l...,
11662,educatedpotato,http://steamcommunity.com/id/educatedpotato,,"Posted April 3, 2015.",,248820,No ratings yet,True,Fantastic game; the multiplayer is tricky 2 se...,
19818,Ilikesalad,http://steamcommunity.com/id/Ilikesalad,,Posted March 28.,,238960,No ratings yet,True,"Cheers, Rian",
6470,waytogoidiot,http://steamcommunity.com/id/waytogoidiot,,Posted June 25.,,322330,No ratings yet,True,Dont Starve is a brilliant game and playing wi...,


In [21]:
# Agregación de los campos "date" y "year"

def extract_posted_date(posted_str):         # Función para extraer la fecha del campo "posted"
    pattern = r'Posted (\w+ \d{1,2}, \d{4})' # Definición del patrón observado
    match = re.search(pattern, posted_str)
    if match:
        return match.group(1)
    else:
        return None

# Aplicar la función para extraer la fecha del campo "posted"
df_reviews['posted_date'] = df_reviews['posted'].apply(lambda x: np.nan if pd.isna(x) else extract_posted_date(x))

df_reviews['posted_date'] = pd.to_datetime(df_reviews['posted_date'])
df_reviews['year_review'] = df_reviews['posted_date'].dt.year
df_reviews['year_review'] = df_reviews['year_review'].fillna(0)
df_reviews['year_review'] = df_reviews['year_review'].astype(int)

df_reviews.sample(2)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0,posted_date,year_review
11423,Doctor_Smooth,http://steamcommunity.com/id/Doctor_Smooth,,"Posted December 11, 2015.",,730,0 of 1 people (0%) found this review helpful,True,"This game, man, THIS game... is REALLY good! T...",,2015-12-11,2015
14404,mixadance,http://steamcommunity.com/id/mixadance,,"Posted February 7, 2013.",,220440,3 of 6 people (50%) found this review helpful,True,I love this game ^-^,,2013-02-07,2013


In [22]:
# Análisis de sentimientos a partir del campo "review"

df_reviews['review'] = df_reviews['review'].astype(str)
df_reviews['polarity'] = df_reviews['review'].apply(lambda text: TextBlob(text).sentiment.polarity)
df_reviews['sentiment'] = pd.cut(df_reviews['polarity'], bins=[-float('inf'), -0.001, 0.0, float('inf')], labels=[0, 1, 2])

In [23]:
df_reviews.sample(5)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0,posted_date,year_review,polarity,sentiment
9532,Koreanpony,http://steamcommunity.com/id/Koreanpony,,Posted July 15.,,417990,0 of 3 people (0%) found this review helpful,False,pls god Translation korean I don`t no fxxk th...,,NaT,0,0.0,1
10339,Bobyzola,http://steamcommunity.com/id/Bobyzola,1 person found this review funny,"Posted November 9, 2015.",,271590,No ratings yet,True,Winning a race on a pink faggio and beating pe...,,2015-11-09,2015,0.2,2
8318,Cybs,http://steamcommunity.com/id/Cybs,1 person found this review funny,"Posted February 1, 2015.",,239140,2 of 3 people (67%) found this review helpful,True,https://www.youtube.com/watch?v=WA77HSGk0FAZom...,,2015-02-01,2015,0.036837,2
14976,76561198102409708,http://steamcommunity.com/profiles/76561198102...,,"Posted January 26, 2015.",,313120,2 of 2 people (100%) found this review helpful,True,The game sofar has been enjoyable although the...,,2015-01-26,2015,0.122559,2
23973,awesomepunk39,http://steamcommunity.com/id/awesomepunk39,,"Posted December 15, 2013.",,50130,0 of 1 people (0%) found this review helpful,True,HAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...,,2013-12-15,2013,0.0,1


In [24]:
# Filtrado de campos a utilizar
df_reviews = df_reviews[['item_id', 'user_id', 'recommend', 'year_review', 'polarity', 'sentiment']]
df_reviews.sample(5)

Unnamed: 0,item_id,user_id,recommend,year_review,polarity,sentiment
1196,440,ias19112000,False,2013,-0.75,0
18574,218620,76561197991329759,True,2013,-0.04,0
15038,65980,76561198069214448,True,2014,0.275,2
9547,570,cumtasteslikejelley,True,2014,-0.12,0
14652,333600,pouyf,True,2015,-0.4,0


### Verificación de tipos de datos - Dataset REVIEWS

In [25]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59333 entries, 0 to 25798
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   item_id      59305 non-null  object  
 1   user_id      59333 non-null  object  
 2   recommend    59305 non-null  object  
 3   year_review  59333 non-null  int64   
 4   polarity     59333 non-null  float64 
 5   sentiment    59333 non-null  category
dtypes: category(1), float64(1), int64(1), object(3)
memory usage: 2.8+ MB


In [26]:
# Conversión de tipos de datos
df_reviews['item_id'] = pd.to_numeric(df_reviews['item_id'], errors='coerce')
df_reviews['recommend'] = df_reviews['recommend'].astype(bool)
df_reviews['sentiment'] = pd.to_numeric(df_reviews['sentiment'], errors='coerce')
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59333 entries, 0 to 25798
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   item_id      59305 non-null  float64
 1   user_id      59333 non-null  object 
 2   recommend    59333 non-null  bool   
 3   year_review  59333 non-null  int64  
 4   polarity     59333 non-null  float64
 5   sentiment    59333 non-null  int64  
dtypes: bool(1), float64(2), int64(2), object(1)
memory usage: 2.8+ MB


### Verificación de valores duplicados - Dataset REVIEWS

In [27]:
df_reviews.duplicated().sum()

874

In [28]:
# Eliminación de duplicados
df_reviews = df_reviews.drop_duplicates().reset_index(drop=True)
df_reviews.duplicated().sum()

0

### Verificación de valores nulos - Dataset REVIEWS

In [29]:
df_reviews.isnull().sum()

item_id        28
user_id         0
recommend       0
year_review     0
polarity        0
sentiment       0
dtype: int64

In [30]:
df_reviews = df_reviews.dropna(subset=['item_id'])
df_reviews.isnull().sum()

item_id        0
user_id        0
recommend      0
year_review    0
polarity       0
sentiment      0
dtype: int64

## Dataset USAGE

### Ingeniería de características - Dataset USAGE

In [31]:
df_usage = user_items
df_usage.sample(2)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
43553,76561197974400211,10,76561197974400211,http://steamcommunity.com/profiles/76561197974...,"[{'item_id': '220', 'item_name': 'Half-Life 2'..."
29311,76561198059653682,116,76561198059653682,http://steamcommunity.com/profiles/76561198059...,"[{'item_id': '3910', 'item_name': 'Sid Meier's..."


In [32]:
# Desagregación del campo "items"

df_usage = user_items.explode('items')

df_usage = df_usage.reset_index(drop=True)
def obtener_elemento(diccionario, clave_busqueda):
    if isinstance(diccionario, dict):
        return diccionario.get(clave_busqueda)
    else:
        return diccionario

# Desaagregaremos cada campo por separado para evitar tiempos excesivos de procesamiento
df_usage['item_id'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'item_id'))
df_usage['item_name'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'item_name'))
df_usage['playtime_forever'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'playtime_forever'))
df_usage['playtime_2weeks'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'playtime_2weeks'))

df_usage.sample(2)

Unnamed: 0,user_id,items_count,steam_id,user_url,items,item_id,item_name,playtime_forever,playtime_2weeks
1120416,76561198036657265,108,76561198036657265,http://steamcommunity.com/profiles/76561198036...,"{'item_id': '468700', 'item_name': 'NVIDIA® VR...",468700,NVIDIA® VR Funhouse,26.0,0.0
1551236,Sad0Panda,198,76561198062734284,http://steamcommunity.com/id/Sad0Panda,"{'item_id': '207400', 'item_name': 'eXceed 3rd...",207400,eXceed 3rd - Jade Penetrate Black Package,755.0,0.0


In [33]:
# Filtrar campos a utilizar

df_usage = df_usage[['item_id', 'user_id', 'playtime_forever']]
df_usage.sample(2)

Unnamed: 0,item_id,user_id,playtime_forever
3440760,202990,76561198009729872,335.0
556916,463150,CSMisBeast,211.0


### Verificación de tipos de datos - Dataset USAGE


In [34]:
df_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170015 entries, 0 to 5170014
Data columns (total 3 columns):
 #   Column            Dtype  
---  ------            -----  
 0   item_id           object 
 1   user_id           object 
 2   playtime_forever  float64
dtypes: float64(1), object(2)
memory usage: 118.3+ MB


In [35]:
# Conversión de tipos de datos
df_usage['item_id'] = pd.to_numeric(df_usage['item_id'], errors='coerce')
df_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170015 entries, 0 to 5170014
Data columns (total 3 columns):
 #   Column            Dtype  
---  ------            -----  
 0   item_id           float64
 1   user_id           object 
 2   playtime_forever  float64
dtypes: float64(2), object(1)
memory usage: 118.3+ MB


### Verficación de valores duplicados - Dataset USAGE

In [36]:
df_usage.duplicated().sum()

59209

In [37]:
# Eliminación de duplicados
df_usage = df_usage.drop_duplicates().reset_index(drop=True)
df_usage.duplicated().sum()

0

### Verificación de valores nulos - Dataset USAGE

In [38]:
df_usage.isnull().sum()

item_id             16714
user_id                 0
playtime_forever    16714
dtype: int64

In [39]:
df_usage = df_usage.dropna(subset=['item_id'])
df_usage.isnull().sum()

item_id             0
user_id             0
playtime_forever    0
dtype: int64

# Exportación de datasets limpios

In [40]:
df_games.to_csv('df_games.csv', index=False)
df_reviews.to_csv('df_reviews.csv', index=False)
df_usage.to_csv('df_usage.csv', index=False)

# Armado y exportación de dataframes para API

## Samples

In [41]:
df_games.sample(2)

Unnamed: 0,item_id,app_name,genres,year,price,developer
47990,529200,Streamline Original Sound Track,Action,2016,8.99,Proletariat Inc.
28940,767230,Trainz Route: Cornish Mainline & Branches,Simulation,2017,,N3V Games


In [42]:
df_reviews.sample(2)

Unnamed: 0,item_id,user_id,recommend,year_review,polarity,sentiment
53808,440.0,DEATHFLINGER,True,2014,0.0,1
3516,311210.0,dietrich1998,False,2015,-0.089792,0


In [43]:
df_usage.sample(2)

Unnamed: 0,item_id,user_id,playtime_forever
484379,221640.0,76561198050156481,149.0
958171,333930.0,76561198067903499,1390.0


## Endpoint 1

def PlayTimeGenre(genre: str): Debe devolver año con mas horas jugadas para dicho género.

Ejemplo de retorno: {"Año de lanzamiento con más horas jugadas para Género X" : 2013}

In [44]:
# Armado de dataframe para la función
df_e1 = pd.merge(df_games, df_usage, on="item_id", how="inner")
df_e1 = df_e1[['genres', 'year','playtime_forever']]

# Serie que contiene el índice de cada fila con el max playtime de cada género
df_e1_indmax = df_e1.groupby('genres')['playtime_forever'].idxmax() 

# Usar los índices para obtener los años correspondientes
df_e1 = df_e1.loc[df_e1_indmax, ['genres', 'year', 'playtime_forever']] 

# Mostrar los años con el máximo playtime_forever por género
df_e1 = df_e1[['genres', 'year']]
df_e1.to_csv('df_e1.csv', index=False)
df_e1

Unnamed: 0,genres,year
847814,Action,2012
1363800,Adventure,2015
1346922,Animation &amp; Modeling,2015
2389134,Audio Production,2014
3381479,Casual,2011
2905284,Design &amp; Illustration,2012
2273948,Early Access,2014
1054235,Education,2014
3253181,Free to Play,2012
27418,Indie,2006


In [45]:
# Probamos el dataframe para la función
df_func1 = df_e1[df_e1['genres'] == 'Strategy']
df_func1

Unnamed: 0,genres,year
411178,Strategy,2010


In [46]:
# Definir la función
def PlayTimeGenre(input_genre: str):
    try:
        df_e1 = pd.read_csv("df_e1.csv")                # Lectura del df
        df_e1 = df_e1[df_e1["genres"] == input_genre]   # Filtrar por input
        
        output_year = df_e1.loc[df_e1['year'].idxmax(), 'year']
    
        return {f"Año de lanzamiento con más horas jugadas para {input_genre}": output_year}
    except Exception as e:
        return {"error": str(e)}

In [47]:
# Probamos la funcion
input_genre = "Strategy"
print(PlayTimeGenre(input_genre))

{'Año de lanzamiento con más horas jugadas para Strategy': 2010}


## Endpoint 2

def UserForGenre( genero : str ): Debe devolver el usuario que acumula más horas jugadas para el género dado y una lista de la acumulación de horas jugadas por año.

Ejemplo de retorno: {"Usuario con más horas jugadas para Género X" : us213ndjss09sdf, "Horas jugadas":[{Año: 2013, Horas: 203}, {Año: 2012, Horas: 100}, {Año: 2011, Horas: 23}]}

In [48]:
df_e2 = pd.merge(df_games, df_usage, on="item_id", how="inner") # Unir datasets
df_e2 = df_e2[['genres', 'year', 'user_id','playtime_forever']] # Filtrar campos

# Obtener el índice de la fila con el máximo valor de playtime_forever para cada género
df_e2_indmax = df_e2.groupby('genres')['playtime_forever'].idxmax()
# Usar los índices para obtener los años correspondientes
# Utilzaremos el anio de lanzamiento para tomar el playtime
df_e2 = df_e2.loc[df_e2_indmax, ['genres', 'year', 'user_id', 'playtime_forever']]

In [49]:
df_e2_users = df_e2[['genres', 'user_id']]
df_e2_users.to_csv('df_e2_users.csv', index=False)
df_e2_users

Unnamed: 0,genres,user_id
847814,Action,Evilutional
1363800,Adventure,idonothack
1346922,Animation &amp; Modeling,ScottyG555
2389134,Audio Production,Lickidactyl
3381479,Casual,tsunamitad
2905284,Design &amp; Illustration,76561198035718256
2273948,Early Access,76561198084846677
1054235,Education,SeedyDog
3253181,Free to Play,76561198063368177
27418,Indie,wolop


In [50]:
df_e2_playtime = df_e2[['genres', 'year', 'user_id', 'playtime_forever']]
df_e2_playtime.to_csv('df_e2_playtime.csv', index=False)
df_e2_playtime

Unnamed: 0,genres,year,user_id,playtime_forever
847814,Action,2012,Evilutional,635295.0
1363800,Adventure,2015,idonothack,333482.0
1346922,Animation &amp; Modeling,2015,ScottyG555,168314.0
2389134,Audio Production,2014,Lickidactyl,109916.0
3381479,Casual,2011,tsunamitad,600068.0
2905284,Design &amp; Illustration,2012,76561198035718256,102554.0
2273948,Early Access,2014,76561198084846677,1241.0
1054235,Education,2014,SeedyDog,3082.0
3253181,Free to Play,2012,76561198063368177,439912.0
27418,Indie,2006,wolop,642773.0


## Endpoint 3

def UsersRecommend( año : int ): Devuelve el top 3 de juegos MÁS recomendados por usuarios para el año dado. (reviews.recommend = True y comentarios positivos/neutrales)

Ejemplo de retorno: [{"Puesto 1" : X}, {"Puesto 2" : Y},{"Puesto 3" : Z}]

In [51]:
# Armado de dataframe para la fucnión
df_e3 = df_reviews[(df_reviews['recommend'] == True) & (df_reviews['sentiment'] >= 0)]
df_e3.loc[:, 'year_review'] = pd.to_numeric(df_e3['year_review'], errors='coerce')
df_3 = df_e3[df_e3['year_review'] != 0]
df_e3 = df_e3.groupby(['year_review', 'item_id'])['recommend'].count().reset_index()
df_e3.rename(columns={'recommend': 'recommend_count'}, inplace=True)
df_e3 = pd.merge(df_e3, df_games, on="item_id", how="left").sort_values(by='recommend_count', ascending=False)
df_e3 = df_e3.dropna(subset=['app_name'])
df_e3 = df_e3[['year_review','app_name', 'recommend_count']]

df_e3.to_csv('df_e3.csv', index=False)
df_e3

Unnamed: 0,year_review,app_name,recommend_count
2847,2014,Team Fortress 2,1547
4320,2015,Counter-Strike: Global Offensive,1527
2853,2014,Counter-Strike: Global Offensive,1068
2159,2013,Team Fortress 2,790
2887,2014,Garry's Mod,757
...,...,...,...
2434,2013,Post Apocalyptic Mayhem,1
2436,2013,SpaceChem,1
2437,2013,Dinner Date,1
2438,2013,Jamestown,1


In [52]:
# Probamos el dataframe para la funcion
df_func3 = df_e3[df_e3['year_review'] == 2015].head(3)
df_func3

Unnamed: 0,year_review,app_name,recommend_count
4320,2015,Counter-Strike: Global Offensive,1527
4314,2015,Team Fortress 2,634
4353,2015,Garry's Mod,357


In [53]:
# Definir la función
def UsersRecommend(input_year: int):
    try:
        df_e3 = pd.read_csv("df_e3.csv")
        output_top3 = df_e3[df_e3['year_review'] == input_year].head(3)
        output_top3_list = [{"Puesto {}: {}".format(i+1, game)} for i, game in enumerate(output_top3['app_name'])]
        return output_top3_list
    except Exception as e:
        return {"error": str(e)}

In [54]:
# Probamos la funcion
input_year = 2015
print(UsersRecommend(input_year))

[{'Puesto 1: Counter-Strike: Global Offensive'}, {'Puesto 2: Team Fortress 2'}, {"Puesto 3: Garry's Mod"}]


## Endpoint 4

def UsersNotRecommend( año : int ): Devuelve el top 3 de juegos MENOS recomendados por usuarios para el año dado. (reviews.recommend = False y comentarios negativos)

Ejemplo de retorno: [{"Puesto 1" : X}, {"Puesto 2" : Y},{"Puesto 3" : Z}]

In [55]:
# Armado de dataframe para la fucnión
df_e4 = df_e3 # Obswrve que es el mismo dataframe que para la función anterior
df_e4.to_csv('df_e4.csv', index=False)
df_e4

Unnamed: 0,year_review,app_name,recommend_count
2847,2014,Team Fortress 2,1547
4320,2015,Counter-Strike: Global Offensive,1527
2853,2014,Counter-Strike: Global Offensive,1068
2159,2013,Team Fortress 2,790
2887,2014,Garry's Mod,757
...,...,...,...
2434,2013,Post Apocalyptic Mayhem,1
2436,2013,SpaceChem,1
2437,2013,Dinner Date,1
2438,2013,Jamestown,1


In [56]:
# Probamos el dataframe para la funcion
df_func4 = df_e3[df_e3['year_review'] == 2015].tail(3)
df_func4

Unnamed: 0,year_review,app_name,recommend_count
5894,2015,Cards and Castles,1
5358,2015,Colin McRae Rally,1
6133,2015,The Quest,1


In [57]:
# Definir la función
def UsersNotRecommend(input_year: int):
    try:
        df_e4 = pd.read_csv("df_e4.csv")
        output_last3 = df_e4[df_e4['year_review'] == input_year].tail(3)
        output_last3_list = [{"Puesto {}: {}".format(i+1, game)} for i, game in enumerate(output_last3['app_name'])]
        return output_last3_list
    except Exception as e:
        return {"error": str(e)}

In [58]:
# Probamos la funcion
input_year = 2015
print(UsersNotRecommend(input_year))

[{'Puesto 1: Cards and Castles'}, {'Puesto 2: Colin McRae Rally'}, {'Puesto 3: The Quest'}]


## Endpoint 5

def sentiment_analysis( año : int ): Según el año de lanzamiento, se devuelve una lista con la cantidad de registros de reseñas de usuarios que se encuentren categorizados con un análisis de sentimiento.

Ejemplo de retorno: {Negative = 182, Neutral = 120, Positive = 278}

In [59]:
# Armado de dataframe para la función
df_e5 = df_reviews[df_reviews['year_review'] != 0].copy()
# Crear columnas 'negative', 'neutral' y 'positive'
df_e5['negative'] = (df_e5['sentiment'] == 1).astype(int)
df_e5['neutral'] = (df_e5['sentiment'] == 0).astype(int)
df_e5['positive'] = (df_e5['sentiment'] == 2).astype(int)

# Dado que no se solicitan resultados por item_id, tomaremos como anio de lanzamiento el year_review ya que lo que se solicita es la suma de los sentiments
# Agrupar y pivotar para contar los valores según year_review
df_e5 = df_e5.groupby('year_review')[['negative', 'neutral', 'positive']].sum()
df_e5 = df_e5.rename_axis('year').reset_index()

df_e5.to_csv('df_e5.csv', index=False)
df_e5

Unnamed: 0,year,negative,neutral,positive
0,2010,8,11,47
1,2011,70,94,366
2,2012,215,206,780
3,2013,1362,1230,4121
4,2014,4736,4389,12709
5,2015,4257,4149,9748


In [60]:
# Probamos el dataframe para la funcion
year = 2015
df_func5 = df_e5[df_e5['year'] == year]
df_func5

Unnamed: 0,year,negative,neutral,positive
5,2015,4257,4149,9748


In [61]:
# Definir la función
def SentimentAnalysis(input_year: int):
    try:
        df_e5 = pd.read_csv("df_e5.csv")

        value_negative = df_func5['negative'].values[0]
        value_neutral = df_func5['neutral'].values[0]
        value_positive = df_func5['positive'].values[0]

        output_sentiment_list = f"Para el año {input_year} se registran los siguientes valores: negative: {value_negative}, neutral: {value_neutral}, neutral: {value_positive}"       
        return output_sentiment_list
    except Exception as e:
        return {"error": str(e)}

In [62]:
# Probamos la funcion
input_year = 2015
print(SentimentAnalysis(input_year))

Para el año 2015 se registran los siguientes valores: negative: 4257, neutral: 4149, neutral: 9748
