# Introducción, objetivos y contenido

Este trabajo corresponde a la fase de ETL (Extraction, Transformation and Loading). El objetivo de esta fase es obtener datasets limpios y listos para ser utilizado en fases posteriores del proyecto. 

Contenidos:
* Importación de librerías
* Carga de datos
* Preparación de datos para cada dataset 
    * Ingeniería de características
    * Verificación de tipos de datos
    * Valores duplicados
    * Valores nulos
* Exportación de los datasets limpios
* Armado y exportación de dataframes para API

# Importación de librerías

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
from math import factorial
from scipy import stats as st
import json
import gzip
import ast
from pandas import json_normalize
from textblob import TextBlob
import re

# Carga de datos

Tenemos un total de 3 datasets en formato json comprimido, debido a ello haremos la carga de datos en forma separada para tomar los recaudos correspondientes.

Dataset GAMES: Este archivo ha sido posible cargarlo en formato jason descomprimido, por lo cual su código resulta simple.

In [2]:
steam_games = pd.read_json('steam_games.json', lines=True)

Dataset REVIEWS: Dado que este dataset tiene una estructura menos estandarizada, ha sido necesario cargarlo aplicando un código que estandarice cada línea del archivo. Luego esas líneas se incorporan como lista a una variable archivo que recopila los datos originales transformados.

In [3]:
dataset_list_reviews = []
with gzip.open('user_reviews.json.gz', 'rb') as file:
    for line in file:
        dataset_list_reviews.append(ast.literal_eval(line.decode('utf-8')))
user_reviews = pd.DataFrame(dataset_list_reviews)
file.close()

Dataset ITEMS: Dado que este dataset tiene una estructura menos estandarizada, ha sido necesario cargarlo aplicando un código que estandarice cada línea del archivo. Luego esas líneas se incorporan como lista a una variable archivo que recopila los datos originales transformados.

In [4]:
dataset_list_items = []
with gzip.open('users_items.json.gz', 'rb') as file:
    for line in file:
        dataset_list_items.append(ast.literal_eval(line.decode('utf-8')))
user_items = pd.DataFrame(dataset_list_items)
file.close()

# Preparación de datos

## Dataset GAMES

### Ingeniería de características - Dataset GAMES

In [5]:
df_games = steam_games
df_games.sample(2)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
106964,,"[Indie, RPG, Strategy]",Fantasy Grounds - Compass Point 04 - Grunk's L...,Fantasy Grounds - Compass Point 04 - Grunk's L...,http://store.steampowered.com/app/608150/Fanta...,2017-03-20,"[RPG, Indie, Strategy, Utilities, Turn-Based, ...",http://steamcommunity.com/app/608150/reviews/?...,"[Multi-player, Co-op, Cross-Platform Multiplay...",5.99,0.0,608150.0,"SmiteWorks USA, LLC"
102182,Artifex Mundi,"[Adventure, Casual]",Eventide 2: Sorcerer's Mirror - Artbook & Soun...,Eventide 2: Sorcerer's Mirror - Artbook &amp; ...,http://store.steampowered.com/app/709300/Event...,2017-11-22,"[Adventure, Casual, Great Soundtrack, Soundtrack]",http://steamcommunity.com/app/709300/reviews/?...,"[Single-player, Downloadable Content]",2.99,0.0,709300.0,The House of Fables


In [6]:
# Renombramiento del campo "id"
df_games.rename(columns={'id': 'item_id'}, inplace=True)

In [7]:
# Desagregación de campos cuyos valores son listas
df_games = df_games.explode('genres')
df_games = df_games.explode('tags')
df_games = df_games.explode('specs')
df_games.sample(5)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer
102771,Hoobalugalar_X,Free to Play,Codename CURE,Codename CURE,http://store.steampowered.com/app/355180/Coden...,2017-10-31,Massively Multiplayer,http://steamcommunity.com/app/355180/reviews/?...,Partial Controller Support,Free,0.0,355180.0,Hoobalugalar_X
93087,CivilSavages,Indie,Dispatcher,Dispatcher,http://store.steampowered.com/app/341980/Dispa...,2015-11-12,RPG,http://steamcommunity.com/app/341980/reviews/?...,Steam Achievements,4.99,0.0,341980.0,CivilSavages
94585,Pixelbomb Games,Strategy,Beyond Flesh and Blood,Beyond Flesh and Blood,http://store.steampowered.com/app/391550/Beyon...,2016-06-01,Adventure,http://steamcommunity.com/app/391550/reviews/?...,Full controller support,14.99,1.0,391550.0,Pixelbomb Games
117102,NEOTOKYO [MOD],Free to Play,NEOTOKYO,NEOTOKYO,http://store.steampowered.com/app/244630/NEOTO...,2014-05-01,Sci-fi,http://steamcommunity.com/app/244630/reviews/?...,Multi-player,Free,0.0,244630.0,STUDIO RADI-8
100188,Klabater,Massively Multiplayer,Heliborne - Search and Rescue Camouflage Pack,Heliborne - Search and Rescue Camouflage Pack,http://store.steampowered.com/app/730490/Helib...,2017-10-12,Action,http://steamcommunity.com/app/730490/reviews/?...,Cross-Platform Multiplayer,3.99,0.0,730490.0,JetCat Games


In [8]:
# Agregación del campo "year"
default_date = pd.to_datetime('1900-01-01')  # Imputar un valor predeterminado en lugar de los valores no válidos en 'release_date'
df_games['release_date'] = pd.to_datetime(df_games['release_date'], errors='coerce').fillna(default_date)

df_games['release_date'] = pd.to_datetime(df_games['release_date'])     # Convertir la columna 'release_date' a objetos de fecha y hora
df_games['year'] = df_games['release_date'].dt.year
df_games = df_games[df_games['year'] != 1900]
df_games.sample(5)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer,year
106595,Interplay Entertainment Corp.,Strategy,M.A.X.: Mechanized Assault & Exploration,M.A.X.: Mechanized Assault &amp; Exploration,http://store.steampowered.com/app/615250/MAX_M...,1998-07-31,Turn-Based,http://steamcommunity.com/app/615250/reviews/?...,Single-player,9.99,0.0,615250.0,Interplay Entertainment Corp.,1998
104842,Tonguç Bodur,Indie,The Hunting God,The Hunting God,http://store.steampowered.com/app/679190/The_H...,2017-07-28,Dragons,http://steamcommunity.com/app/679190/reviews/?...,Single-player,3.99,0.0,679190.0,Tonguç Bodur,2017
104978,Forever Entertainment S. A.,Strategy,Pastry Lovers,Pastry Lovers,http://store.steampowered.com/app/568750/Pastr...,2017-07-21,Romance,http://steamcommunity.com/app/568750/reviews/?...,Steam Achievements,4.99,0.0,568750.0,橙光游戏,2017
108644,GamersHype Productions,Casual,Box Maze - Everyday People Skins Pack,Box Maze - Everyday People Skins Pack,http://store.steampowered.com/app/562790/Box_M...,2016-11-28,Strategy,http://steamcommunity.com/app/562790/reviews/?...,Stats,0.99,0.0,562790.0,GamersHype Productions,2016
113016,Petroglyph,Adventure,Mytheon,Mytheon,http://store.steampowered.com/app/413030/Mytheon/,2010-07-13,Multiplayer,http://steamcommunity.com/app/413030/reviews/?...,Co-op,9.99,0.0,413030.0,Petroglyph,2010


In [9]:
# Eliminación de campos que no serán utilizados
#df_games_eliminarcampos = ['url', 'title','release_date', 'reviews_url', 'specs', ]
# df_games = df_games.drop(df_games_eliminarcampos, axis=1)

In [10]:
# Filtrado de campos a utilizar
df_games = df_games[['item_id', 'app_name', 'genres', 'year', 'price', 'developer']]
df_games.sample(5)

Unnamed: 0,item_id,app_name,genres,year,price,developer
116467,329480.0,Snow Light,Action,2014,9.99,West Dragon Productions DR
108626,307940.0,Radiation Island,Adventure,2016,2.99,Atypical Games
103398,349510.0,Hanako: Honor & Blade,Indie,2017,9.99,"+Mpact Games, LLC."
105036,654660.0,Pharmakon,Casual,2017,9.99,Visumeca Games
99804,713270.0,UnnyWorld - Founder's Pack,RPG,2017,9.99,Unnyhog


### Verificación de tipos de datos - Dataset GAMES

In [11]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1983295 entries, 88310 to 120443
Data columns (total 6 columns):
 #   Column     Dtype  
---  ------     -----  
 0   item_id    float64
 1   app_name   object 
 2   genres     object 
 3   year       int32  
 4   price      object 
 5   developer  object 
dtypes: float64(1), int32(1), object(4)
memory usage: 98.4+ MB


In [12]:
# Conversión de tipos de datos

df_games['price'] = pd.to_numeric(df_games['price'], errors='coerce')  # Conversión a tipo numérico, forzando los errores a NaN
#df_games['early_access'] = df_games['early_access'].astype(bool)       # Conversión a tipo booleano
df_games['item_id'] = pd.to_numeric(df_games['item_id'], errors='coerce')        # Conversión a tipo entero, forzando los errores a NaN
df_games['item_id'].fillna(0, inplace=True)
df_games['item_id'] = df_games['item_id'].astype(int)
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1983295 entries, 88310 to 120443
Data columns (total 6 columns):
 #   Column     Dtype  
---  ------     -----  
 0   item_id    int64  
 1   app_name   object 
 2   genres     object 
 3   year       int32  
 4   price      float64
 5   developer  object 
dtypes: float64(1), int32(1), int64(1), object(3)
memory usage: 98.4+ MB


In [13]:
df_games.sample(5)

Unnamed: 0,item_id,app_name,genres,year,price,developer
102271,754630,Tanki X: Steam Pack,Massively Multiplayer,2017,11.99,AlternativaPlatform
103233,717190,Super Dungeon Master,RPG,2017,,Dave Gumble
118650,231200,Kentucky Route Zero,Adventure,2013,24.99,Cardboard Computer
120093,17530,D.I.P.R.I.P. Warm Up,Indie,2008,,EXOR Studios
110703,395180,Arma 3 Apex,Strategy,2016,34.99,Bohemia Interactive


### Verficación de valores duplicados - Dataset GAMES

In [14]:
df_games.duplicated().sum()

1911358

In [15]:
df_games = df_games.drop_duplicates().reset_index(drop=True)
df_games = df_games.drop_duplicates(subset=['item_id', 'app_name'])     # Eliminar filas cuyo campo "item_id" y "app_name" tiene duplicados
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29782 entries, 0 to 71935
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   item_id    29782 non-null  int64  
 1   app_name   29781 non-null  object 
 2   genres     28548 non-null  object 
 3   year       29782 non-null  int32  
 4   price      27210 non-null  float64
 5   developer  28532 non-null  object 
dtypes: float64(1), int32(1), int64(1), object(3)
memory usage: 1.5+ MB


### Verificación de valores nulos - Dataset GAMES

In [16]:
df_games.isnull().sum()

item_id         0
app_name        1
genres       1234
year            0
price        2572
developer    1250
dtype: int64

En este marco es difícil establecer un patrón de valores nulos que nos permita comprender la razón de su existencia. Considerando los efectos para los que es necesario este dataset, optaremos por eliminar los registros cuyos campos prinicipales del dataset presenten valores nulos.

In [17]:
df_games = df_games.dropna(subset=['genres', 'app_name', 'price', 'developer'])

In [18]:
df_games.isnull().sum()

item_id      0
app_name     0
genres       0
year         0
price        0
developer    0
dtype: int64

## Dataset REVIEWS

### Ingeniería de características - Dataset REVIEWS

In [19]:
df_reviews = user_reviews
df_reviews.sample(2)

Unnamed: 0,user_id,user_url,reviews
22584,76561198073648533,http://steamcommunity.com/profiles/76561198073...,"[{'funny': '', 'posted': 'Posted December 4, 2..."
17241,76561198029043080,http://steamcommunity.com/profiles/76561198029...,"[{'funny': '', 'posted': 'Posted April 24, 201..."


In [20]:
# Desagregación del campo "reviews"

df_reviews = user_reviews.explode('reviews')
df_reviews = pd.concat([df_reviews.drop(['reviews'], axis=1), df_reviews['reviews'].apply(pd.Series)], axis=1)
df_reviews.sample(5)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0
5172,76561198075491202,http://steamcommunity.com/profiles/76561198075...,2 people found this review funny,Posted May 3.,,231430,2 of 5 people (40%) found this review helpful,True,Build giant red army.Thousands of conscripts a...,
1814,76561198058473246,http://steamcommunity.com/profiles/76561198058...,,"Posted December 24, 2013.",,221100,No ratings yet,True,Best $30 i ever spent :),
5703,lozone,http://steamcommunity.com/id/lozone,,"Posted May 11, 2014.",,440,No ratings yet,True,"""Started wid a Gibus now we're here""Seriously....",
14883,ilikeflip,http://steamcommunity.com/id/ilikeflip,,"Posted May 3, 2014.","Last edited December 1, 2015.",730,0 of 1 people (0%) found this review helpful,True,Got bored of it but still a great game.,
15054,nelllllly,http://steamcommunity.com/id/nelllllly,,"Posted April 9, 2015.",,218620,No ratings yet,True,"Do You Like Buying DLCs,Fixing Drills,Buying M...",


In [21]:
# Agregación de los campos "date" y "year"

def extract_posted_date(posted_str):         # Función para extraer la fecha del campo "posted"
    pattern = r'Posted (\w+ \d{1,2}, \d{4})' # Definición del patrón observado
    match = re.search(pattern, posted_str)
    if match:
        return match.group(1)
    else:
        return None

# Aplicar la función para extraer la fecha del campo "posted"
df_reviews['posted_date'] = df_reviews['posted'].apply(lambda x: np.nan if pd.isna(x) else extract_posted_date(x))

df_reviews['posted_date'] = pd.to_datetime(df_reviews['posted_date'])
df_reviews['year_review'] = df_reviews['posted_date'].dt.year
df_reviews['year_review'] = df_reviews['year_review'].fillna(0)
df_reviews['year_review'] = df_reviews['year_review'].astype(int)

df_reviews.sample(2)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0,posted_date,year_review
3173,YouAreDedNottaBigSoupPrice,http://steamcommunity.com/id/YouAreDedNottaBig...,,"Posted December 27, 2014.",Last edited January 3.,300,No ratings yet,False,"It might be realistic, but the community is ut...",,2014-12-27,2014
23641,iTzTDawGGa,http://steamcommunity.com/id/iTzTDawGGa,,"Posted October 5, 2014.",,236370,1 of 2 people (50%) found this review helpful,True,fantastic it's really fun and is still in beta...,,2014-10-05,2014


In [22]:
# Análisis de sentimientos a partir del campo "review"

df_reviews['review'] = df_reviews['review'].astype(str)
df_reviews['polarity'] = df_reviews['review'].apply(lambda text: TextBlob(text).sentiment.polarity)
df_reviews['sentiment'] = pd.cut(df_reviews['polarity'], bins=[-float('inf'), -0.001, 0.0, float('inf')], labels=[0, 1, 2])

In [23]:
df_reviews.sample(5)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0,posted_date,year_review,polarity,sentiment
1263,91221301,http://steamcommunity.com/id/91221301,,"Posted December 5, 2014.",,250340,No ratings yet,True,"wow, so buildated 100/10",,2014-12-05,2014,0.1,2
19207,76561198030535256,http://steamcommunity.com/profiles/76561198030...,,"Posted September 4, 2014.",,113200,No ratings yet,True,geate game,,2014-09-04,2014,-0.4,0
19285,76561198033266228,http://steamcommunity.com/profiles/76561198033...,,"Posted June 29, 2014.",,250180,No ratings yet,True,"Mother flipping Metal Slug m8, its the arcade ...",,2014-06-29,2014,0.155556,2
18403,hammyhamm,http://steamcommunity.com/id/hammyhamm,,"Posted February 3, 2014.","Last edited February 3, 2014.",222880,2 of 2 people (100%) found this review helpful,True,Having played Insurgency many years ago as a S...,,2014-02-03,2014,0.052736,2
5484,Joshua12345195,http://steamcommunity.com/id/Joshua12345195,,"Posted July 16, 2015.",,239350,No ratings yet,True,makes me want to die more than I already have,,2015-07-16,2015,0.5,2


In [24]:
# Filtrado de campos a utilizar
df_reviews = df_reviews[['item_id', 'user_id', 'recommend', 'year_review', 'polarity', 'sentiment']]
df_reviews.sample(5)

Unnamed: 0,item_id,user_id,recommend,year_review,polarity,sentiment
12281,238460,MrHamilton,True,2014,0.098182,2
11255,236390,captainerickub,True,2014,0.0,1
2223,212680,spacemonkey,True,2013,0.05,2
7396,296150,ohyeahpunk,False,2014,-0.91,0
6553,211420,76561198060029466,True,2014,-0.404167,0


### Verificación de tipos de datos - Dataset REVIEWS

In [25]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59333 entries, 0 to 25798
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   item_id      59305 non-null  object  
 1   user_id      59333 non-null  object  
 2   recommend    59305 non-null  object  
 3   year_review  59333 non-null  int64   
 4   polarity     59333 non-null  float64 
 5   sentiment    59333 non-null  category
dtypes: category(1), float64(1), int64(1), object(3)
memory usage: 2.8+ MB


In [26]:
# Conversión de tipos de datos
df_reviews['item_id'] = pd.to_numeric(df_reviews['item_id'], errors='coerce')
df_reviews['recommend'] = df_reviews['recommend'].astype(bool)
df_reviews['sentiment'] = pd.to_numeric(df_reviews['sentiment'], errors='coerce')
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59333 entries, 0 to 25798
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   item_id      59305 non-null  float64
 1   user_id      59333 non-null  object 
 2   recommend    59333 non-null  bool   
 3   year_review  59333 non-null  int64  
 4   polarity     59333 non-null  float64
 5   sentiment    59333 non-null  int64  
dtypes: bool(1), float64(2), int64(2), object(1)
memory usage: 2.8+ MB


### Verificación de valores duplicados - Dataset REVIEWS

In [27]:
df_reviews.duplicated().sum()

874

In [28]:
# Eliminación de duplicados
df_reviews = df_reviews.drop_duplicates().reset_index(drop=True)
df_reviews.duplicated().sum()

0

### Verificación de valores nulos - Dataset REVIEWS

In [29]:
df_reviews.isnull().sum()

item_id        28
user_id         0
recommend       0
year_review     0
polarity        0
sentiment       0
dtype: int64

In [30]:
df_reviews = df_reviews.dropna(subset=['item_id'])
df_reviews.isnull().sum()

item_id        0
user_id        0
recommend      0
year_review    0
polarity       0
sentiment      0
dtype: int64

## Dataset USAGE

### Ingeniería de características - Dataset USAGE

In [31]:
df_usage = user_items
df_usage.sample(2)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
7882,ceiling_phantom,66,76561198020172496,http://steamcommunity.com/id/ceiling_phantom,"[{'item_id': '240', 'item_name': 'Counter-Stri..."
74840,76561198082397421,25,76561198082397421,http://steamcommunity.com/profiles/76561198082...,"[{'item_id': '240', 'item_name': 'Counter-Stri..."


In [32]:
# Desagregación del campo "items"

df_usage = user_items.explode('items')

df_usage = df_usage.reset_index(drop=True)
def obtener_elemento(diccionario, clave_busqueda):
    if isinstance(diccionario, dict):
        return diccionario.get(clave_busqueda)
    else:
        return diccionario

# Desaagregaremos cada campo por separado para evitar tiempos excesivos de procesamiento
df_usage['item_id'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'item_id'))
df_usage['item_name'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'item_name'))
df_usage['playtime_forever'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'playtime_forever'))
df_usage['playtime_2weeks'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'playtime_2weeks'))

df_usage.sample(2)

Unnamed: 0,user_id,items_count,steam_id,user_url,items,item_id,item_name,playtime_forever,playtime_2weeks
2751248,76561198045212550,112,76561198045212550,http://steamcommunity.com/profiles/76561198045...,"{'item_id': '230410', 'item_name': 'Warframe',...",230410,Warframe,75.0,0.0
454609,76561198053738378,63,76561198053738378,http://steamcommunity.com/profiles/76561198053...,"{'item_id': '23530', 'item_name': 'Earth Defen...",23530,Earth Defense Force: Insect Armageddon,231.0,0.0


In [33]:
# Filtrar campos a utilizar
df_usage = df_usage[['item_id', 'user_id', 'playtime_forever']]
df_usage.sample(2)

Unnamed: 0,item_id,user_id,playtime_forever
1498146,8190,Ruzzbug,940.0
1198600,427270,76561198064107236,64.0


### Verificación de tipos de datos - Dataset USAGE


In [34]:
df_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170015 entries, 0 to 5170014
Data columns (total 3 columns):
 #   Column            Dtype  
---  ------            -----  
 0   item_id           object 
 1   user_id           object 
 2   playtime_forever  float64
dtypes: float64(1), object(2)
memory usage: 118.3+ MB


In [35]:
# Conversión de tipos de datos
df_usage['item_id'] = pd.to_numeric(df_usage['item_id'], errors='coerce')
df_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170015 entries, 0 to 5170014
Data columns (total 3 columns):
 #   Column            Dtype  
---  ------            -----  
 0   item_id           float64
 1   user_id           object 
 2   playtime_forever  float64
dtypes: float64(2), object(1)
memory usage: 118.3+ MB


### Verficación de valores duplicados - Dataset USAGE

In [36]:
df_usage.duplicated().sum()

59209

In [37]:
# Eliminación de duplicados
df_usage = df_usage.drop_duplicates().reset_index(drop=True)
df_usage.duplicated().sum()

0

### Verificación de valores nulos - Dataset USAGE

In [38]:
df_usage.isnull().sum()

item_id             16714
user_id                 0
playtime_forever    16714
dtype: int64

In [39]:
df_usage = df_usage.dropna(subset=['item_id'])
df_usage.isnull().sum()

item_id             0
user_id             0
playtime_forever    0
dtype: int64

# Exportación de datasets limpios

In [40]:
df_games.to_csv('df_games.csv', index=False)
df_reviews.to_csv('df_reviews.csv', index=False)
df_usage.to_csv('df_usage.csv', index=False)

# Armado y exportación de dataframes para API

## Samples

In [41]:
df_games.sample(2)

Unnamed: 0,item_id,app_name,genres,year,price,developer
19545,314610,Vincere Totus Astrum,Strategy,2017,3.99,Gamesare
6913,360420,Fantasy Grounds - D&D Monster Pack - Beasts,Indie,2015,4.99,"SmiteWorks USA, LLC"


In [42]:
df_reviews.sample(2)

Unnamed: 0,item_id,user_id,recommend,year_review,polarity,sentiment
32234,206420.0,76561198053674558,True,2015,0.041667,2
38833,391540.0,max_banana23,True,2015,0.0,1


In [43]:
df_usage.sample(2)

Unnamed: 0,item_id,user_id,playtime_forever
2726267,496920.0,gordan_freeman,8.0
48636,113400.0,76561198026258913,33.0


## Endpoint 1

def PlayTimeGenre(genre: str): Debe devolver año con mas horas jugadas para dicho género.

Ejemplo de retorno: {"Año de lanzamiento con más horas jugadas para Género X" : 2013}

In [44]:
# Armado de dataframe para la función
df_e1 = pd.merge(df_games, df_usage, on="item_id", how="inner")
df_e1 = df_e1[['genres', 'year','playtime_forever']]

# Serie que contiene el índice de cada fila con el max playtime de cada género
df_e1_indmax = df_e1.groupby('genres')['playtime_forever'].idxmax() 

# Usar los índices para obtener los años correspondientes
df_e1 = df_e1.loc[df_e1_indmax, ['genres', 'year', 'playtime_forever']] 

# Mostrar los años con el máximo playtime_forever por género
df_e1 = df_e1[['genres', 'year']]
df_e1.to_csv('df_e1.csv', index=False)
df_e1

Unnamed: 0,genres,year
3160546,Action,2004
956218,Adventure,2014
921209,Animation &amp; Modeling,2013
1762950,Audio Production,2014
1599612,Casual,2014
2165742,Design &amp; Illustration,2012
1689391,Early Access,2014
947205,Education,2014
2133429,Free to Play,2013
27418,Indie,2006


In [45]:
# Probamos el dataframe para la función
df_func1 = df_e1[df_e1['genres'] == 'Action']
df_func1

Unnamed: 0,genres,year
3160546,Action,2004


In [46]:
# Definir la función
def PlayTimeGenre(input_genre: str):
    try:
        df_e1 = pd.read_csv("df_e1.csv")                # Lectura del df
        df_e1 = df_e1[df_e1["genres"] == input_genre]   # Filtrar por input
        
        output_year = df_e1.loc[df_e1['year'].idxmax(), 'year']
    
        return {f"Año de lanzamiento con más horas jugadas para {input_genre}": output_year}
    except Exception as e:
        return {"error": str(e)}

In [47]:
# Probamos la funcion
input_genre = "Strategy"
print(PlayTimeGenre(input_genre))

{'Año de lanzamiento con más horas jugadas para Strategy': 2010}


## Endpoint 2

def UserForGenre( genero : str ): Debe devolver el usuario que acumula más horas jugadas para el género dado y una lista de la acumulación de horas jugadas por año.

Ejemplo de retorno: {"Usuario con más horas jugadas para Género X" : us213ndjss09sdf, "Horas jugadas":[{Año: 2013, Horas: 203}, {Año: 2012, Horas: 100}, {Año: 2011, Horas: 23}]}

In [48]:
df_e2 = pd.merge(df_games, df_usage, on="item_id", how="inner") # Unir datasets
df_e2 = df_e2[['genres', 'year', 'user_id','playtime_forever']] # Filtrar campos

# Obtener el índice de la fila con el máximo valor de playtime_forever para cada género
df_e2_indmax = df_e2.groupby('genres')['playtime_forever'].idxmax()
# Usar los índices para obtener los años correspondientes
# Utilzaremos el anio de lanzamiento para tomar el playtime
df_e2 = df_e2.loc[df_e2_indmax, ['genres', 'year', 'user_id', 'playtime_forever']]

In [49]:
df_e2_users = df_e2[['genres', 'user_id']]
df_e2_users.to_csv('df_e2_users.csv', index=False)
df_e2_users

Unnamed: 0,genres,user_id
3160546,Action,76561197977470391
956218,Adventure,DONTFUCKINGCLICKTHIS
921209,Animation &amp; Modeling,76561198059330972
1762950,Audio Production,Lickidactyl
1599612,Casual,76561198101480347
2165742,Design &amp; Illustration,76561198035718256
1689391,Early Access,76561198084846677
947205,Education,SeedyDog
2133429,Free to Play,Cow666
27418,Indie,wolop


In [50]:
df_e2_playtime = df_e2[['genres', 'year', 'user_id', 'playtime_forever']]
df_e2_playtime.to_csv('df_e2_playtime.csv', index=False)
df_e2_playtime

Unnamed: 0,genres,year,user_id,playtime_forever
3160546,Action,2004,76561197977470391,493791.0
956218,Adventure,2014,DONTFUCKINGCLICKTHIS,134223.0
921209,Animation &amp; Modeling,2013,76561198059330972,65427.0
1762950,Audio Production,2014,Lickidactyl,109916.0
1599612,Casual,2014,76561198101480347,74433.0
2165742,Design &amp; Illustration,2012,76561198035718256,102554.0
1689391,Early Access,2014,76561198084846677,1241.0
947205,Education,2014,SeedyDog,3082.0
2133429,Free to Play,2013,Cow666,32987.0
27418,Indie,2006,wolop,642773.0


## Endpoint 3

def UsersRecommend( año : int ): Devuelve el top 3 de juegos MÁS recomendados por usuarios para el año dado. (reviews.recommend = True y comentarios positivos/neutrales)

Ejemplo de retorno: [{"Puesto 1" : X}, {"Puesto 2" : Y},{"Puesto 3" : Z}]

In [51]:
# Armado de dataframe para la fucnión
df_e3 = df_reviews[(df_reviews['recommend'] == True) & (df_reviews['sentiment'] >= 0)]
df_e3.loc[:, 'year_review'] = pd.to_numeric(df_e3['year_review'], errors='coerce')
df_3 = df_e3[df_e3['year_review'] != 0]
df_e3 = df_e3.groupby(['year_review', 'item_id'])['recommend'].count().reset_index()
df_e3.rename(columns={'recommend': 'recommend_count'}, inplace=True)
df_e3 = pd.merge(df_e3, df_games, on="item_id", how="left").sort_values(by='recommend_count', ascending=False)
df_e3 = df_e3.dropna(subset=['app_name'])
df_e3 = df_e3[['year_review','app_name', 'recommend_count']]
df_e3 = df_e3.loc[df_e3['year_review'] != 0]
df_e3.to_csv('df_e3.csv', index=False)
df_e3

Unnamed: 0,year_review,app_name,recommend_count
4320,2015,Counter-Strike: Global Offensive,1527
2853,2014,Counter-Strike: Global Offensive,1068
2887,2014,Garry's Mod,757
3822,2014,Rust,389
3494,2014,DayZ,383
...,...,...,...
2434,2013,Post Apocalyptic Mayhem,1
2436,2013,SpaceChem,1
2437,2013,Dinner Date,1
2438,2013,Jamestown,1


In [52]:
# Probamos el dataframe para la funcion
df_func3 = df_e3[df_e3['year_review'] == 2015].head(3)
df_func3

Unnamed: 0,year_review,app_name,recommend_count
4320,2015,Counter-Strike: Global Offensive,1527
4353,2015,Garry's Mod,357
5265,2015,Grand Theft Auto V,215


In [53]:
# Definir la función
def UsersRecommend(input_year: int):
    try:
        df_e3 = pd.read_csv("df_e3.csv")
        output_top3 = df_e3[df_e3['year_review'] == input_year].head(3)
        #output_top3 = int(output_top3)
        output_top3_list = [{"Puesto {}: {}".format(i+1, game)} for i, game in enumerate(output_top3['app_name'])]
        return output_top3_list
    except Exception as e:
        return {"error": str(e)}

In [54]:
# Probamos la funcion
input_year = 2015
print(UsersRecommend(input_year))

[{'Puesto 1: Counter-Strike: Global Offensive'}, {"Puesto 2: Garry's Mod"}, {'Puesto 3: Grand Theft Auto V'}]


## Endpoint 4

def UsersNotRecommend( año : int ): Devuelve el top 3 de juegos MENOS recomendados por usuarios para el año dado. (reviews.recommend = False y comentarios negativos)

Ejemplo de retorno: [{"Puesto 1" : X}, {"Puesto 2" : Y},{"Puesto 3" : Z}]

In [55]:
# Armado de dataframe para la fucnión
#df_e4 = df_e3 
#df_e4.to_csv('df_e4.csv', index=False)
#df_e4

In [56]:
# Armado de dataframe para la fucnión
df_e4 = df_reviews[(df_reviews['recommend'] == False) & (df_reviews['sentiment'] == 0)]
df_e4.loc[:, 'year_review'] = pd.to_numeric(df_e4['year_review'], errors='coerce')
df_4 = df_e4[df_e4['year_review'] != 0]
df_e4 = df_e4.groupby(['year_review', 'item_id'])['recommend'].count().reset_index()
df_e4.rename(columns={'recommend': 'recommend_count'}, inplace=True)
df_e4 = pd.merge(df_e4, df_games, on="item_id", how="left").sort_values(by='recommend_count', ascending=False)
df_e4 = df_e4.dropna(subset=['app_name'])
df_e4 = df_e4[['year_review','app_name', 'recommend_count']]
df_e4 = df_e4.loc[df_e4['year_review'] != 0]
df_e4.to_csv('df_e4.csv', index=False)
df_e4

Unnamed: 0,year_review,app_name,recommend_count
978,2015,Counter-Strike: Global Offensive,60
1125,2015,DayZ,43
740,2014,DayZ,34
1219,2015,Rust,26
539,2014,Counter-Strike: Global Offensive,21
...,...,...,...
549,2014,"Star Wars: Battlefront 2 (Classic, 2005)",1
548,2014,S.T.A.L.K.E.R.: Shadow of Chernobyl,1
544,2014,Alpha Prime,1
542,2014,The Ship: Murder Party,1


In [57]:
# Probamos el dataframe para la funcion
df_func4 = df_e4[df_e4['year_review'] == 2015].head(3)
df_func4

Unnamed: 0,year_review,app_name,recommend_count
978,2015,Counter-Strike: Global Offensive,60
1125,2015,DayZ,43
1219,2015,Rust,26


In [58]:
# Definir la función
def UsersNotRecommend(input_year: int):
    try:
        df_e4 = pd.read_csv("df_e4.csv")
        output_last3 = df_e4[df_e4['year_review'] == input_year].head(3)
        output_last3_list = [{"Puesto {}: {}".format(i+1, game)} for i, game in enumerate(output_last3['app_name'])]
        return output_last3_list
    except Exception as e:
        return {"error": str(e)}

In [59]:
# Probamos la funcion
input_year = 2015
print(UsersNotRecommend(input_year))

[{'Puesto 1: Counter-Strike: Global Offensive'}, {'Puesto 2: DayZ'}, {'Puesto 3: Rust'}]


## Endpoint 5

def sentiment_analysis( año : int ): Según el año de lanzamiento, se devuelve una lista con la cantidad de registros de reseñas de usuarios que se encuentren categorizados con un análisis de sentimiento.

Ejemplo de retorno: {Negative = 182, Neutral = 120, Positive = 278}

In [60]:
# Armado de dataframe para la función
df_e5 = df_reviews[df_reviews['year_review'] != 0].copy()
# Crear columnas 'negative', 'neutral' y 'positive'
df_e5['negative'] = (df_e5['sentiment'] == 1).astype(int)
df_e5['neutral'] = (df_e5['sentiment'] == 0).astype(int)
df_e5['positive'] = (df_e5['sentiment'] == 2).astype(int)

# Dado que no se solicitan resultados por item_id, tomaremos como anio de lanzamiento el year_review ya que lo que se solicita es la suma de los sentiments
# Agrupar y pivotar para contar los valores según year_review
df_e5 = df_e5.groupby('year_review')[['negative', 'neutral', 'positive']].sum()
df_e5 = df_e5.rename_axis('year').reset_index()

df_e5.to_csv('df_e5.csv', index=False)
df_e5

Unnamed: 0,year,negative,neutral,positive
0,2010,8,11,47
1,2011,70,94,366
2,2012,215,206,780
3,2013,1362,1230,4121
4,2014,4736,4389,12709
5,2015,4257,4149,9748


In [61]:
# Probamos el dataframe para la funcion
year = 2015
df_func5 = df_e5[df_e5['year'] == year]
df_func5

Unnamed: 0,year,negative,neutral,positive
5,2015,4257,4149,9748


In [62]:
# Definir la función
def SentimentAnalysis(input_year: int):
    try:
        df_e5 = pd.read_csv("df_e5.csv")

        value_negative = df_e5['negative'].values[0]
        value_neutral = df_e5['neutral'].values[0]
        value_positive = df_e5['positive'].values[0]

        output_sentiment_list = f"Para el año {input_year} se registran los siguientes valores: negative: {value_negative}, neutral: {value_neutral}, neutral: {value_positive}"       
        return output_sentiment_list
    except Exception as e:
        return {"error": str(e)}

In [63]:
# Probamos la funcion
input_year = 2015
print(SentimentAnalysis(input_year))

Para el año 2015 se registran los siguientes valores: negative: 8, neutral: 11, neutral: 47
