# Introducción, objetivos y contenido

Este trabajo corresponde a la fase de ETL (Extraction, Transformation and Loading). El objetivo de esta fase es obtener datasets limpios y listos para ser utilizado en fases posteriores del proyecto. 

Contenidos:
* Importación de librerías
* Carga de datos
* Preparación de datos para cada dataset 
    * Ingeniería de características
    * Verificación de tipos de datos
    * Valores duplicados
    * Valores nulos
* Exportación de los datasets limpios
* Armado y exportación de dataframes para API

# Importación de librerías

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
from math import factorial
from scipy import stats as st
import json
import gzip
import ast
from pandas import json_normalize
from textblob import TextBlob
import re

# Carga de datos

Tenemos un total de 3 datasets en formato json comprimido, debido a ello haremos la carga de datos en forma separada para tomar los recaudos correspondientes.

Dataset GAMES: Este archivo ha sido posible cargarlo en formato jason descomprimido, por lo cual su código resulta simple.

In [2]:
steam_games = pd.read_json('steam_games.json', lines=True)

Dataset REVIEWS: Dado que este dataset tiene una estructura menos estandarizada, ha sido necesario cargarlo aplicando un código que estandarice cada línea del archivo. Luego esas líneas se incorporan como lista a una variable archivo que recopila los datos originales transformados.

In [3]:
dataset_list_reviews = []
with gzip.open('user_reviews.json.gz', 'rb') as file:
    for line in file:
        dataset_list_reviews.append(ast.literal_eval(line.decode('utf-8')))
user_reviews = pd.DataFrame(dataset_list_reviews)
file.close()

Dataset ITEMS: Dado que este dataset tiene una estructura menos estandarizada, ha sido necesario cargarlo aplicando un código que estandarice cada línea del archivo. Luego esas líneas se incorporan como lista a una variable archivo que recopila los datos originales transformados.

In [4]:
dataset_list_items = []
with gzip.open('users_items.json.gz', 'rb') as file:
    for line in file:
        dataset_list_items.append(ast.literal_eval(line.decode('utf-8')))
user_items = pd.DataFrame(dataset_list_items)
file.close()

# Preparación de datos

## Dataset GAMES

### Ingeniería de características - Dataset GAMES

In [5]:
df_games = steam_games
df_games.sample(2)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
68247,,,,,,,,,,,,,
35446,,,,,,,,,,,,,


In [6]:
# Renombramiento del campo "id"
df_games.rename(columns={'id': 'item_id'}, inplace=True)

In [7]:
# Desagregación de campos cuyos valores son listas
df_games = df_games.explode('genres')
df_games = df_games.explode('tags')
df_games = df_games.explode('specs')
df_games.sample(5)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer
99655,ArcaneRaise,Casual,Nixon (Character for Occult RERaise),Nixon (Character for Occult RERaise),http://store.steampowered.com/app/698358/Nixon...,2017-09-05,Casual,http://steamcommunity.com/app/698358/reviews/?...,Single-player,2.99,0.0,698358.0,Arcane Raise
110734,"Psyonix, Inc.",Action,"Rocket League: Official Game Soundtrack, Vol. 2","Rocket League: Official Game Soundtrack, Vol. 2",http://store.steampowered.com/app/457161/Rocke...,2016-07-08,Racing,http://steamcommunity.com/app/457161/reviews/?...,Downloadable Content,4.99,0.0,457161.0,"Psyonix, Inc."
110932,Magic Pixel Kft.,Casual,Zaccaria Pinball - Zankor Table,Zaccaria Pinball - Zankor Table,http://store.steampowered.com/app/478047/Zacca...,2016-06-16,Casual,http://steamcommunity.com/app/478047/reviews/?...,Steam Cloud,1.99,0.0,478047.0,Magic Pixel Kft.
118703,SEGA,Racing,Sonic & All-Stars Racing Transformed,Sonic &amp; All-Stars Racing Transformed,http://store.steampowered.com/app/212480/Sonic...,2013-01-31,Action,http://steamcommunity.com/app/212480/reviews/?...,Multi-player,19.99,0.0,212480.0,Sumo Digital
102306,Fishing Planet LLC,Massively Multiplayer,Sport Topwater Night Pack,Sport Topwater Night Pack,http://store.steampowered.com/app/591983/Sport...,2017-11-17,Massively Multiplayer,http://steamcommunity.com/app/591983/reviews/?...,Cross-Platform Multiplayer,4.99,0.0,591983.0,Fishing Planet LLC


In [8]:
# Agregación del campo "year"
default_date = pd.to_datetime('1900-01-01')  # Imputar un valor predeterminado en lugar de los valores no válidos en 'release_date'
df_games['release_date'] = pd.to_datetime(df_games['release_date'], errors='coerce').fillna(default_date)

df_games['release_date'] = pd.to_datetime(df_games['release_date'])     # Convertir la columna 'release_date' a objetos de fecha y hora
df_games['year'] = df_games['release_date'].dt.year
df_games = df_games[df_games['year'] != 1900]
df_games.sample(5)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer,year
93035,,Utilities,ePic Character Generator - Season #2: Male Sci-fi,ePic Character Generator - Season #2: Male Sci-fi,http://store.steampowered.com/app/409257/ePic_...,2015-11-05,Photo Editing,http://steamcommunity.com/app/409257/reviews/?...,Single-player,5.99,0.0,409257.0,Overhead Games,2015
117628,Klei Entertainment,Indie,Don't Starve: Reign of Giants,Don't Starve: Reign of Giants,http://store.steampowered.com/app/282470/Dont_...,2014-04-30,Difficult,http://steamcommunity.com/app/282470/reviews/?...,Steam Cloud,4.99,0.0,282470.0,Klei Entertainment,2014
104153,"Beijing New Era Network Technology Co., Ltd.",RPG,红石遗迹 - Red Obsidian Remnant,红石遗迹 - Red Obsidian Remnant,http://store.steampowered.com/app/610960/__Red...,2017-09-04,Anime,http://steamcommunity.com/app/610960/reviews/?...,Steam Cloud,8.99,0.0,610960.0,Red Obsidian Studio,2017
92536,Trazzy Entertainment,Massively Multiplayer,Wind of Luck: Arena - Mediterranean Captain pack,Wind of Luck: Arena - Mediterranean Captain pack,http://store.steampowered.com/app/384340/Wind_...,2015-08-20,Massively Multiplayer,http://steamcommunity.com/app/384340/reviews/?...,Co-op,,0.0,384340.0,Trazzy Entertainment,2015
117247,Krillbite Studio,Indie,Among the Sleep - Enhanced Edition,Among the Sleep - Enhanced Edition,http://store.steampowered.com/app/250620/Among...,2017-11-02,Short,http://steamcommunity.com/app/250620/reviews/?...,Steam Cloud,16.99,0.0,250620.0,Krillbite Studio,2017


In [9]:
# Eliminación de campos que no serán utilizados
#df_games_eliminarcampos = ['url', 'title','release_date', 'reviews_url', 'specs', ]
# df_games = df_games.drop(df_games_eliminarcampos, axis=1)

In [10]:
# Filtrado de campos a utilizar
df_games = df_games[['item_id', 'app_name', 'genres', 'year', 'price', 'developer']]
df_games.sample(5)

Unnamed: 0,item_id,app_name,genres,year,price,developer
100967,298610.0,Ylands,Adventure,2017,15.0,Bohemia Interactive
93627,429200.0,Super Helmets on Fire DX Ultra Edition Plus Alpha,Action,2016,0.99,Ripper Games
98482,236110.0,Dungeon Defenders II,Strategy,2017,Free to Play,Trendy Entertainment
107175,397060.0,Faeria,Massively Multiplayer,2017,Free to Play,Abrakam SA
117038,255070.0,Abyss Odyssey,Adventure,2014,14.99,ACE Team


### Verificación de tipos de datos - Dataset GAMES

In [11]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1983295 entries, 88310 to 120443
Data columns (total 6 columns):
 #   Column     Dtype  
---  ------     -----  
 0   item_id    float64
 1   app_name   object 
 2   genres     object 
 3   year       int32  
 4   price      object 
 5   developer  object 
dtypes: float64(1), int32(1), object(4)
memory usage: 98.4+ MB


In [12]:
# Conversión de tipos de datos

df_games['price'] = pd.to_numeric(df_games['price'], errors='coerce')  # Conversión a tipo numérico, forzando los errores a NaN
#df_games['early_access'] = df_games['early_access'].astype(bool)       # Conversión a tipo booleano
df_games['item_id'] = pd.to_numeric(df_games['item_id'], errors='coerce')        # Conversión a tipo entero, forzando los errores a NaN
df_games['item_id'].fillna(0, inplace=True)
df_games['item_id'] = df_games['item_id'].astype(int)
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1983295 entries, 88310 to 120443
Data columns (total 6 columns):
 #   Column     Dtype  
---  ------     -----  
 0   item_id    int64  
 1   app_name   object 
 2   genres     object 
 3   year       int32  
 4   price      float64
 5   developer  object 
dtypes: float64(1), int32(1), int64(1), object(3)
memory usage: 98.4+ MB


In [13]:
df_games.sample(5)

Unnamed: 0,item_id,app_name,genres,year,price,developer
94820,462990,Tomoyo After ~It's a Wonderful Life~ English E...,Adventure,2016,19.99,VisualArts/Key
116737,307170,Borealis,Indie,2014,4.99,Conrad Nelson
118470,228260,Fallen Enchantress: Legendary Heroes,RPG,2013,24.99,Stardock Entertainment
114386,372150,Yasai Ninja,Indie,2015,1.99,Recotechnology S.L.
93680,262120,Toy Soldiers: Complete,Indie,2016,14.99,"Signal Studios,Krome Studios"


### Verficación de valores duplicados - Dataset GAMES

In [14]:
df_games.duplicated().sum()

1911358

In [15]:
df_games = df_games.drop_duplicates().reset_index(drop=True)
df_games = df_games.drop_duplicates(subset=['item_id', 'app_name'])     # Eliminar filas cuyo campo "item_id" y "app_name" tiene duplicados
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29782 entries, 0 to 71935
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   item_id    29782 non-null  int64  
 1   app_name   29781 non-null  object 
 2   genres     28548 non-null  object 
 3   year       29782 non-null  int32  
 4   price      27210 non-null  float64
 5   developer  28532 non-null  object 
dtypes: float64(1), int32(1), int64(1), object(3)
memory usage: 1.5+ MB


### Verificación de valores nulos - Dataset GAMES

In [16]:
df_games.isnull().sum()

item_id         0
app_name        1
genres       1234
year            0
price        2572
developer    1250
dtype: int64

En este marco es difícil establecer un patrón de valores nulos que nos permita comprender la razón de su existencia. Considerando los efectos para los que es necesario este dataset, optaremos por eliminar los registros cuyos campos prinicipales del dataset presenten valores nulos.

In [17]:
df_games = df_games.dropna(subset=['genres', 'app_name', 'price', 'developer'])

In [18]:
df_games.isnull().sum()

item_id      0
app_name     0
genres       0
year         0
price        0
developer    0
dtype: int64

## Dataset REVIEWS

### Ingeniería de características - Dataset REVIEWS

In [19]:
df_reviews = user_reviews
df_reviews.sample(2)

Unnamed: 0,user_id,user_url,reviews
4945,76561198054691172,http://steamcommunity.com/profiles/76561198054...,"[{'funny': '', 'posted': 'Posted May 1, 2014.'..."
6525,76561198054200662,http://steamcommunity.com/profiles/76561198054...,"[{'funny': '', 'posted': 'Posted July 13, 2014..."


In [20]:
# Desagregación del campo "reviews"

df_reviews = user_reviews.explode('reviews')
df_reviews = pd.concat([df_reviews.drop(['reviews'], axis=1), df_reviews['reviews'].apply(pd.Series)], axis=1)
df_reviews.sample(5)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0
21889,keaty100,http://steamcommunity.com/id/keaty100,,"Posted June 27, 2014.",,244850,No ratings yet,True,It's an early access game so I'll give it that...,
23717,76561198085183415,http://steamcommunity.com/profiles/76561198085...,,"Posted December 23, 2015.",,730,No ratings yet,False,bem loko,
4895,76561198020361485,http://steamcommunity.com/profiles/76561198020...,,"Posted December 23, 2014.",,233130,1 of 1 people (100%) found this review helpful,True,I saw bunnies bumping naughtlySlice up said bu...,
23021,Wortzong,http://steamcommunity.com/id/Wortzong,,"Posted April 21, 2015.",,304030,1 of 1 people (100%) found this review helpful,False,ONe of the biggest P2W games ive ever seen! yo...,
17567,OfficialvnmZz,http://steamcommunity.com/id/OfficialvnmZz,1 person found this review funny,"Posted November 24, 2015.",,730,2 of 2 people (100%) found this review helpful,True,"+rep, ez skins ez life",


In [21]:
# Agregación de los campos "date" y "year"

def extract_posted_date(posted_str):         # Función para extraer la fecha del campo "posted"
    pattern = r'Posted (\w+ \d{1,2}, \d{4})' # Definición del patrón observado
    match = re.search(pattern, posted_str)
    if match:
        return match.group(1)
    else:
        return None

# Aplicar la función para extraer la fecha del campo "posted"
df_reviews['posted_date'] = df_reviews['posted'].apply(lambda x: np.nan if pd.isna(x) else extract_posted_date(x))

df_reviews['posted_date'] = pd.to_datetime(df_reviews['posted_date'])
df_reviews['year_review'] = df_reviews['posted_date'].dt.year
df_reviews['year_review'] = df_reviews['year_review'].fillna(0)
df_reviews['year_review'] = df_reviews['year_review'].astype(int)

df_reviews.sample(2)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0,posted_date,year_review
12508,76561198083594288,http://steamcommunity.com/profiles/76561198083...,,"Posted December 9, 2014.",,294160,No ratings yet,True,A beautiful and addicting simple game where yo...,,2014-12-09,2014
526,jonjon4351,http://steamcommunity.com/id/jonjon4351,1 person found this review funny,"Posted December 7, 2013.",,220240,No ratings yet,True,Awesome game. It is really fun! surviving the ...,,2013-12-07,2013


In [22]:
# Análisis de sentimientos a partir del campo "review"

df_reviews['review'] = df_reviews['review'].astype(str)
df_reviews['polarity'] = df_reviews['review'].apply(lambda text: TextBlob(text).sentiment.polarity)
df_reviews['sentiment'] = pd.cut(df_reviews['polarity'], bins=[-float('inf'), -0.001, 0.0, float('inf')], labels=[0, 1, 2])

In [23]:
df_reviews.sample(5)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0,posted_date,year_review,polarity,sentiment
20718,atacca,http://steamcommunity.com/id/atacca,,"Posted June 27, 2014.",,97000,No ratings yet,True,A good good game. It can be a little repetitiv...,,2014-06-27,2014,0.17225,2
25406,76561198128220819,http://steamcommunity.com/profiles/76561198128...,,"Posted October 2, 2015.",,49520,No ratings yet,True,i r8 8/8 m8's why? easy something even better ...,,2015-10-02,2015,0.544444,2
1392,76561198120937791,http://steamcommunity.com/profiles/76561198120...,,"Posted September 30, 2014.",,304930,No ratings yet,True,Perfect game but it is hard to start,,2014-09-30,2014,0.102778,2
8363,swag0swagger,http://steamcommunity.com/id/swag0swagger,,"Posted December 29, 2013.",,221100,1 of 2 people (50%) found this review helpful,True,Best Early Access Game around. If you liked th...,,2013-12-29,2013,0.333333,2
9525,KentuckyFriedSpy,http://steamcommunity.com/id/KentuckyFriedSpy,,"Posted December 21, 2015.",,428880,7 of 8 people (88%) found this review helpful,True,"Great game, played the original on iOS for hou...",,2015-12-21,2015,0.217969,2


In [24]:
# Filtrado de campos a utilizar
df_reviews = df_reviews[['item_id', 'user_id', 'recommend', 'year_review', 'polarity', 'sentiment']]
df_reviews.sample(5)

Unnamed: 0,item_id,user_id,recommend,year_review,polarity,sentiment
3582,346110,Priv4tejeej,True,2015,0.15,2
21173,730,76561198061388888,True,2015,1.0,2
24147,72850,URMOTHERISASPY,True,2014,0.2,2
2828,236390,ArchangelItherael,True,2015,0.04,2
184,440,redsncrimson,True,2012,0.170707,2


### Verificación de tipos de datos - Dataset REVIEWS

In [25]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59333 entries, 0 to 25798
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   item_id      59305 non-null  object  
 1   user_id      59333 non-null  object  
 2   recommend    59305 non-null  object  
 3   year_review  59333 non-null  int64   
 4   polarity     59333 non-null  float64 
 5   sentiment    59333 non-null  category
dtypes: category(1), float64(1), int64(1), object(3)
memory usage: 2.8+ MB


In [26]:
# Conversión de tipos de datos
df_reviews['item_id'] = pd.to_numeric(df_reviews['item_id'], errors='coerce')
df_reviews['recommend'] = df_reviews['recommend'].astype(bool)
df_reviews['sentiment'] = pd.to_numeric(df_reviews['sentiment'], errors='coerce')
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59333 entries, 0 to 25798
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   item_id      59305 non-null  float64
 1   user_id      59333 non-null  object 
 2   recommend    59333 non-null  bool   
 3   year_review  59333 non-null  int64  
 4   polarity     59333 non-null  float64
 5   sentiment    59333 non-null  int64  
dtypes: bool(1), float64(2), int64(2), object(1)
memory usage: 2.8+ MB


### Verificación de valores duplicados - Dataset REVIEWS

In [27]:
df_reviews.duplicated().sum()

874

In [28]:
# Eliminación de duplicados
df_reviews = df_reviews.drop_duplicates().reset_index(drop=True)
df_reviews.duplicated().sum()

0

### Verificación de valores nulos - Dataset REVIEWS

In [29]:
df_reviews.isnull().sum()

item_id        28
user_id         0
recommend       0
year_review     0
polarity        0
sentiment       0
dtype: int64

In [30]:
df_reviews = df_reviews.dropna(subset=['item_id'])
df_reviews.isnull().sum()

item_id        0
user_id        0
recommend      0
year_review    0
polarity       0
sentiment      0
dtype: int64

## Dataset USAGE

### Ingeniería de características - Dataset USAGE

In [31]:
df_usage = user_items
df_usage.sample(2)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
5740,76561198075694490,38,76561198075694490,http://steamcommunity.com/profiles/76561198075...,"[{'item_id': '24960', 'item_name': 'Battlefiel..."
87650,76561198172394339,72,76561198172394339,http://steamcommunity.com/profiles/76561198172...,"[{'item_id': '4000', 'item_name': 'Garry's Mod..."


In [32]:
# Desagregación del campo "items"

df_usage = user_items.explode('items')

df_usage = df_usage.reset_index(drop=True)
def obtener_elemento(diccionario, clave_busqueda):
    if isinstance(diccionario, dict):
        return diccionario.get(clave_busqueda)
    else:
        return diccionario

# Desaagregaremos cada campo por separado para evitar tiempos excesivos de procesamiento
df_usage['item_id'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'item_id'))
df_usage['item_name'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'item_name'))
df_usage['playtime_forever'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'playtime_forever'))
df_usage['playtime_2weeks'] = df_usage['items'].apply(lambda x: obtener_elemento(x, 'playtime_2weeks'))

df_usage.sample(2)

Unnamed: 0,user_id,items_count,steam_id,user_url,items,item_id,item_name,playtime_forever,playtime_2weeks
1974526,76561198056986052,25,76561198056986052,http://steamcommunity.com/profiles/76561198056...,"{'item_id': '246280', 'item_name': 'Happy Wars...",246280,Happy Wars,0.0,0.0
5124637,76561198110965836,108,76561198110965836,http://steamcommunity.com/profiles/76561198110...,"{'item_id': '211600', 'item_name': 'Thief Gold...",211600,Thief Gold,0.0,0.0


In [33]:
# Filtrar campos a utilizar
df_usage = df_usage[['item_id', 'user_id', 'playtime_forever']]
df_usage.sample(2)

Unnamed: 0,item_id,user_id,playtime_forever
4920747,297020,arthurjudok,0.0
3957755,239140,76561197997821630,6050.0


### Verificación de tipos de datos - Dataset USAGE


In [34]:
df_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170015 entries, 0 to 5170014
Data columns (total 3 columns):
 #   Column            Dtype  
---  ------            -----  
 0   item_id           object 
 1   user_id           object 
 2   playtime_forever  float64
dtypes: float64(1), object(2)
memory usage: 118.3+ MB


In [35]:
# Conversión de tipos de datos
df_usage['item_id'] = pd.to_numeric(df_usage['item_id'], errors='coerce')
df_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5170015 entries, 0 to 5170014
Data columns (total 3 columns):
 #   Column            Dtype  
---  ------            -----  
 0   item_id           float64
 1   user_id           object 
 2   playtime_forever  float64
dtypes: float64(2), object(1)
memory usage: 118.3+ MB


### Verficación de valores duplicados - Dataset USAGE

In [36]:
df_usage.duplicated().sum()

59209

In [37]:
# Eliminación de duplicados
df_usage = df_usage.drop_duplicates().reset_index(drop=True)
df_usage.duplicated().sum()

0

### Verificación de valores nulos - Dataset USAGE

In [38]:
df_usage.isnull().sum()

item_id             16714
user_id                 0
playtime_forever    16714
dtype: int64

In [39]:
df_usage = df_usage.dropna(subset=['item_id'])
df_usage.isnull().sum()

item_id             0
user_id             0
playtime_forever    0
dtype: int64

# Exportación de datasets limpios

In [40]:
df_games.to_csv('df_games.csv', index=False)
df_reviews.to_csv('df_reviews.csv', index=False)
df_usage.to_csv('df_usage.csv', index=False)

# Armado y exportación de dataframes para API

## Samples

In [41]:
df_games.sample(2)

Unnamed: 0,item_id,app_name,genres,year,price,developer
206,12140,Max Payne,Action,2001,9.99,Remedy Entertainment
25368,698880,Slash/Dots.,Action,2017,1.99,NASPAPA GAMES


In [42]:
df_reviews.sample(2)

Unnamed: 0,item_id,user_id,recommend,year_review,polarity,sentiment
25218,221100.0,TheBetacrowisawesome,False,2015,-0.108333,0
29804,311210.0,n4afusionz,False,0,-0.4,0


In [43]:
df_usage.sample(2)

Unnamed: 0,item_id,user_id,playtime_forever
3099509,400.0,Obiwanjezz,185.0
1555815,431240.0,SneakyPete9,87.0


## Endpoint 1

def PlayTimeGenre(genre: str): Debe devolver año con mas horas jugadas para dicho género.

Ejemplo de retorno: {"Año de lanzamiento con más horas jugadas para Género X" : 2013}

In [44]:
# Armado de dataframe para la función
df_e1 = pd.merge(df_games, df_usage, on="item_id", how="inner")
df_e1 = df_e1[['genres', 'year','playtime_forever']]

# Serie que contiene el índice de cada fila con el max playtime de cada género
df_e1_indmax = df_e1.groupby('genres')['playtime_forever'].idxmax() 

# Usar los índices para obtener los años correspondientes
df_e1 = df_e1.loc[df_e1_indmax, ['genres', 'year', 'playtime_forever']] 

# Mostrar los años con el máximo playtime_forever por género
df_e1 = df_e1[['genres', 'year']]
df_e1.to_csv('df_e1.csv', index=False)
df_e1

Unnamed: 0,genres,year
3160546,Action,2004
956218,Adventure,2014
921209,Animation &amp; Modeling,2013
1762950,Audio Production,2014
1599612,Casual,2014
2165742,Design &amp; Illustration,2012
1689391,Early Access,2014
947205,Education,2014
2133429,Free to Play,2013
27418,Indie,2006


In [45]:
# Probamos el dataframe para la función
df_func1 = df_e1[df_e1['genres'] == 'Action']
df_func1

Unnamed: 0,genres,year
3160546,Action,2004


In [46]:
# Definir la función
def PlayTimeGenre(input_genre: str):
    try:
        df_e1 = pd.read_csv("df_e1.csv")                # Lectura del df
        df_e1 = df_e1[df_e1["genres"] == input_genre]   # Filtrar por input
        
        output_year = df_e1.loc[df_e1['year'].idxmax(), 'year']
    
        return {f"Año de lanzamiento con más horas jugadas para {input_genre}": output_year}
    except Exception as e:
        return {"error": str(e)}

In [47]:
# Probamos la funcion
input_genre = "Strategy"
print(PlayTimeGenre(input_genre))

{'Año de lanzamiento con más horas jugadas para Strategy': 2010}


## Endpoint 2

def UserForGenre( genero : str ): Debe devolver el usuario que acumula más horas jugadas para el género dado y una lista de la acumulación de horas jugadas por año.

Ejemplo de retorno: {"Usuario con más horas jugadas para Género X" : us213ndjss09sdf, "Horas jugadas":[{Año: 2013, Horas: 203}, {Año: 2012, Horas: 100}, {Año: 2011, Horas: 23}]}

In [48]:
df_e2 = pd.merge(df_games, df_usage, on="item_id", how="inner") # Unir datasets
df_e2 = df_e2[['genres', 'year', 'user_id','playtime_forever']] # Filtrar campos

# Obtener el índice de la fila con el máximo valor de playtime_forever para cada género
df_e2_indmax = df_e2.groupby('genres')['playtime_forever'].idxmax()
# Usar los índices para obtener los años correspondientes
# Utilzaremos el anio de lanzamiento para tomar el playtime
df_e2 = df_e2.loc[df_e2_indmax, ['genres', 'year', 'user_id', 'playtime_forever']]

In [49]:
df_e2_users = df_e2[['genres', 'user_id']]
df_e2_users.to_csv('df_e2_users.csv', index=False)
df_e2_users

Unnamed: 0,genres,user_id
3160546,Action,76561197977470391
956218,Adventure,DONTFUCKINGCLICKTHIS
921209,Animation &amp; Modeling,76561198059330972
1762950,Audio Production,Lickidactyl
1599612,Casual,76561198101480347
2165742,Design &amp; Illustration,76561198035718256
1689391,Early Access,76561198084846677
947205,Education,SeedyDog
2133429,Free to Play,Cow666
27418,Indie,wolop


In [50]:
df_e2_playtime = df_e2[['genres', 'year', 'user_id', 'playtime_forever']]
df_e2_playtime.to_csv('df_e2_playtime.csv', index=False)
df_e2_playtime

Unnamed: 0,genres,year,user_id,playtime_forever
3160546,Action,2004,76561197977470391,493791.0
956218,Adventure,2014,DONTFUCKINGCLICKTHIS,134223.0
921209,Animation &amp; Modeling,2013,76561198059330972,65427.0
1762950,Audio Production,2014,Lickidactyl,109916.0
1599612,Casual,2014,76561198101480347,74433.0
2165742,Design &amp; Illustration,2012,76561198035718256,102554.0
1689391,Early Access,2014,76561198084846677,1241.0
947205,Education,2014,SeedyDog,3082.0
2133429,Free to Play,2013,Cow666,32987.0
27418,Indie,2006,wolop,642773.0


## Endpoint 3

def UsersRecommend( año : int ): Devuelve el top 3 de juegos MÁS recomendados por usuarios para el año dado. (reviews.recommend = True y comentarios positivos/neutrales)

Ejemplo de retorno: [{"Puesto 1" : X}, {"Puesto 2" : Y},{"Puesto 3" : Z}]

In [51]:
# Armado de dataframe para la fucnión
df_e3 = df_reviews[(df_reviews['recommend'] == True) & (df_reviews['sentiment'] >= 0)]
df_e3.loc[:, 'year_review'] = pd.to_numeric(df_e3['year_review'], errors='coerce')
df_3 = df_e3[df_e3['year_review'] != 0]
df_e3 = df_e3.groupby(['year_review', 'item_id'])['recommend'].count().reset_index()
df_e3.rename(columns={'recommend': 'recommend_count'}, inplace=True)
df_e3 = pd.merge(df_e3, df_games, on="item_id", how="left").sort_values(by='recommend_count', ascending=False)
df_e3 = df_e3.dropna(subset=['app_name'])
df_e3 = df_e3[['year_review','app_name', 'recommend_count']]
df_e3 = df_e3.loc[df_e3['year_review'] != 0]
df_e3.to_csv('df_e3.csv', index=False)
df_e3

Unnamed: 0,year_review,app_name,recommend_count
4320,2015,Counter-Strike: Global Offensive,1527
2853,2014,Counter-Strike: Global Offensive,1068
2887,2014,Garry's Mod,757
3822,2014,Rust,389
3494,2014,DayZ,383
...,...,...,...
2434,2013,Post Apocalyptic Mayhem,1
2436,2013,SpaceChem,1
2437,2013,Dinner Date,1
2438,2013,Jamestown,1


In [52]:
# Probamos el dataframe para la funcion
df_func3 = df_e3[df_e3['year_review'] == 2015].head(3)
df_func3

Unnamed: 0,year_review,app_name,recommend_count
4320,2015,Counter-Strike: Global Offensive,1527
4353,2015,Garry's Mod,357
5265,2015,Grand Theft Auto V,215


In [53]:
# Definir la función
def UsersRecommend(input_year: int):
    try:
        df_e3 = pd.read_csv("df_e3.csv")
        output_top3 = df_e3[df_e3['year_review'] == input_year].head(3)
        #output_top3 = int(output_top3)
        output_top3_list = [{"Puesto {}: {}".format(i+1, game)} for i, game in enumerate(output_top3['app_name'])]
        return output_top3_list
    except Exception as e:
        return {"error": str(e)}

In [54]:
# Probamos la funcion
input_year = 2015
print(UsersRecommend(input_year))

[{'Puesto 1: Counter-Strike: Global Offensive'}, {"Puesto 2: Garry's Mod"}, {'Puesto 3: Grand Theft Auto V'}]


## Endpoint 4

def UsersNotRecommend( año : int ): Devuelve el top 3 de juegos MENOS recomendados por usuarios para el año dado. (reviews.recommend = False y comentarios negativos)

Ejemplo de retorno: [{"Puesto 1" : X}, {"Puesto 2" : Y},{"Puesto 3" : Z}]

In [55]:
# Armado de dataframe para la fucnión
#df_e4 = df_e3 
#df_e4.to_csv('df_e4.csv', index=False)
#df_e4

In [56]:
# Armado de dataframe para la fucnión
df_e4 = df_reviews[(df_reviews['recommend'] == False) & (df_reviews['sentiment'] == 0)]
df_e4.loc[:, 'year_review'] = pd.to_numeric(df_e4['year_review'], errors='coerce')
df_4 = df_e4[df_e4['year_review'] != 0]
df_e4 = df_e4.groupby(['year_review', 'item_id'])['recommend'].count().reset_index()
df_e4.rename(columns={'recommend': 'recommend_count'}, inplace=True)
df_e4 = pd.merge(df_e4, df_games, on="item_id", how="left").sort_values(by='recommend_count', ascending=False)
df_e4 = df_e4.dropna(subset=['app_name'])
df_e4 = df_e4[['year_review','app_name', 'recommend_count']]
df_e4 = df_e4.loc[df_e4['year_review'] != 0]
df_e4.to_csv('df_e4.csv', index=False)
df_e4

Unnamed: 0,year_review,app_name,recommend_count
978,2015,Counter-Strike: Global Offensive,60
1125,2015,DayZ,43
740,2014,DayZ,34
1219,2015,Rust,26
539,2014,Counter-Strike: Global Offensive,21
...,...,...,...
549,2014,"Star Wars: Battlefront 2 (Classic, 2005)",1
548,2014,S.T.A.L.K.E.R.: Shadow of Chernobyl,1
544,2014,Alpha Prime,1
542,2014,The Ship: Murder Party,1


In [57]:
# Probamos el dataframe para la funcion
df_func4 = df_e4[df_e4['year_review'] == 2015].head(3)
df_func4

Unnamed: 0,year_review,app_name,recommend_count
978,2015,Counter-Strike: Global Offensive,60
1125,2015,DayZ,43
1219,2015,Rust,26


In [58]:
# Definir la función
def UsersNotRecommend(input_year: int):
    try:
        df_e4 = pd.read_csv("df_e4.csv")
        output_last3 = df_e4[df_e4['year_review'] == input_year].head(3)
        output_last3_list = [{"Puesto {}: {}".format(i+1, game)} for i, game in enumerate(output_last3['app_name'])]
        return output_last3_list
    except Exception as e:
        return {"error": str(e)}

In [59]:
# Probamos la funcion
input_year = 2015
print(UsersNotRecommend(input_year))

[{'Puesto 1: Counter-Strike: Global Offensive'}, {'Puesto 2: DayZ'}, {'Puesto 3: Rust'}]


## Endpoint 5

def sentiment_analysis( año : int ): Según el año de lanzamiento, se devuelve una lista con la cantidad de registros de reseñas de usuarios que se encuentren categorizados con un análisis de sentimiento.

Ejemplo de retorno: {Negative = 182, Neutral = 120, Positive = 278}

In [60]:
# Armado de dataframe para la función
df_e5 = df_reviews[df_reviews['year_review'] != 0].copy()
# Crear columnas 'negative', 'neutral' y 'positive'
df_e5['negative'] = (df_e5['sentiment'] == 1).astype(int)
df_e5['neutral'] = (df_e5['sentiment'] == 0).astype(int)
df_e5['positive'] = (df_e5['sentiment'] == 2).astype(int)

# Dado que no se solicitan resultados por item_id, tomaremos como anio de lanzamiento el year_review ya que lo que se solicita es la suma de los sentiments
# Agrupar y pivotar para contar los valores según year_review
df_e5 = df_e5.groupby('year_review')[['negative', 'neutral', 'positive']].sum()
df_e5 = df_e5.rename_axis('year').reset_index()

df_e5.to_csv('df_e5.csv', index=False)
df_e5

Unnamed: 0,year,negative,neutral,positive
0,2010,8,11,47
1,2011,70,94,366
2,2012,215,206,780
3,2013,1362,1230,4121
4,2014,4736,4389,12709
5,2015,4257,4149,9748


In [61]:
# Probamos el dataframe para la funcion
year = 2015
df_func5 = df_e5[df_e5['year'] == year]
df_func5

Unnamed: 0,year,negative,neutral,positive
5,2015,4257,4149,9748


In [62]:
# Definir la función
def SentimentAnalysis(input_year: int):
    try:
        df_e5 = pd.read_csv("df_e5.csv")

        value_negative = df_e5['negative'].values[0]
        value_neutral = df_e5['neutral'].values[0]
        value_positive = df_e5['positive'].values[0]

        output_sentiment_list = f"Para el año {input_year} se registran los siguientes valores: negative: {value_negative}, neutral: {value_neutral}, neutral: {value_positive}"       
        return output_sentiment_list
    except Exception as e:
        return {"error": str(e)}

In [63]:
# Probamos la funcion
input_year = 2015
print(SentimentAnalysis(input_year))

Para el año 2015 se registran los siguientes valores: negative: 8, neutral: 11, neutral: 47
