## Preparación de bases para las funciones

Este notebook tiene como objetivo preparar los archivos que utilizarán las funciones para los endpoints. Se decidió realizar este paso para facilitar la ejecución de los endpoints sobre archivos que tengan información relevante y así optimizar las consultas. 

El contenido del notebook es el siguiente:
1. PlayTimeGenre
2. UserForGenre
3. UsersWorstDeveloper
4. sentiment_analysis
5. recomendacion_juego
6. recomendacion_usuario (opcional)

Al final se tendrá 5 archivos listos para ser utilizados en las 5 funciones más 1 archivo para ser usado en la función del  sistema de recomendación. 

#### Importamos Librerías

In [1]:
import pandas as pd
import ast

#### Carga de los 3 archivos resultantes del ETL delos 3 archivos JSON

In [318]:
df_User_Items = pd.read_csv("user_items_final.csv")
df_Steam_Games = pd.read_csv('steam_games_final.csv')
df_User_Reviews = pd.read_csv('user_reviews_final.csv')


#### 1.PlayTimeGenre

Archivo para la función PlayTimeGenre

Los datos necesarios para la función son 'Hours_played', 'release_year', 'genres'

In [None]:
df_User_Items = pd.read_csv("user_items_final.csv")
df_Steam_Games = pd.read_csv('steam_games_final.csv')
df_User_Reviews = pd.read_csv('user_reviews_final.csv')

In [123]:
df_User_Items.head(1)

Unnamed: 0,item_id,item_name,user_id,Hours_played
0,10,Counter-Strike,76561197970982479,0.1


In [124]:
df_Steam_Games.head(1)

Unnamed: 0,item_id,developer,release_year,app_name,tags,specs,genres,price
0,761140.0,Kotoshiro,2018,Lost Summoner Kitty,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",['Single-player'],"['Action', 'Casual', 'Indie', 'Simulation', 'S...",4.99


In [125]:
# Filas y columnas del dataframe
print(df_User_Items.shape)
print(df_Steam_Games.shape)

(3246375, 4)
(29965, 8)


In [126]:
# Uso de función 'merge' para combinar la información de df_User_Items y df_Steam_Games usando 'item_id' como clave, ya que es la columna en común en ambos datfarames
merged_data = pd.merge(df_User_Items, df_Steam_Games[['item_id', 'release_year', 'genres']], on='item_id', how='inner')


In [127]:
merged_data.shape

(2771259, 6)

In [128]:
merged_data.head()

Unnamed: 0,item_id,item_name,user_id,Hours_played,release_year,genres
0,10,Counter-Strike,76561197970982479,0.1,2000,['Action']
1,10,Counter-Strike,doctr,1.55,2000,['Action']
2,10,Counter-Strike,corrupted_soul,1.8,2000,['Action']
3,10,Counter-Strike,WeiEDKrSat,5.467,2000,['Action']
4,10,Counter-Strike,death-hunter,104.583,2000,['Action']


In [129]:
# Agrupamos Hours_played por genres y suma 'Hours_played' para cada combinación
merged_data2 = merged_data.groupby(['release_year', 'genres'])['Hours_played'].sum().reset_index()

In [131]:
merged_data2.shape

(1746, 3)

In [132]:
# Mostramos los valores de 'genres' en filas

# Convertimos la cadena a una lista utilizando ast.literal_eval
merged_data2['genres'] = merged_data2['genres'].apply(ast.literal_eval)

# Expandir la lista en nuevas filas
merged_data3 = merged_data2.explode('genres').reset_index(drop=True)

In [133]:
merged_data3.shape

(5581, 3)

In [134]:
merged_data3.head()

Unnamed: 0,release_year,genres,Hours_played
0,1983,Action,57.887
1,1983,Adventure,57.887
2,1983,Casual,57.887
3,1984,Action,6.4
4,1984,Adventure,6.4


In [135]:
# Exporta el resultado en CSV
merged_data3.to_csv('PlayTimeGenre.csv', index=False)

In [94]:
'''
# Para optimizar la busqueda de las horas mas jugadas por genero
max_indices = merged_data3.groupby('genres')['Hours_played'].idxmax()

# Seleccionar las filas correspondientes a los índices encontrados
df_resultante = merged_data3.loc[max_indices]

df_resultante.shape
'''

In [99]:
df_resultante

Unnamed: 0,release_year,genres,Hours_played
1330,2012,Action,13212080.0
985,2011,Adventure,2822710.0
3625,2015,Animation &amp; Modeling,19140.17
2406,2014,Audio Production,6502.749
5263,2017,Casual,838899.7
1395,2012,Design &amp; Illustration,32172.34
1499,2013,Early Access,1335169.0
1715,2013,Education,2928.084
1574,2013,Free to Play,2104059.0
447,2006,Indie,7364517.0


#### 2.UserForGenre

Archivo para la función UserForGenre

In [None]:
df_User_Items = pd.read_csv("user_items_final.csv")
df_Steam_Games = pd.read_csv('steam_games_final.csv')
df_User_Reviews = pd.read_csv('user_reviews_final.csv')

In [181]:
df_User_Items.shape

(3246375, 4)

In [182]:
df_Steam_Games.shape

(29964, 8)

In [183]:
# Combinar los DataFrames en base a la columna 'item_id' para obtener release_year y genres 
df_combined = pd.merge(df_User_Items, df_Steam_Games[['item_id', 'release_year', 'genres']], on='item_id', how='left')
df_combined.shape

In [185]:
df_combined.head(2)

Unnamed: 0,item_id,item_name,user_id,Hours_played,release_year,genres
0,10,Counter-Strike,76561197970982479,0.1,2000.0,['Action']
1,30,Day of Defeat,76561197970982479,0.117,2003.0,['Action']


In [186]:
# Agrupamos por genres, user_id y release_year, suma las horas
df_final = df_combined.groupby(['genres', 'user_id', 'release_year'], as_index=False)['Hours_played'].sum()


In [187]:
df_final.head()

Unnamed: 0,genres,user_id,release_year,Hours_played
0,"['Action', 'Adventure', 'Casual', 'Free to Pla...",12das,2015.0,0.133
1,"['Action', 'Adventure', 'Casual', 'Free to Pla...",666999661,2015.0,15.55
2,"['Action', 'Adventure', 'Casual', 'Free to Pla...",76561198022148624,2015.0,14.5
3,"['Action', 'Adventure', 'Casual', 'Free to Pla...",76561198030206184,2015.0,2.85
4,"['Action', 'Adventure', 'Casual', 'Free to Pla...",76561198030347257,2015.0,0.717


In [188]:
# Mostramos los valores de 'genres' en filas

# Convertimos la cadena a una lista utilizando ast.literal_eval
df_final['genres'] = df_final['genres'].apply(ast.literal_eval)

# Expandir la lista en nuevas filas
df_final2 = df_final.explode('genres').reset_index(drop=True)

In [189]:
#Agrupamos

# Agrupar por 'genres', 'user_id' y 'release_year' y sumar 'Hours_played'
df_final3 = df_final2.groupby(['genres', 'user_id', 'release_year'], as_index=False)['Hours_played'].sum()


In [124]:
# Comprobamos
# df_final3[(df_final3['genres'] == 'Action') & (df_final3['user_id'] == 'zzzmidmiss')]
# df_final3[(df_final3['genres'] == 'Action') & (df_final3['user_id'] == 'stopgovtcorruption')]


In [190]:
df_final3.shape

(2851017, 4)

In [191]:
# Verificaaciones adicionales

# Convierte el año a tipo entero
df_final3['release_year']=df_final3['release_year'].astype('Int64')

# Cambiar el tipo de datos de la columna 'user_id' a texto (str)
df_final3['user_id'] = df_final3['user_id'].astype(str)
df_final3['genres'] = df_final3['genres'].astype(str)

# Redondea los valores a dos decimales
df_final3['Hours_played'] = df_final3['Hours_played'].round(2)

In [192]:
# Guarda archivo
df_final3.to_csv('UserForGenre.csv', index=False)


#### 3.UsersRecommend

Archivo para la función UsersRecommend

In [None]:
df_User_Items = pd.read_csv("user_items_final.csv")
df_Steam_Games = pd.read_csv('steam_games_final.csv')
df_User_Reviews = pd.read_csv('user_reviews_final.csv')

In [319]:
# Elimina cualquier duplicado en las columnas user_id y item_id
df_User_Reviews.drop_duplicates(subset=['user_id', 'item_id'], keep='first', inplace=True)


In [320]:
# Crea columna que indica si el review es RECOMENDADO o no
df_User_Reviews['Recommended'] = (df_User_Reviews['recommend'] == True) & df_User_Reviews['sentiment_analysis'].isin([1, 2])


In [321]:
# Filtra solo los recomendados y solo las columnas 'user_id', 'item_id', 'year', 'recommend'
df_User_Reviews_Rec = df_User_Reviews[df_User_Reviews['Recommended'] == True][['user_id', 'item_id', 'year', 'Recommended']].copy()

df_User_Reviews_Rec

Unnamed: 0,user_id,item_id,year,Recommended
0,76561197970982479,1250,2011,True
1,76561197970982479,22200,2011,True
2,76561197970982479,43110,2011,True
3,js41637,251610,2014,True
4,js41637,227300,2013,True
...,...,...,...,...
45093,76561198312638244,233270,2014,True
45094,76561198312638244,130,2015,True
45095,76561198312638244,70,2014,True
45096,76561198312638244,362890,2015,True


In [322]:
#Indicamos cuantos usuarios recomiendan cada 'item_id' por año

# Agrupa por 'item_id' y cuenta la cantidad de 'user_id' para cada grupo de 'item_id'.
df_NumberRec_year_item = df_User_Reviews_Rec.groupby(['year', 'item_id'])['user_id'].count().reset_index(name='number_user_id_recom')
df_NumberRec_year_item

Unnamed: 0,year,item_id,number_user_id_recom
0,2010,240,1
1,2010,300,1
2,2010,400,1
3,2010,440,7
4,2010,550,1
...,...,...,...
4448,2015,421630,1
4449,2015,422400,3
4450,2015,423120,1
4451,2015,423880,9


In [323]:
# Ordenamos 
df_NumberRec_year_item.sort_values(by=['year', 'number_user_id_recom'], ascending=[True, False], inplace=True)
df_NumberRec_year_item

Unnamed: 0,year,item_id,number_user_id_recom
3,2010,440,7
5,2010,630,4
6,2010,1250,4
20,2010,22600,2
0,2010,240,1
...,...,...,...
4444,2015,418300,1
4446,2015,418910,1
4448,2015,421630,1
4450,2015,423120,1


In [324]:
# Añadimos el nombre de del juego al dataframe df_NumberRec_year_item según el item_id
df_Merged = df_NumberRec_year_item.merge(df_Steam_Games[['item_id', 'app_name']], on='item_id', how='left')


In [325]:
# Reordena columnas
df_Merged = df_Merged[['year', 'item_id', 'app_name', 'number_user_id_recom']]

In [326]:
df_Merged

Unnamed: 0,year,item_id,app_name,number_user_id_recom
0,2010,440,Team Fortress 2,7
1,2010,630,Alien Swarm,4
2,2010,1250,Killing Floor,4
3,2010,22600,Worms Reloaded,2
4,2010,240,Counter-Strike: Source,1
...,...,...,...,...
4448,2015,418300,Wick,1
4449,2015,418910,Idle Civilization,1
4450,2015,421630,A Study in Steampunk: Choice by Gaslight,1
4451,2015,423120,Community College Hero: Trial by Fire,1


In [317]:
# Guardar el resultado en un archivo CSV
df_Merged.to_csv('UsersRecommend.csv', index=False)

#### 4.UsersWorstDeveloper

Archivo para la función UsersWorstDeveloper

In [385]:
df_User_Items = pd.read_csv("user_items_final.csv")
df_Steam_Games = pd.read_csv('steam_games_final.csv')
df_User_Reviews = pd.read_csv('user_reviews_final.csv')

In [386]:
# Elimina cualquier duplicado en las columnas user_id y item_id
df_User_Reviews.drop_duplicates(subset=['user_id', 'item_id'], keep='first', inplace=True)


In [387]:
# Crea columna que indica si el review es RECOMENDADO o NO RECOMENDADO, asigna True si NO ES RECOMENDADO
df_User_Reviews['Not_Recommended'] = (df_User_Reviews['recommend'] == False) & (df_User_Reviews['sentiment_analysis']==0)
df_User_Reviews

Unnamed: 0,user_id,item_id,recommend,year,language,sentiment_analysis,Not_Recommended
0,76561197970982479,1250,True,2011,en,2,False
1,76561197970982479,22200,True,2011,en,2,False
2,76561197970982479,43110,True,2011,en,2,False
3,js41637,251610,True,2014,en,2,False
4,js41637,227300,True,2013,en,2,False
...,...,...,...,...,...,...,...
45093,76561198312638244,233270,True,2014,en,2,False
45094,76561198312638244,130,True,2015,en,2,False
45095,76561198312638244,70,True,2014,en,2,False
45096,76561198312638244,362890,True,2015,en,2,False


In [388]:
# Filtra solo los NO recomendados y solo las columnas 'user_id', 'item_id', 'year', 'recommend'
df_User_Reviews_NoRec = df_User_Reviews[df_User_Reviews['Not_Recommended'] == True][['user_id', 'item_id', 'year', 'Not_Recommended']].copy()

df_User_Reviews_NoRec

Unnamed: 0,user_id,item_id,year,Not_Recommended
53,76561198066046412,359320,2015,True
97,iamthekingofbrowntown,344760,2015,True
136,Nozomikat,437220,2014,True
170,76561198073784601,299740,2015,True
182,AVATAR715,48240,2014,True
...,...,...,...,...
44978,danebuchanan,311210,2015,True
44996,laislabonita75,305920,2015,True
45027,76561198209894493,570,2014,True
45035,76561198222628548,370240,2015,True


In [389]:
#Indicamos cuántos usuarios no recomiendan cada 'item_id' por año

# Agrupa por 'item_id' y cuenta la cantidad de 'user_id' para cada grupo de 'item_id'.
df_NumberNoRec_year_item = df_User_Reviews_NoRec.groupby(['year', 'item_id'])['user_id'].count().reset_index(name='number_user_id_norecom')
df_NumberNoRec_year_item

Unnamed: 0,year,item_id,number_user_id_norecom
0,2011,440,1
1,2011,18700,2
2,2011,33460,1
3,2011,63940,1
4,2011,91310,1
...,...,...,...
1112,2015,410210,1
1113,2015,412400,1
1114,2015,417860,5
1115,2015,418340,1


In [390]:
# Ordenamos 
df_NumberNoRec_year_item.sort_values(by=['year', 'number_user_id_norecom'], ascending=[True, True], inplace=True)
df_NumberNoRec_year_item

Unnamed: 0,year,item_id,number_user_id_norecom
0,2011,440,1
2,2011,33460,1
3,2011,63940,1
4,2011,91310,1
5,2011,105400,1
...,...,...,...
964,2015,311210,21
766,2015,221100,30
1028,2015,346110,35
758,2015,218620,49


In [391]:
# Añadimos el nombre del juego y el developer al dataframe df_NumberNoRec_year_item según el item_id
df_Merged = df_NumberNoRec_year_item.merge(df_Steam_Games[['item_id', 'app_name','developer']], on='item_id', how='left')


In [392]:
# Reordena columnas
df_Merged = df_Merged[['year', 'item_id', 'developer', 'number_user_id_norecom']]

In [393]:
df_Merged

Unnamed: 0,year,item_id,developer,number_user_id_norecom
0,2011,440,Valve,1
1,2011,33460,Ubisoft Montpellier,1
2,2011,63940,1C Company,1
3,2011,91310,,1
4,2011,105400,,1
...,...,...,...,...
1112,2015,311210,Treyarch,21
1113,2015,221100,Bohemia Interactive,30
1114,2015,346110,"Studio Wildcard,Instinct Games,Efecto Studios,...",35
1115,2015,218620,,49


In [394]:
# Guardar el resultado en un archivo CSV
df_Merged.to_csv('UsersWorstDeveloper.csv', index=False)

#### 5.sentiment_analysis

Archivo para la función sentiment_analysis

In [480]:
df_User_Items = pd.read_csv("user_items_final.csv")
df_Steam_Games = pd.read_csv('steam_games_final.csv')
df_User_Reviews = pd.read_csv('user_reviews_final.csv')

In [481]:
# Elimina cualquier duplicado en las columnas user_id y item_id
df_User_Reviews.drop_duplicates(subset=['user_id', 'item_id'], keep='first', inplace=True)


In [482]:
# Añadimos el nombre del juego y el developer al dataframe df_NumberNoRec_year_item según el item_id
df_w_developer = df_User_Reviews.merge(df_Steam_Games[['item_id','developer']], on='item_id', how='left')


In [483]:
df_w_developer.shape

(45097, 7)

In [484]:
df_w_developer.head(2)

Unnamed: 0,user_id,item_id,recommend,year,language,sentiment_analysis,developer
0,76561197970982479,1250,True,2011,en,2,Tripwire Interactive
1,76561197970982479,22200,True,2011,en,2,ACE Team


In [485]:
# Reordena columnas
df_w_developer = df_w_developer[['user_id', 'item_id', 'sentiment_analysis', 'developer']]

In [486]:
df_w_developer.head(2)

Unnamed: 0,user_id,item_id,sentiment_analysis,developer
0,76561197970982479,1250,2,Tripwire Interactive
1,76561197970982479,22200,2,ACE Team


In [426]:
#dd1=df_w_developer[df_w_developer['developer']=='11 bit studios']
#dd1['sentiment_analysis'].value_counts()

In [487]:
# Contar los valores para cada combinación de 'developer' y 'sentiment_analysis'
df_counts = df_w_developer.groupby(['developer', 'sentiment_analysis']).size().reset_index(name='count')


In [488]:
df_counts.head(10)

Unnamed: 0,developer,sentiment_analysis,count
0,07th Expansion,0,1
1,07th Expansion,1,1
2,"10th Art Studio,Adventure Productions",2,1
3,10tons Ltd,2,1
4,11 bit studios,0,28
5,11 bit studios,1,5
6,11 bit studios,2,21
7,14° East,0,1
8,14° East,2,1
9,16bit Nights,0,1


In [489]:
# Realiza un pivot para obtener una columna por cada valor de 'sentiment_analysis'
df_col_counts = df_counts.pivot(index='developer', columns='sentiment_analysis', values='count').reset_index()# reset_index se utiliza para restablecer el índice y hacer que la columna del índice vuelva a ser una columna regular en el DataFrame
df_col_counts.head()

sentiment_analysis,developer,0,1,2
0,07th Expansion,1.0,1.0,
1,"10th Art Studio,Adventure Productions",,,1.0
2,10tons Ltd,,,1.0
3,11 bit studios,28.0,5.0,21.0
4,14° East,1.0,,1.0


In [490]:
# Llenamos NaN con ceros
df_col_counts.fillna(0, inplace=True)

In [491]:
df_col_counts.rename(columns={0: 'Negative', 1: 'Neutral', 2: 'Positive'}, inplace=True)

df_col_counts

sentiment_analysis,developer,Negative,Neutral,Positive
0,07th Expansion,1.0,1.0,0.0
1,"10th Art Studio,Adventure Productions",0.0,0.0,1.0
2,10tons Ltd,0.0,0.0,1.0
3,11 bit studios,28.0,5.0,21.0
4,14° East,1.0,0.0,1.0
...,...,...,...,...
1999,xXarabongXx,1.0,0.0,0.0
2000,△○□× (Miwashiba),0.0,0.0,5.0
2001,"インレ,Inre",2.0,2.0,1.0
2002,橘子班,1.0,0.0,1.0


In [492]:
# Damos el formato a enteros
df_col_counts[['Negative', 'Neutral', 'Positive']] = df_col_counts[['Negative', 'Neutral', 'Positive']].astype(int)
df_col_counts

sentiment_analysis,developer,Negative,Neutral,Positive
0,07th Expansion,1,1,0
1,"10th Art Studio,Adventure Productions",0,0,1
2,10tons Ltd,0,0,1
3,11 bit studios,28,5,21
4,14° East,1,0,1
...,...,...,...,...
1999,xXarabongXx,1,0,0
2000,△○□× (Miwashiba),0,0,5
2001,"インレ,Inre",2,2,1
2002,橘子班,1,0,1


In [493]:
# Guardar el resultado en un archivo CSV
df_col_counts.to_csv('sentiment_analysis.csv', index=False)