### Feature Engineering 

A partir de los csv´s ya procesados, vamos a preparar el conjunto de datos que utilizaremos para realizar las consultas en la API. 

In [1]:
import pandas as pd
import tools 
import pyarrow as pa 
import pyarrow.parquet as pq 
import warnings 
warnings.filterwarnings("ignore") 

df_user_reviews = pd.read_csv("user_reviews_cleaned.csv", encoding="utf-8")
df_users_items = pd.read_csv("users_items_cleaned.csv", encoding="utf-8")
df_steam_games = pd.read_csv("steam_games_cleaned.csv", encoding="utf-8")

### Análisis de sentimientos

Uno de los requerimientos dentro de este proyecto es que reemplacemos la columna "reviews_review" con una llamada "sentiment_analysis". Esta última realizara un analisis de los comentarios teniendo los siguientes parámetros:

* 0 si es malo,
* 1 si es neutral o no contiene un review
* 2 si es positivo.


Realizaremos un análisis de sentimiento básico utilizando TextBlob que es una biblioteca de procesamiento de lenguaje natural (NLP) en Python, ya que el objetivo de este proyecto es realizar una prueba de concepto para conseguir el producto mínimo viable. Con TextBlob, podremos asignarle un valor númerico a un texto, dependiendo el sentimiento expresado por los usuarios en cada juego.

In [2]:
df_user_reviews["sentiment_analysis"] = df_user_reviews["reviews_review"].apply(tools.analisis_sentimiento)
df_user_reviews.head()

Unnamed: 0,user_id,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,year,sentiment_analysis
0,--000--,218230,No ratings yet,True,This game is just pure awesome. One of the be...,2014,2
1,--000--,233610,1 of 1 people (100%) found this review helpful,True,Distance is a wonderful racing game. The seco...,2015,1
2,--000--,210770,No ratings yet,True,Sanctum 2 is one really the only FPS/Tower Def...,2015,1
3,--000--,208650,3 of 5 people (60%) found this review helpful,False,I have played this game for an extensive amoun...,2014,1
4,--ace--,730,No ratings yet,True,One of the best FPS competetive games i hae pl...,2014,1


Cotejaremos algunos ejemplos para cada una de las clases de sentimiento 

In [3]:

conteo_reviews = df_user_reviews["sentiment_analysis"].value_counts()

reviews_vacias = df_user_reviews["reviews_review"].isnull().sum()

total_reviews = len(df_user_reviews)

porcentaje_reviews = (conteo_reviews / total_reviews * 100).round(2)
porcentaje_reviews_vacias = (reviews_vacias / total_reviews * 100).round(2)



resumen_sentimientos = pd.DataFrame({
    "Conteo": conteo_reviews,
    "Porcentaje": porcentaje_reviews.round(2).astype(str) + '%'
})

# Ordenar el DataFrame por el conteo de mayor a menor
resumen_sentimientos = resumen_sentimientos.sort_values(by="Conteo", ascending=False)

# Imprimir los resultados
print("\nResumen de análisis de sentimientos:")
print(resumen_sentimientos)
print("\nConteo de reviews en blanco: ", reviews_vacias, " Porcentaje: ", porcentaje_reviews_vacias.round(2).astype(str) + '%')





Resumen de análisis de sentimientos:
                    Conteo Porcentaje
sentiment_analysis                   
1                    35241     61.43%
2                    17100     29.81%
0                     5026      8.76%

Conteo de reviews en blanco:  0  Porcentaje:  0.0%


In [4]:
tools.review_por_sentimiento(df_user_reviews["reviews_review"], df_user_reviews["sentiment_analysis"])


Para la categoría de análisis de sentimiento 0 se tienen estos ejemplos de reviews:
Review 1: This game is rubbish, a samurai sword takes just as long to kill them as punching them, I have been riped off by this stupid fn game and the creators should be lined up and shot
Review 2: fkn ♥♥♥♥ you garry ♥♥♥♥ ♥♥♥♥ing idiot ♥♥♥♥♥♥ ♥♥♥♥♥ ♥♥♥♥♥♥ ♥♥♥♥er
Review 3: The core concept is neat but it is so broken it took me over an hour to get a game with my friend and another 2 hours disconnecting and reconnecting... The game was just posted on steam and then never improved.


Para la categoría de análisis de sentimiento 1 se tienen estos ejemplos de reviews:
Review 1: Distance is a wonderful racing game.  The second game by the developers following Nitronic Rush.  This game is amazing in so many ways.  From the high speeds you can reach, to the tricks and stunts you can perform and the different game modes.  There is an incredibly detailed and awesome level editor, while the TrackMania series has t

Habiendo confirmado los porcentajes y ejemplos, eliminaremos la columna "reviews_review". 

In [5]:
df_user_reviews = df_user_reviews.drop(columns=["reviews_review"])
df_user_reviews.columns

Index(['user_id', 'reviews_item_id', 'reviews_helpful', 'reviews_recommend',
       'year', 'sentiment_analysis'],
      dtype='object')

In [6]:
tools.ver_tipo_datos(df_user_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,reviews_item_id,[<class 'int'>],100.0,0.0,0
2,reviews_helpful,[<class 'str'>],100.0,0.0,0
3,reviews_recommend,[<class 'bool'>],100.0,0.0,0
4,year,[<class 'int'>],100.0,0.0,0
5,sentiment_analysis,[<class 'int'>],100.0,0.0,0


### userData 

Ahora nos enfocaremos, en encontrar la forma de usar los datos a partir de los dataframes que abrimos en la parte inicial de este archivo, para así calcular la cantidad que cada usuario invirtió en videojuegos y los productos que compró, esto último a partir de su item.

In [7]:
df_dinero_gastado = df_users_items[["items_count","user_id", "item_id"]]
df_dinero_gastado

Unnamed: 0,items_count,user_id,item_id
0,277,76561197970982479,10
1,277,76561197970982479,30
2,277,76561197970982479,300
3,277,76561197970982479,240
4,277,76561197970982479,3830
...,...,...,...
3246370,7,76561198329548331,304930
3246371,7,76561198329548331,227940
3246372,7,76561198329548331,388490
3246373,7,76561198329548331,521570


Ahora realizaremos dataFrame auxiliar con el precio de cada juego para luego fucionarlo con el dataFrame anterior. 

In [8]:
df_precio_juegos = df_steam_games[["price", "item_id"]]
df_precio_juegos


Unnamed: 0,price,item_id
0,4.99,761140
1,4.99,761140
2,4.99,761140
3,4.99,761140
4,4.99,761140
...,...,...
71546,1.99,610660
71547,1.99,610660
71548,1.99,610660
71549,4.99,658870


In [9]:
# Eliminaremos los duplicados
df_precio_juegos = df_precio_juegos.drop_duplicates(subset = "item_id", keep="first")
df_precio_juegos


Unnamed: 0,price,item_id
0,4.99,761140
5,0.00,643980
9,0.00,670290
14,0.99,767400
17,3.99,772540
...,...,...
71535,1.99,745400
71539,1.99,773640
71543,4.99,733530
71546,1.99,610660


Ahora si fusionaremos los dos dataframes anteriores para crear un solo dataframe. 

In [10]:
df_dinero_gastado = df_dinero_gastado.merge(df_precio_juegos, on="item_id", how="left")
df_dinero_gastado

Unnamed: 0,items_count,user_id,item_id,price
0,277,76561197970982479,10,9.99
1,277,76561197970982479,30,4.99
2,277,76561197970982479,300,9.99
3,277,76561197970982479,240,19.99
4,277,76561197970982479,3830,9.99
...,...,...,...,...
3246370,7,76561198329548331,304930,0.00
3246371,7,76561198329548331,227940,0.00
3246372,7,76561198329548331,388490,0.00
3246373,7,76561198329548331,521570,0.00


In [11]:
tools.ver_tipo_datos(df_dinero_gastado)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,items_count,[<class 'int'>],100.0,0.0,0
1,user_id,[<class 'str'>],100.0,0.0,0
2,item_id,[<class 'int'>],100.0,0.0,0
3,price,[<class 'float'>],85.31,14.69,476857


Observamos que la columna "price" contiene nulos. Por lo tanto, vamos a explorarla.

In [12]:
df_dinero_gastado[df_dinero_gastado["price"].isnull()]

Unnamed: 0,items_count,user_id,item_id,price
16,277,76561197970982479,9340,
31,277,76561197970982479,23120,
34,277,76561197970982479,35010,
40,277,76561197970982479,24860,
41,277,76561197970982479,39530,
...,...,...,...,...
3246337,36,76561198312638244,202990,
3246338,36,76561198312638244,212910,
3246360,5,76561198320038728,39000,
3246365,321,76561198320136420,496920,


Hasta ahora, tenemos un 14% de productos que no tienen precio. Asumiremos que no contienen un precio porque son gratuitos. Entonces, los vamos a rellenar con 0.0

In [13]:
df_gratuito = df_dinero_gastado["price"].fillna(0.0)
# Borramos la columna original y concatenamos la columna rellena con todo el dataframe
df_gasto_items = pd.concat([df_dinero_gastado.drop("price", axis=1), df_gratuito], axis=1)
df_gasto_items

Unnamed: 0,items_count,user_id,item_id,price
0,277,76561197970982479,10,9.99
1,277,76561197970982479,30,4.99
2,277,76561197970982479,300,9.99
3,277,76561197970982479,240,19.99
4,277,76561197970982479,3830,9.99
...,...,...,...,...
3246370,7,76561198329548331,304930,0.00
3246371,7,76561198329548331,227940,0.00
3246372,7,76561198329548331,388490,0.00
3246373,7,76561198329548331,521570,0.00


Una vez unido el Dataframe, podemos eliminar la columna "item_id"

In [14]:
df_gasto_items = df_gasto_items.drop("item_id", axis = 1)
df_gasto_items

Unnamed: 0,items_count,user_id,price
0,277,76561197970982479,9.99
1,277,76561197970982479,4.99
2,277,76561197970982479,9.99
3,277,76561197970982479,19.99
4,277,76561197970982479,9.99
...,...,...,...
3246370,7,76561198329548331,0.00
3246371,7,76561198329548331,0.00
3246372,7,76561198329548331,0.00
3246373,7,76561198329548331,0.00


Sumaremos lo consumido en la plataforma, teniendo en cuenta el nombre de usuario.


In [15]:
df_gasto_items_total = df_gasto_items.groupby("user_id")["price"].sum().reset_index()
df_gasto_items_total

Unnamed: 0,user_id,price
0,--000--,182.84
1,--ace--,122.89
2,--ionex--,99.93
3,-2SV-vuLB-Kg,234.69
4,-404PageNotFound-,1154.47
...,...,...
68398,zzonci,0.00
68399,zzoptimuszz,4.99
68400,zzydrax,99.94
68401,zzyfo,484.73


Ahora crearemos un dataFrame para saber la cantidad de items consumidos por los usuarios y eliminaremos los nombres de usuarios repetidos

In [16]:
df_cantidad_items = df_gasto_items[["items_count", "user_id"]]
df_cantidad_items = df_cantidad_items.drop_duplicates(subset="user_id", keep="first")
df_cantidad_items

Unnamed: 0,items_count,user_id
0,277,76561197970982479
198,888,js41637
717,137,evcentric
821,328,Riot-Punch
951,541,doctr
...,...,...
3246360,5,76561198320038728
3246365,321,76561198320136420
3246367,4,ArkPlays7
3246369,22,76561198323066619


Ahora uniremos los dos últimos dataFrames creados para tenerlo lista para nuestro segundo Endpoint 2

In [17]:
df_user_data = df_cantidad_items.merge(df_gasto_items_total, on="user_id", how="right")
df_user_data

Unnamed: 0,items_count,user_id,price
0,58,--000--,182.84
1,44,--ace--,122.89
2,23,--ionex--,99.93
3,68,-2SV-vuLB-Kg,234.69
4,149,-404PageNotFound-,1154.47
...,...,...,...
68398,5,zzonci,0.00
68399,61,zzoptimuszz,4.99
68400,13,zzydrax,99.94
68401,84,zzyfo,484.73


## UserForGenre

Queremos saber que cantidad de tiempo jugó cada usuario un género en particular. Para ello, crearemos dataFrames auxiliares para luego fusionarlos. 

In [18]:
df_playtime_usuario_item = df_users_items[["played_hours", "user_id", "item_id"]]
df_playtime_usuario_item

Unnamed: 0,played_hours,user_id,item_id
0,0.10,76561197970982479,10
1,0.12,76561197970982479,30
2,78.88,76561197970982479,300
3,30.88,76561197970982479,240
4,5.55,76561197970982479,3830
...,...,...,...
3246370,11.28,76561198329548331,304930
3246371,0.72,76561198329548331,227940
3246372,0.05,76561198329548331,388490
3246373,0.07,76561198329548331,521570


Ahora, extraeremos información del dataframe "df_steam_games"

In [19]:
df_genero = df_steam_games[["genres", "item_id", "release_year"]]

df_genero

Unnamed: 0,genres,item_id,release_year
0,Action,761140,2018
1,Casual,761140,2018
2,Indie,761140,2018
3,Simulation,761140,2018
4,Strategy,761140,2018
...,...,...,...
71546,Indie,610660,2018
71547,Racing,610660,2018
71548,Simulation,610660,2018
71549,Casual,658870,2017


Ahora, uniremos los dos dataFrames 

In [20]:
df_tiempo_genero = df_playtime_usuario_item.merge(df_genero, on="item_id")
df_tiempo_genero

Unnamed: 0,played_hours,user_id,item_id,genres,release_year
0,0.10,76561197970982479,10,Action,2000
1,1.55,doctr,10,Action,2000
2,1.80,corrupted_soul,10,Action,2000
3,5.47,WeiEDKrSat,10,Action,2000
4,104.58,death-hunter,10,Action,2000
...,...,...,...,...,...
6845688,0.02,76561198146468235,367090,Action,2015
6845689,0.02,76561198146468235,367090,Indie,2015
6845690,0.02,76561198146468235,367090,Sports,2015
6845691,0.02,massimo23,448540,Adventure,2016


Agruparemos el dataFrame por género y usuario y sumaremos el tiempo de juego para cada caso. De aquí, obtendremos la información para nuestro Endpoint 3.

In [21]:
# Se agrupa por usuario y se suma el tiempo de juego
df_userforgenre = df_tiempo_genero.groupby(["genres", "release_year","user_id"])["played_hours"].sum().reset_index()

df_userforgenre

Unnamed: 0,genres,release_year,user_id,played_hours
0,Action,1983,2Ta4,0.30
1,Action,1983,76561197966936422,5.52
2,Action,1983,76561197968887720,0.02
3,Action,1983,76561197969020980,1.63
4,Action,1983,76561197971401137,0.55
...,...,...,...,...
2893382,Web Publishing,2017,Eosoforcus,0.97
2893383,Web Publishing,2017,N47H4NI3L,27.25
2893384,Web Publishing,2017,dirklah,13.27
2893385,Web Publishing,2017,kushziller,4.18


### Developer

Aquí buscaremos un dataFrame que contenga los items que cada desarrollador haya creado el año de lanzamiento y el precio de cada uno. De aquí obtendremos la información para nuestro Endpoint 1.

In [22]:
df_steam_games.columns

Index(['genres', 'price', 'early_access', 'item_id', 'release_year',
       'publisher', 'name', 'title', 'developer'],
      dtype='object')

In [32]:
df_item_developer_year = df_steam_games[["price", "release_year", "developer", "item_id"]]
# se eliminan los duplicados
df_item_developer_year = df_item_developer_year.drop_duplicates()
df_item_developer_year

Unnamed: 0,price,release_year,developer,item_id
0,4.99,2018,Kotoshiro,761140
5,0.00,2018,Secret Level SRL,643980
9,0.00,2017,Poolians.com,670290
14,0.99,2017,彼岸领域,767400
17,3.99,2018,Trickjump Games Ltd,772540
...,...,...,...,...
71535,1.99,2018,Bidoniera Games,745400
71539,1.99,2018,"Nikita ""Ghost_RUS""",773640
71543,4.99,2018,Sacada,733530
71546,1.99,2018,Laush Dmitriy Sergeevich,610660


Agregaremos el año de lanzamiento de un juego al dataFrame "df_user_reviews". Extraeremos las columnas de id del juego y su año de lanzamiento, borraremos los duplicados y luego lo fusionaremos con el dataframe de user_reviews.

In [24]:
df_item_release_year = df_steam_games[['item_id', 'release_year']]

df_item_release_year = df_item_release_year.rename(columns={'item_id':'reviews_item_id'})

df_item_release_year = df_item_release_year.drop_duplicates()
df_item_release_year

Unnamed: 0,reviews_item_id,release_year
0,761140,2018
5,643980,2018
9,670290,2017
14,767400,2017
17,772540,2018
...,...,...
71535,745400,2018
71539,773640,2018
71543,733530,2018
71546,610660,2018


In [25]:
df_user_reviews = df_user_reviews.merge(df_item_release_year, on='reviews_item_id')
df_user_reviews


Unnamed: 0,user_id,reviews_item_id,reviews_helpful,reviews_recommend,year,sentiment_analysis,release_year
0,--000--,218230,No ratings yet,True,2014,2,2012
1,112asdasfasdasd,218230,No ratings yet,True,2013,1,2012
2,1234567890192837465,218230,No ratings yet,True,2014,1,2012
3,2828838282,218230,No ratings yet,True,2012,1,2012
4,2sd31,218230,No ratings yet,True,2013,1,2012
...,...,...,...,...,...,...,...
48797,yougotblehed,315340,15 of 32 people (47%) found this review helpful,False,2014,1,2014
48798,yougoyu,424370,5 of 11 people (45%) found this review helpful,True,2014,1,2016
48799,zayyntt,479260,12 of 14 people (86%) found this review helpful,True,2014,1,2016
48800,zayyntt,463550,2 of 3 people (67%) found this review helpful,True,2014,1,2016


### best_developer_year  / developer_reviews_analysis

Queremos buscar los mejores desarrolladores con más juegos recomendados por año. Por otro lado, queremos buscar al desarrollador y la cantidad de comentarios positivos y negativos que este tenga. 
Consideramos que la información que buscan se asemeja, por eso, tomamos la decisión de crear un solo dataframe para obtener la información de ambos. 

In [26]:
#Cambiamos el nombre de la columna para que sea igual a la columna del otro dataFrame
df_user_reviews.rename(columns={'reviews_item_id': 'item_id'}, inplace=True)
df_user_reviews

Unnamed: 0,user_id,item_id,reviews_helpful,reviews_recommend,year,sentiment_analysis,release_year
0,--000--,218230,No ratings yet,True,2014,2,2012
1,112asdasfasdasd,218230,No ratings yet,True,2013,1,2012
2,1234567890192837465,218230,No ratings yet,True,2014,1,2012
3,2828838282,218230,No ratings yet,True,2012,1,2012
4,2sd31,218230,No ratings yet,True,2013,1,2012
...,...,...,...,...,...,...,...
48797,yougotblehed,315340,15 of 32 people (47%) found this review helpful,False,2014,1,2014
48798,yougoyu,424370,5 of 11 people (45%) found this review helpful,True,2014,1,2016
48799,zayyntt,479260,12 of 14 people (86%) found this review helpful,True,2014,1,2016
48800,zayyntt,463550,2 of 3 people (67%) found this review helpful,True,2014,1,2016


Extraemos las columnas necesarias en dataFrames y analizamos los datos

In [27]:
df_1 = df_user_reviews[["item_id", "reviews_recommend", "sentiment_analysis", "year" ]]
tools.ver_tipo_datos(df_1)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,item_id,[<class 'int'>],100.0,0.0,0
1,reviews_recommend,[<class 'bool'>],100.0,0.0,0
2,sentiment_analysis,[<class 'int'>],100.0,0.0,0
3,year,[<class 'int'>],100.0,0.0,0


In [28]:
df_2 = df_steam_games[["item_id","developer"]]
tools.ver_tipo_datos(df_1)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,item_id,[<class 'int'>],100.0,0.0,0
1,reviews_recommend,[<class 'bool'>],100.0,0.0,0
2,sentiment_analysis,[<class 'int'>],100.0,0.0,0
3,year,[<class 'int'>],100.0,0.0,0


Ahora, fusionaremos el df_1 y el df_2 en un solo dataFrame

In [29]:
df_best_developer = df_1.merge(df_2, on='item_id')
df_best_developer

Unnamed: 0,item_id,reviews_recommend,sentiment_analysis,year,developer
0,218230,True,2,2014,Daybreak Game Company
1,218230,True,2,2014,Daybreak Game Company
2,218230,True,2,2014,Daybreak Game Company
3,218230,True,1,2013,Daybreak Game Company
4,218230,True,1,2013,Daybreak Game Company
...,...,...,...,...,...
122014,479260,True,1,2014,"Roman Anatolevich,Denis Ovsyannikov"
122015,463550,True,1,2014,dobro_slon
122016,463550,True,1,2014,dobro_slon
122017,383270,True,1,2014,Fiddlesticks Games


## Carga de los dataframes

In [30]:
dfs = [df_user_reviews, df_user_data, df_userforgenre, 
       df_item_developer_year, df_best_developer]
# Nombres correspondientes a cada DataFrame
names = ["df_user_reviews", "df_user_data","df_userforgenre", 
         "df_item_developer_year", "df_best_developer"]

for df, name in zip(dfs, names):
    archivo = f"{name}_FE.csv"
    df.to_csv(archivo, index=False, encoding="utf-8")
    print(f"DataFrame '{name}' guardado como '{archivo}'")

DataFrame 'df_user_reviews' guardado como 'df_user_reviews_FE.csv'
DataFrame 'df_user_data' guardado como 'df_user_data_FE.csv'
DataFrame 'df_userforgenre' guardado como 'df_userforgenre_FE.csv'
DataFrame 'df_item_developer_year' guardado como 'df_item_developer_year_FE.csv'
DataFrame 'df_best_developer' guardado como 'df_best_developer_FE.csv'


Aprovecharemos para guardar los dataFrame en formato parquet y así optimizar la estructura de los datos en el deploy.

In [31]:
for df, name in zip(dfs, names):
    archivo = f"{name}.parquet"
    pq.write_table(pa.Table.from_pandas(df), archivo)
    print(f"DataFrame '{name}' guardado como '{archivo}'")

DataFrame 'df_user_reviews' guardado como 'df_user_reviews.parquet'
DataFrame 'df_user_data' guardado como 'df_user_data.parquet'
DataFrame 'df_userforgenre' guardado como 'df_userforgenre.parquet'
DataFrame 'df_item_developer_year' guardado como 'df_item_developer_year.parquet'
DataFrame 'df_best_developer' guardado como 'df_best_developer.parquet'
