# Extracción, Transformación y Limpieza de los datos (ETL):

En la carpeta "Revised_Data" se guardan en formato CSV los dataset que deben ser limpiados y transformados para poder crear las fórmulas para los endpoints de la API y el modelo de recomendación.

### 1. Importación de librerías necesarias para realizar los ETL:

In [1]:
# Para abrir los archivos y realizar transformaciones:

import pandas as pd                                         # Para abrir y transformar los datos
import ast                                                  # Para abrir los archivos con datos anidados
from pandas import json_normalize                           # Para expandir los datos anidados

#----------------------------------------------------------------------------------------------------------------------------------------

# Para el modelo de análisis de sentimiento en "df_User_Review":

import re                                                   # Para eliminar caracteres especiales
from nltk.corpus import stopwords                           # para eliminar palabras comunes del ingles
from nltk.tokenize import word_tokenize                     # Para tokenizar
from nltk.stem import WordNetLemmatizer                     # Para lematizar
from nltk import download                                   # Para usar los recursos de nlkt que no están incluídos en la biblioteca
from nltk.sentiment import SentimentIntensityAnalyzer       # Para análisis de sentimiento de "Reviews"
from datetime import datetime                               # Para arreglar el formato de fecha

#----------------------------------------------------------------------------------------------------------------------------------------

# Para ignorar advertencias:

import warnings                    
warnings.filterwarnings("ignore") 



### 2. ETL Steam_Games:

Visualizamos los datos del dataframe "Steam_Games":

In [2]:
df_Steam_Games = pd.read_csv('Revised_Data\Steam_Games.csv')
df_Steam_Games.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",http://steamcommunity.com/app/761140/reviews/?...,['Single-player'],4.99,False,761140.0,Kotoshiro
1,"Making Fun, Inc.","['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"['Free to Play', 'Strategy', 'Indie', 'RPG', '...",http://steamcommunity.com/app/643980/reviews/?...,"['Single-player', 'Multi-player', 'Online Mult...",Free To Play,False,643980.0,Secret Level SRL
2,Poolians.com,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"['Free to Play', 'Simulation', 'Sports', 'Casu...",http://steamcommunity.com/app/670290/reviews/?...,"['Single-player', 'Multi-player', 'Online Mult...",Free to Play,False,670290.0,Poolians.com
3,彼岸领域,"['Action', 'Adventure', 'Casual']",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"['Action', 'Adventure', 'Casual']",http://steamcommunity.com/app/767400/reviews/?...,['Single-player'],0.99,False,767400.0,彼岸领域
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"['Action', 'Indie', 'Casual', 'Sports']",http://steamcommunity.com/app/773570/reviews/?...,"['Single-player', 'Full controller support', '...",2.99,False,773570.0,


### 2.1. Se Imputan los valores nulos de "app_name" con "tile" y se elimina la columna "title":

In [3]:
# Llenar los valores vacíos en 'app_name' con los valores de 'title'
df_Steam_Games['app_name'].fillna(df_Steam_Games['title'], inplace=True)

# Eliminar la columna 'title'
df_Steam_Games.drop(columns=['title'], inplace=True)

# Mostrar el DataFrame resultante
df_Steam_Games.head(3)


Unnamed: 0,publisher,genres,app_name,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",http://steamcommunity.com/app/761140/reviews/?...,['Single-player'],4.99,False,761140.0,Kotoshiro
1,"Making Fun, Inc.","['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"['Free to Play', 'Strategy', 'Indie', 'RPG', '...",http://steamcommunity.com/app/643980/reviews/?...,"['Single-player', 'Multi-player', 'Online Mult...",Free To Play,False,643980.0,Secret Level SRL
2,Poolians.com,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"['Free to Play', 'Simulation', 'Sports', 'Casu...",http://steamcommunity.com/app/670290/reviews/?...,"['Single-player', 'Multi-player', 'Online Mult...",Free to Play,False,670290.0,Poolians.com


### 2.2. Se eliminan las columnas "url" y "reviews_url":

In [4]:
# Eliminar las columnas especificadas
columnas_a_eliminar = ['url', 'reviews_url']
df_Steam_Games.drop(columns=columnas_a_eliminar, inplace=True)

# Mostrar el DataFrame resultante
df_Steam_Games.head(3)

Unnamed: 0,publisher,genres,app_name,release_date,tags,specs,price,early_access,id,developer
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,2018-01-04,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",['Single-player'],4.99,False,761140.0,Kotoshiro
1,"Making Fun, Inc.","['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,2018-01-04,"['Free to Play', 'Strategy', 'Indie', 'RPG', '...","['Single-player', 'Multi-player', 'Online Mult...",Free To Play,False,643980.0,Secret Level SRL
2,Poolians.com,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,2017-07-24,"['Free to Play', 'Simulation', 'Sports', 'Casu...","['Single-player', 'Multi-player', 'Online Mult...",Free to Play,False,670290.0,Poolians.com


### 2.3. Se imputan valores nulos en "genres" según el contenido en "tags"

In [5]:
# Obtener valores únicos de la columna 'tags' para realizar imputaciones de "genres" y "price" cuando es "Free to Play"
valores_unicos_tags = df_Steam_Games['tags'].unique()

# Mostrar los valores únicos
print("Valores únicos en la columna 'tags':")
print(valores_unicos_tags)


Valores únicos en la columna 'tags':
["['Strategy', 'Action', 'Indie', 'Casual', 'Simulation']"
 "['Free to Play', 'Strategy', 'Indie', 'RPG', 'Card Game', 'Trading Card Game', 'Turn-Based', 'Fantasy', 'Tactical', 'Dark Fantasy', 'Board Game', 'PvP', '2D', 'Competitive', 'Replay Value', 'Character Customization', 'Female Protagonist', 'Difficult', 'Design & Illustration']"
 "['Free to Play', 'Simulation', 'Sports', 'Casual', 'Indie', 'Multiplayer']"
 ... "['Action', 'Indie', 'Casual', 'Violent', 'Adventure']"
 "['Indie', 'Casual', 'Puzzle', 'Singleplayer', 'Atmospheric', 'Relaxing']"
 "['Early Access', 'Adventure', 'Indie', 'Action', 'Simulation', 'VR']"]


In [6]:
# Imputar en 'genres' para valores nulos el contenido en 'tags'
df_Steam_Games['genres'] = df_Steam_Games.apply(lambda row: row['tags'] if pd.isnull(row['genres']) else row['genres'], axis=1)

# Mostrar el DataFrame resultante
df_Steam_Games.head(3)


Unnamed: 0,publisher,genres,app_name,release_date,tags,specs,price,early_access,id,developer
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,2018-01-04,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",['Single-player'],4.99,False,761140.0,Kotoshiro
1,"Making Fun, Inc.","['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,2018-01-04,"['Free to Play', 'Strategy', 'Indie', 'RPG', '...","['Single-player', 'Multi-player', 'Online Mult...",Free To Play,False,643980.0,Secret Level SRL
2,Poolians.com,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,2017-07-24,"['Free to Play', 'Simulation', 'Sports', 'Casu...","['Single-player', 'Multi-player', 'Online Mult...",Free to Play,False,670290.0,Poolians.com


### 2.4. Convertir datos en el formato correspondiente los datos de cada columna:

In [7]:
# Convertir a tipo de dato string
columnas_a_string = ['publisher', 'genres', 'app_name', 'tags', 'specs', 'developer']
df_Steam_Games[columnas_a_string] = df_Steam_Games[columnas_a_string].astype(str)

# Reemplazar "Free to Play" por 0 en 'price'
df_Steam_Games['price'] = df_Steam_Games['price'].replace('Free to Play', 0)

# Convertir 'release_date' en fecha:
df_Steam_Games['release_date'] = pd.to_datetime(df_Steam_Games['release_date'], errors='coerce')

# Crea la columna 'Year' extrayendo el año y colocando ceros para los valores nulos
df_Steam_Games['release_year'] = df_Steam_Games['release_date'].dt.year.fillna(0).astype(int)

# Convertir 'price' a tipo de dato float
df_Steam_Games['price'] = pd.to_numeric(df_Steam_Games['price'], errors='coerce')

# Reemplazar representaciones específicas de NaN con el valor real de NaN
df_Steam_Games = df_Steam_Games.replace('nan', pd.NA)

# Mostrar las primeras filas del DataFrame
df_Steam_Games.head(3)



Unnamed: 0,publisher,genres,app_name,release_date,tags,specs,price,early_access,id,developer,release_year
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,2018-01-04,"['Strategy', 'Action', 'Indie', 'Casual', 'Sim...",['Single-player'],4.99,False,761140.0,Kotoshiro,2018
1,"Making Fun, Inc.","['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,2018-01-04,"['Free to Play', 'Strategy', 'Indie', 'RPG', '...","['Single-player', 'Multi-player', 'Online Mult...",,False,643980.0,Secret Level SRL,2018
2,Poolians.com,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,2017-07-24,"['Free to Play', 'Simulation', 'Sports', 'Casu...","['Single-player', 'Multi-player', 'Online Mult...",0.0,False,670290.0,Poolians.com,2017


### 2.5. Imputar nulos de "price" si en "tags" se tiene "Free to Play" con 0 y eliminar la columna "tags":

In [8]:
# Imputar valores nulos en 'price' si 'tags' contiene 'Free to Play'
df_Steam_Games.loc[df_Steam_Games['price'].isnull() & df_Steam_Games['tags'].str.contains('Free to Play'), 'price'] = 0

# Mostrar las primeras filas del DataFrame
df_Steam_Games[['app_name', 'price']].head()


Unnamed: 0,app_name,price
0,Lost Summoner Kitty,4.99
1,Ironbound,0.0
2,Real Pool 3D - Poolians,0.0
3,弹炸人2222,0.99
4,Log Challenge,2.99


In [9]:
# Eliminar la columna "tags":
df_Steam_Games.drop(columns='tags', inplace=True)

# Mostrar el DataFrame resultante
df_Steam_Games.head(3)

Unnamed: 0,publisher,genres,app_name,release_date,specs,price,early_access,id,developer,release_year
0,Kotoshiro,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,2018-01-04,['Single-player'],4.99,False,761140.0,Kotoshiro,2018
1,"Making Fun, Inc.","['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,2018-01-04,"['Single-player', 'Multi-player', 'Online Mult...",0.0,False,643980.0,Secret Level SRL,2018
2,Poolians.com,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,2017-07-24,"['Single-player', 'Multi-player', 'Online Mult...",0.0,False,670290.0,Poolians.com,2017


### 2.5. Exportar el archivo limpio a la carpeta Clean_Data para realizar un EDA en función del modelo y los endpoints:

In [10]:
# Exportar el DataFrame a CSV
df_Steam_Games.to_csv('Clean_Data/SGames_CD.csv', index=False)

print(f"DataFrame exportado a {'Clean_Data/SGames_CD.csv'}")

DataFrame exportado a Clean_Data/SGames_CD.csv


Observación: Desde el principio se sabe que las columnas "genres" y "specs" tienen datos en listas que deberían extenderse; pero no se realiza en este ETL porque se quiere revisar que para el modelo de recomendación puede ser necesario expandir la lista y aplicar la codificación "One Hot Encoding" para estas variables. Para los enpoints, tal vez sea necesario normalizar las tablas para evitar multiplicar los registros (crear tablas auxiliares "genres" y "specs" que se relaconen con el id del juego).

### 3. ETL Users_Items:

### 3.1. Abrir el archivo usando ast.literal_eval debido a los datos anidados:

In [11]:
# Leer el archivo CSV
df_User_Items = pd.read_csv('Revised_Data\\User_Items.csv')

# Aplicar ast.literal_eval a la columna con diccionarios en listas
df_User_Items['items'] = df_User_Items['items'].apply(ast.literal_eval)

# Mostrar las primeras filas del DataFrame
df_User_Items.head()


Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


### 3.2. Eliminar las columnas 'steam_id','items_count','user_url':

In [12]:
# Eliminar la columnas innecesarias del df:
columnas= ['steam_id','items_count','user_url']
df_User_Items.drop(columns=columnas, inplace=True)

# Hacemos una copia del df para trabajar sobre esta:
df_User_Items_Copy = df_User_Items.copy()

# Mostrar el DataFrame resultante
df_User_Items_Copy.head(3)

Unnamed: 0,user_id,items
0,76561197970982479,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."


### 3.3. Expandir la columna "items" (desanidar):

In [13]:
# Aplicar json_normalize directamente en la columna 'items' y conservar 'user_id':
df_expanded = pd.json_normalize(df_User_Items_Copy.to_dict('records'), 'items', ['user_id'])

# Reorganizar las columnas:
df_User_Items_Copy2 = df_expanded[['user_id', 'item_id', 'item_name', 'playtime_forever', 'playtime_2weeks']]

# Mostrar el nuevo DataFrame:
df_User_Items_Copy2


Unnamed: 0,user_id,item_id,item_name,playtime_forever,playtime_2weeks
0,76561197970982479,10,Counter-Strike,6,0
1,76561197970982479,20,Team Fortress Classic,0,0
2,76561197970982479,30,Day of Defeat,7,0
3,76561197970982479,40,Deathmatch Classic,0,0
4,76561197970982479,50,Half-Life: Opposing Force,0,0
...,...,...,...,...,...
5153204,76561198329548331,346330,BrainBread 2,0,0
5153205,76561198329548331,373330,All Is Dust,0,0
5153206,76561198329548331,388490,One Way To Die: Steam Edition,3,3
5153207,76561198329548331,521570,You Have 10 Seconds 2,4,4


Verificar que no hayan datos nulos, ya que de haberlos significaría un error en la exploción de "items":

In [14]:
# Muestra la cuenta de NaN por columnas:
print(df_User_Items_Copy2.isna().sum())
print('----------------------------------------------|')
print(df_User_Items_Copy2.info())

user_id             0
item_id             0
item_name           0
playtime_forever    0
playtime_2weeks     0
dtype: int64
----------------------------------------------|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5153209 entries, 0 to 5153208
Data columns (total 5 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   user_id           object
 1   item_id           object
 2   item_name         object
 3   playtime_forever  int64 
 4   playtime_2weeks   int64 
dtypes: int64(2), object(3)
memory usage: 196.6+ MB
None


Eliminar item_name para optimizar las tablas:

In [15]:
df_User_Items_Copy2.drop(columns=['item_name'], inplace=True)
df_User_Items_Copy2

Unnamed: 0,user_id,item_id,playtime_forever,playtime_2weeks
0,76561197970982479,10,6,0
1,76561197970982479,20,0,0
2,76561197970982479,30,7,0
3,76561197970982479,40,0,0
4,76561197970982479,50,0,0
...,...,...,...,...
5153204,76561198329548331,346330,0,0
5153205,76561198329548331,373330,0,0
5153206,76561198329548331,388490,3,3
5153207,76561198329548331,521570,4,4


In [16]:
df_User_Items_Copy2.iloc[88297]

user_id             76561198081987102
item_id                        387340
playtime_forever                   70
playtime_2weeks                     0
Name: 88297, dtype: object

### 3.4. Exportar el archivo limpio a la carpeta Clean_Data para realizar un EDA en función del modelo y los endpoints:

In [17]:
# Exportar el DataFrame a CSV
df_User_Items_Copy2.to_csv('Clean_Data/UItems_CD.csv', index=False)

print(f"DataFrame exportado a {'Clean_Data/UItems_CD.csv'}")

DataFrame exportado a Clean_Data/UItems_CD.csv


### 4. ETL Users_Review:

### 4.1. Abrir el archivo usando ast.literal_eval debido a los datos anidados:

In [18]:
# Leer el archivo CSV
df_User_Review = pd.read_csv('Revised_Data\\User_Review.csv')

# Aplicar ast.literal_eval a la columna con diccionarios en listas
df_User_Review['reviews'] = df_User_Review['reviews'].apply(ast.literal_eval)

# Mostrar las primeras filas del DataFrame
df_User_Review.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


### 4.2. Eliminar la columna 'user_url':

In [19]:
# Eliminar la columna "User_url":
df_User_Review.drop(columns=['user_url'], inplace=True)

# Hacemos una copia del df para trabajar sobre esta:
df_User_Review_Copy = df_User_Review.copy()

# Mostrar el DataFrame resultante
df_User_Review_Copy.head(3)

Unnamed: 0,user_id,reviews
0,76561197970982479,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."


### 4.3. Expandir la columna "reviews" (desanidar):

Buscar las claves de los datos que servirán como columnas:

In [20]:
# Imprimir el contenido de la columna 'reviews' (solo la primera fila) para obtener claves:
ejemplo_dato=df_User_Review_Copy['reviews'].iloc[:1]
ejemplo_dato=dict(ejemplo_dato)

# Obtener las claves de los diccionarios en la lista
claves = [list(diccionario.keys()) for diccionario in ejemplo_dato[0]]

# Imprimir las claves
print(claves[0])


['funny', 'posted', 'last_edited', 'item_id', 'helpful', 'recommend', 'review']


Expandir datos:

In [21]:
# Aplicar json_normalize directamente en la columna 'reviews' y conservar 'user_id':
df_expanded2 = pd.json_normalize(df_User_Review_Copy.to_dict('records'), 'reviews', ['user_id'])

# Reorganizar las columnas:
df_User_Review_Copy2 = df_expanded2[['user_id','funny', 'posted', 'last_edited', 'item_id', 'helpful', 'recommend', 'review']]

# Mostrar el nuevo DataFrame:
df_User_Review_Copy2

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...
59300,76561198312638244,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
59301,76561198312638244,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
59302,LydiaMorley,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
59303,LydiaMorley,,Posted July 20.,,730,No ratings yet,True,:D


Verificar que no hayan datos nulos, ya que de haberlos significaría un error en la exploción de "reviews":

In [22]:
# Muestra la cuenta de NaN por columnas:
print(df_User_Review_Copy2.isna().sum())
print('----------------------------------------------|')
print(df_User_Review_Copy2.info())

user_id        0
funny          0
posted         0
last_edited    0
item_id        0
helpful        0
recommend      0
review         0
dtype: int64
----------------------------------------------|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59305 entries, 0 to 59304
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      59305 non-null  object
 1   funny        59305 non-null  object
 2   posted       59305 non-null  object
 3   last_edited  59305 non-null  object
 4   item_id      59305 non-null  object
 5   helpful      59305 non-null  object
 6   recommend    59305 non-null  bool  
 7   review       59305 non-null  object
dtypes: bool(1), object(7)
memory usage: 3.2+ MB
None


### 4.4. Limpieza y normalización de "reviews" para procesarlos antes de realizar análisis de sentimiento:

In [23]:
# Descargar los recursos necesarios (puedes ejecutarlo una vez)
download('punkt')
download('stopwords')
download('wordnet')

# Función para limpiar, tokenizar, normalizar y lematizar el texto
def preprocess_text(text):
    # Eliminar caracteres especiales y convertir a minúsculas
    text = re.sub(r'[^a-zA-Z\s]', '', str(text).lower())
    
    # Tokenizar
    words = word_tokenize(text)
    
    # Eliminar stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # Lematizar
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Unir las palabras procesadas en un string
    processed_text = ' '.join(words)
    
    return processed_text

# Aplicar la función de preprocesamiento a la columna 'review'
df_User_Review_Copy2['processed_review'] = df_User_Review_Copy2['review'].apply(preprocess_text)

# Mostrar el DataFrame con la nueva columna procesada
print(df_User_Review_Copy2[['review', 'processed_review']])


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\davin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\davin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\davin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                                  review  \
0      Simple yet with great replayability. In my opi...   
1                   It's unique and worth a playthrough.   
2      Great atmosphere. The gunplay can be a bit chu...   
3      I know what you think when you see this title ...   
4      For a simple (it's actually not all that simpl...   
...                                                  ...   
59300  a must have classic from steam definitely wort...   
59301  this game is a perfect remake of the original ...   
59302  had so much fun plaing this and collecting res...   
59303                                                 :D   
59304                                     so much fun :D   

                                        processed_review  
0      simple yet great replayability opinion zombie ...  
1                               unique worth playthrough  
2      great atmosphere gunplay bit chunky time end d...  
3      know think see title barbie dreamhou

### 4.5. Aplicar análisis de sentimiento a la nueva columna procesada review "processed_reviews":

In [24]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Descargar el recurso necesario (puedes ejecutarlo una vez)
nltk.download('vader_lexicon')

# Crear una instancia de SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Función para asignar el sentimiento basado en compound score
def analyze_sentiment(text):
    sentiment_score = sid.polarity_scores(str(text))['compound']
    
    if sentiment_score >= 0.05:
        return 2  # Positivo
    elif sentiment_score <= -0.05:
        return 0  # Negativo
    else:
        return 1  # Neutral

# Aplicar la función a la columna 'review' y crear la nueva columna 'sentiment_analysis'
df_User_Review_Copy2['sentiment_analysis'] = df_User_Review_Copy2['processed_review'].apply(analyze_sentiment)

# Mostrar el DataFrame resultante
df_User_Review_Copy2


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\davin\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review,processed_review,sentiment_analysis
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,simple yet great replayability opinion zombie ...,2
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,unique worth playthrough,2
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,great atmosphere gunplay bit chunky time end d...,2
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,know think see title barbie dreamhouse party i...,2
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,simple actually simple truck driving simulator...,2
...,...,...,...,...,...,...,...,...,...,...
59300,76561198312638244,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...,must classic steam definitely worth buying,2
59301,76561198312638244,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...,game perfect remake original half life persona...,2
59302,LydiaMorley,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,much fun plaing collecting resource xd first t...,2
59303,LydiaMorley,,Posted July 20.,,730,No ratings yet,True,:D,,1


### 4.6. Eliminar "review" y renombrar "processed_review" para dejar solo una columan definitiva:

In [25]:
# Eliminar la columna 'review'
df_User_Review_Copy2.drop('review', axis=1, inplace=True)

# Renombrar 'processed_review' como 'review'
df_User_Review_Copy2.rename(columns={'processed_review': 'review'}, inplace=True)

# Mostrar el DataFrame resultante
df_User_Review_Copy2.head(3)


Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review,sentiment_analysis
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,simple yet great replayability opinion zombie ...,2
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,unique worth playthrough,2
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,great atmosphere gunplay bit chunky time end d...,2


### 4.7. Crear una columna año que servirá para las consultas para los enpoints:

In [26]:
# Crear una columna auxiliar 'Año' que extrae el año después de la coma en 'posted'
df_User_Review_Copy2['posted_year'] = df_User_Review_Copy2['posted'].str.extract(r', (\d{4})')

# Elimniar datos nulos y la columna posted:
df_User_Review_Copy2.dropna(subset=['posted'], inplace=True)
df_User_Review_Copy2.dropna(subset=['posted_year'], inplace=True)
df_User_Review_Copy2.drop('posted', axis=1, inplace=True)

# Mostrar el DataFrame resultante
print(df_User_Review_Copy2[['posted_year']].head(3))


  posted_year
0        2011
1        2011
2        2011


In [27]:
# Cambiar formato de Año a número:
df_User_Review_Copy2['posted_year'] = df_User_Review_Copy2['posted_year'].astype('Int16')

# Verificar formato de Año:
print(df_User_Review_Copy2.info()) 

# Imprimir el DataFrame actualizado
df_User_Review_Copy2.head(3)

<class 'pandas.core.frame.DataFrame'>
Index: 49186 entries, 0 to 59276
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   user_id             49186 non-null  object
 1   funny               49186 non-null  object
 2   last_edited         49186 non-null  object
 3   item_id             49186 non-null  object
 4   helpful             49186 non-null  object
 5   recommend           49186 non-null  bool  
 6   review              49186 non-null  object
 7   sentiment_analysis  49186 non-null  int64 
 8   posted_year         49186 non-null  Int16 
dtypes: Int16(1), bool(1), int64(1), object(6)
memory usage: 3.2+ MB
None


Unnamed: 0,user_id,funny,last_edited,item_id,helpful,recommend,review,sentiment_analysis,posted_year
0,76561197970982479,,,1250,No ratings yet,True,simple yet great replayability opinion zombie ...,2,2011
1,76561197970982479,,,22200,No ratings yet,True,unique worth playthrough,2,2011
2,76561197970982479,,,43110,No ratings yet,True,great atmosphere gunplay bit chunky time end d...,2,2011


### 4.8. Eliminar "last_edited":

In [28]:
df_User_Review_Copy2.drop(columns=['last_edited'], inplace=True)

### 4.9. Eliminar "Review" porque ya se ha capturado su importancia en el análisis de sentimiento y helpful ya que no haremos uso de esta columna:

In [29]:
df_User_Review_Copy2.drop(columns=['review'], inplace=True)
df_User_Review_Copy2.drop(columns=['helpful'], inplace=True)
df_User_Review_Copy2.drop(columns=['funny'], inplace=True)

### 4.10. Exportar el archivo limpio a la carpeta Clean_Data para realizar un EDA en función del modelo y los endpoints:

In [30]:
# Exportar el DataFrame a CSV
df_User_Review_Copy2.to_csv('Clean_Data/UReviews_CD.csv', index=False)

print(f"DataFrame exportado a {'Clean_Data/UReviews_CD.csv'}")

DataFrame exportado a Clean_Data/UReviews_CD.csv
