# ETL (Extract Transform Load)

Extraer, Transformar, Carga <div>
Es un proceso que se utiliza para mover datos de una fuente (Extract), realizar modificaciones en esos datos según sea necesario (Transform), y cargar los datos resultantes en un destino deseado (Load). 

***En este archivo van todas las transformaciones requeridas y es la a continuación del archivo ETL.*** 
***Archivo items***

## 3. Preparación de los datos

### 3.1 Importamos  librerías

In [1]:
import pandas as pd  # type: ignore
import numpy as np # type: ignore
import seaborn as sns  # type: ignore
import matplotlib.pyplot as plt # type: ignore
import ast
import warnings 
warnings.filterwarnings('ignore')

### 3.2 Carga inicial de los datos

In [2]:
def load_json_lines(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            data.append(ast.literal_eval(line))
    return pd.DataFrame(data)

In [3]:
# Cargar archivos 
df = load_json_lines("Data/australian_users_items.json")

In [4]:
# Tenemos un punto de reinicio 
df_items = df

### 3.3 Preparación de los datos 

***Gracias al EDA ya conocemos nuestros datos, comenzaremos desaninado la columna items.*** 

In [5]:
# Usamos el explode para que cada lista este seperada en una fila personal(Todavia esta desanido)
df_items = df_items.explode('items').reset_index()
df_items = pd.concat([df_items.drop(columns="items"), pd.json_normalize(df_items["items"])], axis=1)
df_items.head(4)

Unnamed: 0,index,user_id,items_count,steam_id,user_url,item_id,item_name,playtime_forever,playtime_2weeks
0,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,10,Counter-Strike,6.0,0.0
1,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,20,Team Fortress Classic,0.0,0.0
2,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,30,Day of Defeat,7.0,0.0
3,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,40,Deathmatch Classic,0.0,0.0


In [12]:
# Veamos como quedo nuestro dataset 
df_items.shape

(5153209, 9)

Ahora vamos a comprobar si hay valores faltantes o duplicados

In [3]:
def null(df, decimales=2):
    """
    Vamos a observar cuantos valores falatantes hay en el dataset
    """
    df_nulos = pd.DataFrame({
        "Numeros de nulos" : df.isnull().sum(),
        "Porcentaje de nulos" : (df.isnull().sum() / df.shape[0]) * 100.0
    })
    df_nulos['Porcentaje de nulos'] = df_nulos['Porcentaje de nulos'].round(decimales).astype(str) + "%"
    return df_nulos

In [7]:
null(df_items, decimales=2)

Unnamed: 0,Numeros de nulos,Porcentaje de nulos
index,0,0.0%
user_id,0,0.0%
items_count,0,0.0%
steam_id,0,0.0%
user_url,0,0.0%
item_id,16806,0.33%
item_name,16806,0.33%
playtime_forever,16806,0.33%
playtime_2weeks,16806,0.33%


In [4]:
def suma_duplicados(df, decimales=2):
    # Identifica las filas duplicadas
    duplicados = df[df.duplicated()]
    
    # Inicializa diccionarios para almacenar los resultados
    num_duplicados = {}
    porcentaje_duplicados = {}
    
    # Itera sobre cada columna
    for col in df.columns:
        # Cuenta el número de duplicados en la columna
        num_duplicados[col] = duplicados[col].count()
        
        # Calcula el porcentaje de duplicados en la columna
        porcentaje_duplicados[col] = (num_duplicados[col] / df.shape[0]) * 100.0
    
    # Crea un DataFrame para los resultados
    df_duplicados = pd.DataFrame({
        "Numero de Duplicados": pd.Series(num_duplicados),
        "Porcentaje de Duplicados": pd.Series(porcentaje_duplicados)
    })
    
    # Convierte el porcentaje a cadena y añade el símbolo de porcentaje
    df_duplicados["Porcentaje de Duplicados"] = df_duplicados["Porcentaje de Duplicados"].round(decimales).astype(str) + "%"
        
    return df_duplicados

In [10]:
suma_duplicados(df_items)

Unnamed: 0,Numero de Duplicados,Porcentaje de Duplicados
index,0,0.0%
user_id,0,0.0%
items_count,0,0.0%
steam_id,0,0.0%
user_url,0,0.0%
item_id,0,0.0%
item_name,0,0.0%
playtime_forever,0,0.0%
playtime_2weeks,0,0.0%


In [11]:
# Vamos a eliminar esas filas con los valores faltantes que son NAN 
df_items = df_items.dropna(subset=['item_id'])
df_items = df_items.dropna(subset=['item_name'])
df_items = df_items.dropna(subset=['playtime_forever'])
df_items = df_items.dropna(subset=['playtime_2weeks'])

In [15]:
df_items.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5153209 entries, 0 to 5170013
Data columns (total 9 columns):
 #   Column            Dtype  
---  ------            -----  
 0   index             int64  
 1   user_id           object 
 2   items_count       int64  
 3   steam_id          object 
 4   user_url          object 
 5   item_id           object 
 6   item_name         object 
 7   playtime_forever  float64
 8   playtime_2weeks   float64
dtypes: float64(2), int64(2), object(5)
memory usage: 393.2+ MB


Hacemos algunas transformaciones 

In [27]:
df_items = df_items.drop(columns="index")

In [28]:
convert_dict ={
    'user_id'           : 'string',
    'items_count'       : 'int32',
    'steam_id'          : 'int64',
    'user_url'          : 'string',
    'item_id'           : 'int64',
    'item_name'         : 'string',
    'playtime_forever'  : 'int32',
    'playtime_2weeks'   : 'int32'
}
df_items = df_items.astype(convert_dict)

Algunos endpoints que vamos a realizar son los que tiene mayor horas jugadas, sabiendo esto eliminaremos todas las filas donde ambos valores sean cero! 

In [29]:
# se conservan las filas donde al menos una de las celdas en las columnas 'playtime_forever' y 'playtime_2weeks' no es igual a 0
df_items = df_items[~(df_items[['playtime_forever', 'playtime_2weeks']] ==0).all(axis=1)]

In [30]:
# Veamos cuantas filas quedaron 
print(f"Quedaron {df_items.shape[0]} filas")

Quedaron 3285249 filas


In [32]:
# Vemos si hay duplicados despues de borrar el indice 
suma_duplicados(df_items)

Unnamed: 0,Numero de Duplicados,Porcentaje de Duplicados
user_id,38871,1.18%
items_count,38871,1.18%
steam_id,38871,1.18%
user_url,38871,1.18%
item_id,38871,1.18%
item_name,38871,1.18%
playtime_forever,38871,1.18%
playtime_2weeks,38871,1.18%


In [36]:
df_items = df_items.drop_duplicates()

In [37]:
# Verificamos los duplicados
suma_duplicados(df_items)

Unnamed: 0,Numero de Duplicados,Porcentaje de Duplicados
user_id,0,0.0%
items_count,0,0.0%
steam_id,0,0.0%
user_url,0,0.0%
item_id,0,0.0%
item_name,0,0.0%
playtime_forever,0,0.0%
playtime_2weeks,0,0.0%


In [38]:
df_items.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3246378 entries, 0 to 5170013
Data columns (total 8 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   user_id           string
 1   items_count       int32 
 2   steam_id          int64 
 3   user_url          string
 4   item_id           int64 
 5   item_name         string
 6   playtime_forever  int32 
 7   playtime_2weeks   int32 
dtypes: int32(3), int64(2), string(3)
memory usage: 185.8 MB


In [39]:
# Convertimos el archivo en parquet y en csv para el SQL.
# Parquet 
#df_items.to_parquet("steam_items.parquet")
# CSV
#df_items.to_csv("steam_items.csv", index=False)

***Archivo reviews***

### 2.4 Carga inicial de los datos

In [5]:
# Cargar archivos
df = load_json_lines("Data/australian_user_reviews.json ")

In [6]:
# Tenemos un punto de reinicio
df_reviews = df

### 2.5 Preparacion de los datos

***Gracias al EDA ya conocemos nuestros datos, comenzaremos desaninado la columna reviews.*** 

In [7]:
df_reviews.shape

(25799, 3)

In [8]:
# Veamos como esta el dataset antes de desanidar
df_reviews.head(3)

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."


In [9]:
# Desanidamos 
df_reviews = df_reviews.explode("reviews").reset_index()
df_reviews = pd.concat([df_reviews.drop(columns="reviews"), df_reviews["reviews"].apply(pd.Series)], axis=1)
df_reviews
# Veamos como esta el dataset despues de desanidar

Unnamed: 0,index,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0
0,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,
1,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,
2,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,
3,1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,
4,1,js41637,http://steamcommunity.com/id/js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,
...,...,...,...,...,...,...,...,...,...,...,...
59328,25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...,
59329,25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...,
59330,25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,
59331,25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,,Posted July 20.,,730,No ratings yet,True,:D,


In [10]:
# Veamos el tipo de dato 
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59333 entries, 0 to 59332
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   index        59333 non-null  int64  
 1   user_id      59333 non-null  object 
 2   user_url     59333 non-null  object 
 3   funny        59305 non-null  object 
 4   posted       59305 non-null  object 
 5   last_edited  59305 non-null  object 
 6   item_id      59305 non-null  object 
 7   helpful      59305 non-null  object 
 8   recommend    59305 non-null  object 
 9   review       59305 non-null  object 
 10  0            0 non-null      float64
dtypes: float64(1), int64(1), object(9)
memory usage: 5.0+ MB


In [11]:
# Veamos si hay valores faltantes 
null(df_reviews, decimales=2)

Unnamed: 0,Numeros de nulos,Porcentaje de nulos
index,0,0.0%
user_id,0,0.0%
user_url,0,0.0%
funny,28,0.05%
posted,28,0.05%
last_edited,28,0.05%
item_id,28,0.05%
helpful,28,0.05%
recommend,28,0.05%
review,28,0.05%


In [12]:
suma_duplicados(df_reviews)

Unnamed: 0,Numero de Duplicados,Porcentaje de Duplicados
index,0,0.0%
user_id,0,0.0%
user_url,0,0.0%
funny,0,0.0%
posted,0,0.0%
last_edited,0,0.0%
item_id,0,0.0%
helpful,0,0.0%
recommend,0,0.0%
review,0,0.0%


Transformamos los datos, eliminando las columnas no necesarias y los nulos

In [13]:
# Columnas no necesarias 
df_reviews = df_reviews.drop(columns=[ "index","funny", "last_edited", "helpful", 0])

In [14]:
# ELiminamos las filas faltantes
df_reviews = df_reviews.dropna(subset=['recommend', 'review', 'posted', 'item_id'])

In [15]:
# Verifiquemos de nuevo si hay nulls.
null(df_reviews, decimales=2)

Unnamed: 0,Numeros de nulos,Porcentaje de nulos
user_id,0,0.0%
user_url,0,0.0%
posted,0,0.0%
item_id,0,0.0%
recommend,0,0.0%
review,0,0.0%


In [16]:
# Verifiquemos de nuevo si hay duplicados.
suma_duplicados(df_reviews)

Unnamed: 0,Numero de Duplicados,Porcentaje de Duplicados
user_id,874,1.47%
user_url,874,1.47%
posted,874,1.47%
item_id,874,1.47%
recommend,874,1.47%
review,874,1.47%


In [17]:
# Eliminamos los dupliados 
df_reviews = df_reviews.drop_duplicates()

In [18]:
df_reviews.head(4)

Unnamed: 0,user_id,user_url,posted,item_id,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...


In [19]:
# Cambiamos el tipo de dato 
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58431 entries, 0 to 59332
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    58431 non-null  object
 1   user_url   58431 non-null  object
 2   posted     58431 non-null  object
 3   item_id    58431 non-null  object
 4   recommend  58431 non-null  object
 5   review     58431 non-null  object
dtypes: object(6)
memory usage: 3.1+ MB


In [20]:
df_reviews.columns

Index(['user_id', 'user_url', 'posted', 'item_id', 'recommend', 'review'], dtype='object')

In [21]:
convert_dict ={
    'user_id'           : 'string',
    'user_url'          : 'string',
    'posted'            : 'string',
    'item_id'           : 'int64',
    'recommend'          : 'bool',
    'review'            : 'string'
}
df_reviews = df_reviews.astype(convert_dict)

In [22]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58431 entries, 0 to 59332
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    58431 non-null  string
 1   user_url   58431 non-null  string
 2   posted     58431 non-null  string
 3   item_id    58431 non-null  int64 
 4   recommend  58431 non-null  bool  
 5   review     58431 non-null  string
dtypes: bool(1), int64(1), string(4)
memory usage: 2.7 MB


***pre - procesamiento de los datos***
Pimero vamos a relizar  del texto para asegurarse de que el análisis sea preciso y efectivo.

In [24]:
# Importamos librerias 
import nltk 
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Descargamos los recursos necesarios 
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...


True

In [29]:
lema = WordNetLemmatizer()

# funcion de preprocesamiento 
def proceso(text):
    text = text.lower() # Convertir a minúsculas
    words = word_tokenize(text) # Tokenización
    # Eliminar stop words y lematizar
    words = [lema.lemmatize(word) for word in words if word not in stopwords.words('english') and word.isalnum()]
    return ' '.join(words) # Sirve para dejar las palabras como str y no en listas y entre parentesis 

In [35]:
# Aplicamos el preprocesamiento 
df_reviews['Processed_review'] = df_reviews['review'].apply(proceso)

In [37]:
# Veamos como quedo nuestras columnas antes y despues 
df_reviews[['review', 'Processed_review']].head(4)

Unnamed: 0,review,Processed_review
0,Simple yet with great replayability. In my opi...,simple yet great replayability opinion zombie ...
1,It's unique and worth a playthrough.,unique worth playthrough
2,Great atmosphere. The gunplay can be a bit chu...,great atmosphere gunplay bit chunky time end d...
3,I know what you think when you see this title ...,know think see title barbie dreamhouse party i...


In [45]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58431 entries, 0 to 59332
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   user_id           58431 non-null  string
 1   user_url          58431 non-null  string
 2   posted            58431 non-null  string
 3   item_id           58431 non-null  int64 
 4   recommend         58431 non-null  bool  
 5   review            58431 non-null  string
 6   Processed_review  58431 non-null  object
dtypes: bool(1), int64(1), object(1), string(4)
memory usage: 3.2+ MB


In [46]:
# Cambiamos la columna a string 
df_reviews['Processed_review'] = df_reviews['Processed_review'].astype(str)

***NLP - Procesamiento de Lenguaje Natural***
Vamos a crear una nueva columna de sentiment analysis

In [39]:
# Importamos librerias 
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [49]:
# Creamos la clase para crear nuestro analysis sentiment
class Sentiment:
    def __init__(self):
        self.sia = SentimentIntensityAnalyzer()
    
    def analysis(self, text):
        if not text: # review ausente = 1 
            return 1 # Neutral
        score =  self.sia.polarity_scores(text)['compound']
        return self.sentiment_category(score)
    
    def sentiment_category(self, score):
        if score <= -0.05:
            return 0 # Negativo 
        elif -0.05 < score < 0.05:
            return 1 # Neutral
        else:
            return 2 # Positivo

In [50]:
# Instanciar la clase 
sentiment_analyzer = Sentiment()
# Aplicamos 
df_reviews['sentiment_analysis'] = df_reviews['Processed_review'].apply(sentiment_analyzer.analysis)

In [55]:
# Resultado
df_reviews.head(4)

Unnamed: 0,user_id,user_url,posted,item_id,recommend,sentiment_analysis
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,True,2
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted July 15, 2011.",22200,True,2
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted April 21, 2011.",43110,True,2
3,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,True,2


In [53]:
#Eliminamos la columna reviews pliral
df_reviews = df_reviews.drop(columns=['review', 'Processed_review'])

In [54]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58431 entries, 0 to 59332
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   user_id             58431 non-null  string
 1   user_url            58431 non-null  string
 2   posted              58431 non-null  string
 3   item_id             58431 non-null  int64 
 4   recommend           58431 non-null  bool  
 5   sentiment_analysis  58431 non-null  int64 
dtypes: bool(1), int64(2), string(3)
memory usage: 2.7 MB


In [56]:
# Convertimos el archivo en parquet y en csv para el SQL.
# Parquet 
# df_reviews.to_parquet("steam_reviews.parquet")
# CSV
# df_reviews.to_csv("steam_reviews.csv", index=False)