# ETL USER REVIEWS

## Cargamos el dataset

In [2]:
import pandas as pd
import ast

#Creamos el Data Frame
rows = []
with open("Datasets OPS\\australian_user_reviews.json", encoding='MacRoman') as f:
    for line in f.readlines():
        rows.append(ast.literal_eval(line))

df = pd.DataFrame(rows)
df.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


## Tranformacion de los datos

#### Separamos todas las reviews, generamos una fila por cada review dentro de la columna reviews, y concatenamos los valores.

In [3]:
df_reviws = df.explode('reviews', ignore_index=True)
df_reviws = pd.concat([df_reviws['user_id'], df_reviws['reviews'].apply(pd.Series)], axis=1)
df_reviws.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59333 entries, 0 to 59332
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_id      59333 non-null  object 
 1   funny        59305 non-null  object 
 2   posted       59305 non-null  object 
 3   last_edited  59305 non-null  object 
 4   item_id      59305 non-null  object 
 5   helpful      59305 non-null  object 
 6   recommend    59305 non-null  object 
 7   review       59305 non-null  object 
 8   0            0 non-null      float64
dtypes: float64(1), object(8)
memory usage: 4.1+ MB


#### Eliminamos filas y columnas nulas, y datos irrelevantes.

In [4]:
df_reviws.dropna(subset=['funny','posted', 'last_edited','item_id','helpful','recommend','review'] ,how='all', inplace=True)
df_reviws.dropna(axis=1 ,how='all', inplace=True)
df_reviws.dropna(axis=0 ,how='all', inplace=True)
df_reviws

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...
59328,76561198312638244,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
59329,76561198312638244,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
59330,LydiaMorley,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
59331,LydiaMorley,,Posted July 20.,,730,No ratings yet,True,:D


#### Borramos columnas que no aportan informacion

In [5]:
# Criterio: Según su uso en las funciones pedidas.
df_reviws.drop(columns=['funny','posted','last_edited','helpful'], inplace=True)
df_reviws.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59305 entries, 0 to 59332
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    59305 non-null  object
 1   item_id    59305 non-null  object
 2   recommend  59305 non-null  object
 3   review     59305 non-null  object
dtypes: object(4)
memory usage: 2.3+ MB


## Analisis de sentimientos

### Utilizare la librerira nltk

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer #Vader esta previamente entrenado.

# Descargar los recursos necesarios
nltk.download(['vader_lexicon', 'stopwords', 'punkt', 'names'])

# Crear un analizador de sentimientos
analyzer = SentimentIntensityAnalyzer()

#### Agregamos al dataset la puntuacion pedida

In [8]:
# Analizamos el valor que nos da el NLP, nos da {'neg': valor , 'neu': valor  , 'pos': valor, 'compound': valor}, segun investigacion entendi que compound hace referencia a la puntacion total de la oracion compuesta donde varia entre -1 y 1.

# Utilizo el compound para elegir su categoria si es malo (x < -0.5), neutro (-0,5 >= x <= 0,5) o positivo (x > 0,5), redondeandolo a entero (-1, 0, 1). A ese mismo valor le sumo 1 asi queda con la referencia pedida del ejercicio malo 0, neutro 1, positivo 2.

df_reviws['sentiment_analysis'] = df_reviws['review'].map(lambda x: int(round(analyzer.polarity_scores(x)['compound'],0) + 1))

## Ultimos arreglos

In [9]:
df_reviws.drop(columns='review', inplace=True)
df_reviws

Unnamed: 0,user_id,item_id,recommend,sentiment_analysis
0,76561197970982479,1250,True,2
1,76561197970982479,22200,True,1
2,76561197970982479,43110,True,2
3,js41637,251610,True,2
4,js41637,227300,True,2
...,...,...,...,...
59328,76561198312638244,70,True,2
59329,76561198312638244,362890,True,2
59330,LydiaMorley,273110,True,2
59331,LydiaMorley,730,True,2


#### Exportamos el dataset para su posterior uso en la API

In [10]:
df_reviws.to_json('../Datasets/User_Reviews_Limpio.json.gz', compression='gzip')