### Importamos las librerias necesarias para comenzar el proceso de ETL 

In [31]:
import pandas as pd
import ast
import json
import numpy as np
from ast import literal_eval
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
import herramientas
import warnings
warnings.filterwarnings("ignore")

***dataset: australian_user_reviews***


#### Extraccion de datos y primera exploracion

Se extraen los datos desde el archivo 'australian_user_reviews.json', se convierte en Dataframe y se observa su contenido.

In [32]:
#Lista para almacenar los diccionarios JSON de cada línea
review = []

#Ruta del archivo JSON
file_path = 'australian_user_reviews.json'

#Abrir el archivo y procesar cada línea
with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        try:
            # Usar ast.literal_eval para convertir la línea en un diccionario
            json_data = ast.literal_eval(line)
            review.append(json_data)
        except ValueError as e:
            print(f"Error en la línea: {line}")
            continue

#Crear un DataFrame a partir de la lista de diccionarios
df_reviews = pd.DataFrame(review)
df_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


Se revisan los tipos de datos por columna y la cantidad de nulos.

In [33]:
herramientas.verifica_tipo_y_nulos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews,[<class 'list'>],100.0,0.0,0


Verificamos la existencia o no de duplicados por columnas user_id

In [34]:
filas_duplicadas = herramientas.verifica_duplicados_por_columna(df_reviews, 'user_id')
filas_duplicadas

Unnamed: 0,user_id,user_url,reviews
12888,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
5250,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
3133,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
3134,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
4139,29123,http://steamcommunity.com/id/29123,"[{'funny': '', 'posted': 'Posted March 26.', '..."
...,...,...,...
2721,xXAussieRockXx,http://steamcommunity.com/id/xXAussieRockXx,"[{'funny': '', 'posted': 'Posted July 17, 2015..."
2680,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
5855,zeroblade,http://steamcommunity.com/id/zeroblade,"[{'funny': '', 'posted': 'Posted November 30, ..."


Se observan 623 filas duplicadas en la columna 'user_id', pero se revisan si los review dentro de los datos anidados de 'df_review' la información se encuentra duplicada o si solo se duplica el 'user_id' porque hay mas de un comentario realizado por ese usuario

In [35]:
# Se revisa un usuario de ejemplo
user_id = '29123'
user_reviews = filas_duplicadas[filas_duplicadas['user_id'] == user_id]['reviews']

for review_list in user_reviews:
    for review in review_list:
        print(review['review'])
    print('-' * 40)

Can't play after the updates, people who doesn't know how to play the game, after playing the game crashes and The ♥♥♥♥ing black screen
5/10 lots of bugs and bad servers.BTW this game needs a better training facility
it needs blood when get hit, and some others changes, 9/10
What can i say? Nice graphics and nice story and charactersThe only thing wrong: it should be playable for all graphics cards even the lowest ones like 12 fps or 6But nvm nice game 8/10
Well, this game is the best i ever played but when i was downloading this game (again) i was like in 30%, but i connected today and it was 0% well if valve can fix this things i will be more happy and i will play more this game.
----------------------------------------
Can't play after the updates, people who doesn't know how to play the game, after playing the game crashes and The ♥♥♥♥ing black screen
5/10 lots of bugs and bad servers.BTW this game needs a better training facility
it needs blood when get hit, and some others change

Se observa que los review son exactamente los mismos para cada registro, por lo que se decide borrar los duplicados, dejando la primer ocurrencia de los registros.

In [36]:
df_reviews = df_reviews.drop_duplicates(subset='user_id', keep='first')
herramientas.verifica_duplicados_por_columna(df_reviews, 'user_id')

'No hay duplicados'

Analizamos el tipo de dato de la columna 'reviews'

In [37]:
# Se observa el tipo de dato que contiene 'review'
df_reviews['reviews'][0]

[{'funny': '',
  'posted': 'Posted November 5, 2011.',
  'last_edited': '',
  'item_id': '1250',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Simple yet with great replayability. In my opinion does "zombie" hordes and team work better than left 4 dead plus has a global leveling system. Alot of down to earth "zombie" splattering fun for the whole family. Amazed this sort of FPS is so rare.'},
 {'funny': '',
  'posted': 'Posted July 15, 2011.',
  'last_edited': '',
  'item_id': '22200',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': "It's unique and worth a playthrough."},
 {'funny': '',
  'posted': 'Posted April 21, 2011.',
  'last_edited': '',
  'item_id': '43110',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game is definitely worth it and I hope they do a sequel...so buy the game so I get a sequel!'}]

Este dataset inicialmente contiene 3 columnas y 25799 filas, sin presencia de nulos. Las columnas que contiene son:
- **'user_id'**: identificador unico de usuario
- **'user_url'**: URL perfil del usuario
- **'reviews'**: Review de usuario en formato Json. El cual contiene una lista de diccionarios. Para cada usuario se tiene uno o mas diccionario con el review. Cada diccionario contiene:
    - **funny**: si alguien puso emoticón de gracioso al review.
    - **posted**: fecha de posteo del review en formato Posted April 21, 2011.
    - **last_edited**: fecha de la última edición.
    - **item_id**: identificador único del item (id del juego).
    - **helpful**: estadística donde otros usuarios indican si fue útil la información.
    - **recommend**: booleano que indica si el usuario recomienda o no el juego.
    - **review**: sentencia string con los comentarios sobre el juego.


#### Analisis de la columna 'reviews'

La columna 'reviews' esta anidada, siendo una lista con uno o mas diccionarios como elementos. Se generara una columna por cada diccionario para posteriormente hacer un registro por cada diccionario.

Este código toma un DataFrame con datos anidados en una columna ('reviews'), desanida esa columna para expandir sus elementos en filas separadas y luego expande esos elementos en múltiples columnas, finalmente concatenando estos nuevos datos con el DataFrame original.

In [38]:
#Tomamos los datos anidados en reviews , desanidamos la misma para expandir sus elementos en filas y multiples columnas para finalmente concarenarlas en el df_reviews
df_reviews1 = df_reviews.explode(['reviews'])
df_reviews2 = df_reviews1['reviews'].apply(pd.Series)
df_reviews3 = pd.concat([df_reviews1, df_reviews2], axis=1)

In [39]:
#Renombramos df_reviews3 por df_reviews
df_reviews = df_reviews3
#Visualizamos el dataset para verificar el paso anterior
df_reviews

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review,0
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011...",,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....",,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted September 8, 2...",,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,
...,...,...,...,...,...,...,...,...,...,...,...
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"{'funny': '', 'posted': 'Posted July 10.', 'la...",,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...,
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"{'funny': '', 'posted': 'Posted July 8.', 'las...",,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...,
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,"{'funny': '1 person found this review funny', ...",1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,"{'funny': '', 'posted': 'Posted July 20.', 'la...",,Posted July 20.,,730,No ratings yet,True,:D,


Verificamos la cantidad de nulos 

In [40]:
herramientas.verifica_tipo_y_nulos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews,"[<class 'dict'>, <class 'float'>]",99.95,0.05,28
3,funny,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
4,posted,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
5,last_edited,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
6,item_id,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
7,helpful,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
8,recommend,"[<class 'bool'>, <class 'float'>]",99.95,0.05,28
9,review,"[<class 'str'>, <class 'float'>]",99.95,0.05,28


Se eliminan las columnas que no necesitamos, en este caso serian las columnas: reviews, funny, last_edited, user_url. Luego visualizamos para verificar los cambios y guardamos en df_reviews

In [41]:
df_reviews_clean = df_reviews.drop('reviews', axis=1)
df_reviews_clean = df_reviews_clean.drop('funny', axis=1)
df_reviews_clean = df_reviews_clean.drop('last_edited', axis=1)
df_reviews_clean = df_reviews_clean.drop('user_url', axis=1)
df_reviews_clean = df_reviews_clean.drop(df_reviews_clean.columns[-1], axis=1)

df_reviews=df_reviews_clean 
df_reviews

Unnamed: 0,user_id,posted,item_id,helpful,recommend,review
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
0,76561197970982479,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.
0,76561197970982479,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
1,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
1,js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...
25797,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...
25797,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...
25798,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
25798,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D


Verificamos la existencia de valores nulos 

In [42]:
herramientas.verifica_tipo_y_nulos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,posted,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
2,item_id,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
3,helpful,"[<class 'str'>, <class 'float'>]",99.95,0.05,28
4,recommend,"[<class 'bool'>, <class 'float'>]",99.95,0.05,28
5,review,"[<class 'str'>, <class 'float'>]",99.95,0.05,28


Eliminamos los valores nulos de las columnas: posted,  item_id, helpful, recommend, review	

In [43]:
df_reviews_clean = df_reviews.dropna(subset=['posted'])
df_reviews_clean = df_reviews_clean.dropna(subset=['item_id'])
df_reviews_clean = df_reviews_clean.dropna(subset=['helpful'])
df_reviews_clean = df_reviews_clean.dropna(subset=['recommend'])
df_reviews_clean = df_reviews_clean.dropna(subset=['review'])

df_reviews=df_reviews_clean 
df_reviews

Unnamed: 0,user_id,posted,item_id,helpful,recommend,review
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
0,76561197970982479,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.
0,76561197970982479,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
1,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
1,js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...
25797,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...
25797,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...
25798,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
25798,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D


Especificamos cada columna , que pertenece a ***dataset: australian_user_reviews*** , colocando 'reviews' antes del nombre de su columna 

In [44]:
df_reviews_clean = df_reviews_clean.add_prefix('reviews_')

df_reviews= df_reviews_clean
df_reviews

Unnamed: 0,reviews_user_id,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
0,76561197970982479,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.
0,76561197970982479,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
1,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
1,js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...
25797,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...
25797,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...
25798,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
25798,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D


Se verifica si se eliminaron correctamente los valores nulos anteriormente encontrados y reveemos el tipo de dato

In [45]:
herramientas.verifica_tipo_y_nulos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,reviews_user_id,[<class 'str'>],100.0,0.0,0
1,reviews_posted,[<class 'str'>],100.0,0.0,0
2,reviews_item_id,[<class 'str'>],100.0,0.0,0
3,reviews_helpful,[<class 'str'>],100.0,0.0,0
4,reviews_recommend,[<class 'bool'>],100.0,0.0,0
5,reviews_review,[<class 'str'>],100.0,0.0,0


#### Modificacion del tipo de dato de ser necesario

- ***'reviews_posted'***

Se necesita que la fecha donde se hizo el posteo de la review este en formato YYYY-MM-DD. Por lo tanto, es necesario procesar la fecha y extraer los elementos relevantes. Se utilizará expresiones regulares para buscar y capturar los valores de año, mes y día dentro de la cadena de texto.

In [46]:
df_reviews['reviews_date'] = df_reviews['reviews_posted'].apply(herramientas.convertir_fecha)
df_reviews['reviews_date']

0              2011-11-05
0              2011-07-15
0              2011-04-21
1              2014-06-24
1              2013-09-08
               ...       
25797    Formato inválido
25797    Formato inválido
25798    Formato inválido
25798    Formato inválido
25798    Formato inválido
Name: reviews_date, Length: 58430, dtype: object

Se puede observar que hay 9932 registros que contienen un formato inválido distinto a los demas registros. En este caso, no contiene el año del posteo, pero con la función se imputó como 'Formato inválido'. Estos registros no se podrán consultar desde la API, pero las demás columnas serán útiles para aportar información.

In [47]:
df_reviews[df_reviews['reviews_date'] == 'Formato inválido']

Unnamed: 0,reviews_user_id,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date
2,evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,Formato inválido
6,76561198079601835,Posted May 20.,730,0 of 1 people (0%) found this review helpful,True,ZIKA DO BAILE,Formato inválido
7,MeaTCompany,Posted July 24.,730,No ratings yet,True,BEST GAME IN THE BLOODY WORLD,Formato inválido
9,76561198156664158,Posted June 16.,252950,0 of 1 people (0%) found this review helpful,True,love it,Formato inválido
10,76561198077246154,Posted June 11.,440,No ratings yet,True,mt bom,Formato inválido
...,...,...,...,...,...,...,...
25797,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...,Formato inválido
25797,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...,Formato inválido
25798,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,Formato inválido
25798,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D,Formato inválido


Se elimina la columna reviews_posted , devido a que no aporta la informacion requerida 

In [48]:
df_reviews = df_reviews.drop('reviews_posted', axis=1)
df_reviews.columns

Index(['reviews_user_id', 'reviews_item_id', 'reviews_helpful',
       'reviews_recommend', 'reviews_review', 'reviews_date'],
      dtype='object')

Se verifica si se elimino correctamente la columna 'reviews_posted'

In [49]:
herramientas.verifica_tipo_y_nulos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,reviews_user_id,[<class 'str'>],100.0,0.0,0
1,reviews_item_id,[<class 'str'>],100.0,0.0,0
2,reviews_helpful,[<class 'str'>],100.0,0.0,0
3,reviews_recommend,[<class 'bool'>],100.0,0.0,0
4,reviews_review,[<class 'str'>],100.0,0.0,0
5,reviews_date,[<class 'str'>],100.0,0.0,0


In [50]:
df_reviews

Unnamed: 0,reviews_user_id,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date
0,76561197970982479,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011-11-05
0,76561197970982479,22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15
0,76561197970982479,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011-04-21
1,js41637,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014-06-24
1,js41637,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013-09-08
...,...,...,...,...,...,...
25797,76561198312638244,70,No ratings yet,True,a must have classic from steam definitely wort...,Formato inválido
25797,76561198312638244,362890,No ratings yet,True,this game is a perfect remake of the original ...,Formato inválido
25798,LydiaMorley,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,Formato inválido
25798,LydiaMorley,730,No ratings yet,True,:D,Formato inválido


### Análisis de sentimientos
Se pide crear una nueva columna llamada 'sentiment_analysis' que reemplace a 'reviews_review' donde se realice un análisis de sentimiento de los comentarios con la siguiente escala:

0 si es malo,
1 si es neutral o esta sin review
2 si es positivo.
Dado que el objetivo de este proyecto es realizar una prueba de concepto, consiguiendo un producto mínimo viable, se realiza un análisis de sentimiento básico utilizando TextBlob que es una biblioteca de procesamiento de lenguaje natural (NLP) en Python. El objetivo de esta metodología es asignar un valor numérico a un texto, en este caso a los comentarios que los usuarios dejaron para un juego determinado, para representar si el sentimiento expresado en el texto es negativo, neutral o positivo.


Se elimina los valores nulos de la columna 'reviews_review'

In [51]:
df_reviews_clean = df_reviews.dropna(subset=['reviews_review'])

df_reviews=df_reviews_clean 
df_reviews

Unnamed: 0,reviews_user_id,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date
0,76561197970982479,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011-11-05
0,76561197970982479,22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15
0,76561197970982479,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011-04-21
1,js41637,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014-06-24
1,js41637,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013-09-08
...,...,...,...,...,...,...
25797,76561198312638244,70,No ratings yet,True,a must have classic from steam definitely wort...,Formato inválido
25797,76561198312638244,362890,No ratings yet,True,this game is a perfect remake of the original ...,Formato inválido
25798,LydiaMorley,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,Formato inválido
25798,LydiaMorley,730,No ratings yet,True,:D,Formato inválido


Verifico valores nulos y tipo de dato

In [52]:
herramientas.verifica_tipo_y_nulos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,reviews_user_id,[<class 'str'>],100.0,0.0,0
1,reviews_item_id,[<class 'str'>],100.0,0.0,0
2,reviews_helpful,[<class 'str'>],100.0,0.0,0
3,reviews_recommend,[<class 'bool'>],100.0,0.0,0
4,reviews_review,[<class 'str'>],100.0,0.0,0
5,reviews_date,[<class 'str'>],100.0,0.0,0


La metodología que utilizare es la siguiente: se toma una revisión de texto como entrada, utiliza TextBlob para calcular la polaridad de sentimiento y luego clasifica la revisión como negativa, neutral o positiva en función de la polaridad calculada. En este caso, se consideraron las polaridades por defecto del modelo, el cuál utiliza umbrales -0.2 y 0.2, siendo polaridades negativas por debajo de -0.2, positivas por encima de 0.2 y neutrales entre medio de ambos.

In [53]:
df_reviews['sentiment_analysis'] = df_reviews['reviews_review'].apply(herramientas.analisis_sentimiento)
df_reviews.head()

Unnamed: 0,reviews_user_id,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date,sentiment_analysis
0,76561197970982479,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011-11-05,1
0,76561197970982479,22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15,2
0,76561197970982479,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011-04-21,1
1,js41637,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014-06-24,1
1,js41637,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013-09-08,1


Se revisan algunos ejemplos para cada una de las clases de sentimiento.

In [54]:
herramientas.ejemplos_review_por_sentimiento(df_reviews['reviews_review'], df_reviews['sentiment_analysis'])

Para la categoría de análisis de sentimiento 0 se tienen estos ejemplos de reviews:
Review 1: Random drops and random quests, with stat points.  Animation style reminiscent of the era before the Voodoo card.
Review 2: The ending to this game is.... ♥♥♥♥♥♥♥.... Just buy it, you'll be invested, im automatically preordering season two of the walking dead game.
Review 3: This game is Marvellous.


Para la categoría de análisis de sentimiento 1 se tienen estos ejemplos de reviews:
Review 1: Simple yet with great replayability. In my opinion does "zombie" hordes and team work better than left 4 dead plus has a global leveling system. Alot of down to earth "zombie" splattering fun for the whole family. Amazed this sort of FPS is so rare.
Review 2: Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game is definitely worth it and I hope they do a sequel...so buy the game so I get a sequel!
Review 3: I know what you think when you see this title "Barbie Dr

Creo la columna rating , que utilizare mas adelante

In [55]:
#Se aplica la función creada para agregar la columna 'ranting'
df_reviews['rating'] = df_reviews.apply(herramientas.columna_rating, axis=1)
df_reviews

Unnamed: 0,reviews_user_id,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date,sentiment_analysis,rating
0,76561197970982479,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011-11-05,1,2.0
0,76561197970982479,22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15,2,2.0
0,76561197970982479,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011-04-21,1,2.0
1,js41637,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014-06-24,1,2.0
1,js41637,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013-09-08,1,2.0
...,...,...,...,...,...,...,...,...
25797,76561198312638244,70,No ratings yet,True,a must have classic from steam definitely wort...,Formato inválido,2,2.0
25797,76561198312638244,362890,No ratings yet,True,this game is a perfect remake of the original ...,Formato inválido,1,2.0
25798,LydiaMorley,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,Formato inválido,1,2.0
25798,LydiaMorley,730,No ratings yet,True,:D,Formato inválido,2,2.0


Verifico nulos y tipos de datos

In [56]:
herramientas.verifica_tipo_y_nulos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,reviews_user_id,[<class 'str'>],100.0,0.0,0
1,reviews_item_id,[<class 'str'>],100.0,0.0,0
2,reviews_helpful,[<class 'str'>],100.0,0.0,0
3,reviews_recommend,[<class 'bool'>],100.0,0.0,0
4,reviews_review,[<class 'str'>],100.0,0.0,0
5,reviews_date,[<class 'str'>],100.0,0.0,0
6,sentiment_analysis,[<class 'int'>],100.0,0.0,0
7,rating,[<class 'float'>],85.46,14.54,8497


Observo los cambios

In [57]:
df_reviews

Unnamed: 0,reviews_user_id,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date,sentiment_analysis,rating
0,76561197970982479,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011-11-05,1,2.0
0,76561197970982479,22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15,2,2.0
0,76561197970982479,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011-04-21,1,2.0
1,js41637,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014-06-24,1,2.0
1,js41637,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013-09-08,1,2.0
...,...,...,...,...,...,...,...,...
25797,76561198312638244,70,No ratings yet,True,a must have classic from steam definitely wort...,Formato inválido,2,2.0
25797,76561198312638244,362890,No ratings yet,True,this game is a perfect remake of the original ...,Formato inválido,1,2.0
25798,LydiaMorley,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,Formato inválido,1,2.0
25798,LydiaMorley,730,No ratings yet,True,:D,Formato inválido,2,2.0


#### Carga del datasets: australian_user_reviews

Se guarda el conjunto de datos transformado como user_review_limpio.

In [58]:
df_reviews_limpio = 'data/user_review_limpio.csv'
df_reviews.to_csv(df_reviews_limpio, index=False, encoding='utf-8')
print(f'Se guardó el archivo {df_reviews_limpio}')

Se guardó el archivo data/user_review_limpio.csv
