 # ETL de **"user_reviews"**
 -------------

En este jupyter notebook se desarrolla la extracción, transformación y carga del conjunto del dataset `"user_reviews"`

-------------
### Tabla de contenido

- 1.- Introducción
- 2.- Importación de las librerías y funciones personalizadas para el ETL
    - 2.1.- Librerías
    - 2.2.- Funciones Personalizadas
- 3.- Desarrollo del proceso de ETL
    - 3.1.- Extracción de datos
        - 3.1.1.- Carga de archivos "user_reviews.json.gz"
    - 3.2.- Transformación de datos
        - 3.2.1.- Aspectos generales del Dataframe
        - 3.2.2.- Exploración y tratamiendo de datos vacíos
        - 3.2.3.- Eliminación de datos duplicados
        - 3.2.4.- Datos incorrectos o irrelevantes
        - 3.2.5.- Datos categóricos
    - 3.3.- Carga de Datos
        - 3.3.1 Generación del archivo limpio
-------------

## **<span style="color: #d8572a;">1.- Introducción</span>**

Este trabajo ETL tiene como objetivo procesar e integrar un conjunto de datos que contiene información sobre videojuegos de Steam. El conjunto de datos, titulado "User_review", consta de 25799 filas × 3 columnas, luego de un proceso de limpieza exhaustivo para eliminar filas con valores NaN. Los datos extraídos se transformarán y cargarán en un sistema de destino para su posterior análisis y utilización.

## **<span style="color: #d8572a;">2.- Importación de las librerías y funciones personalizadas para el ETL</span>**

### **<span style="color: #f7b538;">2.1.- Librerías</span>**

In [56]:
import pandas as pd
import numpy as np

### **<span style="color: #f7b538;">2.2.- Funciones Personalizadas</span>**

In [57]:
import Func_personalizadas.FPersonalizadas as FP

## **<span style="color: #d8572a;">3.- Desarrollo del proceso de ETL</span>**

### **<span style="color: #f7b538;">3.1.- Extracción de datos</span>**

#### 3.1.1.- Carga de archivos "user_reviews.json.gz"

In [58]:
ruta_archivo= r"C:\Users\USUARIO\OneDrive\6.- Data Science\1.- Experiencia Soy Henry Bootcamp\Labs\PI 1\ML_DevOps_Steam_Project\Datasets\user_reviews.json.gz"

In [59]:
df_user_reviews = FP.open_json_path_wlist(ruta_archivo)
df_user_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


### **<span style="color: #f7b538;">**3.2.- Transformación de datos**</span>**

#### 3.2.1.- Aspectos generales del Dataframe

In [60]:
# Dimensiones del dataframe
df_user_reviews.shape

(25799, 3)

In [61]:
df_user_reviews.columns

Index(['user_id', 'user_url', 'reviews'], dtype='object')

In [62]:
df_user_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB


In [63]:
df_user_reviews.id.value_counts()

AttributeError: 'DataFrame' object has no attribute 'id'

### 3.2.2.- Exploración y tratamiendo de datos vacíos

Eliminamos todas las filas que contengan datos vacíos en todos los campos

In [None]:
## Eliminamos las filas vacías
df_user_reviews = df_user_reviews.dropna(how='all').reset_index(drop=True)
## Evaluamos dimensiones
df_user_reviews.shape

(25799, 3)

Revisamos los tipos de datos y los nulos por cada columna

In [None]:
FP.tabla_tipo_datos(df_user_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews,[<class 'list'>],100.0,0.0,0


Revisamos a detalles las columnas que tienen estructura tipo lista, para entender su estructura

In [None]:
col_tipo_list = FP.identificar_columnas_con_listas(df_user_reviews)

In [None]:
for col in col_tipo_list:    
    print(f"Columna {col} \n Tipo de Dato: ",df_user_reviews[col][5])

Columna reviews 
 Tipo de Dato:  [{'funny': '', 'posted': 'Posted May 5, 2014.', 'last_edited': '', 'item_id': '249130', 'helpful': '7 of 8 people (88%) found this review helpful', 'recommend': True, 'review': 'This game is Marvellous.'}, {'funny': '', 'posted': 'Posted December 24, 2012.', 'last_edited': 'Last edited November 25, 2013.', 'item_id': '207610', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'It reminds me of that TV Show called "The Walking Dead".'}, {'funny': '1 person found this review funny', 'posted': 'Posted October 21, 2012.', 'last_edited': 'Last edited November 25, 2013.', 'item_id': '550', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'This game is fantastic if you are looking to DEADicate some time to it.'}, {'funny': '', 'posted': 'Posted March 20, 2012.', 'last_edited': 'Last edited June 22, 2014.', 'item_id': '65800', 'helpful': '1 of 1 people (100%) found this review help

In [None]:
# Se observa el tipo de dato que contiene 'review'
df_user_reviews['reviews'][5]

[{'funny': '',
  'posted': 'Posted May 5, 2014.',
  'last_edited': '',
  'item_id': '249130',
  'helpful': '7 of 8 people (88%) found this review helpful',
  'recommend': True,
  'review': 'This game is Marvellous.'},
 {'funny': '',
  'posted': 'Posted December 24, 2012.',
  'last_edited': 'Last edited November 25, 2013.',
  'item_id': '207610',
  'helpful': '1 of 1 people (100%) found this review helpful',
  'recommend': True,
  'review': 'It reminds me of that TV Show called "The Walking Dead".'},
 {'funny': '1 person found this review funny',
  'posted': 'Posted October 21, 2012.',
  'last_edited': 'Last edited November 25, 2013.',
  'item_id': '550',
  'helpful': '1 of 1 people (100%) found this review helpful',
  'recommend': True,
  'review': 'This game is fantastic if you are looking to DEADicate some time to it.'},
 {'funny': '',
  'posted': 'Posted March 20, 2012.',
  'last_edited': 'Last edited June 22, 2014.',
  'item_id': '65800',
  'helpful': '1 of 1 people (100%) found th

### Alcance preliminar del diccionario de datos de "User reviews"

**Descripción:**

Este conjunto de datos contiene información sobre videojuegos. Tras la limpieza de filas no se evidencia valores tipo NaN, el conjunto final consta de 25799 filas y 3 columnas. A continuación, se detalla la descripción de cada columna:

**Columnas:**

* **user_id**: es un identificador único para el usuario.
* **user_url**: es la url del perfil del usuario en streamcommunity.
* **reviews**: contiene una lista de diccionarios. Para cada usuario se tiene uno o mas diccionario con el review. Cada diccionario contiene:
    * **funny**: indica si alguien puso emoticón de gracioso al review.
    * **posted**: es la fecha de posteo del review en formato Posted April 21, 2011.
    * **last_edited**: es la fecha de la última edición.
    * **item_id**: es el identificador único del item, es decir, del juego.
    * **helpful**: es la estadística donde otros usuarios indican si fue útil la información.
    * **recommend**: es un booleano que indica si el usuario recomienda o no el juego.
    * **review**: es una sentencia string con los comentarios sobre el juego.


**Notas:**

* Las columnas `reviews` contienen listas de valores separados por comas.

### 3.2.3.- Eliminación de datos duplicados

Se analizan si hay duplicados teniendo en cuenta la columna del user_id.

In [None]:
elementos_duplicados = FP.verifica_duplicados_por_columna(df_user_reviews, 'user_id')
elementos_duplicados

Unnamed: 0,user_id,user_url,reviews
12888,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
5250,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
3133,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
3134,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
4139,29123,http://steamcommunity.com/id/29123,"[{'funny': '', 'posted': 'Posted March 26.', '..."
...,...,...,...
2721,xXAussieRockXx,http://steamcommunity.com/id/xXAussieRockXx,"[{'funny': '', 'posted': 'Posted July 17, 2015..."
2680,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
5855,zeroblade,http://steamcommunity.com/id/zeroblade,"[{'funny': '', 'posted': 'Posted November 30, ..."


Identificación de Items Duplicados:

Se han detectado 623 items duplicados en el conjunto de datos, sin embargo:

* **Primero:** Se revisan si los review dentro de los datos anidados de 'review' la información
* **Segundo:** Corroborar si se encuentra duplicada o si solo se duplica el 'user_id' porque hay mas de un comentario realizado por ese usuario.

En primer lugar se busca por el 'developer' si este juego está ya registrado.

In [None]:
# Se revisa un usuario de ejemplo
user_id = 'yolofaceguy'
user_reviews = elementos_duplicados[elementos_duplicados['user_id'] == user_id]['reviews']

for review_list in user_reviews:
    for review in review_list:
        print(review['review'])
    print('-' * 40)

from the creaters of the walking dead, i present to you, the wolf among us. a twisted an unhappy place where crimes are made in the town of the fables. Ecperience one of the most mind-bending, jaw-dropping twists you will see in all of gaming history...well, some of gaming history. SPOILERS: i really liked how bigby turns into his true form to fight all the bloody marys... its just like when neo fought all the agent smiths in the matrix. what i didnt really like is when i chose to lock the crooked man up it glitches my game and i start from the near begining where bigby fought the crooked man's crew, exept the guy who runs a strip club wasnt there. so i was really confused. but anyway i really recommend this game if you want twists and ♥♥♥♥ed up scenes.
this game is awesome,this game is ♥♥♥♥ed up and this game is so sad and depressing. if you want these types of games i would strongly reccomend youto buy it. its worth your money
----------------------------------------
from the creater

Se puede ver que los review son exactamente los mismos para cada registro, por lo que se decide borrar los duplicados, dejando la primer ocurrencia de los registros.

In [None]:
df_user_reviews = df_user_reviews.drop_duplicates(subset='user_id', keep='first')
FP.verifica_duplicados_por_columna(df_user_reviews, 'user_id')

'No hay duplicados'

### 3.2.4.- Datos incorrectos o irrelevantes

* En una primera instancia, hasta el momento no hay columnas por eliminar.
* La columna reviews al ser una columna anidada, podría contener información valiosa como también irrelevante, por lo que requiere un análisis más profundo que se abordará en el siguiente punto.

### 3.2.5.- Datos categóricos

#### 3.2.5.1 Columna 'reviews'

La columna 'reviews' se presenta anidada, siendo una lista con uno o mas diccionarios como elementos. Se busca generar una columna por cada diccionario para posteriormente hacer un registro por cada diccionario.

In [None]:
# Se transforma a columnas cada elemento de las listas
df_review_lista = pd.json_normalize(df_user_reviews['reviews'])
df_review_lista.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,


En esta transformación, se pierde el 'user_id' y 'user_url' al que pertenece cada diccionario, pero aún mantiene la misma posición. Para volver a tener la trazabilidad del usuario, se concatena con el dataframe anterior.

In [None]:
# Se agrega el 'user_id' y 'user_url' a las columnas separadas 
df_review_lista = pd.concat([df_user_reviews[['user_id', 'user_url']], df_review_lista], axis=1)
df_review_lista.head()

Unnamed: 0,user_id,user_url,0,1,2,3,4,5,6,7,8,9
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,evcentric,http://steamcommunity.com/id/evcentric,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,doctr,http://steamcommunity.com/id/doctr,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,maplemage,http://steamcommunity.com/id/maplemage,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,


Ahora que se tienen los diccionarios por columnas, con el usuario que genera dicha información, se genera un registro por cada diccionario, manteniendo en cada caso el usuario que lo genera.

In [None]:
# Se utiliza pd.melt para transformar las columnas en filas conservando el 'user_id' y 'user_url'
df_review_lista = pd.melt(df_review_lista, id_vars=['user_id', 'user_url'], 
                       value_vars=list(range(9)),
                       value_name='reviews')
df_review_lista.head()

Unnamed: 0,user_id,user_url,variable,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,js41637,http://steamcommunity.com/id/js41637,0,"{'funny': '', 'posted': 'Posted June 24, 2014...."
2,evcentric,http://steamcommunity.com/id/evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
3,doctr,http://steamcommunity.com/id/doctr,0,"{'funny': '', 'posted': 'Posted October 14, 20..."
4,maplemage,http://steamcommunity.com/id/maplemage,0,"{'funny': '3 people found this review funny', ..."


Al hacer esto último se puede ver que quedan registros None. Esto ocurre porque hay usuarios que hicieron mas reviews que otros. En este ejemplo se puede ver este caso

In [None]:
df_review_lista[df_review_lista['user_id']=='js41637']

Unnamed: 0,user_id,user_url,variable,reviews
1,js41637,http://steamcommunity.com/id/js41637,0,"{'funny': '', 'posted': 'Posted June 24, 2014...."
25800,js41637,http://steamcommunity.com/id/js41637,1,"{'funny': '', 'posted': 'Posted September 8, 2..."
51599,js41637,http://steamcommunity.com/id/js41637,2,"{'funny': '', 'posted': 'Posted November 29, 2..."
77398,js41637,http://steamcommunity.com/id/js41637,3,
103197,js41637,http://steamcommunity.com/id/js41637,4,
128996,js41637,http://steamcommunity.com/id/js41637,5,
154795,js41637,http://steamcommunity.com/id/js41637,6,
180594,js41637,http://steamcommunity.com/id/js41637,7,
206393,js41637,http://steamcommunity.com/id/js41637,8,


Finalmente se buscará eliminar los registros que tienen None en "reviews"

In [None]:
df_review_lista = df_review_lista.dropna()

Ahora podemos convertir los diccionarios dentro de la fila en columnas independientes

In [None]:
# Se separan por columnas cada una de las claves de 'reviews'
df_user_reviews = df_review_lista['reviews'].apply(pd.Series, dtype='object')
df_user_reviews = df_user_reviews.add_prefix('reviews_')
df_user_reviews.head()

Unnamed: 0,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud


En el procesamiento anterior, se puede ver que la columna de 'user_id' y 'user_url' se perdió nuevamente, por lo que se vuelve a concatenar.

In [None]:
# Se une con el 'user_id' y 'user_url'
df_user_reviews = pd.concat([df_review_lista[['user_id', 'user_url']], df_user_reviews], axis=1)
df_user_reviews.head()

Unnamed: 0,user_id,user_url,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud


**Importante:** Se observa que hay valores faltantes en algunas columnas, pero no estan como nulos, se inspeccionará a detalle

In [None]:
df_user_reviews['reviews_last_edited'][3]

''

Verificamos que se trata de datos que contienen un espacio, procedemos a reemplazarlo con None

In [None]:
df_user_reviews.replace('', None, inplace=True)
df_user_reviews.head()

Unnamed: 0,user_id,user_url,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud


Damos una última verificación para asegurar el correcto tratamiento a la columna 'user_reviews'

In [None]:
FP.tabla_tipo_datos(df_user_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews_funny,"[<class 'NoneType'>, <class 'str'>]",13.76,86.24,49498
3,reviews_posted,[<class 'str'>],100.0,0.0,0
4,reviews_last_edited,"[<class 'NoneType'>, <class 'str'>]",10.28,89.72,51499
5,reviews_item_id,[<class 'str'>],100.0,0.0,0
6,reviews_helpful,[<class 'str'>],100.0,0.0,0
7,reviews_recommend,[<class 'bool'>],100.0,0.0,0
8,reviews_review,"[<class 'str'>, <class 'NoneType'>]",99.95,0.05,30


#### 3.2.5.2 Columnas 'reviews_funny' y 'reviews_last_edited'

**De la tabla anterior,** se observa entre un 86 a 89% de faltantes de datos en las columnas 'reviews_funny' y 'reviews_last_edited' por lo que se decide eliminar estas columnas.

In [None]:
# Se eliminan las columnas 'reviews_funny' y 'reviews_last_edited'
df_user_reviews = df_user_reviews.drop(columns=['reviews_funny', 'reviews_last_edited'])
df_user_reviews.columns

Index(['user_id', 'user_url', 'reviews_posted', 'reviews_item_id',
       'reviews_helpful', 'reviews_recommend', 'reviews_review'],
      dtype='object')

#### 3.2.5.3 Columnas 'reviews_posted'

La estrategia a tomar aquí es un cambio de formato, requerimos un formato de `YYYY-MM-DD`, pero se encuentra como `Posted June 24, 2014`. Por lo que vamos a recurrir con una transformación de la expresión regular en una columna nueva llamada `reviews_date`.

In [None]:
df_user_reviews['reviews_date'] = df_user_reviews['reviews_posted'].apply(FP.convertir_fecha)
df_user_reviews['reviews_date']

0               2011-11-05
1               2014-06-24
2         Formato inválido
3               2013-10-14
4               2014-04-15
                ...       
231291          2014-08-15
231293          2014-08-02
231419          2015-07-31
231499          2015-12-20
231501    Formato inválido
Name: reviews_date, Length: 57397, dtype: object

In [None]:
df_user_reviews['reviews_date'].value_counts()

reviews_date
Formato inválido    9771
2014-06-21           218
2014-06-20           187
2014-06-23           171
2014-06-27           167
                    ... 
2012-06-11             1
2012-05-20             1
2011-05-02             1
2011-03-25             1
2011-07-16             1
Name: count, Length: 1639, dtype: int64

Se puede observar que hay 9771 registros que contienen un formato inválido distinto a los demas registros. En este caso, no contiene el año del posteo, pero con la función se imputó como 'Formato inválido'. Estos registros no se podrán consultar desde la API, pero las demás columnas serán útiles para aportar información.

In [None]:
df_user_reviews[df_user_reviews['reviews_date'] == 'Formato inválido']

Unnamed: 0,user_id,user_url,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date
2,evcentric,http://steamcommunity.com/id/evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,Formato inválido
6,76561198079601835,http://steamcommunity.com/profiles/76561198079...,Posted May 20.,730,0 of 1 people (0%) found this review helpful,True,ZIKA DO BAILE,Formato inválido
7,MeaTCompany,http://steamcommunity.com/id/MeaTCompany,Posted July 24.,730,No ratings yet,True,BEST GAME IN THE BLOODY WORLD,Formato inválido
9,76561198156664158,http://steamcommunity.com/profiles/76561198156...,Posted June 16.,252950,0 of 1 people (0%) found this review helpful,True,love it,Formato inválido
10,76561198077246154,http://steamcommunity.com/profiles/76561198077...,Posted June 11.,440,No ratings yet,True,mt bom,Formato inválido
...,...,...,...,...,...,...,...,...
223569,76561198040184950,http://steamcommunity.com/profiles/76561198040...,Posted April 12.,394690,No ratings yet,True,I cannot say much right now due to the game no...,Formato inválido
226105,76561198046474248,http://steamcommunity.com/profiles/76561198046...,Posted March 28.,234140,No ratings yet,True,"Oh what a day .., What a lovely day to play th...",Formato inválido
228109,dmitry_who,http://steamcommunity.com/id/dmitry_who,Posted May 17.,376210,10 of 28 people (36%) found this review helpful,True,░░░░░░░░░░░█▀▀░░█░░░░░░░░░░░▄▀▀▀▀░░░░░█▄▄░░░░░...,Formato inválido
229231,76561198079507136,http://steamcommunity.com/profiles/76561198079...,Posted January 3.,730,No ratings yet,False,got VACed,Formato inválido


Una vez realizada la transformación debemos, eliminamos al columna `'reviews_posted'`

In [None]:
df_user_reviews = df_user_reviews.drop('reviews_posted', axis=1)
df_user_reviews.columns

Index(['user_id', 'user_url', 'reviews_item_id', 'reviews_helpful',
       'reviews_recommend', 'reviews_review', 'reviews_date'],
      dtype='object')

#### 3.2.5.4 Columnas 'reviews_review'

Esta columna tiene un 5% de valores nulos, por lo que se eliminan.

In [None]:
df_user_reviews = df_user_reviews.dropna(subset=['reviews_review'])

Verificación de limpieza correcta de datos nulos

In [None]:
FP.tabla_tipo_datos(df_user_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews_item_id,[<class 'str'>],100.0,0.0,0
3,reviews_helpful,[<class 'str'>],100.0,0.0,0
4,reviews_recommend,[<class 'bool'>],100.0,0.0,0
5,reviews_review,[<class 'str'>],100.0,0.0,0
6,reviews_date,[<class 'str'>],100.0,0.0,0


**Se confirma la limpieza exitosa del dataset**

### **<span style="color: #f7b538;">**3.3.- Carga de datos**</span>**

#### 3.3.1 Generación del archivo limpio

Se guarda el dataframe transformado como `user_reviews_limpio`

In [None]:
nombre_archivo_limpio = 'Datasets\df_user_reviews_limpio.csv'
df_user_reviews.to_csv(nombre_archivo_limpio, index=False, encoding='utf-8')
print(f'Se guardó el archivo {nombre_archivo_limpio}')

  nombre_archivo_limpio = 'Datasets\df_user_reviews_limpio.csv'


Se guardó el archivo Datasets\df_user_reviews_limpio.csv
