<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

# Data Polishing 3 webs

## 0. python imports

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy

from hdbscan import HDBSCAN
from spacy.lang.es.stop_words import STOP_WORDS
from spacy.lang.es import Spanish
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.utils import shuffle
from umap import UMAP

with warnings.catch_warnings():
    warnings.simplefilter('ignore')

## 1. data loading, merging and cleaning by site

### Gamereactor

In [2]:
gamereactor_df1 = pd.read_csv('../data/gamereactor_100l.csv')

In [3]:
gamereactor_df2 = pd.read_csv('../data/gamereactor_100_500l.csv')

In [4]:
gamereactor_df3 = pd.read_csv('../data/gamereactor_500_1249l.csv')

In [5]:
gamereactor_data = [gamereactor_df1, gamereactor_df2, gamereactor_df3]
gamereactor_df = pd.concat(gamereactor_data)

In [6]:
gamereactor_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1247 entries, 0 to 748
Data columns (total 9 columns):
site        1247 non-null object
url_link    1247 non-null object
author      1247 non-null object
game        1247 non-null object
company     1247 non-null object
genre       1247 non-null object
platform    1247 non-null object
text        1247 non-null object
score       1247 non-null float64
dtypes: float64(1), object(8)
memory usage: 97.4+ KB


In [7]:
gamereactor_df.describe()

Unnamed: 0,score
count,1247.0
mean,7.461107
std,1.296329
min,2.0
25%,7.0
50%,8.0
75%,8.0
max,10.0


In [10]:
gamereactor_df = gamereactor_df.reset_index(drop=True)

In [20]:
# Duplicates

# gamereactor_df.duplicated().value_counts() - False    1247

False    1247
dtype: int64

In [8]:
# Testing empties per column

# (gamereactor_df['game'] == 'None').value_counts() - False 1247
# (gamereactor_df['site'] == 'None').value_counts() - False 1247
# (gamereactor_df['author'] == 'None').value_counts() - False 1247
# (gamereactor_df['url_link'] == 'None').value_counts() - False 1247
# (gamereactor_df['company'] == 'None').value_counts() - False 1122 / True 125
# (gamereactor_df['genre'] == 'None').value_counts() - False 1233 / True 14
# (gamereactor_df['platform'] == 'None').value_counts() - False 1230 / True 17
# (gamereactor_df['score'] == 'None').value_counts() - False 1247


In [45]:
gamereactor_df.head(10)

Unnamed: 0,site,url_link,author,game,company,genre,platform,text,score
0,Gamereactor,https://www.gamereactor.es/squad-analisis/?sid...,Mike Holmes,Squad,,Acción,PC,Anda que no ha llovido desde que jugamos Squad...,8.0
1,Gamereactor,https://www.gamereactor.es/super-mario-bros-35...,Sergio Figueroa,Super Mario Bros. 35 - Battle Royale,Nintendo,Plataformas,Nintendo Switch,"No se habían olvidado de él, lo que pasaba es ...",7.0
2,Gamereactor,https://www.gamereactor.es/crash-bandicoot-4-i...,Eirik Hyldbakk Furu,Crash Bandicoot 4: It's About Time,Activision,Plataformas,"PS4, Xbox One",Aunque creo que Ford se pasó un poco con la ca...,8.0
3,Gamereactor,https://www.gamereactor.es/star-wars-squadrons...,Mike Holmes,Star Wars: Squadrons,Electronic Arts,Acción,"PC, PS4, Xbox One",Squadrons es una nueva esperanza de EA para sa...,7.0
4,Gamereactor,https://www.gamereactor.es/art-of-rally-analis...,Petter Hegevall,Art of Rally,Funselektor Labs Inc,Carreras,PC,Este estilo tan estilo con el que Art of Rally...,7.0
5,Gamereactor,https://www.gamereactor.es/kirby-fighters-2-an...,Alberto Garrido,Kirby Fighters 2,Nintendo,Acción,Nintendo Switch,Se filtró su lanzamiento unas horas antes de s...,6.0
6,Gamereactor,https://www.gamereactor.es/serious-sam-4-plane...,Kieran Harris,Serious Sam 4: Planet Badass,Devolver Digital,Acción,"PC, Stadia",La explosiva y ridículamente exagerada serie d...,4.0
7,Gamereactor,https://www.gamereactor.es/niche-analisis/?sid...,Clover Harker,Niche,,Simulación,"Android, iOS, Linux, Mac, PC","Allá por 2016, probamos suerte con Niche (cuan...",9.0
8,Gamereactor,https://www.gamereactor.es/port-royale-4-anali...,Marco Vrolijk,Port Royale 4,Kalypso Media,Estrategia,PC,Como amante de los juegos en los que toca leva...,7.0
9,Gamereactor,https://www.gamereactor.es/the-outer-worlds-pe...,Roy Woodhouse,The Outer Worlds: Peligro en Gorgona,Private Division,Acción,"Nintendo Switch, PC, PS4, Xbox One",Como os pasó a muchos en 2019 y a otros tantos...,7.0


*Observations:*
- Many Company nones (10%)
- Few Genre and Platform nones (1%/2%)

### Meristation

In [22]:
meristation_df = pd.read_csv('../data/meristation_50p.csv')

Unnamed: 0,site,url_link,author,game,company,genre,platform,text,score
0,meristation,https://as.com/meristation/2020/10/05/analisis...,Carlos Forcada,OkunoKA Madness,Ignition Entertainment,Plataformas,XBO NSW PS4 PC,"\n\n Aunque parece que están ahí casi siempre,...",7.4
1,meristation,https://as.com/meristation/2020/10/03/analisis...,Cristian Ciuraneta,art of rally,Funselektor,Conducción,PC,Los fans de los videojuegos de carreras están ...,7.5
2,meristation,https://as.com/meristation/2020/10/01/analisis...,Sergio C. González\nSergio5Glez,Crash Bandicoot 4: It's About Time,Activision,Plataformas,PS4 XBO,\n\n Crash Bandicoot N. Sane Trilogy fue toda ...,8.3


In [24]:
meristation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1441 entries, 0 to 1440
Data columns (total 9 columns):
site        1441 non-null object
url_link    1441 non-null object
author      1441 non-null object
game        1441 non-null object
company     1441 non-null object
genre       1441 non-null object
platform    1441 non-null object
text        1441 non-null object
score       1441 non-null float64
dtypes: float64(1), object(8)
memory usage: 101.4+ KB


In [26]:
meristation_df.describe()

Unnamed: 0,score
count,1441.0
mean,7.552117
std,1.186901
min,1.0
25%,7.0
50%,7.7
75%,8.5
max,10.0


In [29]:
# Duplicates

#meristation_df.duplicated().value_counts()# - False 1441

False    1441
dtype: int64

In [39]:
# Testing empties per column

# (meristation_df['game'] == 'None').value_counts()# - False 1441
# (meristation_df['site'] == 'None').value_counts()# - False 1441
# (meristation_df['author'] == 'None').value_counts()# - False 1308 / True 133
# (meristation_df['url_link'] == 'None').value_counts()# - False 1441
# (meristation_df['company'] == 'None').value_counts()# - False 1278 / True 163
# (meristation_df['genre'] == 'None').value_counts()# - False 1429 / True 12
# (meristation_df['platform'] == 'None').value_counts()# - False 1441
# (meristation_df['score'] == 'None').value_counts()# - False 1441

In [44]:
meristation_df.head(10)

Unnamed: 0,site,url_link,author,game,company,genre,platform,text,score
0,meristation,https://as.com/meristation/2020/10/05/analisis...,Carlos Forcada,OkunoKA Madness,Ignition Entertainment,Plataformas,XBO NSW PS4 PC,"\n\n Aunque parece que están ahí casi siempre,...",7.4
1,meristation,https://as.com/meristation/2020/10/03/analisis...,Cristian Ciuraneta,art of rally,Funselektor,Conducción,PC,Los fans de los videojuegos de carreras están ...,7.5
2,meristation,https://as.com/meristation/2020/10/01/analisis...,Sergio C. González\nSergio5Glez,Crash Bandicoot 4: It's About Time,Activision,Plataformas,PS4 XBO,\n\n Crash Bandicoot N. Sane Trilogy fue toda ...,8.3
3,meristation,https://as.com/meristation/2020/09/30/analisis...,David Arroyo,WWE 2K Battlegrounds,2K Games,Acción,NSW STD PS4 XBO PC,Superar una ruptura lleva tiempo. Nunca es fác...,5.7
4,meristation,https://as.com/meristation/2020/09/29/analisis...,César Otero,Maid of Sker,Wales Interactive,Aventura,PC NSW PS4 XBO,"""Un consejo, señor, no se acerque nunca al lag...",6.8
5,meristation,https://as.com/meristation/2020/09/29/analisis...,Azucena Ruíz,Pathfinder: Kingmaker,Deep Silver,"RPG, Acción",PC PS4 XBO,"Pathfinder: Kingmaker empezó, como muchos jueg...",7.5
6,meristation,https://as.com/meristation/2020/09/28/analisis...,Francisco J. Brenlla\nfranchuzas,The Outer Worlds: Peril on Gorgon,Private Division,"Acción, RPG",PC PS4 XBO,El anuncio de que Tim Cain y Leonard Boyarsky ...,7.5
7,meristation,https://as.com/meristation/2020/09/26/analisis...,Nacho Requena\nnachomol,Commandos 2 & Praetorians HD Remaster Double Pack,Kalypso Media,"Estrategia, Tiempo real",PC PS4 XBO,Hubo un tiempo en el que todo lo que tocaba o ...,7.0
8,meristation,https://as.com/meristation/2020/09/26/analisis...,Marta Oller\nmartaaax00,Here Be Dragons,Red Zero Games,"Estrategia, Por turnos",NSW PC IPH IPD AND,Como bien citaba el poeta José Espronceda lo d...,6.5
9,meristation,https://as.com/meristation/2020/09/25/analisis...,Jose Luis López de Garayo,Hades,Supergiant Games,"Aventura, Acción",PC NSW,Supergiant Games sigue un patrón muy claro. Di...,9.3


*Observations:*
- Many Author and Company nones (10%)
- Few Genre (1%)
- Platform names poor labeling

### Revogamers

In [40]:
revogamers_df1 = pd.read_csv('../data/revogamers_100l.csv')

In [41]:
revogamers_df2 = pd.read_csv('../data/revogamers_100_500l.csv')

In [42]:
revogamers_df3 = pd.read_csv('../data/revogamers_500_999l.csv')

In [43]:
revogamers_data = [revogamers_df1, revogamers_df2, revogamers_df3]
revogamers_df = pd.concat(revogamers_data)

In [47]:
revogamers_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 863 entries, 0 to 394
Data columns (total 9 columns):
site        863 non-null object
url_link    863 non-null object
author      863 non-null object
game        863 non-null object
company     863 non-null object
genre       863 non-null object
platform    863 non-null object
text        862 non-null object
score       863 non-null float64
dtypes: float64(1), object(8)
memory usage: 67.4+ KB


In [48]:
revogamers_df.describe()

Unnamed: 0,score
count,863.0
mean,7.164311
std,1.257691
min,0.0
25%,6.5
50%,7.5
75%,8.0
max,10.0


In [49]:
revogamers_df = revogamers_df.reset_index(drop=True)

In [50]:
# Duplicates

#revogamers_df.duplicated().value_counts() - False 863

False    863
dtype: int64

In [59]:
# Testing empties per column

# (revogamers_df['game'] == 'None').value_counts()# - False 863
# (revogamers_df['site'] == 'None').value_counts()# - False 863
# (revogamers_df['author'] == 'None').value_counts()# - False 863
# (revogamers_df['url_link'] == 'None').value_counts()# - False 863
# (revogamers_df['company'] == 'None').value_counts()# - False 402 / True 461
# (revogamers_df['genre'] == 'None').value_counts()# - False 406 / True 457
# (revogamers_df['platform'] == 'None').value_counts()# - False 863
# (revogamers_df['score'] == 'None').value_counts()# - False 863


In [61]:
revogamers_df.head(10)

Unnamed: 0,site,url_link,author,game,company,genre,platform,text,score
0,revogamers,https://www.revogamers.net/analisis-w/analisis...,Carlos Firás,Going Under,,,Nintendo Switch,Llega a nuestras Nintendo Switch un juego que ...,8.0
1,revogamers,https://www.revogamers.net/analisis-w/analisis...,Javier Aranda,Kirby Fighters 2,HAL Laboratory,Lucha,Nintendo Switch,Kirby ha demostrado en más de una ocasión que ...,7.5
2,revogamers,https://www.revogamers.net/analisis-w/analisis...,Carlos Firás,Lost Ember,,,Nintendo Switch,Llega a nuestras Nintendo Switch un juego que ...,7.5
3,revogamers,https://www.revogamers.net/analisis-w/analisis...,Javier Aranda,RogueCube,Ratalaika Games,"Acción, Arcade",Nintendo Switch,Se hace ya raro no ver una semana sin un estre...,6.0
4,revogamers,https://www.revogamers.net/analisis-w/analisis...,Javier Aranda,RollerCoaster Tycoon 3 Complete Edition,Frontier,Simulación,Nintendo Switch,Uno de los simuladores más emblemáticos de hac...,8.0
5,revogamers,https://www.revogamers.net/analisis-w/analisis...,Javier Aranda,NBA 2K21,Take-Two Interactive,"Deportes, Simulación",Nintendo Switch,Los títulos deportivos no paran de llegar a la...,7.0
6,revogamers,https://www.revogamers.net/analisis-w/analisis...,Marcos Catalán,Análisis de MX vs ATV All Out,,,Nintendo Switch,La saga MX vs ATV no es nueva y nos ha dejado ...,4.5
7,revogamers,https://www.revogamers.net/analisis-w/analisis...,Javier Aranda,Análisis de MO:Astray,,,Nintendo Switch,Rayark es ya una experta en Nintendo Switch. H...,8.5
8,revogamers,https://www.revogamers.net/analisis-w/analisis...,Carlos Firás,Análisis de Mini Motor Racing X,,,Nintendo Switch,Los amantes de la velocidad arcade están de en...,7.0
9,revogamers,https://www.revogamers.net/analisis-w/analisis...,Javier Aranda,Clash Force,Ratalaika Games,"Acción, Plataformas",Nintendo Switch,"Battletoads, Street Sharks, Tortugas Ninja … h...",5.5


*Observations:*
- Too much Genre and Company nones (+50%)
- One null text

## 2. Full data merge

In [70]:
sites_dataframes = [gamereactor_df, meristation_df, revogamers_df]

gr_meri_revo = pd.concat(sites_dataframes)

In [71]:
gr_meri_revo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3551 entries, 0 to 862
Data columns (total 9 columns):
site        3551 non-null object
url_link    3551 non-null object
author      3551 non-null object
game        3551 non-null object
company     3551 non-null object
genre       3551 non-null object
platform    3551 non-null object
text        3550 non-null object
score       3551 non-null float64
dtypes: float64(1), object(8)
memory usage: 277.4+ KB


gr_meri_revo.describe()

In [74]:
gr_meri_revo.dropna(inplace=True)

In [76]:
gr_meri_revo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3550 entries, 0 to 862
Data columns (total 9 columns):
site        3550 non-null object
url_link    3550 non-null object
author      3550 non-null object
game        3550 non-null object
company     3550 non-null object
genre       3550 non-null object
platform    3550 non-null object
text        3550 non-null object
score       3550 non-null float64
dtypes: float64(1), object(8)
memory usage: 277.3+ KB


In [78]:
gr_meri_revo = shuffle(gr_meri_revo)
gr_meri_revo.reset_index(drop=True, inplace=True)

In [89]:
#Testing uniques

gr_meri_revo.nunique()

site           3
url_link    3550
author       273
game        2932
company      795
genre        282
platform     463
text        3546
score         50
dtype: int64

## 3. Data Standarize 

## 4. Data export

In [90]:
gr_meri_revo.to_csv('../data/gr_meri_revo_dataset_3550l.csv', index=False)

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>