# Data Cleaning
---

## Now that we have web scraped our user reviews for the top 100 games on each console let's read in our data and combine it all into one master data frame.

In [5]:
import pandas as pd

In [246]:
ps4_reviews = pd.read_csv('../data/top_100_ps4_games.csv')
xboxone_reviews = pd.read_csv('../data/top_100_xboxone_games.csv')
switch_reviews = pd.read_csv('../data/top_100_switch_games.csv')
pc_reviews = pd.read_csv('../data/top_100_pc_games.csv')
xbox_series_x_reviews = pd.read_csv('../data/top_100_xbox-series-x_games.csv')
ps5_reviews = pd.read_csv('../data/top_100_ps5_games.csv')

In [247]:
all_console_reviews = pd.concat(
    [ps4_reviews, xboxone_reviews, switch_reviews, pc_reviews, xbox_series_x_reviews, ps5_reviews], 
    ignore_index=True)

### Now that we have a master data frame let's take a look at the first few rows and get the shape to see how many total rows and columns we are working with.

In [248]:
print(all_console_reviews.shape)
all_console_reviews.head()

(112345, 11)


Unnamed: 0,console,video_game_name,summary,developer,genre(s),num_players,esrb_rating,critic_score,avg_user_score,user_review,user_score
0,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,Rating: M,97,8.6,"\nThis site is a joke, this the first time whe...",9
1,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,Rating: M,97,8.6,Fair review of RDR2\r I'm almost 15% finished ...,7
2,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,Rating: M,97,8.6,I really wanted to love it. The over-world is ...,6
3,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,Rating: M,97,8.6,"\nBeautiful graphics, excellent voice acting, ...",7
4,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,Rating: M,97,8.6,This game is really overrated.\rThe amazing en...,7


## Now let's see how many null values are present and address them
---

In [249]:
all_console_reviews.isnull().sum()

console               0
video_game_name       0
summary               0
developer             0
genre(s)              0
num_players        8659
esrb_rating        6513
critic_score          0
avg_user_score     2729
user_review           0
user_score            0
dtype: int64

### Let's also look at what the datatypes are for the columns.

In [250]:
all_console_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112345 entries, 0 to 112344
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   console          112345 non-null  object 
 1   video_game_name  112345 non-null  object 
 2   summary          112345 non-null  object 
 3   developer        112345 non-null  object 
 4   genre(s)         112345 non-null  object 
 5   num_players      103686 non-null  object 
 6   esrb_rating      105832 non-null  object 
 7   critic_score     112345 non-null  int64  
 8   avg_user_score   109616 non-null  float64
 9   user_review      112345 non-null  object 
 10  user_score       112345 non-null  int64  
dtypes: float64(1), int64(2), object(8)
memory usage: 9.4+ MB


## First let's take a look at the `num_players` column, we want to clean the values and only get the number, if applicable. Then address the null values.

In [251]:
all_console_reviews['num_players'].unique()

array(['# of players: Up to 32', '# of players: Up to 30',
       '# of players: No Online Multiplayer', '# of players: Up to 8',
       '# of players: Up to 16', '# of players: Up to 10',
       '# of players: 2', '# of players: Up to 5',
       '# of players: Up to 4', nan, '# of players: 1 Player',
       '# of players: Up to 12', '# of players: Up to 6',
       '# of players: Massively Multiplayer', '# of players: Up to 60',
       '# of players: Online Multiplayer', '# of players: Up to 64',
       '# of players: Up to 3', '# of players: Up to 22',
       '# of players: Up to 20', '# of players: Up to 24',
       '# of players: Up to more than 64', '# of players: 1-16',
       '# of players: 1-32', '# of players: Up to 18', '# of players:',
       '# of players: 1-8', '# of players: 1-2', '# of players: 1-4',
       '# of players: 64 Online', '# of players: 2 Online',
       '# of players: Up to 14', '# of players: Up to 40',
       '# of players: Friend System Only'], dtype=objec

### Replace text values to only get the number of players, for games without a simple number investigate further

In [252]:
all_console_reviews['num_players'] = \
[val.replace('# of players: ', '').replace('Up to ', '').replace(' Player', '').replace('more than ', '')
 if type(val) != float else val for val in all_console_reviews['num_players']]

In [253]:
vals = []
for val in all_console_reviews['num_players']:
    if type(val) != float:
        if '-' in val:
            vals.append(val.split('-')[1])
        else:
            vals.append(val)
    else:
        vals.append(val)

all_console_reviews['num_players'] = vals

In [254]:
all_console_reviews['num_players'].unique()

array(['32', '30', 'No Online Multiplayer', '8', '16', '10', '2', '5',
       '4', nan, '1', '12', '6', 'Massively Multiplayer', '60',
       'Online Multiplayer', '64', '3', '22', '20', '24', '18',
       '# of players:', '64 Online', '2 Online', '14', '40',
       'Friend System Only'], dtype=object)

### Now that we did some initial cleaning let's look closer at games with the unique `num_players`
- No Online Multiplayer
- Massively Multiplayer
- Online Multiplayer
- \# of players:
- Friend System Only

In [255]:
list(all_console_reviews[all_console_reviews['num_players'] == 'No Online Multiplayer']\
['video_game_name'].unique())[:10]

['Persona 5 Royal',
 'God of War',
 'The Last of Us Part II',
 'Persona 5',
 'Undertale',
 'The Witcher 3: Wild Hunt',
 'Shadow of the Colossus',
 'The Witcher 3: Wild Hunt - Blood and Wine',
 'Celeste',
 'NieR: Automata - Game of the YoRHa Edition']

### For 'No Online Multiplayer' we can see that all of the games are single player games so we can change the num_players to 1.

In [256]:
all_console_reviews['num_players'] = \
[val.replace('No Online Multiplayer', '1')
 if type(val) != float else val for val in all_console_reviews['num_players']]

In [257]:
all_console_reviews['num_players'].unique()

array(['32', '30', '1', '8', '16', '10', '2', '5', '4', nan, '12', '6',
       'Massively Multiplayer', '60', 'Online Multiplayer', '64', '3',
       '22', '20', '24', '18', '# of players:', '64 Online', '2 Online',
       '14', '40', 'Friend System Only'], dtype=object)

In [258]:
list(all_console_reviews[all_console_reviews['num_players'] == 'Massively Multiplayer']\
['video_game_name'].unique())[:]

['Final Fantasy XIV: Stormblood',
 'World of Warcraft',
 'World of Warcraft: Wrath of the Lich King',
 'World of Warcraft: The Burning Crusade',
 'World of Warcraft: Cataclysm',
 'Final Fantasy XIV: Endwalker']

### For 'Massively Multiplayer' games which we can see are MMOs, massive multiplayer online games, they can have player counts in the millions. To keep our values within a reasonable range we will change these to a value of 150.

In [259]:
all_console_reviews['num_players'] = \
[val.replace('Massively Multiplayer', '150')
 if type(val) != float else val for val in all_console_reviews['num_players']]

In [260]:
all_console_reviews['num_players'].unique()

array(['32', '30', '1', '8', '16', '10', '2', '5', '4', nan, '12', '6',
       '150', '60', 'Online Multiplayer', '64', '3', '22', '20', '24',
       '18', '# of players:', '64 Online', '2 Online', '14', '40',
       'Friend System Only'], dtype=object)

In [261]:
list(all_console_reviews[all_console_reviews['num_players'] == 'Online Multiplayer']\
['video_game_name'].unique())[:]

['Dreams',
 'Divinity: Original Sin II - Definitive Edition',
 'Apex Legends',
 'Thronebreaker: The Witcher Tales',
 'Warframe',
 'Kingdom Two Crowns',
 'Half-Life',
 "Sid Meier's Civilization IV",
 'Microsoft Flight Simulator',
 'Spelunky 2',
 'Factorio',
 'The Pathless']

### For 'Online Multiplayer' these have varying number of players, some are MMOs, others battle royales, and others merely having a simple deathmatch mode. For simplicity sake we will change the value to 100, lower than 'Massively Multiplayer' since these were not labeled as the same.

In [262]:
all_console_reviews['num_players'] = \
[val.replace('Online Multiplayer', '100')
 if type(val) != float else val for val in all_console_reviews['num_players']]

In [263]:
all_console_reviews['num_players'].unique()

array(['32', '30', '1', '8', '16', '10', '2', '5', '4', nan, '12', '6',
       '150', '60', '100', '64', '3', '22', '20', '24', '18',
       '# of players:', '64 Online', '2 Online', '14', '40',
       'Friend System Only'], dtype=object)

In [264]:
list(all_console_reviews[all_console_reviews['num_players'] == '# of players:']\
['video_game_name'].unique())[:]

['The Sims']

### The Sims is a single player game with multipler aspects that really only revolve on downloading other players creations etc. We will change this to a value of 1.

In [265]:
all_console_reviews['num_players'] = \
[val.replace('# of players:', '1')
 if type(val) != float else val for val in all_console_reviews['num_players']]

In [266]:
all_console_reviews['num_players'].unique()

array(['32', '30', '1', '8', '16', '10', '2', '5', '4', nan, '12', '6',
       '150', '60', '100', '64', '3', '22', '20', '24', '18', '64 Online',
       '2 Online', '14', '40', 'Friend System Only'], dtype=object)

In [267]:
list(all_console_reviews[all_console_reviews['num_players'] == 'Friend System Only']\
['video_game_name'].unique())[:]

['Control: Ultimate Edition']

### Control is a single player game, change value to 1.

In [268]:
all_console_reviews['num_players'] = \
[val.replace('Friend System Only', '1')
 if type(val) != float else val for val in all_console_reviews['num_players']]

In [269]:
all_console_reviews['num_players'].unique()

array(['32', '30', '1', '8', '16', '10', '2', '5', '4', nan, '12', '6',
       '150', '60', '100', '64', '3', '22', '20', '24', '18', '64 Online',
       '2 Online', '14', '40'], dtype=object)

### Now we can do a final cleanup of values that have 'Online' and we are good to further inspect null values

In [270]:
all_console_reviews['num_players'] = \
[val.replace(' Online', '')
 if type(val) != float else val for val in all_console_reviews['num_players']]

In [271]:
all_console_reviews['num_players'].unique()

array(['32', '30', '1', '8', '16', '10', '2', '5', '4', nan, '12', '6',
       '150', '60', '100', '64', '3', '22', '20', '24', '18', '14', '40'],
      dtype=object)

In [272]:
all_console_reviews[all_console_reviews['num_players'].isnull()]['video_game_name'].unique()

array(['Final Fantasy XIV: Shadowbringers', 'INSIDE', 'Shovel Knight',
       'Tales From The Borderlands: Episode 5 - The Vault of the Traveler',
       'Monster Hunter: World - Iceborne', 'Rez Infinite', 'Bastion',
       'The Binding of Isaac: Rebirth',
       'Keep Talking and Nobody Explodes',
       'The Talos Principle: Deluxe Edition',
       'Nex Machina: Death Machine',
       'Guacamelee! Super Turbo Championship Edition',
       'Bloodborne: The Old Hunters', 'The Witness', 'Night in the Woods',
       'TowerFall Ascension', 'Psychonauts 2', 'NBA 2K17',
       'Destiny: The Taken King', 'Pro Evolution Soccer 2017',
       'Forza Horizon 3: Hot Wheels', 'DiRT Rally',
       'Killer Instinct Season 3', 'Destiny 2: Forsaken',
       'Dying Light: The Following', 'DiRT 4', 'Firewatch',
       'Tetris Effect: Connected', 'Chicory: A Colorful Tale',
       "Death's Door", 'DUSK', 'Portal 2', 'Warcraft III: Reign of Chaos',
       "Sid Meier's Alpha Centauri", 'Final Fantasy XIV: 

### The list of the games above with a missing `num_players` value range from indie developed games to massive AAA franchise games. A quick spot check of a few games shows that they are single player games. For this reason we will replace all null values in this column with 1. We will also change the type into an int.

In [273]:
all_console_reviews['num_players'].fillna('1', inplace=True)

In [274]:
all_console_reviews['num_players'] = all_console_reviews['num_players'].astype(int)

In [275]:
all_console_reviews.isnull().sum()

console               0
video_game_name       0
summary               0
developer             0
genre(s)              0
num_players           0
esrb_rating        6513
critic_score          0
avg_user_score     2729
user_review           0
user_score            0
dtype: int64

## Now let's take a look at the `esrb_rating` column, we want to clean the values and then address the null values. For further information on ESRB rating see [ESRB Rating Guide](https://www.esrb.org/ratings-guide/)

In [276]:
all_console_reviews['esrb_rating'].unique()

array(['Rating: M', 'Rating: T', 'Rating: E', 'Rating: E10+', nan,
       'Rating: K-A'], dtype=object)

### We want to extract only the rating itself and get rid of any extra text.

In [277]:
all_console_reviews['esrb_rating'] = \
[val.replace('Rating: ', '')
 if type(val) != float else val for val in all_console_reviews['esrb_rating']]

In [278]:
all_console_reviews['esrb_rating'].unique()

array(['M', 'T', 'E', 'E10+', nan, 'K-A'], dtype=object)

In [279]:
all_console_reviews[all_console_reviews['esrb_rating'].isna()]

Unnamed: 0,console,video_game_name,summary,developer,genre(s),num_players,esrb_rating,critic_score,avg_user_score,user_review,user_score
21860,ps4,Injustice 2: Legendary Edition,\nNetherRealm Studios,NetherRealm Studios,"Genre(s): Action, Fighting, 2D",2,,88,7.7,I normally do not buy fighting games but i did...,10
21861,ps4,Injustice 2: Legendary Edition,\nNetherRealm Studios,NetherRealm Studios,"Genre(s): Action, Fighting, 2D",2,,88,7.7,Probably the best DC superhero fighting game i...,9
21862,ps4,Injustice 2: Legendary Edition,\nNetherRealm Studios,NetherRealm Studios,"Genre(s): Action, Fighting, 2D",2,,88,7.7,\nOh how I long waited for this legendary edit...,10
21863,ps4,Injustice 2: Legendary Edition,\nNetherRealm Studios,NetherRealm Studios,"Genre(s): Action, Fighting, 2D",2,,88,7.7,\nSo this game looks amazing. My big problem i...,6
21864,ps4,Injustice 2: Legendary Edition,\nNetherRealm Studios,NetherRealm Studios,"Genre(s): Action, Fighting, 2D",2,,88,7.7,"Mudei minha nota, pois corrigiram diversos bug...",7
...,...,...,...,...,...,...,...,...,...,...,...
112340,ps5,Salt and Sacrifice,Salt and Sacrifice is the follow-up to Souls-l...,Ska Studios,"Genre(s): Role-Playing, Action RPG, Action, Pl...",6,,76,5.5,Edit: I got used to the controls and the game ...,7
112341,ps5,Salt and Sacrifice,Salt and Sacrifice is the follow-up to Souls-l...,Ska Studios,"Genre(s): Role-Playing, Action RPG, Action, Pl...",6,,76,5.5,"\nIf it ain't broke, don't fix it. That about ...",7
112342,ps5,Salt and Sacrifice,Salt and Sacrifice is the follow-up to Souls-l...,Ska Studios,"Genre(s): Role-Playing, Action RPG, Action, Pl...",6,,76,5.5,\nWould give it a higher rating but they made ...,7
112343,ps5,Salt and Sacrifice,Salt and Sacrifice is the follow-up to Souls-l...,Ska Studios,"Genre(s): Role-Playing, Action RPG, Action, Pl...",6,,76,5.5,\nUnfortunately a big step back from Salt and ...,0
