In [74]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Data understanding

https://www.reddit.com/r/datasets/comments/2awdgx/i_made_this_dataset_of_all_of_igns_game_reviews/

In [52]:
meta = pd.read_csv('./data/IGN_data.csv')
meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17534 entries, 0 to 17533
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Game        17534 non-null  object 
 1   Platform    17534 non-null  object 
 2   Score       17534 non-null  float64
 3   Genre       17534 non-null  object 
 4   Unnamed: 4  0 non-null      float64
dtypes: float64(2), object(3)
memory usage: 685.0+ KB


In [73]:
meta.isna().sum()

Game        0
Platform    0
Score       0
Genre       0
dtype: int64

In [53]:
meta.drop(columns='Unnamed: 4', inplace=True)

In [54]:
meta.Game.nunique()

12208

In [55]:
meta.Platform.nunique()

56

In [56]:
meta.Genre.nunique()

96

12208 different games, 56 different consoles/platforms, 96 genres

In [57]:
meta['Platform'].head(20)

0             Xbox One
1                Wii U
2        PlayStation 3
3        PlayStation 4
4                   PC
5             Xbox One
6             Xbox 360
7                   PC
8     PlayStation Vita
9                   PC
10       PlayStation 4
11    PlayStation Vita
12                  PC
13       PlayStation 4
14                  PC
15       PlayStation 3
16              iPhone
17                  PC
18              iPhone
19       PlayStation 4
Name: Platform, dtype: object

dataset includes phone games too

In [58]:
meta['Platform'].value_counts()

PC                      3026
PlayStation 2           1683
Xbox 360                1582
Wii                     1347
PlayStation 3           1295
Nintendo DS             1040
PlayStation              952
Wireless                 905
Xbox                     822
iPhone                   815
PlayStation Portable     625
Game Boy Advance         620
GameCube                 509
Game Boy Color           356
Nintendo 64              301
Dreamcast                286
Nintendo DSi             255
Nintendo 3DS             173
PlayStation Vita         114
iPad                      94
Lynx                      82
Wii U                     78
Macintosh                 70
Genesis                   58
PlayStation 4             47
NES                       46
TurboGrafx-16             39
Xbox One                  35
Android                   33
NeoGeo Pocket Color       31
N-Gage                    30
Super NES                 28
Game Boy                  22
Sega 32X                  18
iPod          

i will drop games on the least popular consoles


Will try to keep consoles with more than 20 games

- iPad and iPod games should be compatible with iphone, will replace those to iphone

- only difference between Nintendo ds and dsi is the camara the second one includes, they are both compatible with the same games, dsi games are ds games, will raplece dsi to ds


not sure what `wireless` console is

In [59]:
meta[meta['Platform'] == 'Wireless']

Unnamed: 0,Game,Platform,Score,Genre
2832,The Sims 3,Wireless,7.5,Simulation
5319,Fast &amp; Furious,Wireless,8.0,Racing
5476,Zombie Infection,Wireless,8.5,Action
5773,Castle of Magic,Wireless,8.5,Action
5786,Far Cry 2,Wireless,7.5,Shooter
...,...,...,...,...
13147,Lilo &amp; Stitch: Space Escape,Wireless,3.0,Shooter
13151,Defender,Wireless,2.0,Shooter
13209,Intellivision Astrosmash,Wireless,5.0,Shooter
13213,Tetris,Wireless,6.5,Puzzle


per ign site, [here](https://www.ign.com/games/lilo-stitch-space-escape) seems to be mobile games, game was released on 2003, before android or ios, there are 905 games for old mobile os

game names show '&amp;' where '&' is found, will replace it to '&'

In [60]:
meta[meta['Game'] == 'Football Manager 2014']

Unnamed: 0,Game,Platform,Score,Genre
285,Football Manager 2014,PC,8.0,"Sports, Simulation"
286,Football Manager 2014,Macintosh,8.0,"Sports, Simulation"
287,Football Manager 2014,Linux,8.0,"Sports, Simulation"


seems like games show up once per platform they were released

In [64]:
meta['Game'].value_counts()

Cars                                                  10
Madden NFL 07                                         10
Brain Challenge                                        9
Madden NFL 08                                          9
Ratatouille                                            9
                                                      ..
Assault Heroes 2                                       1
Defend Your Castle                                     1
PixelJunk Monsters Encore                              1
LostWinds                                              1
The Walking Dead: The Game -- Episode 1: A New Day     1
Name: Game, Length: 12208, dtype: int64

In [70]:
meta['Score'].unique()

array([ 7.8,  9. ,  8.7,  7.5,  5.4,  7. ,  6.3,  6.4,  6.8,  7.4,  4.5,
        7.7,  3.8,  7.6,  5.8,  9.3,  6.9,  6. ,  9.6,  8. ,  8.6,  3.5,
        8.2,  7.1,  9.2,  7.3,  8.1,  5. ,  8.5,  7.9,  8.9,  3. ,  9.5,
        6.5,  5.7,  6.6,  7.2,  9.1,  5.2,  4. ,  5.6,  8.3,  5.5,  4.8,
        4.3,  2.7,  8.8,  5.9,  8.4,  9.4,  6.2,  4.7,  5.1,  4.4,  9.8,
       10. ,  6.7,  6.1,  4.9,  2.8,  5.3,  4.1,  4.2,  3.7,  3.4,  4.6,
        2.5,  2.3,  3.9,  2. ,  1. ,  1.5,  9.7,  3.1,  2.6,  3.3,  3.6,
        2.2,  2.1,  0.8,  1.9,  3.2,  1.4,  2.9,  1.7,  1.2,  2.4,  9.9,
        0.5,  1.1,  1.3,  0.7,  1.8])

IGN rates software on a scale from 0 to 10, with 10 being the best.

In [72]:
meta['Genre'].value_counts()

Action                  3628
Sports                  1854
Shooter                 1472
Racing                  1189
Strategy                1013
                        ... 
Sports, Fighting           1
Baseball                   1
Sports, Other              1
Adventure, Adventure       1
Sports, Editor             1
Name: Genre, Length: 96, dtype: int64

theres overlaping with some games having more than a single genre

## Data Preparation
**Cleaning**

In [61]:
#combining different platforms that share the same games
meta.loc[meta['Platform'] == 'iPad', 'Platform']='iPhone'
meta.loc[meta['Platform'] == 'iPod', 'Platform']='iPhone'
meta.loc[meta['Platform'] == 'Nintendo DSi', 'Platform']='Nintendo DS'

In [62]:
# replacing '&amp;' to '&' in game column
meta['Game'] = meta['Game'].str.replace("&amp;", "&")

In [63]:
# subsetting to consoles with more than 20 games
meta_clean = meta[meta['Platform'].map(meta['Platform'].value_counts()) >= 20]