# Video Game Sales Analysis: A Comparison between Profressional and Audience Reviews

## Introduction

In the modern age, the internet has set down deep roots within society, letting humans communicate their experiences and opinions across the globe. However, an interesting phenomena that has reared it's head over the last couple of decades has been the differing opinions of professional critics and your average consumer when it comes to reviewing entertainment.

However, if the theory that the quality and review of the entertainment correlates with the amount of sales, then who is correct in this assumption?

In this study, I am going to dive into the video game market, and study if there is a correlation between the sales amount and the review score of both the consumer and the critic.

I will be documenting each step of my study, for educational purposes.

***

## The Datasets

***Global Video Game Sales (vgsales)***: The original dataset only had games up to 2016. For this analysis, I will only be looking at games made up to 2016.

***Metacritic Dataset***: This consists of a dataset that collates a list of games and their user review and meta critic scores.



***

## Issues that could affect the overall analysis

It's important to keep

## The Analysis

Pull in your usual libraries.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Let's pull in the first dataset, the VGsales dataset.

Now, the data sets should be clean before I pull them in, as the datasets were highly rated on Kaggle. However to be sure, and for my own practice, let's do some prelimanary checks to make sure there's nothing abnormal about the set.

Also, what's worth noting, is my analysis may require the dataset to be organised in a different fashion instead of what is needed.

Let's explore the dataset to have a look.

In [3]:
vgsales_data = pd.read_csv('Data/vgsales.csv')
vgsales_data

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


So, our columns to work with are:

- Rank
- Name
- Platform
- Year
- Genre
- Publisher
- NA_Sales
- EU_Sales
- JP_Sales
- Other_Sales
- Global_Sales

Since we are looking at how critic reviews and audience reviews affect a games sales, there are a few columns that pose an interesting question. These are also things to look at later on with the dataset.

1. Do platforms the games are released on heavily affect the reviews? Is there a large swing in the IQR between the platforms?
2. Do we want to take a more high-level overview of the data, and compile all platforms together? Will doing so lose the unique specific context provided by keeping the platforms separately?
3. Do games of the same name with a variety of release years affect review scores?
4. Do these reviews change from region to region?

I think for our first goal, it may be smart to look at the high level data, potentially losing the nuance between the platforms and the release dates. However, after we've analysed it from a high-level point of view, I will be interested in diving deeper into the data and looking into the nuances.

For now, let's continue with checking the integrity of this dataset.

In [34]:
## Check the beginning of the dataframe

vgsales_data.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Need for Speed: Most Wanted,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Need for Speed: Most Wanted,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Need for Speed: Most Wanted,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Need for Speed: Most Wanted,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Need for Speed: Most Wanted,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [5]:
## Check the end of the dataframe

vgsales_data.tail()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.0,0.0,0.0,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.0,0.0,0.0,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.0,0.0,0.0,0.0,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.0,0.01,0.0,0.0,0.01
16597,16600,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.0,0.0,0.0,0.01


In [6]:
## Check if there are any missing values

vgsales_data.isnull().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

In [7]:
## Let's use df.describe to check for any outliers

vgsales_data.describe()

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.406443,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.828981,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


In [8]:
## Check that data types in each column, and make sure they are the types they need to be.

vgsales_data.dtypes

Rank              int64
Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

The only discrepency here is that years are generally better as integers and not float points. It's become a float point however, due to there being NaNs within the column.

So we need to think about the best practice here. Do we remove the rows with no year values? Is it perhaps better to insert a placeholder?

For now, let's shelve this issue into our "To-do list" and check the rest of the integirty of the dataset.

In [9]:
## Count the value of rows with an identical name.

vgsales_data['Name'].value_counts()

Name
Need for Speed: Most Wanted                12
Ratatouille                                 9
FIFA 14                                     9
LEGO Marvel Super Heroes                    9
Madden NFL 07                               9
                                           ..
Ar tonelico Qoga: Knell of Ar Ciel          1
Galaga: Destination Earth                   1
Nintendo Presents: Crossword Collection     1
TrackMania: Build to Race                   1
Know How 2                                  1
Name: count, Length: 11493, dtype: int64

This could be another issue. As we can see of the above, there are multiple games with the same name. We should explore some of these and see what the rest of the rows say. I theorise it is because each platform is separated into each row, but we need to check.

In [22]:
## Create a list of games whose name appears more than 1 time.
name_duplicates = vgsales_data['Name'].value_counts()
name_duplicates = name_duplicates[name_duplicates > 1].index.tolist()

## Create a copy of the main table, filtering by names that appear on the name_duplicates list above.
## We create a copy to preserve the "raw" data in the dataset, incase we need to correct any mistakes.
vgsales_filtered_dups = vgsales_data[vgsales_data['Name'].isin(name_duplicates)]
vgsales_filtered_dups

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
5,6,Tetris,GB,1989.0,Puzzle,Nintendo,23.20,2.26,4.22,0.58,30.26
16,17,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.40
17,18,Grand Theft Auto: San Andreas,PS2,2004.0,Action,Take-Two Interactive,9.43,0.40,0.41,10.57,20.81
18,19,Super Mario World,SNES,1990.0,Platform,Nintendo,12.78,3.75,3.54,0.55,20.61
...,...,...,...,...,...,...,...,...,...,...,...
16586,16589,Secret Files 2: Puritas Cordis,DS,2009.0,Adventure,Deep Silver,0.00,0.01,0.00,0.00,0.01
16591,16594,Myst IV: Revelation,PC,2004.0,Adventure,Ubisoft,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01


The issue is, we still have 7880 rows, which is too many to simply look into one by one. We could just ouput this into an excel chart and look manually, and if this wasn't a training project, it's a step I would take. Instead, for the sake of practice, I will try to utilise Pandas more.

Let's start looking at the games with the highest values first. Need for Speed: Most Wanted, Ratatouille, and FIFA 14.

In [33]:
print(vgsales_filtered_dups[vgsales_filtered_dups['Name'] == "Need for Speed: Most Wanted"])
print("-------------------------------------------------")
print(vgsales_filtered_dups[vgsales_filtered_dups['Name'] == "Ratatouille"])
print("-------------------------------------------------")
print(vgsales_filtered_dups[vgsales_filtered_dups['Name'] == "FIFA 14"])

        Rank                         Name Platform    Year   Genre  \
252      253  Need for Speed: Most Wanted      PS2  2005.0  Racing   
498      499  Need for Speed: Most Wanted      PS3  2012.0  Racing   
1173    1175  Need for Speed: Most Wanted     X360  2012.0  Racing   
1530    1532  Need for Speed: Most Wanted     X360  2005.0  Racing   
1742    1744  Need for Speed: Most Wanted      PSV  2012.0  Racing   
2005    2007  Need for Speed: Most Wanted       XB  2005.0  Racing   
3585    3587  Need for Speed: Most Wanted       GC  2005.0  Racing   
5900    5902  Need for Speed: Most Wanted       PC  2005.0  Racing   
6149    6151  Need for Speed: Most Wanted     WiiU  2013.0  Racing   
6278    6280  Need for Speed: Most Wanted       DS  2005.0  Racing   
6492    6494  Need for Speed: Most Wanted      GBA  2005.0  Racing   
11676  11678  Need for Speed: Most Wanted       PC  2012.0  Racing   

             Publisher  NA_Sales  EU_Sales  JP_Sales  Other_Sales  \
252    Electronic Ar

It's as we theorised, each row is the data of a different console. This comes hand in hand with our task of fixing the years, and if we change them into integers and remove the years will null data. Again, let's add this to our To-do List. Let's make this visible.

**To-do List**
- *Make a decision on how to deal with year data types*
- *Do we merge and combine all platforms into one, even with different years?*