# Information provided

This dataset contains a list of video games with sales greater than 100,000 copies. Fields include:
•             Rank - Ranking of overall sales
•             Name - Name of the game
•             Platform - Platform of the game (i.e. PC, PS4, XOne, etc.)
•             Genre - Genre of the game
•             ESRB Rating - ESRB Rating of the game
•             Publisher - Publisher of the game
•             Developer - Developer of the game
•             Critic Score - Critic score of the game from 10
•             User Score - Users score the game from 10
•             Total Shipped - Total shipped copies of the game
•             Global_Sales - Total worldwide sales (in millions)
•             NA_Sales - Sales in North America (in millions)
•             PAL_Sales - Sales in Europe (in millions)
•             JP_Sales - Sales in Japan (in millions)
•             Other_Sales - Sales in the rest of the world (in millions)
•             Year - Year of release of the game

# Remarks

https://en.wikipedia.org/wiki/Entertainment_Software_Rating_Board

The ESRB was officially launched on September 16, 1994; its system consisted of five age-based ratings; "Early Childhood", "Kids to Adults" (later renamed "Everyone" in 1998), "Teen", "Mature", and "Adults Only".

Former ratings: Early Childhood (EC), Kids to Adults (K–A)

Rating Pending (RP)	This symbol is used in promotional materials for games which have not yet been assigned a final rating by the ESRB.

The VGChartz Network is a network of five video game websites - VGChartz, gamrFeed, gamrReview, gamrTV and gamrConnect. VGChartz itself sits at the centre of the VGChartz Network and is a video game sales tracking website that provides weekly sales figures of console software and hardware by region. The site was launched in June 2005 and is owned by Brett Walton.[1][2] Employing ten people,[3] VGChartz provides tools for worldwide data analysis and regular worldwide written analysis of the data it provides.

VGChartz provides tools for data analysis and charting and regular written analysis of the data referencing major news in the video gaming industry. Sales figures on VGChartz are based on estimates extrapolated from small retail samples.[4] While offering some information about their methodology through their website,[5] VGChartz does not publish any sources on how they get their data. Some sites, including Gamasutra and Wired News, have questioned the reliability of the information presented by the site.

VGChartz receives a large amount of criticism from websites and editors over how they obtain and verify their data. VGChartz' official statement is that they "use a number of proprietrary [sic] and ever-developing methods". Detractors note how the website is ad-supported and claims to track every country's individual hardware and software sales to a single digit, something not even Microsoft or Sony are always capable of achieving.[20]

https://en.wikipedia.org/wiki/VGChartz

- Total Shipped - Total shipped copies of the game - in millions, too  
- A Game Publisher finances a video game's development, handles the marketing and release of the game, and gets the returns on its sales.

# Data quality issues

The provider of the dataset clearly stated that "this dataset contains a list of video games with sales greater than 100,000 copies".


It is not clear what a given sales figure refers to. Are video games with the same name and developer which are developed for different platforms the same or different? 
The problem is we cannot confirm that this statement is true for the dataset, i e that sales are greater than 100000 copies for every video game. Even if we allow for the Possibility that video games under the same names but developed for different platforms are considered one game and only their Overall sales exceed the 100000 copies figure this statement cannot be confirmed. Some video games have no sales data whatsoever. Some video games have clearly less than 100000 sales. 
Several video games are stated to have sorted exactly 100000 copies. The first and smaller problem is that 100000, of course is not "greater than" but exactly equal to 100000. The other problem with this figure is that it calls into question the accuracy of the data.
Is it really feasible that several video games sold exactly 100000 copies? Clearly not, but then the question arises what was the level of precision when the data were collected?

There are clearly no video games with sales figures of 102000 or 103000 copies in the dataset. This makes one think that the rounding error is at least 10000 copies, or even more. But then it also means that this dataset doesn't specify if sales figures are more or even less than the stated figure by 10%. For a video game developing company this +/-10% it could make all the difference between having a successful or a loss-making product. The development of a software product has a huge share of fixed costs. In fact, the majority of the costs are probably fixed in this industry. If a software product is already developed then it doesn't cost significantly more to produce or sell it in 100000 copies 1 million or even 10 million copies, if the game itself is a downloadable software product on a PC for example. (For other platforms, e.g. Nintendo consoles, one additional piece of software may cost some money but it is negligible compared to the overall development cost of the product.)

And by definition, the success of developing and marketing top-selling "blockbuster" products is a rare occasion. And this dataset clearly shows this fact.

One of the most fundamental problems of such pre-filtered datasets is the "survivorship bias". From the perspective of a video game publisher, the data on video games that are not in the dataset might be just as relevant as the ones that are shown in the dataset. We can also hypothesize that the number of video games that were left out is much greater than included, what means that we do not have data on the majority of the market.

Naturally, the games that were left out were mostly failures, and we have no idea about them, even though these are the failures a game publisher would likely be trying to avoid. These missing data may also hide relevant facts about the market, facts that could easily turn insights derived from this dataset upside down.

According to this estimate, the number of currently available video games is well over 1 million:
https://gamingshift.com/how-many-video-games-exist/#:~:text=After%20doing%20some%20research%2C%20our,games%20for%20the%20Nintendo%20Switch.
Of course, each and every dataset has its own problems and none of them will provide a complete view ever. This is the data that we have and the task is to  present the data visually. But the considerations mentioned above must always take precedence because the data left out could very easily and very plausibly turn any finding upside down, however "obvious" or "insightful" they may seem at first glance.
- One could assume that interesting data or insights could be drawn from the scores that a given game has received, but the relevant data are overwhelmingly missing for most games, even for the most popular ones.

Assumptions about our client
When making recommendations and visualizing data one of the main factors to consider is the audience and the goals and needs of this audience.
Assumptions:
- the audience has no knowledge about the dataset
- the audience is at a professional level when interpreting visualizations


# Main findings
It is clear from the dataset that the gaming industry is extremely concentrated. This concentration is obvious if we look at the sales figures by video games, but even more so by publishers. The overwhelming majority of games are failures in this sense.
It is also reasonable to assume that the most popular games are priced higher than the average, which means that the financial figures might be even more extreme.
It is also obvious that most publishers manage the sales and development of video games like a portfolio: the most successful games not only generate the overwhelming majority of sales and profits but also finance the vast number of unsuccessful failures. 

So if our client is an existing, well-established publisher with an existing portfolio of games in the pipeline, then the question becomes how should this publisher change the direction of its strategy regarding the management of its video games portfolio.
There seems to be no recipe for success. For example, we can see that Nintendo enjoyed the success of many top selling games but produced a number of failures too.


In [1]:
import numpy as np
import pandas as pd
import plotly
pd.options.plotting.backend = 'plotly'
pd.options.display.float_format = "{:,.1f}".format
import plotly.express as px
from pathlib import Path

In [2]:
file_path = Path.cwd() / r'data/vgsales-12-4-2019 (1).csv'

In [3]:
# first read

sales = pd.read_csv(filepath_or_buffer=file_path)
sales.head(10)

Unnamed: 0,Rank,Name,basename,Genre,ESRB_Rating,Platform,Publisher,Developer,VGChartz_Score,Critic_Score,...,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,url,status,Vgchartzscore,img_url
0,1,Wii Sports,wii-sports,Sports,E,Wii,Nintendo,Nintendo EAD,,7.7,...,,,,,2006.0,,http://www.vgchartz.com/game/2667/wii-sports/?...,1,,/games/boxart/full_2258645AmericaFrontccc.jpg
1,2,Super Mario Bros.,super-mario-bros,Platform,,NES,Nintendo,Nintendo EAD,,10.0,...,,,,,1985.0,,http://www.vgchartz.com/game/6455/super-mario-...,1,,/games/boxart/8972270ccc.jpg
2,3,Mario Kart Wii,mario-kart-wii,Racing,E,Wii,Nintendo,Nintendo EAD,,8.2,...,,,,,2008.0,11th Apr 18,http://www.vgchartz.com/game/6968/mario-kart-w...,1,8.7,/games/boxart/full_8932480AmericaFrontccc.jpg
3,4,PlayerUnknown's Battlegrounds,playerunknowns-battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,...,,,,,2017.0,13th Nov 18,http://www.vgchartz.com/game/215988/playerunkn...,1,,/games/boxart/full_8052843AmericaFrontccc.jpg
4,5,Wii Sports Resort,wii-sports-resort,Sports,E,Wii,Nintendo,Nintendo EAD,,8.0,...,,,,,2009.0,,http://www.vgchartz.com/game/24656/wii-sports-...,1,8.8,/games/boxart/full_7295041AmericaFrontccc.jpg
5,6,Pokemon Red / Green / Blue Version,pokmon-red,Role-Playing,E,GB,Nintendo,Game Freak,,9.4,...,,,,,1998.0,,http://www.vgchartz.com/game/4030/pokemon-red-...,1,,/games/boxart/full_6442337AmericaFrontccc.png
6,7,New Super Mario Bros.,new-super-mario-bros,Platform,E,DS,Nintendo,Nintendo EAD,,9.1,...,,,,,2006.0,,http://www.vgchartz.com/game/1582/new-super-ma...,1,,/games/boxart/full_2916260AmericaFrontccc.jpg
7,8,Tetris,tetris,Puzzle,E,GB,Nintendo,Bullet Proof Software,,,...,,,,,1989.0,,http://www.vgchartz.com/game/4534/tetris/?regi...,1,,/games/boxart/3740960ccc.jpg
8,9,New Super Mario Bros. Wii,new-super-mario-bros-wii,Platform,E,Wii,Nintendo,Nintendo EAD,,8.6,...,,,,,2009.0,,http://www.vgchartz.com/game/35076/new-super-m...,1,9.1,/games/boxart/full_1410872AmericaFrontccc.jpg
9,10,Minecraft,minecraft,Misc,,PC,Mojang,Mojang AB,,10.0,...,,,,,2010.0,05th Aug 18,http://www.vgchartz.com/game/47724/minecraft/?...,1,,/games/boxart/full_minecraft_1AmericaFront.png


In [4]:
dtypes={'Genre':'category',
        'ESRB_Rating' : 'category',
        'Platform':'category'}

sales = pd.read_csv(filepath_or_buffer=file_path, parse_dates=['Last_Update'], converters={'Year': lambda x : x.replace('.0', '')}, dtype=dtypes)

In [5]:
sales.head(10)

Unnamed: 0,Rank,Name,basename,Genre,ESRB_Rating,Platform,Publisher,Developer,VGChartz_Score,Critic_Score,...,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,url,status,Vgchartzscore,img_url
0,1,Wii Sports,wii-sports,Sports,E,Wii,Nintendo,Nintendo EAD,,7.7,...,,,,,2006,NaT,http://www.vgchartz.com/game/2667/wii-sports/?...,1,,/games/boxart/full_2258645AmericaFrontccc.jpg
1,2,Super Mario Bros.,super-mario-bros,Platform,,NES,Nintendo,Nintendo EAD,,10.0,...,,,,,1985,NaT,http://www.vgchartz.com/game/6455/super-mario-...,1,,/games/boxart/8972270ccc.jpg
2,3,Mario Kart Wii,mario-kart-wii,Racing,E,Wii,Nintendo,Nintendo EAD,,8.2,...,,,,,2008,2018-04-11,http://www.vgchartz.com/game/6968/mario-kart-w...,1,8.7,/games/boxart/full_8932480AmericaFrontccc.jpg
3,4,PlayerUnknown's Battlegrounds,playerunknowns-battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,...,,,,,2017,2018-11-13,http://www.vgchartz.com/game/215988/playerunkn...,1,,/games/boxart/full_8052843AmericaFrontccc.jpg
4,5,Wii Sports Resort,wii-sports-resort,Sports,E,Wii,Nintendo,Nintendo EAD,,8.0,...,,,,,2009,NaT,http://www.vgchartz.com/game/24656/wii-sports-...,1,8.8,/games/boxart/full_7295041AmericaFrontccc.jpg
5,6,Pokemon Red / Green / Blue Version,pokmon-red,Role-Playing,E,GB,Nintendo,Game Freak,,9.4,...,,,,,1998,NaT,http://www.vgchartz.com/game/4030/pokemon-red-...,1,,/games/boxart/full_6442337AmericaFrontccc.png
6,7,New Super Mario Bros.,new-super-mario-bros,Platform,E,DS,Nintendo,Nintendo EAD,,9.1,...,,,,,2006,NaT,http://www.vgchartz.com/game/1582/new-super-ma...,1,,/games/boxart/full_2916260AmericaFrontccc.jpg
7,8,Tetris,tetris,Puzzle,E,GB,Nintendo,Bullet Proof Software,,,...,,,,,1989,NaT,http://www.vgchartz.com/game/4534/tetris/?regi...,1,,/games/boxart/3740960ccc.jpg
8,9,New Super Mario Bros. Wii,new-super-mario-bros-wii,Platform,E,Wii,Nintendo,Nintendo EAD,,8.6,...,,,,,2009,NaT,http://www.vgchartz.com/game/35076/new-super-m...,1,9.1,/games/boxart/full_1410872AmericaFrontccc.jpg
9,10,Minecraft,minecraft,Misc,,PC,Mojang,Mojang AB,,10.0,...,,,,,2010,2018-08-05,http://www.vgchartz.com/game/47724/minecraft/?...,1,,/games/boxart/full_minecraft_1AmericaFront.png


In [6]:
sales.shape

(55792, 23)

In [7]:
# Year and Genre is inconsistent even for a given Name:

sales[sales.Name=='Tour de France 2011']

Unnamed: 0,Rank,Name,basename,Genre,ESRB_Rating,Platform,Publisher,Developer,VGChartz_Score,Critic_Score,...,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,url,status,Vgchartzscore,img_url
15224,15225,Tour de France 2011,tour-de-france-2011,Sports,E,X360,Unknown,Cyanide Studio,,,...,,0.0,,0.0,,NaT,http://www.vgchartz.com/game/52785/tour-de-fra...,1,,/games/boxart/full_tour-de-france-the-official...
21004,21005,Tour de France 2011,tour-de-france-2011,Sports,E,PS3,Unknown,Cyanide Studio,,6.0,...,,0.0,,,,NaT,http://www.vgchartz.com/game/52784/tour-de-fra...,1,,/games/boxart/full_tour-de-france-the-official...
49675,49676,Tour de France 2011,tour-de-france-2011,Racing,,PS3,Unknown,Focus Home Interactive,,,...,,,,,2002.0,NaT,http://www.vgchartz.com/game/53231/tour-de-fra...,1,,/games/boxart/default.jpg


In [8]:
sales[sales.Year=='']

Unnamed: 0,Rank,Name,basename,Genre,ESRB_Rating,Platform,Publisher,Developer,VGChartz_Score,Critic_Score,...,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Last_Update,url,status,Vgchartzscore,img_url
15224,15225,Tour de France 2011,tour-de-france-2011,Sports,E,X360,Unknown,Cyanide Studio,,,...,,0.0,,0.0,,NaT,http://www.vgchartz.com/game/52785/tour-de-fra...,1,,/games/boxart/full_tour-de-france-the-official...
15653,15654,The History Channel: Great Battles - Medieval,the-history-channel-great-battles-medieval,Strategy,,PS3,Unknown,Slitherine Software,,,...,,0.0,,0.0,,NaT,http://www.vgchartz.com/game/43186/the-history...,1,,/games/boxart/full_the-history-channel-great-b...
15791,15792,B.L.U.E.: Legend of Water,blue-legend-of-water,Adventure,,PS,Unknown,Unknown,,,...,,,0.0,0.0,,NaT,http://www.vgchartz.com/game/17261/blue-legend...,1,,/games/boxart/2406668ccc.jpg
20065,20066,Wii de Asobu: Metroid Prime 2: Dark Echoes,wii-de-asobu-metroid-prime-2-dark-echoes,Shooter,,Wii,Unknown,Retro Studios,,,...,,,0.0,,,NaT,http://www.vgchartz.com/game/35542/wii-de-asob...,1,,/games/boxart/full_2705729JapanFrontccc.jpg
20086,20087,Iron Master: The Legendary Blacksmith,iron-master-the-legendary-blacksmith,Action,RP,DS,Unknown,Barunson Creative,,,...,,,0.0,,,NaT,http://www.vgchartz.com/game/36571/iron-master...,1,,/games/boxart/full_7472386JapanFrontccc.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55766,55767,Yuppie Psycho,yuppie-psycho,Adventure,,XOne,Unknown,Baroque Decay,,,...,,,,,,2018-04-17,http://www.vgchartz.com/game/221879/yuppie-psy...,1,,/games/boxart/default.jpg
55767,55768,Yuppie Psycho,yuppie-psycho,Adventure,,PS4,Unknown,Baroque Decay,,,...,,,,,,2018-04-17,http://www.vgchartz.com/game/221880/yuppie-psy...,1,,/games/boxart/default.jpg
55769,55770,Zarvot,zarvot,Action,,NS,Unknown,Snowhydra Games,,,...,,,,,,2018-10-14,http://www.vgchartz.com/game/222952/zarvot/?re...,1,6.0,/games/boxart/full_809985AmericaFrontccc.png
55774,55775,Zombeer,zombeer,Action,M,WiiU,Unknown,Padaone Games,,,...,,,,,,2018-01-05,http://www.vgchartz.com/game/220347/zombeer/?r...,1,,/games/boxart/full_452226AmericaFrontccc.png


In [9]:
# VGChartz_Score irrelevant
sales['VGChartz_Score'].describe()

count   0.0
mean    NaN
std     NaN
min     NaN
25%     NaN
50%     NaN
75%     NaN
max     NaN
Name: VGChartz_Score, dtype: float64

In [10]:
# Last_Update: are only data until 2016 or 2017 comparable? 

sales['Last_Update'].describe()

  sales['Last_Update'].describe()


count                    9186
unique                    430
top       2019-03-29 00:00:00
freq                      227
first     2017-11-28 00:00:00
last      2019-04-12 00:00:00
Name: Last_Update, dtype: object

In [11]:
sales['status'].describe()

count   55,792.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: status, dtype: float64

In [12]:
# basename seems irrelevant, no info provided

sales.drop(columns=['basename', 'VGChartz_Score', 'status', 'url', 'img_url', 'Last_Update' ], inplace=True)

In [13]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55792 entries, 0 to 55791
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Rank           55792 non-null  int64   
 1   Name           55792 non-null  object  
 2   Genre          55792 non-null  category
 3   ESRB_Rating    23623 non-null  category
 4   Platform       55792 non-null  category
 5   Publisher      55792 non-null  object  
 6   Developer      55775 non-null  object  
 7   Critic_Score   6536 non-null   float64 
 8   User_Score     335 non-null    float64 
 9   Total_Shipped  1827 non-null   float64 
 10  Global_Sales   19415 non-null  float64 
 11  NA_Sales       12964 non-null  float64 
 12  PAL_Sales      13189 non-null  float64 
 13  JP_Sales       7043 non-null   float64 
 14  Other_Sales    15522 non-null  float64 
 15  Year           55792 non-null  object  
 16  Vgchartzscore  799 non-null    float64 
dtypes: category(3), float64(9), int

In [14]:
sales.value_counts('Publisher')

Publisher
Unknown               4891
Sega                  2085
Activision            1519
Ubisoft               1519
Electronic Arts       1498
                      ... 
Rich Olson               1
GolemLabs                1
Revolution (Japan)       1
Revolution               1
yeo                      1
Length: 3069, dtype: int64

In [15]:
sales.value_counts('Platform')

Platform
PC      10978
PS2      3564
DS       3292
PS       2703
XBL      2115
        ...  
TG16        3
S32X        3
BBCM        1
C128        1
Aco         1
Length: 74, dtype: int64

In [16]:
sales.value_counts('Platform').describe()

count       74.0
mean       753.9
std      1,450.1
min          1.0
25%         27.5
50%        261.0
75%      1,036.2
max     10,978.0
dtype: float64

In [17]:
print(sales.Year.unique())

['2006' '1985' '2008' '2017' '2009' '1998' '1989' '2010' '2007' '2005'
 '2000' '1991' '2013' '2014' '2004' '2011' '1990' '2003' '2002' '2016'
 '2015' '2001' '1999' '2018' '2012' '1996' '1992' '1993' '1997' '1994'
 '1982' '1988' '1987' '1995' '1981' '1986' '1978' '1983' '2020' '1984'
 '1977' '2019' '1980' '1970' '1979' '' '1975' '1973']


In [18]:
# 'Total_Shipped' and 'Global_Sales' seem to be complementary

sales['copies']= sales['Total_Shipped'].combine_first(sales['Global_Sales'])*1000_000
assert sales['copies'].count() == sales['Total_Shipped'].count() + sales['Global_Sales'].count()

In [19]:
# 1380 rows have exactly "0" copies
print(sales[sales.copies == 0].count())
sales[sales.copies == 0].head()

Rank             1380
Name             1380
Genre            1380
ESRB_Rating       713
Platform         1380
Publisher        1380
Developer        1378
Critic_Score      155
User_Score          4
Total_Shipped       0
Global_Sales     1380
NA_Sales          187
PAL_Sales        1064
JP_Sales          252
Other_Sales      1012
Year             1380
Vgchartzscore       0
copies           1380
dtype: int64


Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,Total_Shipped,Global_Sales,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
19862,19863,Nari Kids Park: Ultraman R/B,Action,,NS,Bandai Namco Games,Bandai Namco Games,,,,0.0,,,0.0,,2018,,0.0
19863,19864,Nekketsu Inou Bukatsu: Trigger Kiss,Adventure,,PSV,Idea Factory,Idea Factory,,,,0.0,,,0.0,,2014,,0.0
19864,19865,Rampage Puzzle Attack,Puzzle,E,GBA,Midway Games,Ninai Games,,,,0.0,0.0,0.0,,0.0,2001,,0.0
19865,19866,Duke Nukem Trilogy: Critical Mass,Shooter,T,PSP,Unknown,FrontLine Studios,,,,0.0,0.0,,,0.0,2001,,0.0
19866,19867,The World of Golden Eggs: Nori Nori Uta Dekich...,Puzzle,,DS,AQ Interactive,AQ Interactive,,,,0.0,,,0.0,,2009,,0.0


In [20]:
sales_total_available = sales[sales.copies > 0]
sales_total_available.shape

(19862, 18)

In [21]:
# 25% within this filtered dataset still do not exceed 50 thousand copies
sales_total_available.copies.describe()

count       19,862.0
mean       530,876.0
std      1,572,633.9
min         10,000.0
25%         50,000.0
50%        160,000.0
75%        450,000.0
max     82,860,000.0
Name: copies, dtype: float64

In [22]:
# only 12158 have sales greater than 100 thousand

sales[sales.copies > 100_000].shape


(12158, 18)

In [23]:
# records without Year data are not significant in sales, can be dropped:

sales[sales.Year=='']['copies'].sum()

130000.0

In [24]:
# filtering rows without Year
sales_total_available = sales_total_available[sales_total_available.Year!='']

# business decision: rows with sales greater than 0 are relevant, creating .csv file for (Tableau) visualization and furher analysis:
# other irrelevant/sparse columns: 'ESRB_Rating', 'Developer', 'Critic_Score', 'User_Score', 'Total_Shipped', 'Global_Sales', 'NA_Sales', 'PAL_Sales', 'JP_Sales', 'Other_Sales', 'Vgchartzscore'

sales_total_available.drop(
    columns=['ESRB_Rating', 'Developer', 'Critic_Score', 'User_Score', 'Total_Shipped', 'Global_Sales', 'NA_Sales', 'PAL_Sales', 'JP_Sales', 'Other_Sales', 'Vgchartzscore'],
    inplace=True)


In [25]:
# commented out after creating the csv file
# output_path = Path.cwd() / 'output' / 'sales_total_available.csv'
# sales_total_available.to_csv(output_path, index=False)

In [26]:
regional_sales_available = sales.dropna(subset=['NA_Sales','PAL_Sales', 'JP_Sales', 'Other_Sales'], how='any')
global_sales_available=sales.dropna(subset=['Global_Sales'])
regional_sales_available.shape

(2416, 18)

In [27]:
global_sales_available.shape

(19415, 18)

In [28]:
global_and_regional_sales_available = sales.dropna(subset=['Global_Sales','NA_Sales','PAL_Sales', 'JP_Sales', 'Other_Sales'], how='any')
global_and_regional_sales_available.shape

(2416, 18)

In [29]:
sum_in_dataset = global_and_regional_sales_available['Global_Sales']
calculated_sum= global_and_regional_sales_available.loc[:,['NA_Sales','PAL_Sales', 'JP_Sales', 'Other_Sales']].sum(axis=1)

In [30]:
assert sum_in_dataset.all() == calculated_sum.all()

In [31]:
global_and_regional_sales_available[['NA_Sales','PAL_Sales', 'JP_Sales', 'Other_Sales']].sum()

NA_Sales      1,327.6
PAL_Sales     1,002.5
JP_Sales        281.9
Other_Sales     345.7
dtype: float64

In [32]:
sales.drop(columns=['Total_Shipped', 'Global_Sales'], inplace=True)

In [33]:
sales.copies.describe()

count       21,242.0
mean       496,387.3
std      1,526,308.9
min              0.0
25%         40,000.0
50%        140,000.0
75%        420,000.0
max     82,860,000.0
Name: copies, dtype: float64

In [34]:
sales.head(10)

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
0,1,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,,,,,2006,,82860000.0
1,2,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,,,,,1985,,40240000.0
2,3,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,,,,,2008,8.7,37140000.0
3,4,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,,,,,2017,,36600000.0
4,5,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,,,,,2009,8.8,33090000.0
5,6,Pokemon Red / Green / Blue Version,Role-Playing,E,GB,Nintendo,Game Freak,9.4,,,,,,1998,,31380000.0
6,7,New Super Mario Bros.,Platform,E,DS,Nintendo,Nintendo EAD,9.1,8.1,,,,,2006,,30800000.0
7,8,Tetris,Puzzle,E,GB,Nintendo,Bullet Proof Software,,,,,,,1989,,30260000.0
8,9,New Super Mario Bros. Wii,Platform,E,Wii,Nintendo,Nintendo EAD,8.6,9.2,,,,,2009,9.1,30220000.0
9,10,Minecraft,Misc,,PC,Mojang,Mojang AB,10.0,,,,,,2010,,30010000.0


In [35]:
sales.tail(10)

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
55782,55783,Ion Maiden,Shooter,,PC,3D Realms,Voidpoint,,,,,,,2019.0,,
55783,55784,Ion Maiden,Shooter,,PS4,3D Realms,Voidpoint,,,,,,,2019.0,,
55784,55785,Ion Maiden,Shooter,,XOne,3D Realms,Voidpoint,,,,,,,2019.0,,
55785,55786,Ion Maiden,Shooter,,NS,3D Realms,Voidpoint,,,,,,,2019.0,,
55786,55787,In the Valley of Gods,Adventure,,PC,Campo Santo,Campo Santo,,,,,,,2019.0,,
55787,55788,Indivisible,Role-Playing,,PC,505 Games,Lab Zero Games,,,,,,,2019.0,,
55788,55789,Lost Ember,Adventure,RP,PC,Mooneye Studios,Mooneye Studios,,,,,,,2019.0,,
55789,55790,Lost Ember,Adventure,RP,PS4,Mooneye Studios,Mooneye Studios,,,,,,,2019.0,,
55790,55791,Lost Ember,Adventure,RP,XOne,Mooneye Studios,Mooneye Studios,,,,,,,2019.0,,
55791,55792,Falcon Age,Action-Adventure,,PS4,Unknown,Outerloop Games,,,,,,,,,


In [36]:
sales.dropna(subset=['Critic_Score', 'User_Score'], how='all').shape

(6653, 16)

In [37]:
sales.dropna(subset=['Critic_Score', 'User_Score'], how='any').shape

(218, 16)

In [38]:
sales.describe()

Unnamed: 0,Rank,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Vgchartzscore,copies
count,55792.0,6536.0,335.0,12964.0,13189.0,7043.0,15522.0,799.0,21242.0
mean,27896.5,7.2,8.3,0.3,0.2,0.1,0.0,7.4,496387.3
std,16105.9,1.5,1.4,0.5,0.4,0.2,0.1,1.4,1526308.9
min,1.0,1.0,2.0,0.0,0.0,0.0,0.0,2.6,0.0
25%,13948.8,6.4,7.8,0.1,0.0,0.0,0.0,6.8,40000.0
50%,27896.5,7.5,8.5,0.1,0.0,0.1,0.0,7.8,140000.0
75%,41844.2,8.3,9.1,0.3,0.1,0.1,0.0,8.5,420000.0
max,55792.0,10.0,10.0,9.8,9.8,2.7,3.1,9.6,82860000.0


In [39]:
sales[sales['copies']<=100000].shape

(9084, 16)

In [40]:
sales[sales['copies']==100000].shape

(484, 16)

In [41]:
sales['copies'].isna().sum()

34550

In [42]:
sales[sales['copies']<=100000].dropna(subset=['copies']).head()

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
12158,12159,Ennichi no Tatsujin,Misc,,Wii,Namco Bandai,Namco Bandai Games,,,,,0.1,,2006,,100000.0
12159,12160,The Godfather: Mob Wars,Action,M,PSP,Electronic Arts,Page 44,6.2,,0.1,0.0,,0.0,2006,,100000.0
12160,12161,Pac-Man and the Ghostly Adventures 2,Platform,E10,X360,Namco Bandai Games,Monkey Bar Games,,,0.1,0.0,,0.0,2014,,100000.0
12161,12162,Puzzle Chronicles,Puzzle,E10,PSP,Konami,Infinite Interactive,6.5,,0.1,0.0,,0.0,2010,,100000.0
12162,12163,Girls und Panzer: I Will Master Tankery,Action,,PSV,Namco Bandai Games,Namco Bandai Games,,,,,0.1,,2014,,100000.0


In [43]:
sales[sales['copies']<=100000].dropna(subset=['copies']).shape

(9084, 16)

In [44]:
copies_available = sales.dropna(subset=['copies'])
copies_available.head(10)

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
0,1,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,,,,,2006,,82860000.0
1,2,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,,,,,1985,,40240000.0
2,3,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,,,,,2008,8.7,37140000.0
3,4,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,,,,,2017,,36600000.0
4,5,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,,,,,2009,8.8,33090000.0
5,6,Pokemon Red / Green / Blue Version,Role-Playing,E,GB,Nintendo,Game Freak,9.4,,,,,,1998,,31380000.0
6,7,New Super Mario Bros.,Platform,E,DS,Nintendo,Nintendo EAD,9.1,8.1,,,,,2006,,30800000.0
7,8,Tetris,Puzzle,E,GB,Nintendo,Bullet Proof Software,,,,,,,1989,,30260000.0
8,9,New Super Mario Bros. Wii,Platform,E,Wii,Nintendo,Nintendo EAD,8.6,9.2,,,,,2009,9.1,30220000.0
9,10,Minecraft,Misc,,PC,Mojang,Mojang AB,10.0,,,,,,2010,,30010000.0


In [45]:
copies_available.shape

(21242, 16)

In [46]:
# df['cum_sum'] = df["val1"].cumsum()
# df['cum_perc'] = round(100*df.cum_sum/df["val1"].sum(),2)

In [47]:
sales[sales['Name']=='Submarine Titans']

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
16763,16764,Submarine Titans,Strategy,E,PC,Strategy First,Ellipse Studios,6.1,9.0,,,,,2000,,30000.0


In [48]:
sales[sales['Name']=='Submarine Titans']

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
16763,16764,Submarine Titans,Strategy,E,PC,Strategy First,Ellipse Studios,6.1,9.0,,,,,2000,,30000.0


In [49]:
sales['Name'].value_counts()

Plants vs. Zombies                          20
Monopoly                                    15
Double Dragon                               14
Samurai Shodown                             13
Pier Solar and the Great Architects         12
                                            ..
Arc Arena: Monster Tournament                1
Arbor Vitae                                  1
Arasuji de Oboeru Sokudoku no Susume DS      1
Arasuji de Kitaeru Sokumimi no Susume DS     1
Falcon Age                                   1
Name: Name, Length: 37102, dtype: int64

In [50]:
sales[sales['Name']=='Plants vs. Zombies']

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
1312,1313,Plants vs. Zombies,Strategy,E10,PC,PopCap Games,PopCap Games,9.0,,,,,,2009,8.4,1650000.0
2118,2119,Plants vs. Zombies,Strategy,E10,DS,PopCap Games,PopCap Games,8.1,,0.9,0.1,,0.1,2011,,1110000.0
7347,7348,Plants vs. Zombies,Strategy,E10,X360,PopCap Games,PopCap Games,8.9,,0.2,,,0.0,2010,,270000.0
8631,8632,Plants vs. Zombies,Strategy,E10,PS3,PopCap Games,PopCap,,,0.2,,,0.0,2011,,210000.0
41103,41104,Plants vs. Zombies,Strategy,E10,XBL,PopCap Games,PopCap Games,8.9,,,,,,2010,,
41104,41105,Plants vs. Zombies,Strategy,E10,PSN,Sony Online Entertainment,PopCap Games,,,,,,,2011,,
41105,41106,Plants vs. Zombies,Strategy,E10,DSiW,PopCap Games,PopCap Games,,,,,,,2011,,
41106,41107,Plants vs. Zombies,Misc,,PSV,Sony Computer Entertainment,Unknown,,,,,,,2012,,
41107,41108,Plants vs. Zombies,Misc,,PS3,Sony Computer Entertainment,Unknown,,,,,,,2011,,
41108,41109,Plants vs. Zombies,Misc,,DS,PopCap Games,Unknown,,,,,,,2011,,


In [51]:
sales[(sales['Name']=='Plants vs. Zombies') & (sales['Platform']=='PC')]

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
1312,1313,Plants vs. Zombies,Strategy,E10,PC,PopCap Games,PopCap Games,9.0,,,,,,2009,8.4,1650000.0
41110,41111,Plants vs. Zombies,Misc,,PC,PopCap Games,Unknown,,,,,,,2009,8.4,


In [52]:
sales[(sales['Name']=='Plants vs. Zombies') & (sales['Platform']=='PS3')]

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
8631,8632,Plants vs. Zombies,Strategy,E10,PS3,PopCap Games,PopCap,,,0.2,,,0.0,2011,,210000.0
41107,41108,Plants vs. Zombies,Misc,,PS3,Sony Computer Entertainment,Unknown,,,,,,,2011,,


In [53]:
sales[sales['Name']=='Monopoly']

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer,Critic_Score,User_Score,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Year,Vgchartzscore,copies
791,792,Monopoly,Misc,,PC,Hasbro Interactive,Unknown,,,1.5,0.8,,0.1,1995,,2390000.0
1087,1088,Monopoly,Misc,E,Wii,Electronic Arts,EA Bright Light,,,0.9,0.8,0.0,0.2,2008,,1890000.0
1466,1467,Monopoly,Misc,E,PS,Hasbro Interactive,Gremlin Interactive,,,1.2,0.3,,0.1,1997,,1510000.0
3879,3880,Monopoly,Misc,E,X360,Electronic Arts,EA Bright Light,,,0.3,0.2,,0.1,2008,,590000.0
5215,5216,Monopoly,Misc,E,PS2,Electronic Arts,EA Bright Light,,,0.2,0.2,,0.1,2008,,430000.0
6159,6160,Monopoly,Misc,E,PS3,Electronic Arts,EA Bright Light,,,0.2,0.1,,0.0,2008,,340000.0
7245,7246,Monopoly,Misc,E,DS,Electronic Arts,Electronic Arts,,,0.1,0.1,,0.0,2010,,270000.0
8567,8568,Monopoly,Misc,E,N64,Hasbro Interactive,Minds-Eye Productions,,,0.2,0.0,,0.0,1999,,210000.0
38247,38248,Monopoly,Misc,,GEN,Parker Bros.,Sculptured Software,,,,,,,1992,,
38248,38249,Monopoly,Misc,,SNES,Parker Bros.,Sculptured Software,,,,,,,1992,,


In [54]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55792 entries, 0 to 55791
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Rank           55792 non-null  int64   
 1   Name           55792 non-null  object  
 2   Genre          55792 non-null  category
 3   ESRB_Rating    23623 non-null  category
 4   Platform       55792 non-null  category
 5   Publisher      55792 non-null  object  
 6   Developer      55775 non-null  object  
 7   Critic_Score   6536 non-null   float64 
 8   User_Score     335 non-null    float64 
 9   NA_Sales       12964 non-null  float64 
 10  PAL_Sales      13189 non-null  float64 
 11  JP_Sales       7043 non-null   float64 
 12  Other_Sales    15522 non-null  float64 
 13  Year           55792 non-null  object  
 14  Vgchartzscore  799 non-null    float64 
 15  copies         21242 non-null  float64 
dtypes: category(3), float64(8), int64(1), object(4)
memory usage: 5.7+ MB


In [55]:
sales['ESRB_Rating'].value_counts(dropna=False)

NaN    32169
E      10811
T       6157
M       3314
E10     2897
RP       368
EC        54
AO        19
KA         3
Name: ESRB_Rating, dtype: int64

In [56]:
sales['Publisher'].value_counts(dropna=False)

Unknown               4891
Sega                  2085
Activision            1519
Ubisoft               1519
Electronic Arts       1498
                      ... 
Realtech VR              1
WorkJam                  1
Juicy Beast Studio       1
Imax                     1
Digital Tentacle         1
Name: Publisher, Length: 3069, dtype: int64

In [57]:
sales['Developer'].value_counts(dropna=False)

Unknown              4756
Konami                911
Sega                  817
Capcom                684
Namco                 425
                     ... 
Reality Bytes           1
H.I.C.                  1
Darkest Hour Team       1
Palestar                1
Outerloop Games         1
Name: Developer, Length: 8065, dtype: int64

In [58]:
sales['Platform'].value_counts(dropna=False)

PC      10978
PS2      3564
DS       3292
PS       2703
XBL      2115
        ...  
TG16        3
S32X        3
BBCM        1
C128        1
Aco         1
Name: Platform, Length: 74, dtype: int64

In [59]:
sales['Genre'].value_counts(dropna=False)

Misc                9476
Action              7667
Adventure           5293
Sports              5244
Shooter             4586
Role-Playing        4551
Platform            3445
Strategy            3266
Puzzle              3162
Racing              3030
Simulation          2737
Fighting            2085
Action-Adventure     609
Visual Novel         260
Music                195
Party                 75
MMO                   74
Board Game            16
Education             12
Sandbox                9
Name: Genre, dtype: int64

In [60]:
sales['Year'].value_counts(dropna=False)

2009    4507
2010    3661
2011    3489
2008    2979
2014    2905
2007    2573
2006    2112
2005    1821
2003    1738
2002    1736
2013    1723
2015    1633
2004    1611
2000    1561
2012    1560
2001    1493
2017    1486
2018    1482
1999    1285
2016    1238
1996    1220
1994    1163
1995    1162
1998    1094
1993    1043
1997    1030
         979
1992     935
1991     786
2019     729
1990     666
1989     436
2020     331
1983     300
1988     282
1987     259
1982     214
1986     146
1970     104
1985      93
1984      88
1978      48
1981      37
1980      35
1977      12
1979       5
1975       1
1973       1
Name: Year, dtype: int64