# Video Games Sales

This notebook is for data analysis of video game sales.

Data: [Kaggle.com](https://www.kaggle.com/datasets/gregorut/videogamesales)

In [2]:
import pandas as pd

# Load the dataset from local storage and do the basic checks

Firstly, we will load the data into df_main and then we will display some basic information about the data frame - the head and information.

In [3]:
df_main = pd.read_csv('../files/vgsales.csv')
df_main.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [4]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


## Missing Values

THere are couple of columns with missing data:
1. Year
2. Publisher

Let's verify for which games we are missing this data. If these are blockbusters, we need to do something about those. But if they are some unknown titles for old platforms, we can drop observations with values missing in these columns

In [6]:
df_main[df_main['Year'].isna()]

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
179,180,Madden NFL 2004,PS2,,Sports,Electronic Arts,4.26,0.26,0.01,0.71,5.23
377,378,FIFA Soccer 2004,PS2,,Sports,Electronic Arts,0.59,2.36,0.04,0.51,3.49
431,432,LEGO Batman: The Videogame,Wii,,Action,Warner Bros. Interactive Entertainment,1.86,1.02,0.00,0.29,3.17
470,471,wwe Smackdown vs. Raw 2006,PS2,,Fighting,,1.57,1.02,0.00,0.41,3.00
607,608,Space Invaders,2600,,Shooter,Atari,2.36,0.14,0.00,0.03,2.53
...,...,...,...,...,...,...,...,...,...,...,...
16307,16310,Freaky Flyers,GC,,Racing,Unknown,0.01,0.00,0.00,0.00,0.01
16327,16330,Inversion,PC,,Shooter,Namco Bandai Games,0.01,0.00,0.00,0.00,0.01
16366,16369,Hakuouki: Shinsengumi Kitan,PS3,,Adventure,Unknown,0.01,0.00,0.00,0.00,0.01
16427,16430,Virtua Quest,GC,,Role-Playing,Unknown,0.01,0.00,0.00,0.00,0.01


As we can see, there are some titles with significant sale result (like LEGO: BATMAN for Wii or FIFA Soccer 2004 for PS2).

Some are not important, so let's filter for those which have sales bigger than 0.5

In [7]:
df_main[(df_main['Year'].isna()) & (df_main['Global_Sales'] > 0.5)]

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
179,180,Madden NFL 2004,PS2,,Sports,Electronic Arts,4.26,0.26,0.01,0.71,5.23
377,378,FIFA Soccer 2004,PS2,,Sports,Electronic Arts,0.59,2.36,0.04,0.51,3.49
431,432,LEGO Batman: The Videogame,Wii,,Action,Warner Bros. Interactive Entertainment,1.86,1.02,0.0,0.29,3.17
470,471,wwe Smackdown vs. Raw 2006,PS2,,Fighting,,1.57,1.02,0.0,0.41,3.0
607,608,Space Invaders,2600,,Shooter,Atari,2.36,0.14,0.0,0.03,2.53
624,625,Rock Band,X360,,Misc,Electronic Arts,1.93,0.34,0.0,0.21,2.48
649,650,Frogger's Adventures: Temple of the Frog,GBA,,Adventure,Konami Digital Entertainment,2.15,0.18,0.0,0.07,2.39
652,653,LEGO Indiana Jones: The Original Adventures,Wii,,Action,LucasArts,1.54,0.63,0.0,0.22,2.39
711,713,Call of Duty 3,Wii,,Shooter,Activision,1.19,0.84,0.0,0.23,2.26
782,784,Rock Band,Wii,,Misc,MTV Games,1.35,0.56,0.0,0.2,2.11


There are some significant releases

We need to find release year for them.

I will [Twitch API](https://api-docs.igdb.com/#getting-started) to do it

In [None]:
import os
import requests
from dotenv import load_dotenv

# load .env file
load_dotenv()

# read Twitch API credentials from .env file
twitch_client_id = os.getenv('TWITCH_CLIENT_ID')
twitch_client_secret = os.getenv('TWITCH_CLIENT_SECRET')

# prepare request
url = f'https://id.twitch.tv/oauth2/token?client_id={twitch_client_id}&client_secret={twitch_client_secret}&grant_type=client_credentials'

# send request for token
response = requests.post(url)

# extract token
access_token = response.json()['access_token']

In [13]:
twitch_api_url = 'https://api.igdb.com/v4'

headers = {
    'Client-ID': twitch_client_id,
    'Authorization': f'Bearer {access_token}'
}

In [28]:
response = requests.post(f'{twitch_api_url}/platforms', headers=headers, data='fields *; where abbreviation = "PS2";')

print([x['id'] for x in response.json()])

[8]


In [12]:
query = f'search "Yakuza 4"; fields release_dates;'

response = requests.post(f'{twitch_api_url}/games', headers=headers, data=query)

print(response.json())

[{'id': 2062, 'release_dates': [16192, 107005, 107006, 107007]}, {'id': 103016, 'release_dates': [160342, 174640, 174641, 226587, 231154, 303456]}]


In [17]:
query='fields *; where id = 16192;'

response = requests.post(f'{twitch_api_url}/release_dates', headers=headers, data=query)

print(response.json()[0]['y'])

2010
