## Main Question

Determining if a game is **Hit** or **Flop** based on its global sales!

## Exploratory Data Analysis (EDA)

In [1]:
# Importing necessary packages!
import pandas as pd
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Importing or reading the dataset!
games_dataset = pd.read_csv("Dataset/Video_Games_Sales.csv")

In [3]:
# Printing out first five observations of the dataset, to which types of data or features we have in the dataset!
games_dataset.head()

Unnamed: 0,index,Rank,Game Title,Platform,Year,Genre,Publisher,North America,Europe,Japan,Rest of World,Global,Review
0,0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,40.43,28.39,3.77,8.54,81.12,76.28
1,1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,91.0
2,2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,14.5,12.22,3.63,3.21,33.55,82.07
3,3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,14.82,10.51,3.18,3.01,31.52,82.65
4,4,5,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26,88.0


In [4]:
# Basic statistics measures of the dataset!
games_dataset.describe()

Unnamed: 0,index,Rank,Year,North America,Europe,Japan,Rest of World,Global,Review
count,1907.0,1907.0,1878.0,1907.0,1907.0,1907.0,1907.0,1907.0,1907.0
mean,953.0,954.0,2003.766773,1.258789,0.706675,0.317493,0.206471,2.48924,79.038977
std,550.6478,550.6478,5.895369,1.95656,1.148904,0.724945,0.343093,3.563159,10.616899
min,0.0,1.0,1983.0,0.0,0.0,0.0,0.0,0.83,30.5
25%,476.5,477.5,2000.0,0.51,0.23,0.0,0.06,1.11,74.0
50%,953.0,954.0,2005.0,0.81,0.44,0.02,0.13,1.53,81.0
75%,1429.5,1430.5,2008.0,1.375,0.81,0.3,0.22,2.54,86.23
max,1906.0,1907.0,2012.0,40.43,28.39,7.2,8.54,81.12,97.0


In [5]:
print(f"Number of features in the dataset is {games_dataset.shape[1]} and the number of observations/rows in the dataset is {games_dataset.shape[0]}")

Number of features in the dataset is 13 and the number of observations/rows in the dataset is 1907


In [6]:
games_dataset.isnull().sum()

index             0
Rank              0
Game Title        0
Platform          0
Year             29
Genre             0
Publisher         2
North America     0
Europe            0
Japan             0
Rest of World     0
Global            0
Review            0
dtype: int64

* The features Year and Publisher contain missing values!  
* The feature *Year* has 29 missing values, so instead of dropping the observations, We're handling the missing values by imputations which means we are filling the missing values!  
* The feature *Publisher* has 2 missing values. I chose to drop the observations here instead of filling them in!  
    * The reasoning behind this is that it may not make sense to fill in the publisher name with the most common name or any other names!

In [7]:
# Column 'Year' has 29 missing values!
games_dataset['Year'] = games_dataset['Year'].fillna(games_dataset['Year'].mode()[0]) 
# mode() returns the most common values! Therefore we are filling the missing values in the column 'Year' with the most common value in this column!

print("Number of rows/observations before drop: ", games_dataset.shape[0])

# Using dropna() to drop observations that has missing values in the *Publisher* column!
# Also setting the argument *inplace* equals to *True* to make the changes explicitly into the dataset itself and not creating a new dataset!
games_dataset.dropna(subset=['Publisher'], inplace=True)

print("Number of rows/observations after drop: ", games_dataset.shape[0])

Number of rows/observations before drop:  1907
Number of rows/observations after drop:  1905


In [8]:
games_dataset.isnull().sum()

index            0
Rank             0
Game Title       0
Platform         0
Year             0
Genre            0
Publisher        0
North America    0
Europe           0
Japan            0
Rest of World    0
Global           0
Review           0
dtype: int64

## Feature Selection

**Target Class:** *Global* sales feature for predicting if a game has been successfull or not. 'Hit' or 'Flop'!  

Creating a new column in the dataset called 'Hit'. It has a value of 1 or 0 for 'Hit' or 'Flop'!  

For most games selling 3.5K - 5K copies worldwide within 18 months is a solid success, according to "Boardgameweek.com",  
but for this dataset I'm taking the mean of the global sales as the threshold!  

Threshold I'm choosing for the target class is 2.5 which means games with global sales equals or above 6 are considered to be 'Hit' otherwise 'Flop'!  

In [9]:
threshold = games_dataset['Global'].mean()
print(games_dataset['Global'].mean())
# Creating a new column 'Hit'
games_dataset['Hit'] = games_dataset['Global'].apply(lambda x: 1 if x >= threshold else 0)

2.4895118110236223


In [10]:
games_dataset.head()

Unnamed: 0,index,Rank,Game Title,Platform,Year,Genre,Publisher,North America,Europe,Japan,Rest of World,Global,Review,Hit
0,0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,40.43,28.39,3.77,8.54,81.12,76.28,1
1,1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,91.0,1
2,2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,14.5,12.22,3.63,3.21,33.55,82.07,1
3,3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,14.82,10.51,3.18,3.01,31.52,82.65,1
4,4,5,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26,88.0,1


In [11]:
games_dataset['Hit'].value_counts() # We have class imbalanced here!

Hit
0    1415
1     490
Name: count, dtype: int64

**Input features**

The features that I have chosen as input features are: *‘Platform’, ‘Year’, ‘Genre’, ‘Publisher’* and *'Review'*!  

**Question: Why were other features not chosen?**
* **Data Leakage:** The *‘North America’, ‘Europe’, ‘Japan’*, and *‘Rest of World’* features are components of the *‘Global’* sales!  
* **Feature Importance:** The feature *'Rank'* or *'Game Title'* don't seem to be very relevant for answering the main question!  


**One-Hot encoding on categorical features**

In [14]:
platform_encoded = pd.get_dummies(games_dataset['Platform'])
games_dataset = pd.concat([games_dataset, platform_encoded], axis=1)

games_dataset.head()

Unnamed: 0,index,Rank,Game Title,Platform,Year,Genre,Publisher,North America,Europe,Japan,...,PS3,PSP,PSV,SAT,SCD,SNES,Wii,WiiU,X360,XB
0,0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,40.43,28.39,3.77,...,False,False,False,False,False,False,True,False,False,False
1,1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,...,False,False,False,False,False,False,False,False,False,False
2,2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,14.5,12.22,3.63,...,False,False,False,False,False,False,True,False,False,False
3,3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,14.82,10.51,3.18,...,False,False,False,False,False,False,True,False,False,False
4,4,5,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,...,False,False,False,False,False,False,False,False,False,False


In [16]:
genre_encoded = pd.get_dummies(games_dataset['Genre'])
genre_encoded.head()

Unnamed: 0,Action,Adventure,Fighting,Misc,Platform,Puzzle,Racing,Role-Playing,Shooter,Simulation,Sports,Strategy
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,True,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,True,False,False,False,False,False,False
