# **Overview**.    

The gaming industry is certainly one of the thriving industries of the modern age and one of those that are most influenced by the advancement in technology. With the availability of technologies like AR/VR in consumer products like gaming consoles and even smartphones, the gaming sector shows great potential. In this hackathon, you as a data scientist must use your analytical skills to predict the sales of video games depending on given factors. Given are **8 distinguishing factors** that can influence the sales of a video game. Your objective as a data scientist is to build a machine learning model that can accurately predict the sales in millions of units for a given game.

Project dataset source link: [MachineHack Hackathon](https://machinehack.com/hackathon/video_game_sales_prediction_weekend_hackathon_10/data)

After registering for the hackathon we receive 3 files viz. Sample submission, Train and Test.csv.

**Data Description**:-
The unzipped folder will have the following files.

Train.csv –  3506 observations.     
Test.csv –  1503 observations.    
Sample Submission – Sample format for the submission.    
**Target Variable**: SalesInMillions

Once downloaded we can run below code cell to upload the files to this notebook. Run the below code cell, then click on `Choose Files` button to upload files to google colab.

In [1]:
# from Jupyter.notebook import files
# uploaded = files.upload()
# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

# File Imports

In [2]:
#Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(action='ignore')

In [3]:
# Read the csv files
input = pd.read_csv("Train.csv")

In [4]:
#print all columns to understand the dataset
input.head()

Unnamed: 0,ID,CONSOLE,YEAR,CATEGORY,PUBLISHER,RATING,CRITICS_POINTS,USER_POINTS,SalesInMillions
0,2860,ds,2008,role-playing,Nintendo,E,2.833333,0.303704,1.779257
1,731,wii,2012,simulation,Konami Digital Entertainment,E10+,13.2,1.64,0.21505
2,495,pc,2019,shooter,Activision,M,4.5625,0.00641,0.534402
3,2641,ps2,2002,sports,Electronic Arts,E,4.181818,0.326923,1.383964
4,811,ps3,2013,action,Activision,M,2.259259,0.032579,0.082671


# Data cleaning

In [5]:
input.isnull().sum()

ID                 0
CONSOLE            0
YEAR               0
CATEGORY           0
PUBLISHER          0
RATING             0
CRITICS_POINTS     0
USER_POINTS        0
SalesInMillions    0
dtype: int64

There are no null values in the dataset. So we can move to the next step of removing unnecessary columns.

From dataset, we can observe that except `id` column, all the other columns play a significant role in final sales of videogames. So it can be dropped.

In [6]:
input = input.drop(columns=['ID'])
train, test = train_test_split(input, test_size=0.2, random_state=42, shuffle=True)

# Descriptive Statistics

In [7]:
train.shape, test.shape

((2804, 8), (702, 8))

In [8]:
train.nunique()

CONSOLE              17
YEAR                 23
CATEGORY             12
PUBLISHER           184
RATING                6
CRITICS_POINTS     1499
USER_POINTS        1875
SalesInMillions    2804
dtype: int64

In [9]:
#If you are seeing the output below for the first time visit this link
#to understand what the values in each of this rows(mean, std, min, max) actually
#are:- https://www.w3resource.com/pandas/dataframe/dataframe-describe.php
train.describe()

Unnamed: 0,YEAR,CRITICS_POINTS,USER_POINTS,SalesInMillions
count,2804.0,2804.0,2804.0,2804.0
mean,2008.982168,3.748742,0.403144,2.184942
std,4.28669,3.101958,0.455677,2.578479
min,1997.0,0.568966,0.000341,0.001524
25%,2006.0,1.73522,0.063171,0.952236
50%,2009.0,2.745968,0.229331,1.863315
75%,2012.0,4.555556,0.6,2.807032
max,2019.0,23.25,2.325,84.226041


From above table, my first insight is I can create bar charts of **console, year**, **category** and **ratings** columns easily. For other columns I might have to go for some other visual representation since the the number of unique values is high.

*   From **SalesInMillions** column we can see that average
sales have been around 2 million and max sales have reached a mark of about 84 million🤩 and min sales were around just 1500😔.
*   From **year** column we can see that data covers sales from the year 1997 to 2019
*   **Critic Points** range from 0.5 to 23.25 while **user points** range from 0.0003 to 2.32. We might need to noramlise this values on same scale else critic points will have higher impact than user points on final prediction although in reality both of them should have equal importance.



# EDA

I am first opting for auto EDA packages like pandas-profiling for generating visualisations and there corresponding reports.

In [10]:
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pandas!=1.4.0,<2.1,>1.1 (from ydata-profiling==0.0.dev0)
  Obtaining dependency information for pandas!=1.4.0,<2.1,>1.1 from https://files.pythonhosted.org/packages/9e/71/756a1be6bee0209d8c0d8c5e3b9fc72c00373f384a4017095ec404aec3ad/pandas-2.0.3-cp311-cp311-win_amd64.whl.metadata
  Using cached pandas-2.0.3-cp311-cp311-win_amd64.whl.metadata (18 kB)
Collecting pydantic<2,>=1.8.1 (from ydata-profiling==0.0.dev0)
  Obtaining dependency information for pydantic<2,>=1.8.1 from https://file

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'D:\\visual\\visualvenv\\Lib\\site-packages\\~andas.libs\\msvcp140-59fdf63e48138046aebeb6ddb5b4e960.dll'
Check the permissions.



In [11]:
from pandas_profiling import ProfileReport
report = ProfileReport(train, title="Report", html={'style': {'full_width':True}}, explorative=True, missing_diagrams={'bar': True})

ModuleNotFoundError: No module named 'pandas_profiling'

In [None]:
report.to_notebook_iframe()

In [None]:
#Save the report in file
report.to_file("pandas_profiling_report.html")

From the above reports we can gain following insights:-   
*   Console column graph:   
<img src="https://res.cloudinary.com/dk22rcdch/image/upload/v1595439244/VideoGameDatasetAnalysisImages/Screenshot_2020-07-22_at_11.02.44_PM_nxz5cm.png" width=400>      
The sales of **PS2** were the highest in the data set

*   Years Column graph:   
<img src="https://res.cloudinary.com/dk22rcdch/image/upload/v1595439371/VideoGameDatasetAnalysisImages/Screenshot_2020-07-22_at_11.05.51_PM_ycn3nl.png" width=400>  
The sales were highest between the period **2005-2010**.

*   Game category column graph:   
<img src="https://res.cloudinary.com/dk22rcdch/image/upload/v1595439531/VideoGameDatasetAnalysisImages/Screenshot_2020-07-22_at_11.08.40_PM_ugwpdi.png" width=400>   
  **Action** category games are most popular

Now let's compare individual columns with target(SalesInMillions) column to gain a few more insights into the data.

In [None]:
#Sales of games that happened corresponding to each console.
df = pd.DataFrame(train.groupby(['CONSOLE']).agg({'SalesInMillions': 'sum'}))

NameError: name 'pd' is not defined

In [None]:
df.plot.bar(figsize=(12, 6))

NameError: name 'df' is not defined

**💡Insight**:  From the above graph we can see that sales were highest for PS3 platform followed by Xbox360

In [None]:
df = pd.DataFrame(train.groupby(['YEAR']).agg({'SalesInMillions': 'sum'}))

NameError: name 'pd' is not defined

In [None]:
df.plot.bar(figsize=(12, 6))

NameError: name 'df' is not defined

**💡Insight**:  From the above graph we can see that sales were highest in the year 2010

In [None]:
df = pd.DataFrame(train.groupby(['CATEGORY']).agg({'SalesInMillions': 'sum'}))

NameError: name 'pd' is not defined

In [None]:
df.plot.bar(figsize=(12, 6))

NameError: name 'df' is not defined

**💡Insight**:  From the above graph we can see that sales were highest for action genre

# Model training

In [None]:
!pip install catboost

Collecting catboost
  Obtaining dependency information for catboost from https://files.pythonhosted.org/packages/bc/a6/5abbac311fbcaeee79e13c468cd9535f02296ca1a7b0e44d6f468fa83434/catboost-1.2.1-cp311-cp311-win_amd64.whl.metadata
  Using cached catboost-1.2.1-cp311-cp311-win_amd64.whl.metadata (1.2 kB)
Using cached catboost-1.2.1-cp311-cp311-win_amd64.whl (101.0 MB)
Installing collected packages: catboost
Successfully installed catboost-1.2.1


In [None]:

import catboost as cat
cat_feat = ['CONSOLE','CATEGORY', 'PUBLISHER', 'RATING']
features = list(set(train.columns)-set(['SalesInMillions']))
target = 'SalesInMillions'
model = cat.CatBoostRegressor(random_state=100,cat_features=cat_feat,verbose=0)
model.fit(train[features],train[target])

# Model Accuracy

In [None]:
y_true= pd.DataFrame(data=test[target], columns=['SalesInMillions'])
test_temp = test.drop(columns=[target])

In [None]:
y_pred = model.predict(test_temp[features])

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(y_true, y_pred))
print(rmse)

In [None]:
import pickle
filename = 'finalized_model.sav'

In [None]:
pickle.dump(model, open(filename, 'wb'))

In [None]:
loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
test_temp[features].head(1)

In [None]:
loaded_model.predict(test_temp[features].head(1))

In [None]:
from google.colab import drive
drive.mount('/content/drive')