# <p> <center style="background-color:#5c0707;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px">  Video games sales regression techniques   </center></p>


# Table of Contents

1. [Introduction](#introduction)  
2. [Data Description](#data-description)  
3. [Data Preprocessing](#data-preprocessing)  
4. [Data Visualizations](#4-visualizations-of-the-data-and-analysis)
5. [Model Training](#5-statistical-models-for-regression)
6. [Model Evaluation](#6-metrics-evaluation)
7. [Conclusion](#7-conclusion)  
8. [References](#8-references)  

# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 1. Introduction </center></p>

Forecasting video game sales is a useful tool for publishers, developers, and investors in the gaming sector.  
Based on important characteristics including user ratings, critic reviews, and time since release, we can forecast a video game's worldwide sales using machine learning regression algorithms.

This project aims to:
- Examine the effects of various qualities on sales performance.

- Apply and compare multiple regression models (Linear Regression, KNN, Decision Trees).

- Use code documentation and environment control to guarantee reproducibility.

# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 2. Data Description </center></p>

The dataset used is from [Kaggle](https://www.kaggle.com/code/yonatanrabinovich/video-games-sales-regression-techniques), titled **“Video games sales regression techniques”**.

- **Rows**: 16,719 video games  
- **Columns**: 16 variables, including:
  - `Name`: Game title  
  - `Platform`: Console/platform (e.g., PS4, X360)  
  - `Year_of_Release`: Year the game was released  
  - `Genre`, `Publisher`: Category and publisher of the game  
  - `Critic_Score`, `User_Score`: Aggregated ratings from critics and users  
  - `NA_Sales`, `EU_Sales`, `JP_Sales`, `Other_Sales`: Regional sales in millions  
  - `Global_Sales`: Total worldwide sales (target variable)
  - `Critic_Count`, `User_Count`: number of critics used for critics score and number of users who gave user score
  - `Developer`: Developer of the game
  - `Rating`: The ESRB ratings
  - `Game_Age`: Calculated as `2016 - Year_of_Release` to measure how long a game has been on the market.



### Loading necassary packages 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


import random, os
SEED = 5 # like in the original project
random.seed(SEED)
np.random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)

# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 3. Data preprocessing  </center></p>

In [2]:
# Load dataset
vgsales = pd.read_csv('Data/Video_Games_Sales_as_at_22_Dec_2016.csv', na_values=["", " ", "NA", "N/A"])

In [3]:
vgsales.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [4]:
# Check structure
vgsales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


In [11]:
# Table to count number of NAs 
na_count = vgsales.isna().sum()

na_count = pd.DataFrame(na_count, columns=['na_count'])

na_count


Unnamed: 0,na_count
Name,2
Platform,0
Year_of_Release,269
Genre,2
Publisher,54
NA_Sales,0
EU_Sales,0
JP_Sales,0
Other_Sales,0
Global_Sales,0


In [13]:
# Drop all NA values as the original project
vgsales = vgsales.dropna()


In [25]:
# Drop all sales except global
vgsales = vgsales.drop(columns=['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales'])


In [27]:
# Recalculate 'Year_of_Release' as game age
vgsales['Game_Age'] = 2016 - vgsales['Year_of_Release']
vgsales = vgsales.drop(columns=['Year_of_Release'])


In [29]:
vgsales.head()

Unnamed: 0,Name,Platform,Genre,Publisher,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,Game_Age
0,Wii Sports,Wii,Sports,Nintendo,82.53,76.0,51.0,8.0,322.0,Nintendo,E,10.0
2,Mario Kart Wii,Wii,Racing,Nintendo,35.52,82.0,73.0,8.3,709.0,Nintendo,E,8.0
3,Wii Sports Resort,Wii,Sports,Nintendo,32.77,80.0,73.0,8.0,192.0,Nintendo,E,7.0
6,New Super Mario Bros.,DS,Platform,Nintendo,29.8,89.0,65.0,8.5,431.0,Nintendo,E,10.0
7,Wii Play,Wii,Misc,Nintendo,28.92,58.0,41.0,6.6,129.0,Nintendo,E,10.0


In [37]:
# Summary table
vgsales.describe(include='all')


Unnamed: 0,Name,Platform,Genre,Publisher,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,Game_Age
count,6825,6825,6825,6825,6825.0,6825.0,6825.0,6825.0,6825.0,6825,6825,6825.0
unique,4377,17,12,262,,,,89.0,,1289,7,
top,Need for Speed: Most Wanted,PS2,Action,Electronic Arts,,,,7.8,,EA Canada,T,
freq,8,1140,1630,944,,,,294.0,,149,2377,
mean,,,,,0.77759,70.272088,28.931136,,174.722344,,,8.563223
std,,,,,1.963443,13.868572,19.224165,,587.428538,,,4.211248
min,,,,,0.01,13.0,3.0,,4.0,,,0.0
25%,,,,,0.11,62.0,14.0,,11.0,,,5.0
50%,,,,,0.29,72.0,25.0,,27.0,,,9.0
75%,,,,,0.75,80.0,39.0,,89.0,,,12.0


In [39]:
vgsales.to_csv('vgsales_cleaned.csv')

# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 4. Visualizations of the data and analysis  </center></p>


## Univariate plots

## Multivariate plots

# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 5. Statistical Models For Regression  </center></p>

## KNN Regressor

## Linear Regression


## Decision Tree


# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 6. Metrics Evaluation   </center></p>

# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 7. Conclusion  </center></p>

# <p> <center style="background-color:#b76e79;font-family:Palatino Linotype;color:white;font-size:150%;text-align:center;border-radius:0px;padding:10px"> 8. References  </center></p>
