##  Video games recommendation system

The aim of this notebook is to create a recommendation system that will give the user products, similiar to the one they chose. This will be happening with the help of the Nearest neighbours model.

The iteration of the porject will be kept in a git repository
- git link - https://git.fhict.nl/I509460/video-game-reommendation.git

The project is created and work on by Mihail Kenarov


In [35]:
import sklearn 
import pandas as pd
import seaborn as sns

print("scikit-learn version:", sklearn.__version__)     # 1.4.1
print("pandas version:", pd.__version__)            # 2.2.1
print("seaborn version:", sns.__version__)          # 0.13.2

scikit-learn version: 1.4.1.post1
pandas version: 2.2.0
seaborn version: 0.13.2


# 📦 Data provisioning



### Data Requirment 

We would need to find a suitable dataset that has some video games with data about them like genres,publishers etc.

## Data Collection 

This data is availabe at the site of Kaggle. We will be using it as a csv file that has some information that does seem quite useful for us https://www.kaggle.com/datasets/asaniczka/video-game-sales-2024


In [36]:
df = pd.read_csv('vgchartz-2024.csv')
df.head()

Unnamed: 0,img,title,console,genre,publisher,developer,critic_score,total_sales,na_sales,jp_sales,pal_sales,other_sales,release_date,last_update
0,/games/boxart/full_6510540AmericaFrontccc.jpg,Grand Theft Auto V,PS3,Action,Rockstar Games,Rockstar North,9.4,20.32,6.37,0.99,9.85,3.12,2013-09-17,
1,/games/boxart/full_5563178AmericaFrontccc.jpg,Grand Theft Auto V,PS4,Action,Rockstar Games,Rockstar North,9.7,19.39,6.06,0.6,9.71,3.02,2014-11-18,2018-01-03
2,/games/boxart/827563ccc.jpg,Grand Theft Auto: Vice City,PS2,Action,Rockstar Games,Rockstar North,9.6,16.15,8.41,0.47,5.49,1.78,2002-10-28,
3,/games/boxart/full_9218923AmericaFrontccc.jpg,Grand Theft Auto V,X360,Action,Rockstar Games,Rockstar North,,15.86,9.06,0.06,5.33,1.42,2013-09-17,
4,/games/boxart/full_4990510AmericaFrontccc.jpg,Call of Duty: Black Ops 3,PS4,Shooter,Activision,Treyarch,8.1,15.09,6.18,0.41,6.05,2.44,2015-11-06,2018-01-14


We see that we do have a lot of data to work with however I do not believe that some of the columns are needed that much when It comes to finding the most suitable game for the one that a user has chosen in this Jypyter notebook. I would say that `last_update`, `img`, `pal_sales`, `other_sales` are currently unnecesary so I will drop them


In [37]:
columns_to_drop = ['last_update', 'img', 'pal_sales', 'other_sales']
cleaned_df = df.drop(columns_to_drop, axis=1)
cleaned_df.head(10)

Unnamed: 0,title,console,genre,publisher,developer,critic_score,total_sales,na_sales,jp_sales,release_date
0,Grand Theft Auto V,PS3,Action,Rockstar Games,Rockstar North,9.4,20.32,6.37,0.99,2013-09-17
1,Grand Theft Auto V,PS4,Action,Rockstar Games,Rockstar North,9.7,19.39,6.06,0.6,2014-11-18
2,Grand Theft Auto: Vice City,PS2,Action,Rockstar Games,Rockstar North,9.6,16.15,8.41,0.47,2002-10-28
3,Grand Theft Auto V,X360,Action,Rockstar Games,Rockstar North,,15.86,9.06,0.06,2013-09-17
4,Call of Duty: Black Ops 3,PS4,Shooter,Activision,Treyarch,8.1,15.09,6.18,0.41,2015-11-06
5,Call of Duty: Modern Warfare 3,X360,Shooter,Activision,Infinity Ward,8.7,14.82,9.07,0.13,2011-11-08
6,Call of Duty: Black Ops,X360,Shooter,Activision,Treyarch,8.8,14.74,9.76,0.11,2010-11-09
7,Red Dead Redemption 2,PS4,Action-Adventure,Rockstar Games,Rockstar Games,9.8,13.94,5.26,0.21,2018-10-26
8,Call of Duty: Black Ops II,X360,Shooter,Activision,Treyarch,8.4,13.86,8.27,0.07,2012-11-13
9,Call of Duty: Black Ops II,PS3,Shooter,Activision,Treyarch,8.0,13.8,4.99,0.65,2012-11-13


## Current thoughts

Currenly I am not completly sure if all of these columns are things that are going to matter a lot to our system,but I will still go along with them for now. Let us create a dictionary with what we have 





## Data Dictionary

As a first attempt to improve the data requirements, I have tried to add units and ranges to the data definition. It can also help to check the consistency and validity of the data.

(0) Title: The name of the game

(1) Console: The console on which the game is played on

(2) Genre: The genre of a game

(3) Publisher: The game publishers which colud be considered 'The big names' of the industry

(4) Developer: The studio that worked on the creation of the game

(5) Critic score: The score that is given to a game from a certain agency (like IGN)

(6) Total sales: The number of times the game has been sold worldwide

(7) NA sales: The number of times a game has been sold in North America

(8) Japan sales: The number of times a game has been sold in Japan



### For now the data seems alright but let's check if there are some empty spaces, what kind of data are we working with and some general values in the dataset

In [38]:
cleaned_df.info() # to see the type of data we are working with

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64016 entries, 0 to 64015
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         64016 non-null  object 
 1   console       64016 non-null  object 
 2   genre         64016 non-null  object 
 3   publisher     64016 non-null  object 
 4   developer     63999 non-null  object 
 5   critic_score  6678 non-null   float64
 6   total_sales   18922 non-null  float64
 7   na_sales      12637 non-null  float64
 8   jp_sales      6726 non-null   float64
 9   release_date  56965 non-null  object 
dtypes: float64(4), object(6)
memory usage: 4.9+ MB


In [39]:
print(cleaned_df.isna().sum()) # to see the missing values

title               0
console             0
genre               0
publisher           0
developer          17
critic_score    57338
total_sales     45094
na_sales        51379
jp_sales        57290
release_date     7051
dtype: int64


In [40]:
print(cleaned_df.describe()) # to get a bit of general knowledge about the data in the dataset currently

       critic_score   total_sales      na_sales     jp_sales
count   6678.000000  18922.000000  12637.000000  6726.000000
mean       7.220440      0.349113      0.264740     0.102281
std        1.457066      0.807462      0.494787     0.168811
min        1.000000      0.000000      0.000000     0.000000
25%        6.400000      0.030000      0.050000     0.020000
50%        7.500000      0.120000      0.120000     0.040000
75%        8.300000      0.340000      0.280000     0.120000
max       10.000000     20.320000      9.760000     2.130000


### Current thoughts 

There is quite a lot of missing data still that I am not sure I cmopletly need. After more research I will come to a better conclusion.

## What we currently know


(5) Critic score: The maximum score is 10 and the minimum is 1(out of 10)

(6) Total sales: The maximum amount of sales of a game is 20.32 million while the minimum found was 0 on a worldwide scale

(7) NA sales:  The maximum amount of sales of a game is 9.76 million while the minimum found was 0 on a NA scale

(8) Japan sales: The maximum amount of sales of a game is 2.13 million while the minimum found was 0 on a Japan scale
