##  Video games recommendation system

The aim of this notebook is to create a recommendation system that will give the user products, similiar to the one they chose. 

The iteration of the porject will be kept in a git repository
- git link - https://git.fhict.nl/I509460/video-game-reommendation.git

The project is created and work on by Mihail Kenarov


In [22]:
import sklearn 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

print("scikit-learn version:", sklearn.__version__) # 1.4.1
print("pandas version:", pd.__version__)            # 2.2.1
print("seaborn version:", sns.__version__)          # 0.13.2

scikit-learn version: 1.4.1.post1
pandas version: 2.2.0
seaborn version: 0.13.2


# 📦 Data provisioning



### Data Requirment 

We would need to find a suitable dataset that has some video games with data about them like genres,publishers etc.

## Data Collection 

This data is availabe at the site of Kaggle. We will be using it as a csv file that has some information that does seem quite useful for us https://www.kaggle.com/datasets/gsimonx37/backloggd/data

The dataset also does seem new and also has been updated less than a week ago so we do have some insurance that it is relevant one way or another 

In [23]:
df_devs = pd.read_csv('developers.csv')
df_games = pd.read_csv('games.csv')
df_genres = pd.read_csv('genres.csv')
df_platforms = pd.read_csv('platforms.csv')
df_scores = pd.read_csv('scores.csv')

#Let's see what we are working with by starting off with the sizes 

print(f"The dataset for devs has the size of {df_devs.shape[0]} rows and {df_devs.shape[1]} columns.")
print(f"The dataset for games has the size of {df_games.shape[0]} rows and {df_games.shape[1]} columns.")
print(f"The dataset for genres has the size of {df_genres.shape[0]} rows and {df_genres.shape[1]} columns.")
print(f"The dataset for platforms has the size of {df_platforms.shape[0]} rows and {df_platforms.shape[1]} columns.")
print(f"The dataset for scores has the size of {df_scores.shape[0]} rows and {df_scores.shape[1]} columns.")



The dataset for devs has the size of 143454 rows and 2 columns.
The dataset for games has the size of 172512 rows and 10 columns.
The dataset for genres has the size of 286025 rows and 2 columns.
The dataset for platforms has the size of 261475 rows and 2 columns.
The dataset for scores has the size of 1725120 rows and 3 columns.


## Let's see what are the columns we are looking for and what kinds of data they have as well as what are the values 

In [24]:
print(f"The dataset for devs has {df_devs.columns} columns.")
print("")
print(f"The dataset for games has {df_games.columns} columns.")
print("")
print(f"The dataset for genres has {df_genres.columns} columns.")
print("")
print(f"The dataset for platforms has {df_platforms.columns} columns.")
print("")
print(f"The dataset for scores has {df_scores} columns.")


The dataset for devs has Index(['id', 'developer'], dtype='object') columns.

The dataset for games has Index(['id', 'name', 'date', 'rating', 'reviews', 'plays', 'playing',
       'backlogs', 'wishlists', 'description'],
      dtype='object') columns.

The dataset for genres has Index(['id', 'genre'], dtype='object') columns.

The dataset for platforms has Index(['id', 'platform'], dtype='object') columns.

The dataset for scores has               id  score  amount
0        1000001    0.5      10
1        1000001    1.0       5
2        1000001    1.5       1
3        1000001    2.0       3
4        1000001    2.5       9
...          ...    ...     ...
1725115  1172512    3.0       0
1725116  1172512    3.5       1
1725117  1172512    4.0       0
1725118  1172512    4.5       0
1725119  1172512    5.0       0

[1725120 rows x 3 columns] columns.


In [25]:
print(f"df_devs these unique values")
df_devs.nunique()


df_devs these unique values


id           101022
developer     30502
dtype: int64

In [26]:
print(f"df_games has these unique values")
df_games.nunique()

df_games has these unique values


id             172512
name           136313
date            12171
rating             47
reviews           834
plays            3236
playing           584
backlogs         1808
wishlists        1220
description    128142
dtype: int64

In [27]:
print(f"df_genres has these unique values")
df_genres.nunique()

df_genres has these unique values


id       147369
genre        23
dtype: int64

In [28]:
print(f"df_platforms has these unique values")
df_platforms.nunique()

df_platforms has these unique values


id          144033
platform       199
dtype: int64

In [29]:
print(f"df_scores has these unique values")
df_scores.nunique()

df_scores has these unique values


id        172512
score         10
amount      2276
dtype: int64

## Data Dictionary

As a first attempt to improve the data requirements, I have tried to add units and ranges to the data definition. It can also help to check the consistency and validity of the data.

### 1.Games Dataset - basic data:

id - video game identifier (primary key);

name - name of the video game;

date - release date of the video game;

rating - average rating of the video game;

reviews - number of reviews;

plays - total number of players;

playing - number of players currently;

backlogs - the number of additions of a video game to the backlog;

wishlists - the number of times a video game has been added to “favorites”;

description - description of the video game.

### 2.Developers dataset - developers (publishers):

id - video game identifier (foreign key);

developer - developer (publisher) of a video game.

### 3.Platforms dataset - gaming platforms:

id - video game identifier (foreign key);

platform - gaming platform.

### 4.Genres dataset - game genres:

id - video game identifier (foreign key);

genre - video game genre.

### 5.Scores dataset - user ratings:

id - video game identifier (foreign key);

score - score (from 0.5 to 5 in increments of 0.5);

amount - number of users that gave this score


## For now let's work with the games dataset and if some data is missing we can probably play around with the other datasets if needed to get some more information that we need

In [30]:
df_games.head()

Unnamed: 0,id,name,date,rating,reviews,plays,playing,backlogs,wishlists,description
0,1000001,Cathode Ray Tube Amusement Device,1947-12-31,3.5,65.0,117.0,1.0,28.0,56.0,The cathode ray tube amusement device is the e...
1,1000002,Bertie the Brain,1950-08-25,2.5,11.0,24.0,0.0,6.0,12.0,Currently considered the first videogame in hi...
2,1000003,Nim,1951-12-31,1.8,2.0,11.0,0.0,2.0,6.0,The Nimrod was a special purpose computer that...
3,1000004,Draughts,1952-08-31,2.4,3.0,17.0,0.0,3.0,7.0,A game of draughts (a.k.a. checkers) written f...
4,1000005,OXO,1952-12-31,3.1,14.0,52.0,1.0,12.0,13.0,OXO was a computer game developed by Alexander...


In [31]:
df_games.shape

(172512, 10)

In [32]:
print(df_games.isna().sum()) # to see the missing values

id                  0
name                0
date            34781
rating         116943
reviews             1
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64


### There are some that we will look at in the future

## For now I will leave it like this and see what I can do in the future with those, But I would also Like to add in the publishers to a specific game, so let's start with it 


### Let's start by having the developers on the same row

In [40]:
df_devs['developer'] = df_devs['developer'].astype(str)

# Group by 'id' and join the developers into a single string
df_devs = df_devs.groupby('id')['developer'].apply(', '.join).reset_index()

print(df_devs.head(10))
# Merge df_games and df_devs on the 'id' column using a left join
#merged_df = pd.merge(df_games, df_devs, on='id', how='left')

# Display the first 16 rows of the merged DataFrame
#merged_df.head(16)

        id                                          developer
0  1000002                                        Josef Kates
1  1000004                               Christopher Strachey
2  1000005  Alexander Shafto "Sandy" Douglas, University o...
3  1000007                                William Higinbotham
4  1000009           Steve Russel, Computer Recreations, Inc.
5  1000010                                        Mabel Addis
6  1000011                                        Namco, Sega
7  1000013                        Creative Computing Software
8  1000015                      Miso Ramen Group, Hudson Soft
9  1000016                                           Magnavox


### Let's merge them with the games

In [42]:
# Merge df_games and df_devs on the 'id' column using a left join
merged_df = pd.merge(df_games, df_devs, on='id', how='left')

# Display the first 16 rows of the merged DataFrame
merged_df.head(16)

Unnamed: 0,id,name,date,rating,reviews,plays,playing,backlogs,wishlists,description,developer
0,1000001,Cathode Ray Tube Amusement Device,1947-12-31,3.5,65.0,117.0,1.0,28.0,56.0,The cathode ray tube amusement device is the e...,
1,1000002,Bertie the Brain,1950-08-25,2.5,11.0,24.0,0.0,6.0,12.0,Currently considered the first videogame in hi...,Josef Kates
2,1000003,Nim,1951-12-31,1.8,2.0,11.0,0.0,2.0,6.0,The Nimrod was a special purpose computer that...,
3,1000004,Draughts,1952-08-31,2.4,3.0,17.0,0.0,3.0,7.0,A game of draughts (a.k.a. checkers) written f...,Christopher Strachey
4,1000005,OXO,1952-12-31,3.1,14.0,52.0,1.0,12.0,13.0,OXO was a computer game developed by Alexander...,"Alexander Shafto ""Sandy"" Douglas, University o..."
5,1000006,Pool,1954-06-26,3.0,5.0,20.0,0.0,2.0,4.0,A game of pool (billiards) developed by Willia...,
6,1000007,Tennis for Two,1958-10-18,3.0,41.0,100.0,0.0,18.0,29.0,Tennis for Two is often credited to be the wor...,William Higinbotham
7,1000008,Mouse in the Maze,1959-01-16,2.6,3.0,17.0,0.0,2.0,6.0,"A game where players place maze walls, bits of...",
8,1000009,Spacewar!,1962-04-30,3.0,25.0,124.0,0.0,23.0,36.0,Spacewar! is one of the earliest digital compu...,"Steve Russel, Computer Recreations, Inc."
9,1000010,The Sumerian Game,1964-12-31,2.6,3.0,17.0,0.0,7.0,7.0,The Sumerian Game is a text-based strategy vid...,Mabel Addis


Let's add the genres as well 


In [45]:
# Convert the 'genre' column to string
df_genres['genre'] = df_genres['genre'].astype(str)

# Group by 'id' and join the genres into a single string
df_genres = df_genres.groupby('id')['genre'].apply(', '.join).reset_index()

# Perform one-hot encoding
df_genres_encoded = df_genres['genre'].str.get_dummies(sep=', ')

# Join the encoded genres back to the 'id' column
df_genres_encoded = pd.concat([df_genres['id'], df_genres_encoded], axis=1)

df_genres_encoded.head(10)

Unnamed: 0,id,Adventure,Arcade,Brawler,Card & Board Game,Fighting,Indie,MOBA,Music,Pinball,...,RPG,Racing,Real Time Strategy,Shooter,Simulator,Sport,Strategy,Tactical,Turn Based Strategy,Visual Novel
0,1000001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1000002,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1000003,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,1000004,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1000005,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
5,1000006,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
6,1000007,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
7,1000008,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,1000009,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
9,1000010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


### I do believe that using the one hot encoding will help us with seeing a similiarity between all of the games, which will be a key part in our system. Now, let's join them together



In [48]:
# Merge df_games and df_genres on the 'id' column using a left join
merged_df = pd.merge(df_games, df_genres_encoded, on='id', how='left')

# Display the first 16 rows of the merged DataFrame
merged_df.head(16)

Unnamed: 0,id,name,date,rating,reviews,plays,playing,backlogs,wishlists,description,...,RPG,Racing,Real Time Strategy,Shooter,Simulator,Sport,Strategy,Tactical,Turn Based Strategy,Visual Novel
0,1000001,Cathode Ray Tube Amusement Device,1947-12-31,3.5,65.0,117.0,1.0,28.0,56.0,The cathode ray tube amusement device is the e...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1000002,Bertie the Brain,1950-08-25,2.5,11.0,24.0,0.0,6.0,12.0,Currently considered the first videogame in hi...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1000003,Nim,1951-12-31,1.8,2.0,11.0,0.0,2.0,6.0,The Nimrod was a special purpose computer that...,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1000004,Draughts,1952-08-31,2.4,3.0,17.0,0.0,3.0,7.0,A game of draughts (a.k.a. checkers) written f...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1000005,OXO,1952-12-31,3.1,14.0,52.0,1.0,12.0,13.0,OXO was a computer game developed by Alexander...,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,1000006,Pool,1954-06-26,3.0,5.0,20.0,0.0,2.0,4.0,A game of pool (billiards) developed by Willia...,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
6,1000007,Tennis for Two,1958-10-18,3.0,41.0,100.0,0.0,18.0,29.0,Tennis for Two is often credited to be the wor...,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,1000008,Mouse in the Maze,1959-01-16,2.6,3.0,17.0,0.0,2.0,6.0,"A game where players place maze walls, bits of...",...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,1000009,Spacewar!,1962-04-30,3.0,25.0,124.0,0.0,23.0,36.0,Spacewar! is one of the earliest digital compu...,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
9,1000010,The Sumerian Game,1964-12-31,2.6,3.0,17.0,0.0,7.0,7.0,The Sumerian Game is a text-based strategy vid...,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


# A complete Dataset has been created

### Now, I'll probably start looking into the features of my dataset, seeing what are some of the missing values and then go on with the creating of the recommendation algo
