# Reccomendation System for Steam Game - Content-Based Filtering

- Author    : Muhammad Aditya Bayhaqie
- Practice  : Machine Learning Terapan (Dicoding)
- Dataset   : [Steam Games Kaggle](https://www.kaggle.com/datasets/fronkongames/steam-games-dataset/data?select=games.csv)

## Data Understanding

Mari panggil library dan dataset yang akan digunakan.

In [3]:
# Import library
import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import time

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
! mkdir ~/.kaggle

In [6]:
!cp /content/drive/MyDrive/CollabData/kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [7]:
! chmod 600 ~/.kaggle/kaggle.json

In [8]:
!kaggle datasets download fronkongames/steam-games-dataset

Dataset URL: https://www.kaggle.com/datasets/fronkongames/steam-games-dataset
License(s): MIT
Downloading steam-games-dataset.zip to /content
 94% 225M/241M [00:00<00:00, 590MB/s] 
100% 241M/241M [00:00<00:00, 440MB/s]


In [9]:
!unzip steam-games-dataset.zip

Archive:  steam-games-dataset.zip
  inflating: games.csv               
  inflating: games.json              


### Data Assessment

In [10]:
games = pd.read_csv('/content/games.csv')

In [11]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111452 entries, 20200 to 3183790
Data columns (total 39 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AppID                       111446 non-null  object 
 1   Name                        111452 non-null  object 
 2   Release date                111452 non-null  object 
 3   Estimated owners            111452 non-null  int64  
 4   Peak CCU                    111452 non-null  int64  
 5   Required age                111452 non-null  float64
 6   Price                       111452 non-null  int64  
 7   DiscountDLC count           111452 non-null  int64  
 8   About the game              104969 non-null  object 
 9   Supported languages         111452 non-null  object 
 10  Full audio languages        111452 non-null  object 
 11  Reviews                     10624 non-null   object 
 12  Header image                111452 non-null  object 
 13  Website       

Dari data tersebut, ditarik kesimpulan bahwa:
- Terdapat 37 **Kolom**
- Terdapat 111452 **Baris**
- Beberapa Kolom memiliki **Non-null** yang sedikit (`Score rank`, `Metacritic url`, `Reviews`) dan perlu ditangani dengan beberapa metode berupa
  - Drop Kolom
  - Isi Kolom kosong

In [12]:
# Membaca dataset

df = games
df.head(5)

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DiscountDLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...,['English'],...,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
655370,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",...,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1732930,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",...,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1355720,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,0,0,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",...,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1139950,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,0,0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",...,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


Dari data tersebut, Dapat disimpulkan bahwa Terjadi pergeseran konten data dari kolom `AppID` hingga `DiscountDLCcount`, `AppID` akan di drop karena data tersebut Insignifikan dan Kolom lainnya akan direname untuk memperbaiki konten data

### Data Preparation

In [13]:
# Rename columns
df = df.rename(columns={
    'Price': 'DiscountDLC count',
    'Required age' : 'Price',
    'Peak CCU': 'Required age',
    'Estimated owners': 'Peak CCU',
    'Release date': 'Estimated owners',
    'Name': 'Release date',
    'AppID': 'Name',
})

# Reindex the DataFrame
df = df.reset_index(drop=True)

df.head(5)

Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DiscountDLC count,DiscountDLC count.1,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...,['English'],...,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",...,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
2,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",...,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
3,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,0,0,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",...,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
4,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,0,0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",...,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


In [14]:
# Drop the 8th column ('DiscountDLC count')
df = df.drop(df.columns[7], axis=1)

# Display the updated DataFrame (optional)
df.head(5)

Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,About the game,Supported languages,Full audio languages,Reviews,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,Galactic Bowling is an exaggerated and stylize...,['English'],[],,...,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",[],,...,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
2,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",[],,...,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
3,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",[],,...,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
4,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",[],,...,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


## Exploratory Data Analysis

### Univariate Exploratory Data Analysis

`Game` Variable

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111452 entries, 0 to 111451
Data columns (total 37 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Name                        111446 non-null  object 
 1   Release date                111452 non-null  object 
 2   Estimated owners            111452 non-null  object 
 3   Peak CCU                    111452 non-null  int64  
 4   Required age                111452 non-null  int64  
 5   Price                       111452 non-null  float64
 6   About the game              104969 non-null  object 
 7   Supported languages         111452 non-null  object 
 8   Full audio languages        111452 non-null  object 
 9   Reviews                     10624 non-null   object 
 10  Header image                111452 non-null  object 
 11  Website                     46458 non-null   object 
 12  Support url                 50759 non-null   object 
 13  Support email 

Fitur `Release date` perlu diganti tipe datanya ke dates agar datanya dapat meproses masukan data lebih baik nantinya

In [16]:
print('Banyak data game yang terdaftar: ', len(df.Name.unique()))
print('Banyak game yang terdaftar: ', df.Name.unique())
print('List Game: ', df.Name.unique())

Banyak data game yang terdaftar:  110326
Banyak game yang terdaftar:  ['Galactic Bowling' 'Train Bandit' 'Jolt Project' ... 'MosGhost'
 'AccuBow VR' 'Defense Of Fort Burton']
List Game:  ['Galactic Bowling' 'Train Bandit' 'Jolt Project' ... 'MosGhost'
 'AccuBow VR' 'Defense Of Fort Burton']


Terdapat 110326 data Game yang unik dengan 37 Fitur yang dapat digunakan

## Data Preprocessing

## Data Preparation

### Taking care on Missing Values

In [17]:
# Mengecek missing value pada dataframe all_resto
df.isnull().sum()

Unnamed: 0,0
Name,6
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,6483
Supported languages,0
Full audio languages,0
Reviews,100828


#### `Name` Feature Treatment

Feature `Name` yang null akan didrop saja barisnya.

In [18]:
df.dropna(subset=['Name'], inplace=True)
df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,6478
Supported languages,0
Full audio languages,0
Reviews,100822


#### `About the game` Feature Treatment

Feature `About the game` akan diganti dengan data

```
No Description
```

In [19]:
df['About the game'] = df['About the game'].fillna('No Description')
df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Reviews,100822


#### `Reviews`, `Website`, `Support url`, `Support email`, `Metacritic url`, `Metacritic score`, `Average playtime two weeks`,`Median playtime two weeks`, `Score rank` and `Notes` Feature Treatment

Fitur `Reviews`, `Website`, `Support url`, `Support email`, `Metacritic url`, `Metacritic score`, `Average playtime two weeks`,`Median playtime two weeks`, `Score rank` dan `Notes` akan didrop saja kolomnya

*PS: Data ini dapat digunakan sebagai pelengkap deskripsi game yang akan kita rekomendasikan, namun untuk kali ini datanya akan didrop saja*

In [20]:
# Drop specified columns
columns_to_drop = ['Reviews', 'Website', 'Support url', 'Support email', 'Metacritic url' , 'Metacritic score' , 'Score rank', 'Notes', 'Average playtime two weeks', 'Median playtime two weeks']
df = df.drop(columns=columns_to_drop)

df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Header image,0


#### `Developers` Feature Treatment

Feature `Developers` yang null akan didrop saja barisnya.

In [21]:
df.dropna(subset=['Developers'], inplace=True)
df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Header image,0


#### `Publishers` Feature Treatment

Feature `Publishers` yang null akan kita samakan dengan developers

In [22]:
publishers_with_null = df[df['Publishers'].isnull()]
print("Publishers with null data:")
display(publishers_with_null[['Name','Developers', 'Publishers']])

Publishers with null data:


Unnamed: 0,Name,Developers,Publishers
23,Turtle Lu,Falco Software,
345,Borderless Gaming,"AndrewMD5,Codeusa",
515,Bunker Constructor,Tindalos Interactive,
659,Super Meat Boy,Team Meat,
748,Little Square Things,G.Reed,
...,...,...,...
99688,AnderKant,KlankeKlanke,
101299,Ancient Ruins,Byking Inc,
101396,Heritage - A Dragon's Tale,CGWorks_HeritageDev,
102415,Retail Mage,Jam & Tea Studios,


In [23]:
df['Publishers'].fillna(df['Developers'], inplace=True)
df.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Publishers'].fillna(df['Developers'], inplace=True)


Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Header image,0


#### `Categories` Feature Treatment

In [24]:
display(df.head())

Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,About the game,Supported languages,Full audio languages,Header image,...,Recommendations,Average playtime forever,Median playtime forever,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,Galactic Bowling is an exaggerated and stylize...,['English'],[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
2,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
3,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
4,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


In [25]:
display(df[df['Categories'].isnull()])

Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,About the game,Supported languages,Full audio languages,Header image,...,Recommendations,Average playtime forever,Median playtime forever,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
31,Home Office Tasker,"Sep 8, 2021",0 - 20000,0,0,0.99,You no longer need to go to special applicatio...,"['English', 'Russian', 'German', 'Spanish - Sp...",[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,lonch.me,lonch.me,,Utilities,"Utilities,Time Management,Time Manipulation,So...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
145,Kooring VR Coding Adventure,Aug 2020,0 - 20000,0,0,8.49,Help Kooring get to the goal through the 3 dif...,"['English', 'Simplified Chinese', 'Korean', 'T...","['English', 'Simplified Chinese', 'Korean', 'T...",https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,VRANI inc.,VRANI inc.,,"Adventure,Casual,Indie,Strategy,Education","Education,Choose Your Own Adventure,Programmin...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
216,Maria Blanchard Virtual Gallery,"Jul 1, 2022",0 - 0,0,0,0.00,Maria Blanchard (1881-1932). She was a Franco-...,['Spanish - Spain'],['Spanish - Spain'],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,Virtual Video,Virtual Video,,"Design & Illustration,Education",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
359,Gamefuel Driver Control,"Dec 10, 2015",20000 - 50000,0,0,29.99,The problem: You may have hardware or devices ...,['English'],[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0,Auslogics Software,Console Classics,,Utilities,Utilities,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
450,Start10,"May 11, 2017",0 - 20000,0,0,4.99,Customize your Start menu for easy access to U...,"['English', 'French', 'Italian', 'German', 'Sp...",[],https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,1,1,Stardock,Stardock,,Utilities,"Utilities,Software",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110841,Desktop Lux,"Mar 23, 2025",0 - 20000,0,0,9.99,Desktop Lux is a program for decoration and ad...,"['English', 'Russian', 'Simplified Chinese']","['English', 'Russian', 'Simplified Chinese']",https://shared.akamai.steamstatic.com/store_it...,...,0,0,0,Pothos,Pothos,,"Animation & Modeling,Design & Illustration,Uti...",,https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
110899,Image2pixel-PixelArtGenerator,"Mar 14, 2025",0 - 20000,2,0,9.99,Image2pixel is a pixel painting generation too...,"['English', 'French', 'Italian', 'German', 'Sp...",[],https://shared.akamai.steamstatic.com/store_it...,...,0,0,0,PixelBearStudio,PixelBearStudio,,"Design & Illustration,Photo Editing","Photo Editing,Design & Illustration",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
111118,MateEngine,"Apr 16, 2025",0 - 20000,59,0,3.99,"MateEngine MateEngine is a lightweight, custom...","['English', 'Spanish - Spain', 'Japanese', 'Ru...","['English', 'Japanese', 'Traditional Chinese']",https://shared.akamai.steamstatic.com/store_it...,...,0,0,0,Shinyflvres,Shinymoon,,"Animation & Modeling,Design & Illustration","Animation & Modeling,Design & Illustration,Anime",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
111243,JWildfireSwan,"Mar 13, 2025",0 - 20000,1,0,24.99,"JWildfireSwan is the successor to JWildfire, a...",['English'],[],https://shared.akamai.steamstatic.com/store_it...,...,0,0,0,Andreas Maschke,Andreas Maschke,,"Animation & Modeling,Design & Illustration,Gam...","Design & Illustration,Early Access,Animation &...",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...


Data null pada `Categories` akan kita drop saja

In [26]:
df.dropna(subset=['Categories'], inplace=True)
df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Header image,0


#### `Screenshots` , `Movies` and `Header image` Feature Treatment

Untuk fitur `Screenshots` and `Movies` akan didrop saja karena tidak relevan untuk keperluan sistem rekomendasi nantinya

*PS: Data ini dapat digunakan sebagai pelengkap deskripsi game yang akan kita rekomendasikan, namun untuk kali ini datanya akan didrop saja*

In [27]:
# Drop specified columns
columns_to_drop = ['Screenshots', 'Movies', 'Header image']
df = df.drop(columns=columns_to_drop)

df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Windows,0


#### `Genres` Feature Treatment

Untuk fitur `Genres` akan didrop saja baris Null karena jumlah data Null yang sedikit

In [28]:
df.dropna(subset=['Genres'], inplace=True)
df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Windows,0


#### `Release date` Feature Treatment

Untuk fitur `Release date` akan didrop saja baris Null karena jumlah data Null yang sedikit

In [29]:
df.dropna(subset=['Release date'], inplace=True)
df.isnull().sum()

Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Windows,0


#### `Tags` Feature Treatment

Untuk fitur `Tags` akan diisi dengan data pada `Genres`

In [30]:
df['Tags'].fillna(df['Genres'], inplace=True)
df.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Tags'].fillna(df['Genres'], inplace=True)


Unnamed: 0,0
Name,0
Release date,0
Estimated owners,0
Peak CCU,0
Required age,0
Price,0
About the game,0
Supported languages,0
Full audio languages,0
Windows,0


### Data Type Modification

In [31]:
# Change 'Release date' to datetime
df['Release date'] = pd.to_datetime(df['Release date'], format='%b %d, %Y', errors='coerce')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 103573 entries, 0 to 111451
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   Name                      103573 non-null  object        
 1   Release date              103448 non-null  datetime64[ns]
 2   Estimated owners          103573 non-null  object        
 3   Peak CCU                  103573 non-null  int64         
 4   Required age              103573 non-null  int64         
 5   Price                     103573 non-null  float64       
 6   About the game            103573 non-null  object        
 7   Supported languages       103573 non-null  object        
 8   Full audio languages      103573 non-null  object        
 9   Windows                   103573 non-null  bool          
 10  Mac                       103573 non-null  bool          
 11  Linux                     103573 non-null  bool          
 12  User sc

Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,About the game,Supported languages,Full audio languages,Windows,...,Negative,Achievements,Recommendations,Average playtime forever,Median playtime forever,Developers,Publishers,Categories,Genres,Tags
0,Galactic Bowling,2008-10-21,0 - 20000,0,0,19.99,Galactic Bowling is an exaggerated and stylize...,['English'],[],True,...,11,30,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling"
1,Train Bandit,2017-10-12,0 - 20000,0,0,0.99,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",[],True,...,5,12,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,Jolt Project,2021-11-17,0 - 20000,0,0,4.99,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",[],True,...,0,0,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy","Action,Adventure,Indie,Strategy"
3,Henosis™,2020-07-23,0 - 20000,0,0,5.99,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",[],True,...,0,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,Two Weeks in Painland,2020-02-03,0 - 20000,0,0,0.0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",[],True,...,8,17,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,..."


In [32]:
display(df)

Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,About the game,Supported languages,Full audio languages,Windows,...,Negative,Achievements,Recommendations,Average playtime forever,Median playtime forever,Developers,Publishers,Categories,Genres,Tags
0,Galactic Bowling,2008-10-21,0 - 20000,0,0,19.99,Galactic Bowling is an exaggerated and stylize...,['English'],[],True,...,11,30,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling"
1,Train Bandit,2017-10-12,0 - 20000,0,0,0.99,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",[],True,...,5,12,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,Jolt Project,2021-11-17,0 - 20000,0,0,4.99,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",[],True,...,0,0,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy","Action,Adventure,Indie,Strategy"
3,Henosis™,2020-07-23,0 - 20000,0,0,5.99,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",[],True,...,0,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,Two Weeks in Painland,2020-02-03,0 - 20000,0,0,0.00,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",[],True,...,8,17,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111447,Paragon Of Time,2025-04-10,0 - 20000,0,0,2.99,"You stand at the edge of time, trying to save ...",['English'],[],True,...,0,0,0,0,0,Webcess,Webcess,"Single-player,Full controller support,Steam Cl...","Action,Casual,Indie","Action Roguelike,Bullet Hell,Hack and Slash,Ro..."
111448,A Few Days With : Hazel,2025-04-11,0 - 20000,0,0,2.69,"Join Hazel, an attractive young lady, and enjo...","['English', 'French', 'Italian', 'German', 'Sp...",[],True,...,0,7,0,0,0,Hentai Panda,Hentai Panda,"Single-player,Steam Achievements,Steam Cloud,F...","Casual,Indie","Casual,Indie"
111449,MosGhost,2025-04-01,0 - 20000,0,0,7.99,Story : Andrei moved to Moscow for work and re...,"['English', 'Russian', 'French', 'Italian', 'G...",[],True,...,12,0,0,0,0,Sinka Games,"Sinka Games,Arkuda Inc.","Single-player,Family Sharing",Simulation,"Simulation,Walking Simulator,Idler,First-Perso..."
111450,AccuBow VR,2025-03-11,0 - 0,0,0,0.00,AccuBow VR: Master Realistic Archery in Immers...,['English'],['English'],True,...,0,0,0,0,0,AccuBow LLC,AccuBow LLC,"Single-player,Tracked Controller Support,VR On...","Action,Adventure,Free To Play","Action,Adventure,Free To Play"


### Value Modification

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 103573 entries, 0 to 111451
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   Name                      103573 non-null  object        
 1   Release date              103448 non-null  datetime64[ns]
 2   Estimated owners          103573 non-null  object        
 3   Peak CCU                  103573 non-null  int64         
 4   Required age              103573 non-null  int64         
 5   Price                     103573 non-null  float64       
 6   About the game            103573 non-null  object        
 7   Supported languages       103573 non-null  object        
 8   Full audio languages      103573 non-null  object        
 9   Windows                   103573 non-null  bool          
 10  Mac                       103573 non-null  bool          
 11  Linux                     103573 non-null  bool          
 12  User sc

#### Data Modification on `Supported languages` and `Full audio languages`

Data yang mengandung

```
[]
```

Pada `Supported languages` dan `Full audio languages` akan diganti dengan

```
No Supporting Languages / No Full audio languages
```

In [34]:
# Function to check and replace empty lists, and count occurrences
def replace_empty_lists_and_count(df, column_name):
  empty_list_count = 0
  for index, value in df[column_name].items():
    if value == '[]':
      df.loc[index, column_name] = f'No {column_name}'
      empty_list_count += 1
  return df, empty_list_count

# Process 'Supported languages' column
df, supported_languages_count = replace_empty_lists_and_count(df, 'Supported languages')
print(f"Number of entries with '[]' in 'Supported languages': {supported_languages_count}")

# Process 'Full audio languages' column
df, full_audio_languages_count = replace_empty_lists_and_count(df, 'Full audio languages')
print(f"Number of entries with '[]' in 'Full audio languages': {full_audio_languages_count}")

# Display rows that originally contained '[]' in either column (now replaced)
display(df[
    (df['Supported languages'] == 'No Supported languages') |
    (df['Full audio languages'] == 'No Full audio languages')
])

Number of entries with '[]' in 'Supported languages': 82
Number of entries with '[]' in 'Full audio languages': 58402


Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,About the game,Supported languages,Full audio languages,Windows,...,Negative,Achievements,Recommendations,Average playtime forever,Median playtime forever,Developers,Publishers,Categories,Genres,Tags
0,Galactic Bowling,2008-10-21,0 - 20000,0,0,19.99,Galactic Bowling is an exaggerated and stylize...,['English'],No Full audio languages,True,...,11,30,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling"
1,Train Bandit,2017-10-12,0 - 20000,0,0,0.99,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,...,5,12,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,Jolt Project,2021-11-17,0 - 20000,0,0,4.99,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",No Full audio languages,True,...,0,0,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy","Action,Adventure,Indie,Strategy"
3,Henosis™,2020-07-23,0 - 20000,0,0,5.99,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,...,0,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,Two Weeks in Painland,2020-02-03,0 - 20000,0,0,0.00,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",No Full audio languages,True,...,8,17,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111444,Kafkaesque: The Nightmare Trial,2025-04-17,0 - 20000,0,0,2.39,Kafkaesque: The Nightmare Trial is a psycholog...,['English'],No Full audio languages,True,...,0,3,0,0,0,Dawn Fades,Dawn Fades,"Single-player,Steam Achievements,Full controll...","Action,Indie","Action,Indie"
111446,Starry Trace,2025-04-14,0 - 20000,0,0,1.21,Welcome to Starry Trace ! A relaxing yet brain...,"['English', 'Japanese', 'Simplified Chinese', ...",No Full audio languages,True,...,0,10,0,0,0,Pomegranate Games,Pomegranate Games,"Single-player,Steam Achievements,Full controll...","Casual,Indie,Strategy","Casual,Strategy,Puzzle,Relaxing,2D,Cute,Detect..."
111447,Paragon Of Time,2025-04-10,0 - 20000,0,0,2.99,"You stand at the edge of time, trying to save ...",['English'],No Full audio languages,True,...,0,0,0,0,0,Webcess,Webcess,"Single-player,Full controller support,Steam Cl...","Action,Casual,Indie","Action Roguelike,Bullet Hell,Hack and Slash,Ro..."
111448,A Few Days With : Hazel,2025-04-11,0 - 20000,0,0,2.69,"Join Hazel, an attractive young lady, and enjo...","['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,...,0,7,0,0,0,Hentai Panda,Hentai Panda,"Single-player,Steam Achievements,Steam Cloud,F...","Casual,Indie","Casual,Indie"


In [35]:
df.head()

Unnamed: 0,Name,Release date,Estimated owners,Peak CCU,Required age,Price,About the game,Supported languages,Full audio languages,Windows,...,Negative,Achievements,Recommendations,Average playtime forever,Median playtime forever,Developers,Publishers,Categories,Genres,Tags
0,Galactic Bowling,2008-10-21,0 - 20000,0,0,19.99,Galactic Bowling is an exaggerated and stylize...,['English'],No Full audio languages,True,...,11,30,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling"
1,Train Bandit,2017-10-12,0 - 20000,0,0,0.99,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,...,5,12,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,Jolt Project,2021-11-17,0 - 20000,0,0,4.99,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",No Full audio languages,True,...,0,0,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy","Action,Adventure,Indie,Strategy"
3,Henosis™,2020-07-23,0 - 20000,0,0,5.99,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,...,0,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,Two Weeks in Painland,2020-02-03,0 - 20000,0,0,0.0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",No Full audio languages,True,...,8,17,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,..."


### Feature Consideration

### Dropping Duplicates

Pada bagian ini, akan diidentifikasi dan dihapus baris-baris yang memiliki nilai duplikat pada kolom `Name`. Karena setiap nama game seharusnya unik dalam dataset, baris duplikat berdasarkan nama game menunjukkan entri yang redundant dan perlu dihapus untuk memastikan kebersihan data dan mencegah bias dalam analisis atau pemodelan.

Langkah-langkahnya adalah sebagai berikut:

1.  **Memeriksa jumlah data duplikat:** Dihitung berapa banyak baris yang memiliki nilai `Name` yang sama dengan baris sebelumnya menggunakan `.duplicated()`.
2.  **Menghapus data duplikat:** Dihapus baris duplikat berdasarkan kolom `Name` dengan menggunakan `.drop_duplicates()`. Parameter `keep='first'` (default) akan mempertahankan baris pertama yang muncul dan menghapus sisanya.
3.  **Memverifikasi jumlah data setelah penghapusan:** Ditampilkan jumlah baris dalam DataFrame setelah data duplikat dihapus untuk memastikan operasi berhasil.

In [36]:
print(f'Jumlah game duplikat: {df.Name.duplicated().sum()}')
print(f'Jumlah game sebelum menghapus duplikat: {len(df)}')

df = df.drop_duplicates('Name')
print(f'Jumlah game setelah menghapus duplikat: {len(df)}')

Jumlah game duplikat: 1074
Jumlah game sebelum menghapus duplikat: 103573
Jumlah game setelah menghapus duplikat: 102499


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 102499 entries, 0 to 111451
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   Name                      102499 non-null  object        
 1   Release date              102375 non-null  datetime64[ns]
 2   Estimated owners          102499 non-null  object        
 3   Peak CCU                  102499 non-null  int64         
 4   Required age              102499 non-null  int64         
 5   Price                     102499 non-null  float64       
 6   About the game            102499 non-null  object        
 7   Supported languages       102499 non-null  object        
 8   Full audio languages      102499 non-null  object        
 9   Windows                   102499 non-null  bool          
 10  Mac                       102499 non-null  bool          
 11  Linux                     102499 non-null  bool          
 12  User sc

### Data Selection and Conversion

Setelah proses pembersihan dan penanganan nilai yang hilang selesai, tahap selanjutnya adalah memilih fitur-fitur yang relevan untuk digunakan dalam proses pemodelan sistem rekomendasi dan mengubahnya menjadi format yang sesuai.

Pada bagian ini, dilakukan hal berikut:

1.  **Pemilihan Kolom:** Kolom-kolom spesifik dari DataFrame `df` yang dianggap penting untuk membangun profil konten setiap game dipilih. Kolom-kolom ini meliputi `Name`, `Release date`, `Required age`, `Supported languages`, `Full audio languages`, `Windows`, `Mac`, `Linux`, `Average playtime forever`, `Categories`, dan `Tags`.
2.  **Konversi ke List:** Setiap kolom yang dipilih dikonversi menjadi Python list. Ini dilakukan untuk memudahkan penggunaan data dalam struktur data yang lebih sederhana dan mudah diakses, yang akan digunakan untuk membuat DataFrame baru yang lebih ringkas dan fokus pada fitur-fitur yang dipilih.
3.  **Verifikasi Panjang List:** Panjang (jumlah elemen) dari setiap list yang baru dibuat dicetak untuk memastikan bahwa proses konversi berhasil dan semua data dari kolom yang dipilih telah dimasukkan ke dalam list masing-masing.

In [38]:
Name = df['Name'].tolist()
print(len(Name))
Release_date = df['Release date'].tolist()
print(len(Release_date))
Required_age = df['Required age'].tolist()
print(len(Required_age))
Supported_languages = df['Supported languages'].tolist()
print(len(Supported_languages))
Full_audio_languages = df['Full audio languages'].tolist()
print(len(Full_audio_languages))
Windows = df['Windows'].tolist()
print(len(Windows))
Mac = df['Mac'].tolist()
print(len(Mac))
Linux = df['Linux'].tolist()
print(len(Linux))
Average_playtime_forever = df['Average playtime forever'].tolist()
print(len(Average_playtime_forever))
Categories = df['Categories'].tolist()
print(len(Categories))
Tags = df['Tags'].tolist()
print(len(Tags))

102499
102499
102499
102499
102499
102499
102499
102499
102499
102499
102499
102499


### Dictionary Making

Pada bagian ini, DataFrame baru bernama `games_df` dibuat. DataFrame ini disusun menggunakan list-list data yang telah dipilih dan dikonversi pada langkah sebelumnya, seperti `Name`, `Release date`, `Required age`, dan seterusnya.

Tujuannya adalah untuk mengorganisir fitur-fitur game yang relevan ke dalam satu struktur DataFrame yang ringkas dan siap untuk tahap pemrosesan selanjutnya. Setiap kolom dalam DataFrame ini merepresentasikan fitur spesifik dari game.

Setelah DataFrame `games_df` dibuat, lima baris pertamanya ditampilkan untuk memberikan gambaran awal tentang struktur dan isi data yang telah digabungkan.

In [39]:
games_df = pd.DataFrame({
    'Name': Name,
    'Release date': Release_date,
    'Required age': Required_age,
    'Supported languages': Supported_languages,
    'Full audio languages': Full_audio_languages,
    'Windows': Windows,
    'Mac': Mac,
    'Linux': Linux,
    'Average playtime forever': Average_playtime_forever,
    'Categories': Categories,
    'Tags': Tags
})

display(games_df)

Unnamed: 0,Name,Release date,Required age,Supported languages,Full audio languages,Windows,Mac,Linux,Average playtime forever,Categories,Tags
0,Galactic Bowling,2008-10-21,0,['English'],No Full audio languages,True,False,False,0,"Single-player,Multi-player,Steam Achievements,...","Indie,Casual,Sports,Bowling"
1,Train Bandit,2017-10-12,0,"['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,True,False,0,"Single-player,Steam Achievements,Full controll...","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,Jolt Project,2021-11-17,0,"['English', 'Portuguese - Brazil']",No Full audio languages,True,False,False,0,Single-player,"Action,Adventure,Indie,Strategy"
3,Henosis™,2020-07-23,0,"['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,True,True,0,"Single-player,Full controller support","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,Two Weeks in Painland,2020-02-03,0,"['English', 'Spanish - Spain']",No Full audio languages,True,True,False,0,"Single-player,Steam Achievements","Indie,Adventure,Nudity,Violent,Sexual Content,..."
...,...,...,...,...,...,...,...,...,...,...,...
102494,Paragon Of Time,2025-04-10,0,['English'],No Full audio languages,True,False,False,0,"Single-player,Full controller support,Steam Cl...","Action Roguelike,Bullet Hell,Hack and Slash,Ro..."
102495,A Few Days With : Hazel,2025-04-11,0,"['English', 'French', 'Italian', 'German', 'Sp...",No Full audio languages,True,False,False,0,"Single-player,Steam Achievements,Steam Cloud,F...","Casual,Indie"
102496,MosGhost,2025-04-01,0,"['English', 'Russian', 'French', 'Italian', 'G...",No Full audio languages,True,False,False,0,"Single-player,Family Sharing","Simulation,Walking Simulator,Idler,First-Perso..."
102497,AccuBow VR,2025-03-11,0,['English'],['English'],True,False,False,0,"Single-player,Tracked Controller Support,VR On...","Action,Adventure,Free To Play"


#### Dictionary Reduction

Pada bagian ini, ukuran DataFrame `games_df` dikurangi. Tujuannya adalah untuk mempercepat proses eksperimen dan pengembangan model dengan bekerja pada subset data yang lebih kecil, namun tetap merepresentasikan karakteristik data asli.

Langkah-langkah yang dilakukan:

1.  **Menghitung jumlah baris total:** Diambil jumlah total baris dalam DataFrame `games_df`.
2.  **Menghitung jumlah baris yang dikurangi:** Dihitung jumlah baris target setelah pengurangan. Dalam kasus ini, diambil 20% dari jumlah baris total.
3.  **Mengambil sampel acak:** DataFrame `games_df` diubah dengan mengambil sampel acak sebanyak jumlah baris yang telah dihitung. `random_state=42` digunakan untuk memastikan hasil sampel konsisten setiap kali kode dijalankan.
4.  **Reset Indeks:** Indeks DataFrame direset untuk membuat indeks baru yang berurutan setelah pengambilan sampel.
5.  **Menampilkan informasi pengurangan:** Dicetak pesan yang menunjukkan jumlah baris setelah pengurangan untuk memverifikasi proses.

In [40]:
total_rows = len(games_df)
reduced_rows = int(total_rows * 0.20)
games_df = games_df.sample(n=reduced_rows, random_state=42).reset_index(drop=True)
print(f"Reduced dataset to {len(games_df)} rows (approximately 20%).")

Reduced dataset to 20499 rows (approximately 20%).


##### `Categories` and `Tags` Value Normalization

Pada langkah ini, dilakukan normalisasi pada nilai-nilai dalam kolom `Categories` dan `Tags` untuk menyederhanakan dan mengekstrak informasi relevan.

Untuk kolom `Categories`, sebuah fungsi diterapkan untuk mengidentifikasi keberadaan kategori spesifik seperti 'Single-player', 'Multi-player', 'Steam Achievements', 'Family Sharing', dan 'Full controller support'. Berdasarkan kategori-kategori ini, dibuat fitur baru:
-   `Player based`: Menentukan apakah game berorientasi 'Single' (hanya single-player), 'Multi' (multi-player), atau keduanya.
-   `Steam Achievements`: Menunjukkan apakah game memiliki pencapaian Steam (Boolean).
-   `Family Sharing`: Menunjukkan apakah game mendukung Family Sharing (Boolean).
-   `Full controller support`: Menunjukkan apakah game mendukung kontroler penuh (Boolean).

Untuk kolom `Tags`, sebuah fungsi diterapkan untuk mengekstrak tiga tag pertama (jika ada). Ini menghasilkan tiga fitur baru:
-   `Tag 1`: Tag pertama.
-   `Tag 2`: Tag kedua.
-   `Tag 3`: Tag ketiga.

Selanjutnya, untuk kolom `Supported languages` dan `Full audio languages`, sebuah fungsi diterapkan untuk menghitung jumlah bahasa yang terdaftar. Nilai 'No Supported languages' atau 'No Full audio languages' diperlakukan sebagai 0 bahasa. Hasilnya adalah nilai numerik yang merepresentasikan jumlah bahasa yang didukung atau bahasa audio penuh.

Setelah fitur-fitur baru dibuat, kolom asli `Categories` dan `Tags` dihapus dari DataFrame karena informasi relevan telah diekstraksi dan dinormalisasi ke dalam kolom-kolom baru. DataFrame yang diperbarui kemudian ditampilkan.

In [41]:
def process_categories_refined(categories_str):
    categories = categories_str.split(',')
    single_player = 'Single-player' in categories
    multi_player = 'Multi-player' in categories

    # Determine Player based: Multi if both are present, Single if only Single-player, None otherwise
    if single_player and multi_player:
        player_based = 'Multi'
    elif single_player:
        player_based = 'Single'
    else:
        player_based = 'Single' # Or handle cases where neither is present

    steam_achievements = 'Steam Achievements' in categories
    family_sharing = 'Family Sharing' in categories
    full_controller_support = 'Full controller support' in categories

    return player_based, steam_achievements, family_sharing, full_controller_support

# Apply the refined function to the Categories column
games_df['Player based'] = games_df['Categories'].apply(lambda x: process_categories_refined(x)[0])
games_df['Steam Achievements'] = games_df['Categories'].apply(lambda x: process_categories_refined(x)[1])
games_df['Family Sharing'] = games_df['Categories'].apply(lambda x: process_categories_refined(x)[2])
games_df['Full controller support'] = games_df['Categories'].apply(lambda x: process_categories_refined(x)[3])

# Function to process Tags
def process_tags(tags_str):
    tags = tags_str.split(',')
    tag1 = tags[0].strip() if len(tags) > 0 else None
    tag2 = tags[1].strip() if len(tags) > 1 else None
    tag3 = tags[2].strip() if len(tags) > 2 else None
    return tag1, tag2, tag3

# Apply the function to the Tags column
games_df['Tag 1'] = games_df['Tags'].apply(lambda x: process_tags(x)[0])
games_df['Tag 2'] = games_df['Tags'].apply(lambda x: process_tags(x)[1])
games_df['Tag 3'] = games_df['Tags'].apply(lambda x: process_tags(x)[2])


# Display the updated DataFrame
display(games_df.head())

Unnamed: 0,Name,Release date,Required age,Supported languages,Full audio languages,Windows,Mac,Linux,Average playtime forever,Categories,Tags,Player based,Steam Achievements,Family Sharing,Full controller support,Tag 1,Tag 2,Tag 3
0,Golful,2023-09-04,0,['English'],No Full audio languages,True,False,False,0,Single-player,"Casual,Mini Golf,Puzzle-Platformer,Pixel Graph...",Single,False,False,False,Casual,Mini Golf,Puzzle-Platformer
1,Dungeon Village 2,2023-05-14,0,"['English', 'French', 'German', 'Thai', 'Portu...",No Full audio languages,True,False,False,0,"Single-player,Steam Achievements,Full controll...","Casual,Indie,RPG,Simulation,Strategy",Single,True,False,True,Casual,Indie,RPG
2,Push the Sheep,2022-03-14,0,"['English', 'German', 'Spanish - Spain', 'Port...",No Full audio languages,True,False,False,0,"Single-player,Includes level editor","Casual,Indie",Single,False,False,False,Casual,Indie,
3,Who is the hero of this Game,2022-07-24,0,"['English', 'Russian']",No Full audio languages,True,False,False,0,"Single-player,Steam Achievements","Adventure,Simulation,Arcade,Visual Novel,2D Pl...",Single,True,False,False,Adventure,Simulation,Arcade
4,Cornflower Corbin,2017-08-17,0,['English'],No Full audio languages,True,False,False,0,"Single-player,Multi-player,Co-op,Shared/Split ...","Action,Indie,Casual,2D,Singleplayer,Multiplaye...",Multi,True,False,True,Action,Indie,Casual


In [42]:
# Function to count languages, handling the specific "No" strings
def count_languages(language_str):
    if language_str in ['No Supported languages', 'No Full audio languages']:
        return 0
    # Assuming the language strings are comma-separated (and possibly have leading/trailing spaces)
    languages = [lang.strip() for lang in language_str.split(',') if lang.strip()]
    return len(languages)

# Apply the function to the 'Supported languages' and 'Full audio languages' columns
games_df['Supported languages'] = games_df['Supported languages'].apply(count_languages)
games_df['Full audio languages'] = games_df['Full audio languages'].apply(count_languages)

# Display the updated DataFrame with the new count columns
display(games_df.head())

# You can also display the info to see the new columns
games_df.info()

Unnamed: 0,Name,Release date,Required age,Supported languages,Full audio languages,Windows,Mac,Linux,Average playtime forever,Categories,Tags,Player based,Steam Achievements,Family Sharing,Full controller support,Tag 1,Tag 2,Tag 3
0,Golful,2023-09-04,0,1,0,True,False,False,0,Single-player,"Casual,Mini Golf,Puzzle-Platformer,Pixel Graph...",Single,False,False,False,Casual,Mini Golf,Puzzle-Platformer
1,Dungeon Village 2,2023-05-14,0,10,0,True,False,False,0,"Single-player,Steam Achievements,Full controll...","Casual,Indie,RPG,Simulation,Strategy",Single,True,False,True,Casual,Indie,RPG
2,Push the Sheep,2022-03-14,0,5,0,True,False,False,0,"Single-player,Includes level editor","Casual,Indie",Single,False,False,False,Casual,Indie,
3,Who is the hero of this Game,2022-07-24,0,2,0,True,False,False,0,"Single-player,Steam Achievements","Adventure,Simulation,Arcade,Visual Novel,2D Pl...",Single,True,False,False,Adventure,Simulation,Arcade
4,Cornflower Corbin,2017-08-17,0,1,0,True,False,False,0,"Single-player,Multi-player,Co-op,Shared/Split ...","Action,Indie,Casual,2D,Singleplayer,Multiplaye...",Multi,True,False,True,Action,Indie,Casual


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20499 entries, 0 to 20498
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Name                      20499 non-null  object        
 1   Release date              20466 non-null  datetime64[ns]
 2   Required age              20499 non-null  int64         
 3   Supported languages       20499 non-null  int64         
 4   Full audio languages      20499 non-null  int64         
 5   Windows                   20499 non-null  bool          
 6   Mac                       20499 non-null  bool          
 7   Linux                     20499 non-null  bool          
 8   Average playtime forever  20499 non-null  int64         
 9   Categories                20499 non-null  object        
 10  Tags                      20499 non-null  object        
 11  Player based              20499 non-null  object        
 12  Steam Achievements

Drop kolom utama

In [43]:
games_df = games_df.drop(columns=['Categories', 'Tags'])
display(games_df.head())

Unnamed: 0,Name,Release date,Required age,Supported languages,Full audio languages,Windows,Mac,Linux,Average playtime forever,Player based,Steam Achievements,Family Sharing,Full controller support,Tag 1,Tag 2,Tag 3
0,Golful,2023-09-04,0,1,0,True,False,False,0,Single,False,False,False,Casual,Mini Golf,Puzzle-Platformer
1,Dungeon Village 2,2023-05-14,0,10,0,True,False,False,0,Single,True,False,True,Casual,Indie,RPG
2,Push the Sheep,2022-03-14,0,5,0,True,False,False,0,Single,False,False,False,Casual,Indie,
3,Who is the hero of this Game,2022-07-24,0,2,0,True,False,False,0,Single,True,False,False,Adventure,Simulation,Arcade
4,Cornflower Corbin,2017-08-17,0,1,0,True,False,False,0,Multi,True,False,True,Action,Indie,Casual


In [44]:
display(games_df)

Unnamed: 0,Name,Release date,Required age,Supported languages,Full audio languages,Windows,Mac,Linux,Average playtime forever,Player based,Steam Achievements,Family Sharing,Full controller support,Tag 1,Tag 2,Tag 3
0,Golful,2023-09-04,0,1,0,True,False,False,0,Single,False,False,False,Casual,Mini Golf,Puzzle-Platformer
1,Dungeon Village 2,2023-05-14,0,10,0,True,False,False,0,Single,True,False,True,Casual,Indie,RPG
2,Push the Sheep,2022-03-14,0,5,0,True,False,False,0,Single,False,False,False,Casual,Indie,
3,Who is the hero of this Game,2022-07-24,0,2,0,True,False,False,0,Single,True,False,False,Adventure,Simulation,Arcade
4,Cornflower Corbin,2017-08-17,0,1,0,True,False,False,0,Multi,True,False,True,Action,Indie,Casual
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20494,Mondrian 99,2023-05-25,0,7,0,True,False,False,0,Single,False,False,True,Casual,Indie,
20495,Heaven's Ladder,2024-08-30,0,103,0,True,False,False,0,Single,False,False,False,Adventure,Casual,Strategy
20496,oOo: Ascension,2018-09-26,0,29,0,True,True,True,0,Multi,True,False,True,Racing,Action,Indie
20497,Ultimate Challenge,2023-09-27,0,10,2,True,False,False,0,Single,False,False,False,Action,Adventure,Casual


### Data Randomizing

Pada bagian ini, data dalam DataFrame `games_df` diacak secara acak. Tujuannya adalah untuk memastikan bahwa urutan data tidak bias dan siap untuk tahap pembagian data menjadi set pelatihan dan validasi pada langkah selanjutnya. Pengacakan ini penting untuk menghindari model belajar dari urutan data tertentu.

Langkah-langkah yang dilakukan:

1.  **Mengacak DataFrame:** DataFrame `games_df` diacak menggunakan `.sample(frac=1)`. Parameter `frac=1` memastikan bahwa seluruh data diambil sebagai sampel. `random_state=42` digunakan untuk membuat hasil pengacakan dapat direproduksi.
2.  **Memisahkan Fitur dan Target:** Kolom `Name` dipisahkan sebagai variabel target (`y`), dan semua kolom lainnya dianggap sebagai fitur (`X`).
3.  **Membagi Data Latih dan Validasi:** Data `X` dan `y` dibagi menjadi set pelatihan (80%) dan set validasi (20%) berdasarkan indeks baris yang telah diacak.
4.  **Menampilkan Bentuk Data:** Bentuk (jumlah baris dan kolom) dari DataFrame `X`, `y`, `X_train`, `X_val`, `y_train`, dan `y_val` dicetak untuk memverifikasi hasil pembagian data.

In [45]:
# Mengacak dataset
games_df = games_df.sample(frac=1, random_state=42)
games_df

Unnamed: 0,Name,Release date,Required age,Supported languages,Full audio languages,Windows,Mac,Linux,Average playtime forever,Player based,Steam Achievements,Family Sharing,Full controller support,Tag 1,Tag 2,Tag 3
16109,Angels Fall First,2015-10-01,0,1,1,True,False,False,127,Multi,False,False,False,Action,Indie,FPS
2457,Slime Farm,2024-10-22,0,9,0,True,False,False,0,Single,False,True,False,Creature Collector,Management,Farming Sim
1077,SPORE™,2008-12-19,0,15,0,True,False,False,813,Single,False,False,False,God Game,Open World,Exploration
11386,Cow Girls,2021-12-16,0,13,0,True,False,False,0,Single,False,False,False,Sexual Content,Nudity,Mature
15535,Rabi-Ribi,2016-01-28,0,8,0,True,False,False,4001,Single,True,False,True,Anime,Metroidvania,Bullet Hell
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11284,Mahjick - The Realm Taker,2023-09-19,0,3,0,True,False,False,0,Single,True,False,False,Mahjong,Singleplayer,Adventure
11964,Animal Lover,2017-02-14,0,1,0,True,True,False,401,Single,True,False,False,Indie,Casual,Simulation
5390,Sweet Dreams,2024-09-12,0,6,3,True,False,False,0,Single,True,True,False,Horror,Psychological Horror,Survival Horror
860,Alien Invasion 3d,2018-03-21,0,1,0,True,False,False,0,Single,False,False,False,Action,Indie,


In [46]:
# Mengambil semua kolom kecuali 'Name' sebagai fitur (X)
X = games_df.drop('Name', axis=1)
# Mengambil kolom 'Name' sebagai target (y)
y = games_df['Name']

# Membagi data menjadi 80% data latih dan 20% data validasi
train_indices = int(0.8 * games_df.shape[0])

X_train, X_val = (
    X[:train_indices],
    X[train_indices:]
)

y_train, y_val = (
    y[:train_indices],
    y[train_indices:]
)

print("Bentuk data X (fitur):", X.shape)
print("Bentuk data y (target):", y.shape)
print("Bentuk data X_train:", X_train.shape)
print("Bentuk data X_val:", X_val.shape)
print("Bentuk data y_train:", y_train.shape)
print("Bentuk data y_val:", y_val.shape)

Bentuk data X (fitur): (20499, 15)
Bentuk data y (target): (20499,)
Bentuk data X_train: (16399, 15)
Bentuk data X_val: (4100, 15)
Bentuk data y_train: (16399,)
Bentuk data y_val: (4100,)


In [47]:
print(X,y)

      Release date  Required age  Supported languages  Full audio languages  \
16109   2015-10-01             0                    1                     1   
2457    2024-10-22             0                    9                     0   
1077    2008-12-19             0                   15                     0   
11386   2021-12-16             0                   13                     0   
15535   2016-01-28             0                    8                     0   
...            ...           ...                  ...                   ...   
11284   2023-09-19             0                    3                     0   
11964   2017-02-14             0                    1                     0   
5390    2024-09-12             0                    6                     3   
860     2018-03-21             0                    1                     0   
15795   2024-01-22             0                   12                     0   

       Windows    Mac  Linux  Average playtime fore

## Model Development using Content-Based Filtering

Pada tahap ini, model menghitung skor kecocokan antar game berdasarkan kesamaan kontennya (kategori, tag, dll.). Pertama, kita melakukan proses vektorisasi (mengubah teks menjadi angka) pada fitur-fitur game. Selanjutnya, hitung kesamaan kosinus antar vektor game. Hasilnya adalah daftar game yang paling mirip dengan game yang diberikan sebagai masukan.

In [48]:
# Menggabungkan fitur teks yang relevan untuk Content-Based Filtering
games_df['features'] = games_df[['Release date', 'Required age', 'Supported languages',
                                 'Full audio languages', 'Windows', 'Mac', 'Linux',
                                 'Average playtime forever', 'Player based',
                                 'Steam Achievements', 'Family Sharing',
                                 'Full controller support', 'Tag 1', 'Tag 2', 'Tag 3']].astype(str).agg(' '.join, axis=1)

In [49]:
# Inisialisasi TfidfVectorizer
# Menghapus stop words umum bisa membantu, tetapi tergantung pada fitur teks yang digunakan
tfidf = TfidfVectorizer(stop_words='english')

# Melakukan fit dan transform pada fitur yang digabungkan
tfidf_matrix = tfidf.fit_transform(games_df['features'])

print("Bentuk matriks TF-IDF:", tfidf_matrix.shape)

Bentuk matriks TF-IDF: (20499, 1434)


In [50]:
# Menghitung kemiripan kosinus antar game
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print("Bentuk matriks kemiripan kosinus:", cosine_sim.shape)

# Membuat mapping dari nama game ke indeksnya dalam matriks kemiripan
indices = pd.Series(games_df.index, index=games_df['Name']).drop_duplicates()

Bentuk matriks kemiripan kosinus: (20499, 20499)


In [51]:
# Fungsi untuk mendapatkan rekomendasi berdasarkan nama game
def get_content_based_recommendations(game_name, cosine_sim=cosine_sim, games_df=games_df, indices=indices):
    # Mendapatkan indeks dari game yang namanya cocok
    idx = indices[game_name]

    # Mendapatkan skor kemiripan dari semua game dengan game ini
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Mengurutkan game berdasarkan skor kemiripan secara menurun
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Mengambil skor dari 10 game paling mirip (kecuali game itu sendiri)
    sim_scores = sim_scores[1:11]

    # Mendapatkan indeks game
    game_indices = [i[0] for i in sim_scores]

    # Mengembalikan nama game yang paling mirip
    return games_df['Name'].iloc[game_indices]

In [61]:
# Contoh penggunaan: Mendapatkan rekomendasi untuk game tertentu
game_to_recommend = 'MotoGP™15 Compact'

In [62]:
start_time = time.time()
recommended_games = get_content_based_recommendations(game_to_recommend)
end_time = time.time()

if game_to_recommend in games_df['Name'].values:
    print(f"Rekomendasi game yang mirip dengan '{game_to_recommend}':")
    print(get_content_based_recommendations(game_to_recommend))
else:
    print(f"Game '{game_to_recommend}' tidak ditemukan dalam dataset.")

matching_time = end_time - start_time

print(f"Time taken for content-based matching: {matching_time:.4f} seconds")

Rekomendasi game yang mirip dengan 'MotoGP™15 Compact':
4471          Twinkle Hunter
5841                Kanamono
7971           Schildmaid MX
3914          Rick Henderson
6556          Full Moon Rush
19548             PopSlinger
6641             Beast Modon
13135      FORTRESS DEFENDER
2519                Top Gang
60       Super Destronaut DX
Name: Name, dtype: object
Time taken for content-based matching: 0.0540 seconds


## Model Development using Deep Content Filtering

Meskipun dataset ini tidak memiliki data interaksi pengguna (seperti rating atau playtime per user) yang ideal untuk model seperti RecommenderNet, kita bisa mengadaptasi idenya.

RecommenderNet biasanya menggabungkan embedding dari pengguna dan item. Karena tidak ada pengguna, kita bisa mencoba membuat embedding hanya untuk item (game) berdasarkan fitur kontennya.

Ide:
1. Buat layer embedding untuk setiap fitur kategorikal (misalnya, Player based, Tag 1, Tag 2, Tag 3).
2. Buat layer input untuk fitur numerik (misalnya, Required age, Supported languages, dll.).
3. Gabungkan embedding dan input numerik.
4. Lewatkan melalui beberapa dense layer.
5. Output layer bisa berupa embedding vektor game. Kemiripan antar game kemudian dihitung dari embedding ini.

In [54]:
# Fitur kategorikal yang akan di-encode
categorical_features = ['Player based', 'Steam Achievements', 'Family Sharing',
                        'Full controller support', 'Tag 1', 'Tag 2', 'Tag 3',
                        'Windows', 'Mac', 'Linux'] # Include platform features

# Salinan data untuk encoding
games_encoded = games_df.copy()

# Dictionary untuk menyimpan encoder
label_encoders = {}

for col in categorical_features:
    le = LabelEncoder()
    # Handle potential NaN values that might appear after drops/preprocessing
    games_encoded[col] = games_encoded[col].astype(str) # Ensure all are strings before encoding
    games_encoded[col] = le.fit_transform(games_encoded[col])
    label_encoders[col] = le # Store the encoder

In [55]:
# Fitur numerik
numerical_features = ['Required age', 'Supported languages', 'Full audio languages', 'Average playtime forever']

# Normalisasi fitur numerik (Opsional tapi direkomendasikan)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
games_encoded[numerical_features] = scaler.fit_transform(games_encoded[numerical_features])

In [56]:
# Buat input layers
input_layers = []
embedding_layers = []
dense_inputs = []

# Input dan embedding untuk fitur kategorikal
for col in categorical_features:
    num_unique_values = len(label_encoders[col].classes_)
    embedding_dim = max(2, min(50, num_unique_values // 2)) # Heuristik sederhana untuk dimensi embedding

    input_layer = keras.Input(shape=(1,), name=f'input_{col}')
    # Modify the layer name to replace spaces with underscores
    cleaned_col_name = col.replace(" ", "_")
    embedding_layer = layers.Embedding(input_dim=num_unique_values, output_dim=embedding_dim, name=f'embedding_{cleaned_col_name}')(input_layer)
    flatten_layer = layers.Flatten()(embedding_layer)

    input_layers.append(input_layer)
    embedding_layers.append(flatten_layer)

# Input untuk fitur numerik
for col in numerical_features:
    input_layer = keras.Input(shape=(1,), name=f'input_{col}')
    input_layers.append(input_layer)
    dense_inputs.append(input_layer)

# Gabungkan semua embedding dan input numerik
# Pastikan embedding_layers dan dense_inputs tidak kosong
if embedding_layers and dense_inputs:
    concat_layer = layers.concatenate(embedding_layers + dense_inputs)
elif embedding_layers:
     concat_layer = layers.concatenate(embedding_layers)
elif dense_inputs:
    concat_layer = layers.concatenate(dense_inputs)
else:
    raise ValueError("Tidak ada fitur yang tersedia untuk model.")

# Dense layers (arsitektur mirip RecommenderNet)
x = layers.Dense(128, activation='relu')(concat_layer)
x = layers.Dropout(0.2)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(0.2)(x)

# Output layer: Representasi embedding game (misalnya, 32 dimensi)
game_embedding = layers.Dense(32, activation='linear', name='game_embedding')(x) # Linear activation for embedding

# Model
recommender_model = keras.Model(inputs=input_layers, outputs=game_embedding)

recommender_model.summary()

In [57]:
# Mendapatkan embedding untuk semua game
# Siapkan input data dalam format yang sesuai untuk model
model_inputs = {}
for col in categorical_features + numerical_features:
     # Ensure correct data type for model input
    if col in categorical_features:
        model_inputs[f'input_{col}'] = games_encoded[col].values
    else: # numerical_features
        model_inputs[f'input_{col}'] = games_encoded[col].values

# Ubah dictionary input menjadi list sesuai urutan input_layers di model
ordered_model_inputs = [model_inputs[input_layer.name] for input_layer in recommender_model.inputs]


game_embeddings = recommender_model.predict(ordered_model_inputs)

print("Bentuk matriks embedding game:", game_embeddings.shape)

[1m641/641[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step
Bentuk matriks embedding game: (20499, 32)


In [58]:
# Menghitung kemiripan kosinus antar embedding game
cosine_sim_deep = cosine_similarity(game_embeddings, game_embeddings)

print("Bentuk matriks kemiripan kosinus (Deep):", cosine_sim_deep.shape)

Bentuk matriks kemiripan kosinus (Deep): (20499, 20499)


In [59]:
# Fungsi untuk mendapatkan rekomendasi menggunakan embedding dari model "Deep"
def get_deep_content_based_recommendations(game_name, cosine_sim=cosine_sim_deep, games_df=games_df, indices=indices):
    # Mendapatkan indeks dari game yang namanya cocok
    # Pastikan indeks Series menggunakan index dari games_df, bukan games_encoded
    idx = indices[game_name]

    # Mendapatkan skor kemiripan dari semua game dengan game ini
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Mengurutkan game berdasarkan skor kemiripan secara menurun
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Mengambil skor dari 10 game paling mirip (kecuali game itu sendiri)
    sim_scores = sim_scores[1:11]

    # Mendapatkan indeks game
    game_indices = [i[0] for i in sim_scores]

    # Mengembalikan nama game yang paling mirip
    return games_df['Name'].iloc[game_indices]

In [63]:
# Contoh penggunaan: Mendapatkan rekomendasi menggunakan embedding dari model "Deep"
if game_to_recommend in games_df['Name'].values:
    print(f"\nRekomendasi game yang mirip dengan '{game_to_recommend}' menggunakan Deep Content-Based:")
    print(get_deep_content_based_recommendations(game_to_recommend))
else:
    print(f"Game '{game_to_recommend}' tidak ditemukan dalam dataset.")


Rekomendasi game yang mirip dengan 'MotoGP™15 Compact' menggunakan Deep Content-Based:
16661                         CHKN
7392                     Love, Sam
555      Eternal Hour: Golden Hour
5841                      Kanamono
12398     Showdown at Willow Creek
17590         Creatures Such as We
4762                    Oceanarium
8100        Sugy the Christmas elf
4566                     Gelldonia
11421                     dead run
Name: Name, dtype: object
