# Game Recommendation System



## Import Library dan Data Collection

In [1]:
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("thedevastator/video-game-sales-and-ratings")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/thedevastator/video-game-sales-and-ratings?dataset_version_number=2...


100%|██████████| 540k/540k [00:00<00:00, 115MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/thedevastator/video-game-sales-and-ratings/versions/2





## Data Understanding

In [3]:
df=pd.read_csv('/root/.cache/kagglehub/datasets/thedevastator/video-game-sales-and-ratings/versions/2/Video_Games.csv')
df.head()

Unnamed: 0,index,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


Berikut merupakan beberapa contoh dari data di dataframe

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16928 entries, 0 to 16927
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   index            16928 non-null  int64  
 1   Name             16926 non-null  object 
 2   Platform         16928 non-null  object 
 3   Year_of_Release  16655 non-null  float64
 4   Genre            16926 non-null  object 
 5   Publisher        16873 non-null  object 
 6   NA_Sales         16928 non-null  float64
 7   EU_Sales         16928 non-null  float64
 8   JP_Sales         16928 non-null  float64
 9   Other_Sales      16928 non-null  float64
 10  Global_Sales     16928 non-null  float64
 11  Critic_Score     8260 non-null   float64
 12  Critic_Count     8260 non-null   float64
 13  User_Score       10159 non-null  object 
 14  User_Count       7718 non-null   float64
 15  Developer        10240 non-null  object 
 16  Rating           10092 non-null  object 
dtypes: float64(9

In [None]:
print('Banyak tipe game: ', len(df.Genre.unique()))
print('Tipe game: ', df.Genre.unique())

Banyak tipe game:  13
Tipe game:  ['Sports' 'Platform' 'Racing' 'Role-Playing' 'Puzzle' 'Misc' 'Shooter'
 'Simulation' 'Action' 'Fighting' 'Adventure' 'Strategy' nan]


Terdapat 16928 data dan 17 kolom dalam dataframe.

- **1 kolom dengan tipe data int64**  
  - `index`: Penanda unik untuk setiap entri dalam dataset.  

- **9 kolom dengan tipe data float64**  
  - `Year_of_Release`: Tahun rilis game.  
  - `NA_Sales`, `EU_Sales`, `JP_Sales`, `Other_Sales`, `Global_Sales`: Data penjualan game di berbagai wilayah.  
  - `Critic_Score`, `Critic_Count`, `User_Count`: Skor kritik, jumlah ulasan kritik, dan jumlah ulasan pengguna.  

- **7 kolom dengan tipe data object**  
  - `Name`: Nama game.  
  - `Platform`: Jenis konsol atau perangkat.  
  - `Genre`: Kategori game.  
  - `Publisher`: Nama penerbit game.  
  - `User_Score`: Skor dari pengguna.  
  - `Developer`: Nama pengembang game.  
  - `Rating`: Peringkat game berdasarkan lembaga rating.  











## Data Preparation

In [None]:
df.duplicated('Name').sum()

5365

Terdapat 5365 game dengan nama yang sama. Untuk mempermudah kita drop data duplikat

In [4]:
df.drop_duplicates(subset='Name', inplace=True)

In [None]:
df.duplicated('Name').sum()

0

Setelah didrop, tidak ada game dengan nama yang sama

In [5]:
df.isnull().sum()

Unnamed: 0,0
index,0
Name,1
Platform,0
Year_of_Release,164
Genre,1
Publisher,48
NA_Sales,0
EU_Sales,0
JP_Sales,0
Other_Sales,0


In [7]:
total_missing = df.isnull().any(axis=1).sum()
print(total_missing)

7444


Disini masih terdapat 7444 data yang kosong. Kita drop semua data yang kosong karena tidak perlu digunakan

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().sum()

Unnamed: 0,0
index,0
Name,0
Platform,0
Year_of_Release,0
Genre,0
Publisher,0
NA_Sales,0
EU_Sales,0
JP_Sales,0
Other_Sales,0


Setelah didrop, tidak ada data yang sama

In [None]:
game_name=df['Name'].to_list()
game_genre=df['Genre'].to_list()
print(len(game_name))
print(len(game_genre))

4119
4119


Kita ubah dari data series menjadi list dengan to_list().Disini terdapat 4119 data game.

In [None]:
game_new = pd.DataFrame({
    'name': game_name,
    'genre': game_genre})
game_new

Unnamed: 0,name,genre
0,Wii Sports,Sports
1,Mario Kart Wii,Racing
2,Wii Sports Resort,Sports
3,New Super Mario Bros.,Platform
4,Wii Play,Misc
...,...,...
4114,Hospital Tycoon,Strategy
4115,Fast Racing Neo,Action
4116,E.T. The Extra-Terrestrial,Action
4117,Breach,Shooter


Kita membuat dictionary untuk menentukan pasangan key-value pada data game_name dan game_genre yang telah kita siapkan sebelumnya.

## Model Development

In [None]:
data = game_new
data.sample(5)

Unnamed: 0,name,genre
1818,Eternal Darkness: Sanity's Requiem,Adventure
504,Kingdom Hearts: Chain of Memories,Role-Playing
3171,Growlanser Generations,Role-Playing
3327,Little League World Series Baseball 2008,Sports
72,Big Brain Academy,Misc


Cek lagi data yang kita miliki dan assign dataframe dari tahap sebelumnya ke dalam variabel data

In [None]:
# Inisialisasi TfidfVectorizer
tf = TfidfVectorizer()

# Melakukan perhitungan idf pada data genre
tf.fit(data['genre'])

# Mapping array dari fitur index integer ke fitur nama
tf.get_feature_names_out()

array(['action', 'adventure', 'fighting', 'misc', 'platform', 'playing',
       'puzzle', 'racing', 'role', 'shooter', 'simulation', 'sports',
       'strategy'], dtype=object)

Kita menggunakan fungsi TfidfVectorizer dari library scikit-learn. Selain melakukan vektorisasi, TF-IDF juga melakukan proses tokenisasi pada data. Sehingga, kita tidak perlu melakukan tokenisasi lagi.

In [None]:
# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(data['genre'])

# Melihat ukuran matrix tfidf
tfidf_matrix.shape

(4119, 13)

TFIDF digunakan pada sistem rekomendasi untuk menemukan representasi fitur penting dari macam-macam genre dalam data yaitu ['action', 'adventure', 'fighting', 'misc', 'platform', 'playing',
       'puzzle', 'racing', 'role', 'shooter', 'simulation', 'sports',
       'strategy'].Matriks yang dimiliki berukuran (4119, 13). Nilai 4119 merupakan ukuran data dan 13 merupakan matrik kategori game.

In [None]:
tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 1., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 1., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 1.]])

Untuk menghasilkan vektor tf-idf dalam bentuk matriks, kita menggunakan fungsi todense().

In [None]:
# Membuat dataframe untuk melihat tf-idf matrix
# Kolom diisi dengan jenis game
# Baris diisi dengan nama game

pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tf.get_feature_names_out(),
    index=data.name
).sample(13, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,fighting,puzzle,playing,sports,strategy,adventure,misc,platform,shooter,racing,action,simulation,role
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Meteos: Disney Magic,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Dynasty Warriors 3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Land of the Dead: Road to Fiddler's Green,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
NFL Blitz 20-02,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MTV Music Generator 2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
Sky Odyssey,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Rugby World Cup 2011,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Tactics Ogre: The Knight of Lodis,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107
Ys: The Oath in Felghana,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107
The Wizard of Oz: Beyond The Yellow Brick Road,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107


Berdasarkan output matriks TF-IDF, kita dapat menganalisis distribusi genre dalam dataset game yang diberikan. Setiap nilai dalam tabel menunjukkan tingkat kepentingan (TF-IDF score) suatu genre terhadap masing-masing game. Nilai 1.0 menunjukkan bahwa game tersebut sepenuhnya dikategorikan dalam genre tersebut, sedangkan nilai 0.707107 menunjukkan adanya keterkaitan dengan lebih dari satu genre dengan bobot yang lebih rendah.

Beberapa pola yang dapat diamati:

Beberapa game memiliki bobot **1.0** pada satu genre, yang berarti mereka diklasifikasikan secara eksklusif dalam kategori tersebut:
- **Puzzle**: *Meteos: Disney Magic*  
- **Action**: *Dynasty Warriors 3*  
- **Shooter**: *Land of the Dead: Road to Fiddler's Green*  
- **Sports**: *NFL Blitz 20-02*, *Rugby World Cup 2011*  
- **Misc**: *MTV Music Generator 2*  
- **Simulation**: *Sky Odyssey*  

Beberapa game memiliki bobot **0.707107** pada kategori **Role-Playing**, menunjukkan bahwa mereka memiliki elemen RPG tetapi mungkin juga terkait dengan genre lain:
- *Tactics Ogre: The Knight of Lodis*  
- *Ys: The Oath in Felghana*  
- *The Wizard of Oz: Beyond The Yellow Brick Road*  .

## Cosine Similarity

Setelah berhasil mengidentifikasi korelasi antara game dengan genre gamenya, kita akan menghitung derajat kesamaan (similarity degree) antar game dengan teknik cosine similarity.

In [None]:
# Menghitung cosine similarity pada matrix tf-idf
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

menghitung derajat kesamaan (similarity degree) antar game dengan teknik cosine similarity.

In [None]:
# Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa nama resto
cosine_sim_df = pd.DataFrame(cosine_sim, index=data['name'], columns=data['name'])
print('Shape:', cosine_sim_df.shape)

# Melihat similarity matrix pada setiap resto
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (4119, 4119)


name,Resident Evil,Way of the Samurai 2,F1 2002,Company of Heroes: Opposing Fronts,Darksiders: Warmastered Edition
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Spyro Orange: The Cortex Conspiracy,1.0,1.0,0.0,0.0,1.0
Rengoku II: The Stairway To H.E.A.V.E.N.,1.0,1.0,0.0,0.0,1.0
Way of the Samurai,1.0,1.0,0.0,0.0,1.0
UFC Undisputed 3,1.0,1.0,0.0,0.0,1.0
Frogger's Adventures: The Rescue,0.0,0.0,0.0,0.0,0.0
Spyro: Enter the Dragonfly,0.0,0.0,0.0,0.0,0.0
Lost in Blue: Shipwrecked,0.0,0.0,0.0,0.0,0.0
NBA Live 2003,0.0,0.0,0.0,0.0,0.0
Beaterator,0.0,0.0,0.0,0.0,0.0
Zone of the Enders: The Fist of Mars,0.0,0.0,0.0,0.0,0.0


Berdasarkan hasil cosine similarity, terdapat beberapa pola yang bisa dianalisis dari data yang diberikan. Game seperti Spyro Orange: The Cortex Conspiracy, Rengoku II: The Stairway To H.E.A.V.E.N., Way of the Samurai, dan UFC Undisputed 3 memiliki skor kesamaan yang tinggi (1.0) terhadap Resident Evil dan Way of the Samurai 2, namun tidak memiliki kesamaan dengan F1 2002, Company of Heroes: Opposing Fronts, dan Darksiders: Warmastered Edition (skor 0.0). Hal ini menunjukkan bahwa game-game tersebut mungkin memiliki karakteristik yang mirip dalam genre, mekanik permainan, atau elemen naratif, sementara game seperti F1 2002 dan Company of Heroes: Opposing Fronts cenderung berbeda secara signifikan. Sementara itu, beberapa game seperti Frogger's Adventures: The Rescue, Spyro: Enter the Dragonfly, dan Lost in Blue: Shipwrecked memiliki skor 0.0 di semua kategori, yang menunjukkan bahwa game-game ini memiliki karakteristik yang sangat berbeda dari game lainnya dalam daftar ini. Analisis ini mengindikasikan bahwa cosine similarity dapat digunakan untuk mengelompokkan game berdasarkan kesamaan fitur atau elemen yang mereka miliki.

In [None]:
def game_recommendations(nama_game, similarity_data=cosine_sim_df, items=data[['name', 'genre']], k=10):
    """
    Rekomendasi Game berdasarkan kemiripan dataframe

    Parameter:
    ---
    nama_game : tipe data string (str)
                Nama Game (index kemiripan dataframe)
    similarity_data : tipe data pd.DataFrame (object)
                      Kesamaan dataframe, simetrik, dengan game sebagai
                      indeks dan kolom
    items : tipe data pd.DataFrame (object)
            Mengandung kedua nama dan fitur lainnya yang digunakan untuk mendefinisikan kemiripan
    k : tipe data integer (int)
        Banyaknya jumlah rekomendasi yang diberikan
    ---


    Pada index ini, kita mengambil k dengan nilai similarity terbesar
    pada index matrix yang diberikan (i).
    """


    # Mengambil data dengan menggunakan argpartition untuk melakukan partisi secara tidak langsung sepanjang sumbu yang diberikan
    # Dataframe diubah menjadi numpy
    # Range(start, stop, step)
    index = similarity_data.loc[:,nama_game].to_numpy().argpartition(
        range(-1, -k, -1))

    # Mengambil data dengan similarity terbesar dari index yang ada
    closest = similarity_data.columns[index[-1:-(k+2):-1]]

    # Drop nama_game agar nama game yang dicari tidak muncul dalam daftar rekomendasi
    closest = closest.drop(nama_game, errors='ignore')

    return pd.DataFrame(closest).merge(items).head(k)

Di sini, kita membuat fungsi game_recommendations dengan beberapa parameter sebagai berikut:



*   Nama_game : Nama game (index kemiripan dataframe).
*   Similarity_data : Dataframe mengenai similarity yang telah kita definisikan sebelumnya.
*   Items : Nama dan fitur yang digunakan untuk mendefinisikan kemiripan, dalam hal ini adalah ‘name’ dan ‘genre’.
*   k : Banyak rekomendasi yang ingin diberikan. Kita berikan 10 rekomendasi




## Mendapatkan Rekomendasi

In [None]:
data[data.name.eq('Mario Kart Wii')]

Unnamed: 0,name,genre
1,Mario Kart Wii,Racing


Kita coba cari rekomendasi untuk Mario Kart WII

In [None]:
# Mendapatkan rekomendasi restoran yang mirip dengan Mario Kart WII
game_recommendations('Mario Kart Wii')

Unnamed: 0,name,genre
0,GRID 2,Racing
1,Pro Race Driver,Racing
2,RalliSport Challenge 2,Racing
3,Colin McRae Rally 2005,Racing
4,Test Drive Unlimited,Racing
5,Pure,Racing
6,MotoGP,Racing
7,Rock 'N Roll Racing,Racing
8,F1 Career Challenge,Racing
9,Midnight Club: Los Angeles,Racing


Disini kita mendapatkan 10 rekomendasi game yang mirip dengna Mario Kart yaitu dalam genre Racing