# Sistem Rekomendasi Game mengunakan metode Content Base Filtering

# Pertanyaan Bisnis
1. Bagaimana merekomendasikan game kepada pengguna berdasarkan kesukaan genre yang sama
2. Bagaimana menghitung akurasi berbasis evaluasi, berdasarkan sistem rekomendasi games berdasarkan genre

#Link dataset
kaagle : https://www.kaggle.com/datasets/uuratl/metacritic-games-12-23-2024

# Persiapan Library

In [None]:
import pandas as pd
import numpy as np
import kagglehub
import re
from kagglehub import KaggleDatasetAdapter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Data Understanding

In [None]:
# Set the path to the file you'd like to load
file_path = "12_23_2024_detailed_data.csv"

df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "uuratl/metacritic-games-12-23-2024",
  file_path,
)
df.head()

  df = kagglehub.load_dataset(


Unnamed: 0.1,Unnamed: 0,name,metacritic_review_count,metacritic_review_score,user_review_count,user_review_score,developer,publisher,platforms,genres,esrb,must_play
0,0,The Legend of Zelda: Ocarina of Time,Based on 22 Critic Reviews,99,"Based on 10,273 User Ratings",9.1,Developer: Nintendo,Publisher: Nintendo,['Nintendo 64'],['Open-World Action'],Rated E for Everyone,1
1,1,SoulCalibur,Based on 24 Critic Reviews,98,Based on 574 User Ratings,7.8,Developer: Namco,Publisher: Namco,"['Dreamcast', 'iOS (iPhone/iPad)', 'Xbox 360']",['3D Fighting'],Rated T for Teen,1
2,2,Grand Theft Auto IV,Based on 86 Critic Reviews,98,"Based on 4,462 User Ratings",8.3,Developer: Rockstar North,Publisher: Rockstar Games,"['PlayStation 3', 'Xbox 360', 'PC']",['Open-World Action'],Rated M for Mature,1
3,3,Super Mario Galaxy,Based on 73 Critic Reviews,97,"Based on 4,027 User Ratings",9.1,Developer: Nintendo,Publisher: Nintendo,['Wii'],['3D Platformer'],Rated E for Everyone,1
4,4,Super Mario Galaxy 2,Based on 87 Critic Reviews,97,"Based on 2,857 User Ratings",9.0,Developer: Nintendo EAD Tokyo,Publisher: Nintendo,['Wii'],['3D Platformer'],Rated E for Everyone,1


Mengambil dataset menggunakan library bawaan kaagle, agar projek bisa di running tanpa perlu input dataset


---
KaggleDatasetAdapter.PANDAS akan memanggil fungsi pandas
file_path, mengambil nama file dari link dataset


# Univariate Exploratory Data Analysis

## melihat dimensi data

In [None]:
df.shape

(13390, 12)

Melihat baris dan kolom yang dimiliki dataset

## Melihat informasi dataset

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13390 entries, 0 to 13389
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               13390 non-null  int64 
 1   name                     13390 non-null  object
 2   metacritic_review_count  13384 non-null  object
 3   metacritic_review_score  13390 non-null  object
 4   user_review_count        11833 non-null  object
 5   user_review_score        13390 non-null  object
 6   developer                13380 non-null  object
 7   publisher                13380 non-null  object
 8   platforms                13390 non-null  object
 9   genres                   13390 non-null  object
 10  esrb                     11287 non-null  object
 11  must_play                13390 non-null  int64 
dtypes: int64(2), object(10)
memory usage: 1.2+ MB


Dataset mentah memiliki total 13390 baris, kebanyakan bertype object, dan dua kolom bertype int64

### Mengecek Duplikasi

In [None]:
df.duplicated().sum()

np.int64(0)

tidak terdapat baris dengan nilai yang sama

---

fungsi duplicated mengambil baris yang memiliki duplikasi, dan sum menghitung jumlah duplikasi yang ada

### Mengecek nilai unik Kolom Genres

In [None]:
df['genres'].unique()

array(["['Open-World Action']", "['3D Fighting']", "['3D Platformer']",
       "['Skating']", "['FPS']", "['Football Sim']",
       "['Linear Action Adventure']", "['Survival']", "['Western RPG']",
       "['Compilation']", "['Visual Novel']", "['Action RPG']",
       "['Auto Racing Sim']", "['Action Adventure']", "['2D Platformer']",
       "['JRPG']", "['Third Person Shooter']", "['4X Strategy']",
       "['Auto Racing']", "['2D Fighting']", "['Real-Time Strategy']",
       "['Arcade']", "['Management']", "['Command RTS']", "['Skiing']",
       "['Sandbox']", "['Soccer Sim']", "['Rhythm']",
       '["2D Beat-\'Em-Up"]', "['Basketball Sim']", "['Roguelike']",
       "['Future Racing']", "['MMORPG']", "['Application']",
       "['Hockey Sim']", "['Virtual Life']", "['Turn-Based Tactics']",
       "['Tactical FPS']", "['Tennis']", "['Arcade Racing']",
       "['Baseball Sim']", "['Metroidvania']", "['Party']",
       "['Aircraft Sim']", "['Card Battle']", "['Space Combat Sim']",
       

Menjabarkan nilai unique dari genres, sebagai pertimbangan masukan untuk fungsi rekomendasi nanti

# Data Preparation

## Mengatasi missing value

In [None]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
name,0
metacritic_review_count,6
metacritic_review_score,0
user_review_count,1557
user_review_score,0
developer,10
publisher,10
platforms,0
genres,0


Terdapat 5 kolom yang memiliki missing value, yaitu metacritic_review_count, user_review_count, developer, publisher, dan esrb.


---
mengunakan perintah isnull untuk mengetahui kolom yang memiliki nilai null, dan sum() untuk menjumlahkan nilai null dalam satu kolom


In [None]:
df_clean = df.dropna()
df_clean

Unnamed: 0.1,Unnamed: 0,name,metacritic_review_count,metacritic_review_score,user_review_count,user_review_score,developer,publisher,platforms,genres,esrb,must_play
0,0,The Legend of Zelda: Ocarina of Time,Based on 22 Critic Reviews,99,"Based on 10,273 User Ratings",9.1,Developer: Nintendo,Publisher: Nintendo,['Nintendo 64'],['Open-World Action'],Rated E for Everyone,1
1,1,SoulCalibur,Based on 24 Critic Reviews,98,Based on 574 User Ratings,7.8,Developer: Namco,Publisher: Namco,"['Dreamcast', 'iOS (iPhone/iPad)', 'Xbox 360']",['3D Fighting'],Rated T for Teen,1
2,2,Grand Theft Auto IV,Based on 86 Critic Reviews,98,"Based on 4,462 User Ratings",8.3,Developer: Rockstar North,Publisher: Rockstar Games,"['PlayStation 3', 'Xbox 360', 'PC']",['Open-World Action'],Rated M for Mature,1
3,3,Super Mario Galaxy,Based on 73 Critic Reviews,97,"Based on 4,027 User Ratings",9.1,Developer: Nintendo,Publisher: Nintendo,['Wii'],['3D Platformer'],Rated E for Everyone,1
4,4,Super Mario Galaxy 2,Based on 87 Critic Reviews,97,"Based on 2,857 User Ratings",9.0,Developer: Nintendo EAD Tokyo,Publisher: Nintendo,['Wii'],['3D Platformer'],Rated E for Everyone,1
...,...,...,...,...,...,...,...,...,...,...,...,...
13384,13384,Ride to Hell: Retribution,Based on 14 Critic Reviews,19,Based on 468 User Ratings,1.4,Developer: Eutechnyx,Publisher: Deep Silver,"['PlayStation 3', 'PC', 'Xbox 360']",['Action Adventure'],Rated M for Mature,0
13385,13385,SPOGS Racing,Based on 7 Critic Reviews,18,Based on 10 User Ratings,6.1,Developer: Pronto Games,Publisher: D2C Games,"['Wii', 'PC']",['Auto Racing'],Rated E for Everyone,0
13386,13386,Double Dragon II: Wander of the Dragons,Based on 19 Critic Reviews,17,Based on 65 User Ratings,1.6,Developer: Gravity,Publisher: CFK,['Xbox 360'],"[""2D Beat-'Em-Up""]",Rated T for Teen,0
13387,13387,Vroom in the Night Sky,Based on 15 Critic Reviews,17,Based on 122 User Ratings,3.2,Developer: Poisoft,Publisher: Poisoft,['Nintendo Switch'],['Biking'],Rated E for Everyone,0


baris yang memiliki nilai nan saya hapus untuk meningkatkan performa model


---
fungsi dropna mengambil baris mana yang memiliki nilai nan, kemudian menghapusnya


In [None]:
df_clean.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
name,0
metacritic_review_count,0
metacritic_review_score,0
user_review_count,0
user_review_score,0
developer,0
publisher,0
platforms,0
genres,0


Terlihat tidak adanya lagi missing value disetiap kolom

## Menghapus kolom

In [None]:
df_clean = df_clean.drop('Unnamed: 0', axis=1)

saya menghapus kolom Unnamed:0 yang tidak sengaja terbuat ketika memanggil file csv

---
fungsi drop akan langsung menghapus nama kolom dalam argumen, dalam hal ini 'Unnamed: 0'


## Mengurutkan berdasarkan Genre Games

In [None]:
df_clean.sort_values(by='genres', inplace=True)

Mengurutkan berdasarkan genre, untuk mempermudah menemukan game dengan genre yang sama

---
fungsi sort.values akan mengurutkan values berdasarkan kolom yang dipilih, dalam hal ini genres


## Menormalisasi Data

In [None]:
# Menormalisasi kolom metacritic_review_count
df_clean['metacritic_review_count'] = df_clean['metacritic_review_count'].str.replace('Based on', '')
df_clean['metacritic_review_count'] = df_clean['metacritic_review_count'].str.replace('Critic Reviews', '')

# Menormalisasi kolom user_review_count
df_clean['user_review_count'] = df_clean['user_review_count'].str.replace('Based on', '')
df_clean['user_review_count'] = df_clean['user_review_count'].str.replace('User Ratings', '')

#Menormalisasi kolom Developer
df_clean['developer'] = df_clean['developer'].str.replace('Developer: ', '')

# Menormalisasi kolom publiser
df_clean['publisher'] = df_clean['publisher'].str.replace('Publisher: ', '')

# Menormalisasi Kolom esrb
df_clean['esrb'] = df_clean['esrb'].str.replace('Rated', '')

Membersihkan kolom dari string, berfungsi agar kolom hanya memiliki nilai numeric, untuk mempersiapkan analysis. Jikalau kolom tersebut akan dipakai nantinya

---
fungsi str.replace menganti value strin menjadi value yang lain. dalam hal ini menganti kata dengan nilai kosong


## Merubah Type Data

In [None]:
# Merubah type data metacritic_review_count
df_clean['metacritic_review_count']=df_clean['metacritic_review_count'].astype(int)

# merubah type data metacritic_review_score
df_clean['metacritic_review_score']=df_clean['metacritic_review_score'].astype(int)

# merubah type data user_review_count
df_clean['user_review_count'] = df_clean['user_review_count'].str.replace(',', '').astype(float)

# merubah type data user_review_score
df_clean['user_review_score']=df_clean['user_review_score'].astype(float)

Mengubah type data kolom menjadi int dan float, agar lebih sesuai dengan isi kolom sekarang

---
fungsi astype menganti type kolom menjadi type yang kita inginkan


In [None]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9902 entries, 7265 to 8704
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     9902 non-null   object 
 1   metacritic_review_count  9902 non-null   int64  
 2   metacritic_review_score  9902 non-null   int64  
 3   user_review_count        9902 non-null   float64
 4   user_review_score        9902 non-null   float64
 5   developer                9902 non-null   object 
 6   publisher                9902 non-null   object 
 7   platforms                9902 non-null   object 
 8   genres                   9902 non-null   object 
 9   esrb                     9902 non-null   object 
 10  must_play                9902 non-null   int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 928.3+ KB


Menampilkan informasi dataset yang sudah dibersihkan, terlihat kalau dataset hanya memiliki 9902 baris yang tersisa.
metacritic_review_count dan metacritic_review_score sudah berubah menjadi int.
user_review_count dan user_review_score sudah berubah menjadi float

## Membuat kolom Combinasi


In [None]:
df_clean['combined_features']= df_clean['name'].fillna('') + '' + df_clean['developer'].fillna('') + '' + df_clean['publisher'].fillna('') + '' + df_clean['genres'].fillna('') + '' + df_clean['esrb'].fillna('')

kolom kombinasi berguna untuk menghitung nilai cosine_similarity nantinya.

---
kolom combined_features akan berisi value kombinasi dari kolom yang dipilih, setiap value akan dipisahkan dengan '' melalui fungsi fillna


## menormalisasi kolom kombinasi

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # hapus tanda baca
    text = re.sub(r'\d+', '', text)      # hapus angka
    return text

df_clean['combined_features']=df_clean['combined_features'].apply(clean_text)

Melakukan normalisasi terhadap kolom combined_features, dengan mengubah menjadi lower case, membersihkan tanda baca, dan angka untuk mempersiapkan untuk digunakan dalam tf idf nantinya.

---

fungsi text.lower mengubah setiap text menjadi lower case.
fungsi re.sub menghapus tanda baca dan angka untuk setiap text

## TF-IDF Vectorizer

In [None]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_clean['combined_features'])

Mempersiapkan agar kolom bisa digunakan dalam sistem rekomendasi menggunakan tf-idf


---

**TfidfVectorizer(stop_words='english'**) memanggil object tf-idf dari library sklearn. **TF-IDF** berguna untuk mengetahui seberapa penting suatu kata dalam setiap dokumen, **stop_words=english** membuat model mengabaikan stop words umum dalam bahasa inggriss
fungsi **fit** belajar mengenail vocabulary dari kolom combined_features, untuk mengetahui nilai unique dan distribusi didalam kolom
fungsi **transform**, untuk mengubah teks menjadi bentuk numeric

# Model Development dengan Content Based Filtering

## Menghitung cosine similarity

In [None]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print('Ukuran matriks cosine similarity:', cosine_sim.shape)

Ukuran matriks cosine similarity: (9902, 9902)


cosine_sim, berfungsi untuk menghitung similarity antara 2 arrays, dalam hal ini, saya menghitung cosine similarity dalam setiap permainan.
fungsi **shape** mencetak ukuran dari matriks dalam hal ini kesamaan antara setiap game dengan game yang lain, karena tfidf_matrix dipanggil 2 kali.
nilai 9902 mewakili baris dan kolom, juga dapat berarti jumlah game dalam dataset yang sudah dibersihkan

## Membuat fungsi rekomendasi berdasarkan genre

In [None]:
def recommend_games_by_genre(genre, top_n=5):
    # Filter the dataframe by the specified genre
    genre_games = df_clean[df_clean['genres'].str.lower().str.contains(genre.lower(), na=False)]

    if genre_games.empty:
        print(f"No games found for genre: {genre}")
        return pd.DataFrame(columns=['genres', 'name'])

    # Sort the games by metacritic_review_score in descending order
    recommended_games = genre_games.sort_values(by='metacritic_review_score', ascending=False).head(top_n)

    return recommended_games[['genres', 'name']]

fungsi rekomendasi menggunakan 2 parameter, variabel yang akan digunakan untuk menampung inputan, dan top_n=5 akan menampilkan 5 data teratas dari inputan.
fungsi **str.lower** mengubah value dari kolom genres menjadi huruf kecil.
fungsi **str.lower** mengecek apakah string yang dimasukkan tersedia di kolom genres. yang juga sudah diubah ke huruf kecil.
fungsi na=False akan mengangap valued nan sebagai tidak ada
**if genre_games.empty** mengecek apakah data yang dimasukkan tidak ada didalam kolom
**recommended_games** mengurutkan variabel genre_games berdasrkan kolom metacritic_review_score secara Menurun
**fungsi return** akan mengembalikan nilai value, dalam hal ini genres dan name

### Membuat daftar preferensi pengguna

In [None]:
user_preferences = {
    "Open-World Action": ["The Legend of Zelda: Ocarina of Time", "Grand Theft Auto IV"],
    "3D Platformer": ["Super Mario Galaxy"],
    "JRPG": ["Persona 5 Royal"],
    "Action RPG": ["Diablo III"],
    "MOBA":["Dota 2"],
    "Soccer":["FIFA 21"],
    "MMORPG":["World of Warcraft"],
    "Turn-Based Strategy":["Total War: Shogun 2"],
    "Virtual Life":["The Sims"]
}

user_preferences berfungsi menampung preferensi games pilihan pengguna, berdasarkan 10 genre teratas

In [None]:
recommended_games_by_genre = {}

for genre, games in user_preferences.items():
    recommended_games_by_genre[genre] = recommend_games_by_genre(genre, top_n=5)

display(recommended_games_by_genre)

{'Open-World Action':                   genres                                     name
 0  ['Open-World Action']     The Legend of Zelda: Ocarina of Time
 2  ['Open-World Action']                      Grand Theft Auto IV
 9  ['Open-World Action']                       Grand Theft Auto V
 8  ['Open-World Action']                    Red Dead Redemption 2
 5  ['Open-World Action']  The Legend of Zelda: Breath of the Wild,
 '3D Platformer':                genres                  name
 4   ['3D Platformer']  Super Mario Galaxy 2
 12  ['3D Platformer']   Super Mario Odyssey
 3   ['3D Platformer']    Super Mario Galaxy
 54  ['3D Platformer']             Astro Bot
 98  ['3D Platformer']  Super Mario 3D World,
 'JRPG':        genres                  name
 46   ['JRPG']       Persona 5 Royal
 60   ['JRPG']          Chrono Cross
 50   ['JRPG']      Final Fantasy IX
 63   ['JRPG']  Metaphor: ReFantazio
 104  ['JRPG']             Persona 5,
 'Action RPG':              genres                       

sistem rekomendasi ini terdiri dari 3 langkah
1. memanggil fungsi rekomendasi
2. membuat perulangan untuk mendapatkan key dan items dari user_preferences. Dimana key adalah nama genre dan items adalah nama game. nama game ini akan dimasukkan kedalam fungsi di pembangunan model untuk dicarikan rekomendasinya
3. 5 Rekomendasi games teratas dari setiap genre akan ditampilkan

# Evaluasi

In [None]:
precision_at_n_scores = []
recall_at_n_scores = []
f1_scores = []

for genre, recommended_df in recommended_games_by_genre.items():
    recommended_game_names = recommended_df['name'].tolist()
    liked_games = user_preferences.get(genre, [])

    relevant_recommendations = len(set(recommended_game_names) & set(liked_games))
    n = len(recommended_game_names)
    total_relevant_items = len(liked_games)

    precision_at_n = relevant_recommendations / n if n > 0 else 0
    precision_at_n_scores.append(precision_at_n)

    recall_at_n = relevant_recommendations / total_relevant_items if total_relevant_items > 0 else 0
    recall_at_n_scores.append(recall_at_n)

    f1_score = (2 * precision_at_n * recall_at_n) / (precision_at_n + recall_at_n) if (precision_at_n + recall_at_n) > 0 else 0
    f1_scores.append(f1_score)

    print(f"Genre '{genre}':")
    print(f"  Precision@{n}: {precision_at_n:.4f}")
    print(f"  Recall@{n}: {recall_at_n:.4f}")
    print(f"  F1-score@{n}: {f1_score:.4f}") # Display F1-score
    print("-" * 20)


average_precision_at_n = sum(precision_at_n_scores) / len(precision_at_n_scores) if precision_at_n_scores else 0
average_recall_at_n = sum(recall_at_n_scores) / len(recall_at_n_scores) if recall_at_n_scores else 0
average_f1_score = sum(f1_scores) / len(f1_scores) if f1_scores else 0

print(f"\nAverage Precision at N across all genres: {average_precision_at_n:.4f}")
print(f"Average Recall at N across all genres: {average_recall_at_n:.4f}")
print(f"Average F1-score at N across all genres: {average_f1_score:.4f}")

Genre 'Open-World Action':
  Precision@5: 0.4000
  Recall@5: 1.0000
  F1-score@5: 0.5714
--------------------
Genre '3D Platformer':
  Precision@5: 0.2000
  Recall@5: 1.0000
  F1-score@5: 0.3333
--------------------
Genre 'JRPG':
  Precision@5: 0.2000
  Recall@5: 1.0000
  F1-score@5: 0.3333
--------------------
Genre 'Action RPG':
  Precision@5: 0.0000
  Recall@5: 0.0000
  F1-score@5: 0.0000
--------------------
Genre 'MOBA':
  Precision@5: 0.0000
  Recall@5: 0.0000
  F1-score@5: 0.0000
--------------------
Genre 'Soccer':
  Precision@5: 0.0000
  Recall@5: 0.0000
  F1-score@5: 0.0000
--------------------
Genre 'MMORPG':
  Precision@5: 0.2000
  Recall@5: 1.0000
  F1-score@5: 0.3333
--------------------
Genre 'Turn-Based Strategy':
  Precision@5: 0.2000
  Recall@5: 1.0000
  F1-score@5: 0.3333
--------------------
Genre 'Virtual Life':
  Precision@5: 0.2000
  Recall@5: 1.0000
  F1-score@5: 0.3333
--------------------

Average Precision at N across all genres: 0.1556
Average Recall at N ac

pertama 3 list untuk menampung precision, recall, dan f1_score dibuat.

Lalu fungsi rekomendasi akan dipanggil kembali dengan menggunakan data dari user_preferences

Selanjutnya dihitung precision, recall, dan f1_score nya

hasil akhir dari precision, recall, f1_score dari setiap genre akan ditampilkan.

rata-rata dari precision, recall, dan f1_score akan dihitung dan ditampilkan