# Project Recommender System
Our project for Group 6 is regarding Video Games rating. The objective would be to predict the rating based on the different qualifications from the game.
This notebook is the starting point of this project and is the data cleaning and exploration.

### Outline of the Notebook

- Step 1: Setting Up the Environment
- Step 2: Data Cleaning
    - 2.1 Missing Values and Duplicates
    - 2.2 Fix Data Names and Types
- Step 3: Data Description
- Step 4: Save Data

## Step 1: Setting up the Environment

In [None]:
# Libraries for data manipulation
import pandas as pd
import os

In [None]:
# Load and display the dataset

# Path to the dataset, to be changed according to your local setup
PATH = "/Users/agathecauhape/EMLyon 2024-25/Canada/Recommender System/projet/data/"

file_path = os.path.join(PATH, "video_game_reviews.csv")
df = pd.read_csv(file_path)
print(df.head(3))


           Game Title  User Rating Age Group Targeted  Price Platform  \
0  Grand Theft Auto V         36.4           All Ages  41.41       PC   
1          The Sims 4         38.3             Adults  57.56       PC   
2           Minecraft         26.8              Teens  44.93       PC   

  Requires Special Device   Developer        Publisher  Release Year  \
0                      No  Game Freak       Innersloth          2015   
1                      No    Nintendo  Electronic Arts          2015   
2                     Yes      Bungie           Capcom          2012   

       Genre Multiplayer  Game Length (Hours) Graphics Quality  \
0  Adventure          No                 55.3           Medium   
1    Shooter         Yes                 34.6              Low   
2  Adventure         Yes                 13.9              Low   

  Soundtrack Quality Story Quality  \
0            Average          Poor   
1               Poor          Poor   
2               Good       Average   



In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47774 entries, 0 to 47773
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Game Title               47774 non-null  object 
 1   User Rating              47774 non-null  float64
 2   Age Group Targeted       47774 non-null  object 
 3   Price                    47774 non-null  float64
 4   Platform                 47774 non-null  object 
 5   Requires Special Device  47774 non-null  object 
 6   Developer                47774 non-null  object 
 7   Publisher                47774 non-null  object 
 8   Release Year             47774 non-null  int64  
 9   Genre                    47774 non-null  object 
 10  Multiplayer              47774 non-null  object 
 11  Game Length (Hours)      47774 non-null  float64
 12  Graphics Quality         47774 non-null  object 
 13  Soundtrack Quality       47774 non-null  object 
 14  Story Quality         

This dataset contains 47774 games rated. It has a total of 18 columns meaning we have many information regarding each games. A lot of them seems to be object so we might have to do a lot of data manipulation and also on column names which are not respecting conventionnal writing. 

Below is a description of each variables:
- Game Title: Title of the game.
- User Rating: Rating of this game.
- Age Group Target: Age Group that is supposed to be the most attracted to the game.
- Price: Price of the game.
- Platform: Platform on what the game can be played.
- Requires Special Device: If the game requires a special installation.
- Developer: Name of the developer.
- Publisher: Name of the publisher.
- Release Year: Release year of the game.
- Genre: Genre of the game.
- Multiplayer: If the game is multiplayer or not.
- Game Length: How long on average it takes to finish the game. 
- Graphics Quality: Quality of graphics.
- Soundtrack Quality: Quality of soundtracks.
- Store Quality: Quality of the store.
- User Review Text: Text review of the user, will be used in either sentiment analysis or RAG analysis. 
- Game Mode: If the game is played online with other players or in standalone.
- Min Number of Players: Minimum players required to play.


In [6]:
# Save df into df_copy for further processing and not to modify the original df
df_copy = df.copy()

## Step 2: Data Cleaning

### 2.1 Missing Values and Duplicates

We need to check the data for missing values and duplicates. If too much appear it can affect the results of modeling anything. 

In [7]:
df_copy.isna().sum()

Game Title                 0
User Rating                0
Age Group Targeted         0
Price                      0
Platform                   0
Requires Special Device    0
Developer                  0
Publisher                  0
Release Year               0
Genre                      0
Multiplayer                0
Game Length (Hours)        0
Graphics Quality           0
Soundtrack Quality         0
Story Quality              0
User Review Text           0
Game Mode                  0
Min Number of Players      0
dtype: int64

In [8]:
# Count duplicates
duplicate_count = df_copy.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Get percentage of duplicates
duplicate_percentage = (duplicate_count / len(df_copy)) * 100
print(f"Percentage of duplicates: {duplicate_percentage:.2f}%")

Number of duplicate rows: 0
Percentage of duplicates: 0.00%


This dataset has no missing values and neither duplicates so no further cleaning is required regarding this. 

### 2.2 Fix Data Names and Types

The columns names are not respecting the python writing conventions and should be in snakecase for easier manipulation hence we simply change them.

In [None]:
# Change the column names
df_copy.columns = [
    'game_title',
    'user_rating',
    'age_group_targeted',
    'price',
    'platform',
    'requires_special_device',
    'developer',
    'publisher',
    'release_year',
    'genre',
    'multiplayer',
    'game_length_hours',
    'graphics_quality',
    'soundtrack_quality',
    'story_quality',
    'user_review_text',
    'game_mode',
    'min_number_of_players'
]


In [None]:
# Check the new column names
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47774 entries, 0 to 47773
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   game_title               47774 non-null  object 
 1   user_rating              47774 non-null  float64
 2   age_group_targeted       47774 non-null  object 
 3   price                    47774 non-null  float64
 4   platform                 47774 non-null  object 
 5   requires_special_device  47774 non-null  object 
 6   developer                47774 non-null  object 
 7   publisher                47774 non-null  object 
 8   release_year             47774 non-null  int64  
 9   genre                    47774 non-null  object 
 10  multiplayer              47774 non-null  object 
 11  game_length_hours        47774 non-null  float64
 12  graphics_quality         47774 non-null  object 
 13  soundtrack_quality       47774 non-null  object 
 14  story_quality         

Some columns would be better as string variables rather than object and others as category therefore we convert them for later analysis.

In [11]:
# Fix the types of the columns
df_copy['game_title'] = df_copy['game_title'].astype('string')
df_copy['user_rating'] = df_copy['user_rating'].astype('float')
df_copy['age_group_targeted'] = df_copy['age_group_targeted'].astype('category')
df_copy['price'] = df_copy['price'].astype('float')
df_copy['platform'] = df_copy['platform'].astype('category')
df_copy['requires_special_device'] = df_copy['requires_special_device'].astype('string')
df_copy['developer'] = df_copy['developer'].astype('string')
df_copy['publisher'] = df_copy['publisher'].astype('string')
df_copy['release_year'] = df_copy['release_year'].astype('int')
df_copy['genre'] = df_copy['genre'].astype('category')
df_copy['multiplayer'] = df_copy['multiplayer'].astype('string')
df_copy['game_length_hours'] = df_copy['game_length_hours'].astype('float')
df_copy['graphics_quality'] = df_copy['graphics_quality'].astype('category')
df_copy['soundtrack_quality'] = df_copy['soundtrack_quality'].astype('category')
df_copy['story_quality'] = df_copy['story_quality'].astype('category')
df_copy['user_review_text'] = df_copy['user_review_text'].astype('string')
df_copy['game_mode'] = df_copy['game_mode'].astype('string')
df_copy['min_number_of_players'] = df_copy['min_number_of_players'].astype('int')


## Step 3: Data Description

Before going into detailed analysis, we want to get an overview of the data to know where to go from there.

In [12]:
df_copy.describe()

Unnamed: 0,user_rating,price,release_year,game_length_hours,min_number_of_players
count,47774.0,47774.0,47774.0,47774.0,47774.0
mean,29.719329,39.951371,2016.480952,32.481672,5.116758
std,7.550131,11.520342,4.027276,15.872508,2.769521
min,10.1,19.99,2010.0,5.0,1.0
25%,24.3,29.99,2013.0,18.8,3.0
50%,29.7,39.845,2016.0,32.5,5.0
75%,35.1,49.9575,2020.0,46.3,7.0
max,49.5,59.99,2023.0,60.0,10.0


In [13]:
# Display the count of unique values for each colmn
print(" Unique value counts per column:\n")
for col in df_copy.columns:
    print(f"{col}: {df_copy[col].nunique()} unique values")

 Unique value counts per column:

game_title: 40 unique values
user_rating: 392 unique values
age_group_targeted: 4 unique values
price: 4001 unique values
platform: 5 unique values
requires_special_device: 2 unique values
developer: 10 unique values
publisher: 9 unique values
release_year: 14 unique values
genre: 10 unique values
multiplayer: 2 unique values
game_length_hours: 551 unique values
graphics_quality: 4 unique values
soundtrack_quality: 4 unique values
story_quality: 4 unique values
user_review_text: 12 unique values
game_mode: 2 unique values
min_number_of_players: 9 unique values


## Step 4: Save Data
We save the dataset df_copy into a CSV file to be able to reuse it in the future of the project this way. 

In [14]:
# Verify the changes made to the DataFrame before saving
df_copy.head(3)

Unnamed: 0,game_title,user_rating,age_group_targeted,price,platform,requires_special_device,developer,publisher,release_year,genre,multiplayer,game_length_hours,graphics_quality,soundtrack_quality,story_quality,user_review_text,game_mode,min_number_of_players
0,Grand Theft Auto V,36.4,All Ages,41.41,PC,No,Game Freak,Innersloth,2015,Adventure,No,55.3,Medium,Average,Poor,"Solid game, but too many bugs.",Offline,1
1,The Sims 4,38.3,Adults,57.56,PC,No,Nintendo,Electronic Arts,2015,Shooter,Yes,34.6,Low,Poor,Poor,"Solid game, but too many bugs.",Offline,3
2,Minecraft,26.8,Teens,44.93,PC,Yes,Bungie,Capcom,2012,Adventure,Yes,13.9,Low,Good,Average,"Great game, but the graphics could be better.",Offline,5


In [15]:
# Save
df_copy.to_csv(os.path.join(PATH, "video_game_clean.csv"), index=False)