In [15]:
from src.cleaning.clean import clean_data
from src.cleaning.nlp import apply_preprocess_text
import pandas as pd

## The Dataset
This dataset is from [Kaggle](https://www.kaggle.com/datasets/deepcontractor/top-video-games-19952021-metacritic?resource=download), and contains reviews from Metacritic.com for video games from 1995-2021. There is a meta_score which is the average of critic reviews for a video game, and user_review which is the average among users.

In [16]:
df = pd.read_csv('data/all_games.csv')
df.head(10)

Unnamed: 0,name,platform,release_date,summary,meta_score,user_review
0,The Legend of Zelda: Ocarina of Time,Nintendo 64,"November 23, 1998","As a young boy, Link is tricked by Ganondorf, ...",99,9.1
1,Tony Hawk's Pro Skater 2,PlayStation,"September 20, 2000",As most major publishers' development efforts ...,98,7.4
2,Grand Theft Auto IV,PlayStation 3,"April 29, 2008",[Metacritic's 2008 PS3 Game of the Year; Also ...,98,7.7
3,SoulCalibur,Dreamcast,"September 8, 1999","This is a tale of souls and swords, transcendi...",98,8.4
4,Grand Theft Auto IV,Xbox 360,"April 29, 2008",[Metacritic's 2008 Xbox 360 Game of the Year; ...,98,7.9
5,Super Mario Galaxy,Wii,"November 12, 2007",[Metacritic's 2007 Wii Game of the Year] The u...,97,9.1
6,Super Mario Galaxy 2,Wii,"May 23, 2010","Super Mario Galaxy 2, the sequel to the galaxy...",97,9.1
7,Red Dead Redemption 2,Xbox One,"October 26, 2018",Developed by the creators of Grand Theft Auto ...,97,8.0
8,Grand Theft Auto V,Xbox One,"November 18, 2014",Grand Theft Auto 5 melds storytelling and game...,97,7.9
9,Grand Theft Auto V,PlayStation 3,"September 17, 2013","Los Santos is a vast, sun-soaked metropolis fu...",97,8.3


## Data Cleaning

In this section we do a few of the basic steps to clean the data and further expand on some features that may help with creating a good recommender system. We use the function `clean_data` which does the below:
- Cleans up whitespace
- Changes `release_date` to a datetime column and create `year` and `decade` columns
- Creates the `platform_type` column which generalizes video game console platforms to a particular brand (e.g. Nintendo Switch = Nintendo)

In [17]:
df = clean_data(df)
df.head()

Unnamed: 0,name,platform,release_date,summary,meta_score,user_review,year,decade,platform_type
0,The Legend of Zelda: Ocarina of Time,Nintendo 64,1998-11-23,"As a young boy, Link is tricked by Ganondorf, ...",99,9.1,1998,1990,Nintendo
1,Tony Hawk's Pro Skater 2,PlayStation,2000-09-20,As most major publishers' development efforts ...,98,7.4,2000,2000,PlayStation
2,Grand Theft Auto IV,PlayStation 3,2008-04-29,[Metacritic's 2008 PS3 Game of the Year; Also ...,98,7.7,2008,2000,PlayStation
3,SoulCalibur,Dreamcast,1999-09-08,"This is a tale of souls and swords, transcendi...",98,8.4,1999,1990,Sega
4,Grand Theft Auto IV,Xbox 360,2008-04-29,[Metacritic's 2008 Xbox 360 Game of the Year; ...,98,7.9,2008,2000,Xbox


### NLP tasks

Since we don't have individual user reviews, and instead an average of critic and user reviews. We'll have to use item to item comparisons and analyze the written summaries for each of the video games. So some natural language processing techniques will have to be used on the data. I use the function `apply_preprocess_text` which does tokenization, stopword removal, and stemmization on each of the summaries for the video games.

In [18]:
df = apply_preprocess_text(df, "summary")
df.head(10)

  0%|          | 0/18800 [00:00<?, ?it/s]

Unnamed: 0,name,platform,release_date,summary,meta_score,user_review,year,decade,platform_type
0,The Legend of Zelda: Ocarina of Time,Nintendo 64,1998-11-23,"young boy , link trick ganondorf , king gerudo...",99,9.1,1998,1990,Nintendo
1,Tony Hawk's Pro Skater 2,PlayStation,2000-09-20,major publish ' develop effort shift number ne...,98,7.4,2000,2000,PlayStation
2,Grand Theft Auto IV,PlayStation 3,2008-04-29,[ metacrit 's 2008 ps3 game year ; also known ...,98,7.7,2008,2000,PlayStation
3,SoulCalibur,Dreamcast,1999-09-08,"tale soul sword , transcend world histori , to...",98,8.4,1999,1990,Sega
4,Grand Theft Auto IV,Xbox 360,2008-04-29,[ metacrit 's 2008 xbox 360 game year ; also k...,98,7.9,2008,2000,Xbox
5,Super Mario Galaxy,Wii,2007-11-12,[ metacrit 's 2007 wii game year ] ultim ninte...,97,9.1,2007,2000,Nintendo
6,Super Mario Galaxy 2,Wii,2010-05-23,"super mario galaxi 2 , sequel galaxy-hop origi...",97,9.1,2010,2010,Nintendo
7,Red Dead Redemption 2,Xbox One,2018-10-26,develop creator grand theft auto v red dead re...,97,8.0,2018,2010,Xbox
8,Grand Theft Auto V,Xbox One,2014-11-18,grand theft auto 5 meld storytel gameplay uniq...,97,7.9,2014,2010,Xbox
9,Grand Theft Auto V,PlayStation 3,2013-09-17,"lo santo vast , sun-soak metropoli full self-h...",97,8.3,2013,2010,PlayStation
