# Video Game Recomendation System

## Business understanding

## Data understanding

In [354]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Dataset from kaggle on [here](https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset?select=amazon_reviews_us_Digital_Video_Games_v1_00.tsv)

In [10]:
df_raw = pd.read_csv('./data/Digital_Video_Games.tsv', sep='\t', header=0, error_bad_lines=False)

### Columns
- **marketplace**: 2 letter country code of the marketplace where the review was written.
- **customer_id**: Random identifier that can be used to aggregate reviews written by a single author.
- **review_id**: The unique ID of the review.
- **productid**: The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same productid.
- **product_parent**: Random identifier that can be used to aggregate reviews for the same product.
- **product_title**: Title of the product.
- **product_category**: Broad product category that can be used to group reviews (also used to group the dataset into coherent parts).
- **star_rating**: The 1-5 star rating of the review.
- **helpful_votes**: Number of helpful votes.
- **total_votes**: Number of total votes the review received.
- **vine**: Review was written as part of the Vine program.
- **verified_purchase**: The review is on a verified purchase.
- **review_headline**: The title of the review.
- **review_body**: The review text.
- **review_date**: The date the review was written.

In [11]:
df_raw.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,21269168,RSH1OZ87OYK92,B013PURRZW,603406193,Madden NFL 16 - Xbox One Digital Code,Digital_Video_Games,2,2,3,N,N,A slight improvement from last year.,I keep buying madden every year hoping they ge...,2015-08-31
1,US,133437,R1WFOQ3N9BO65I,B00F4CEHNK,341969535,Xbox Live Gift Card,Digital_Video_Games,5,0,0,N,Y,Five Stars,Awesome,2015-08-31
2,US,45765011,R3YOOS71KM5M9,B00DNHLFQA,951665344,Command & Conquer The Ultimate Collection [Ins...,Digital_Video_Games,5,0,0,N,Y,Hail to the great Yuri!,If you are prepping for the end of the world t...,2015-08-31
3,US,113118,R3R14UATT3OUFU,B004RMK5QG,395682204,Playstation Plus Subscription,Digital_Video_Games,5,0,0,N,Y,Five Stars,Perfect,2015-08-31
4,US,22151364,RV2W9SGDNQA2C,B00G9BNLQE,640460561,Saints Row IV - Enter The Dominatrix [Online G...,Digital_Video_Games,5,0,0,N,Y,Five Stars,Awesome!,2015-08-31


In [12]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144724 entries, 0 to 144723
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   marketplace        144724 non-null  object
 1   customer_id        144724 non-null  int64 
 2   review_id          144724 non-null  object
 3   product_id         144724 non-null  object
 4   product_parent     144724 non-null  int64 
 5   product_title      144724 non-null  object
 6   product_category   144724 non-null  object
 7   star_rating        144724 non-null  int64 
 8   helpful_votes      144724 non-null  int64 
 9   total_votes        144724 non-null  int64 
 10  vine               144724 non-null  object
 11  verified_purchase  144724 non-null  object
 12  review_headline    144722 non-null  object
 13  review_body        144722 non-null  object
 14  review_date        144721 non-null  object
dtypes: int64(5), object(10)
memory usage: 16.6+ MB


Some null values in the last 3 columns.

In [13]:
df_raw['marketplace'].unique()

array(['US'], dtype=object)

All reviews in dataset are from US customers

In [14]:
df_raw['customer_id'].nunique()

112891

112891 different customers 

In [15]:
df_raw['review_id'].nunique()

144724

Seems like there are no duplicate reviews in the set

In [16]:
df_raw['product_id'].nunique()

7939

7939 different products, but not all video games, found "Playstation Plus Subscription" in row 3, so all digitally sold products related video games, like subscriptions, seem to be included in the data set, will further look for other kind of product other than games and drop those.

In [17]:
df_raw['product_parent'].nunique()

7755

- **product_parent**: Random identifier that can be used to aggregate reviews for the same product.

seems like just another id but we have about 200 less data points in this column than product id

In [18]:
df_raw['product_title'].nunique()

6946

if i have 6946 different titles, i would expect the same amount of product_id, wonder if by droping everything that is not a game, like substriptions, would fix this discrepancy.

In [19]:
df_raw.iloc[3]['product_title']

'Playstation Plus Subscription'

It doesnt specify if the subscription is for 1, 3 or 12 months

<br>

Below i used key words to look for any products that is not a game
e.g
- Subscription
- Month
- Membership
- Card
- Network



In [23]:
Game_subs = df_raw[df_raw['product_title'].str.contains('Subscription')]
Game_subs['product_title'].unique()

array(['Playstation Plus Subscription', 'Xbox Live Subscription',
       '1 Month Subscription: EVE Online [Instant Access]',
       'PlayStation Now Subscription Twister Parent',
       'Webkinz 12 Month Deluxe Subscription [Online Game Code]'],
      dtype=object)

In [24]:
Sub_months = df_raw[df_raw['product_title'].str.contains('Month')]
Sub_months['product_title'].unique()

array(['1 Month Subscription: EVE Online [Instant Access]',
       'Xbox Live 24 Month Gold Membership - Xbox 360 Digital Code',
       'Disney Club Penguin 12 Month Membership Code with Free Amazon Exclusive Bonus Item',
       'Xbox Live 3-Month Gold + $100 Xbox Gift Card - Xbox 360 Digital Code',
       'Xbox Live 3-Month Gold + $5 Xbox Gift Card - Xbox 360 Digital Code',
       'Webkinz 12 Month Deluxe Subscription [Online Game Code]',
       'Xbox 12 Month Gold for Black Ops II - Xbox 360 Digital Code',
       '3-Month PS Plus + $10 PS Gift Card - PS3 / PS4 [Digital Code]',
       '3-Month PS Plus + $20 PS Gift Card - PS3 / PS4 [Digital Code]',
       'Xbox LIVE 12 Month Gold for Call of Duty: Ghost - Xbox 360 Digital Code',
       'Curse Premium - 1 Month [Download]',
       'GameBattles Premium Access - 3 Months [Online Game Code]',
       '3-Month PS Plus + $50 PS Gift Card - PS3 / PS4 [Digital Code]',
       'Xbox Live 12-Month Gold + $25 Xbox Gift Card - Xbox 360 Digital Code

In [25]:
Sub_Membership = df_raw[df_raw['product_title'].str.contains('Membership')]
Sub_Membership['product_title'].unique()

array(['Xbox Live 24 Month Gold Membership - Xbox 360 Digital Code',
       'Disney Club Penguin 12 Month Membership Code with Free Amazon Exclusive Bonus Item',
       'Disney Club Penguin Membership',
       'Xbox Live 3 month Gold Membership + 1 bonus month [Online Game Code]',
       '1 Year Membership: AdventureQuest Worlds [Instant Access]',
       '3 Month Membership: AdventureQuest Worlds [Instant Access]',
       'Xbox LIVE 12 Month Gold Membership + 1 Bonus Month - Xbox 360 Digital Code',
       '6 Month Membership: AdventureQuest Worlds [Instant Access]'],
      dtype=object)

In [26]:
Sub_Card = df_raw[df_raw['product_title'].str.contains('Card')]
Sub_Card['product_title'].unique()

array(['Xbox Live Gift Card', 'Playstation Network Card',
       'Xbox 360 Live Points Card', 'Grand Theft Auto V Cash Cards',
       'Hoyle Card Games 2012 AMR',
       'Final Fantasy XIV Online: 60 Day Time Card [Online Game Code]',
       'Hoyle Card Games  [Download]',
       'Xbox $5 Gift Card - Xbox 360 Digital Code',
       "Hoyle Kid's Card Games [Download]",
       'Legends of Solitaire: The Lost Cards [Download]',
       'Xbox $15 Gift Card (Call of Duty Ghosts:\xa0Onslaught DLC) - Xbox 360 Digital Code',
       'Xbox Live 3-Month Gold + $100 Xbox Gift Card - Xbox 360 Digital Code',
       'Xbox Live 3-Month Gold + $5 Xbox Gift Card - Xbox 360 Digital Code',
       'Tripeaks Solitaire Multi (Tripeaks with Multiple Card-Sets and Multiple Layouts) [Download]',
       'Reel Deal Card Games 2011 [Download]',
       '2,013 Card, Mahjongg & Solitaire Games [Download]',
       'Hoyle Card Games [Mac Download]',
       'Sony Playstation Network Card - $10 [Online Game Code]',
       

In [27]:
Sub_Network = df_raw[df_raw['product_title'].str.contains('Network')]
Sub_Network['product_title'].unique()

array(['Playstation Network Card',
       'Sony Playstation Network Card - $10 [Online Game Code]'],
      dtype=object)

In [28]:
df_raw['product_category'].nunique()

1

In [380]:
df_raw['star_rating'].unique()

array([2, 5, 4, 1, 3])

In [372]:
df_raw['review_date'] = pd.to_datetime(df_raw['review_date'])
df_raw['review_date']

0        2015-08-31
1        2015-08-31
2        2015-08-31
3        2015-08-31
4        2015-08-31
            ...    
144719   2008-12-25
144720   2008-12-24
144721   2008-09-10
144722   2008-09-01
144723   2006-08-08
Name: review_date, Length: 144724, dtype: datetime64[ns]

Reviews go from 2006 to 2015

## Data Preparation

Droping all rows that have strings: 
Subscription, Month, Membership,Network on the `product_title` column

In [413]:
df_raw_noSub = df_raw[df_raw["product_title"].str.contains("Subscription|Month|month|Membership|Network|Season Pass|Xbox Music Pass|Virtual Currency") == False]

Season Pass, Xbox Music Pass and Virtual Currency are some of the other kind of items on the dataset, i simply added them on the line above as i kept finding more during the process

For titles that have string 'Card' in the title, i have to be more careful since there are some game that include that on their name

In [414]:
Sub_Card2 = df_raw_noSub[df_raw_noSub['product_title'].str.contains('Card')]
Sub_Card2['product_title'].unique()

array(['Xbox Live Gift Card', 'Xbox 360 Live Points Card',
       'Grand Theft Auto V Cash Cards', 'Hoyle Card Games 2012 AMR',
       'Final Fantasy XIV Online: 60 Day Time Card [Online Game Code]',
       'Hoyle Card Games  [Download]',
       'Xbox $5 Gift Card - Xbox 360 Digital Code',
       "Hoyle Kid's Card Games [Download]",
       'Legends of Solitaire: The Lost Cards [Download]',
       'Xbox $15 Gift Card (Call of Duty Ghosts:\xa0Onslaught DLC) - Xbox 360 Digital Code',
       'Tripeaks Solitaire Multi (Tripeaks with Multiple Card-Sets and Multiple Layouts) [Download]',
       'Reel Deal Card Games 2011 [Download]',
       '2,013 Card, Mahjongg & Solitaire Games [Download]',
       'Hoyle Card Games [Mac Download]',
       '1-Year PS Plus + $10 PS Gift Card - PS3 / PS4 [Digital Code]',
       'Five Card Deluxe [Download]',
       '1-Year PS Plus + $50 PS Gift Card - PS3 / PS4 [Digital Code]',
       'Xbox Live $6 Gift Card - Xbox 360 Digital Code',
       '1-Year PS Plus + $

First i will drop rows where 'Gift Card' is found, then i will proceed with other common combinations of words like 'Xbox Live' and Live Points Card

In [415]:
df_raw_noSub2 = df_raw_noSub[df_raw_noSub["product_title"].str.contains("Gift Card|Xbox Live|Live Points Card|Xbox Live Gift Card|Grand Theft Auto V Cash Cards") == False]

In [416]:
Sub_Card3 = df_raw_noSub2[df_raw_noSub2['product_title'].str.contains('Card')]
Sub_Card3['product_title'].unique()

array(['Hoyle Card Games 2012 AMR',
       'Final Fantasy XIV Online: 60 Day Time Card [Online Game Code]',
       'Hoyle Card Games  [Download]',
       "Hoyle Kid's Card Games [Download]",
       'Legends of Solitaire: The Lost Cards [Download]',
       'Tripeaks Solitaire Multi (Tripeaks with Multiple Card-Sets and Multiple Layouts) [Download]',
       'Reel Deal Card Games 2011 [Download]',
       '2,013 Card, Mahjongg & Solitaire Games [Download]',
       'Hoyle Card Games [Mac Download]', 'Five Card Deluxe [Download]',
       "King's Collection: 6 Classic Card Games",
       '8 Card Game Pack [Download]', '5 Realms of Cards [Download]',
       '5 Card Slingo [Download]', 'Card Crazy! [Download]',
       'Rift 30 Day Game Time Card [Online Game Code]',
       'Bicylce Family Card Games [Download]',
       'Rift 60 Day Game Time Card [Online Game Code]',
       'High Stakes Poker: Connelly Card Club [Online Game Code]',
       'Strange Cases: The Tarot Card Mystery [Download]'], dt

Seems like items left with string card in the title are actual games

<br>

Lets look for DLCs

In [417]:
DLC = df_raw_noSub2[df_raw_noSub2['product_title'].str.contains('DLC')]
DLC['product_title'].nunique()

212

In [418]:
len(df_raw_noSub2[df_raw_noSub2['product_title'].str.contains('DLC')])

1698

We have 215 different dlcs, and 1721 reviews, an original copy of the first game is required to be able to run a DLC, I will drop these too.

In [419]:
df_raw_noSub_no_DLC = df_raw_noSub2[df_raw_noSub2["product_title"].str.contains("DLC|Pack") == False]

In [420]:
df_raw_noSub_no_DLC['product_title'].nunique()

6206

In [421]:
df_raw_noSub_no_DLC['product_id'].nunique()

7081

Quantity for product id and product title still doesn't match

In [422]:
sorted(df_raw_noSub_no_DLC['product_title'].unique())

['007 Legends [Download]',
 '1 Moment Of Time: Silentville [Download]',
 '1 PLEX: EVE Online [Instant Access]',
 '1 Penguin 100 Cases [Download]',
 '1 vs 100 [Download]',
 '10 Talismans [Download]',
 '100 % Hidden Objects 2 [Download]',
 '100% Hidden Object (Mac) [Download]',
 '100% Hidden Objects',
 '1001 Japanese Crosswords',
 '1001 Kidz Games [Download]',
 '1001 Mini-Golf Challenge [Download]',
 '1001 Nights: The Adventures of Sindbad',
 '1001 Tangram Puzzles',
 '101 - in - 1 Megamix [Online Game Code]',
 '1080° Snowboarding [Online Game Code]',
 '12 Labours of Hercules 3: Girl Power [Download]',
 '12 Labours of Hercules II: The Cretan Bull [Download]',
 '12 Labours of Hercules [Download]',
 '12 PLEX: EVE Online [Instant Access]',
 '12000 AC plus Game Booster Addon: AdventureQuest Worlds [Game Connect]',
 '15 Square Slider Puzzle [Download]',
 '16 Bit Arena [Download]',
 '18 Wheels of Steel American Long Haul [Download]',
 '18 Wheels of Steel American Long Haul [Online Game Code]',


Some games include a substring next to the title, it describes the way buyers could access the game, for example:

<br>

`18 Wheels of Steel American Long Haul [Download]`

`18 Wheels of Steel American Long Haul [Online Game Code]`

<br>

Definitely the same game, just different "delivery" method.


In [423]:
df_raw_noSub_no_DLC[df_raw_noSub_no_DLC['product_title'].str.contains('18 Wheels of Steel American Long Haul')].head(3)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
39224,US,36420328,R2LOZS1XJTWZNN,B004GHNG0E,614285704,18 Wheels of Steel American Long Haul [Download],Digital_Video_Games,4,0,0,N,Y,Four Stars,ok,2014-10-15
50242,US,17845931,R2TJLHZ9X7EG83,B004GHNG0E,614285704,18 Wheels of Steel American Long Haul [Download],Digital_Video_Games,4,0,0,N,Y,Four Stars,Like driving trucks.,2014-07-14
74152,US,21374226,R388BHLY4GIU5E,B00CLVZGPK,775610170,18 Wheels of Steel American Long Haul [Online ...,Digital_Video_Games,1,1,1,N,Y,"Doesn't work, no manufacturer support",The code I got does not work. Amazon referred ...,2013-12-29



<br>

The `[Download]` edition of the game, on the second row above, has a different `product_id` than its `[Online Game Code]` edition (on the 3rd row above), it's the same game but taged with different ID

<br>

I find it easier to modify the name than the id number, i need to group those reviews togeter, the delivery method doesnt change the software performance, so the review for a specific version is still relevant for a different one.


At first sight, `[Instant Access]`, `[Download]`, `[Digital Code]`, `[Game Connect]` and `[Online Game Code]` seem to be the most frequent tags at the end of the games titles, if delete those out of the title name, i could use them so the rec system instead of the id, which was my first thought

In [424]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['product_title'].str.replace(" \[Instant Access\]", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['product_title'].str.replace(" \[Instant Access\]", "")


In [425]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Download\]", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Download\]", "")


In [426]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Game Connect\]", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Game Connect\]", "")


In [427]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Online Game Code\]", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Online Game Code\]", "")


In [428]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Digital Code\]", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" \[Digital Code\]", "")


The first title (on index 0) has 'Xbox One Digital Code' in its name, will check if there are more games with this same substring, or even for a different platform other than xbox

In [430]:
df_raw_noSub_no_DLC[df_raw_noSub_no_DLC['product_title'].str.contains('Xbox One Digital Code')]

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,clean_title
0,US,21269168,RSH1OZ87OYK92,B013PURRZW,603406193,Madden NFL 16 - Xbox One Digital Code,Digital_Video_Games,2,2,3,N,N,A slight improvement from last year.,I keep buying madden every year hoping they ge...,2015-08-31,Madden NFL 16 - Xbox One Digital Code
104,US,36076198,RL9RFGWHGPJGO,B00TFVDR32,479126001,Ori and the Blind Forest - Xbox One Digital Code,Digital_Video_Games,5,0,0,N,Y,Exactly what the reviews said it would be,I bought this game after reading all the revie...,2015-08-31,Ori and the Blind Forest - Xbox One Digital Code
159,US,51447735,R253MSTC5ECVRS,B00TFVDR32,479126001,Ori and the Blind Forest - Xbox One Digital Code,Digital_Video_Games,5,0,0,N,Y,FANTASTIC!,I'd say it took me less than a week to finish ...,2015-08-30,Ori and the Blind Forest - Xbox One Digital Code
175,US,14884528,RWQI661DFFIF3,B00R6HA3XY,608413618,Child of Light - Xbox One Digital Code,Digital_Video_Games,5,3,3,N,Y,Great game - storybook gameplay and dreamy gra...,This is one of the best games I have played in...,2015-08-30,Child of Light - Xbox One Digital Code
263,US,2919462,R1XT34JC176BLS,B012P5WRQM,63259150,Gears of War: Ultimate Edition Deluxe Version ...,Digital_Video_Games,5,0,2,N,Y,Works perfectly,I bought the game and Amazon gave me the code ...,2015-08-29,Gears of War: Ultimate Edition Deluxe Version ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41070,US,39846241,RUUJU7BAQUPBU,B00NMO0IA8,567442793,Minecraft - Xbox One Digital Code,Digital_Video_Games,5,0,1,N,N,Games bomb,"Fun game, I like this better then the laptop c...",2014-09-30,Minecraft - Xbox One Digital Code
41122,US,44156652,R3P036FJ9S6P2A,B00NMO0IA8,567442793,Minecraft - Xbox One Digital Code,Digital_Video_Games,5,5,5,N,N,This code variant is perfect for anyone who wa...,Minecraft for the Xbox One is just amazing. It...,2014-09-29,Minecraft - Xbox One Digital Code
41279,US,13709778,RP50H0ZMHSWP7,B00NMO0IA8,567442793,Minecraft - Xbox One Digital Code,Digital_Video_Games,5,4,5,N,N,My kifs love it,This is a great version of Minecraft. Once yo...,2014-09-28,Minecraft - Xbox One Digital Code
41563,US,33074509,R1LVHPFL52JKSG,B00NMO0IA8,567442793,Minecraft - Xbox One Digital Code,Digital_Video_Games,1,3,24,N,Y,"Buggy right out of the ""gates""",The game freezes my son's Xbox1 console upon s...,2014-09-25,Minecraft - Xbox One Digital Code


below i deleted the susbstring using same method as before

In [431]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - Xbox One Digital Code", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - Xbox One Digital Code", "")


found it for the 360 console too

In [450]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - Xbox 360 Digital Code", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - Xbox 360 Digital Code", "")


Now will look for something similar in a different platform

In [451]:
df_raw_noSub_no_DLC[df_raw_noSub_no_DLC['product_title'].str.contains('PS4')]

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,clean_title
57,US,9095823,R2642823GCWLND,B011WI7U0G,864854550,Batman: Arkham Knight: A Matter Of Family - PS...,Digital_Video_Games,3,1,1,N,Y,Three Stars,It's easily better than the Harley Quinn dlc a...,2015-08-31,Batman: Arkham Knight: A Matter Of Family
76,US,51688000,R34Y0NX6RHTWH1,B012PRO97A,56796613,Sword Art Online Re: Hollow Fragment - PS4 [Di...,Digital_Video_Games,5,1,2,N,Y,Great gift option,Received code in seconds. No issues with the ...,2015-08-31,Sword Art Online Re: Hollow Fragment
81,US,17746031,R2BYOTSDJJY65O,B00DRKJBC8,29664234,Final Fantasy VII - PS4 [Digital Code],Digital_Video_Games,4,0,0,N,Y,An amazing game,"First of all, I never played the original Play...",2015-08-31,Final Fantasy VII
97,US,28034012,R2GFH6IGSW33MP,B00GMPJKDA,806044015,Trine 2: Complete Story - PS4 [Digital Code],Digital_Video_Games,4,1,1,N,Y,Four Stars,"Awesome brain candy, visually beautiful and in...",2015-08-31,Trine 2: Complete Story
122,US,3978884,R1GQER9Z4SUO2A,B00JAPIV84,841717561,Dead Nation Apocalypse Edition - PS4 [Digital ...,Digital_Video_Games,5,1,1,N,Y,Five Stars,very good,2015-08-30,Dead Nation Apocalypse Edition
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87719,US,44206257,R28CULMVKI3STF,B00DRKJBC8,29664234,Final Fantasy VII - PS4 [Digital Code],Digital_Video_Games,4,2,5,N,Y,I can't believe I found it,I have been a huge fan of FF since the first o...,2013-08-20,Final Fantasy VII
88499,US,18170799,R1WK8YSYXTH0ZW,B00DRKJBC8,29664234,Final Fantasy VII - PS4 [Digital Code],Digital_Video_Games,5,1,1,N,Y,Awesome,"I used to play this when I was younger, and is...",2013-08-13,Final Fantasy VII
89606,US,30522361,R2JM2FL74BQWUE,B00DRKJBC8,29664234,Final Fantasy VII - PS4 [Digital Code],Digital_Video_Games,3,1,19,N,Y,Best RPG ever!,The game is amazing and gets 5 stars. The down...,2013-08-03,Final Fantasy VII
90688,US,17058574,R1K7X74SN65PV4,B00DRKJBC8,29664234,Final Fantasy VII - PS4 [Digital Code],Digital_Video_Games,4,40,47,N,N,(Almost) Exactly as you remember it,...For better or worse.<br /><br />This is in ...,2013-07-24,Final Fantasy VII


found `' - PS4'`, `' - PS3'`, `' - PS Vita / PS4 / PS3'`, `' - PS Vita'`

In [452]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS Vita / PS4 / PS3", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS Vita / PS4 / PS3", "")


In [453]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS Vita", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS Vita", "")


In [454]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS3", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS3", "")


In [455]:
df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS4", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw_noSub_no_DLC['clean_title'] = df_raw_noSub_no_DLC['clean_title'].str.replace(" - PS4", "")


### Clean DF

In [456]:
Df_clean = df_raw_noSub_no_DLC[['customer_id', 'clean_title', 'star_rating']]

In [457]:
Df_clean.reset_index(drop=True, inplace=True)

In [458]:
Df_clean['customer_id'].nunique()

76961

In [459]:
Df_clean['clean_title'].nunique()

6006

In [460]:
!ls data

Digital_Video_Games.tsv  GameRatings.csv          meta_Video_Games.json.gz


In [469]:
Df_clean.to_csv('./data/GameRatings.csv')

In [470]:
Ratings = pd.read_csv('./data/GameRatings.csv')

In [404]:
Ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101228 entries, 0 to 101227
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   customer_id  101228 non-null  int64 
 1   clean_title  101228 non-null  object
 2   star_rating  101228 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 2.3+ MB


In [462]:
from surprise import Dataset, Reader

In [465]:
reader = Reader(line_format='user item rating',
                sep=',', skip_lines=1)

In [466]:
data = Dataset.load_from_file('./data/GameRatings.csv', reader=reader)

ValueError: could not convert string to float: 'She Wrote 2: Return to Cabot Cove"'

## Modeling

I will train models using the collaborative filtering aproach, and for a first simple model i picked a memory-based one, 

Memory-based models calculate the similarities between users / items based on user-item rating pairs.


Model-based models (admittedly, a weird name) use some sort of machine learning algorithm to estimate the ratings. A typical example is singular value decomposition of the user-item ratings matrix.

In [None]:
trainset, testset = train_test_split(jokes, test_size=0.2)

In [None]:
from surprise.model_selection import train_test_split

In [None]:
import surprise
from surprise.prediction_algorithms import *

In [None]:
from surprise import Dataset, SVD
from surprise.model_selection import cross_validate

In [None]:
from surprise.prediction_algorithms import SVD
from surprise.model_selection import GridSearchCV

## Evaluation

In [None]:
making predictions

## Deployment

## metadata

In [2]:
!ls data


Digital_Video_Games.tsv  meta_Video_Games.json.gz


In [3]:
import gzip

In [6]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

meta_df = getDF('./data/meta_Video_Games.json.gz')