## Code pour prendre des revues de différents produits
* Importer les données
    * Appliances
    * Automotive
    * CDs_and_Vinyl
    * Digital_Music
    * Gift_Cards
    * Handmade_Products
    * Musical_Instruments
    * Video_Games
* Filtrer celles que je souhaite conserver (quantité, équilibre des notes, présence d'informations pour xgboost [ex : prix, ...])
    * Ne pas oublier le set.seed
* Enregistrer les données

## Pipeline de données


In [2]:
### Packages de base
import numpy as np
import pandas as pd
from datasets import load_dataset

### Video Games
#### Importation

In [3]:
## Load User Reviews
dataset_reviews = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Video_Games", split="full", trust_remote_code=True)

## Load Item Metadata - Test avec All_Beauty
dataset_items = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Video_Games", split="full", trust_remote_code=True)


#### Reviews

In [4]:
## Convert to df (plus à l'aise pour certaines manip et EDA)
df_reviews = pd.DataFrame(dataset_reviews)

In [5]:
### Data manipulation
## Concat title and text
df_reviews['full_text'] = "Title : " + df_reviews['title'].astype(str) + "\n Review : " + df_reviews['text'].astype(str)

## Create variable as_image
df_reviews['as_image'] = np.where(df_reviews['images'].str.len() == 0, 0, 1)

## Create variable as_helpful_vote
df_reviews['as_helpful_vote'] = np.where(df_reviews['helpful_vote'] == 0, 0, 1)

#### Items

In [6]:
## Convert to pandas df
df_items = pd.DataFrame(dataset_items)

In [8]:
## Main category (filtrer pour conserver seulement "Appliances")
print(df_items['main_category'].value_counts())
df_items_cat = df_items[df_items['main_category'] == "Video Games"]
print(df_items_cat['main_category'].value_counts())

Video Games                     81255
Computers                       17235
All Electronics                 14816
Cell Phones & Accessories        3884
Toys & Games                     2733
Software                         1511
Industrial & Scientific          1079
Amazon Home                       737
Home Audio & Theater              443
Tools & Home Improvement          369
Office Products                   295
Sports & Outdoors                 244
Buy a Kindle                      220
Movies & TV                       197
Books                             196
Musical Instruments               154
All Beauty                        126
Camera & Photo                    117
Portable Audio & Accessories      112
Digital Music                     104
Health & Personal Care             95
Automotive                         85
AMAZON FASHION                     54
Pet Supplies                       38
Grocery                            36
Baby                               26
Arts, Crafts

In [9]:
## Filtrer pour conserver seulement les produits avec un prix
print(df_items_cat['price'].count())
print(df_items_cat[df_items_cat['price'] != 'None']['price'].count())
df_items_price = df_items_cat[df_items_cat['price'] != 'None']
df_items_price['price'].astype(float).describe()

81255
38580


count    38580.000000
mean        51.174594
std         89.392353
min          0.000000
25%         14.950000
50%         28.000000
75%         50.090000
max       3359.990000
Name: price, dtype: float64

In [10]:
## Aperçu des catégories
print(df_items_price['categories'].value_counts())


[Video Games, PC, Games]                                                                             5994
[]                                                                                                   2904
[Video Games, PlayStation 4, Games]                                                                  2015
[Video Games, Legacy Systems, PlayStation Systems, PlayStation 2, Games]                             1700
[Video Games, Nintendo Switch, Games]                                                                1516
                                                                                                     ... 
[Video Games, Legacy Systems, Xbox Systems, Xbox 360, Accessories, Thumb Grips]                         1
[Video Games, Legacy Systems, Nintendo Systems, Wii, Accessories, Cooling Systems]                      1
[Video Games, Legacy Systems, PlayStation Systems, Sony PSP, Accessories, Cables & Adapters]            1
[Video Games, Legacy Systems, PlayStation Syst

In [11]:
## Isoler les catégories importantes
df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])
#df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])

## Aperçu des options
print(df_items_price['categories_single'].value_counts())

## À grouper
# Parts & Accessories
# Other


Games                                         24465
Accessories                                    2293
Consoles                                        983
Cases & Storage                                 715
Controllers                                     616
                                              ...  
2023 Most Anticipated                             1
Deals on Nominations 12/18/2022-12/23/2022        1
Bonus Offers - Up to $20 Credit                   1
Music Controllers                                 1
Commodore 64                                      1
Name: categories_single, Length: 136, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])


In [12]:
## Grouper les catégories
df_items_price['categories_grp'] = np.where(df_items_price['categories_single'] == "Games", "Games", "Accessories")

df_items_price['categories_grp'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_grp'] = np.where(df_items_price['categories_single'] == "Games", "Games", "Accessories")


Games          24465
Accessories    14115
Name: categories_grp, dtype: int64

In [11]:
### Potentiel pour XGBoost
## Reviews
# Création variable presence_image (as_image)
# verified_purchase
# helpful_vote (as_helpful_votes)

## Items
# Sure : main_category, average_rating, rating_number, price
# catgories_grp
# Potentiel : 
#   XTitle? De quoi à faire avec ça?
#   *store name? dequoi à faire avec ça?
#   **categories (potentiellement qqch à faire avec ça!) (extraire du dictionnaire!)
# rating_number (number of ratings for average)


#### Merge Items to Reviews

In [13]:
# Keep only necessary variables before mergeing
df_reviews_f = df_reviews[['rating', 'full_text', 'as_image', 'parent_asin', 'as_helpful_vote', 'helpful_vote', 'verified_purchase']]
df_items_f = df_items_price[['main_category', 'average_rating', 'rating_number', 'price', 'parent_asin', 'categories_grp']]

# Merge Items on Reviews
df_full = df_reviews_f.merge(df_items_f, on='parent_asin', how='left')

# Filter Price
df_full_price = df_full[df_full['price'] != 'None']

# Keep only necessary variables
df_final = df_full_price[['parent_asin', # both
               'rating', 'full_text', 'as_image', 'helpful_vote', 'as_helpful_vote', 'verified_purchase', # reviews
               'main_category', 'average_rating', 'rating_number', 'price', 'categories_grp']] # items

# Filter main_categoy and price
df_final = df_final.dropna()

In [14]:
df_final

Unnamed: 0,parent_asin,rating,full_text,as_image,helpful_vote,as_helpful_vote,verified_purchase,main_category,average_rating,rating_number,price,categories_grp
1,B07SRWRH5D,5.0,Title : Good. A bit slow\n Review : Nostalgic ...,0,1,1,False,Video Games,4.8,9097.0,25.95,Games
2,B07MFMFW34,5.0,Title : ... an order for my kids & they have r...,0,0,0,True,Video Games,3.0,31.0,29.99,Games
3,B0BCHWZX95,5.0,Title : Great alt to pro controller\n Review :...,0,0,0,True,Video Games,4.6,19492.0,67.61,Accessories
19,B07GJ5W7HV,1.0,"Title : DONT BUY\n Review : NOT RECCOMENDED, B...",0,0,0,True,Video Games,4.6,3874.0,98.9,Accessories
20,B00T76ZD78,5.0,Title : Great set at a fantastic price\n Revie...,0,0,0,True,Video Games,4.2,251.0,7.32,Accessories
...,...,...,...,...,...,...,...,...,...,...,...,...
4624604,B015NHBBOS,5.0,Title : I love it\n Review : Thank you for gre...,0,0,0,True,Video Games,4.4,387.0,158.9,Accessories
4624605,B015NHBBOS,5.0,Title : Five Stars\n Review : Expectations met,0,0,0,True,Video Games,4.4,387.0,158.9,Accessories
4624606,B015NHBBOS,5.0,Title : Five Stars\n Review : The controller i...,0,0,0,True,Video Games,4.4,387.0,158.9,Accessories
4624607,B015NHBBOS,5.0,Title : Good seller\n Review : Works great,0,0,0,True,Video Games,4.4,387.0,158.9,Accessories


In [14]:
#df_final.iloc[1]['full_text']

### Save


In [15]:
## Save data
df_final.to_csv('./../data/video_games.csv')