## Code pour prendre des revues de différents produits
* Importer les données
    * Appliances
    * Automotive
    * CDs_and_Vinyl
    * Digital_Music
    * Gift_Cards
    * Handmade_Products
    * Musical_Instruments
    * Video_Games
* Filtrer celles que je souhaite conserver (quantité, équilibre des notes, présence d'informations pour xgboost [ex : prix, ...])
    * Ne pas oublier le set.seed
* Enregistrer les données

## Pipeline de données


In [1]:
### Packages de base
import numpy as np
import pandas as pd
from datasets import load_dataset

### Gift Cards
#### Importation

In [2]:
## Load User Reviews
dataset_reviews = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Gift_Cards", split="full", trust_remote_code=True)

## Load Item Metadata - Test avec All_Beauty
dataset_items = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Gift_Cards", split="full", trust_remote_code=True)


#### Reviews

In [3]:
## Convert to df (plus à l'aise pour certaines manip et EDA)
df_reviews = pd.DataFrame(dataset_reviews)

In [4]:
### Data manipulation
## Concat title and text
df_reviews['full_text'] = "Title : " + df_reviews['title'].astype(str) + "\n Review : " + df_reviews['text'].astype(str)

## Create variable as_image
df_reviews['as_image'] = np.where(df_reviews['images'].str.len() == 0, 0, 1)

## Create variable as_helpful_vote
df_reviews['as_helpful_vote'] = np.where(df_reviews['helpful_vote'] == 0, 0, 1)

#### Items

In [5]:
## Convert to pandas df
df_items = pd.DataFrame(dataset_items)

In [6]:
## Main category (filtrer pour conserver seulement "Gift Cards")
print(df_items['main_category'].value_counts())
df_items_cat = df_items[df_items['main_category'] == "Gift Cards"]

Gift Cards                926
Amazon Home                 9
Health & Personal Care      9
Office Products             8
Video Games                 3
Software                    2
Sports & Outdoors           2
Grocery                     2
Toys & Games                1
All Beauty                  1
AMAZON FASHION              1
All Electronics             1
Arts, Crafts & Sewing       1
Books                       1
Name: main_category, dtype: int64


In [7]:
## Filtrer pour conserver seulement les produits avec un prix
print(df_items_cat['price'].count())
print(df_items_cat[df_items_cat['price'] != 'None']['price'].count())
df_items_price = df_items_cat[df_items_cat['price'] != 'None']
df_items_price['price'].astype(float).describe()

926
368


count     368.000000
mean       52.768207
std       109.047580
min         3.990000
25%        25.000000
50%        45.000000
75%        50.000000
max      2000.000000
Name: price, dtype: float64

In [8]:
## Aperçu des catégories
print(df_items_price['categories'].value_counts())


[Gift Cards, Gift Card Categories, Restaurants]                      157
[Gift Cards, Gift Card Categories, Specialty Cards]                   63
[Gift Cards, Gift Card Categories, Clothing, Shoes & Accessories]     44
[Gift Cards, Gift Card Categories, Books, Movies & Music]             19
[Gift Cards, Gift Card Recipients, For Him]                            7
[Gift Cards, Gift Card Categories, Electronics & Office]               7
[Gift Cards, Gift Card Categories, Grocery, Gourmet & Floral]          7
[Gift Cards, Gift Card Categories, Travel & Leisure]                   7
[]                                                                     6
[Gift Cards, Gift Card Categories, Video Games & Online Games]         6
[Gift Cards, Gift Cards]                                               5
[Gift Cards, Gift Card Categories, Health & Beauty]                    3
[Gift Cards, Gift Card Categories, Home Improvement]                   3
[Gift Cards, Gift Card Categories, Automotive & Ind

In [9]:
## Isoler les catégories importantes
df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])

## Aperçu des options
df_items_price['categories_single'].value_counts()

## À grouper
#


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])


Restaurants                                 157
Specialty Cards                              63
Clothing, Shoes & Accessories                44
Books, Movies & Music                        19
For Him                                       7
Electronics & Office                          7
Grocery, Gourmet & Floral                     7
Travel & Leisure                              7
Video Games & Online Games                    6
Gift Cards                                    5
Health & Beauty                               3
Home Improvement                              3
Automotive & Industrial                       3
Amazon Incentives Brand Guidelines            3
Christmas                                     3
Birthday                                      2
Home & Decor                                  2
Gift Cards for New Baby                       2
Toys, Kids & Baby                             2
Gift Cards: Amazon Shipping                   2
Sports, Outdoors & Fitness              

In [10]:
## Grouper les catégories
df_items_price['categories_grp'] = np.where(
        df_items_price['categories_single'] == "Restaurants", "Restaurants",
        np.where(
            df_items_price['categories_single'] == "Specialty Cards", "Specialty Cards",
        np.where(
            np.isin(df_items_price['categories_single'], ["Clothing, Shoes & Accessories", "For Him"]), "Clothing",
        np.where(
            np.isin(df_items_price['categories_single'], ["Books, Movies & Music", "Electronics & Office", "Video Games & Online Games"]), "Office-Gaming",
            "Other"
        ))))

df_items_price['categories_grp'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_grp'] = np.where(


Restaurants        157
Other               65
Specialty Cards     63
Clothing            51
Office-Gaming       32
Name: categories_grp, dtype: int64

In [11]:
### Potentiel pour XGBoost
## Reviews
# Création variable presence_image (as_image)
# verified_purchase
# helpful_vote (as_helpful_votes)

## Items
# Sure : main_category, average_rating, rating_number, price
# catgories_grp
# Potentiel : 
#   XTitle? De quoi à faire avec ça?
#   *store name? dequoi à faire avec ça?
#   **categories (potentiellement qqch à faire avec ça!) (extraire du dictionnaire!)
# rating_number (number of ratings for average)


#### Merge Items to Reviews

In [12]:
# Keep only necessary variables before mergeing
df_reviews_f = df_reviews[['rating', 'full_text', 'as_image', 'parent_asin', 'as_helpful_vote', 'helpful_vote', 'verified_purchase']]
df_items_f = df_items_price[['main_category', 'average_rating', 'rating_number', 'price', 'parent_asin', 'categories_grp']]

# Merge Items on Reviews
df_full = df_reviews_f.merge(df_items_f, on='parent_asin', how='left')

# Filter Price
df_full_price = df_full[df_full['price'] != 'None']

# Keep only necessary variables
df_final = df_full_price[['parent_asin', # both
               'rating', 'full_text', 'as_image', 'helpful_vote', 'as_helpful_vote', 'verified_purchase', # reviews
               'main_category', 'average_rating', 'rating_number', 'price', 'categories_grp']] # items

# Filter main_categoy and price
df_final = df_final.dropna()

In [13]:
df_final

Unnamed: 0,parent_asin,rating,full_text,as_image,helpful_vote,as_helpful_vote,verified_purchase,main_category,average_rating,rating_number,price,categories_grp
2,B005S28ZES,5.0,Title : perfect gift\n Review : When you have ...,0,27,1,True,Gift Cards,4.9,4918.0,25.0,Clothing
3,B00ADR2LV6,5.0,Title : Nice looking\n Review : The tin is a n...,0,0,0,False,Gift Cards,4.9,185606.0,25.0,Other
4,B00FTGTIOE,1.0,Title : Not $10 Gift Cards\n Review : I bought...,0,2,1,True,Gift Cards,4.9,13066.0,40.0,Other
5,B00ADR2LV6,5.0,Title : Cute!\n Review : That snowman tin is a...,0,0,0,True,Gift Cards,4.9,185606.0,25.0,Other
7,B00ADR2LV6,5.0,Title : Great gift\n Review : Super cute nice ...,0,0,0,True,Gift Cards,4.9,185606.0,25.0,Other
...,...,...,...,...,...,...,...,...,...,...,...,...
152394,B00ADR2LV6,4.0,Title : Super cute\n Review : Super cute,0,0,0,True,Gift Cards,4.9,185606.0,25.0,Other
152398,B077N4CNVJ,5.0,Title : Winner winner!\n Review : Boxes was a ...,0,1,1,True,Gift Cards,4.9,104005.0,25.0,Other
152400,B077N4CNVJ,5.0,Title : Perfect gift\n Review : Perfect gift,0,0,0,True,Gift Cards,4.9,104005.0,25.0,Other
152404,B00FTGEQCI,5.0,Title : Five Stars\n Review : Who doesn't love...,0,1,1,True,Gift Cards,4.9,1013.0,30.0,Restaurants


In [14]:
#df_final.iloc[1]['full_text']

### Save


In [15]:
## Save data
df_final.to_csv('./../data/gift_cards.csv')