## Code pour prendre des revues de différents produits
* Importer les données
    * Appliances
    * Automotive
    * CDs_and_Vinyl
    * Digital_Music
    * Gift_Cards
    * Handmade_Products
    * Musical_Instruments
    * Video_Games
* Filtrer celles que je souhaite conserver (quantité, équilibre des notes, présence d'informations pour xgboost [ex : prix, ...])
    * Ne pas oublier le set.seed
* Enregistrer les données

## Pipeline de données


In [1]:
### Packages de base
import numpy as np
import pandas as pd
from datasets import load_dataset

### Handmade Products
#### Importation

In [2]:
## Load User Reviews
dataset_reviews = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Handmade_Products", split="full", trust_remote_code=True)

## Load Item Metadata - Test avec All_Beauty
dataset_items = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Handmade_Products", split="full", trust_remote_code=True)


Handmade_Products.jsonl:   0%|          | 0.00/289M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Generating full split: 0 examples [00:00, ? examples/s]

meta_Handmade_Products.jsonl:   0%|          | 0.00/399M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

#### Reviews

In [3]:
## Convert to df (plus à l'aise pour certaines manip et EDA)
df_reviews = pd.DataFrame(dataset_reviews)

In [4]:
### Data manipulation
## Concat title and text
df_reviews['full_text'] = "Title : " + df_reviews['title'].astype(str) + "\n Review : " + df_reviews['text'].astype(str)

## Create variable as_image
df_reviews['as_image'] = np.where(df_reviews['images'].str.len() == 0, 0, 1)

## Create variable as_helpful_vote
df_reviews['as_helpful_vote'] = np.where(df_reviews['helpful_vote'] == 0, 0, 1)

#### Items

In [5]:
## Convert to pandas df
df_items = pd.DataFrame(dataset_items)

In [7]:
## Main category (filtrer pour conserver seulement "Appliances")
print(df_items['main_category'].value_counts())
df_items_cat = df_items[df_items['main_category'] == "Handmade"]
print(df_items_cat['main_category'].value_counts())

Handmade                  164765
Amazon Home                   11
AMAZON FASHION                 9
Office Products                6
Health & Personal Care         3
Pet Supplies                   3
All Beauty                     2
All Electronics                2
Arts, Crafts & Sewing          1
Amazon Devices                 1
Name: main_category, dtype: int64
Handmade    164765
Name: main_category, dtype: int64


In [8]:
## Filtrer pour conserver seulement les produits avec un prix
print(df_items_cat['price'].count())
print(df_items_cat[df_items_cat['price'] != 'None']['price'].count())
df_items_price = df_items_cat[df_items_cat['price'] != 'None']
df_items_price['price'].astype(float).describe()

164765
97617


count     97617.000000
mean         35.136140
std         329.921911
min           0.010000
25%          13.000000
50%          19.990000
75%          33.900000
max      100000.000000
Name: price, dtype: float64

In [9]:
## Aperçu des catégories
print(df_items_price['categories'].value_counts())


[Handmade Products, Home & Kitchen, Artwork, Prints]                                  13731
[Handmade Products, Jewelry, Necklaces, Pendant]                                       6870
[Handmade Products, Jewelry, Earrings, Drop & Dangle]                                  4324
[Handmade Products, Home & Kitchen, Home Décor, Decorative Accessories, Ornaments]     3465
[Handmade Products, Home & Kitchen, Home Décor, Signs & Plaques]                       3003
                                                                                      ...  
[Handmade Products, Handmade Small Business Promotion, Midwest FBA]                       1
[Handmade Products, Beauty & Grooming, Shaving & Hair Removal]                            1
[Handmade Products, Southeast States, Florida]                                            1
[Handmade Products, Handmade Small Business Promotion, Rocky Mountain FBA]                1
[Handmade Products, Clothing]                                                   

In [10]:
## Isoler les catégories importantes
#df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])
df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])

## Aperçu des options
print(df_items_price['categories_single'].value_counts())

## À grouper
# Parts & Accessories
# Other


Home & Kitchen                                 40395
Jewelry                                        29962
Clothing, Shoes & Accessories                   9557
Stationery & Party Supplies                     7789
Beauty & Grooming                               2603
Sports & Outdoors                               1475
Last minute gifts                               1353
Electronics Accessories                         1160
Pet Supplies                                     923
Toys & Games                                     760
Baby                                             689
Health & Personal Care                           531
Handmade Gift Shop                                50
Handmade Small Business Promotion - Jewelry       49
Handmade_Prime_Test                               44
Handmade Small Business Promotion                 10
Prime-eligible products                           10
Southeast States                                   7
Midwest States                                

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])


In [11]:
## Grouper les catégories
df_items_price['categories_grp'] = np.where(
    df_items_price['categories_single'] == "Home & Kitchen", "Home & Kitchen", 
    np.where(
        np.isin(df_items_price['categories_single'], ["Jewelry", "Clothing, Shoes & Accessories", "Beauty & Grooming"]), "Looks",
    "Other"
    ))

df_items_price['categories_grp'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_grp'] = np.where(


Looks             42122
Home & Kitchen    40395
Other             15100
Name: categories_grp, dtype: int64

In [11]:
### Potentiel pour XGBoost
## Reviews
# Création variable presence_image (as_image)
# verified_purchase
# helpful_vote (as_helpful_votes)

## Items
# Sure : main_category, average_rating, rating_number, price
# catgories_grp
# Potentiel : 
#   XTitle? De quoi à faire avec ça?
#   *store name? dequoi à faire avec ça?
#   **categories (potentiellement qqch à faire avec ça!) (extraire du dictionnaire!)
# rating_number (number of ratings for average)


#### Merge Items to Reviews

In [12]:
# Keep only necessary variables before mergeing
df_reviews_f = df_reviews[['rating', 'full_text', 'as_image', 'parent_asin', 'as_helpful_vote', 'helpful_vote', 'verified_purchase']]
df_items_f = df_items_price[['main_category', 'average_rating', 'rating_number', 'price', 'parent_asin', 'categories_grp']]

# Merge Items on Reviews
df_full = df_reviews_f.merge(df_items_f, on='parent_asin', how='left')

# Filter Price
df_full_price = df_full[df_full['price'] != 'None']

# Keep only necessary variables
df_final = df_full_price[['parent_asin', # both
               'rating', 'full_text', 'as_image', 'helpful_vote', 'as_helpful_vote', 'verified_purchase', # reviews
               'main_category', 'average_rating', 'rating_number', 'price', 'categories_grp']] # items

# Filter main_categoy and price
df_final = df_final.dropna()

In [13]:
df_final

Unnamed: 0,parent_asin,rating,full_text,as_image,helpful_vote,as_helpful_vote,verified_purchase,main_category,average_rating,rating_number,price,categories_grp
0,B08GPJ1MSN,5.0,Title : Beautiful colors\n Review : I bought o...,0,1,1,True,Handmade,4.3,1194.0,17.99,Looks
1,B084TWHS7W,5.0,Title : You simply must order order more than ...,0,0,0,True,Handmade,5.0,4.0,13.49,Home & Kitchen
2,B07V3NRQC4,5.0,Title : Great\n Review : As pictured. Used a f...,0,0,0,True,Handmade,4.3,21.0,14.95,Home & Kitchen
3,B071ZMDK26,5.0,Title : Well made and so beautiful\n Review : ...,0,2,1,True,Handmade,4.7,1214.0,24.0,Looks
5,B09ZXTLVWP,5.0,Title : These are beautiful\n Review : I have ...,0,29,1,False,Handmade,4.3,21.0,18.99,Home & Kitchen
...,...,...,...,...,...,...,...,...,...,...,...,...
664157,B0BPYCKN76,1.0,Title : This can't be the real thing! GROSS!\n...,0,0,0,True,Handmade,4.1,284.0,11.99,Home & Kitchen
664158,B0843SG3C6,5.0,Title : Great scrubby\n Review : Great quality...,0,2,1,True,Handmade,4.4,208.0,15.95,Home & Kitchen
664159,B01DTEP09O,5.0,Title : Five Stars\n Review : Beautiful notebo...,0,0,0,True,Handmade,4.6,111.0,29.0,Other
664160,B07GJ554VV,5.0,Title : This is a beautiful picture\n Review :...,0,0,0,True,Handmade,4.5,514.0,12.99,Home & Kitchen


In [14]:
#df_final.iloc[1]['full_text']

### Save


In [None]:
#df_final = df_final.drop(df_final.index[[]])

In [14]:
## Save data
df_final.to_csv('./../data/handmade_products.csv')

In [None]:
#df_final.iloc[456879]['full_text']

"Title : Horrible\n Review : Horrible! Didn't look anything like the picture. Dull gold and look<br />like the gold was coming off. Looked pitted and black in spots.<br />Wouldn't tell anyone to buy from them."