## Code pour prendre des revues de différents produits
* Importer les données
    * Appliances
    * Automotive
    * CDs_and_Vinyl
    * Digital_Music
    * Gift_Cards
    * Handmade_Products
    * Musical_Instruments
    * Video_Games
* Filtrer celles que je souhaite conserver (quantité, équilibre des notes, présence d'informations pour xgboost [ex : prix, ...])
    * Ne pas oublier le set.seed
* Enregistrer les données

## Pipeline de données


In [1]:
### Packages de base
import numpy as np
import pandas as pd
from datasets import load_dataset

### Appliances
#### Importation

In [2]:
## Load User Reviews
dataset_reviews = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Appliances", split="full", trust_remote_code=True)

## Load Item Metadata - Test avec All_Beauty
dataset_items = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Appliances", split="full", trust_remote_code=True)


Appliances.jsonl:   0%|          | 0.00/929M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Generating full split: 0 examples [00:00, ? examples/s]

meta_Appliances.jsonl:   0%|          | 0.00/285M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

#### Reviews

In [3]:
## Convert to df (plus à l'aise pour certaines manip et EDA)
df_reviews = pd.DataFrame(dataset_reviews)

In [4]:
### Data manipulation
## Concat title and text
df_reviews['full_text'] = "Title : " + df_reviews['title'].astype(str) + "\n Review : " + df_reviews['text'].astype(str)

## Create variable as_image
df_reviews['as_image'] = np.where(df_reviews['images'].str.len() == 0, 0, 1)

## Create variable as_helpful_vote
df_reviews['as_helpful_vote'] = np.where(df_reviews['helpful_vote'] == 0, 0, 1)

#### Items

In [5]:
## Convert to pandas df
df_items = pd.DataFrame(dataset_items)

In [None]:
## Main category (filtrer pour conserver seulement "Appliances")
print(df_items_c['main_category'].value_counts())
df_items_cat = df_items[np.isin(df_items['main_category'], ["Appliances", "Tools & Home Improvement"])]
print(df_items_cat['main_category'].value_counts())

Tools & Home Improvement    42694
Appliances                  25572
Name: main_category, dtype: int64


In [10]:
## Filtrer pour conserver seulement les produits avec un prix
print(df_items_cat['price'].count())
print(df_items_cat[df_items_cat['price'] != 'None']['price'].count())
df_items_price = df_items_cat[df_items_cat['price'] != 'None']
df_items_price['price'].astype(float).describe()

68266
34431


count    34431.000000
mean        95.358969
std        349.050875
min          0.190000
25%         15.990000
50%         29.310000
75%         64.555000
max      21095.620000
Name: price, dtype: float64

In [11]:
## Aperçu des catégories
print(df_items_price['categories'].value_counts())


[Appliances, Parts & Accessories]                                                                                                                      7511
[Appliances, Parts & Accessories, Dryer Parts & Accessories, Replacement Parts]                                                                        7062
[Appliances, Parts & Accessories, Refrigerator Parts & Accessories, Water Filters]                                                                     1840
[Appliances, Parts & Accessories, Washer Parts & Accessories]                                                                                          1418
[Appliances, Parts & Accessories, Cooktop Parts & Accessories]                                                                                         1005
                                                                                                                                                       ... 
[Appliances, Parts & Accessories, Kegerator Replacement Parts]  

In [None]:
## Isoler les catégories importantes
#df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])
df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])

## Aperçu des options
print(df_items_price['categories_single'].value_counts())

## À grouper
# Parts & Accessories
# Other


Parts & Accessories            7511
Replacement Parts              7062
Water Filters                  1840
Washer Parts & Accessories     1418
Cooktop Parts & Accessories    1005
                               ... 
LG Styler Steam Closets           1
Kegerator Replacement Parts       1
Ranges                            1
Permanent Filters                 1
Warming Drawers                   1
Name: categories_single, Length: 82, dtype: int64
Parts & Accessories                              30697
Refrigerators, Freezers & Ice Makers              1164
Ranges, Ovens & Cooktops                          1058
Laundry Appliances                                 598
Dishwashers                                        113
Coffee & Espresso Machine Parts & Accessories       26
Commercial Food Preparation Equipment               10
LG Styler Steam Closets                              1
Name: categories_single_2, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_single_2'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])


In [15]:
## Grouper les catégories
df_items_price['categories_grp'] = np.where(df_items_price['categories_single'] == "Parts & Accessories", "Parts & Accessories", "Other")

df_items_price['categories_grp'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_grp'] = np.where(df_items_price['categories_single'] == "Parts & Accessories", "Parts & Accessories", "Other")


Other                  26920
Parts & Accessories     7511
Name: categories_grp, dtype: int64

In [11]:
### Potentiel pour XGBoost
## Reviews
# Création variable presence_image (as_image)
# verified_purchase
# helpful_vote (as_helpful_votes)

## Items
# Sure : main_category, average_rating, rating_number, price
# catgories_grp
# Potentiel : 
#   XTitle? De quoi à faire avec ça?
#   *store name? dequoi à faire avec ça?
#   **categories (potentiellement qqch à faire avec ça!) (extraire du dictionnaire!)
# rating_number (number of ratings for average)


#### Merge Items to Reviews

In [16]:
# Keep only necessary variables before mergeing
df_reviews_f = df_reviews[['rating', 'full_text', 'as_image', 'parent_asin', 'as_helpful_vote', 'helpful_vote', 'verified_purchase']]
df_items_f = df_items_price[['main_category', 'average_rating', 'rating_number', 'price', 'parent_asin', 'categories_grp']]

# Merge Items on Reviews
df_full = df_reviews_f.merge(df_items_f, on='parent_asin', how='left')

# Filter Price
df_full_price = df_full[df_full['price'] != 'None']

# Keep only necessary variables
df_final = df_full_price[['parent_asin', # both
               'rating', 'full_text', 'as_image', 'helpful_vote', 'as_helpful_vote', 'verified_purchase', # reviews
               'main_category', 'average_rating', 'rating_number', 'price', 'categories_grp']] # items

# Filter main_categoy and price
df_final = df_final.dropna()

In [17]:
df_final

Unnamed: 0,parent_asin,rating,full_text,as_image,helpful_vote,as_helpful_vote,verified_purchase,main_category,average_rating,rating_number,price,categories_grp
0,B01N0TQ0OH,5.0,Title : Work great\n Review : work great. use ...,0,0,0,True,Tools & Home Improvement,4.7,4939.0,9.99,Other
1,B07DD37QPZ,5.0,Title : excellent product\n Review : Little on...,0,0,0,True,Tools & Home Improvement,4.4,3186.0,22.99,Other
7,B00AF7WZTM,5.0,Title : Five Stars\n Review : Part came quickl...,0,0,0,True,Appliances,4.6,129.0,46.27,Other
11,B09W5PMK5X,5.0,Title : so far so good\n Review : but i havent...,0,2,1,True,Appliances,3.5,35.0,399.0,Other
15,B08FDB6W59,5.0,Title : great\n Review : worked great,0,0,0,True,Tools & Home Improvement,4.2,5428.0,18.0,Other
...,...,...,...,...,...,...,...,...,...,...,...,...
2128582,B0BVM8Z4JM,5.0,Title : Great value. Easy to install. Water ta...,0,0,0,True,Tools & Home Improvement,4.7,1582.0,21.99,Other
2128583,B08JVKQNT4,5.0,Title : Good\n Review : Good,0,0,0,True,Tools & Home Improvement,4.6,112.0,16.94,Other
2128596,B07R4RHC4H,1.0,"Title : Junk, doesn’t last.\n Review : Purchas...",0,0,0,True,Tools & Home Improvement,4.4,9518.0,35.99,Other
2128597,B07H7G4WB2,5.0,Title : Broan Nutone 41000 Models S99110437 99...,0,0,0,True,Appliances,4.8,612.0,10.98,Other


In [14]:
#df_final.iloc[1]['full_text']

### Save


In [None]:
## Save data
df_final.to_csv('./../data/appliances.csv')