## Code pour prendre des revues de différents produits
* Importer les données
    * Appliances
    * Automotive
    * CDs_and_Vinyl
    * Digital_Music
    * Gift_Cards
    * Handmade_Products
    * Musical_Instruments
    * Video_Games
* Filtrer celles que je souhaite conserver (quantité, équilibre des notes, présence d'informations pour xgboost [ex : prix, ...])
    * Ne pas oublier le set.seed
* Enregistrer les données

## Pipeline de données


In [1]:
### Packages de base
import numpy as np
import pandas as pd
from datasets import load_dataset

### Automotive
#### Importation

In [2]:
## Load User Reviews
dataset_reviews = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Musical_Instruments", split="full", trust_remote_code=True)

## Load Item Metadata - Test avec All_Beauty
dataset_items = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Musical_Instruments", split="full", trust_remote_code=True)


Musical_Instruments.jsonl:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Generating full split: 0 examples [00:00, ? examples/s]

meta_Musical_Instruments.jsonl:   0%|          | 0.00/632M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

#### Reviews

In [3]:
## Convert to df (plus à l'aise pour certaines manip et EDA)
df_reviews = pd.DataFrame(dataset_reviews)

In [5]:
### Data manipulation
## Concat title and text
df_reviews['full_text'] = "Title : " + df_reviews['title'].astype(str) + "\n Review : " + df_reviews['text'].astype(str)

## Create variable as_image
df_reviews['as_image'] = np.where(df_reviews['images'].str.len() == 0, 0, 1)

## Create variable as_helpful_vote
df_reviews['as_helpful_vote'] = np.where(df_reviews['helpful_vote'] == 0, 0, 1)

#### Items

In [4]:
## Convert to pandas df
df_items = pd.DataFrame(dataset_items)

In [7]:
## Main category (filtrer pour conserver seulement "Appliances")
print(df_items['main_category'].value_counts())
df_items_cat = df_items[df_items['main_category'] == "Musical Instruments"]
print(df_items_cat['main_category'].value_counts())

Musical Instruments             183052
All Electronics                   6003
Industrial & Scientific           3930
Tools & Home Improvement          3265
Amazon Home                       2988
Home Audio & Theater              2099
AMAZON FASHION                    1689
Sports & Outdoors                 1123
Toys & Games                      1061
Computers                          746
Cell Phones & Accessories          715
Software                           641
Camera & Photo                     527
Office Products                    410
Health & Personal Care             402
Books                              382
Automotive                         363
Arts, Crafts & Sewing              195
All Beauty                         174
Car Electronics                    117
Pet Supplies                       100
Portable Audio & Accessories        56
Baby                                44
Video Games                         43
Grocery                             24
Appliances               

In [8]:
## Filtrer pour conserver seulement les produits avec un prix
print(df_items_cat['price'].count())
print(df_items_cat[df_items_cat['price'] != 'None']['price'].count())
df_items_price = df_items_cat[df_items_cat['price'] != 'None']
df_items_price['price'].astype(float).describe()

183052
69492


count    69492.000000
mean       136.649016
std        325.308437
min          0.010000
25%         15.290000
50%         37.865000
75%        129.000000
max      15499.950000
Name: price, dtype: float64

In [9]:
## Aperçu des catégories
print(df_items_price['categories'].value_counts())


[]                                                                                                                              3572
[Musical Instruments, Instrument Accessories, Guitar & Bass Accessories, Picks & Pick Holders, Picks]                           1804
[Musical Instruments, Instrument Accessories, Guitar & Bass Accessories, Electric Guitar Parts, Pickups & Pickup Covers]        1425
[Musical Instruments, Microphones & Accessories, Microphones, Wireless Microphones & Systems, Wireless Lavalier Microphones]    1144
[Musical Instruments, Instrument Accessories, Guitar & Bass Accessories, Straps & Strap Locks, Straps]                          1102
                                                                                                                                ... 
[Musical Instruments, Wind & Woodwind Instruments, Folk & World, Shenais]                                                          1
[Musical Instruments, Electronic Music, DJ & Karaoke, Karaoke Equipme

In [10]:
## Isoler les catégories importantes
#df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])
df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])

## Aperçu des options
print(df_items_price['categories_single'].value_counts())

## À grouper
# Parts & Accessories
# Other


Instrument Accessories                         32146
Microphones & Accessories                       6007
Drums & Percussion                              5555
Live Sound & Stage                              4189
Guitars                                         3001
Amplifiers & Effects                            2977
Studio Recording Equipment                      2886
Electronic Music, DJ & Karaoke                  2326
Wind & Woodwind Instruments                     1932
Band & Orchestra                                1546
Keyboards & MIDI                                1245
Ukuleles, Mandolins & Banjos                    1144
Bass Guitars                                     376
Stringed Instruments                             276
Drum & Percussion Accessories                    227
Prime Card Bonus                                  39
Musical Instruments Outlet                         9
Musical Instruments Savings                        7
Musical Instruments Beginner Store            

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])


In [11]:
## Grouper les catégories
df_items_price['categories_grp'] = np.where(df_items_price['categories_single'] == "Instrument Accessories", "Instrument Accessories", "Other")

df_items_price['categories_grp'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_grp'] = np.where(df_items_price['categories_single'] == "Instrument Accessories", "Instrument Accessories", "Other")


Other                     37346
Instrument Accessories    32146
Name: categories_grp, dtype: int64

In [11]:
### Potentiel pour XGBoost
## Reviews
# Création variable presence_image (as_image)
# verified_purchase
# helpful_vote (as_helpful_votes)

## Items
# Sure : main_category, average_rating, rating_number, price
# catgories_grp
# Potentiel : 
#   XTitle? De quoi à faire avec ça?
#   *store name? dequoi à faire avec ça?
#   **categories (potentiellement qqch à faire avec ça!) (extraire du dictionnaire!)
# rating_number (number of ratings for average)


#### Merge Items to Reviews

In [59]:
# Keep only necessary variables before mergeing
df_reviews_f = df_reviews[['rating', 'full_text', 'as_image', 'parent_asin', 'as_helpful_vote', 'helpful_vote', 'verified_purchase']]
df_items_f = df_items_price[['main_category', 'average_rating', 'rating_number', 'price', 'parent_asin', 'categories_grp']]

# Merge Items on Reviews
df_full = df_reviews_f.merge(df_items_f, on='parent_asin', how='left')

# Filter Price
df_full_price = df_full[df_full['price'] != 'None']

# Keep only necessary variables
df_final = df_full_price[['parent_asin', # both
               'rating', 'full_text', 'as_image', 'helpful_vote', 'as_helpful_vote', 'verified_purchase', # reviews
               'main_category', 'average_rating', 'rating_number', 'price', 'categories_grp']] # items

# Filter main_categoy and price
df_final = df_final.dropna()

In [13]:
df_final

Unnamed: 0,parent_asin,rating,full_text,as_image,helpful_vote,as_helpful_vote,verified_purchase,main_category,average_rating,rating_number,price,categories_grp
2,B0040FJ27S,4.0,Title : okay\n Review : pretty good overall. ...,0,0,0,True,Musical Instruments,4.5,1763.0,28.99,Other
3,B00WJ3HL5I,3.0,Title : Easy to return.\n Review : Too bad it ...,0,0,0,True,Musical Instruments,4.6,15012.0,10.99,Instrument Accessories
4,B07T9NM5QR,5.0,Title : Good product despite tight bolt.\n Rev...,0,0,0,False,Musical Instruments,4.7,2032.0,48.39,Other
7,B0C5D8X3JT,5.0,Title : Xlnt product\n Review : Easy to use. G...,0,0,0,True,Musical Instruments,4.6,6408.0,26.99,Instrument Accessories
8,B074VB2D2K,5.0,Title : Nice Karaoke machine\n Review : My gra...,0,3,1,True,Musical Instruments,4.5,861.0,44.88,Other
...,...,...,...,...,...,...,...,...,...,...,...,...
3017430,B0017T7XCG,4.0,Title : Four Stars\n Review : Thanks,0,0,0,True,Musical Instruments,4.6,321.0,219.28,Other
3017431,B0B4SHB5T3,3.0,"Title : Good harps, but\n Review : Good harps,...",0,3,1,True,Musical Instruments,4.4,687.0,53.39,Other
3017432,B0017T7XCG,5.0,Title : Five Stars\n Review : Great price for ...,0,0,0,True,Musical Instruments,4.6,321.0,219.28,Other
3017433,B0017T7XCG,5.0,Title : Great value\n Review : Meets my need,0,0,0,True,Musical Instruments,4.6,321.0,219.28,Other


In [14]:
#df_final.iloc[1]['full_text']

### Save


In [128]:
df_final = df_final.drop(df_final.index[[821653, 1394260, 1394261, 1394262, 1394263, 1394264, 1394265, 1394266, 1394267, 1394268, 1394269, 1394270, 1394271, 1394272, 1394273, 1394274]])

In [None]:
## Save data
df_final.to_csv('./../data/musical_instruments.csv')

#### OLD - trouver les erreurs pour le to_csv

In [62]:
df_final.iloc[821654]['full_text']

'Title : Nice mixer for basic purposes.\n Review : No switch, no big deal for me. Do not have any heat issues some have reported. I use to mix electronic drums/guitar for jamming along as well as occasional DJ/Recording. Excellent sound, low noise, great for my purposes.'

In [124]:
test = pd.read_csv('./../data/musical_instruments.csv')

In [125]:
test.count()

Unnamed: 0           1394259
parent_asin          1394259
rating               1394259
full_text            1394259
as_image             1394259
helpful_vote         1394259
as_helpful_vote      1394259
verified_purchase    1394259
main_category        1394259
average_rating       1394259
rating_number        1394259
price                1394259
categories_grp       1394259
dtype: int64

In [112]:
test.iloc[1394270]['full_text']

IndexError: single positional indexer is out-of-bounds

In [106]:
df_final.iloc[1394271]['full_text']

'Title : Punching 2 to 3 times Its Price!\n Review : So I got this microphone with no particular expectations, but I did need a microphone to have at a secondary location. However, my budget would not allow me to get what I use for the content I normally correct which is Marantz MPM-2000u Microphone. The TLDR is I am beyond impressed with the sound quality of this microphone for the price of admission. So lets get into the list of pros and cons!<br /><br />Pros:<br />-Universal Plug and Play- Plug in and just works (tested on MacOS Catalina, Windows 10, and Manjaro Linux)<br />-Solid build quality (has a nice metal feel and weight to it even though its much smaller than my Marantz)<br />-Sound quality- I could rant and rave for hours about this but my expectation were low and I have now had to  readjust what my expectations of the sound quality of lower cost mics do to this one. ( I would put the sound quality just slightly behind the Blue Yeti)<br />-Price- The barrier to entry for Po

In [126]:
test2 = df_final.drop(df_final.index[[821653, 1394260, 1394261, 1394262, 1394263, 1394264, 1394265, 1394266, 1394267, 1394268, 1394269, 1394270, 1394271, 1394272, 1394273, 1394274]])

In [94]:
df_final.count()

parent_asin          1788161
rating               1788161
full_text            1788161
as_image             1788161
helpful_vote         1788161
as_helpful_vote      1788161
verified_purchase    1788161
main_category        1788161
average_rating       1788161
rating_number        1788161
price                1788161
categories_grp       1788161
dtype: int64

In [95]:
test2.count()

parent_asin          1788160
rating               1788160
full_text            1788160
as_image             1788160
helpful_vote         1788160
as_helpful_vote      1788160
verified_purchase    1788160
main_category        1788160
average_rating       1788160
rating_number        1788160
price                1788160
categories_grp       1788160
dtype: int64

In [97]:
df_final.iloc[821653]['full_text']

'Title : highly recommend!!!💞💞\n Review : 😃Very beautiful instrument. Me and my two kids 👧👦👩love our ukulele.  I start to learn strumming in first 3 days.🎶🎵🎼 It comes with everything  you need. Highly recommend it to anyone . Great buy !!<br />💖💖约3:16 \u3000神爱世人、甚至将他的独生子赐给他们、叫一切信他的、不至灭亡、反得永生。\x00<br />🌺🌺For God so loved the world, that he gave his one and only Son, that whoever believes in him should not perish, but have eternal life.'

In [114]:
test2.iloc[1394270]['full_text']

'Title : Get for $40 what others charge $200 for\n Review : First off, this product was given to me for review. Now, onto the actual review! So I had the chance to review this mic over the last week or 2. I have used this to record 5 of my YouTube channel videos, and did a few livestreams with them as well.  So for me, the biggest thing that I require out of any piece of audio equipment is that it works on Linux. The reason for this is all my content is shot edited and recorded on Manjaro KDE, but I am also a Windows, and MacOS user. So it is imperative that my audio equipment play well in all 3. To which, I can say that the K031B does just that.<br /><br />Pros-<br />+UPnP Audio device<br />+Works in Linux,Windows, and MacOS<br />+Sound quality for the price is fantastic from both headset and Lav Mic<br />+Build of the USB receiver, headset mic and lapel mic, and transmitter is good<br />+Headset (great for livestreaming)<br />+Lav Mic<br />+Transmission range is quite good again for 