## Code pour prendre des revues de différents produits
* Importer les données
    * Appliances
    * Automotive
    * CDs_and_Vinyl
    * Digital_Music
    * Gift_Cards
    * Handmade_Products
    * Musical_Instruments
    * Video_Games
* Filtrer celles que je souhaite conserver (quantité, équilibre des notes, présence d'informations pour xgboost [ex : prix, ...])
    * Ne pas oublier le set.seed
* Enregistrer les données

## Pipeline de données


In [1]:
### Packages de base
import numpy as np
import pandas as pd
from datasets import load_dataset

### Digital Music
#### Importation

In [2]:
## Load User Reviews
dataset_reviews = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Digital_Music", split="full", trust_remote_code=True)

## Load Item Metadata - Test avec All_Beauty
dataset_items = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Digital_Music", split="full", trust_remote_code=True)


Digital_Music.jsonl:   0%|          | 0.00/78.8M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Generating full split: 0 examples [00:00, ? examples/s]

meta_Digital_Music.jsonl:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

#### Reviews

In [None]:
## Convert to df (plus à l'aise pour certaines manip et EDA)
df_reviews = pd.DataFrame(dataset_reviews)
print(df_reviews.count())

In [4]:
### Data manipulation
## Concat title and text
df_reviews['full_text'] = "Title : " + df_reviews['title'].astype(str) + "\n Review : " + df_reviews['text'].astype(str)

## Create variable as_image
df_reviews['as_image'] = np.where(df_reviews['images'].str.len() == 0, 0, 1)

## Create variable as_helpful_vote
df_reviews['as_helpful_vote'] = np.where(df_reviews['helpful_vote'] == 0, 0, 1)

#### Items

In [5]:
## Convert to pandas df
df_items = pd.DataFrame(dataset_items)

In [7]:
## Main category (filtrer pour conserver seulement "Appliances")
print(df_items['main_category'].value_counts())
df_items_cat = df_items[df_items['main_category'] == "Digital Music"]
print(df_items_cat['main_category'].value_counts())

Digital Music    70537
Name: main_category, dtype: int64
Digital Music    70537
Name: main_category, dtype: int64


In [8]:
## Filtrer pour conserver seulement les produits avec un prix
print(df_items_cat['price'].count())
print(df_items_cat[df_items_cat['price'] != 'None']['price'].count())
df_items_price = df_items_cat[df_items_cat['price'] != 'None']
df_items_price['price'].astype(float).describe()

70537
40125


count    40125.000000
mean        40.199933
std         63.837289
min          0.010000
25%         12.990000
50%         23.900000
75%         42.850000
max       2200.000000
Name: price, dtype: float64

In [10]:
## Aperçu des catégories
print(df_items_price['categories'].value_counts())


[]                                                       40119
[Digital Music, Music By Price, $5.00 to $5.99]              2
[Digital Music, International Music, Far East & Asia]        1
[Digital Music, Country]                                     1
[Digital Music, Music By Price, $8.00 to $8.99]              1
[Digital Music, Rock]                                        1
Name: categories, dtype: int64


In [11]:
## Isoler les catégories importantes
#df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None])[-1])
df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])

## Aperçu des options
print(df_items_price['categories_single'].value_counts())

## À grouper
# Parts & Accessories
# Other


Music By Price         3
International Music    1
Country                1
Rock                   1
Name: categories_single, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_single'] = df_items_price['categories'].apply(lambda x : (x or [None, None])[1])


In [12]:
## Grouper les catégories
df_items_price['categories_grp'] = df_items_price['categories_single']

df_items_price['categories_grp'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_price['categories_grp'] = df_items_price['categories_single']


Music By Price         3
International Music    1
Country                1
Rock                   1
Name: categories_grp, dtype: int64

In [11]:
### Potentiel pour XGBoost
## Reviews
# Création variable presence_image (as_image)
# verified_purchase
# helpful_vote (as_helpful_votes)

## Items
# Sure : main_category, average_rating, rating_number, price
# catgories_grp
# Potentiel : 
#   XTitle? De quoi à faire avec ça?
#   *store name? dequoi à faire avec ça?
#   **categories (potentiellement qqch à faire avec ça!) (extraire du dictionnaire!)
# rating_number (number of ratings for average)


#### Merge Items to Reviews

In [13]:
# Keep only necessary variables before mergeing
df_reviews_f = df_reviews[['rating', 'full_text', 'as_image', 'parent_asin', 'as_helpful_vote', 'helpful_vote', 'verified_purchase']]
df_items_f = df_items_price[['main_category', 'average_rating', 'rating_number', 'price', 'parent_asin', 'categories_grp']]

# Merge Items on Reviews
df_full = df_reviews_f.merge(df_items_f, on='parent_asin', how='left')

# Filter Price
df_full_price = df_full[df_full['price'] != 'None']

# Keep only necessary variables
df_final = df_full_price[['parent_asin', # both
               'rating', 'full_text', 'as_image', 'helpful_vote', 'as_helpful_vote', 'verified_purchase', # reviews
               'main_category', 'average_rating', 'rating_number', 'price', 'categories_grp']] # items

# Filter main_categoy and price
df_final = df_final.dropna()

In [14]:
df_final

Unnamed: 0,parent_asin,rating,full_text,as_image,helpful_vote,as_helpful_vote,verified_purchase,main_category,average_rating,rating_number,price,categories_grp
1424,B00D6H7KTS,5.0,Title : In praise of Rahman!\n Review : My CD ...,0,7,1,True,Digital Music,4.3,13.0,2.99,International Music
3503,B00108JUAW,5.0,Title : Five Stars\n Review : Love the old stuff.,0,0,0,True,Digital Music,4.3,84.0,1.98,Music By Price
10452,B01LWRURZR,5.0,Title : Great live album!\n Review : This cd r...,0,3,1,True,Digital Music,4.8,120.0,7.97,Country
20819,B00D6H7KTS,4.0,Title : A Ranjhaa called Dhanush THE FILM REVI...,0,1,1,False,Digital Music,4.3,13.0,2.99,International Music
21504,B007D5C8QO,5.0,Title : Five Stars\n Review : Best massage mp3...,0,0,0,True,Digital Music,4.8,14.0,9.49,Music By Price
22866,B01LWRURZR,5.0,Title : Five Stars\n Review : Can't go wrong w...,0,0,0,True,Digital Music,4.8,120.0,7.97,Country
28361,B00108JUAW,5.0,Title : Love the music; love the memories it b...,0,0,0,True,Digital Music,4.3,84.0,1.98,Music By Price
29625,B01LWRURZR,5.0,Title : Five Stars\n Review : I liked this alb...,0,0,0,True,Digital Music,4.8,120.0,7.97,Country
29817,B07DWFM4JK,5.0,Title : THIS REVIEW IS FOR THE 2018 3 C...,0,7,1,True,Digital Music,4.4,51.0,27.52,Rock
40555,B01LWRURZR,5.0,Title : Five Stars\n Review : Outstanding!,0,0,0,True,Digital Music,4.8,120.0,7.97,Country


In [14]:
#df_final.iloc[1]['full_text']

### Save


In [15]:
## Save data
df_final.to_csv('./../data/digital_music.csv')