### Lien vers le dataset

#### https://amazon-reviews-2023.github.io/index.html

### Data Fields

#### For User Reviews

| Field         | Type   | Explanation                                                                                                                                                                  |
|---------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| rating        | float  | Rating of the product (from 1.0 to 5.0).                                                                                                                                      |
| title         | str    | Title of the user review.                                                                                                                                                    |
| text          | str    | Text body of the user review.                                                                                                                                                 |
| images        | list   | Images that users post after they have received the product. Each image has different sizes (small, medium, large), represented by the small_image_url, medium_image_url, and large_image_url respectively. |
| asin          | str    | ID of the product.                                                                                                                                                           |
| parent_asin   | str    | Parent ID of the product. Note: Products with different colors, styles, sizes usually belong to the same parent ID. The “asin” in previous Amazon datasets is actually parent ID. Please use parent ID to find product meta. |
| user_id       | str    | ID of the reviewer.                                                                                                                                                          |
| timestamp     | int    | Time of the review (unix time).                                                                                                                                              |
| verified_purchase | bool | User purchase verification.                                                                                                                                               |
| helpful_vote  | int    | Helpful votes of the review.                                                                                                                                                 |

#### For Item Metadata

| Field          | Type   | Explanation                                                                                                                                                                  |
|----------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| main_category  | str    | Main category (i.e., domain) of the product.                                                                                                                                 |
| title          | str    | Name of the product.                                                                                                                                                         |
| average_rating | float  | Rating of the product shown on the product page.                                                                                                                              |
| rating_number  | int    | Number of ratings in the product.                                                                                                                                             |
| features       | list   | Bullet-point format features of the product.                                                                                                                                  |
| description    | list   | Description of the product.                                                                                                                                                   |
| price          | float  | Price in US dollars (at time of crawling).                                                                                                                                     |
| images         | list   | Images of the product. Each image has different sizes (thumb, large, hi_res). The “variant” field shows the position of image.                                               |
| videos         | list   | Videos of the product including title and url.                                                                                                                                |
| store          | str    | Store name of the product.                                                                                                                                                    |
| categories     | list   | Hierarchical categories of the product.                                                                                                                                       |
| details        | dict   | Product details, including materials, brand, sizes, etc.                                                                                                                     |
| parent_asin    | str    | Parent ID of the product.                                                                                                                                                    |
| bought_together | list   | Recommended bundles from the websites.                                                                                                                                       |


In [23]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType, BooleanType, ArrayType, MapType

reviews = "Movies_and_TV.jsonl"
meta_datas = "meta_Movies_and_TV.jsonl"

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType, BooleanType, ArrayType, MapType

# Créer une session Spark
spark = SparkSession.builder.appName("AmazonReviews").getOrCreate()

# Définir le schéma pour le fichier Movies_and_TV.jsonl
schema_reviews = StructType([
    StructField("rating", FloatType(), True),
    StructField("title", StringType(), True),
    StructField("text", StringType(), True),
    StructField("images", ArrayType(StringType()), True),
    StructField("asin", StringType(), True),
    StructField("parent_asin", StringType(), True),
    StructField("user_id", StringType(), True),
    StructField("timestamp", IntegerType(), True),
    StructField("verified_purchase", BooleanType(), True),
    StructField("helpful_vote", IntegerType(), True)
])

# Définir le schéma pour le fichier meta_Movies_and_TV.jsonl
schema_metadata = StructType([
    StructField("main_category", StringType(), True),
    StructField("title", StringType(), True),
    StructField("subtitle", StringType(), True),
    StructField("average_rating", FloatType(), True),
    StructField("rating_number", IntegerType(), True),
    StructField("features", ArrayType(StringType()), True),
    StructField("description", ArrayType(StringType()), True),
    StructField("price", FloatType(), True),
    StructField("images", ArrayType(MapType(StringType(), StringType())), True),
    StructField("videos", ArrayType(StringType()), True),
    StructField("store", StringType(), True),
    StructField("categories", ArrayType(StringType()), True),
    StructField("details", MapType(StringType(), StringType()), True),
    StructField("parent_asin", StringType(), True),
    StructField("bought_together", ArrayType(StringType()), True)
])

# Charger les données JSON en DataFrame Spark en utilisant les schémas définis
df_reviews_spark = spark.read.schema(schema_reviews).json(reviews)
df_metadata_spark = spark.read.schema(schema_metadata).json(meta_datas)

# Afficher le schéma pour vérifier la structure des données
df_reviews_spark.printSchema()
df_metadata_spark.printSchema()

# Afficher le nombre de lignes du DataFrame
print(f"Nombre de lignes du DataFrame df_reviews_spark: {df_reviews_spark.count()}")
print(f"Nombre de lignes du DataFrame df_metadata_spark: {df_metadata_spark.count()}")

# Afficher les 5 premières lignes du DataFrame où assembly_required ou author est non nul
df_reviews_spark.show(5)
df_metadata_spark.show(5)

# Jointure sur la colonne 'parent_asin'
# On utilise how='inner' pour ne garder que les lignes qui ont un parent_asin commun, les autres sont supprimées
df_merged_spark = df_reviews_spark.join(df_metadata_spark, 'parent_asin', 'inner')

# Afficher le nombre de lignes du dataframe fusionné
print(f"Nombre de lignes du dataframe fusionné: {df_merged_spark.count()}")

# Vérifier les valeurs manquantes avant la jointure
missing_reviews = df_reviews_spark.filter(df_reviews_spark.parent_asin.isNull()).count()
missing_reviews_metadata = df_metadata_spark.filter(df_metadata_spark.parent_asin.isNull()).count()
print(f"Nombre d'enregistrement avec 'parent_asin' manquant avant la jointure dans reviews et metadata: {missing_reviews, missing_reviews_metadata}")

# Vérifier les valeurs manquantes après la jointure
missing_after_join = df_merged_spark.filter(df_merged_spark.parent_asin.isNull()).count()
print(f"Nombre d'avis avec 'parent_asin' manquant après la jointure: {missing_after_join}")




root
 |-- rating: float (nullable = true)
 |-- title: string (nullable = true)
 |-- text: string (nullable = true)
 |-- images: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- asin: string (nullable = true)
 |-- parent_asin: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- timestamp: integer (nullable = true)
 |-- verified_purchase: boolean (nullable = true)
 |-- helpful_vote: integer (nullable = true)

root
 |-- main_category: string (nullable = true)
 |-- title: string (nullable = true)
 |-- subtitle: string (nullable = true)
 |-- average_rating: float (nullable = true)
 |-- rating_number: integer (nullable = true)
 |-- features: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- description: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- price: float (nullable = true)
 |-- images: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |  