# Kaggle eCommerce Behavior Analysis
This Notebook briefly analyzes the [Kaggle eCommerce Behavior dataset](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store)

## Data Gathering
### Data set download using Kaggle Python API
The first part of this notebook downloads the data set using the Kaggle python API.

The files are downloaded only if there are not already in the local folder or if they are not up to date anymore.

In [1]:
import kaggle
from typing import Dict
import os
from pathlib import Path

kaggle.api.authenticate()

dataset_author = "mkechinov"
dataset_name = "ecommerce-behavior-data-from-multi-category-store"
dataset_full_name = f"{dataset_author}/{dataset_name}"

dataset_folder = "dataset"
os.makedirs(dataset_folder, exist_ok=True)

print(f"Listing local csv files in ./{dataset_folder}.")
local_files_dict: Dict[str, Path] = {}
for path in Path(dataset_folder).iterdir():
    if path.is_file() and path.suffix == ".csv":
        print(
            f"File {path.name} with size {path.stat().st_size} found in ./{dataset_folder}"
        )
        local_files_dict[path.name] = path

download_files = False
print(f"Listing files associated with Kaggle dataset {dataset_full_name}.")
for file in kaggle.api.datasets_list_files(dataset_author, dataset_name)[
    "datasetFiles"
]:
    file_name: str = file["name"]
    file_size: int = file["totalBytes"]
    print(f"File {file_name} with size {file_size} retrieved from Kaggle API.")
    if (
        file_name not in local_files_dict
        or file_size != local_files_dict[file_name].stat().st_size
    ):
        print(
            f"File {file_name} non existing locally or not having the same size."
        )
        download_files = True
        break

if download_files:
    kaggle.api.dataset_download_files(
        dataset_name, path=dataset_folder, unzip=True, quiet=False
    )

Listing local csv files in ./dataset.
File 2019-Nov.csv with size 9006762395 found in ./dataset
File 2019-Oct.csv with size 5668612855 found in ./dataset
Listing files associated with Kaggle dataset mkechinov/ecommerce-behavior-data-from-multi-category-store.
File 2019-Nov.csv with size 9006762395 retrieved from Kaggle API.
File 2019-Oct.csv with size 5668612855 retrieved from Kaggle API.


### Pyspark Init and data frames creation
Once the csv files associated with the Kaggle data set are downloaded, we can read them inside a Spark Session.

In [2]:
# Import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, isnan, percentile, when
from pyspark.sql.types import DateType

# Create SparkSession
spark = (
    SparkSession.builder.master("local[*]")
    .appName("e_commerce_behavior")
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/18 16:32:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Create RDD from 2019-Oct.csv
# TODO: Use list of files retrieved from Kaggle instead of hard coding file name
df_sept = spark.read.csv(
    f"{dataset_folder}/2019-Oct.csv", header=True, escape='"'
)
df_sept_size = df_sept.count()

                                                                                

In [4]:
# Create RDD from 2019-Nov.csv
df_nov = spark.read.csv(
    f"{dataset_folder}/2019-Nov.csv", header=True, escape='"'
)
df_nov_size = df_nov.count()

                                                                                

## Data Assessment and Cleaning
In this part, we'll perform parallel assessment and cleaning steps on the data sets.

In [5]:
df_sept.show()

+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand|  price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|2019-10-01 00:00:...|      view|  44600062|2103807459595387724|                NULL|shiseido|  35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01 00:00:...|      view|   3900821|2053013552326770905|appliances.enviro...|    aqua|  33.20|554748717|9333dfbd-b87a-470...|
|2019-10-01 00:00:...|      view|  17200506|2053013559792632471|furniture.living_...|    NULL| 543.10|519107250|566511c2-e2e3-422...|
|2019-10-01 00:00:...|      view|   1307067|2053013558920217191|  computers.notebook|  lenovo| 251.74|550050854|7c90fc70-0e80-459...|
|2019-10-01 00:00:...|      view|   1004237|205301355563188265

In [6]:
df_sept.printSchema()

root
 |-- event_time: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)



In [7]:
df_nov.show()

+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|2019-11-01 00:00:...|      view|   1003461|2053013555631882655|electronics.smart...|  xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 00:00:...|      view|   5000088|2053013566100866035|appliances.sewing...|  janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 00:00:...|      view|  17302664|2053013553853497655|                NULL|   creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 00:00:...|      view|   3601530|2053013563810775923|appliances.kitche...|      lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 00:00:...|      view|   1004775|2053013555631882655|elect

In [8]:
df_nov.printSchema()

root
 |-- event_time: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)



From this first visual assessment, we can notice that the 2 data sets contain the same columns with the same data type schema. This means that we can merge the 2 data set using a union to concatenate all the records together while keeping the same set of column for the merged data set

### Merging the data sets
We'll use the PySpark union function to merge our 2 original data sets.

We'll then assert that the count of records of the merged data set is equal to the sum of records of the 2 original data sets.

In [9]:
# Merge 2 dataset with an union to concatenate the rows since the columns are the same
df_merged = df_sept.union(df_nov)
df_merged_size = df_merged.count()

# Assert that the merged data frame as the records from both original data frames
assert df_merged_size == (df_sept_size + df_nov_size)
df_merged.printSchema()

24/08/18 16:32:59 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

root
 |-- event_time: string (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)



                                                                                

In [10]:
df_merged.show()

+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand|  price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|2019-10-01 00:00:...|      view|  44600062|2103807459595387724|                NULL|shiseido|  35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01 00:00:...|      view|   3900821|2053013552326770905|appliances.enviro...|    aqua|  33.20|554748717|9333dfbd-b87a-470...|
|2019-10-01 00:00:...|      view|  17200506|2053013559792632471|furniture.living_...|    NULL| 543.10|519107250|566511c2-e2e3-422...|
|2019-10-01 00:00:...|      view|   1307067|2053013558920217191|  computers.notebook|  lenovo| 251.74|550050854|7c90fc70-0e80-459...|
|2019-10-01 00:00:...|      view|   1004237|205301355563188265

From the merged data set data type schema, we can notice that each field (column) is a string. There are at least 2 fields that could benefit from a data type cast:
* *price* should be a `double`
* *event_time* should be a `DateType`

### Casting field to their appropriate data type
To cast the column into a new data type, we'll use the PySpark `withColumn` function.

In [11]:
# Cast the price column to double and the event_time column to DateType
df_cleaned = df_merged.withColumn(
    "price", df_merged.price.cast("double")
).withColumn("event_time", df_merged.event_time.cast(DateType()))
df_cleaned.printSchema()

root
 |-- event_time: date (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)



In [12]:
df_cleaned.show()

+----------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|event_time|event_type|product_id|        category_id|       category_code|   brand|  price|  user_id|        user_session|
+----------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|2019-10-01|      view|  44600062|2103807459595387724|                NULL|shiseido|  35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01|      view|   3900821|2053013552326770905|appliances.enviro...|    aqua|   33.2|554748717|9333dfbd-b87a-470...|
|2019-10-01|      view|  17200506|2053013559792632471|furniture.living_...|    NULL|  543.1|519107250|566511c2-e2e3-422...|
|2019-10-01|      view|   1307067|2053013558920217191|  computers.notebook|  lenovo| 251.74|550050854|7c90fc70-0e80-459...|
|2019-10-01|      view|   1004237|2053013555631882655|electronics.smart...|   apple|1081.98|535871217|c6bd7419-2748-4c5...|
|2019-10

From the visual display of the data set, we can notice that there are `NULL` inside the *brand* field. Since we don't have any simple way to impute the *brand* value, we'll drop all the records with `NULL` in *brand*.

### Dropping `NULL` from *brand*
To remove records containing `NULL` in the *brand* column, we'll use the PySpark `na.drop` function

In [13]:
# Drop record with na (NULL) in brand
df_cleaned = df_cleaned.na.drop(subset=["brand"])
df_cleaned.show()

+----------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|event_time|event_type|product_id|        category_id|       category_code|   brand|  price|  user_id|        user_session|
+----------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|2019-10-01|      view|  44600062|2103807459595387724|                NULL|shiseido|  35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01|      view|   3900821|2053013552326770905|appliances.enviro...|    aqua|   33.2|554748717|9333dfbd-b87a-470...|
|2019-10-01|      view|   1307067|2053013558920217191|  computers.notebook|  lenovo| 251.74|550050854|7c90fc70-0e80-459...|
|2019-10-01|      view|   1004237|2053013555631882655|electronics.smart...|   apple|1081.98|535871217|c6bd7419-2748-4c5...|
|2019-10-01|      view|   1480613|2053013561092866779|   computers.desktop|  pulser| 908.62|512742880|0d0d91c2-c9c2-4e8...|
|2019-10

In [14]:
# Assert that there is no NULL left in brand
assert (
    df_cleaned.select(
        count(when(col("brand").isNull(), "brand")).alias("brand")
    ).collect()[0][0]
    == 0
)

                                                                                

We'll now check if we have and undesired values in *price*

In [15]:
# Check if we have NaN in price
df_cleaned.select(
    count(when(isnan("price"), "price")).alias("price_NaN")
).show()



+---------+
|price_NaN|
+---------+
|        0|
+---------+



                                                                                

In [16]:
# Check price statistics
df_price_summary = df_cleaned.select("price").summary(
    "count", "min", "5%", "25%", "50%", "mean", "75%", "95%", "max"
)
df_price_summary.show()

[Stage 24:>                                                         (0 + 1) / 1]

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|          94619500|
|    min|              0.77|
|     5%|             21.86|
|    25%|             72.85|
|    50%|            172.18|
|   mean|305.65739274913716|
|    75%|            383.54|
|    95%|           1019.06|
|    max|           2574.07|
+-------+------------------+



                                                                                

We can notice that a small portion of the prices (in the 95 percentile) have a very high value compared to the rest of the set. This skews the quantitative variable to the right.

The value above the 95 percentile can be considered as outliers.

One way of dealing with outliers is to remove them from the data set.

### Removing 95 percentile outliers from *price*
We'll first extract the 95 percentile value from the price summary

In [17]:
# Compute price 95% percentile
# price_95_percentiles = df_cleaned.select(percentile("price", 0.95).alias("percentile")).collect()[0][0]
price_95_percentiles = df_price_summary.where(col("summary") == "95%").head()[
    "price"
]
# price_95_percentiles = df_price_summary.where(df_price_summary.summary == "95%")
print(f"Price 95 percentile = {price_95_percentiles}")
# df_price_summary.where(df_price_summary.summary == "95%").show()

[Stage 27:>                                                         (0 + 1) / 1]

Price 95 percentile = 1019.06


                                                                                

And then we can remove all the records where *price* is higher than the 95 percentile.

In [18]:
# Keep only records where price is lower than its 95 percentile value to remove outliers
df_cleaned = df_cleaned.where(df_cleaned.price < price_95_percentiles)
# Check price summary after cleaning
df_cleaned.select("price").summary(
    "count", "min", "5%", "25%", "50%", "mean", "75%", "95%", "max"
).show()

[Stage 30:>                                                         (0 + 1) / 1]

+-------+-----------------+
|summary|            price|
+-------+-----------------+
|  count|         89886155|
|    min|             0.77|
|     5%|            20.59|
|    25%|            69.46|
|    50%|           160.36|
|   mean|244.4629314853627|
|    75%|            334.6|
|    95%|           793.92|
|    max|          1019.05|
+-------+-----------------+



                                                                                

## Extracting insights
In this part we'll extract insights from the merged and cleaned data set

### Which brand is the most popular?
To answer this question we'll group the data set by *brand*, count the number of occurrence for each *brand* and then sort by count.

In [19]:
# Group by brand, count records and sort by descending count
df_cleaned.groupBy("brand").count().sort("count", ascending=False).show()



+--------+--------+
|   brand|   count|
+--------+--------+
| samsung|12657661|
|   apple| 7824278|
|  xiaomi| 7675572|
|  huawei| 2521150|
| lucente| 1840936|
|      lg| 1554545|
|   bosch| 1496318|
|    oppo| 1294585|
|    sony| 1167245|
|    acer| 1046914|
|cordiant| 1039948|
| respect| 1036498|
|   artel|  989032|
|  lenovo|  948701|
| redmond|  741768|
| philips|  732106|
|      hp|  723385|
| indesit|  718806|
|dauscher|  706026|
|   vitek|  633277|
+--------+--------+
only showing top 20 rows



                                                                                

From this grouped data frame by *brand*, we can tell that **Samsung** is the most famous brand, since it's the brand most bought by the customers.