# Customer Analysis - Explore Customer Behavior

## Import

Needed packages. Pyspark will be used for data managing and plotly for visualisations. Keep in mind to install
JAVA so Spark will work properly.

Used dataset is from https://rees46.com/de found on https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store.

In [1]:
import os
import pyspark
import pandas as pd
import pyspark.sql.functions as f
import plotly.express as px
import plotly.graph_objects as go

## Read

The data needs to be located in ```data/``` in unzipped form as a csv.

In [7]:
# read raw data
spark = pyspark.sql.SparkSession.builder.appName("app1").getOrCreate()
# sdf = spark.read.csv("data/*.csv", header=True, inferSchema=True)
#sdf_201911 = spark.read.csv("data/2019-Nov.csv", header=True, inferSchema=True)
#sdf_201910 = spark.read.csv("data/2019-Oct.csv", header=True, inferSchema=True)
sdf = spark.read.csv("data/test_data.csv", header=True, inferSchema=True)

In [8]:
# join both months together
#sdf = sdf_201910.union(sdf_201911)
# sdf = spark.read.csv("data/test_data.csv", header=True, inferSchema=True)
sdf.show()

+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|2019-11-01 00:00:...|      view|   1003461|2053013555631882655|electronics.smart...|  xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 00:00:...|      view|   5000088|2053013566100866035|appliances.sewing...|  janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 00:00:...|      view|  17302664|2053013553853497655|                null|   creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 00:00:...|      view|   3601530|2053013563810775923|appliances.kitche...|      lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 00:00:...|      view|   1004775|2053013555631882655|elect

## Preparation

Prepare and enhance data for analysis and modelling.

In [9]:
# Datatypes
sdf = sdf.withColumn("event_time", sdf["event_time"].cast(pyspark.sql.types.TimestampType()))
sdf = sdf.withColumn("category_id", sdf["category_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("product_id", sdf["product_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("user_id", sdf["user_id"].cast(pyspark.sql.types.StringType()))

# Feature Splitting
sdf = sdf.withColumn("category_class", f.substring_index(sdf.category_code, '.', 1))

# sdf = sdf.withColumn("category_class", f.split(sdf["category_code"], ".").getItem(0))
# sdf = sdf.withColumn("category_sub_class", f.split(sdf["category_code"], ".").getItem(1))
# sdf = sdf.withColumn("category_sub_sub_class", f.split(sdf["category_code"], ".").getItem(2))

sdf = sdf.withColumn("year", f.year("event_time"))
sdf = sdf.withColumn("month", f.month("event_time"))
sdf = sdf.withColumn("weekofyear", f.weekofyear("event_time"))
sdf = sdf.withColumn("dayofyear", f.dayofyear("event_time"))
sdf = sdf.withColumn("dayofweek", f.dayofweek("event_time"))
sdf = sdf.withColumn("dayofmonth", f.dayofmonth("event_time"))

# None Handling
sdf = sdf.fillna(value="not defined")

sdf.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = false)
 |-- product_id: string (nullable = false)
 |-- category_id: string (nullable = false)
 |-- category_code: string (nullable = false)
 |-- brand: string (nullable = false)
 |-- price: double (nullable = true)
 |-- user_id: string (nullable = false)
 |-- user_session: string (nullable = false)
 |-- category_class: string (nullable = false)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- weekofyear: integer (nullable = true)
 |-- dayofyear: integer (nullable = true)
 |-- dayofweek: integer (nullable = true)
 |-- dayofmonth: integer (nullable = true)



## Dataframe Creation

create several dataframes with different aggregation level to answer different questions/ tasks.

In [10]:
# raw
sdf_raw = sdf
sdf.createOrReplaceTempView("Data")
sdf_raw.show()

+-------------------+----------+----------+-------------------+--------------------+-----------+------+---------+--------------------+--------------+----+-----+----------+---------+---------+----------+
|         event_time|event_type|product_id|        category_id|       category_code|      brand| price|  user_id|        user_session|category_class|year|month|weekofyear|dayofyear|dayofweek|dayofmonth|
+-------------------+----------+----------+-------------------+--------------------+-----------+------+---------+--------------------+--------------+----+-----+----------+---------+---------+----------+
|2019-11-01 01:00:00|      view|   1003461|2053013555631882655|electronics.smart...|     xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|   electronics|2019|   11|        44|      305|        6|         1|
|2019-11-01 01:00:00|      view|   5000088|2053013566100866035|appliances.sewing...|     janome|293.65|530496790|8e5f4f83-366c-4f7...|    appliances|2019|   11|        44|      305|       

In [11]:
# aggregated customer
sdf_agg_cust = sdf.groupBy("user_id", "user_session", "event_type", "product_id").mean("price")
sdf_agg_cust.show()

+---------+--------------------+----------+----------+----------+
|  user_id|        user_session|event_type|product_id|avg(price)|
+---------+--------------------+----------+----------+----------+
|513998949|a7b196d9-afe5-4dc...|      view|  50600085|    113.93|
|565731881|5d8cb7aa-ca44-470...|      view|   4804194|     69.24|
|565731881|5d8cb7aa-ca44-470...|      view|   4804151|     51.22|
|550043341|200ebe4a-40e6-4c3...|      view|  16000004|     43.73|
|532647354|d2d3d2c6-631d-489...|      view|   1004258|    732.07|
|544896141|80a43be5-1e98-44e...|      view|   1005116|   1013.86|
|566143627|aa610ab3-5c60-455...|      view|   1004708|    151.99|
|566280567|8cd74350-34e7-423...|      view|   1004322|    334.37|
|515782589|f2081cf0-0ee3-4bf...|      view|   1800729|    289.33|
|517081324|bd1d99b3-0c06-4e1...|      view|  12301394|    226.83|
|566255262|173d7b72-1db7-463...|      view|  16700384|     36.78|
|520772685|816a59f3-f5ae-4cc...|      view|   1306894|    360.09|
|565098257

In [None]:
# aggregated session

In [None]:
# aggregated product


In [None]:
# aggregated class

## Field Explanations

Following fields are in the standard dataset:
- event_time
- event_type
- product_id
- category_id
- category_code
- brand
- price
- user_id
- user_session

### General

In [14]:
sdf_count_overview = spark.sql("SELECT COUNT(*) AS Row_Count, \
                                       COUNT(DISTINCT(product_id)) AS Product_ID, \
                                       COUNT(DISTINCT(category_class)) AS Category_Class, \
                                       COUNT(DISTINCT(category_code)) AS Category_Code, \
                                       COUNT(DISTINCT(category_id)) AS Category_ID, \
                                       COUNT(DISTINCT(brand)) AS Brand, \
                                       COUNT(DISTINCT(user_id)) AS User_ID, \
                                       COUNT(DISTINCT(user_session)) AS User_Session, \
                                       ROUND(MEAN(price),2) AS AVG_Price \
                                FROM Data")
sdf_count_overview.show()

+---------+----------+--------------+-------------+-----------+-----+-------+------------+---------+
|Row_Count|Product_ID|Category_Class|Category_Code|Category_ID|Brand|User_ID|User_Session|AVG_Price|
+---------+----------+--------------+-------------+-----------+-----+-------+------------+---------+
|      217|       188|             9|           35|         74|   86|    126|         126|   306.63|
+---------+----------+--------------+-------------+-----------+-----+-------+------------+---------+



In this overview you can see the count of unique rows, product_ids, category_classes, category_codes, category_ids, brands, user_ids and user_sessions as well as the average price of the products.

### event_time

### event_type

The event_type describes the kind of interaction, an user had with a product. The field can take three forms: View, Cart and Purchase. The distribution of these three forms is represented in the following plot:

In [63]:
sdf_event_distribution = spark.sql("SELECT event_type, \
                                              Count(*) AS Count \
                                      FROM Data \
                                      GROUP BY event_type")
px.sunburst(sdf_event_distribution.toPandas(), path=['event_type'], values="Count")

### product_id

The product_id is the unique identificator for a product. As you can see in the overview, there are [...] unique product_ids in the datasets Oct-2019 and Nov-2019, the users have interacted with.

In [15]:
sdf_count_per_product_id = spark.sql("SELECT DISTINCT(product_id) AS Product_ID, \
                                                COUNT(product_id) AS Count \
                                        FROM Data \
                                        GROUP BY product_id \
                                        ORDER BY Count DESC")
px.bar(sdf_count_per_product_id.limit(10).toPandas(), x='Product_ID', y='Count')


In the plot above you can see the Product_IDS, the users have interacted with most as well as the count of the interactions.

### category_id

The category_id is an unique identifier for the category of a Product. Every Product is assigned to a single category_id, which is summarizing many product_ids into categories. This knowledge is based on the more detailed analyzes within the file "product_analysis.ipnynb". As you can see in the overview, there are [...] unique category_ids.

In [18]:
sdf_count_per_category_id=spark.sql("SELECT DISTINCT(category_id) AS Category_ID, \
                                            COUNT(product_id) AS Count \
                                     FROM Data \
                                     GROUP BY category_id \
                                     ORDER BY Count DESC")
                                     
px.bar(sdf_count_per_category_id.limit(10).toPandas(), x="Category_ID", y="Count")                               

In the plot above you can see the category_ids with the most interactions.

### category_code


The category_code is describing the category, a product_id and category_id is assigned to. Every Product_id and Category_id is assigned to a single category_code, which is summarizing many product_ids and category_ids into categories. This knowledge is also based on the more detailed analyzes within the file "product_analysis.ipnynb". As you can see in the overview, there are ... unique category_code.

In [20]:
sdf_count_per_category_code=spark.sql("SELECT DISTINCT(category_code) AS Category_Code, \
                                              COUNT(product_id) AS Count \
                                        FROM Data \
                                        GROUP BY category_code \
                                        ORDER BY Count DESC")
px.bar(sdf_count_per_category_code.limit(10).toPandas(), x="Category_Code", y="Count")

The plot above represents again the category_codes with the most interactions.

The category_code consists of two or three parts in general, which are separeted by a dot. A possible category_code is for example: appliances.kitchen.washer or electronics.smartphone. Because of that the category code can be splited into to the categories: category_class, category_sub_class and category_sub_sub_class.

#### Category_class

The category_class is representing the fist part of the category_code. It can be used to summarize many category_codes into an overarching category_class. As you can see in the overview, there are [...] unique category_classes.

In [23]:
sdf_count_per_category_class=spark.sql("SELECT DISTINCT(category_class) AS Category_Class, \
                                               COUNT(product_id) AS Count \
                                        FROM Data \
                                        GROUP BY category_class \
                                        ORDER BY Count DESC")
                                        
px.bar(sdf_count_per_category_class.limit(10).toPandas(), x="Category_Class", y="Count")

The plot above represents again the category_classes with the most interactions.

#### Category_sub_class

#### Category_sub_sub_class

### brand


The brand indicates the brand of a product_id. It is independent of the categories, so that a brand can appear in many category_classes.  This knowledge is also based on the more detailed analyzes within the file "product_analysis.ipnynb". There are [...] unique brands in the dataset. Thereby you can see the most popular brands in the following plot:

In [24]:
sdf_count_per_brand=spark.sql("SELECT DISTINCT(brand) AS Brand, \
                                      COUNT(product_id) AS Count \
                                FROM Data \
                                GROUP BY brand \
                                ORDER BY Count DESC")

px.histogram(sdf_count_per_brand.limit(10).toPandas(), x="Brand", y="Count")                               

### price


The price-column is indicating the price of a product_id. In the following table and plot you can see the general distribution of the price in the dataset.

In [26]:
sdf_price_distribution = spark.sql("SELECT MAX(price) AS MAX, \
                                           MEAN(price) AS MEAN, \
                                           MIN(price) AS MIN \
                                    FROM Data")
sdf_price_distribution.show()

+-------+-----------------+----+
|    MAX|             MEAN| MIN|
+-------+-----------------+----+
|2496.59|306.6343778801844|1.09|
+-------+-----------------+----+



In [25]:
px.box(sdf_raw.toPandas(), y="price")

### user_id


### user_session

## Exploration and Analysis

In the following part, the connections and the dependencies between the single fields will be analysed.

### Time Distribution

### Category and products

#### Connection between category_class, category_code, category_id, product_id and brand

The product_id is a subset of the category_id, which is a subset of the category_code. The category_code is in turn a subset of the category_class. (product_id ⊂ category_id ⊂ category_code ⊂ category_class). The brand on the otherhand is cross-class. This knowledge is based on the more detailed analyzes within the file "product_analysis.ipnynb".

In the following plot you can see distribution of the product_id, category_id and category_code within the category_class. It´s possible to access a more detailed view by selecting a special category_class, category_code or category_id.

In [42]:
sdf_category_distribution = spark.sql("SELECT category_class, \
                                              category_code, \
                                              category_id, \
                                              product_id, \
                                              Count(*) AS Count \
                                      FROM Data \
                                      GROUP BY category_class, category_code, category_id, product_id")
px.sunburst(sdf_category_distribution.toPandas(), path=['category_class', 'category_code', 'category_id', "product_id"], values="Count")

The distribution of the brand within the category_classes is represented in the following plot:

In [43]:
px.histogram(sdf_raw.toPandas(), x="category_class", color="brand")

#### Connection to the price

The following plots will represent the price distribution within the category_classes, category_codes, category_ids, product_ids and brands.

In [48]:
px.box(sdf_raw.toPandas(), x="category_class", y="price", title="Price ~ Category_class")

In [51]:
px.box(sdf_raw.toPandas(), x="category_code", y="price", title="Price ~ Category_code")

In [52]:
px.box(sdf_raw.toPandas(), x="category_id", y="price", title="Price ~ Category_id")

In [53]:
px.box(sdf_raw.toPandas(), x="brand", y="price", title="Price ~ brand")

In [57]:
sdf_price_per_product=spark.sql("SELECT DISTINCT(Product_ID), \
                                        Price \
                                 FROM Data \
                                 ORDER BY Price Desc")
px.bar(sdf_price_per_product.limit(10).toPandas(), x="Product_ID", y="Price", title="TOP 10 most expensive Product_IDs")

#### Connection to the event-type

In [None]:
The following plots will represent the event_type distribution within the category_classes, category_codes, category_ids, product_ids and brands.

In [65]:
sdf_category_class_event_distribution = spark.sql("SELECT category_class, \
                                                          event_type, \
                                                          Count(*) AS Count \
                                                    FROM Data \
                                                    GROUP BY category_class, event_type")
px.sunburst(sdf_category_class_event_distribution.toPandas(), path=['category_class','event_type'], values="Count")

In [66]:
sdf_category_code_event_distribution = spark.sql("SELECT category_code, \
                                                          event_type, \
                                                          Count(*) AS Count \
                                                    FROM Data \
                                                    GROUP BY category_code, event_type")
px.sunburst(sdf_category_code_event_distribution.toPandas(), path=['category_code','event_type'], values="Count")

In [67]:
sdf_category_id_event_distribution = spark.sql("SELECT category_id, \
                                                          event_type, \
                                                          Count(*) AS Count \
                                                    FROM Data \
                                                    GROUP BY category_id, event_type")
px.sunburst(sdf_category_id_event_distribution.toPandas(), path=['category_id','event_type'], values="Count")

In [68]:
sdf_product_id_event_distribution = spark.sql("SELECT product_id, \
                                                          event_type, \
                                                          Count(*) AS Count \
                                                    FROM Data \
                                                    GROUP BY product_id, event_type")
px.sunburst(sdf_product_id_event_distribution.toPandas(), path=['product_id','event_type'], values="Count")

In [69]:
sdf_brand_event_distribution = spark.sql("SELECT brand, \
                                                 event_type, \
                                                 Count(*) AS Count \
                                        FROM Data \
                                        GROUP BY brand, event_type")
px.sunburst(sdf_brand_event_distribution.toPandas(), path=['brand','event_type'], values="Count")

### User analysis

### Event_Type and Price

The following plot represents the distribution of the price within the event_type.

In [70]:
px.box(sdf_raw.toPandas(), x="event_type", y="price", title="Price ~ Event_Type")

### Customer Journey

### Correlation Matrix

In the following correlation Matrix you can see the attributes, which have a big impact on the event_type.

# Clustering