# Customer Analysis - Explore Customer Behavior

## Import

Needed packages. Pyspark will be used for data managing and plotly for visualisations. Keep in mind to install
JAVA so Spark will work properly.

Used dataset is from https://rees46.com/de found on https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store.

In [34]:
import os
import pyspark
import pandas as pd
import pyspark.sql.functions as f
import plotly.express as px
import plotly.graph_objects as go

## Read

The data needs to be located in ```data/``` in unzipped form as a csv.

In [74]:
# read raw data
spark = pyspark.sql.SparkSession.builder.appName("app1").getOrCreate()
# sdf = spark.read.csv("data/*.csv", header=True, inferSchema=True)
# sdf_201911 = spark.read.csv("data/2019-Nov.csv", header=True, inferSchema=True)
# sdf_201910 = spark.read.csv("data/2019-Oct.csv", header=True, inferSchema=True)

In [75]:
# join both months together
sdf = sdf_201910.union(sdf_201911)
sdf = spark.read.csv("data/test_data.csv", header=True, inferSchema=True)
sdf.show()

+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand|  price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|2019-10-01 00:00:...|      view|  44600062|2103807459595387724|                null|shiseido|  35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01 00:00:...|      view|   3900821|2053013552326770905|appliances.enviro...|    aqua|   33.2|554748717|9333dfbd-b87a-470...|
|2019-10-01 00:00:...|      view|  17200506|2053013559792632471|furniture.living_...|    null|  543.1|519107250|566511c2-e2e3-422...|
|2019-10-01 00:00:...|      view|   1307067|2053013558920217191|  computers.notebook|  lenovo| 251.74|550050854|7c90fc70-0e80-459...|
|2019-10-01 00:00:...|      view|   1004237|205301355563188265

## Preparation

Prepare and enhance data for analysis and modelling.

In [76]:
# Datatypes
sdf = sdf.withColumn("event_time", sdf["event_time"].cast(pyspark.sql.types.TimestampType()))
sdf = sdf.withColumn("category_id", sdf["category_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("product_id", sdf["product_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("user_id", sdf["user_id"].cast(pyspark.sql.types.StringType()))

# Feature Splitting
# sdf = sdf.withColumn("category_class", f.substring_index(sdf.category_code, '.', 1))

sdf = sdf.withColumn("category_class", f.split(sdf["category_code"], "\.").getItem(0))
sdf = sdf.withColumn("category_sub_class", f.split(sdf["category_code"], "\.").getItem(1))
sdf = sdf.withColumn("category_sub_sub_class", f.split(sdf["category_code"], "\.").getItem(2))

sdf = sdf.withColumn("year", f.year("event_time"))
sdf = sdf.withColumn("month", f.month("event_time"))
sdf = sdf.withColumn("weekofyear", f.weekofyear("event_time"))
sdf = sdf.withColumn("dayofyear", f.dayofyear("event_time"))
sdf = sdf.withColumn("dayofweek", f.dayofweek("event_time"))
sdf = sdf.withColumn("dayofmonth", f.dayofmonth("event_time"))
sdf = sdf.withColumn("hour", f.hour("event_time"))
# None Handling
# sdf = sdf.fillna(value="not defined")

sdf.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)
 |-- category_class: string (nullable = true)
 |-- category_sub_class: string (nullable = true)
 |-- category_sub_sub_class: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- weekofyear: integer (nullable = true)
 |-- dayofyear: integer (nullable = true)
 |-- dayofweek: integer (nullable = true)
 |-- dayofmonth: integer (nullable = true)
 |-- hour: integer (nullable = true)



## Dataframe Creation

create several dataframes with different aggregation level to answer different questions/ tasks.

In [77]:
# raw
sdf_raw = sdf

-RECORD 0--------------------------------------
 event_time             | 2019-10-01 02:00:00  
 event_type             | view                 
 product_id             | 44600062             
 category_id            | 2103807459595387724  
 category_code          | null                 
 brand                  | shiseido             
 price                  | 35.79                
 user_id                | 541312140            
 user_session           | 72d76fde-8bb3-4e0... 
 category_class         | null                 
 category_sub_class     | null                 
 category_sub_sub_class | null                 
 year                   | 2019                 
 month                  | 10                   
 weekofyear             | 40                   
 dayofyear              | 274                  
 dayofweek              | 3                    
 dayofmonth             | 1                    
 hour                   | 2                    
only showing top 1 row



In [78]:
# aggregated customer
sdf_agg_cust = sdf.groupBy("user_id", "user_session", "event_type", "product_id").mean("price")


+---------+--------------------+----------+----------+------------------+
|  user_id|        user_session|event_type|product_id|        avg(price)|
+---------+--------------------+----------+----------+------------------+
|515483851|18ea1924-2c5a-4ac...|      view|  15700181|            214.16|
|549736688|29ab2a23-cfa8-411...|      view|   1004935|            167.03|
|477121012|413b498a-71d5-49f...|      view|   1004839|179.30666666666664|
|555279241|56616147-d002-47b...|      view|   1005115|            975.57|
|516692901|242696d6-672f-476...|      view|   1005115|            975.57|
+---------+--------------------+----------+----------+------------------+
only showing top 5 rows



In [79]:
# aggregated session

In [80]:
# aggregated product


In [81]:
# aggregated class

In [82]:
# aggregated time (weeks, dayofweeks, month)

sdf_time_dist_month = sdf.groupBy("event_type", "dayofmonth").count()
sdf_time_dist_month = sdf_time_dist_month.withColumnRenamed("count", "cnt")
sdf_time_dist_month = sdf_time_dist_month.sort("dayofmonth", "event_type")


+----------+----------+-------+
|event_type|dayofmonth|    cnt|
+----------+----------+-------+
|      cart|         1|  35804|
|  purchase|         1|  41765|
|      view|         1|2607456|
|      cart|         2|  36594|
|  purchase|         2|  41343|
|      view|         2|2668416|
|      cart|         3|  39353|
|  purchase|         3|  41306|
|      view|         3|2612149|
|      cart|         4|  65815|
|  purchase|         4|  53978|
|      view|         4|3090211|
|      cart|         5|  54746|
|  purchase|         5|  48361|
|      view|         5|2944444|
|      cart|         6|  51991|
|  purchase|         6|  47521|
|      view|         6|2916113|
|      cart|         7|  37517|
|  purchase|         7|  46261|
+----------+----------+-------+
only showing top 20 rows



In [83]:
sdf_time_dist_week = sdf.groupBy("event_type", "dayofweek").count()
sdf_time_dist_week = sdf_time_dist_week.withColumnRenamed("count", "cnt")
sdf_time_dist_week = sdf_time_dist_week.sort("dayofweek", "event_type")


+----------+---------+--------+
|event_type|dayofweek|     cnt|
+----------+---------+--------+
|      cart|        1|  720590|
|  purchase|        1|  353887|
|      view|        1|16326988|
|      cart|        2|  369597|
|  purchase|        2|  201448|
|      view|        2|12174247|
|      cart|        3|  371977|
|  purchase|        3|  211058|
|      view|        3|13298191|
|      cart|        4|  369438|
|  purchase|        4|  216415|
|      view|        4|13123564|
|      cart|        5|  464803|
|  purchase|        5|  210074|
|      view|        5|13859823|
|      cart|        6|  886709|
|  purchase|        6|  207588|
|      view|        6|17753073|
|      cart|        7|  772332|
|  purchase|        7|  259318|
+----------+---------+--------+
only showing top 20 rows



In [84]:
sdf_time_dist_day = sdf.groupBy("event_type", "hour").count()
sdf_time_dist_day = sdf_time_dist_day.withColumnRenamed("count", "cnt")
sdf_time_dist_day = sdf_time_dist_day.sort("hour", "event_type")


+----------+----+-------+
|event_type|hour|    cnt|
+----------+----+-------+
|      cart|   0|  16404|
|  purchase|   0|   6420|
|      view|   0| 581731|
|      cart|   1|  19607|
|  purchase|   1|   5508|
|      view|   1| 647710|
|      cart|   2|  38919|
|  purchase|   2|   7653|
|      view|   2|1151038|
|      cart|   3|  79145|
|  purchase|   3|  17638|
|      view|   3|2204048|
|      cart|   4| 138076|
|  purchase|   4|  42514|
|      view|   4|3358263|
|      cart|   5| 193031|
|  purchase|   5|  75571|
|      view|   5|4406869|
|      cart|   6| 225650|
|  purchase|   6|  95690|
+----------+----+-------+
only showing top 20 rows



## Field Explanations

Following fields are in the standard dataset:
- event_time
- product_id
- category_id
- category_code
- brand
- price
- user_id
- user_session

### general

In [46]:
sdf_raw.show(1, vertical=True)
print(f"Number of total rows: {sdf_raw.count()}")

+-------------------+----------+----------+-------------------+--------------------+------+------+---------+--------------------+--------------+------------------+----------------------+----+-----+----------+---------+---------+----------+----+
|         event_time|event_type|product_id|        category_id|       category_code| brand| price|  user_id|        user_session|category_class|category_sub_class|category_sub_sub_class|year|month|weekofyear|dayofyear|dayofweek|dayofmonth|hour|
+-------------------+----------+----------+-------------------+--------------------+------+------+---------+--------------------+--------------+------------------+----------------------+----+-----+----------+---------+---------+----------+----+
|2019-11-01 01:00:00|      view|   1003461|2053013555631882655|electronics.smart...|xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|   electronics|        smartphone|                  null|2019|   11|        44|      305|        6|         1|   1|
|2019-11-01 01:00:00

### event_time

In [47]:
sdf_raw.select("event_time").show(5)
print(f"Number of distinct event_time rows: {sdf_raw.select('event_time').distinct().count()}")

+-------------------+
|         event_time|
+-------------------+
|2019-11-01 01:00:00|
|2019-11-01 01:00:00|
|2019-11-01 01:00:01|
|2019-11-01 01:00:01|
|2019-11-01 01:00:01|
+-------------------+
only showing top 5 rows

Number of distinct event_time rows: 84


### event_type

In [48]:
sdf_event_type_dist = sdf_raw.groupBy("event_type").count()
sdf_event_type_dist.show()

Unique Values:
+----------+
|event_type|
+----------+
|  purchase|
|      view|
|      cart|
+----------+

Distribution:
+----------+-----+
|event_type|count|
+----------+-----+
|  purchase|    2|
|      view|  240|
|      cart|    2|
+----------+-----+



In [49]:
# Plot Event Types
df = sdf_event_type_dist.toPandas()
fig = px.pie(df, values='count', names='event_type', title='Distribution of Customer Actions')
fig.show()

### product_id

In [50]:
sdf_raw.select("product_id").show(5)

+----------+
|product_id|
+----------+
|   1003461|
|   5000088|
|  17302664|
|   3601530|
|   1004775|
+----------+
only showing top 5 rows



In [51]:
print("Number of unique values:")
print(sdf_raw.select("product_id").distinct().count())

Number of unique values:
214


In [52]:
print("Number of products with no category_id:")
# print(sdf_raw.where((sdf_raw["category_id"] == "not defined")).count())
# print(sdf_raw.na.count())

Number of products with no category_id:


In [53]:
print("Distribution:")
sdf_product_id_dist = sdf_raw.groupBy("product_id").count()
sdf_product_id_dist.show()

Distribution:
+----------+-----+
|product_id|count|
+----------+-----+
|   1306571|    1|
|  25600085|    1|
|  26402378|    1|
|   3901174|    1|
|   5801656|    1|
|   1307012|    1|
|  26204072|    1|
|   3100034|    1|
|   2300214|    1|
|  15100148|    1|
|   4804194|    1|
|   3701016|    1|
|  12708306|    1|
|  16700690|    1|
|   1307322|    1|
|   2400216|    1|
|  41100055|    1|
|   1307115|    2|
|   4600541|    1|
|  15100337|    1|
+----------+-----+
only showing top 20 rows



In [54]:
px.bar(sdf_product_id_dist.orderBy("count").limit(10).toPandas(), x='product_id', y='count')

### category_id

In [55]:
sdf_raw.select("category_id").show(5)
print(f"Number of unique categories: {sdf_raw.select('category_id').distinct().count()}")

+-------------------+
|        category_id|
+-------------------+
|2053013555631882655|
|2053013566100866035|
|2053013553853497655|
|2053013563810775923|
|2053013555631882655|
+-------------------+
only showing top 5 rows

Number of unique categories: 90


In [56]:
sdf_product_per_category = sdf_raw.groupBy("category_id").agg(f.countDistinct("product_id"))
sdf_product_per_category.show()

+-------------------+-----------------+
|        category_id|count(product_id)|
+-------------------+-----------------+
|2085718636156158307|                1|
|2152167773222993940|                1|
|2053013558190408249|                1|
|2146660887346282824|                1|
|2053013558031024687|                1|
|2090971680431145002|                1|
|2053013558282682943|                1|
|2053013556110033341|                1|
|2053013558920217191|               17|
|2134905044833666047|                2|
|2053013560530830019|                1|
|2106075662325383725|                1|
|2070747671722722162|                1|
|2137704922018218396|                2|
|2053013559792632471|                2|
|2053013554625249641|                1|
|2053013562183385881|                1|
|2053013554449088861|                1|
|2053013558433677895|                1|
|2134905045613805589|                1|
+-------------------+-----------------+
only showing top 20 rows



In [57]:
df = sdf_product_per_category.toPandas()
px.box(df, y="count(product_id)")

### category_code

In [58]:
sdf_raw.select("category_code").show(5)

print("Number of unique categorie codes:")
print(sdf_raw.select("category_code").distinct().count())

+--------------------+
|       category_code|
+--------------------+
|electronics.smart...|
|appliances.sewing...|
|                null|
|appliances.kitche...|
|electronics.smart...|
+--------------------+
only showing top 5 rows

Number of unique categorie codes:
42


### brand

In [59]:
sdf_raw.select("brand").show(5)

print("Number of unique brands:")
print(sdf_raw.select("brand").distinct().count())

+------+
| brand|
+------+
|xiaomi|
|janome|
| creed|
|    lg|
|xiaomi|
+------+
only showing top 5 rows

Number of unique brands:
99


### price

In [60]:
sdf_raw.describe("price").show()

+-------+-----------------+
|summary|            price|
+-------+-----------------+
|  count|              244|
|   mean|302.1450819672131|
| stddev|425.3893301363122|
|    min|             1.09|
|    max|          2496.59|
+-------+-----------------+



In [61]:
df = sdf_raw.toPandas()
px.box(df, y="price")


### user_id

In [62]:
sdf_raw.select("user_id").show(5)

print("Number of users:")
print(sdf_raw.select("user_id").distinct().count())

+---------+
|  user_id|
+---------+
|520088904|
|530496790|
|561587266|
|518085591|
|558856683|
+---------+
only showing top 5 rows

Number of users:
153


### user_session

In [63]:
# avg actions per session
sdf_cnt_action_per_session = sdf_raw.groupby("user_session").count()
sdf_cnt_action_per_session.describe().show()

+-------+--------------------+------------------+
|summary|        user_session|             count|
+-------+--------------------+------------------+
|  count|                 153|               153|
|   mean|                null|1.5947712418300655|
| stddev|                null|1.0287825488662496|
|    min|0110890b-96d6-4ae...|                 1|
|    max|ff868137-fb3e-4da...|                 6|
+-------+--------------------+------------------+



## Exploration and Analysis

### Time Distribution

In [85]:
# Timestamp Distribution (per event_type) over every day of month

df = sdf_time_dist_month.toPandas()

fig = px.bar(df, x = 'dayofmonth', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a month",
     xaxis_title = 'Day of Month', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.show()

In [86]:
# Timestamp Distribution (per event_type) over every day of week

df = sdf_time_dist_week.toPandas()

fig = px.bar(df, x = 'dayofweek', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a week",
     xaxis_title = 'Day of Week', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.show()

In [87]:
# Timestamp Distribution (per event_type) over every hour of a day

df = sdf_time_dist_day.toPandas()

fig = px.bar(df, x = 'hour', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a day",
     xaxis_title = 'Hour of day', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.show()

In [71]:
sdf.groupBy("event_time").count().show()

+-------------------+-----+
|         event_time|count|
+-------------------+-----+
|2019-11-01 01:00:09|    3|
|2019-11-01 01:01:23|    3|
|2019-11-01 01:00:19|    5|
|2019-11-01 01:00:41|    7|
|2019-11-01 01:01:00|    2|
|2019-11-01 01:01:17|    2|
|2019-11-01 01:00:35|    3|
|2019-11-01 01:01:20|    3|
|2019-11-01 01:01:21|    3|
|2019-11-01 01:00:27|    5|
|2019-11-01 01:01:11|    4|
|2019-11-01 01:01:07|    1|
|2019-11-01 01:00:12|    5|
|2019-11-01 01:00:53|    2|
|2019-11-01 01:00:22|    2|
|2019-11-01 01:00:54|    1|
|2019-11-01 01:00:16|    4|
|2019-11-01 01:00:51|    3|
|2019-11-01 01:01:01|    3|
|2019-11-01 01:00:26|    4|
+-------------------+-----+
only showing top 20 rows



In [None]:
df = sdf.groupBy("event_time").count().toPandas()

fig = px.line(df, x="event_time", y="count")
fig.show()

### Category and products

### Price and Events

### User analysis

### session analysis



- number of events
- session_start_time
- session_stop_time
- session_success
- products_bought
- products_viewed
- turnover


In [None]:
# different buying behavior between different times (morning more sales than evening?)
sdf_session = sdf.groupBy("user_id", "user_session", "event_type", "product_id", "price").agg(f.max("event_time"), f.min("event_time"), f.count())

In [None]:
sdf_session.show()

### Customer Profiles

In preparation for clustering a customer profile will be created:

- customer_id
- number_of_view_events
- number_of_cart_events
- number_of_purchase_events
- total_turnover
- number_of_bought_items (resolve multiple purchasing events for quantity)
- avg_sold_cart
- avg_session_time
- avg_actions_per_session



In [None]:
# time period simply dependend on input data

### Customer Journey

## Clustering