# Customer Analysis - Explore Customer Behavior

## Import

Needed packages. Pyspark will be used for data managing and plotly for visualisations. Keep in mind to install
JAVA so Spark will work properly.

Used dataset is from https://rees46.com/de found on https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store.

In [1]:
import os
import pyspark
import pandas as pd
import pyspark.sql.functions as f
import plotly.express as px
import plotly.graph_objects as go

## Read

The data needs to be located in ```data/``` in unzipped form as a csv.

In [2]:
# read raw data
spark = pyspark.sql.SparkSession.builder.appName("app1").getOrCreate()
# sdf = spark.read.csv("data/*.csv", header=True, inferSchema=True)
# sdf_201911 = spark.read.csv("data/2019-Nov.csv", header=True, inferSchema=True)
# sdf_201910 = spark.read.csv("data/2019-Oct.csv", header=True, inferSchema=True)

In [3]:
# join both months together
# sdf = sdf_201910.union(sdf_201911)
sdf = spark.read.csv("data/test_data.csv", header=True, inferSchema=True)
sdf.show()

+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|2019-11-01 00:00:...|      view|   1003461|2053013555631882655|electronics.smart...|  xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 00:00:...|      view|   5000088|2053013566100866035|appliances.sewing...|  janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 00:00:...|      view|  17302664|2053013553853497655|                null|   creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 00:00:...|      view|   3601530|2053013563810775923|appliances.kitche...|      lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 00:00:...|      view|   1004775|2053013555631882655|elect

## Preparation

Prepare and enhance data for analysis and modelling.

In [4]:
# Datatypes
sdf = sdf.withColumn("event_time", sdf["event_time"].cast(pyspark.sql.types.TimestampType()))
sdf = sdf.withColumn("category_id", sdf["category_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("product_id", sdf["product_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("user_id", sdf["user_id"].cast(pyspark.sql.types.StringType()))

# Feature Splitting
# sdf = sdf.withColumn("category_class", f.substring_index(sdf.category_code, '.', 1))

sdf = sdf.withColumn("category_class", f.split(sdf["category_code"], "\.").getItem(0))
sdf = sdf.withColumn("category_sub_class", f.split(sdf["category_code"], "\.").getItem(1))
sdf = sdf.withColumn("category_sub_sub_class", f.split(sdf["category_code"], "\.").getItem(2))

sdf = sdf.withColumn("year", f.year("event_time"))
sdf = sdf.withColumn("month", f.month("event_time"))
sdf = sdf.withColumn("weekofyear", f.weekofyear("event_time"))
sdf = sdf.withColumn("dayofyear", f.dayofyear("event_time"))
sdf = sdf.withColumn("dayofweek", f.dayofweek("event_time"))
sdf = sdf.withColumn("dayofmonth", f.dayofmonth("event_time"))
sdf = sdf.withColumn("hour", f.hour("event_time"))
# None Handling
# sdf = sdf.fillna(value="not defined")

sdf.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)
 |-- category_class: string (nullable = true)
 |-- category_sub_class: string (nullable = true)
 |-- category_sub_sub_class: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- weekofyear: integer (nullable = true)
 |-- dayofyear: integer (nullable = true)
 |-- dayofweek: integer (nullable = true)
 |-- dayofmonth: integer (nullable = true)
 |-- hour: integer (nullable = true)



## Dataframe Creation

create several dataframes with different aggregation level to answer different questions/ tasks.

In [5]:
# raw
sdf_raw = sdf

In [6]:
sdf.createOrReplaceTempView("Data")

In [7]:
# aggregated customer
sdf_agg_cust = sdf.groupBy("user_id", "user_session", "event_type", "product_id").mean("price")


In [8]:
# aggregated session

In [9]:
# aggregated product


In [10]:
# aggregated class

In [11]:
# aggregated time (weeks, dayofweeks, month)

sdf_time_dist_month = sdf.groupBy("event_type", "dayofmonth").count()
sdf_time_dist_month = sdf_time_dist_month.withColumnRenamed("count", "cnt")
sdf_time_dist_month = sdf_time_dist_month.sort("dayofmonth", "event_type")


In [12]:
sdf_time_dist_week = sdf.groupBy("event_type", "dayofweek").count()
sdf_time_dist_week = sdf_time_dist_week.withColumnRenamed("count", "cnt")
sdf_time_dist_week = sdf_time_dist_week.sort("dayofweek", "event_type")


In [13]:
sdf_time_dist_day = sdf.groupBy("event_type", "hour").count()
sdf_time_dist_day = sdf_time_dist_day.withColumnRenamed("count", "cnt")
sdf_time_dist_day = sdf_time_dist_day.sort("hour", "event_type")


## Field Explanations

Following fields are in the standard dataset:
- event_time -> Timestamp of the event
- event_type -> Type of the event (e.g. view, purchase, cart)
- product_id -> unique ID of the product viewed/purchased
- category_id -> unique ID of the category it belongs
- category_code -> Code in written words for every category. Subcategories are split by "."
- brand -> brand of the product
- price -> sales price of the product
- user_id -> ID of the user who triggered the event
- user_session -> unique session ID from this user to connect entries with each other

In [None]:
#TODO list all new features

### General
In this overview you can see the count of unique rows, product_ids, category_classes, category_codes, category_ids, brands, user_ids and user_sessions as well as the average price of the products.

In [14]:
sdf_count_overview = spark.sql("SELECT COUNT(*) AS Row_Count, \
                                       COUNT(DISTINCT(product_id)) AS Product_ID, \
                                       COUNT(DISTINCT(category_class)) AS Category_Class, \
                                       COUNT(DISTINCT(category_code)) AS Category_Code, \
                                       COUNT(DISTINCT(category_id)) AS Category_ID, \
                                       COUNT(DISTINCT(brand)) AS Brand, \
                                       COUNT(DISTINCT(user_id)) AS User_ID, \
                                       COUNT(DISTINCT(user_session)) AS User_Session, \
                                       ROUND(MEAN(price),2) AS AVG_Price \
                                FROM Data")
sdf_count_overview.show()

+---------+----------+--------------+-------------+-----------+-----+-------+------------+---------+
|Row_Count|Product_ID|Category_Class|Category_Code|Category_ID|Brand|User_ID|User_Session|AVG_Price|
+---------+----------+--------------+-------------+-----------+-----+-------+------------+---------+
|      244|       214|             9|           41|         90|   98|    153|         153|   302.15|
+---------+----------+--------------+-------------+-----------+-----+-------+------------+---------+



In [15]:
sdf_raw.show(1, vertical=True)
print(f"Number of total rows: {sdf_raw.count()}")

-RECORD 0--------------------------------------
 event_time             | 2019-11-01 01:00:00  
 event_type             | view                 
 product_id             | 1003461              
 category_id            | 2053013555631882655  
 category_code          | electronics.smart... 
 brand                  | xiaomi               
 price                  | 489.07               
 user_id                | 520088904            
 user_session           | 4d3b30da-a5e4-49d... 
 category_class         | electronics          
 category_sub_class     | smartphone           
 category_sub_sub_class | null                 
 year                   | 2019                 
 month                  | 11                   
 weekofyear             | 44                   
 dayofyear              | 305                  
 dayofweek              | 6                    
 dayofmonth             | 1                    
 hour                   | 1                    
only showing top 1 row

Number of total 

### event_time

In [16]:
sdf_raw.select("event_time").show(5)
print(f"Number of distinct event_time rows: {sdf_raw.select('event_time').distinct().count()}")

+-------------------+
|         event_time|
+-------------------+
|2019-11-01 01:00:00|
|2019-11-01 01:00:00|
|2019-11-01 01:00:01|
|2019-11-01 01:00:01|
|2019-11-01 01:00:01|
+-------------------+
only showing top 5 rows

Number of distinct event_time rows: 84


### event_type
The event_type describes the kind of interaction, an user had with a product. The field can take three forms: View, Cart and Purchase. The distribution of these three forms is represented in the following plot:

In [17]:
sdf_event_type_dist = sdf_raw.groupBy("event_type").count()
sdf_event_type_dist.show()

+----------+-----+
|event_type|count|
+----------+-----+
|  purchase|    2|
|      view|  240|
|      cart|    2|
+----------+-----+



In [18]:
# Plot Event Types
df = sdf_event_type_dist.toPandas()
fig = px.pie(df, values='count', names='event_type', title='Distribution of Customer Actions')
fig.show()

### product_id
The product_id is the unique identificator for a product. As you can see in the overview, there are ... unique product_ids in the datasets Oct-2019 and Nov-2019, the users have interacted with.


In [19]:
sdf_count_per_product_id = spark.sql("SELECT DISTINCT(product_id) AS Product_ID, \
                                                COUNT(product_id) AS Count \
                                        FROM Data \
                                        GROUP BY product_id \
                                        ORDER BY Count DESC")
px.bar(sdf_count_per_product_id.limit(10).toPandas(), x='Product_ID', y='Count', title="Top 10 most interacted products")


In [20]:
# TODO BE DELTED -> find error why not equal to above
sdf_product_id_dist = sdf_raw.groupBy("product_id").count().orderBy("count")

px.bar(sdf_product_id_dist.limit(10).toPandas(), x='product_id', y='count', title="Top 10 most interacted products")

### category_id
The category_id is an unique identifier for the category of a Product. Every Product is assigned to a single category_id, which is summarizing many product_ids into categories. This knowledge is based on the more detailed analyzes within the file "product_analysis.ipnynb". As you can see in the overview, there are ... unique category_ids.

In [21]:
sdf_count_per_category_id=spark.sql("SELECT DISTINCT(category_id) AS Category_ID, \
                                            COUNT(*) AS Count \
                                     FROM Data \
                                     GROUP BY category_id \
                                     ORDER BY Count DESC")
                                     
px.bar(sdf_count_per_category_id.limit(10).toPandas(), x="Category_ID", y="Count", title="Top 10  category_ids most interacted with")  

### category_code
The category_code is describing the category, a product_id and category_id is assigned to. Every Product_id and Category_id is assigned to a single category_code, which is summarizing many product_ids and category_ids into categories. This knowledge is also based on the more detailed analyzes within the file "product_analysis.ipnynb". As you can see in the overview, there are ... unique category_code.

In [22]:
sdf_count_per_category_code=spark.sql("SELECT DISTINCT(category_code) AS Category_Code, \
                                              COUNT(product_id) AS Count \
                                        FROM Data \
                                        GROUP BY category_code \
                                        ORDER BY Count DESC")
px.bar(sdf_count_per_category_code.limit(10).toPandas(), x="Category_Code", y="Count", title="Top 10 category_code most interacted with")

### category_class and category_sub_class
The category_code consists of two or three parts in general, which are separeted by a dot. A possible category_code is for example: appliances.kitchen.washer or electronics.smartphone. Because of that the category code can be splited into to the categories: category_class, category_sub_class and category_sub_sub_class.

The category_class is representing the fist part of the category_code. It can be used to summarize many category_codes into an overarching category_class. As you can see in the overview, there are ... unique category_classes.

In [23]:
sdf_agg_classes = sdf_raw.groupBy("category_class", "category_sub_class", "category_sub_sub_class").count().na.fill(value="not defined")
sdf_agg_classes = sdf_agg_classes.where(sdf_agg_classes["category_class"] != "not defined")
sdf_agg_classes.show() #TODO groupBy product_id

+--------------+------------------+----------------------+-----+
|category_class|category_sub_class|category_sub_sub_class|count|
+--------------+------------------+----------------------+-----+
|       apparel|             jeans|           not defined|    3|
|          auto|       accessories|            compressor|    2|
|       apparel|             shoes|           not defined|    1|
|  construction|             tools|                 drill|    4|
|   electronics|             audio|             headphone|    5|
|    appliances|           kitchen|               blender|    1|
|     furniture|           kitchen|                 table|    1|
|    appliances|    sewing_machine|           not defined|    2|
|    appliances|           kitchen|         refrigerators|    5|
|   accessories|               bag|           not defined|    1|
|     computers|       peripherals|               printer|    2|
|     furniture|          bathroom|                  bath|    1|
|    appliances|         

In [24]:
px.sunburst(sdf_agg_classes.toPandas(), path=["category_class", "category_sub_class", "category_sub_sub_class"], values="count", title="Category Classes and Subclasses (without data for class = 'not defined')")

### brand
The brand indicates the brand of a product_id. It is independent of the categories, so that a brand can appear in many category_classes. This knowledge is also based on the more detailed analyzes within the file "product_analysis.ipnynb". There are ... unique brands in the dataset. Thereby you can see the most popular brands in the following plot:

In [25]:
sdf_count_per_brand=spark.sql("SELECT DISTINCT(brand) AS Brand, \
                                      COUNT(*) AS Count \
                                FROM Data \
                                GROUP BY brand \
                                ORDER BY Count DESC")

px.histogram(sdf_count_per_brand.limit(10).toPandas(), x="Brand", y="Count", title="Top 10 brands most interacted with")                               

### price

In [26]:
sdf_raw.describe("price").show()

+-------+-----------------+
|summary|            price|
+-------+-----------------+
|  count|              244|
|   mean|302.1450819672131|
| stddev|425.3893301363122|
|    min|             1.09|
|    max|          2496.59|
+-------+-----------------+



In [27]:
px.box(sdf_raw.groupBy("product_id").avg("price").toPandas(), y="avg(price)", title="Price distribution for products in Store")


### user_id

In [28]:
sdf_raw.select("user_id").show(5)

print("Number of users:")
print(sdf_raw.select("user_id").distinct().count())

+---------+
|  user_id|
+---------+
|520088904|
|530496790|
|561587266|
|518085591|
|558856683|
+---------+
only showing top 5 rows

Number of users:
153


### user_session

In [29]:
# avg actions per session
sdf_cnt_action_per_session = sdf_raw.groupby("user_session").count()
sdf_cnt_action_per_session.describe().show()

+-------+--------------------+------------------+
|summary|        user_session|             count|
+-------+--------------------+------------------+
|  count|                 153|               153|
|   mean|                null|1.5947712418300655|
| stddev|                null|1.0287825488662496|
|    min|0110890b-96d6-4ae...|                 1|
|    max|ff868137-fb3e-4da...|                 6|
+-------+--------------------+------------------+



## Exploration and Analysis

### Time Distribution

In [30]:
# Timestamp Distribution (per event_type) over every day of month

df = sdf_time_dist_month.toPandas()

fig = px.bar(df, x = 'dayofmonth', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a month",
     xaxis_title = 'Day of Month', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.show()

In [31]:
# Timestamp Distribution (per event_type) over every day of week

df = sdf_time_dist_week.toPandas()

fig = px.bar(df, x = 'dayofweek', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a week",
     xaxis_title = 'Day of Week', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.show()

In [32]:
# Timestamp Distribution (per event_type) over every hour of a day

df = sdf_time_dist_day.toPandas()

fig = px.bar(df, x = 'hour', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a day",
     xaxis_title = 'Hour of day', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.show()

### Category and products

#### Connection between category_class, category_code, category_id, product_id and brand

Connection between category_class, category_code, category_id, product_id and brand
The product_id is a subset of the category_id, which is a subset of the category_code. The category_code is in turn a subset of the category_class. (product_id ⊂ category_id ⊂ category_code ⊂ category_class). The brand on the otherhand is cross-class. This knowledge is based on the more detailed analyzes within the file "product_analysis.ipnynb".

In the following plot you can see distribution of the product_id, category_id and category_code within the category_class. It´s possible to access a more detailed view by selecting a special category_class, category_code or category_id.

In [33]:
sdf_product_per_category = sdf_raw.groupBy("category_id").agg(f.countDistinct("product_id"))

df = sdf_product_per_category.toPandas()
px.box(df, y="count(product_id)", title="Number of products per category_id")

In [34]:
sdf_agg_brand_category = sdf_raw.groupBy("category_class", "brand", "product_id").count().na.fill(value="not defined")
px.sunburst(sdf_agg_brand_category.toPandas(), path=["category_class", "brand"], values="count", title="Brands per Category_class")

#### Connection to the price

The following plots will represent the price distribution within the category_classes, category_codes, category_ids, product_ids and brands.

In [35]:
px.box(sdf_raw.toPandas(), x="category_class", y="price", title="Price ~ Category_class")

In [36]:
sdf_price_per_product=spark.sql("SELECT DISTINCT(Product_ID), \
                                        Price \
                                 FROM Data \
                                 ORDER BY Price Desc")
px.bar(sdf_price_per_product.limit(10).toPandas(), x="Product_ID", y="Price", title="TOP 10 most expensive Product_IDs")

#### Connection to the event-type
The following plots will represent the event_type distribution within the category_classes, category_codes, category_ids, product_ids and brands.

In [106]:
sdf_category_class_event_distribution = spark.sql("SELECT category_class, \
                                                          event_type, \
                                                          Count(*) AS Count \
                                                    FROM Data \
                                                    GROUP BY category_class, event_type")
px.sunburst(sdf_category_class_event_distribution.na.fill(value="not defined").toPandas(), path=['category_class','event_type'], values="Count")

### Event_Type and Price
The following plot represents the distribution of the price within the event_type.

In [95]:
px.box(sdf_raw.toPandas(), x="event_type", y="price", title="Price ~ Event_Type")

In [114]:
# groupby product and price --> see if different prices for same product

sdf_prices = sdf.select("product_id", "price").distinct().groupBy("product_id").count()
sdf_prices.where(sdf_prices["count"] > 1).show()

+----------+-----+
|product_id|count|
+----------+-----+
+----------+-----+



### User analysis

### session analysis



- number of events
- session_start_time
- session_stop_time
- session_success
- products_bought
- products_viewed
- turnover


In [52]:
sdf_session = sdf.select("user_id", "user_session", "event_type", "product_id", "price", "event_time").orderBy("user_id", "user_session")

In [53]:
sdf_session = sdf_session.withColumn("views", f.when(sdf_session.event_type == "view", 1).otherwise(0))
sdf_session = sdf_session.withColumn("purchases", f.when(sdf_session.event_type == "purchase", 1).otherwise(0))
sdf_session = sdf_session.withColumn("carts", f.when(sdf_session.event_type == "cart", 1).otherwise(0))

sdf_session = sdf_session.withColumn("first_event", sdf_session.event_time)
sdf_session = sdf_session.withColumn("last_event", sdf_session.event_time)

In [95]:
sdf_session_agg = sdf_session.groupBy("user_id", "user_session").agg(f.avg("price"), f.sum("views"), f.sum("purchases"), f.sum("carts"), f.min("event_time"), f.max("event_time"))
sdf_session_agg = sdf_session_agg.withColumn("duration", (sdf_session_agg["max(event_time)"] - sdf_session_agg["min(event_time)"]))
sdf_session_agg = sdf_session_agg.withColumn("sum(events)", (sdf_session_agg["sum(views)"] + sdf_session_agg["sum(purchases)"] + sdf_session_agg["sum(carts)"]))
sdf_session_agg = sdf_session_agg.withColumn("turnover", f.when(sdf_session_agg["sum(purchases)"] > 0, (sdf_session_agg["sum(purchases)"] *  sdf_session_agg["avg(price)"])).otherwise(0))

sdf_session_agg = sdf_session_agg.withColumn("successfull", f.when(sdf_session_agg["sum(purchases)"] > 0, 1).otherwise(0))

In [96]:
sdf_session_agg.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)
 |-- avg(price): double (nullable = true)
 |-- sum(views): long (nullable = true)
 |-- sum(purchases): long (nullable = true)
 |-- sum(carts): long (nullable = true)
 |-- min(event_time): timestamp (nullable = true)
 |-- max(event_time): timestamp (nullable = true)
 |-- duration: interval (nullable = true)
 |-- sum(events): long (nullable = true)
 |-- turnover: double (nullable = true)
 |-- successfull: integer (nullable = false)



In [97]:
sdf_session_agg.show()

+---------+--------------------+------------------+----------+--------------+----------+-------------------+-------------------+--------------------+-----------+--------+-----------+
|  user_id|        user_session|        avg(price)|sum(views)|sum(purchases)|sum(carts)|    min(event_time)|    max(event_time)|            duration|sum(events)|turnover|successfull|
+---------+--------------------+------------------+----------+--------------+----------+-------------------+-------------------+--------------------+-----------+--------+-----------+
|436701163|50de79b1-b0ec-42c...|            128.42|         1|             0|         0|2019-11-01 01:00:23|2019-11-01 01:00:23|           0 seconds|          1|     0.0|          0|
|512367687|09085a31-dc7d-46c...|            168.06|         1|             0|         0|2019-11-12 02:34:42|2019-11-12 02:34:42|           0 seconds|          1|     0.0|          0|
|512370912|daf0bf99-adf1-487...|            287.83|         1|             0|        

In [98]:
sdf_session_agg.describe().show()

+-------+-------------------+--------------------+------------------+------------------+--------------------+--------------------+------------------+------------------+--------------------+
|summary|            user_id|        user_session|        avg(price)|        sum(views)|      sum(purchases)|          sum(carts)|       sum(events)|          turnover|         successfull|
+-------+-------------------+--------------------+------------------+------------------+--------------------+--------------------+------------------+------------------+--------------------+
|  count|                153|                 153|               153|               153|                 153|                 153|               153|               153|                 153|
|   mean|       5.37992906E8|                null| 276.2533169934642|1.5686274509803921|0.013071895424836602|0.013071895424836602|1.5947712418300655|  5.08640522875817|0.013071895424836602|
| stddev|2.155841121697013E7|                null|

### Customer Profiles

In preparation for clustering a customer profile will be created:

- customer_id
- number_of_view_events
- number_of_cart_events
- number_of_purchase_events
- total_turnover
- number_of_bought_items (resolve multiple purchasing events for quantity)
- avg_sold_cart
- avg_session_time
- avg_actions_per_session



In [104]:
sdf_customer_profile = sdf_session_agg.groupBy("user_id").agg(f.sum("sum(events)"), f.sum("sum(views)"), f.sum("sum(purchases)"), f.sum("sum(carts)"), f.sum("turnover"), f.count("user_session"), f.sum("successfull"))

sdf_customer_profile = sdf_customer_profile.withColumn("avg_turnover_per_session", (sdf_customer_profile["sum(turnover)"] / sdf_customer_profile["count(user_session)"]))
sdf_customer_profile = sdf_customer_profile.withColumn("avg_events_per_session", (sdf_customer_profile["sum(sum(events))"] / sdf_customer_profile["count(user_session)"]))

In [105]:
sdf_customer_profile.show(30)

+---------+----------------+---------------+-------------------+---------------+-------------+-------------------+----------------+------------------------+----------------------+
|  user_id|sum(sum(events))|sum(sum(views))|sum(sum(purchases))|sum(sum(carts))|sum(turnover)|count(user_session)|sum(successfull)|avg_turnover_per_session|avg_events_per_session|
+---------+----------------+---------------+-------------------+---------------+-------------+-------------------+----------------+------------------------+----------------------+
|512416379|               2|              2|                  0|              0|          0.0|                  1|               0|                     0.0|                   2.0|
|518045858|               2|              2|                  0|              0|          0.0|                  1|               0|                     0.0|                   2.0|
|566280399|               1|              1|                  0|              0|          0.0|      

In [None]:
# basic visualisations

In [None]:
# groupby product and price --> see unique

### Customer Journey

- Path to success (purchase)
- Path to failure (no purchase)

## Clustering

- Customer classification (h,m,l)
- Customer related content/ recomenadion