# Customer Analysis - Explore Customer Behavior

## Import

Needed packages. Pyspark will be used for data managing and plotly for visualisations. Keep in mind to install
JAVA so Spark will work properly.

Used dataset is from https://rees46.com/de found on https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store.

In [11]:
import os
import pyspark
import pandas as pd
import pyspark.sql.functions as f
import plotly.express as px
import plotly.graph_objects as go

## Read

The data needs to be located in ```data/``` in unzipped form as a csv.

In [21]:
# read raw data
spark = pyspark.sql.SparkSession.builder.appName("app1").getOrCreate()
# sdf = spark.read.csv("data/*.csv", header=True, inferSchema=True)
sdf_201911 = spark.read.csv("data/2019-Nov.csv", header=True, inferSchema=True)
sdf_201910 = spark.read.csv("data/2019-Oct.csv", header=True, inferSchema=True)

+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand|  price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+
|2019-10-01 00:00:...|      view|  44600062|2103807459595387724|                null|shiseido|  35.79|541312140|72d76fde-8bb3-4e0...|
|2019-10-01 00:00:...|      view|   3900821|2053013552326770905|appliances.enviro...|    aqua|   33.2|554748717|9333dfbd-b87a-470...|
|2019-10-01 00:00:...|      view|  17200506|2053013559792632471|furniture.living_...|    null|  543.1|519107250|566511c2-e2e3-422...|
|2019-10-01 00:00:...|      view|   1307067|2053013558920217191|  computers.notebook|  lenovo| 251.74|550050854|7c90fc70-0e80-459...|
|2019-10-01 00:00:...|      view|   1004237|205301355563188265

In [None]:
# join both months together
sdf = sdf_201910.union(sdf_201911)
# sdf = spark.read.csv("data/test_data.csv", header=True, inferSchema=True)
sdf.show()

## Preparation

Prepare and enhance data for analysis and modelling.

In [22]:
# Datatypes
sdf = sdf.withColumn("event_time", sdf["event_time"].cast(pyspark.sql.types.TimestampType()))
sdf = sdf.withColumn("category_id", sdf["category_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("product_id", sdf["product_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("user_id", sdf["user_id"].cast(pyspark.sql.types.StringType()))

# Feature Splitting
sdf = sdf.withColumn("category_class", f.substring_index(sdf.category_code, '.', 1))

# sdf = sdf.withColumn("category_class", f.split(sdf["category_code"], ".").getItem(0))
# sdf = sdf.withColumn("category_sub_class", f.split(sdf["category_code"], ".").getItem(1))
# sdf = sdf.withColumn("category_sub_sub_class", f.split(sdf["category_code"], ".").getItem(2))

sdf = sdf.withColumn("year", f.year("event_time"))
sdf = sdf.withColumn("month", f.month("event_time"))
sdf = sdf.withColumn("weekofyear", f.weekofyear("event_time"))
sdf = sdf.withColumn("dayofyear", f.dayofyear("event_time"))
sdf = sdf.withColumn("dayofweek", f.dayofweek("event_time"))
sdf = sdf.withColumn("dayofmonth", f.dayofmonth("event_time"))

# None Handling
sdf = sdf.fillna(value="not defined")

sdf.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = false)
 |-- product_id: string (nullable = false)
 |-- category_id: string (nullable = false)
 |-- category_code: string (nullable = false)
 |-- brand: string (nullable = false)
 |-- price: double (nullable = true)
 |-- user_id: string (nullable = false)
 |-- user_session: string (nullable = false)
 |-- category_class: string (nullable = false)
 |-- category_sub_class: string (nullable = false)
 |-- category_sub_sub_class: string (nullable = false)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- weekofyear: integer (nullable = true)
 |-- dayofyear: integer (nullable = true)
 |-- dayofweek: integer (nullable = true)
 |-- dayofmonth: integer (nullable = true)



## Dataframe Creation

create several dataframes with different aggregation level to answer different questions/ tasks.

In [38]:
# raw
sdf_raw = sdf
sdf_raw.show()

+-------------------+----------+----------+-------------------+--------------------+-----------+-------+---------+--------------------+--------------+------------------+----------------------+----+-----+----------+---------+---------+----------+
|         event_time|event_type|product_id|        category_id|       category_code|      brand|  price|  user_id|        user_session|category_class|category_sub_class|category_sub_sub_class|year|month|weekofyear|dayofyear|dayofweek|dayofmonth|
+-------------------+----------+----------+-------------------+--------------------+-----------+-------+---------+--------------------+--------------+------------------+----------------------+----+-----+----------+---------+---------+----------+
|2019-10-01 02:00:00|      view|  44600062|2103807459595387724|         not defined|   shiseido|  35.79|541312140|72d76fde-8bb3-4e0...|   not defined|       not defined|           not defined|2019|   10|        40|      274|        3|         1|
|2019-10-01 02:0

In [37]:
# aggregated customer
sdf_agg_cust = sdf.groupBy("user_id", "user_session", "event_type", "product_id").mean("price")
sdf_agg_cust.show()

+---------+--------------------+----------+----------+------------------+
|  user_id|        user_session|event_type|product_id|        avg(price)|
+---------+--------------------+----------+----------+------------------+
|515483851|18ea1924-2c5a-4ac...|      view|  15700181|            214.16|
|549736688|29ab2a23-cfa8-411...|      view|   1004935|            167.03|
|477121012|413b498a-71d5-49f...|      view|   1004839|179.30666666666664|
|555279241|56616147-d002-47b...|      view|   1005115|            975.57|
|516692901|242696d6-672f-476...|      view|   1005115|            975.57|
|555461758|bf8b4392-9803-410...|      view|   2601908|            437.57|
|538138285|05f93693-8323-451...|      view|  17302407|            189.79|
|516025438|86916d0f-ed46-4e0...|      view|   1004650|            628.78|
|555462575|5ffa5556-648d-45c...|      view|   1700796|           2202.24|
|555462875|a85e90a6-37df-43a...|      view|   1004739|            197.55|
|555462891|4d8ee4c7-a87e-44f...|      

In [None]:
# aggregated session

In [None]:
# aggregated product


In [None]:
# aggregated class

## Field Explanations

Following fields are in the standard dataset:
- event_time
- product_id
- category_id
- category_code
- brand
- price
- user_id
- user_session

In [None]:
# general

In [None]:
# event_time

In [None]:
# product_id

In [None]:
# category_id

In [None]:
# category_code


In [None]:
# brand


In [None]:
# price


In [None]:
# user_id


In [None]:
# user_session

## Exploration and Analysis

### Time Distribution

### Category and products

### User analysis

### Customer Journey

## Clustering