# Customer Analysis - Explore Customer Behavior

## Import

Needed packages. Pyspark will be used for data managing and plotly for visualisations. Keep in mind to install
JAVA so Spark will work properly.

Used dataset is from https://rees46.com/de found on https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store.

In [163]:
# %%timeit
import os
import pyspark
import pandas as pd
import pyspark.sql.functions as f
import plotly.express as px
import plotly.graph_objects as go

## Read

The data needs to be located in ```data/``` in unzipped form as a csv.

In [164]:
# minimal spark session

# spark = pyspark.sql.SparkSession.builder.appName("app1").getOrCreate()

In [165]:
# spark = pyspark.sql.SparkSession \
#     .builder \
#     .appName("app_great") \
#     .getOrCreate()
# sc = spark.sparkContext

In [166]:
# better spark session

spark = pyspark.sql.SparkSession \
            .builder \
            .master("local") \
            .appName("app1") \
            .config("spark.executor.memory", f"16g") \
            .config("spark.driver.memory", f"16g") \
            .config("spark.memory.offHeap.enabled", True) \
            .config("spark.memory.offHeap.size", f"16g") \
            .config("spark.sql.debug.maxToStringFields", f"16") \
            .getOrCreate()
sc = spark.sparkContext

In [167]:
# read raw data

#sdf_201911 = spark.read.csv("data/2019-Nov.csv", header=True, inferSchema=True)
#sdf_201910 = spark.read.csv("data/2019-Oct.csv", header=True, inferSchema=True)

# join both months together
#sdf = sdf_201910.union(sdf_201911)

If test_data or samples are used: keep in mind the plots and analysis can be different in comparison to the whole dataset.

In [168]:
# read test data (test_data is top rows, sample are randomly selected data)

# sdf = spark.read.csv("data/test_data.csv", header=True, inferSchema=True)
sdf = spark.read.csv("data/sample_100k.csv", header=True, inferSchema=True)

In [169]:
# %%timeit

sdf.show(3)

+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|          event_time|event_type|product_id|        category_id|       category_code|   brand| price|  user_id|        user_session|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
|2019-10-01 00:07:...|      view|   2701657|2053013563911439225|appliances.kitche...|    beko|257.04|547949682|f2546bf3-6240-4ae...|
|2019-10-01 02:21:...|      view|   2601936|2053013563970159485|                null|dauscher| 483.9|548035257|e3541ed4-1629-4c9...|
|2019-10-01 02:21:...|      view|   1004872|2053013555631882655|electronics.smart...| samsung|286.35|514328693|655b8a4e-b567-400...|
+--------------------+----------+----------+-------------------+--------------------+--------+------+---------+--------------------+
only showing top 3 rows



## Preparation

Prepare data for analysis and modelling.

In [170]:
# %%timeit

# Datatypes
sdf = sdf.withColumn("event_time", sdf["event_time"].cast(pyspark.sql.types.TimestampType()))
sdf = sdf.withColumn("category_id", sdf["category_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("product_id", sdf["product_id"].cast(pyspark.sql.types.StringType()))
sdf = sdf.withColumn("user_id", sdf["user_id"].cast(pyspark.sql.types.StringType()))
#sdf = sdf.fillna(value="not defined")

## Dataframe Creation

create several dataframes with different aggregation level to answer different questions/ tasks.

In [171]:
# %%timeit

# raw
sdf_raw = sdf

In [172]:
# %%timeit

sdf.createOrReplaceTempView("Data")

## Field Explanations

Following fields are in the standard dataset:
- event_time
- product_id
- category_id
- category_code
- brand
- price
- user_id
- user_session

### General
In this overview you can see the count of unique rows, product_ids, category_codes, category_ids, brands, user_ids and user_sessions as well as the average price of the products.

In [173]:
# %%timeit

sdf_count_overview = spark.sql("SELECT COUNT(*) AS Row_Count, \
                                       COUNT(DISTINCT(product_id)) AS Product_ID, \
                                       COUNT(DISTINCT(category_code)) AS Category_Code, \
                                       COUNT(DISTINCT(category_id)) AS Category_ID, \
                                       COUNT(DISTINCT(brand)) AS Brand, \
                                       COUNT(DISTINCT(user_id)) AS User_ID, \
                                       COUNT(DISTINCT(user_session)) AS User_Session, \
                                       ROUND(MEAN(price),2) AS AVG_Price \
                                FROM Data")
sdf_count_overview.show()

+---------+----------+-------------+-----------+-----+-------+------------+---------+
|Row_Count|Product_ID|Category_Code|Category_ID|Brand|User_ID|User_Session|AVG_Price|
+---------+----------+-------------+-----------+-----+-------+------------+---------+
|   109740|     32258|          124|        609| 2166| 102615|      108952|   292.65|
+---------+----------+-------------+-----------+-----+-------+------------+---------+



In [174]:
# %%timeit

sdf_raw.select("product_id",  "category_code", "category_id", "brand", "user_id", "user_session", "price").describe().show()

+-------+--------------------+-------------------+--------------------+------------------+-------------------+--------------------+-----------------+
|summary|          product_id|      category_code|         category_id|             brand|            user_id|        user_session|            price|
+-------+--------------------+-------------------+--------------------+------------------+-------------------+--------------------+-----------------+
|  count|              109740|              74277|              109740|             94356|             109740|              109740|           109740|
|   mean| 1.174871172937853E7|               null|2.057716794746857...|               NaN| 5.36626043352433E8|                null|292.6456165482042|
| stddev|1.5470839284096591E7|               null|1.959282109329738...|               NaN|2.137620709675881E7|                null|358.8585517029615|
|    min|           100000016|    accessories.bag| 2053013552226107603|            a-case|          

In [175]:
# %%timeit

sdf_raw.show(1, vertical=True)


-RECORD 0-----------------------------
 event_time    | 2019-10-01 02:07:13  
 event_type    | view                 
 product_id    | 2701657              
 category_id   | 2053013563911439225  
 category_code | appliances.kitche... 
 brand         | beko                 
 price         | 257.04               
 user_id       | 547949682            
 user_session  | f2546bf3-6240-4ae... 
only showing top 1 row



### event_time

The column event_time consists of the elements date and timestamp.

In [176]:
# %%timeit

sdf_raw.select("event_time").show(5)


+-------------------+
|         event_time|
+-------------------+
|2019-10-01 02:07:13|
|2019-10-01 04:21:12|
|2019-10-01 04:21:13|
|2019-10-01 04:21:36|
|2019-10-01 04:21:52|
+-------------------+
only showing top 5 rows



### event_type
The event_type describes the kind of interaction, an user had with a product. The field can take three forms: View, Cart and Purchase. The distribution of these three forms is represented in the following plot:

In [177]:
# %%timeit

sdf_event_type_dist = sdf_raw.groupBy("event_type").count()
sdf_event_type_dist.show()

+----------+------+
|event_type| count|
+----------+------+
|  purchase|  1651|
|      view|104128|
|      cart|  3961|
+----------+------+



The table shows that the most interactions are Product-Views, followed by Products, which have been added to the cart. The least interaction is the purchase.

In [178]:
# %%timeit

# Plot Event Types
df = sdf_event_type_dist.select("count", "event_type").toPandas()
fig = px.pie(df, values='count', names='event_type', title='Distribution of Customer Actions')
fig.write_html("data/exports/Distribution of Customer Actions.html")
fig.show()

### product_id
The product_id is the unique identificator for a product. As you can see in the overview, there are 206876 unique product_ids in the datasets Oct-2019 and Nov-2019, the users have interacted with.


In [179]:
# %%timeit


sdf_count_per_product_id = sdf_raw.groupBy("product_id").count().orderBy(f.desc("count"))
px.bar(sdf_count_per_product_id.limit(10).toPandas(), x='product_id', y='count', title="Top 10 most interacted products").update_xaxes(type='category')


The most interacted products have the following IDs: 1004856, 1005115, 1004767

### category_id
The category_id is an unique identifier for the category of a Product. Every Product is assigned to a single category_id, which is summarizing many product_ids into categories. As you can see in the overview, there are 691 unique category_ids.

In [180]:
# %%timeit

sdf_count_per_category_id = sdf_raw.groupBy("category_id").count().orderBy(f.desc("count"))


px.bar(sdf_count_per_category_id.limit(10).toPandas(), x="category_id", y="count", title="Top 10  category_ids most interacted with").update_xaxes(type='category')

The most interacted categories have the following IDs: 2053013555631882655, 2053013553559896355, 2053013558920217191

### category_code
The category_code is describing the category, a product_id and category_id is assigned to. Every Product_id and Category_id is assigned to a single category_code, which is summarizing many product_ids and category_ids into categories. As you can see in the overview, there are 129 unique category_code.

In [181]:
# %%timeit

sdf_count_per_category_code = sdf_raw.groupBy("category_code").count().orderBy(f.desc("count"))


px.bar(sdf_count_per_category_code.limit(10).toPandas(), x="category_code", y="count", title="Top 10 category_code most interacted with")

The graph is showing that the electronics, especially the smartphones, are the products with the most interactions.

### brand
The brand indicates the brand of a product_id. It is independent of the categories, so that a brand can appear in many category_codes. There are 4303 unique brands in the dataset. Thereby you can see the most popular brands in the following plot:

In [182]:
# %%timeit

sdf_count_per_brand = sdf_raw.groupBy("brand").count().orderBy(f.desc("count"))

px.histogram(sdf_count_per_brand.limit(10).toPandas(), x="brand", y="count", title="Top 10 brands most interacted with")

The most popular brands are samsung, apple, xiaomi and huawei.

### price

In the following table and plot you can see the distribution of the price.

In [183]:
# %%timeit

sdf_raw.describe("price").show()

+-------+-----------------+
|summary|            price|
+-------+-----------------+
|  count|           109740|
|   mean|292.6456165482042|
| stddev|358.8585517029615|
|    min|              0.0|
|    max|          2574.07|
+-------+-----------------+



In [184]:
# %%timeit

sdf_avg_prices = sdf_raw.select("product_id", "price").distinct()
px.box(sdf_avg_prices.toPandas(), y="price", title="Prices for every product")


### user_id

Every user has an unique identifier, the user_id. As you can see in the overview, there are 5316649 unique users.

In [185]:
# %%timeit

sdf_raw.select("user_id").show(5)

+---------+
|  user_id|
+---------+
|547949682|
|548035257|
|514328693|
|530033604|
|521800906|
+---------+
only showing top 5 rows



### user_session

The User_Session can be used to assign different sessions to a user. As you can see in the overview, there are 23016650 unique user_sessions. In the following plot, you can see the ratio of number of events per session.

In [186]:
# %%timeit

# avg actions per session
sdf_cnt_action_per_session = sdf_raw.groupby("user_session").count()
sdf_cnt_action_per_session.describe().show()

+-------+--------------------+-------------------+
|summary|        user_session|              count|
+-------+--------------------+-------------------+
|  count|              108952|             108952|
|   mean|                null| 1.0072325427711286|
| stddev|                null|0.08591987334611226|
|    min|000100ac-223c-43b...|                  1|
|    max|ffffe461-7a97-474...|                  4|
+-------+--------------------+-------------------+



## Feature Enginnering

In the following, the dataset is enriched with additional features that are useful for future analyses.

In [187]:
# %%timeit

# Feature Splitting
# sdf = sdf.withColumn("category_class", f.substring_index(sdf.category_code, '.', 1))

sdf = sdf.withColumn("category_class", f.split(sdf["category_code"], "\.").getItem(0))
sdf = sdf.withColumn("category_sub_class", f.split(sdf["category_code"], "\.").getItem(1))
sdf = sdf.withColumn("category_sub_sub_class", f.split(sdf["category_code"], "\.").getItem(2))

sdf = sdf.withColumn("year", f.year("event_time"))
sdf = sdf.withColumn("month", f.month("event_time"))
sdf = sdf.withColumn("weekofyear", f.weekofyear("event_time"))
sdf = sdf.withColumn("dayofyear", f.dayofyear("event_time"))
sdf = sdf.withColumn("dayofweek", f.dayofweek("event_time"))
sdf = sdf.withColumn("dayofmonth", f.dayofmonth("event_time"))
sdf = sdf.withColumn("hour", f.hour("event_time"))

sdf = sdf.withColumn('turnover', f.when(f.col('event_type') == 'purchase', f.col('price')).otherwise(0))
sdf = sdf.withColumn('bougth_quantity', f.when(f.col('event_type') == 'purchase', f.lit(1)).otherwise(0))
sdf = sdf.withColumn('viewed_quantity', f.when(f.col('event_type') == 'view', f.lit(1)).otherwise(0))
sdf = sdf.withColumn('cart_quantity', f.when(f.col('event_type') == 'cart', f.lit(1)).otherwise(0))
# None Handling
# sdf = sdf.fillna(value="not defined")

sdf_raw= sdf
sdf.createOrReplaceTempView("Data")

### category_class and category_sub_class
The category_code consists of two or three parts in general, which are separeted by a dot. A possible category_code is for example: appliances.kitchen.washer or electronics.smartphone. Because of that the category code can be splited into to the categories: category_class, category_sub_class and category_sub_sub_class.

The category_class is representing the fist part of the category_code. It can be used to summarize many category_codes into an overarching category_class. As you can see in the overview, there are 13 unique category_classes.

In [188]:
# %%timeit

sdf_agg_classes = sdf_raw.groupBy("category_class", "category_sub_class", "category_sub_sub_class").count().na.fill(value="not defined")
sdf_agg_classes = sdf_agg_classes.where(sdf_agg_classes["category_class"] != "not defined").orderBy(f.desc("count"))
sdf_agg_classes.show(5)

+--------------+------------------+----------------------+-----+
|category_class|category_sub_class|category_sub_sub_class|count|
+--------------+------------------+----------------------+-----+
|   electronics|        smartphone|           not defined|27678|
|   electronics|            clocks|           not defined| 3416|
|     computers|          notebook|           not defined| 3385|
|   electronics|             video|                    tv| 3369|
|   electronics|             audio|             headphone| 2929|
+--------------+------------------+----------------------+-----+
only showing top 5 rows



In [189]:
# %%timeit

sdf_count_overview_classes = spark.sql("SELECT COUNT(DISTINCT(category_class)) AS Category_Class, \
                                       COUNT(DISTINCT(category_sub_class)) AS Category_sub_Class, \
                                       COUNT(DISTINCT(category_sub_sub_class)) AS Category_sub_sub_Class \
                                FROM Data")
sdf_count_overview_classes.show()

+--------------+------------------+----------------------+
|Category_Class|Category_sub_Class|Category_sub_sub_Class|
+--------------+------------------+----------------------+
|            13|                56|                    82|
+--------------+------------------+----------------------+



In [190]:
sdf_distinct_category_class=spark.sql("SELECT DISTINCT(category_class) AS Category_Classes \
                                        FROM Data")
sdf_distinct_category_class.show()

+----------------+
|Category_Classes|
+----------------+
|        medicine|
|       computers|
|            auto|
|            null|
|      stationery|
|           sport|
|         apparel|
|      appliances|
|    country_yard|
|       furniture|
|     accessories|
|            kids|
|     electronics|
|    construction|
+----------------+



In [191]:
# %%timeit

px.sunburst(sdf_agg_classes.toPandas(), path=["category_class", "category_sub_class", "category_sub_sub_class"], values="count", title="Category Classes and Subclasses (without not defined data in category_class)")

As you can see in the plot, the category_Class "electronics" with the "smartphone"-subclass is the biggest one.

### Time - Fields

The column “event_time” allows you to create the following columns: year, month, weekofyear, dayofyear, dayofweek, dayofmonth and hour. These columns allow advanced analysis.

- Year: The Year-column contains only the year 2019 since the dataset only covers this year.
- Month: The month-column cotains the values 10, 11 and 12,  which are representing the months october, november and december 2019.DeprecationWarning
- Weekofyear: The weekofyear-column is covering the weeks 40 - 48
- Dayofyear: The dayofyear-column is covering the days 274 - 334
- Dayofweek: The dayofweek-column is covering the values 1-7. These values are representing days sunday(1), monday(2), Tuesday(3), Wednesday(4), Thursday(5), Friday(6) and Saturday(7).
- Dayofmonth: The dayofmonth-column contains the values 1-31, which are reperesenting the day in the corresponding month.
- hour: The hour-column contains the values 0-24, which are representing the hour of interaction.

### Turnover and quantities

In the following table you can see that in some cases a product ID was purchased several times per user session. (Remark: it´s only visible on the original Dataset with 100M records and not on the example dataset with only 100k records)

In [192]:
# %%timeit

sdf_count_session_product = sdf_raw.where(sdf_raw["event_Type"] == 'purchase').groupBy("user_session", "product_id", "event_type").count().orderBy(f.desc("count"))
sdf_count_session_product.show()

+--------------------+----------+----------+-----+
|        user_session|product_id|event_type|count|
+--------------------+----------+----------+-----+
|b847614f-a355-4c5...|   1002524|  purchase|    1|
|b0fe7827-9a84-49c...|  12703498|  purchase|    1|
|e20a14dc-16e3-4c7...|   1801858|  purchase|    1|
|bfb40671-baea-4f6...|  12718603|  purchase|    1|
|5e5c2a3e-2e70-4a4...|   1005273|  purchase|    1|
|3f75337e-7516-450...|   1004856|  purchase|    1|
|9a5b274e-8fa6-4e2...|   1801690|  purchase|    1|
|ad9a265f-9c0d-494...|  12700116|  purchase|    1|
|1cdf8f9c-da95-476...|   1003312|  purchase|    1|
|eb6fbf52-b2a8-4a1...|   3600182|  purchase|    1|
|b010e2d7-03f4-4af...|  26300091|  purchase|    1|
|13516ee1-dca7-470...|   1005161|  purchase|    1|
|f7549efe-51af-45b...|  26400195|  purchase|    1|
|9e4f9c27-8d1c-4f2...|   1004856|  purchase|    1|
|9bc7f9f0-d7e9-498...|   1004767|  purchase|    1|
|68b361a5-e732-459...|   1002544|  purchase|    1|
|4baf2203-6300-4f9...|   220090

From this it can be concluded that a single interaction is created for each product purchased. With the help of this information, the columns "turnover", "bought_quantity", "viewed_quantity" and "cart_quantity" can be created.

- Turnover: The turnover is equivalent to the price, if the event_type is equal to "purchase".
- bought_quantity: The bought_quantity describes the quantity of a product,that had been bought. In the unaggregated form it only contains the values 0 and 1.
- viewed_quantity: The viewed_quantity describes the quantity of a product,that had been viewed. In the unaggregated form it only contains the values 0 and 1.
- cart_quantity: The cart_quantity describes the quantity of a product,that had been put into the cart. In the unaggregated form it only contains the values 0 and 1.

These columns are particularly suitable for aggregated analyses.

In [193]:
# %%timeit

sdf_count_turnover_quantity = sdf_raw.agg(f.sum("turnover"), f.sum("bougth_quantity"), f.sum("viewed_quantity"), f.sum("Cart_quantity"))
sdf_count_turnover_quantity.show()

+------------------+--------------------+--------------------+------------------+
|     sum(turnover)|sum(bougth_quantity)|sum(viewed_quantity)|sum(Cart_quantity)|
+------------------+--------------------+--------------------+------------------+
|501771.14000000036|                1651|              104128|              3961|
+------------------+--------------------+--------------------+------------------+



## Exploration and Analysis

In the following, the fields are analyzed in relation to each other in order to gain further insights.

### Time Distribution

#### Time and Events

Primarily a correlation between time and event is analyzed.

In [194]:
# %%timeit

# aggregated time (weeks, dayofweeks, month)

sdf_time_dist_month = sdf.groupBy("event_type", "dayofmonth").count()
sdf_time_dist_month = sdf_time_dist_month.withColumnRenamed("count", "cnt")
sdf_time_dist_month = sdf_time_dist_month.sort("dayofmonth", "event_type")
sdf_time_dist_week = sdf.groupBy("event_type", "dayofweek").count()
sdf_time_dist_week = sdf_time_dist_week.withColumnRenamed("count", "cnt")
sdf_time_dist_week = sdf_time_dist_week.sort("dayofweek", "event_type")

sdf_time_dist_day = sdf.groupBy("event_type", "hour").count()
sdf_time_dist_day = sdf_time_dist_day.withColumnRenamed("count", "cnt")
sdf_time_dist_day = sdf_time_dist_day.sort("hour", "event_type")

In [195]:
# %%timeit

# Timestamp Distribution (per event_type) over every day of month

df = sdf_time_dist_month.toPandas()

fig = px.bar(df, x = 'dayofmonth', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a month",
     xaxis_title = 'Day of Month', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.write_html("data/exports/Number of events over a month.html")
fig.show()

As you can see in the plot there are the most interactions in the middle of the month.

In [196]:
# %%timeit

# Timestamp Distribution (per event_type) over every day of week

df = sdf_time_dist_week.toPandas()

fig = px.bar(df, x = 'dayofweek', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a week",
     xaxis_title = 'Day of Week', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.write_html("data/exports/Number of events over a week.html")
fig.show()

The plot shows that most interactions take place on a Friday, Saturday or Sunday.

In [197]:
# %%timeit

# Timestamp Distribution (per event_type) over every hour of a day

df = sdf_time_dist_day.toPandas()

fig = px.bar(df, x = 'hour', y = 'cnt', color ='event_type', barmode = 'stack')

fig.update_layout(title = "Number of events over a day",
     xaxis_title = 'Hour of day', yaxis_title = 'Number of Events')
fig.update_xaxes(type="category")
fig.write_html("data/exports/Number of events over a day.html")
fig.show()

You can see that most interactions take place between 3 and 6 p. m.

#### Time and Turnover

In [198]:
# %%timeit


sdf_month_Umsatz = sdf_raw.groupBy("month", "dayofmonth").sum("turnover").orderBy(f.asc("month"), f.asc("dayofmonth"))

In [199]:
# %%timeit

sdf_month_Umsatz = sdf_month_Umsatz.withColumn("month", sdf_month_Umsatz["month"].cast(pyspark.sql.types.StringType()))


In [200]:
# %%timeit

df = sdf_month_Umsatz.toPandas()

fig = px.bar(df, x = "dayofmonth", y = "sum(turnover)", color ='month', barmode = 'stack')

fig.update_layout(title = "Turnover per Month",
     xaxis_title = 'Day of Month', yaxis_title = 'Turnover')
fig.update_xaxes(type="category")
fig.write_html("data/exports/Turnover per Month.html")
fig.show()

The plot shows that in the middle of the month the largest turnover is achieved.

In [201]:
# %%timeit
sdf_week_Umsatz = sdf_raw.groupBy("weekofyear", "dayofweek").sum("turnover").orderBy(f.asc("weekofyear"), f.asc("dayofweek"))
sdf_week_Umsatz = sdf_week_Umsatz.withColumn("weekofyear", sdf_week_Umsatz["weekofyear"].cast(pyspark.sql.types.StringType())).orderBy(f.asc("dayofweek"))
df = sdf_week_Umsatz.toPandas()

fig = px.bar(df, x = "dayofweek", y = "sum(turnover)", color ='weekofyear', barmode = 'stack')

fig.update_layout(title = "Turnover per Week",
     xaxis_title = 'Day of Week', yaxis_title = 'Turnover')
fig.update_xaxes(type="category")
fig.write_html("data/exports/Turnover per Week.html")
fig.show()

The most turnover is also achieved on saturdays and sundays.

In [202]:
# %%timeit
sdf_daytime_turnover = sdf_raw.groupBy("hour").sum("turnover").orderBy(f.asc("hour"))
df = sdf_daytime_turnover.toPandas()

fig = px.bar(df, x = "hour", y = "sum(turnover)", barmode = 'stack')

fig.update_layout(title = "Turnover per Day",
     xaxis_title = 'Hour of day', yaxis_title = 'Turnover')
fig.update_xaxes(type="category")
fig.show()

The plot shows that the biggest turnover is generated at 10 am. 

### Category and products

#### Connection between category_class, category_code, category_id, product_id and brand

The product_id is a subset of the category_id, which is a subset of the category_code. The category_code is in turn a subset of the category_class. (product_id ⊂ category_id ⊂ category_code ⊂ category_class). The brand on the otherhand is cross-class. This knowledge is based on the more detailed analyzes within the file "product_analysis.ipnynb".

In the following plot you can see distribution of the product_id, category_id and category_code within the category_class. It´s possible to access a more detailed view by selecting a special category_class, category_code or category_id.

In [203]:
# %%timeit

sdf_product_per_category = sdf_raw.groupBy("category_id").agg(f.countDistinct("product_id"))

df = sdf_product_per_category.toPandas()
px.box(df, y="count(product_id)", title="Number of products per category_id")

In [204]:
# %%timeit

sdf_agg_brand_category = sdf_raw.groupBy("category_class", "brand", "product_id").count().na.drop()
px.sunburst(sdf_agg_brand_category.toPandas(), path=["category_class", "brand"], values="count", title="Brands per Category_class (without not defined data")

#### Categories and price

The following plots will represent the price distribution within the category_classes, category_codes, category_ids, product_ids and brands.

In [None]:
# %%timeit

px.box(sdf_raw.select("category_class", "price").toPandas(), x="category_class", y="price", title="Price ~ Category_class")

In [None]:
# %%timeit

sdf_price_per_product = sdf_raw.select("product_id", "price").distinct().orderBy(f.desc("price"))

sdf_price_per_product.show(10)

#### Category and event-type
The following plots will represent the event_type distribution within the category_classes, category_codes, category_ids, product_ids and brands.

In [None]:
# %%timeit

sdf_category_class_event_distribution = sdf_raw.groupBy("category_class", "event_type").count().na.drop()
px.bar(sdf_category_class_event_distribution.toPandas(), x="category_class", y="count", color="event_type", barmode="group")

#### Category and turnover

The following plots will give you an overview over the turnover, the quanities viewed, added to cart and purchased of the most popular brands, category_codes and category_classes.

In [None]:
# %%timeit

sdf_brand_overview = sdf_raw.groupBy("brand").agg(f.sum("turnover"), f.avg("price"), f.sum("viewed_quantity"), f.sum("cart_quantity"), f.sum("bougth_quantity")).orderBy(f.desc("sum(turnover)"))
sdf_brand_overview = sdf_brand_overview.withColumn("sum(turnover)", f.round(sdf_brand_overview["sum(turnover)"], 2))
sdf_brand_overview = sdf_brand_overview.withColumn("avg(price)", f.round(sdf_brand_overview["avg(price)"], 2))

sdf_brand_overview.show(10)


In [None]:
# %%timeit

sdf_category_code_overview = sdf_raw.groupBy("category_code").agg(f.sum("turnover"), f.avg("price"), f.sum("viewed_quantity"), f.sum("cart_quantity"), f.sum("bougth_quantity")).orderBy(f.desc("sum(turnover)"))
sdf_category_code_overview = sdf_category_code_overview.withColumn("sum(turnover)", f.round(sdf_category_code_overview["sum(turnover)"], 2))
sdf_category_code_overview = sdf_category_code_overview.withColumn("avg(price)", f.round(sdf_category_code_overview["avg(price)"], 2))

sdf_category_code_overview.show(10)



In [None]:
# %%timeit

sdf_category_class_overview = sdf_raw.groupBy("category_class").agg(f.sum("turnover"), f.avg("price"), f.sum("viewed_quantity"), f.sum("cart_quantity"), f.sum("bougth_quantity")).orderBy(f.desc("sum(turnover)"))
sdf_category_class_overview = sdf_category_class_overview.withColumn("sum(turnover)", f.round(sdf_category_class_overview["sum(turnover)"], 2))
sdf_category_class_overview = sdf_category_class_overview.withColumn("avg(price)", f.round(sdf_category_class_overview["avg(price)"], 2))

sdf_category_class_overview.show(10)


### User analysis

In the following the users and their sessions will be analysed.


#### User and Time

In [None]:
# %%timeit

# creating spark dataframe for unique number of users each month

sdf_user_id_by_month = sdf.select("user_id", "month").distinct().groupBy("month").count()

In [None]:
# %%timeit

# Data Prep for user graph

pdf_usr_id_by_mnth = sdf_user_id_by_month.toPandas()
pdf_usr_id_by_mnth = pdf_usr_id_by_mnth[pdf_usr_id_by_mnth.month != 12]
pdf_usr_id_by_mnth.month = pdf_usr_id_by_mnth.month.astype(str)
px.bar(pdf_usr_id_by_mnth, x="month", y="count", title="Unique users each month")

In October 3.02 million people visited the site and in November the total user count grew to 3.696 million. Which nets
a difference of around 700,000 users.

In [None]:
# %%timeit

sdf_user_id_both_months = sdf.select("user_id", "month").distinct().groupby("user_id").sum()

cnt = sdf_user_id_both_months.where(sdf_user_id_both_months["sum(month)"] == 21).count()
print("Amount of users that visited the site in both months:", cnt)

Not all users in October and November are to be expected to have only visited in the respective month.
A total of 1,400,979 users visited the page in both October and November.

#### User and User_Session

Primarily it will be analysed how many sessions a user have on average.

In [None]:
# %%timeit

sdf_agg_user = sdf_raw.select("user_id", "user_session").distinct().groupBy("user_id").count()
sdf_agg_user.select("count").describe().show()

In [132]:
# %%timeit

print("Statistical distribution of Sessions per User in October")
sdf_agg_user_mnth = sdf_raw.select("user_id", "user_session", "month").distinct().groupBy("user_id", "month").count()
sdf_agg_user_mnth.where(sdf_agg_user_mnth.month == 10).select("count").describe().show()

print("Statistical distribution of Sessions per User in November")
sdf_agg_user_mnth = sdf_raw.select("user_id", "user_session", "month").distinct().groupBy("user_id", "month").count()
sdf_agg_user_mnth.where(sdf_agg_user_mnth.month == 11).select("count").describe().show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             63978|
|   mean|1.0474538122479602|
| stddev|0.2788289791365775|
|    min|                 1|
|    max|                31|
+-------+------------------+



In [133]:
# %%timeit

sdf_amount_interactions_per_sess = sdf_raw.select("user_session", "event_type").groupBy("user_session", "event_type").count()
# sdf_amount_interactions_per_sess.show()
print("Average amount of views per session:")
sdf_amount_interactions_per_sess.where(sdf_amount_interactions_per_sess.event_type == "view").agg({"count":"mean"}).show()
print("Average amount of 'Add to Cart' per session:")
sdf_amount_interactions_per_sess.where(sdf_amount_interactions_per_sess.event_type == "cart").agg({"count":"mean"}).show()
print("Average amount of purchases per session:")
sdf_amount_interactions_per_sess.where(sdf_amount_interactions_per_sess.event_type == "purchase").agg({"count":"mean"}).show()

Average amount of views per session:
+------------------+
|        avg(count)|
+------------------+
|1.0070795775465202|
+------------------+

Average amount of 'Add to Cart' per session:
+------------------+
|        avg(count)|
+------------------+
|1.0022773279352226|
+------------------+

Average amount of purchases per session:
+------------------+
|        avg(count)|
+------------------+
|1.0006060606060605|
+------------------+



#### User and product
Now it will be analysed how many products a user buy / view / put in cart on average.

In [134]:
# %%timeit

sdf_amount_events = sdf_raw.select("user_id", "event_type").groupBy("event_type").count()
amount_usr = sdf_raw.select("user_id").distinct().count()
pdf_avrg_events_usr = sdf_amount_events.toPandas()

pdf_avrg_events_usr["count"] = pdf_avrg_events_usr["count"].div(amount_usr)
print(pdf_avrg_events_usr)

  event_type     count
0   purchase  0.016089
1       view  1.014744
2       cart  0.038601


A user views on average 19,6 item, puts 0,7 items in their cart and buys 0,3 items.


In [135]:
# %%timeit

sdf_agg_user_type = sdf_raw.select("user_id", "event_type").groupBy("user_id").count()
most_active_user = sdf_agg_user_type.sort("count", ascending=False).take(1)

print(f"The most active user has the ID {most_active_user[0][0]} with a total of {most_active_user[0][1]} interactions.")

The most active user has the ID 568778435 with a total of 31 interactions.


#### Avrg Sessions per week day

In [136]:
# %%timeit

# User session distribution over every day of week

sdf_usr_ses_dist_week = sdf_raw.select("user_session", "dayofyear", "dayofweek").groupBy("dayofweek").count()
pdf_usr_sess_time_dist = sdf_usr_ses_dist_week.toPandas()
fig_usr_time = px.box(pdf_usr_sess_time_dist, x="dayofweek", y="count", title="Average Sessions per Weekday")
fig_usr_time.update_layout(
    xaxis_title="Day of Week"
)

The plot is indicating that the most user_sessions take place at the weekend.

### Unique sessions each month

In [137]:
# %%timeit

sdf_ses_by_mnth = sdf_raw.select("user_session", "month").distinct().groupby("month").count()

pdf_usr_ses_by_mnth = sdf_ses_by_mnth.toPandas()
pdf_usr_ses_by_mnth = pdf_usr_ses_by_mnth[pdf_usr_ses_by_mnth.month != 12]
pdf_usr_ses_by_mnth.month = pdf_usr_id_by_mnth.month.astype(str)
px.bar(pdf_usr_id_by_mnth, x="month", y="count", title="Unique sessions each month")

#### User and turnover

In the following plot, you can see an overview over the unique users and their turnover as well as their quantitities viewed, added to cart and purchased.

In [138]:
# %%timeit

sdf_user_overview = sdf_raw.groupBy("user_id").agg(f.sum("turnover"), f.count("user_session"), f.sum("viewed_quantity"), f.sum("cart_quantity"), f.sum("bougth_quantity")).orderBy(f.desc("sum(turnover)"))
sdf_user_overview.show(10)

+---------+-------------+-------------------+--------------------+------------------+--------------------+
|  user_id|sum(turnover)|count(user_session)|sum(viewed_quantity)|sum(cart_quantity)|sum(bougth_quantity)|
+---------+-------------+-------------------+--------------------+------------------+--------------------+
|559974229|      2496.59|                  2|                   1|                 0|                   1|
|520264548|      2275.71|                  3|                   1|                 0|                   2|
|554258016|      2053.57|                  1|                   0|                 0|                   1|
|566661896|      2003.91|                  1|                   0|                 0|                   1|
|521644901|      1747.79|                  1|                   0|                 0|                   1|
|515819101|      1741.04|                  1|                   0|                 0|                   1|
|518619273|      1735.58|            

### session analysis



In [139]:
# %%timeit

sdf_session = sdf.select("user_id", "user_session", "event_type", "product_id", "price", "event_time") #.orderBy("user_id", "user_session")

In [140]:
# %%timeit

sdf_session = sdf_session.withColumn("views", f.when(sdf_session.event_type == "view", 1).otherwise(0))
sdf_session = sdf_session.withColumn("purchases", f.when(sdf_session.event_type == "purchase", 1).otherwise(0))
sdf_session = sdf_session.withColumn("carts", f.when(sdf_session.event_type == "cart", 1).otherwise(0))
sdf_session = sdf_session.withColumn("turnover", f.when(sdf_session.event_type == "purchase", sdf_session["price"]).otherwise(0))

sdf_session = sdf_session.withColumn("first_event", sdf_session.event_time)
sdf_session = sdf_session.withColumn("last_event", sdf_session.event_time)

In [141]:
# %%timeit

sdf_session_agg = sdf_session.groupBy("user_id", "user_session").agg(f.sum("turnover"), f.sum("views"), f.sum("purchases"), f.sum("carts"), f.min("event_time"), f.max("event_time"))
sdf_session_agg = sdf_session_agg.withColumn("duration", (sdf_session_agg["max(event_time)"] - sdf_session_agg["min(event_time)"]))
sdf_session_agg = sdf_session_agg.withColumn("sum(events)", (sdf_session_agg["sum(views)"] + sdf_session_agg["sum(purchases)"] + sdf_session_agg["sum(carts)"]))
sdf_session_agg = sdf_session_agg.withColumn("turnover", f.when(sdf_session_agg["sum(purchases)"] > 0, (sdf_session_agg["sum(purchases)"] *  sdf_session_agg["sum(turnover)"])).otherwise(0))

sdf_session_agg = sdf_session_agg.withColumn("successfull", f.when(sdf_session_agg["sum(purchases)"] > 0, 1).otherwise(0))

In [142]:
# %%timeit

sdf_session_agg.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- user_session: string (nullable = true)
 |-- sum(turnover): double (nullable = true)
 |-- sum(views): long (nullable = true)
 |-- sum(purchases): long (nullable = true)
 |-- sum(carts): long (nullable = true)
 |-- min(event_time): timestamp (nullable = true)
 |-- max(event_time): timestamp (nullable = true)
 |-- duration: interval (nullable = true)
 |-- sum(events): long (nullable = true)
 |-- turnover: double (nullable = true)
 |-- successfull: integer (nullable = false)



In the following plot, you can see an overview over the unique user_sessions with their correspaonding users. It´s possible to analyse their turnover as well as their quantitities viewed, added to cart and purchased.

In [143]:
# %%timeit

sdf_session_agg.show(5)

+---------+--------------------+-------------+----------+--------------+----------+-------------------+-------------------+---------+-----------+--------+-----------+
|  user_id|        user_session|sum(turnover)|sum(views)|sum(purchases)|sum(carts)|    min(event_time)|    max(event_time)| duration|sum(events)|turnover|successfull|
+---------+--------------------+-------------+----------+--------------+----------+-------------------+-------------------+---------+-----------+--------+-----------+
|512610557|e1259a61-6e0f-4e1...|          0.0|         1|             0|         0|2019-10-01 09:04:45|2019-10-01 09:04:45|0 seconds|          1|     0.0|          0|
|513768075|e01242f0-9d8f-439...|          0.0|         1|             0|         0|2019-10-01 09:25:38|2019-10-01 09:25:38|0 seconds|          1|     0.0|          0|
|554263846|40c55834-1655-466...|          0.0|         1|             0|         0|2019-10-01 17:58:12|2019-10-01 17:58:12|0 seconds|          1|     0.0|          0

### Customer Profiles

In preparation for clustering a customer profile will be created:

- customer_id
- number_of_view_events
- number_of_cart_events
- number_of_purchase_events
- total_turnover
- number_of_bought_items (resolve multiple purchasing events for quantity)
- avg_sold_cart
- avg_session_time
- avg_actions_per_session



In [144]:
# %%timeit

# time period simply dependend on input data
sdf_customer_profile = sdf_session_agg.groupBy("user_id").agg(f.sum("sum(events)"), f.sum("sum(views)"), f.sum("sum(purchases)"), f.sum("sum(carts)"), f.sum("turnover"), f.count("user_session"), f.sum("successfull"))

sdf_customer_profile = sdf_customer_profile.withColumn("avg_turnover_per_session", (sdf_customer_profile["sum(turnover)"] / sdf_customer_profile["count(user_session)"]))
sdf_customer_profile = sdf_customer_profile.withColumn("avg_events_per_session", (sdf_customer_profile["sum(sum(events))"] / sdf_customer_profile["count(user_session)"]))

In [145]:
# %%timeit

sdf_customer_profile.show(5)

+---------+----------------+---------------+-------------------+---------------+-------------+-------------------+----------------+------------------------+----------------------+
|  user_id|sum(sum(events))|sum(sum(views))|sum(sum(purchases))|sum(sum(carts))|sum(turnover)|count(user_session)|sum(successfull)|avg_turnover_per_session|avg_events_per_session|
+---------+----------------+---------------+-------------------+---------------+-------------+-------------------+----------------+------------------------+----------------------+
|517623472|               1|              1|                  0|              0|          0.0|                  1|               0|                     0.0|                   1.0|
|570387312|               1|              1|                  0|              0|          0.0|                  1|               0|                     0.0|                   1.0|
|517960968|               1|              1|                  0|              0|          0.0|      

### Correlation matrix

In the following different correlation matrices are created. These are used to identify correlations between certain attributes with the turnover and the quantities viewed, added to cart and purchased.

#### Daytime - Correlation Matrix
The following plot is looking for a correlation between the daytime and the decision variables. Therefore the daytime will be divided into the characteristics "morning", "afternoon", "evening" and "night". 

In [146]:
# %%timeit

sdf_corr_time = sdf.select("event_time", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity")
sdf_corr_time = sdf_corr_time.withColumn("hour", f.hour("event_time"))

In [147]:
# %%timeit

sdf_corr_time = sdf_corr_time.withColumn('Morning', f.when((f.col('hour')>=6) & (f.col('hour')<12), f.lit(1)).otherwise(0))
sdf_corr_time = sdf_corr_time.withColumn('Afternoon', f.when((f.col('hour')>=12) & (f.col('hour')<18), f.lit(1)).otherwise(0))
sdf_corr_time = sdf_corr_time.withColumn('Evening', f.when(f.col('hour') > 18 , f.lit(1)).otherwise(0))
sdf_corr_time = sdf_corr_time.withColumn('Night', f.when(f.col('hour') < 6, f.lit(1)).otherwise(0))

In [148]:
# %%timeit

sdf_corr_time = sdf_corr_time.select("Morning", "Afternoon", "Evening", "Night", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity")

In [149]:
# %%timeit

sdf_corr_time.toPandas().corr().style.background_gradient(cmap='bwr')

Unnamed: 0,Morning,Afternoon,Evening,Night,turnover,bougth_quantity,viewed_quantity,cart_quantity
Morning,1.0,-0.508147,-0.274731,-0.252492,0.015152,0.024416,-0.034028,0.024252
Afternoon,-0.508147,1.0,-0.297016,-0.272973,-0.00018,0.001971,0.003361,-0.005256
Evening,-0.274731,-0.297016,1.0,-0.147583,-0.011522,-0.01939,0.023661,-0.015289
Night,-0.252492,-0.272973,-0.147583,1.0,-0.005302,-0.012047,0.004628,0.002397
turnover,0.015152,-0.00018,-0.011522,-0.005302,1.0,0.660496,-0.351623,-0.015796
bougth_quantity,0.024416,0.001971,-0.01939,-0.012047,0.660496,1.0,-0.532363,-0.023916
viewed_quantity,-0.034028,0.003361,0.023661,0.004628,-0.351623,-0.532363,1.0,-0.833542
cart_quantity,0.024252,-0.005256,-0.015289,0.002397,-0.015796,-0.023916,-0.833542,1.0


The plot is indicating that products are preferred viewed in the evening and added to cart in the morning. At last they are also preferred bought in the morning, wherefore the generated turnover is biggest in the morning.

#### Weekday - Correlation Matrix
The following plot is looking for a correlation between the weekdays and the decision variables.

In [150]:
# %%timeit

sdf_corr_dayofweek = sdf.select("dayofweek", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity")

In [151]:
# %%timeit

# One-hot-encoding
sdf_corr_dayofweek = sdf_corr_dayofweek.withColumn('Sunday', f.when(f.col('dayofweek') == '1', f.lit(1)).otherwise(0))
sdf_corr_dayofweek = sdf_corr_dayofweek.withColumn('Monday', f.when(f.col('dayofweek') == '2', f.lit(1)).otherwise(0))
sdf_corr_dayofweek = sdf_corr_dayofweek.withColumn('Tuesday', f.when(f.col('dayofweek') == '3', f.lit(1)).otherwise(0))
sdf_corr_dayofweek = sdf_corr_dayofweek.withColumn('Wednesday', f.when(f.col('dayofweek') == '4', f.lit(1)).otherwise(0))
sdf_corr_dayofweek = sdf_corr_dayofweek.withColumn('Thursday', f.when(f.col('dayofweek') == '5', f.lit(1)).otherwise(0))
sdf_corr_dayofweek = sdf_corr_dayofweek.withColumn('Friday', f.when(f.col('dayofweek') == '6', f.lit(1)).otherwise(0))
sdf_corr_dayofweek = sdf_corr_dayofweek.withColumn('Saturday', f.when(f.col('dayofweek') == '7', f.lit(1)).otherwise(0))

In [152]:
# %%timeit

sdf_corr_dayofweek = sdf_corr_dayofweek.select("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",  "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity" )

In [153]:
# %%timeit

sdf_corr_dayofweek.toPandas().corr().style.background_gradient(cmap='bwr')

Unnamed: 0,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,turnover,bougth_quantity,viewed_quantity,cart_quantity
Monday,1.0,-0.137659,-0.137936,-0.143665,-0.166566,-0.166155,-0.157643,0.000138,-0.004349,0.013587,-0.013208
Tuesday,-0.137659,1.0,-0.141995,-0.147892,-0.171467,-0.171044,-0.162282,-0.00173,0.001834,0.010861,-0.014023
Wednesday,-0.137936,-0.141995,1.0,-0.148189,-0.171812,-0.171388,-0.162608,0.00092,0.001441,0.013407,-0.016773
Thursday,-0.143665,-0.147892,-0.148189,1.0,-0.178947,-0.178506,-0.169361,-0.003699,-0.002365,0.009039,-0.009131
Friday,-0.166566,-0.171467,-0.171812,-0.178947,1.0,-0.206962,-0.196359,-0.008164,-0.010454,-0.017735,0.027767
Saturday,-0.166155,-0.171044,-0.171388,-0.178506,-0.206962,1.0,-0.195875,-0.004134,-0.005073,-0.005345,0.009623
Sunday,-0.157643,-0.162282,-0.162608,-0.169361,-0.196359,-0.195875,1.0,0.016828,0.019204,-0.018606,0.00944
turnover,0.000138,-0.00173,0.00092,-0.003699,-0.008164,-0.004134,0.016828,1.0,0.660496,-0.351623,-0.015796
bougth_quantity,-0.004349,0.001834,0.001441,-0.002365,-0.010454,-0.005073,0.019204,0.660496,1.0,-0.532363,-0.023916
viewed_quantity,0.013587,0.010861,0.013407,0.009039,-0.017735,-0.005345,-0.018606,-0.351623,-0.532363,1.0,-0.833542


The plot is showing that product are prefered viewed at the the beginning of the week and added to the cart at the end of the week. At last they are bought between sunday and wednesday.

#### Category Class - Correlation Matrix
The following plot is looking for a correlation between the category class and the decision variables. Therefore the category classes ... will be one-hot-encoded.

In [154]:
# %%timeit

sdf_corr_category_class = sdf.select("category_class", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity")

In [155]:
# One-hot-encoding
sdf_corr_category_class = sdf_corr_category_class.withColumn('computers', f.when(f.col('category_class') == 'computers', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('auto', f.when(f.col('category_class') == 'auto', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('apparel', f.when(f.col('category_class') == 'apparel', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('appliances', f.when(f.col('category_class') == 'appliances', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('furniture', f.when(f.col('category_class') == 'furniture', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('accessories', f.when(f.col('category_class') == 'accessories', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('electronics', f.when(f.col('category_class') == 'electronics', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('construction', f.when(f.col('category_class') == 'construction', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('medicine', f.when(f.col('category_class') == 'medicine', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('stationery', f.when(f.col('category_class') == 'stationery', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('sport', f.when(f.col('category_class') == 'sport', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('country_yard', f.when(f.col('category_class') == 'country_yard', f.lit(1)).otherwise(0))
sdf_corr_category_class = sdf_corr_category_class.withColumn('kids', f.when(f.col('category_class') == 'kids', f.lit(1)).otherwise(0))
sdf_corr_category_class.dropna()

DataFrame[category_class: string, turnover: double, bougth_quantity: int, viewed_quantity: int, cart_quantity: int, computers: int, auto: int, apparel: int, appliances: int, furniture: int, accessories: int, electronics: int, construction: int, medicine: int, stationery: int, sport: int, country_yard: int, kids: int]

In [156]:
# %%timeit

sdf_corr_category_class = sdf_corr_category_class.select("computers", "auto", "apparel", "appliances", "furniture", "accessories", "electronics", "construction", "medicine", "stationery","sport","country_yard", "kids", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity" )

In [157]:
# %%timeit

sdf_corr_category_class.toPandas().corr().style.background_gradient(cmap='bwr')

Unnamed: 0,computers,auto,apparel,appliances,furniture,accessories,electronics,construction,medicine,stationery,sport,country_yard,kids,turnover,bougth_quantity,viewed_quantity,cart_quantity
computers,1.0,-0.036114,-0.051808,-0.093916,-0.045042,-0.018666,-0.190337,-0.032244,-0.005034,-0.003219,-0.015782,-0.004739,-0.027443,-0.004311,-0.012062,0.019324,-0.014949
auto,-0.036114,1.0,-0.02962,-0.053694,-0.025752,-0.010672,-0.108821,-0.018435,-0.002878,-0.00184,-0.009023,-0.002709,-0.01569,-0.008989,-0.007656,0.012203,-0.009414
apparel,-0.051808,-0.02962,1.0,-0.077028,-0.036943,-0.01531,-0.156111,-0.026446,-0.004128,-0.00264,-0.012944,-0.003887,-0.022508,-0.015074,-0.015633,0.029863,-0.025065
appliances,-0.093916,-0.053694,-0.077028,1.0,-0.066969,-0.027753,-0.282994,-0.047941,-0.007484,-0.004786,-0.023464,-0.007046,-0.040803,-0.014637,-0.005545,0.009287,-0.007349
furniture,-0.045042,-0.025752,-0.036943,-0.066969,1.0,-0.01331,-0.135724,-0.022992,-0.003589,-0.002295,-0.011254,-0.003379,-0.019569,-0.008979,-0.01094,0.024691,-0.022019
accessories,-0.018666,-0.010672,-0.01531,-0.027753,-0.01331,1.0,-0.056246,-0.009528,-0.001487,-0.000951,-0.004664,-0.0014,-0.00811,-0.00584,-0.008165,0.013882,-0.011065
electronics,-0.190337,-0.108821,-0.156111,-0.282994,-0.135724,-0.056246,1.0,-0.09716,-0.015167,-0.0097,-0.047554,-0.014279,-0.082693,0.06819,0.048407,-0.088228,0.072604
construction,-0.032244,-0.018435,-0.026446,-0.047941,-0.022992,-0.009528,-0.09716,1.0,-0.002569,-0.001643,-0.008056,-0.002419,-0.014009,-0.008617,-0.010518,0.010447,-0.005473
medicine,-0.005034,-0.002878,-0.004128,-0.007484,-0.003589,-0.001487,-0.015167,-0.002569,1.0,-0.000257,-0.001258,-0.000378,-0.002187,-0.001635,-0.002475,-0.00155,0.003445
stationery,-0.003219,-0.00184,-0.00264,-0.004786,-0.002295,-0.000951,-0.0097,-0.001643,-0.000257,1.0,-0.000804,-0.000241,-0.001399,-0.001046,-0.001583,-0.000257,0.001336


The plot is indicating that the category_class "electronics" is generating the biggest turnover with the biggest probability to be bought. Nevertheless the categories "medicine" and "stationary" also tend to be added to the cart. The rest only tends to be viewed

#### Month - Correlation Matrix
The following plot is looking for a correlation between the month and the decision variables. Therefore the month will be divided into the characteristics "Beginningofmonth", "Middleofmonth" and "Endofmonth". 

In [158]:
# %%timeit

sdf_corr_month = sdf_raw.select("dayofmonth", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity")

In [159]:
# %%timeit

# One-hot-encoding
sdf_corr_month = sdf_corr_month.withColumn('Beginningofmonth', f.when(f.col('dayofmonth')<10, f.lit(1)).otherwise(0))
sdf_corr_month = sdf_corr_month.withColumn('Middleofmonth', f.when((f.col('dayofmonth')>=10) & (f.col('dayofmonth')<20), f.lit(1)).otherwise(0))
sdf_corr_month = sdf_corr_month.withColumn('Endofmonth', f.when(f.col('dayofmonth') > 20 , f.lit(1)).otherwise(0))

In [160]:
# %%timeit

sdf_corr_month = sdf_corr_month.select("Beginningofmonth", "Middleofmonth", "Endofmonth", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity" )

In [161]:
# %%timeit

sdf_corr_month.toPandas().corr().style.background_gradient(cmap='bwr')

Unnamed: 0,Beginningofmonth,Middleofmonth,Endofmonth,turnover,bougth_quantity,viewed_quantity,cart_quantity
Beginningofmonth,1.0,-0.504939,-0.355707,-0.001025,-0.001675,0.045324,-0.052433
Middleofmonth,-0.504939,1.0,-0.560632,-0.003629,-0.00492,-0.04247,0.053368
Endofmonth,-0.355707,-0.560632,1.0,0.005799,0.006408,0.00302,-0.007748
turnover,-0.001025,-0.003629,0.005799,1.0,0.660496,-0.351623,-0.015796
bougth_quantity,-0.001675,-0.00492,0.006408,0.660496,1.0,-0.532363,-0.023916
viewed_quantity,0.045324,-0.04247,0.00302,-0.351623,-0.532363,1.0,-0.833542
cart_quantity,-0.052433,0.053368,-0.007748,-0.015796,-0.023916,-0.833542,1.0


The plot is indicating that products are rather bought at the end of the month as at the beginning or in the middle. But therefore the products are rather viewed at the beginning of the month and rather added to the cart in the middle of the month. 

#### Price - Correlation Matrix
The following plot is looking for a correlation between the price and the decision variables.

In [162]:
# %%timeit

sdf_corr_price = sdf_raw.select("price", "turnover", "bougth_quantity", "viewed_quantity", "cart_quantity")
sdf_corr_price.toPandas().corr().style.background_gradient(cmap='bwr')



Unnamed: 0,price,turnover,bougth_quantity,viewed_quantity,cart_quantity
price,1.0,0.090558,0.003883,-0.004626,0.00293
turnover,0.090558,1.0,0.660496,-0.351623,-0.015796
bougth_quantity,0.003883,0.660496,1.0,-0.532363,-0.023916
viewed_quantity,-0.004626,-0.351623,-0.532363,1.0,-0.833542
cart_quantity,0.00293,-0.015796,-0.023916,-0.833542,1.0


The plot is indicating that more expensive products are rather bought than viewed, while cheaper products are vice versa.

## Summary of results

In the previous analysis, a lot of insights have been gathered. These are briefly summarised in the following.

### General information
- ~100M Events in 2 Months (October and November 2019)
- ~200K Products, 13 Category Classes, ~4300 Brands, ~290 average Price (0-2574)
- ~5M Users, ~23M User Sessions
- Events: ~95% Views, ~3.5% Add_to_cart, 1.5% Purchase
- ~500M Turnover, ~104M Views, ~4M Add_to_Carts, ,~1.6M Purchases

### TOP3

Most interactions:
- Product_ID: 1004856, 1005115, 1004767
- Category_ID: 2053013555631882655, 2053013553559896355, 2053013558920217191
- category_codes: Electronices.smartphone, electronice.clocks, electronics.video.tv
- brands: samsung, apple, xiaomi
- user: 568778435 

Most turnover:
- brand: apple, samsung, xiaomi
- category_code: electronics.smartphone, electronics.video.tv, computers.notebook
- category_class: electronics, appliances, computers

### User and Session Analysis
- average number of sessions per user: 4.3; Max: 22542
- average number of events per user: 19.6 Views, 0.7 Add_to_Cart, 0.3 Purchases
- average number of events per session: 4.5 Views, 1.7 Add_to_Cart, 1.18 Purchases
- most user sessions take place at the weekend (Friday-Sunday)

### Event- and Turnover-Distribution
Most interactions:
- in the middle of the month (15.-17.)
- at the weekend (friday-sunday)
- between 3 and 6 pm

Most turnover: 
- in the middle of the month (16.-17.)
- at the weekend (saturday, sunday)
- at 10 am

### Correlations
- Most turnover: in the morning, at the end of the month, on sunday/monday, high price, electronics
- Most Views: in the evening, at the beggining of the month, Monday-Thursday, low price, all categories except electronics/stationery/medicine
- Most Add_to_cart: in the morning, in the middle of the month, Friday-Sunday, high price, electronics/stationery/medicine
- Most Purchase: in the morning, at the end of the month, sunday/tuesday, high price, electronics