# Reference 
For better understanding of the code, please refer to the following code: 
- `./Lab/Code/practice/RFM-Koggle.ipynb`

In [1]:
import os 
import findspark 
findspark.init()

# for sql
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col
from pyspark.sql.functions import col, max as spark_max, count, sum as spark_sum, datediff, lit, min as spark_min
from pyspark.sql.types import IntegerType

# for time 
import time 


In [2]:
# 可以改成 *.csv 
# root = '../../../Data/eCommerce-behavior-data/2019-Nov.csv'
root = '../../data/only_purchases_1day.csv'
spark = SparkSession.builder.appName('eCommerce').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/10 02:10:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# job id 1
ecommerce = spark.read\
    .option("inferSchema", "true")\
    .option("header", "true")\
    .csv(root)

In [4]:
ecommerce.explain()

== Physical Plan ==
FileScan csv [event_time#17,event_type#18,product_id#19,category_id#20L,category_code#21,brand#22,price#23,user_id#24,user_session#25] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/shannon/Library/CloudStorage/OneDrive-國立臺灣科技大學/NTUST/Germa..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<event_time:timestamp,event_type:string,product_id:int,category_id:bigint,category_code:str...




In [5]:
ecommerce.createOrReplaceTempView('ecommerce_2019_oct') 

In [6]:
ecommerce.printSchema()

root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: double (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- user_session: string (nullable = true)



In [7]:
ecommerce.show(5) # job id 2 

+-------------------+----------+----------+-------------------+--------------------+------+------+---------+--------------------+
|         event_time|event_type|product_id|        category_id|       category_code| brand| price|  user_id|        user_session|
+-------------------+----------+----------+-------------------+--------------------+------+------+---------+--------------------+
|2019-11-01 01:00:00|      view|   1003461|2053013555631882655|electronics.smart...|xiaomi|489.07|520088904|4d3b30da-a5e4-49d...|
|2019-11-01 01:00:00|      view|   5000088|2053013566100866035|appliances.sewing...|janome|293.65|530496790|8e5f4f83-366c-4f7...|
|2019-11-01 01:00:01|      view|  17302664|2053013553853497655|                NULL| creed| 28.31|561587266|755422e7-9040-477...|
|2019-11-01 01:00:01|      view|   3601530|2053013563810775923|appliances.kitche...|    lg|712.87|518085591|3bfb58cd-7892-48c...|
|2019-11-01 01:00:01|      view|   1004775|2053013555631882655|electronics.smart...|xiaomi

## Only Select Purchase Columns

In [8]:
ecommerce = ecommerce.filter(col("event_type") == "purchase")
ecommerce.show(5) # job id 3, this is no 

+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+
|         event_time|event_type|product_id|        category_id|       category_code|  brand| price|  user_id|        user_session|
+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+
|2019-11-01 01:00:41|  purchase|  13200605|2053013557192163841|furniture.bedroom...|   NULL| 566.3|559368633|d6034fa2-41fb-4ac...|
|2019-11-01 01:01:04|  purchase|   1005161|2053013555631882655|electronics.smart...| xiaomi|211.92|513351129|e6b7ce9b-1938-4e2...|
|2019-11-01 01:04:51|  purchase|   1004856|2053013555631882655|electronics.smart...|samsung|128.42|562958505|0f039697-fedc-40f...|
|2019-11-01 01:05:34|  purchase|  26401669|2053013563651392361|                NULL|lucente|109.66|541854711|c41c44d5-ef9b-41b...|
|2019-11-01 01:06:33|  purchase|   1801881|2053013554415534427|electronics.video.tv

## Aggregate user_session
- the user may has made multiple purchases in the same session.

In [9]:
result = ecommerce.groupBy('user_session').agg(
    spark_max('event_time').alias('Date_order'),  # alias is rename function
    spark_max('user_id').alias('user_id'), 
    count('user_session').alias('order_count'), 
    spark_sum('price').alias('price')  
)
result.show(5) # Job id 4 = groupBy, Job id 5 = show 



+--------------------+-------------------+---------+-----------+------+
|        user_session|         Date_order|  user_id|order_count| price|
+--------------------+-------------------+---------+-----------+------+
|8693715b-2f32-462...|2019-11-01 02:42:46|558439221|          1|  66.9|
|ad3086a0-22ad-4da...|2019-11-01 02:53:29|521374214|          1|185.85|
|5b0a2af3-293f-4bd...|2019-11-01 03:16:26|547622783|          1|206.96|
|3e5e22a7-ffd8-402...|2019-11-01 04:13:37|516883277|          1|230.93|
|279f59b0-d73f-445...|2019-11-01 04:38:18|561621729|          1|458.28|
+--------------------+-------------------+---------+-----------+------+
only showing top 5 rows



                                                                                

# **RFM Analysis**
RFM is a method used for analyzing customer value. It is commonly used in database marketing and direct marketing and has received particular attention in retail and professional services industries.

RFM stands for the three dimensions:

* Recency – How recently did the customer purchase?
* Frequency – How often do they purchase?
* Monetary Value – How much do they spend?

source: [wikipedia](https://en.wikipedia.org/wiki/RFM_(market_research))

so we will make that 3 attribute Recency, Frequency, and Monetary

In [11]:
max_date = result.agg(spark_max('Date_order')).collect()[0][0] 
print("The latest date:", max_date) # job id 6 agg, job id 7 show 

                                                                                

In [16]:
# cach all the data with spark 
ecommerce.cache() # cache is a transformation wide dep 
ecommerce.show(5) # action 

24/01/10 01:45:05 WARN CacheManager: Asked to cache already cached data.

+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+
|         event_time|event_type|product_id|        category_id|       category_code|  brand| price|  user_id|        user_session|
+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+
|2019-11-01 01:00:41|  purchase|  13200605|2053013557192163841|furniture.bedroom...|   NULL| 566.3|559368633|d6034fa2-41fb-4ac...|
|2019-11-01 01:01:04|  purchase|   1005161|2053013555631882655|electronics.smart...| xiaomi|211.92|513351129|e6b7ce9b-1938-4e2...|
|2019-11-01 01:04:51|  purchase|   1004856|2053013555631882655|electronics.smart...|samsung|128.42|562958505|0f039697-fedc-40f...|
|2019-11-01 01:05:34|  purchase|  26401669|2053013563651392361|                NULL|lucente|109.66|541854711|c41c44d5-ef9b-41b...|
|2019-11-01 01:06:33|  purchase|   1801881|2053013554415534427|electronics.video.tv

                                                                                

In [20]:
ecommerce.rdd.getNumPartitions() # get current partition

68

In [18]:
ecommerce.repartition(5)
ecommerce.show(5)

+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+
|         event_time|event_type|product_id|        category_id|       category_code|  brand| price|  user_id|        user_session|
+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+
|2019-11-01 01:00:41|  purchase|  13200605|2053013557192163841|furniture.bedroom...|   NULL| 566.3|559368633|d6034fa2-41fb-4ac...|
|2019-11-01 01:01:04|  purchase|   1005161|2053013555631882655|electronics.smart...| xiaomi|211.92|513351129|e6b7ce9b-1938-4e2...|
|2019-11-01 01:04:51|  purchase|   1004856|2053013555631882655|electronics.smart...|samsung|128.42|562958505|0f039697-fedc-40f...|
|2019-11-01 01:05:34|  purchase|  26401669|2053013563651392361|                NULL|lucente|109.66|541854711|c41c44d5-ef9b-41b...|
|2019-11-01 01:06:33|  purchase|   1801881|2053013554415534427|electronics.video.tv

The last date we have is 2019-12-01 so we will use date 2019-12-2 as reference

In [24]:
study_date = '2019-12-02'

# prepare data for calculating the recency 
result = result.withColumn('last_purchase', datediff(lit(study_date), col('Date_order')))

# show the result
result.show(5)

+--------------------+-------------------+---------+-----------+------+-------------+
|        user_session|         Date_order|  user_id|order_count| price|last_purchase|
+--------------------+-------------------+---------+-----------+------+-------------+
|2af9b570-0942-4dc...|2019-10-01 02:09:26|524601178|          1|189.91|           62|
|62a3b59a-de32-450...|2019-10-01 05:28:56|543624132|          1|254.76|           62|
|3a8a2e45-3c9b-4d1...|2019-10-01 05:31:53|521819296|          1|360.11|           62|
|194fc2ad-6a50-4dc...|2019-10-01 05:57:31|555477458|          1|130.76|           62|
|f70b875e-caf2-4c1...|2019-10-01 06:03:31|550692948|          1|583.28|           62|
+--------------------+-------------------+---------+-----------+------+-------------+
only showing top 5 rows



In [25]:
# Calculate Recency、Frequency 和 Monetary
RFM_result = result.groupBy('user_id').agg(
    spark_min('last_purchase').alias('Recency'),
    count('user_id').alias('Frequency'),
    spark_sum('price').alias('Monetary')
)

# Show the result
RFM_result.show(10)

+---------+-------+---------+--------+
|  user_id|Recency|Frequency|Monetary|
+---------+-------+---------+--------+
|512817507|     62|        1| 1080.18|
|519298781|     62|        1|   177.1|
|552723049|     62|        1|  360.08|
|515361365|     62|        2|   88.54|
|513902632|     62|        1|  186.22|
|512878155|     62|        1|  254.26|
|548720234|     62|        1|  965.02|
|545546853|     62|        1| 1222.41|
|514399464|     62|        2| 3397.72|
|543399383|     62|        1|  388.81|
+---------+-------+---------+--------+
only showing top 10 rows



### About the warning. 
As indicated [here](https://stackoverflow.com/questions/41661849/spill-to-disk-and-shuffle-write-spark) this warning means that your RAM is full and that part of the RAM contents are moved to disk.

## Frequency 

In [26]:
RFM_result.describe('Frequency').show()

+-------+------------------+
|summary|         Frequency|
+-------+------------------+
|  count|             14064|
|   mean|1.1553612059158134|
| stddev|0.5264487104076627|
|    min|                 1|
|    max|                10|
+-------+------------------+



## Recency

In [27]:
RFM_result.describe('Recency').show()

+-------+-------------------+
|summary|            Recency|
+-------+-------------------+
|  count|              14064|
|   mean|  61.99139647326508|
| stddev|0.09235860860653512|
|    min|                 61|
|    max|                 62|
+-------+-------------------+



## Monetary

In [28]:
RFM_result.describe('Monetary').show()

+-------+-----------------+
|summary|         Monetary|
+-------+-----------------+
|  count|            14064|
|   mean|446.2431747724663|
| stddev| 743.075831746735|
|    min|             0.79|
|    max|         30607.05|
+-------+-----------------+

