For better look in jupyternotebook. Please execute the following cell.

In [1]:
%%HTML
<style>
    body {
        --vscode-font-family: "LXGW WenKai";
        line-height: 2; Í
    }
</style>

In [2]:
import os 
import findspark 
findspark.init()

# for sql
from pyspark.sql import SparkSession 
from pyspark.sql import functions as F
from pyspark.sql.functions import col, when, sum,avg,max,count
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType, DoubleType, LongType, TimestampType

# for time 
import datetime as dt
import psutil

# for plot
import matplotlib.pyplot as plt
import squarify





In [3]:
# Definieren des Schemas basierend auf der Struktur
schema = StructType([
    StructField("event_time", TimestampType(), True),
    StructField("event_type", StringType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("category_id", LongType(), True),
    StructField("category_code", StringType(), True),
    StructField("brand", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("user_id", IntegerType(), True),
    StructField("user_session", StringType(), True)
])

# RFM-Segmentation

This section aims at solving a big data use case demonstrated by the RFM-Segmentation.
A short introduction to RFM-Analysis and the RFM-Segmentation

### RFM Analysis
RFM is a method used for analyzing customer value. It is commonly used in database marketing and direct marketing and has received particular attention in retail and professional services industries.

RFM stands for the three dimensions:

* Recency – How recently did the customer purchase?
* Frequency – How often do they purchase?
* Monetary Value – How much do they spend?

source: [wikipedia](https://en.wikipedia.org/wiki/RFM_(market_research))

so we will calculate the 3 attributes Recency, Frequency, and Monetary

### RFM Segmentation

RFM segmentation is a useful tool for identifying groups of clients who should be given extra attention. RFM segmentation enables marketers to target specific groups of customers with communications that are far more relevant to their unique behaviors, resulting in improved response rates, enhanced loyalty, and increased customer lifetime value. RFM segmentation is a method for segmenting data, just like other approaches.

The simplest way to create customers segments from RFM Model is to use Quartiles. We assign a score from 1 to 4 to Recency, Frequency and Monetary. Four is the best/highest value, and one is the lowest/worst value. A final RFM score is calculated simply by combining individual RFM score numbers.


The Code is divided into multiple phases:

1. Spark Session Initialization

    Initializes a SparkSession with a specified master node and application name "rfm-segmentation".
    This is the starting point for any Spark application and is used for configuring Spark's settings.

2. Data Loading and Initial Processing

    Reads an ecommerce CSV file into a Spark DataFrame with a predefined schema and prints the schema to confirm the data structure.
    It filters the data to include only purchase events and caches the result for efficient access during further transformations.

3. Data Aggregation

    Groups the filtered purchase data by "user_session" and aggregates it to compute the maximum event time, unique user IDs, count of sessions, and total money spent per session.
    This aggregated data is then used to derive insights on user behavior within the ecommerce platform.

4. Data Preparation for RFM Analysis

    This step prepares the data for RFM segmentation by determining how recently and frequently each user has made purchases and how much they have spent.

5. RFM Segmentation Calculation

    Performs RFM analysis by grouping data on a user level and calculating Recency, Frequency, and Monetary values.
    Determines quantiles for Recency, Frequency, and Monetary to classify users into different segments based on their behavior.

6. RFM Score Assignment

    Assigns RFM quartiles and scores to each user based on the calculated quantiles. This involves categorizing users into segments such as "Lost", "Hibernating", etc., based on their RFM scores.
    The RFM score is a composite score given to each customer to represent their overall value to the business.

7. RFM Labeling and Segmentation Summary

    Further refines the RFM segments by assigning descriptive labels based on the RFM score.
    Aggregates the final RFM results to calculate average Recency, Frequency, and Monetary values for each segment, along with the size of each segment.

8. Visualization

    The final part of the code involves creating a visualization of the RFM segments using a squarify plot to represent the size of each segment visually. This helpf for understanding the distribution of users across different RFM segments.

In [4]:
# für Recency
def R(x, q1, q2, q3):
    if x <= q1:
        return 1
    elif x <= q2:
        return 2
    elif x <= q3:
        return 3
    else:
        return 4

# für Frequency und Monetary
def FM(x, q1, q2, q3):
    if x <= q1:
        return 4
    elif x <= q2:
        return 3
    elif x <= q3:
        return 2
    else:
        return 1

def run_rfm_segmentation(mastername, dateiname, schema):
    # Memory Usage
    memory = psutil.virtual_memory()
    print(f"Memory Usage: {memory.percent}%")

    # Disk I/O
    disk_io_start = psutil.disk_io_counters()
    spark = SparkSession.builder \
        .master(mastername) \
        .appName("rfm-segmentation") \
        .getOrCreate()

    # Lesen der CSV-Datei mit dem definierten Schema
    ecommerce = spark.read.csv(dateiname, schema=schema, header=True)

    print(ecommerce.rdd.getNumPartitions())
    only_purchases = ecommerce.filter(col("event_type") == 'purchase').cache()

    aggregated_data = only_purchases.groupBy("user_session") \
        .agg(
            F.max("event_time").alias("Date_order"),
            F.collect_set("user_id").alias("user_id"),  # Unique user_ids
            F.count("user_session").alias("Quantity"),
            F.sum("price").alias("money_spent")
        )

    # Assuming 'data' is your PySpark DataFrame and 'Date_order' is a string column
    study_date = dt.datetime(2019, 12, 1)

    # Convert 'Date_order' to date type if it's not already
    data = aggregated_data.withColumn("Date_order", F.col("Date_order").cast(DateType()))

    # Calculate the difference in days
    data = data.withColumn("last_purchase", F.datediff(F.lit(study_date), "Date_order"))


    RFM_result = data.groupBy("user_id") \
        .agg(
            F.min("last_purchase").alias("Recency"),
            F.count("user_id").alias("Frequency"),
            F.sum("money_spent").alias("Monetary")
        )

    # Definition der Quantil-Wahrscheinlichkeiten und des relativen Fehlers
    quantile_probs = [0.25, 0.5, 0.75]
    rel_error = 0.01  # Geringer relativer Fehler für eine genauere Approximation

    # Berechnung der Quantile für Recency, Frequency und Monetary
    recency_quantiles = RFM_result.stat.approxQuantile("Recency", quantile_probs, rel_error)
    frequency_quantiles = RFM_result.stat.approxQuantile("Frequency", quantile_probs, rel_error)
    monetary_quantiles = RFM_result.stat.approxQuantile("Monetary", quantile_probs, rel_error)

    # Zusammenstellen der Quantilinformation in einem Dictionary
    quartiles = {
        'Recency': dict(zip(quantile_probs, recency_quantiles)),
        'Frequency': dict(zip(quantile_probs, frequency_quantiles)),
        'Monetary': dict(zip(quantile_probs, monetary_quantiles))
    }

    # Extrahieren der Quartilswerte
    recency_quartiles = [quartiles['Recency'][0.25], quartiles['Recency'][0.50], quartiles['Recency'][0.75]]
    frequency_quartiles = [quartiles['Frequency'][0.25], quartiles['Frequency'][0.50], quartiles['Frequency'][0.75]]
    monetary_quartiles = [quartiles['Monetary'][0.25], quartiles['Monetary'][0.50], quartiles['Monetary'][0.75]]

    RFM_result = RFM_result.withColumn('R_Quartile', 
        when(col('Recency') <= recency_quartiles[0], 1)
        .when(col('Recency') <= recency_quartiles[1], 2)
        .when(col('Recency') <= recency_quartiles[2], 3)
        .otherwise(4))

    RFM_result = RFM_result.withColumn('F_Quartile', 
        when(col('Frequency') <= frequency_quartiles[0], 4)
        .when(col('Frequency') <= frequency_quartiles[1], 3)
        .when(col('Frequency') <= frequency_quartiles[2], 2)
        .otherwise(1))

    RFM_result = RFM_result.withColumn('M_Quartile', 
        when(col('Monetary') <= monetary_quartiles[0], 4)
        .when(col('Monetary') <= monetary_quartiles[1], 3)
        .when(col('Monetary') <= monetary_quartiles[2], 2)
        .otherwise(1))

    # Erstellen der RFM_segmentation und RFM_score Spalten, indem man die Quartil-Spalten in Strings umwandelt und sie zusammenfügt
    RFM_result = RFM_result.withColumn('RFM_segmentation', 
        col('R_Quartile').cast(StringType()) + 
        col('F_Quartile').cast(StringType()) + 
        col('M_Quartile').cast(StringType()))

    RFM_result = RFM_result.withColumn('RFM_score', 
        col('R_Quartile') + 
        col('F_Quartile') + 
        col('M_Quartile'))


    RFM_result = RFM_result.withColumn('RFM_label', 
        when(col('RFM_score') >= 10, 'Lost')
        .when(col('RFM_score') >= 9, 'Hibernating')
        .when(col('RFM_score') >= 8, 'Can’t Lose Them')
        .when(col('RFM_score') >= 7, 'About To Sleep')
        .when(col('RFM_score') >= 6, 'Promising')
        .when(col('RFM_score') >= 5, 'Potential Loyalist')
        .when(col('RFM_score') >= 4, 'Loyal Customers')
        .otherwise('Champions'))

    # Gruppieren nach 'RFM_label' und Berechnen der Durchschnittswerte sowie der Gruppengröße
    RFM_desc = RFM_result.groupBy('RFM_label').agg(
        F.mean('Recency').alias('Average_Recency'),
        F.mean('Frequency').alias('Average_Frequency'),
        F.mean('Monetary').alias('Average_Monetary'),
        F.count('RFM_label').alias('Segment_Size')
    )

    # Runden der Durchschnittswerte auf eine Dezimalstelle
    RFM_desc = RFM_desc.select(
        'RFM_label',
        F.round('Average_Recency', 1).alias('Average_Recency'),
        F.round('Average_Frequency', 1).alias('Average_Frequency'),
        F.round('Average_Monetary', 1).alias('Average_Monetary'),
        'Segment_Size'
    )

    # Anzeigen des aggregierten DataFrames
    RFM_desc.show()

    # Extract the necessary columns from the Spark DataFrame
    sizes = RFM_desc.select("Segment_Size").rdd.flatMap(lambda x: x).collect()
    labels = RFM_desc.select("RFM_label").rdd.flatMap(lambda x: x).collect()

    # Create your plot and resize it
    #fig = plt.gcf()
    #ax = fig.add_subplot()
    #fig.set_size_inches(16, 9)

    # Create squarify plot
    #squarify.plot(sizes=sizes, label=labels, alpha=.6)
    #plt.title("RFM Segments", fontsize=18, fontweight="bold")
    #plt.axis('off')
    
    disk_io_end = psutil.disk_io_counters()

    read_bytes = disk_io_end.read_bytes - disk_io_start.read_bytes
    write_bytes = disk_io_end.write_bytes - disk_io_start.write_bytes

    print(f"Read: {read_bytes / 1024 / 1024:.2f} MB, Write: {write_bytes / 1024 / 1024:.2f} MB")


# Data Scaling (Cluster)

In this section we want to provide an overview for scaling our spark application based on a local standalone cluster setup

Therefore we compare the total time of the application and the important metrics of the longest job of the aplication based on different cluster configurations.

**To accurately setup the test invorement for the cluster, use the terminal to create the master node and the worker node(s)
The following code has to be excecuted in the bin folder of your spark installation.**

master node:
spark-class org.apache.spark.deploy.master.Master

worker node:
spark-class org.apache.spark.deploy.worker.Worker spark://<masternode-ip>:<masternode-port>

The base configuration for worker nodes is 15.0 GiB (1024.0 MiB Used) of RAM and 10 Cores .
The worker can be configured by using --cores x for the number of cores and --memory x for the associated RAM.

e.g.

./spark-class org.apache.spark.deploy.worker.Worker \
    --cores 5 \
    spark://100.119.9.7:7077

./spark-class org.apache.spark.deploy.worker.Worker \
    --memory 512m \
    spark://100.119.9.7:7077

After creating the cluster, change the "spark://<masternode-ip>:<masternode-port>" of the cluster the test according to your cluster.

The setup for the test cases is as followed:

master1Worker:
- Base configuration
- 1 worker

master2Worker:
- Base configuration
- 2 workers

master3Worker:
- Base
- 3 workers

master6Worker:
- Base
- 6 workers

master2Core:
- --cores 2
- 3 workers

master5Core:
- --cores 5
- 3 workers

master256MB:
- --memory 256m
- 3 workers

master512MB:
- --memory 512m
- 3 workers

master768MB:
- --memory 768m
- 3 workers

**Change the path of the data according to your system.**

In [5]:
# data with different sizes
smallData = '../../../only_purchases_1day.csv'
mediumData = '../../../2019-Oct.csv'
bigData = '../../../*.csv'

# the clusters for the tests
master1Worker = "spark://100.119.9.7:7077"
master2Worker = "spark://100.119.9.7:7077"
master3Worker = "spark://100.119.9.7:7077"
master6Worker = "spark://100.119.9.7:7077"
master2Core = "spark://100.119.9.7:7077"
master5Core = "spark://100.119.9.7:7077"
master256MB = "spark://100.119.9.7:7077"
master512MB = "spark://100.119.9.7:7077"
master768MB = "spark://100.119.9.7:7077"


## Data Scalability (Cluster)

### Test 1: Small Data

In [6]:
#run_rfm_segmentation(master3Worker, smallData, schema)

### Test 2: Medium Data

In [7]:
#run_rfm_segmentation(master3Worker, mediumData, schema)

### Test 3: Big Data

In [8]:
#run_rfm_segmentation(master3Worker, bigData, schema)

**Conclusion Data Scalability (Cluster)**
| Test ID | num of Task | rows | seconds/byte | per task input size |
| --- | --- | --- | --- | --- |
| 1 | 1 | 1,000 | 0.6/131.1KiB | 131.1KiB |
| 2 | 1 | 10,000 | 0.5/1.3MiB | 1.3MiB |
| 3 | 4 | 100,000 | 4/13MiB | 4.1MiB |
| 4 | 16 | 1,000,000 | 26/129.9MiB | 8.4MiB |
| 5 | 16 | 10,000,000 | 1.4*60/1.3GiB | 80.5MiB |
| 6 | 21 | 20,000,000 | 1.0*60/2.5GiB | 128.1Mib |
| 7 | 101 | 100,000,000 | 6.7*60/12.6GIB | 128.1Mib |

## Cluster Scalabilty

### Test 4: 1 Worker

In [9]:
run_rfm_segmentation(master1Worker, bigData, schema)

Memory Usage: 56.4%


24/02/02 18:30:43 WARN Utils: Your hostname, Nikolais-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 100.119.9.7 instead (on interface en0)
24/02/02 18:30:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/02 18:30:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


423


[Stage 3:>                                                        (0 + 10) / 11]

![Image](https://i.imgur.com/GDLM91D)

### Test 5: initialize cluster with 2 Workers

In [None]:
#run_rfm_segmentation(master2Worker, bigData, schema)

### Test 6: initialize cluster with 3 Workers

### Test 6: initialize cluster with 6 Workers

In [None]:
#run_rfm_segmentation(master6Worker, bigData, schema)

### Test 7: initialize cluster with 2 Cores on each Worker

In [None]:
#run_rfm_segmentation(master2Core, bigData, schema)

### Test 8: initialize cluster with 5 Cores on each Worker

In [None]:
#run_rfm_segmentation(master5Core, bigData, schema)

### Test 9: initialize cluster with a memory of 256MB for each Worker

In [None]:
#run_rfm_segmentation(master256MB, bigData, schema)

### Test 10: initialize cluster with a memory of 512MB for each Worker

In [None]:
#run_rfm_segmentation(master512MB, bigData, schema)

### Test 11: initialize cluster with a memory of 768MB for each Worker

In [None]:
#run_rfm_segmentation(master768MB, bigData, schema)

24/02/02 18:22:23 ERROR TaskSchedulerImpl: Lost executor 2 on 100.119.9.7: Command exited with code 129
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_252 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_71 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_310 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_137 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_77 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_270 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_21 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_104 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_195 !
24/02/02 18:22:23 WARN BlockManagerMasterEndpoint: No

**Conclusion Cluster Scalability**

| Test ID | num of Task | rows | seconds/byte | per task input size |
| --- | --- | --- | --- | --- |
| 1 | 1 | 1,000 | 0.6/131.1KiB | 131.1KiB |
| 2 | 1 | 10,000 | 0.5/1.3MiB | 1.3MiB |
| 3 | 4 | 100,000 | 4/13MiB | 4.1MiB |
| 4 | 16 | 1,000,000 | 26/129.9MiB | 8.4MiB |
| 5 | 16 | 10,000,000 | 1.4*60/1.3GiB | 80.5MiB |
| 6 | 21 | 20,000,000 | 1.0*60/2.5GiB | 128.1Mib |
| 7 | 101 | 100,000,000 | 6.7*60/12.6GIB | 128.1Mib |

![Image](https://i.imgur.com/hCWLsuY.png)

![Image](https://i.imgur.com/VxpeVvd.png)