# Spark modelling - optimized

## 1. Libraries and Spark setup


This section imports necessary libraries and sets up the Spark environment:

Libraries:
* `o`, `sys`, `json`, `datetime`, `numpy`, `pandas`, `tqdm`, `matplotlib.pyplot`: General purpose libraries for file system access, system functionalities, JSON handling, date/time manipulation, numerical computation, data manipulation, progress bars, and plotting.

pyspark libraries:
* `SparkContext`, `SparkConf`: Core Spark functionalities for setting up the Spark context and configuration.
* `SparkSession`: Entry point for interacting with Spark SQL.
* `functions as F`: Provides various Spark SQL functions for data manipulation.
* `types`: Defines data types for Spark DataFrames.
* `Window`: Used for window functions in Spark SQL.
* `ml.feature`: Provides feature engineering and transformation tools like `Word2Vec`, `Imputer`, `OneHotEncoder`, `StringIndexer`, `VectorAssembler`.
* `ml.classification`: Provides classification algorithms like `LogisticRegression` and `RandomForestClassifier`.
* `ml.evaluation`: Provides evaluation metrics like `BinaryClassificationEvaluator` and `BinaryClassificationMetrics`.
* `ml.tuning`: Provides tools for hyperparameter tuning like `CrossValidator` and `ParamGridBuilder`.

Spark Configuration:
* `SparkConf`: Sets configuration parameters for the Spark application.
* `spark.master`: Specifies the cluster manager; local[*] indicates using all available cores on the local machine.
* `spark.driver.memory`, `spark.driver.maxResultSize`: Allocates memory for the driver process.
* `SparkContext`, `SparkSession`: Creates the Spark context and session based on the configuration.

Accessing Data:
* `access_data`: Function to load JSON data from a local file.
* `access_s3_data`: Loads AWS credentials from a local JSON file.

Spark configuration is further set to access data from Yandex Cloud Storage (S3-compatible) using the loaded credentials.

In [1]:
import os
import sys
import json
import datetime
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Create filenames 

import datetime
import uuid
import getpass

In [2]:
!pip install wandb


Collecting wandb
  Using cached wandb-0.16.6-py3-none-any.whl (2.2 MB)
Collecting setproctitle
  Using cached setproctitle-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting sentry-sdk>=1.0.0
  Using cached sentry_sdk-1.45.0-py2.py3-none-any.whl (267 kB)
Collecting appdirs>=1.4.3
  Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Collecting docker-pycreds>=0.4.0
  Using cached docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting urllib3<3,>=1.21.1
  Using cached urllib3-2.2.1-py3-none-any.whl (121 kB)
Installing collected packages: appdirs, urllib3, setproctitle, docker-pycreds, sentry-sdk, wandb
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.8
    Uninstalling urllib3-1.26.8:
      Successfully uninstalled urllib3-1.26.8
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

In [3]:
# W&B logging 
import wandb

## <span style="color: red;">Insert the name of YOUR notebook here below </span>



In [4]:
# Set name of notebook
os.environ["WANDB_NOTEBOOK_NAME"] = "full_pipeline_with_logging_hyperopt_jake_1month-Copy1.ipynb"





In [5]:
# Pyspark general 
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Pyspark pre-processing 
from pyspark.sql.window import Window

# Pyspark vectorization
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

# 
from pyspark.ml import Pipeline
from pyspark.ml.feature import Imputer

# Pyspark models 
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier, LinearSVC
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

# Pyspark classifiers 
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics

# Pyspark cross-validation
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Pyspark reporting 
from sklearn.metrics import classification_report

# Pyspark other
from pyspark.ml.functions import vector_to_array


In [6]:
!pip install hyperopt


Collecting hyperopt
  Using cached hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
Collecting future
  Using cached future-1.0.0-py3-none-any.whl (491 kB)
Installing collected packages: future, hyperopt
Successfully installed future-1.0.0 hyperopt-0.2.7


In [7]:
# Hyperopt related 
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.pyll.base import scope


## <span style="color: red;">Get your w&b API for the next section</span>


In [8]:
# Weights and biases login 

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mst083972[0m ([33mgsom-diploma-jap[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
print('user:', os.environ['JUPYTERHUB_SERVICE_PREFIX'])

def uiWebUrl(self):
    from urllib.parse import urlparse
    web_url = self._jsc.sc().uiWebUrl().get()
    port = urlparse(web_url).port
    return '{}proxy/{}/jobs/'.format(os.environ['JUPYTERHUB_SERVICE_PREFIX'], port)

SparkContext.uiWebUrl = property(uiWebUrl)

conf = SparkConf()
conf.set('spark.master', 'local[*]')
conf.set('spark.driver.memory', '32G')
conf.set('spark.driver.maxResultSize', '8G')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
spark

In [10]:
def access_data(file_path):
    with open(file_path) as file:
        access_data = json.load(file)
    return access_data

access_s3_data = access_data('.access_jhub_data')

In [11]:
spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', access_s3_data['aws_access_key_id'])
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', access_s3_data['aws_secret_access_key'])
spark._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3a.S3AFileSystem')
spark._jsc.hadoopConfiguration().set('fs.s3a.multipart.size', '104857600')
spark._jsc.hadoopConfiguration().set('fs.s3a.block.size', '33554432')
spark._jsc.hadoopConfiguration().set('fs.s3a.threads.max', '256')
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 'http://storage.yandexcloud.net')
spark._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 
                                     'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')

## 2. Dataset


This section defines variables for data processing and file paths, then performs several data processing stages:
Variables:

* `VER`: Version identifier.
* `PROC_DS`, `PROC_LAGS`, `PROC_VECS`: Flags to control data processing stages.
* `FRAC_0`: Fraction of negative examples to sample when processing lags.
* `BUCKET`: S3 bucket name.
* `files_path`, `files_mask`: Local paths and masks for raw data files.
* `file_path_ds`, `file_path_lags`, `file_path_trn`, `file_path_tst`: S3 paths for different stages of processed data.


• `VER = 'v2'`: This defines the version of your data processing pipeline. You might increment this version number when you make significant changes to the processing steps.
• `BUCKET` = 'pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a7b': This specifies the name of the bucket where your data is stored (likely on a cloud storage platform like AWS S3 or Yandex Object Storage).

**Flags:**

• `PROC_DS` (False): This flag controls whether to process the raw dataset. If True, the code will read the raw CSV files, extract relevant columns, filter by date, and create the data_raw.parquet file.

* Change to True: When you have new raw data or need to reprocess the existing raw data due to changes in the extraction logic.
* Keep as False: When you already have a processed data_raw.parquet file and don't need to re-process it.

• `PROC_LAGS` (False): This flag controls whether to process and create lag features. If True, the code will calculate lag features based on event history for different time windows and store them in the data_lags.parquet file.

* Change to True: When you need to recalculate lag features, such as when you've changed the time window definitions or added new events.
* Keep as False: When you already have the desired lag features in data_lags.parquet and don't need to recompute them.

• `FRAC_0` (.001): This variable sets the sampling fraction for events with payment_event_flag = 0 when processing lags. This is used to reduce the size of the dataset for faster processing while maintaining a representative sample.

* Adjust Value: You might change this value depending on the size of your dataset and the desired balance between processing time and data representativeness.

• `PROC_VECS` (True): This flag controls whether to vectorize the data using TF-IDF. If True, the code will transform the lag features into numerical vectors using TF-IDF and store them in data_vec_train.parquet and data_vec_test.parquet files.

* Change to False: If you want to experiment with other vectorization methods or use the data in its raw form.
* Keep as True: When TF-IDF vectorization is the desired approach for your modeling tasks.

**File Paths:**

• `files_path`, `files_mask`: These define the location and pattern of the raw data files.
• `file_path_ds`, `file_path_lags`, `file_path_trn`, `file_path_tst`: These specify the storage locations for the processed datasets at different stages of the pipeline.

By understanding these flags and variables, you can control which parts of the data processing pipeline are executed, allowing for efficient experimentation and iteration.


v5 = data week 
2024, 1, 1, 0, 0, 0 
     2024, 4, 16, 23, 59, 59

## <span style="color: red;">Insert YOUR version here below, I.e. V9. If creating NEW data, set all to TRUE</span>

In [12]:
VER = 'vJ1rf' # <-- insert YOUR version here 
PROC_DS = True # <-- set to true 
PROC_LAGS = True # <-- set to true 
FRAC_0 = .001 # used only if `PROC_LAGS = True`
PROC_VECS = True
BUCKET = 'pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a7b'
PRJ_PATH = '/home/jovyan/__RAYPFP'

files_path = 'data/events'
files_mask = f'{files_path}/data_202*-*-*.csv'

file_path_ds = f's3a://{BUCKET}/work/{VER}/data_raw.parquet'
file_path_lags = f's3a://{BUCKET}/work/{VER}/data_lags.parquet'
file_path_trn = f's3a://{BUCKET}/work/{VER}/data_vec_train.parquet'
file_path_tst = f's3a://{BUCKET}/work/{VER}/data_vec_test.parquet'
file_path_lags1 = f's3a://{BUCKET}/work/{VER}/data_lags1.parquet'


In [13]:
#file_path_lags1 = f's3a://{BUCKET}/work/{VER}/data_lags1.parquet'


In [14]:
def clean_parquet(path):
    cmd = path.replace(
        f's3a://{BUCKET}',
        f'rm -rf {PRJ_PATH}'
    )
    !{cmd}
    return f'command to run: {cmd}'

In [15]:
# Create filenames that will be used later when saving predictions 

current_datetime = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
current_user = os.environ['JUPYTERHUB_SERVICE_PREFIX']
current_user = current_user.split("/")[2]  
unique_identifier = str(uuid.uuid4())[:8]  # Generate a unique identifier (first 8 characters)


In [16]:
print("**Filename Information:**")
print(f"- Current Date and Time: {current_datetime}")
print(f"- Current User: {current_user}")
print(f"- Unique Identifier: {unique_identifier}")

**Filename Information:**
- Current Date and Time: 20240425_062613
- Current User: st083972
- Unique Identifier: 9de7d825


### 2.1. Load or preprocess data - `raw` stage

2.1. Load or Preprocess Data - raw Stage

This section checks the `PROC_DS` flag.

If True, it reads raw CSV data from S3, parses timestamps, filters and flags payment events within a specific timeframe, and selects relevant columns.

The processed data is then saved as a parquet file in S3 and the DataFrame is unloaded from memory.

Finally, it reads the processed data from the parquet file and displays a few rows.



## <span style="color: red;">Ensure dates used are as follows</span>

flag_min_datetime = datetime.datetime(2024, 3, 21, 0, 0, 0)

flag_max_datetime = datetime.datetime(2024, 4, 18, 23, 59, 59)

In [17]:
%%time
if PROC_DS:
    sdf = spark.read.option('escape','"').csv(f's3a://{BUCKET}/{files_mask}', header=True)
    sdf = sdf.withColumn('event_datetime', F.to_timestamp("event_datetime"))
    flag_min_datetime = datetime.datetime(2024, 3, 21, 0, 0, 0)
    flag_max_datetime = datetime.datetime(2024, 4, 18, 23, 59, 59)
    print(flag_min_datetime, flag_max_datetime)
    sdf = sdf.withColumn(
        'payment_event_flag', 
        (
            (F.col('event_name').like('%Мои штрафы/Оплата/Завершили оплату%') | 
            F.col('event_name').like('%Мои штрафы/Оплата/Платёж принят%')) &
            F.col('event_datetime').between(flag_min_datetime, flag_max_datetime)
        ).cast("int")
    )
    sdf = sdf.select(
        'profile_id',
        'event_datetime',
        'payment_event_flag',
        'event_name'
    )
    sdf.repartition(1).write.parquet(file_path_ds)
    sdf.unpersist()
sdf = spark.read.parquet(file_path_ds)
sdf.limit(5).toPandas()



if not PROC_DS:
    # Code to execute if PROC_DS is False
    flag_min_datetime = datetime.datetime(2024, 3, 21, 0, 0, 0)
    flag_max_datetime = datetime.datetime(2024, 4, 18, 23, 59, 59)
    print("flag min datetime: ", flag_min_datetime, '\n', 
          "flag max datetime: ", flag_max_datetime)

2024-03-21 00:00:00 2024-04-18 23:59:59
CPU times: user 269 ms, sys: 80.3 ms, total: 350 ms
Wall time: 20min 55s


In [18]:
sdf.groupBy('payment_event_flag').count().show()

+------------------+---------+
|payment_event_flag|    count|
+------------------+---------+
|                 0|755277302|
|                 1|    45742|
+------------------+---------+



In [19]:
# Init Weights and Biases to begin storing data 

wandb.init(project="ray-diploma", config={
    "version": VER,
    "proc_ds": PROC_DS,
    "proc_lags": PROC_LAGS,
    "proc_vecs": PROC_VECS,
    "frac_0": FRAC_0,
    "min_datetime": flag_min_datetime,
    "max_datetime": flag_max_datetime,
    "current_user": current_user, 
    "uuid": unique_identifier,
    # ... other common parameters ...
}, 
           mode="online", 
           dir="/home/jovyan/")

In [20]:
settings = wandb.Settings()
print(settings) 



In [21]:
# Weights and Biases logging here:

counts = sdf.groupBy('payment_event_flag').count().toPandas()
wandb.log({"data_count": counts})
print("**After Initial Data Loading**")
print(counts)

**After Initial Data Loading**
   payment_event_flag      count
0                   0  755277302
1                   1      45742


### 2.2. Load or preprocess data - `lags` stage

2.2. Load or Preprocess Data - lags Stage

The `dataset_lags` function defines window specifications for different time intervals (e.g., 10 minutes to 1 hour, 1 day to 3 days).

It then uses these windows to calculate the list of event names within each time interval for each profile, creating lag features.

If `PROC_LAGS` is True, the function samples the data based on the `payment_event_flag` and the specified fraction for negative examples.

The processed data with lag features is saved as a parquet file and unloaded from memory.

Finally, it reads the data with lags and displays the count of positive and negative examples.

## Lags new implementation 

## <span style="color: red;">Do NOT change</span>


In [22]:
def dataset_lags(sdf, shift=0):
    hour = 60 * 60
    day = 24 * 60 * 60

    w_10min_to_1week = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-7 * day + shift, -10 * 60 + shift))
    w_1week_to_2weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-14 * day + shift, -7 * day + shift))
    w_2weeks_to_3weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-21 * day + shift, -14 * day + shift))
    w_3weeks_to_4weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-28 * day + shift, -21 * day + shift))
    w_4weeks_to_5weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-35 * day + shift, -28 * day + shift))
    w_5weeks_to_6weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-42 * day + shift, -35 * day + shift))
    w_6weeks_to_7weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-49 * day + shift, -42 * day + shift))
    w_7weeks_to_8weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-56 * day + shift, -49 * day + shift))
    w_8weeks_to_9weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-63 * day + shift, -56 * day + shift))
    w_9weeks_to_10weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-70 * day + shift, -63 * day + shift))
    w_10weeks_to_11weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-77 * day + shift, -70 * day + shift))
    w_11weeks_to_12weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-84 * day + shift, -77 * day + shift))
    w_12weeks_to_13weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-91 * day + shift, -84 * day + shift))
    w_13weeks_to_14weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-98 * day + shift, -91 * day + shift))   
    w_14weeks_to_15weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-105 * day + shift, -98 * day + shift))
    w_15weeks_to_16weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-112 * day + shift, -105 * day + shift))
    w_16weeks_to_17weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-119 * day + shift, -112 * day + shift))
    w_17weeks_to_18weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-126 * day + shift, -119 * day + shift))
    w_18weeks_to_19weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-133 * day + shift, -126 * day + shift))
    w_19weeks_to_20weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-140 * day + shift, -133 * day + shift))
    w_20weeks_to_21weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-147 * day + shift, -140 * day + shift))
    w_21weeks_to_22weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-154 * day + shift, -147 * day + shift))
    w_22weeks_to_23weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-161 * day + shift, -154 * day + shift))
    w_23weeks_to_24weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-168 * day + shift, -161 * day + shift))
    w_24weeks_to_25weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-175 * day + shift, -168 * day + shift))
    w_25weeks_to_26weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-182 * day + shift, -175 * day + shift)) 
    w_26weeks_to_27weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-189 * day + shift, -182 * day + shift)) 
    
    return (
        sdf
            #.withColumn('lag_10min_to_1week', F.collect_list('event_name').over(w_10min_to_1week))
            .withColumn('lag_1week_to_2weeks', F.collect_list('event_name').over(w_1week_to_2weeks))
            .withColumn('lag_2weeks_to_3weeks', F.collect_list('event_name').over(w_2weeks_to_3weeks))
            .withColumn('lag_3weeks_to_4weeks', F.collect_list('event_name').over(w_3weeks_to_4weeks))
            .withColumn('lag_4weeks_to_5weeks', F.collect_list('event_name').over(w_4weeks_to_5weeks))
            .withColumn('lag_5weeks_to_6weeks', F.collect_list('event_name').over(w_5weeks_to_6weeks))
            .withColumn('lag_6weeks_to_7weeks', F.collect_list('event_name').over(w_6weeks_to_7weeks))
            .withColumn('lag_7weeks_to_8weeks', F.collect_list('event_name').over(w_7weeks_to_8weeks))
            .withColumn('lag_8weeks_to_9weeks', F.collect_list('event_name').over(w_8weeks_to_9weeks))
            .withColumn('lag_9weeks_to_10weeks', F.collect_list('event_name').over(w_9weeks_to_10weeks))
            .withColumn('lag_10weeks_to_11weeks', F.collect_list('event_name').over(w_10weeks_to_11weeks))
            .withColumn('lag_11weeks_to_12weeks', F.collect_list('event_name').over(w_11weeks_to_12weeks))
            .withColumn('lag_12weeks_to_13weeks', F.collect_list('event_name').over(w_12weeks_to_13weeks))
            .withColumn('lag_13weeks_to_14weeks', F.collect_list('event_name').over(w_13weeks_to_14weeks))
            .withColumn('lag_14weeks_to_15weeks', F.collect_list('event_name').over(w_14weeks_to_15weeks))
            .withColumn('lag_15weeks_to_16weeks', F.collect_list('event_name').over(w_15weeks_to_16weeks))
            .withColumn('lag_16weeks_to_17weeks', F.collect_list('event_name').over(w_16weeks_to_17weeks))
            .withColumn('lag_17weeks_to_18weeks', F.collect_list('event_name').over(w_17weeks_to_18weeks))
            .withColumn('lag_18weeks_to_19weeks', F.collect_list('event_name').over(w_18weeks_to_19weeks))
            .withColumn('lag_19weeks_to_20weeks', F.collect_list('event_name').over(w_19weeks_to_20weeks))
            .withColumn('lag_20weeks_to_21weeks', F.collect_list('event_name').over(w_20weeks_to_21weeks))
            .withColumn('lag_21weeks_to_22weeks', F.collect_list('event_name').over(w_21weeks_to_22weeks))
            .withColumn('lag_22weeks_to_23weeks', F.collect_list('event_name').over(w_22weeks_to_23weeks))
            .withColumn('lag_23weeks_to_24weeks', F.collect_list('event_name').over(w_23weeks_to_24weeks))
            .withColumn('lag_24weeks_to_25weeks', F.collect_list('event_name').over(w_24weeks_to_25weeks))
            .withColumn('lag_25weeks_to_26weeks', F.collect_list('event_name').over(w_25weeks_to_26weeks))
            .withColumn('lag_26weeks_to_27weeks', F.collect_list('event_name').over(w_26weeks_to_27weeks))
            .select(
                'profile_id',
                'event_datetime',
                'payment_event_flag',
                'event_name',
                #'lag_10min_to_1week',
                'lag_1week_to_2weeks',
                'lag_2weeks_to_3weeks',
                'lag_3weeks_to_4weeks',
                'lag_4weeks_to_5weeks',
                'lag_5weeks_to_6weeks',
                'lag_6weeks_to_7weeks',
                'lag_7weeks_to_8weeks',
                'lag_8weeks_to_9weeks',
                'lag_9weeks_to_10weeks',
                'lag_10weeks_to_11weeks',
                'lag_11weeks_to_12weeks',
                'lag_12weeks_to_13weeks',
                'lag_13weeks_to_14weeks',
                'lag_14weeks_to_15weeks',
                'lag_15weeks_to_16weeks',
                'lag_16weeks_to_17weeks',
                'lag_17weeks_to_18weeks',
                'lag_18weeks_to_19weeks',
                'lag_19weeks_to_20weeks',
                'lag_20weeks_to_21weeks',
                'lag_21weeks_to_22weeks',
                'lag_22weeks_to_23weeks',
                'lag_23weeks_to_24weeks',
                'lag_24weeks_to_25weeks',
                'lag_25weeks_to_26weeks',
                'lag_26weeks_to_27weeks'
            )
        .orderBy(F.col('event_datetime'), ascending=False)
    )

## -- New implementation of time lag windows -- 

Added the following code; 

*     clean_parquet(file_path_lags)
*     dates  = (flag_min_datetime, flag_max_datetime)

1. `clean_parquet(file_path_lags)`:

This line of code is used to clean or remove any existing parquet files at the specified file_path_lags location before writing new data.

By calling `clean_parquet(file_path_lags)` before writing the new data with time lag windows, it ensures that any previous data stored at the same location is removed, preventing any conflicts or data inconsistencies.

**we have commented it out for this because we actually want to reuse the existing data when training and grid searching our models** 

2. `dates = (flag_min_datetime, flag_max_datetime):`

This line creates a tuple named dates that contains two datetime values: flag_min_datetime and flag_max_datetime.

The purpose of dates = (flag_min_datetime, flag_max_datetime) is to create a tuple that represents a date range for filtering the data. The values of flag_min_datetime and flag_max_datetime are defined earlier in the notebook – I.e. 

"""flag_min_datetime = datetime.datetime(2023, 8, 1, 0, 0, 0)
flag_max_datetime = datetime.datetime(2023, 8, 31, 23, 59, 59)
print(flag_min_datetime, flag_max_datetime)""" 

After creating the dates tuple, the code uses it to filter the sdf DataFrame based on the event_datetime column. 

The asterisk (*) before dates is used to unpack the tuple and pass the individual datetime values as arguments to the between function.


In [23]:
# Old version before VG update to filter for dates



if PROC_LAGS:
    sdf = sdf.sampleBy(
        'payment_event_flag', 
        fractions={0: FRAC_0, 1: 1}, 
        seed=2023
    )
    sdf = dataset_lags(sdf)
    dates  = (flag_min_datetime, flag_max_datetime)
    sdf = sdf.filter(sdf.event_datetime.between(*dates))
    sdf = sdf.filter(
        #(F.size('lag_10min_to_1week')      > 0) |
        (F.size('lag_1week_to_2weeks')     > 0) |
        (F.size('lag_2weeks_to_3weeks')    > 0) |
        (F.size('lag_3weeks_to_4weeks')    > 0) |
        (F.size('lag_4weeks_to_5weeks')    > 0) |
        (F.size('lag_5weeks_to_6weeks')    > 0) |
        (F.size('lag_6weeks_to_7weeks')    > 0) |
        (F.size('lag_7weeks_to_8weeks')    > 0) |
        (F.size('lag_8weeks_to_9weeks')    > 0) |
        (F.size('lag_9weeks_to_10weeks')   > 0) |
        (F.size('lag_10weeks_to_11weeks')  > 0) |
        (F.size('lag_11weeks_to_12weeks')  > 0) |
        (F.size('lag_12weeks_to_13weeks')  > 0) |
        (F.size('lag_13weeks_to_14weeks')  > 0) |
        (F.size('lag_14weeks_to_15weeks')  > 0) |
        (F.size('lag_15weeks_to_16weeks')  > 0) |
        (F.size('lag_16weeks_to_17weeks')  > 0) |
        (F.size('lag_17weeks_to_18weeks')  > 0) |
        (F.size('lag_18weeks_to_19weeks')  > 0) |
        (F.size('lag_19weeks_to_20weeks')  > 0) |
        (F.size('lag_20weeks_to_21weeks')  > 0) |
        (F.size('lag_21weeks_to_22weeks')  > 0) |
        (F.size('lag_22weeks_to_23weeks')  > 0) |
        (F.size('lag_23weeks_to_24weeks')  > 0) |
        (F.size('lag_24weeks_to_25weeks')  > 0) |
        (F.size('lag_25weeks_to_26weeks')  > 0) |
        (F.size('lag_26weeks_to_27weeks')  > 0) 
    )
    clean_parquet(file_path_lags)
    sdf.repartition(8).write.parquet(file_path_lags)
    sdf.unpersist()
sdf = spark.read.parquet(file_path_lags)


In [24]:
# Weights and Biases logging here:
counts = sdf.groupBy('payment_event_flag').count().toPandas()
wandb.log({"data_count_with_lags": counts})

## <span style="color: red;">Double check your dataset has been reduced in the following cell</span>


In [25]:
print("**Dataset Size After Lag Feature Creation**", '/n', counts)

**Dataset Size After Lag Feature Creation** /n    payment_event_flag  count
0                   1  23083
1                   0  19471


In [26]:
sdf.printSchema()

root
 |-- profile_id: string (nullable = true)
 |-- event_datetime: timestamp (nullable = true)
 |-- payment_event_flag: integer (nullable = true)
 |-- event_name: string (nullable = true)
 |-- lag_1week_to_2weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_2weeks_to_3weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_3weeks_to_4weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_4weeks_to_5weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_5weeks_to_6weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_6weeks_to_7weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_7weeks_to_8weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_8weeks_to_9weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_9weeks_to_1

### Train test split process 

In [27]:
# Define train test split function 

def stratified_split(sdf, frac, label, seed=2023):
    zeros = sdf.filter(sdf[label] == 0)
    ones = sdf.filter(sdf[label] == 1)
    train_, test_ = zeros.randomSplit([1 - frac, frac], seed=seed)
    train, test = ones.randomSplit([1 - frac, frac], seed=seed)
    train = train.union(train_)
    test = test.union(test_)
    return train, test

In [28]:
# Conduct train test split 

sdf_train, sdf_test = stratified_split(
    sdf, 
    frac=.2, # Size of the test dataset
    label='payment_event_flag',
    seed=2023
)

## <span style="color: red;">Check your data has been split and classes are approx equal</span>


In [29]:
sdf_train.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,1,18435
1,0,15552


In [30]:
sdf_test.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,1,4648
1,0,3919


In [31]:
# Weights and Biases logging here:

train_counts = sdf_train.groupBy('payment_event_flag').count().toPandas()
wandb.log({"train_data_count": train_counts})

test_counts = sdf_test.groupBy('payment_event_flag').count().toPandas()
wandb.log({"test_data_count": test_counts})

## <span style="color: red;">Ensure date ranges used are the ones you set earlier</span>


In [32]:
from pyspark.sql.functions import min, max

# Find the minimum and maximum dates
min_date = sdf.agg(min("event_datetime")).collect()[0][0]
max_date = sdf.agg(max("event_datetime")).collect()[0][0]

print(f"Minimum Date: {min_date}")
print(f"Maximum Date: {max_date}")

Minimum Date: 2024-03-21 00:00:17
Maximum Date: 2024-04-18 23:59:23


In [33]:
# Training set
train_min_date = sdf_train.agg(min("event_datetime")).collect()[0][0]
train_max_date = sdf_train.agg(max("event_datetime")).collect()[0][0]

print(f"Training Set Minimum Date: {train_min_date}")
print(f"Training Set Maximum Date: {train_max_date}")


Training Set Minimum Date: 2024-03-21 00:00:17
Training Set Maximum Date: 2024-04-18 23:59:23


In [34]:
# Test set
test_min_date = sdf_test.agg(min("event_datetime")).collect()[0][0]
test_max_date = sdf_test.agg(max("event_datetime")).collect()[0][0]

print(f"Test Set Minimum Date: {test_min_date}")
print(f"Test Set Maximum Date: {test_max_date}")

Test Set Minimum Date: 2024-03-21 00:01:28
Test Set Maximum Date: 2024-04-18 23:57:47


### 2.3. Load or preprocess data - `vectorize` stage

2.3. Load or Preprocess Data - vectorize Stage

This section defines a list of lag features to be used.

The datasets_tfidf function performs `TF-IDF` vectorization on the lag features for both training and test datasets.

It uses `HashingTF` to convert lists of event names into numerical feature vectors and `IDF` to rescale the features based on their document frequency.

The function also creates a dictionary mapping feature indices to the corresponding event names.

In [35]:
lags = [
#    'lag_10min_to_1week',
    'lag_1week_to_2weeks',
    'lag_2weeks_to_3weeks',
    'lag_3weeks_to_4weeks',
    'lag_4weeks_to_5weeks',
    'lag_5weeks_to_6weeks',
    'lag_6weeks_to_7weeks',
    'lag_7weeks_to_8weeks',
    'lag_8weeks_to_9weeks',
    'lag_9weeks_to_10weeks',
    'lag_10weeks_to_11weeks',
    'lag_11weeks_to_12weeks',
    'lag_12weeks_to_13weeks',
    'lag_13weeks_to_14weeks',
    'lag_14weeks_to_15weeks',
    'lag_15weeks_to_16weeks',
    'lag_16weeks_to_17weeks',
    'lag_17weeks_to_18weeks',
    'lag_18weeks_to_19weeks',
    'lag_19weeks_to_20weeks',
    'lag_20weeks_to_21weeks',
    'lag_21weeks_to_22weeks',
    'lag_22weeks_to_23weeks',
    'lag_23weeks_to_24weeks',
    'lag_24weeks_to_25weeks',
    'lag_25weeks_to_26weeks',
    'lag_26weeks_to_27weeks'
]

## TF-IDF implementation 

In [36]:
def datasets_tfidf(sdf_train, sdf_test, lags, min_freq=3, num_features=10):
    features_dict = {}
    count = 0
    for lag in tqdm(lags):
        hashingTF = HashingTF(
            inputCol=lag, 
            outputCol=lag + '_tf', 
            numFeatures=num_features
        )
        featurizedData = hashingTF.transform(sdf_train)
        idf = IDF(
            inputCol=lag + '_tf', 
            outputCol=lag + '_tfidf',
            minDocFreq=min_freq  
        )
        idfModel = idf.fit(featurizedData)
        sdf_train = idfModel.transform(featurizedData)
        sdf_test = idfModel.transform(
            hashingTF.transform(sdf_test)
        )
        events = [
            x
            for xs in sdf_train.select(lag).distinct().rdd.flatMap(lambda x: x).collect()
            for x in xs
        ]
        hash_dict = {}
        for e in events:
            hash_dict[lag + '_' + e] = hashingTF.indexOf(e)
        for feat_num in range(num_features):
            tmp_list = []
            for k, v in hash_dict.items():
                if v == feat_num: tmp_list.append(k)
            features_dict[count * num_features + feat_num] = tmp_list
        count += 1
    return sdf_train, sdf_test, features_dict

## Breakdown of PROC_VECS code: 


**Conditional Execution:**

• if PROC_VECS:: The code within this block is executed only if the PROC_VECS flag is set to True. This flag controls whether TF-IDF vectorization is performed on the data.

**TF-IDF Vectorization:**

`sdf_train, sdf_test, vectorizers = datasets_tfidf(...)`: This line calls the datasets_tfidf function, which performs TF-IDF vectorization on the lag features present in the sdf_train and sdf_test DataFrames.
* The lags argument provides the list of lag feature column names to be vectorized.
* The vec_size=10 argument specifies the desired dimensionality (number of features) of the resulting TF-IDF vectors.

**The function returns three values:**

* `sdf_train`: The training DataFrame with the added TF-IDF vector columns.
* `sdf_test`: The test DataFrame with the added TF-IDF vector columns.
* `vectorizers`: A list of fitted TF-IDF vectorizer models (one for each lag feature).

**Cleaning and Saving Parquet Files:**

`clean_parquet(file_path_trn)`: This line calls a function (not shown) to clean up any existing Parquet files at the specified path (file_path_trn) before saving the new data.

`sdf_train.repartition(8).write.parquet(file_path_trn)`: The training DataFrame (sdf_train) is repartitioned into 8 partitions for optimized writing.

* The write.parquet method saves the DataFrame as a Parquet file at the specified path (file_path_trn).
* The same process is repeated for the test DataFrame (sdf_test) using file_path_tst.

**Unpersisting DataFrames:**

* `sdf_train.unpersist(), sdf_test.unpersist()`: These lines remove the DataFrames from Spark's memory. Since the data has been saved to disk, it can be reloaded later if needed, freeing up memory for subsequent processing.

**Reloading DataFrames (if necessary):**

* `sdf_train = spark.read.parquet(file_path_trn)`: This line reloads the training data from the saved Parquet file if it's not already in memory.

The same is done for the test data using file_path_tst.

In [37]:
# NEW version with clean_parquet (TF-IDF)

if PROC_VECS:
    sdf_train, sdf_test, features_dict = datasets_tfidf(
        sdf_train,
        sdf_test,
        lags,
        #vec_size=10,
        min_freq=3,
        num_features=100
    )
    clean_parquet(file_path_trn)
    sdf_train.repartition(8).write.parquet(file_path_trn)
    clean_parquet(file_path_tst)
    sdf_test.repartition(8).write.parquet(file_path_tst)
    sdf_train.unpersist()
    sdf_test.unpersist()
sdf_train = spark.read.parquet(file_path_trn)
sdf_test = spark.read.parquet(file_path_tst)


if not PROC_VECS:
    sdf_train, sdf_test, features_dict = datasets_tfidf(
        sdf_train, 
        sdf_test, 
        lags, 
        min_freq=3,
        num_features=100
    )
    print(len(features_dict.items()))


  0%|          | 0/26 [00:00<?, ?it/s]

In [38]:
# Check data size after reloading
print("**Training Set After TF-IDF**")
sdf_train.groupBy('payment_event_flag').count().show()
print("**Testing Set After TF-IDF**")
sdf_test.groupBy('payment_event_flag').count().show()

**Training Set After TF-IDF**
+------------------+-----+
|payment_event_flag|count|
+------------------+-----+
|                 1|18435|
|                 0|15552|
+------------------+-----+

**Testing Set After TF-IDF**
+------------------+-----+
|payment_event_flag|count|
+------------------+-----+
|                 1| 4648|
|                 0| 3919|
+------------------+-----+



In [39]:
sdf_train.printSchema()

root
 |-- profile_id: string (nullable = true)
 |-- event_datetime: timestamp (nullable = true)
 |-- payment_event_flag: integer (nullable = true)
 |-- event_name: string (nullable = true)
 |-- lag_1week_to_2weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_2weeks_to_3weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_3weeks_to_4weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_4weeks_to_5weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_5weeks_to_6weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_6weeks_to_7weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_7weeks_to_8weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_8weeks_to_9weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_9weeks_to_1

## -- end TF-IDF --

## Word2Vec implementaion 

#### <span style="color: red;">We are not currently using word2vec</span>


In [40]:
"""from pyspark.ml.feature import Word2Vec

def datasets_vecorized(sdf_train, sdf_test, lags, vec_size=10):
    vectorizers = []
    for lag in tqdm(lags):
        word2Vec = Word2Vec(
            vectorSize=vec_size,
            minCount=0,
            inputCol=lag,
            outputCol=lag + '_vec'
        )
        vectorizer = word2Vec.fit(sdf_train)
        sdf_train = vectorizer.transform(sdf_train)
        sdf_test = vectorizer.transform(sdf_test)
        vectorizers.append(vectorizer)
    return sdf_train, sdf_test, vectorizers"""

"from pyspark.ml.feature import Word2Vec\n\ndef datasets_vecorized(sdf_train, sdf_test, lags, vec_size=10):\n    vectorizers = []\n    for lag in tqdm(lags):\n        word2Vec = Word2Vec(\n            vectorSize=vec_size,\n            minCount=0,\n            inputCol=lag,\n            outputCol=lag + '_vec'\n        )\n        vectorizer = word2Vec.fit(sdf_train)\n        sdf_train = vectorizer.transform(sdf_train)\n        sdf_test = vectorizer.transform(sdf_test)\n        vectorizers.append(vectorizer)\n    return sdf_train, sdf_test, vectorizers"

## -- End word2vec -- 

## 3. Model

### 3.1. Features assembling

3.1. Features Assembling

The features_assembled function prepares the data for model training:

* It selects the `TF-IDF` features and the target variable (`payment_event_flag`).
* It uses VectorAssembler to combine the `TF-IDF` features into a single vector column named `features`.
* It returns a DataFrame with the target variable and the assembled feature vector.

The upsampled function can be used to address class imbalance:

* It separates the data into positive and negative examples.
* It duplicates the positive examples to achieve a balanced class distribution based on the `UPSAMPLE` setting.

The code then:
* Defines a list of lag features to be used.
* Assembles features for both training and test sets.
* Optionally performs **upsampling** on the training set (and potentially the test set) if `UPSAMPLE` is enabled.
* Displays the class distribution after upsampling.

In [40]:
def features_assembled(sdf, feats):
    cols_to_model = [x + '_tfidf' for x in feats]
    cols_to_model.extend(['payment_event_flag'])
    print('columns to model:', cols_to_model)
    vecAssembler = VectorAssembler(
        inputCols=[c for c in cols_to_model if c != 'payment_event_flag'], 
        outputCol='features'
    )
    features = sdf.select(cols_to_model)
    features_vec = vecAssembler.transform(features)
    features_data = features_vec.select('payment_event_flag', 'features')
    return features_data

def upsampled(sdf, label, upsample='max'):
    zeros = sdf.filter(sdf[label] == 0)
    ones = sdf.filter(sdf[label] == 1)
    res = zeros.union(ones)
    if upsample == 'max':
        up_count = int(zeros.count() / ones.count())
        for _ in range(up_count - 1):
            res = res.union(ones)
    else:
        for _ in range(upsample - 1):
            res = res.union(ones)
    return res

In [41]:
# "MAX" Strategy: Setting UPSAMPLE = 'max' instructs the upsampled function to duplicate the minority class examples
# They are upsampled until their count becomes equal to the count of the majority class. 
# In other words, it aims for a 1:1 class ratio.

UPSAMPLE = 'max' # Can be either 'none' or 'max'


In [42]:
# Setting feats, make sure to comment out (#) any lags we will not use for train/pred 

feats = [
#    'lag_10min_to_1week',
    'lag_1week_to_2weeks',
    'lag_2weeks_to_3weeks',
    'lag_3weeks_to_4weeks',
    'lag_4weeks_to_5weeks',
    'lag_5weeks_to_6weeks',
    'lag_6weeks_to_7weeks',
    'lag_7weeks_to_8weeks',
    'lag_8weeks_to_9weeks',
    'lag_9weeks_to_10weeks',
    'lag_10weeks_to_11weeks',
    'lag_11weeks_to_12weeks',
    'lag_12weeks_to_13weeks',
    'lag_13weeks_to_14weeks',
    'lag_14weeks_to_15weeks',
    'lag_15weeks_to_16weeks',
    'lag_16weeks_to_17weeks',
    'lag_17weeks_to_18weeks',
    'lag_18weeks_to_19weeks',
    'lag_19weeks_to_20weeks',
    'lag_20weeks_to_21weeks',
    'lag_21weeks_to_22weeks',
    'lag_22weeks_to_23weeks',
    'lag_23weeks_to_24weeks',
    'lag_24weeks_to_25weeks',
    'lag_25weeks_to_26weeks',
    'lag_26weeks_to_27weeks'
]



features_train = features_assembled(sdf_train, feats=feats)
features_test = features_assembled(sdf_test, feats=feats)

if UPSAMPLE:
    features_train = upsampled(
        features_train, 
        label='payment_event_flag', 
        upsample=UPSAMPLE
    )
    # Use to upsample test set
    #features_test = upsampled(
    #    features_test, 
    #    label='payment_event_flag', 
    #    upsample=UPSAMPLE
    #)

columns to model: ['lag_1week_to_2weeks_tfidf', 'lag_2weeks_to_3weeks_tfidf', 'lag_3weeks_to_4weeks_tfidf', 'lag_4weeks_to_5weeks_tfidf', 'lag_5weeks_to_6weeks_tfidf', 'lag_6weeks_to_7weeks_tfidf', 'lag_7weeks_to_8weeks_tfidf', 'lag_8weeks_to_9weeks_tfidf', 'lag_9weeks_to_10weeks_tfidf', 'lag_10weeks_to_11weeks_tfidf', 'lag_11weeks_to_12weeks_tfidf', 'lag_12weeks_to_13weeks_tfidf', 'lag_13weeks_to_14weeks_tfidf', 'lag_14weeks_to_15weeks_tfidf', 'lag_15weeks_to_16weeks_tfidf', 'lag_16weeks_to_17weeks_tfidf', 'lag_17weeks_to_18weeks_tfidf', 'lag_18weeks_to_19weeks_tfidf', 'lag_19weeks_to_20weeks_tfidf', 'lag_20weeks_to_21weeks_tfidf', 'lag_21weeks_to_22weeks_tfidf', 'lag_22weeks_to_23weeks_tfidf', 'lag_23weeks_to_24weeks_tfidf', 'lag_24weeks_to_25weeks_tfidf', 'lag_25weeks_to_26weeks_tfidf', 'lag_26weeks_to_27weeks_tfidf', 'payment_event_flag']
columns to model: ['lag_1week_to_2weeks_tfidf', 'lag_2weeks_to_3weeks_tfidf', 'lag_3weeks_to_4weeks_tfidf', 'lag_4weeks_to_5weeks_tfidf', 'lag_5w

In [43]:
features_train.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,0,15552
1,1,18435


In [44]:
features_test.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,1,4648
1,0,3919


In [45]:
features_train.limit(3).toPandas()

Unnamed: 0,payment_event_flag,features
0,0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [46]:
features_train.limit(3).toPandas()['features'][0]

SparseVector(2600, {38: 2.4797, 145: 4.0024, 248: 5.0819, 1945: 4.1552, 2045: 4.0587})

### 3.2. Training and evaluating

#### <span style="color: red;">This will train and evaluate WITHOUT hyperopt</span>


A `RandomForestClassifier` is initialized with the following settings:

* `labelCol`: Specifies the target variable column ("`payment_event_flag`").
* `featuresCol`: Specifies the feature vector column ("`features`").
* `numTrees`: Sets the number of trees in the random forest to 100.
* `maxDepth`: Sets the maximum depth of each tree to 16.

* The model is trained using the fit method on the training data.
* The trained model is used to make predictions on the test data.

The `BinaryClassificationMetrics` class is used to calculate evaluation metrics:

* `areaUnderROC`: Area under the ROC curve, which measures the model's ability to distinguish between classes.
* `areaUnderPR`: Area under the Precision-Recall curve, which is more informative for imbalanced datasets.

The code also uses `classification_report` from `scikit-learn` to get a detailed report including precision, recall, F1-score, and support for each class.

## <span style="color: red;">Choose your model type from the options below</span>

`# Options include`
`## "LinearSVCModel"`
`## "GBTClassificationModel"`
`## "LogisticRegressionModel" `
`## "RandomForestClassificationModel"`

In [47]:
# Define model filepath to link predictions results with the actual CSV 

current_model = "RandomForestClassificationModel" # <-- place your model name here
# Options include 
## "LinearSVCModel"
## "GBTClassificationModel"
## "LogisticRegressionModel" 
## "RandomForestClassificationModel"

file_path_pred = f's3a://{BUCKET}/work/{VER}/preds_{current_model}_{current_datetime}_{current_user}_{unique_identifier}.csv'
print(f"- Prediction File Path: {file_path_pred}")

- Prediction File Path: s3a://pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a7b/work/vJ1rf/preds_RandomForestClassificationModel_20240425_062613_st083972_9de7d825.csv


## <span style="color: red;">Update code for your own model</span>


In [48]:
# RF implementation with W&B logging 
rf = RandomForestClassifier(labelCol="payment_event_flag", featuresCol="features", numTrees=100, maxDepth=16)

wandb.config.update({
    "upsample": UPSAMPLE,
    "model_type": type(rf).__name__,
    "num_trees": rf.getNumTrees,
    "max_depth": rf.getMaxDepth,
    "lags": lags,
    "num_features": len(features_dict.items()),
    "output_file": file_path_pred
})



In [49]:
%%time
model = rf.fit(features_train)



CPU times: user 15.8 ms, sys: 0 ns, total: 15.8 ms
Wall time: 16.2 s


In [50]:
predictions = model.transform(features_test)
payment_event_flag_preds = predictions.select('prediction', 'payment_event_flag')
metrics = BinaryClassificationMetrics(
    payment_event_flag_preds.rdd.map(
        lambda lines: [float(x) for x in lines]
    )
)
print('ROC AUC:', metrics.areaUnderROC)
print('Area under PR-curve:', metrics.areaUnderPR)





ROC AUC: 0.5737474466817074
Area under PR-curve: 0.583466858336571


In [51]:
# Weights and Biases logging here
wandb.log({"roc_auc": metrics.areaUnderROC})
wandb.log({"pr_auc": metrics.areaUnderPR})



In [52]:
# Calculate classification report
y_true = predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()
report = classification_report(y_true, y_pred)
print(report)



              precision    recall  f1-score   support

           0       0.73      0.22      0.33      3919
           1       0.58      0.93      0.72      4648

    accuracy                           0.60      8567
   macro avg       0.66      0.57      0.53      8567
weighted avg       0.65      0.60      0.54      8567



In [53]:
# Weights and Biases logging here
wandb.log({"classification_report": wandb.Html(report)})


In [54]:
def dict_to_html_table(data):
    """Converts a dictionary into an HTML table."""
    html = ""
    for key, value in data.items():
        html += f""
    html += "{key}{value}"
    return html

# Example Usage (after calculating classification report for RF):

rf_report = classification_report(y_true, y_pred, output_dict=True)  # Get RF report as a dictionary
rf_report_html = dict_to_html_table(rf_report)

wandb.log({"rf_classification_report_html": wandb.Html(rf_report_html)})  # Log as HTML

In [55]:
# Convert report to HTML format
rf_report_html = classification_report(y_true, y_pred, output_dict=False)

# Log the report as HTML to W&B 
wandb.log({"rf_classification_report_html": wandb.Html(rf_report_html)})

#### Get feature importance 

In [56]:
# Feature Importance

TH = .01 

features_imps = {}
for i, v in enumerate(model.featureImportances.toArray()):
    if v >= TH: features_imps[i] = v
features_imps = dict(sorted(features_imps.items(), key=lambda x: x[1], reverse=True))

for k, v in features_imps.items():
    print('-' * 100)
    print('feature number:', k, '| feature importance:', v)
    print('features:', features_dict[k])

# Weights and biases logging here
wandb.log({"feature_importances": wandb.Table(dataframe=pd.DataFrame.from_dict(features_imps, orient='index'))})


----------------------------------------------------------------------------------------------------
feature number: 38 | feature importance: 0.1388936524507944
features: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_1week_to_2weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга', 'lag_1week_to_2weeks_Пуш Локальный/Скидочный штраф/Показан']
----------------------------------------------------------------------------------------------------
feature number: 78 | feature importance: 0.10010848244047943
features: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Платёж принят', 'lag_1week_to_2weeks_Проверка/Список штрафов/У штрафа не все данные', 'lag_1week_to_2weeks_Поиск полиса по докам/Запрос/Ошибка']
----------------------------------------------------------------------------------------------------
feature number: 178 | feature importance: 0.08749186396713882
features: ['lag_2weeks_to_3weeks_Мои штрафы/Оплата/Платёж принят', 'lag_2weeks_to_3weeks_Проверка/Список штрафов/

In [None]:
"""def log_feature_importance(model, features_dict, model_name="model", top_n=10):
    #Extracts and logs feature importances from a model.
    importances = model.featureImportances.toArray()
    feature_importances = list(zip(features_dict.keys(), importances))
    feature_importances.sort(key=lambda x: x[1], reverse=True)

    # Print top features
    print(f"Top Features for {model_name} Model:")
    for feat_num, importance in feature_importances[:top_n]:
        print(f"Feature {feat_num}: {features_dict[feat_num]} - Importance: {importance:.4f}")

    # Create DataFrame and log to W&B
    importance_df = pd.DataFrame(feature_importances, columns=["Feature Number", "Importance"])
    wandb.log({f"{model_name}_feature_importance": wandb.Table(dataframe=importance_df)})

# Example Usage (after training a Random Forest model):

log_feature_importance(rf, features_dict, model_name="rf")"""

In [60]:
def log_feature_importance(model, features_dict, model_name="rf", top_n=10):
    """Extracts and logs feature importances from a Random Forest model."""
    importances = model.featureImportances.toArray()
    feature_importances = list(zip(features_dict.keys(), importances))
    feature_importances.sort(key=lambda x: x[1], reverse=True)

    # Print top features
    print(f"Top Features for {model_name} Model:")
    for feat_num, importance in feature_importances[:top_n]:
        print(f"Feature {feat_num}: {features_dict[feat_num]} - Importance: {importance:.4f}")

    # Create DataFrame and log to W&B
    importance_df = pd.DataFrame(feature_importances, columns=["Feature Number", "Importance"])
    wandb.log({f"{model_name}_feature_importance": wandb.Table(dataframe=importance_df)})

# Example Usage (after training a Random Forest model):

log_feature_importance(model, features_dict)  # model_name defaults to "rf"

Top Features for rf Model:
Feature 38: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_1week_to_2weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга', 'lag_1week_to_2weeks_Пуш Локальный/Скидочный штраф/Показан'] - Importance: 0.1389
Feature 78: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Платёж принят', 'lag_1week_to_2weeks_Проверка/Список штрафов/У штрафа не все данные', 'lag_1week_to_2weeks_Поиск полиса по докам/Запрос/Ошибка'] - Importance: 0.1001
Feature 178: ['lag_2weeks_to_3weeks_Мои штрафы/Оплата/Платёж принят', 'lag_2weeks_to_3weeks_Проверка/Список штрафов/У штрафа не все данные'] - Importance: 0.0875
Feature 138: ['lag_2weeks_to_3weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_2weeks_to_3weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга'] - Importance: 0.0818
Feature 238: ['lag_3weeks_to_4weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_3weeks_to_4weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга'] - Importance: 0.0447
Feature 278: ['lag_3we

In [61]:
def log_feature_importance(model, features_dict, model_name="rf", top_n=10):
    """Extracts and logs feature importances with lag details from a Random Forest model."""
    importances = model.featureImportances.toArray()
    feature_importances = list(zip(features_dict.keys(), importances))
    feature_importances.sort(key=lambda x: x[1], reverse=True)

    # Print top features with lag details
    print(f"Top Features for {model_name} Model:")
    for feat_num, importance in feature_importances[:top_n]:
        lag_details = features_dict[feat_num]
        for lag_feature in lag_details:
            print(f"  - Feature {feat_num}: {lag_feature} - Importance: {importance:.4f}")
            
    # Create a list to store feature information
    feature_info = []

    # Iterate through features and extract information
    for feat_num, importance in feature_importances:
        lag_details = features_dict[feat_num]
        for lag_feature in lag_details:
            feature_info.append({
                "Feature Number": feat_num,
                "Lag Feature": lag_feature,
                "Importance": importance
            })

    # Create a DataFrame from the list of dictionaries
    importance_df = pd.DataFrame(feature_info)

    # Log the DataFrame as a W&B Table
    wandb.log({f"{model_name}_feature_importance": wandb.Table(dataframe=importance_df)})

In [62]:
log_feature_importance(model, features_dict)  # model_name defaults to "rf"

Top Features for rf Model:
  - Feature 38: lag_1week_to_2weeks_Мои штрафы/Оплата/Завершили оплату - Importance: 0.1389
  - Feature 38: lag_1week_to_2weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга - Importance: 0.1389
  - Feature 38: lag_1week_to_2weeks_Пуш Локальный/Скидочный штраф/Показан - Importance: 0.1389
  - Feature 78: lag_1week_to_2weeks_Мои штрафы/Оплата/Платёж принят - Importance: 0.1001
  - Feature 78: lag_1week_to_2weeks_Проверка/Список штрафов/У штрафа не все данные - Importance: 0.1001
  - Feature 78: lag_1week_to_2weeks_Поиск полиса по докам/Запрос/Ошибка - Importance: 0.1001
  - Feature 178: lag_2weeks_to_3weeks_Мои штрафы/Оплата/Платёж принят - Importance: 0.0875
  - Feature 178: lag_2weeks_to_3weeks_Проверка/Список штрафов/У штрафа не все данные - Importance: 0.0875
  - Feature 138: lag_2weeks_to_3weeks_Мои штрафы/Оплата/Завершили оплату - Importance: 0.0818
  - Feature 138: lag_2weeks_to_3weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга - Import

In [63]:
def log_grouped_feature_importance(model, features_dict, model_name="rf", top_n=10):
    """Extracts and logs feature importances with grouped lag details."""
    importances = model.featureImportances.toArray()
    feature_importances = list(zip(features_dict.keys(), importances))
    feature_importances.sort(key=lambda x: x[1], reverse=True)

    grouped_feature_info = []

    for feat_num, importance in feature_importances:
        lag_details = features_dict[feat_num]
        combined_lags = ", ".join(lag_details)  # Combine lag features into a single string
        grouped_feature_info.append({
            "Feature Number": feat_num,
            "Lag Features": combined_lags,
            "Importance": importance
        })

    # Create DataFrame
    grouped_importance_df = pd.DataFrame(grouped_feature_info)

    # Log to W&B
    wandb.log({f"{model_name}_grouped_feature_importance": wandb.Table(dataframe=grouped_importance_df)})

In [64]:
log_grouped_feature_importance(model, features_dict)


#### Run Hyperopt 


In [68]:
# Expanded search space for Random Forest
rf_search_space = {
    'numTrees': hp.quniform('numTrees', 10, 200, 10),
    'maxDepth': hp.quniform('maxDepth', 5, 30, 1),
    'minInstancesPerNode': hp.quniform('minInstancesPerNode', 1, 10, 1),  # Minimum instances per leaf
    'maxBins': hp.quniform('maxBins', 10, 50, 5),  # Number of bins for discretizing continuous features
    'subsamplingRate': hp.uniform('subsamplingRate', 0.5, 1.0),  # Subsampling rate for each tree
    'featureSubsetStrategy': hp.choice('featureSubsetStrategy', ['auto', 'all', 'sqrt', 'log2', 'onethird']), 
    'impurity': hp.choice('impurity', ['gini', 'entropy'])  # Impurity measure (Gini or entropy)
}



In [69]:
# Define the objective function for Hyperopt
def rf_objective(params):
    rf = RandomForestClassifier(labelCol="payment_event_flag", featuresCol="features", **params)
    rf_model = rf.fit(features_train)
    predictions = rf_model.transform(features_test)
    evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
    auc = evaluator.evaluate(predictions)
    # Log metrics and params to W&B
    wandb.log({"rf_hyperopt_auc": auc, "rf_hyperopt_params": params})
    return {'loss': -auc, 'status': STATUS_OK}  # Hyperopt minimizes loss

# Update W&B run config
wandb.config.update({
    "model_type": "RandomForestHyperopt",
    "hyperopt_search_space": rf_search_space
})



In [70]:
# Run Hyperopt optimization
rf_trials = Trials()
best_rf_params = fmin(
    fn=rf_objective,
    space=rf_search_space,
    algo=tpe.suggest,
    max_evals=30,
    trials=rf_trials
)



100%|██████████| 30/30 [1:09:36<00:00, 139.21s/trial, best loss: -0.7258779220699368]


In [71]:
# Log best hyperparameters
wandb.log({"best_rf_hyperparameters": best_rf_params}) 



In [82]:
"""def convert_hyperopt_params(params):
    #Converts Hyperopt parameter types for better logging.
    strategy_map = {0: "auto", 1: "all", 2: "sqrt", 3: "log2", 4: "onethird"}  # Mapping for featureSubsetStrategy
    impurity_map = {0: "gini", 1: "entropy"}  # Mapping for impurity 
    converted_params = {}
    for key, value in params.items():
        if isinstance(value, dict):
            converted_params[key] = value['quniform'] 
        elif key == 'featureSubsetStrategy':
            strategy_index = int(value) 
            converted_params[key] = strategy_map.get(strategy_index, "auto")
        elif key == 'impurity':
            impurity_index = int(value) 
            converted_params[key] = impurity_map.get(impurity_index)
        else:
            converted_params[key] = value
    return converted_params"""


def convert_hyperopt_params(params):
    #Converts Hyperopt parameter types for better logging.
    strategy_map = {0: "auto", 1: "all", 2: "sqrt", 3: "log2", 4: "onethird"}  # Mapping for featureSubsetStrategy
    impurity_map = {0: "gini", 1: "entropy"}  # Mapping for impurity
    converted_params = {}
    for key, value in params.items():
        if isinstance(value, dict):
            converted_params[key] = value['quniform']
        elif key == 'featureSubsetStrategy':
            if isinstance(value, str):
                converted_params[key] = value  # Already a string, no conversion needed
            else:
                strategy_index = int(value)
                converted_params[key] = strategy_map.get(strategy_index, "auto")
        elif key == 'impurity':
            impurity_index = int(value)
            converted_params[key] = impurity_map.get(impurity_index)
        else:
            converted_params[key] = value
    return converted_params

In [83]:
# Convert Hyperopt parameter types
best_rf_params = convert_hyperopt_params(best_rf_params)

In [84]:
# Train the best RF model
best_rf = RandomForestClassifier(labelCol="payment_event_flag", featuresCol="features", **best_rf_params)
best_rf_model = best_rf.fit(features_train) 



In [85]:
# ... (After training the best RF model: best_rf_model = best_rf.fit(features_train)) ...

# Evaluate the best model on the test set
predictions = best_rf_model.transform(features_test)
evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
best_auc = evaluator.evaluate(predictions)

# Calculate PR AUC
pr_evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag", metricName="areaUnderPR")
best_pr_auc = pr_evaluator.evaluate(predictions)


In [86]:
# Log metrics and report to W&B
wandb.log({"best_rf_roc_auc": best_auc})
wandb.log({"best_rf_pr_auc": best_pr_auc})

In [92]:
# Log metrics and report to W&B
print("best_rf_roc_auc: ", best_auc)
print("best_rf_pr_auc: ", best_pr_auc)

best_rf_roc_auc:  0.725861507488782
best_rf_pr_auc:  0.7674507664282019


In [87]:
# Calculate classification report
y_true = predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()
report = classification_report(y_true, y_pred)




In [90]:
print(report)

              precision    recall  f1-score   support

           0       0.69      0.31      0.43      3919
           1       0.60      0.88      0.72      4648

    accuracy                           0.62      8567
   macro avg       0.65      0.60      0.57      8567
weighted avg       0.64      0.62      0.59      8567



In [89]:

wandb.log({"best_rf_classification_report": wandb.Html(report)})


In [None]:
# ... (Evaluate and log the best model's performance as before) ... 

# --- Functions for Logging Hyperopt Trials ---

def convert_hyperopt_params(params):
    """Converts Hyperopt parameter types for better logging."""
    converted_params = {}
    for key, value in params.items():
        if isinstance(value, dict):
            # Handle nested dictionaries (e.g., from hp.quniform)
            converted_params[key] = value['quniform'] 
        else:
            converted_params[key] = value
    return converted_params

def prepare_trials_df(trials):
    """Prepares a Pandas DataFrame from Hyperopt trials for logging to W&B."""
    trial_data = []
    for i, trial in enumerate(trials.trials):
        params = trial['misc']['vals']
        loss = trial['result']['loss']
        converted_params = convert_hyperopt_params(params)
        trial_dict = {"trial_id": i, "loss": loss}
        trial_dict.update(converted_params)  
        trial_data.append(trial_dict)
    trials_df = pd.DataFrame(trial_data)
    return trials_df

# Prepare and log the trials DataFrame
trials_df = prepare_trials_df(rf_trials)
wandb.log({"hyperopt_trials": wandb.Table(dataframe=trials_df)})

#### Feature importance 

In [None]:
# ... (After evaluating and logging the best RF model's performance) ... 

# Get feature importances from the best RF model
importances = best_rf_model.featureImportances.toArray()

# Zip importances with feature names (from features_dict)
feature_importances = list(zip(features_dict.keys(), importances))

# Sort features by importance (descending order)
feature_importances.sort(key=lambda x: x[1], reverse=True)

# Print top features and their importances
print("Top Features for Best Random Forest Model:")
for feat_num, importance in feature_importances[:10]:
    print(f"Feature {feat_num}: {features_dict[feat_num]} - Importance: {importance:.4f}")

# Create a DataFrame for feature importance
feature_importance_df = pd.DataFrame(feature_importances, columns=["Feature Number", "Importance"])

# Log the DataFrame as a W&B Table
wandb.log({"best_rf_feature_importance": wandb.Table(dataframe=feature_importance_df)})

In [49]:
from pyspark.ml.classification import LinearSVC

# Initialize the SVM model
svm = LinearSVC(labelCol="payment_event_flag", featuresCol="features", maxIter=100)



In [50]:
# Log model type and parameters to W&B
wandb.config.update({
    "model_type": type(svm).__name__,
    "max_iter": svm.getMaxIter(),
    "upsample": UPSAMPLE,
    "lags": lags,
    "num_features": len(features_dict.items()),
    "output_file": file_path_pred
})



In [51]:
# Train the model
svm_model = svm.fit(features_train)



In [52]:
# Make predictions on the test set
predictions = svm_model.transform(features_test)



In [63]:
# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
svm_auc = evaluator.evaluate(predictions)

# Calculate classification report
y_true = predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()
report = classification_report(y_true, y_pred)

# Log metrics to W&B
wandb.log({"svm_roc_auc": svm_auc})
wandb.log({"svm_classification_report": wandb.Html(report)})

print(f"SVM ROC AUC: {svm_auc}")
print(report)

In [176]:
print(f"SVM ROC AUC: {svm_auc}")

SVM ROC AUC: 0.7163015593809696


In [64]:
# Calculate classification report
y_true = predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()
report = classification_report(y_true, y_pred)



In [55]:
# Log metrics to W&B
wandb.log({"svm_roc_auc": svm_auc})
wandb.log({"svm_classification_report": wandb.Html(report)})



In [65]:
print(f"SVM ROC AUC: {svm_auc}")
print(report)


SVM ROC AUC: 0.7163015593809696
              precision    recall  f1-score   support

           0       0.59      0.81      0.68      3948
           1       0.77      0.52      0.62      4669

    accuracy                           0.65      8617
   macro avg       0.68      0.67      0.65      8617
weighted avg       0.69      0.65      0.65      8617



In [143]:
svm_report = classification_report(y_true, y_pred, output_dict=True)

In [144]:
svm_report_html = dict_to_html_table(svm_report)

In [145]:
wandb.log({"svm_classification_report_html": wandb.Html(svm_report_html)})

In [78]:
# Set the SVM model to output raw predictions (needed for probabilities)
svm.setRawPredictionCol("rawPrediction")

# Train the model
svm_model = svm.fit(features_train)

# Make predictions on the test set
predictions = svm_model.transform(features_test)

# Get raw prediction and true labels
prediction_raw = predictions.select("rawPrediction", "payment_event_flag")

# Convert to RDD and format for BinaryClassificationMetrics
preds_rdd = prediction_raw.rdd.map(lambda lp: (float(lp[0][1]), float(lp[1])))

# Create BinaryClassificationMetrics object
metrics = BinaryClassificationMetrics(preds_rdd)

# Calculate PR AUC
pr_auc = metrics.areaUnderPR

# Log PR AUC to W&B
wandb.log({"svm_pr_auc": pr_auc})

print(f"SVM PR AUC: {pr_auc}")



SVM PR AUC: 0.7584486213386411


In [79]:
# Get the coefficients (feature weights)
svm_coef = svm_model.coefficients

# Zip coefficients with feature names
features_weights = list(zip(features_dict.keys(), svm_coef))

# Sort features by absolute value of their weights (descending order)
features_weights.sort(key=lambda x: abs(x[1]), reverse=True)

# Print top features and their weights
print("Top Features for SVM Model:")
for feat_num, weight in features_weights[:10]:
    print(f"Feature {feat_num}: {features_dict[feat_num]} - Weight: {weight:.4f}")

# Create a table for logging to W&B

features_importance_df = pd.DataFrame(features_weights, columns=["Feature Number", "Weight"])



Top Features for SVM Model:
Feature 2015: ['lag_21weeks_to_22weeks_Проверка/История платежей/Детали'] - Weight: 3.4729
Feature 773: ['lag_8weeks_to_9weeks_Мои штрафы/Проверка ВУ/Нажали на ВУ'] - Weight: 3.1714
Feature 777: ['lag_8weeks_to_9weeks_Мои штрафы/Детали штрафа/Нажали кнопку "Оплатить"'] - Weight: 3.0667
Feature 70: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Оплата не прошла'] - Weight: -1.6171
Feature 1483: ['lag_15weeks_to_16weeks_Мои штрафы/Пуш/Открыт'] - Weight: 1.1093
Feature 538: ['lag_6weeks_to_7weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_6weeks_to_7weeks_Пуш Локальный/Скидочный штраф/Показан', 'lag_6weeks_to_7weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга'] - Weight: -1.0175
Feature 2242: ['lag_23weeks_to_24weeks_Мои штрафы/Оплата/Начали оплату'] - Weight: 0.8045
Feature 2246: ['lag_23weeks_to_24weeks_Мои штрафы/Оплата/Сберпей/Открыт'] - Weight: 0.7952
Feature 1882: ['lag_19weeks_to_20weeks_Мои штрафы/Оплата/Ушел с ввода данных'] - Weight: 0.7904
Feature 7

In [101]:
# Create a list to store feature information
feature_info = []

# Iterate through top features and extract information
for feat_num, weight in features_weights[:30]:
    event_names = features_dict[feat_num]
    for event_name in event_names:
        feature_info.append({
            "Feature Number": feat_num,
            "Event Name": event_name,
            "Weight": weight
        })

# Create a DataFrame from the list of dictionaries
feature_importance_df = pd.DataFrame(feature_info)

In [102]:
# Log the DataFrame as a W&B Table
wandb.log({"svm_feature_importance": wandb.Table(dataframe=feature_importance_df)})

In [103]:
feature_importance_df

Unnamed: 0,Feature Number,Event Name,Weight
0,2015,lag_21weeks_to_22weeks_Проверка/История платеж...,3.472949
1,773,lag_8weeks_to_9weeks_Мои штрафы/Проверка ВУ/На...,3.171449
2,777,lag_8weeks_to_9weeks_Мои штрафы/Детали штрафа/...,3.066674
3,70,lag_1week_to_2weeks_Мои штрафы/Оплата/Оплата н...,-1.617134
4,1483,lag_15weeks_to_16weeks_Мои штрафы/Пуш/Открыт,1.109276
5,538,lag_6weeks_to_7weeks_Мои штрафы/Оплата/Заверши...,-1.017492
6,538,lag_6weeks_to_7weeks_Пуш Локальный/Скидочный ш...,-1.017492
7,538,lag_6weeks_to_7weeks_Страхование/Главная/Ленди...,-1.017492
8,2242,lag_23weeks_to_24weeks_Мои штрафы/Оплата/Начал...,0.804529
9,2246,lag_23weeks_to_24weeks_Мои штрафы/Оплата/Сберп...,0.795154


In [91]:
# Calculate total absolute weight
total_weight = sum(abs(weight) for _, weight in features_weights)

# Create a list to store feature information
feature_info = []

# Extract information for top 20 features
for feat_num, weight in features_weights[:20]:
    event_names = features_dict[feat_num]
    for event_name in event_names:
        percentage_contribution = (abs(weight) / total_weight) * 100
        feature_info.append({
            "Feature Number": feat_num,
            "Event Name": event_name,
            "Weight": weight,
            "Percentage Contribution": f"{percentage_contribution:.2f}%"
        })

# Create DataFrame
feature_weight_df = pd.DataFrame(feature_info)


In [93]:
# Log the DataFrame as a W&B Table
wandb.log({"svm_feature_weight": wandb.Table(dataframe=feature_weight_df)})

In [92]:
feature_weight_df

Unnamed: 0,Feature Number,Event Name,Weight,Percentage Contribution
0,2015,lag_21weeks_to_22weeks_Проверка/История платеж...,3.472949,2.04%
1,773,lag_8weeks_to_9weeks_Мои штрафы/Проверка ВУ/На...,3.171449,1.86%
2,777,lag_8weeks_to_9weeks_Мои штрафы/Детали штрафа/...,3.066674,1.80%
3,70,lag_1week_to_2weeks_Мои штрафы/Оплата/Оплата н...,-1.617134,0.95%
4,1483,lag_15weeks_to_16weeks_Мои штрафы/Пуш/Открыт,1.109276,0.65%
5,538,lag_6weeks_to_7weeks_Мои штрафы/Оплата/Заверши...,-1.017492,0.60%
6,538,lag_6weeks_to_7weeks_Пуш Локальный/Скидочный ш...,-1.017492,0.60%
7,538,lag_6weeks_to_7weeks_Страхование/Главная/Ленди...,-1.017492,0.60%
8,2242,lag_23weeks_to_24weeks_Мои штрафы/Оплата/Начал...,0.804529,0.47%
9,2246,lag_23weeks_to_24weeks_Мои штрафы/Оплата/Сберп...,0.795154,0.47%


#### Logging notebook characteristics

The below code snippet prints a formatted summary with clear headings and labels for each variable or parameter.

It includes information about data processing flags, lag feature details, date range used for filtering payment events, upsampling strategy, model type, and the output file path.

You can further customize this code by adding more relevant variables or model-specific parameters based on your pipeline configuration.

**Benefits:**

* Reproducibility: Having a summary of the pipeline parameters improves the reproducibility of your results and makes it easier to track the specific settings used for a particular experiment.

In [126]:
print("## Pipeline Summary ##")

# Data Processing Flags:
print(f"- PROC_DS: {PROC_DS}")
print(f"- PROC_LAGS: {PROC_LAGS}")
print(f"- PROC_VECS: {PROC_VECS}")

# Lag Feature Information:
print(f"- Lag Features Used: {lags}")
print(f"- Number of Features after TF-IDF: {len(features_dict.items())}")

# Date Range:
print(f"- flag_min_datetime: {flag_min_datetime}")
print(f"- flag_max_datetime: {flag_max_datetime}")

# Upsampling:
print(f"- UPSAMPLE: {UPSAMPLE}")

# Model:
print(f"- Model Type: {type(svm_model).__name__}")
# Add more model-specific parameters if needed 

# Output:
print(f"- Predictions saved to: ", file_path_pred)

print("## End of Summary ##")

## Pipeline Summary ##
- PROC_DS: False
- PROC_LAGS: False
- PROC_VECS: True
- Lag Features Used: ['lag_1week_to_2weeks', 'lag_2weeks_to_3weeks', 'lag_3weeks_to_4weeks', 'lag_4weeks_to_5weeks', 'lag_5weeks_to_6weeks', 'lag_6weeks_to_7weeks', 'lag_7weeks_to_8weeks', 'lag_8weeks_to_9weeks', 'lag_9weeks_to_10weeks', 'lag_10weeks_to_11weeks', 'lag_11weeks_to_12weeks', 'lag_12weeks_to_13weeks', 'lag_13weeks_to_14weeks', 'lag_14weeks_to_15weeks', 'lag_15weeks_to_16weeks', 'lag_16weeks_to_17weeks', 'lag_17weeks_to_18weeks', 'lag_18weeks_to_19weeks', 'lag_19weeks_to_20weeks', 'lag_20weeks_to_21weeks', 'lag_21weeks_to_22weeks', 'lag_22weeks_to_23weeks', 'lag_23weeks_to_24weeks', 'lag_24weeks_to_25weeks', 'lag_25weeks_to_26weeks', 'lag_26weeks_to_27weeks']
- Number of Features after TF-IDF: 2600
- flag_min_datetime: 2024-03-21 00:00:00
- flag_max_datetime: 2024-04-18 23:59:59
- UPSAMPLE: max
- Model Type: LinearSVCModel
- Predictions saved to:  s3a://pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a7b/work

In [139]:
summary_dict = {
    "PROC_DS": PROC_DS,
    "PROC_LAGS": PROC_LAGS,
    "PROC_VECS": PROC_VECS,
    "flag_min_datetime": flag_min_datetime,
    "flag_max_datetime": flag_max_datetime,
    "UPSAMPLE": UPSAMPLE,
    "Number of Features after TF-IDF": len(features_dict.items()),
    "Model Type": type(svm_model).__name__,
    "maxIter": svm_model.getMaxIter(),
    "regParam": svm_model.getRegParam(),
    "tol": svm_model.getTol(),
    "Predictions saved to": file_path_pred, 
    "user_id": current_user, 
    "unique_identifier": unique_identifier
}

In [140]:
summary_dict["Number of Features after TF-IDF"] = str(len(features_dict.items()))


In [141]:
def dict_to_html_table(data):
    html = "<table>"
    for key, value in data.items():
        html += f"<tr><th>{key}</th><td>{value}</td></tr>"
    html += "</table>"
    return html

summary_html = dict_to_html_table(summary_dict)

In [142]:
wandb.log({"pipeline_summary_html": wandb.Html(summary_html)})


# <span style="color: red;">!!!!! BELOW WILL RUN HYPEROPT AGAIN TF-IDF !!!!!

It also does not include the date code before running the model

# *** Add filtering here ***
    if PROC_LAGS:
        dates  = (flag_min_datetime, flag_max_datetime)
        features_train_filtered = features_train.filter(features_train.event_datetime.between(*dates))
    else:
        features_train_filtered = features_train 


</span>

## Random forest WITHOUT tf-idf elements

#### Analysis of Hyperparameter Choices for SVM with Hyperopt

`maxIter`: This parameter controls the maximum number of iterations the optimization algorithm will run before stopping.

`Choice`: scope.int(hp.quniform('maxIter', 10, 200, 10)) suggests exploring values between 10 and 200 with a step size of 10.

`Rationale`: A higher maxIter allows the algorithm to search more extensively for the optimal solution but increases training time. The chosen range provides a reasonable starting point to balance exploration and efficiency.

**Alternative Choices:**

`Smaller Range/Step Size`: If training time is a major concern, you might reduce the range or step size to explore fewer values.

`Larger Range/Step Size`: If you suspect the optimal value might be outside the current range, you could expand it.

`regParam`: This parameter controls the regularization strength. Higher values lead to stronger regularization, which can help prevent overfitting but might also underfit if too high.

`Choice`: hp.loguniform('regParam', -6, 0) samples values from a log-uniform distribution between 1e-6 and 1.

`Rationale`: The log-uniform distribution allows exploring a wide range of regularization strengths, from very small to relatively large values, which is often suitable for regularization parameters.

**Alternative Choices:**

`Uniform Distribution`: If you have prior knowledge about the reasonable range for regParam, you could use a uniform distribution within that range.

`Specific Values`: In some cases, you might want to try specific values based on experience or domain knowledge.

`tol`: This parameter sets the tolerance for the stopping criterion. The optimization algorithm stops when the improvement in the objective function falls below this tolerance level.

`Choice`: hp.loguniform('tol', -6, -1) samples values from a log-uniform distribution between 1e-6 and 1e-1.

`Rationale`: Similar to regParam, the log-uniform distribution allows exploring a wide range of tolerance values.

**Alternative Choices:**

`Uniform Distribution`: If you have a better understanding of the appropriate tolerance range, you could use a uniform distribution.

`Fixed Value`: In some cases, you might set a fixed tolerance value based on the desired level of precision.

**Overall Assessment:**

The hyperparameter choices in the provided search space are reasonable for a starting point. 

The ranges and distributions allow for a broad exploration of different configurations.

Fine-tuning the search space might be beneficial based on the results of initial experiments and your understanding of the problem and the SVM model.

Exploring additional hyperparameters such as fitIntercept or different kernels for non-linear 

SVMs could be considered depending on the complexity of the data and the problem you are trying to solve.

In [163]:
# Define the search space for Hyperopt
svm_search_space = {
    'maxIter': scope.int(hp.quniform('maxIter', 10, 200, 10)),
    'regParam': hp.loguniform('regParam', -6, 0),
    'tol': hp.loguniform('tol', -6, -1),
}


In [164]:
# Define the objective function for Hyperopt
def svm_objective(params):
    svm = LinearSVC(labelCol="payment_event_flag", featuresCol="features", **params)
    svm_model = svm.fit(features_train)
    predictions = svm_model.transform(features_test)
    evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
    auc = evaluator.evaluate(predictions)
    # Log metrics and params to W&B
    wandb.log({"svm_hyperopt_auc": auc, "svm_hyperopt_params": params})
    return {'loss': -auc, 'status': STATUS_OK}


In [165]:
# Update W&B run config
wandb.config.update({
    "model_type": "LinearSVCHyperopt",
    "hyperopt_search_space": svm_search_space
})


In [166]:
# Run Hyperopt optimization
svm_trials = Trials()
best_svm_params = fmin(
    fn=svm_objective,
    space=svm_search_space,
    algo=tpe.suggest,
    max_evals=30,
    trials=svm_trials
)



100%|██████████| 30/30 [02:34<00:00,  5.14s/trial, best loss: -0.7137594902071325]


In [167]:
best_svm_params



{'maxIter': 80.0, 'regParam': 0.005145783339572193, 'tol': 0.00313329957772047}

In [168]:
# Log best hyperparameters
wandb.log({"best_svm_hyperparameters": best_svm_params})



In [169]:
# Train the best SVM model
best_svm = LinearSVC(labelCol="payment_event_flag", featuresCol="features", **best_svm_params)
best_svm_model = best_svm.fit(features_train)


In [170]:
# Train the best SVM model
best_svm = LinearSVC(labelCol="payment_event_flag", featuresCol="features", **best_svm_params)
best_svm_model = best_svm.fit(features_train)



In [171]:
# Evaluate the best model
best_predictions = best_svm_model.transform(features_test)
best_evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
best_svm_auc = best_evaluator.evaluate(best_predictions)



In [172]:
# Calculate classification report for the best model
y_true = best_predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = best_predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()
best_report = classification_report(y_true, y_pred)



In [173]:
print(best_report)

              precision    recall  f1-score   support

           0       0.63      0.50      0.56      3948
           1       0.64      0.75      0.69      4669

    accuracy                           0.64      8617
   macro avg       0.63      0.63      0.62      8617
weighted avg       0.64      0.64      0.63      8617



In [174]:
# Convert report to HTML format
hyperopt_report_html = classification_report(y_true, y_pred, output_dict=False)

# fix 
hyperopt_svm_report_html = dict_to_html_table(svm_report)

# Log the report as HTML to W&B
wandb.log({"hopt_best_svm_classification_report": wandb.Html(hyperopt_svm_report_html)})

In [175]:
# Get prediction probabilities and true labels
prediction_probs = best_predictions.select("rawPrediction", "payment_event_flag")

# Convert to RDD and format for BinaryClassificationMetrics
preds_rdd = prediction_probs.rdd.map(lambda lp: (float(lp[0][1]), float(lp[1])))

# Create BinaryClassificationMetrics object
metrics = BinaryClassificationMetrics(preds_rdd)

# Calculate AUC-ROC and AUC-PR
roc_auc = metrics.areaUnderROC
pr_auc = metrics.areaUnderPR

# Print the metrics
print(f"Best SVM ROC AUC: {roc_auc}")
print(f"Best SVM PR AUC: {pr_auc}")

# Log the metrics to W&B
wandb.log({"best_svm_roc_auc": roc_auc})
wandb.log({"best_svm_pr_auc": pr_auc})



Best SVM ROC AUC: 0.7137535769674866
Best SVM PR AUC: 0.7562001878676617


In [177]:
# Get the coefficients (feature weights) from the best SVM model
best_svm_coef = best_svm_model.coefficients

# Zip coefficients with feature names
features_weights = list(zip(features_dict.keys(), best_svm_coef))

# Sort features by absolute value of their weights (descending order)
features_weights.sort(key=lambda x: abs(x[1]), reverse=True)

# Print top features and their weights
print("Top Features for Hyperopt SVM Model:")
for feat_num, weight in features_weights[:10]:
    print(f"Feature {feat_num}: {features_dict[feat_num]} - Weight: {weight:.4f}")

# Create a DataFrame for feature importance (without weights)
feature_importance_df = pd.DataFrame(features_weights, columns=["Feature Number", "Weight"])

# Log the DataFrame as a W&B Table
wandb.log({"best_svm_feature_importance": wandb.Table(dataframe=feature_importance_df)})



Top Features for Hyperopt SVM Model:
Feature 70: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Оплата не прошла'] - Weight: -1.1312
Feature 538: ['lag_6weeks_to_7weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_6weeks_to_7weeks_Пуш Локальный/Скидочный штраф/Показан', 'lag_6weeks_to_7weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга'] - Weight: -0.6258
Feature 1184: ['lag_12weeks_to_13weeks_Мои штрафы/Задолженность/Долг 10 т.р', 'lag_12weeks_to_13weeks_Справочник/ПДД/Открыли статью'] - Weight: 0.4250
Feature 5: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Завешили оплату', 'lag_1week_to_2weeks_Справочник/Поиск/Открыл КоАП из выдачи'] - Weight: 0.4221
Feature 2211: ['lag_23weeks_to_24weeks_Мои штрафы/Оплата/Начали множественную оплату', 'lag_23weeks_to_24weeks_Push/Получен пуш от api', 'lag_23weeks_to_24weeks_Мои штрафы/Редактирование документа'] - Weight: 0.3908
Feature 2015: ['lag_21weeks_to_22weeks_Проверка/История платежей/Детали'] - Weight: 0.3838
Feature 1475: ['lag_15weeks_to_16weeks

In [178]:
# Calculate total absolute weight
total_weight = sum(abs(weight) for _, weight in features_weights)

# Create a list to store feature information with weight percentages
feature_info = []

# Extract information for top 20 features with weight percentages
for feat_num, weight in features_weights[:20]:
    event_names = features_dict[feat_num]
    for event_name in event_names:
        percentage_contribution = (abs(weight) / total_weight) * 100
        feature_info.append({
            "Feature Number": feat_num,
            "Event Name": event_name,
            "Weight": weight,
            "Percentage Contribution": f"{percentage_contribution:.2f}%"
        })

# Create DataFrame for weighted feature importance
feature_weight_df = pd.DataFrame(feature_info)

# Log the DataFrame as a W&B Table
wandb.log({"best_svm_feature_weight": wandb.Table(dataframe=feature_weight_df)})

## <span style="color: red;">Below code will finish your W&B run</span>


In [None]:
wandb.finish()  # Finalize W&B run