# Spark modelling - optimized

## 1. Libraries and Spark setup


This section imports necessary libraries and sets up the Spark environment:

Libraries:
* `o`, `sys`, `json`, `datetime`, `numpy`, `pandas`, `tqdm`, `matplotlib.pyplot`: General purpose libraries for file system access, system functionalities, JSON handling, date/time manipulation, numerical computation, data manipulation, progress bars, and plotting.

pyspark libraries:
* `SparkContext`, `SparkConf`: Core Spark functionalities for setting up the Spark context and configuration.
* `SparkSession`: Entry point for interacting with Spark SQL.
* `functions as F`: Provides various Spark SQL functions for data manipulation.
* `types`: Defines data types for Spark DataFrames.
* `Window`: Used for window functions in Spark SQL.
* `ml.feature`: Provides feature engineering and transformation tools like `Word2Vec`, `Imputer`, `OneHotEncoder`, `StringIndexer`, `VectorAssembler`.
* `ml.classification`: Provides classification algorithms like `LogisticRegression` and `RandomForestClassifier`.
* `ml.evaluation`: Provides evaluation metrics like `BinaryClassificationEvaluator` and `BinaryClassificationMetrics`.
* `ml.tuning`: Provides tools for hyperparameter tuning like `CrossValidator` and `ParamGridBuilder`.

Spark Configuration:
* `SparkConf`: Sets configuration parameters for the Spark application.
* `spark.master`: Specifies the cluster manager; local[*] indicates using all available cores on the local machine.
* `spark.driver.memory`, `spark.driver.maxResultSize`: Allocates memory for the driver process.
* `SparkContext`, `SparkSession`: Creates the Spark context and session based on the configuration.

Accessing Data:
* `access_data`: Function to load JSON data from a local file.
* `access_s3_data`: Loads AWS credentials from a local JSON file.

Spark configuration is further set to access data from Yandex Cloud Storage (S3-compatible) using the loaded credentials.

In [1]:
import os
import sys
import json
import datetime
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Create filenames 

import datetime
import uuid
import getpass

In [2]:
!pip install wandb




In [3]:
# W&B logging 
import wandb

## <span style="color: red;">Insert the name of my notebook here below </span>



In [4]:
# Set name of notebook
os.environ["WANDB_NOTEBOOK_NAME"] = "full_pipeline_with_logging_hyperopt_PK.ipynb"





In [5]:
# Pyspark general 
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Pyspark pre-processing 
from pyspark.sql.window import Window

# Pyspark vectorization
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

# 
from pyspark.ml import Pipeline
from pyspark.ml.feature import Imputer

# Pyspark models 
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier, LinearSVC
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

# Pyspark classifiers 
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics

# Pyspark cross-validation
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Pyspark reporting 
from sklearn.metrics import classification_report

# Pyspark other
from pyspark.ml.functions import vector_to_array


In [6]:
!pip install hyperopt




In [7]:
# Hyperopt related 
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.pyll.base import scope


## <span style="color: red;">Get your w&b API for the next section</span>


In [8]:
# Weights and biases login 

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mkapetan[0m ([33mgsom-diploma-jap[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [9]:
print('user:', os.environ['JUPYTERHUB_SERVICE_PREFIX'])

def uiWebUrl(self):
    from urllib.parse import urlparse
    web_url = self._jsc.sc().uiWebUrl().get()
    port = urlparse(web_url).port
    return '{}proxy/{}/jobs/'.format(os.environ['JUPYTERHUB_SERVICE_PREFIX'], port)

SparkContext.uiWebUrl = property(uiWebUrl)

conf = SparkConf()
conf.set('spark.master', 'local[*]')
conf.set('spark.driver.memory', '32G')
conf.set('spark.driver.maxResultSize', '8G')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
spark

user: /user/st095435/


In [10]:
def access_data(file_path):
    with open(file_path) as file:
        access_data = json.load(file)
    return access_data

access_s3_data = access_data('.access_jhub_data')

In [11]:
spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', access_s3_data['aws_access_key_id'])
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', access_s3_data['aws_secret_access_key'])
spark._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3a.S3AFileSystem')
spark._jsc.hadoopConfiguration().set('fs.s3a.multipart.size', '104857600')
spark._jsc.hadoopConfiguration().set('fs.s3a.block.size', '33554432')
spark._jsc.hadoopConfiguration().set('fs.s3a.threads.max', '256')
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 'http://storage.yandexcloud.net')
spark._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 
                                     'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')

## 2. Dataset


This section defines variables for data processing and file paths, then performs several data processing stages:
Variables:

* `VER`: Version identifier.
* `PROC_DS`, `PROC_LAGS`, `PROC_VECS`: Flags to control data processing stages.
* `FRAC_0`: Fraction of negative examples to sample when processing lags.
* `BUCKET`: S3 bucket name.
* `files_path`, `files_mask`: Local paths and masks for raw data files.
* `file_path_ds`, `file_path_lags`, `file_path_trn`, `file_path_tst`: S3 paths for different stages of processed data.


• `VER = 'v2'`: This defines the version of your data processing pipeline. You might increment this version number when you make significant changes to the processing steps.
• `BUCKET` = 'pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a7b': This specifies the name of the bucket where your data is stored (likely on a cloud storage platform like AWS S3 or Yandex Object Storage).

**Flags:**

• `PROC_DS` (False): This flag controls whether to process the raw dataset. If True, the code will read the raw CSV files, extract relevant columns, filter by date, and create the data_raw.parquet file.

* Change to True: When you have new raw data or need to reprocess the existing raw data due to changes in the extraction logic.
* Keep as False: When you already have a processed data_raw.parquet file and don't need to re-process it.

• `PROC_LAGS` (False): This flag controls whether to process and create lag features. If True, the code will calculate lag features based on event history for different time windows and store them in the data_lags.parquet file.

* Change to True: When you need to recalculate lag features, such as when you've changed the time window definitions or added new events.
* Keep as False: When you already have the desired lag features in data_lags.parquet and don't need to recompute them.

• `FRAC_0` (.001): This variable sets the sampling fraction for events with payment_event_flag = 0 when processing lags. This is used to reduce the size of the dataset for faster processing while maintaining a representative sample.

* Adjust Value: You might change this value depending on the size of your dataset and the desired balance between processing time and data representativeness.

• `PROC_VECS` (True): This flag controls whether to vectorize the data using TF-IDF. If True, the code will transform the lag features into numerical vectors using TF-IDF and store them in data_vec_train.parquet and data_vec_test.parquet files.

* Change to False: If you want to experiment with other vectorization methods or use the data in its raw form.
* Keep as True: When TF-IDF vectorization is the desired approach for your modeling tasks.

**File Paths:**

• `files_path`, `files_mask`: These define the location and pattern of the raw data files.
• `file_path_ds`, `file_path_lags`, `file_path_trn`, `file_path_tst`: These specify the storage locations for the processed datasets at different stages of the pipeline.

By understanding these flags and variables, you can control which parts of the data processing pipeline are executed, allowing for efficient experimentation and iteration.


v5 = data week 
2024, 1, 1, 0, 0, 0 
     2024, 4, 16, 23, 59, 59

## <span style="color: red;">Insering my version here below, I.e. V9. If creating NEW data, set all to TRUE</span>

In [12]:
VER = 'vP1' # <-- insert YOUR version here 
PROC_DS = False # <-- set to true 
PROC_LAGS = False # <-- set to true 
FRAC_0 = .001 # used only if `PROC_LAGS = True`
PROC_VECS = True
BUCKET = 'pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a7b'
PRJ_PATH = '/home/jovyan/__RAYPFP'

files_path = 'data/events'
files_mask = f'{files_path}/data_202*-*-*.csv'

file_path_ds = f's3a://{BUCKET}/work/{VER}/data_raw.parquet'
file_path_lags = f's3a://{BUCKET}/work/{VER}/data_lags.parquet'
file_path_trn = f's3a://{BUCKET}/work/{VER}/data_vec_train.parquet'
file_path_tst = f's3a://{BUCKET}/work/{VER}/data_vec_test.parquet'
file_path_lags1 = f's3a://{BUCKET}/work/{VER}/data_lags1.parquet'


In [13]:
#file_path_lags1 = f's3a://{BUCKET}/work/{VER}/data_lags1.parquet'


In [14]:
def clean_parquet(path):
    cmd = path.replace(
        f's3a://{BUCKET}',
        f'rm -rf {PRJ_PATH}'
    )
    !{cmd}
    return f'command to run: {cmd}'

In [15]:
# Create filenames that will be used later when saving predictions 

current_datetime = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
current_user = os.environ['JUPYTERHUB_SERVICE_PREFIX']
current_user = current_user.split("/")[2]  
unique_identifier = str(uuid.uuid4())[:8]  # Generate a unique identifier (first 8 characters)


In [16]:
print("**Filename Information:**")
print(f"- Current Date and Time: {current_datetime}")
print(f"- Current User: {current_user}")
print(f"- Unique Identifier: {unique_identifier}")

**Filename Information:**
- Current Date and Time: 20240423_125752
- Current User: st095435
- Unique Identifier: 3fb7a821


### 2.1. Load or preprocess data - `raw` stage

2.1. Load or Preprocess Data - raw Stage

This section checks the `PROC_DS` flag.

If True, it reads raw CSV data from S3, parses timestamps, filters and flags payment events within a specific timeframe, and selects relevant columns.

The processed data is then saved as a parquet file in S3 and the DataFrame is unloaded from memory.

Finally, it reads the processed data from the parquet file and displays a few rows.



## <span style="color: red;">Ensure dates used are as follows</span>

flag_min_datetime = datetime.datetime(2024, 3, 21, 0, 0, 0)

flag_max_datetime = datetime.datetime(2024, 4, 18, 23, 59, 59)

In [17]:
%%time
if PROC_DS:
    sdf = spark.read.option('escape','"').csv(f's3a://{BUCKET}/{files_mask}', header=True)
    sdf = sdf.withColumn('event_datetime', F.to_timestamp("event_datetime"))
    flag_min_datetime = datetime.datetime(2024, 3, 21, 0, 0, 0)
    flag_max_datetime = datetime.datetime(2024, 4, 18, 23, 59, 59)
    print(flag_min_datetime, flag_max_datetime)
    sdf = sdf.withColumn(
        'payment_event_flag', 
        (
            (F.col('event_name').like('%Мои штрафы/Оплата/Завершили оплату%') | 
            F.col('event_name').like('%Мои штрафы/Оплата/Платёж принят%')) &
            F.col('event_datetime').between(flag_min_datetime, flag_max_datetime)
        ).cast("int")
    )
    sdf = sdf.select(
        'profile_id',
        'event_datetime',
        'payment_event_flag',
        'event_name'
    )
    sdf.repartition(1).write.parquet(file_path_ds)
    sdf.unpersist()
sdf = spark.read.parquet(file_path_ds)
sdf.limit(5).toPandas()



if not PROC_DS:
    # Code to execute if PROC_DS is False
    flag_min_datetime = datetime.datetime(2024, 3, 21, 0, 0, 0)
    flag_max_datetime = datetime.datetime(2024, 4, 18, 23, 59, 59)
    print("flag min datetime: ", flag_min_datetime, '\n', 
          "flag max datetime: ", flag_max_datetime)

flag min datetime:  2024-03-21 00:00:00 
 flag max datetime:  2024-04-18 23:59:59
CPU times: user 13.6 ms, sys: 9.68 ms, total: 23.2 ms
Wall time: 6.94 s


In [18]:
sdf.groupBy('payment_event_flag').count().show()

+------------------+---------+
|payment_event_flag|    count|
+------------------+---------+
|                 0|752434695|
|                 1|    45742|
+------------------+---------+



In [19]:
# Init Weights and Biases to begin storing data 

wandb.init(project="ray-diploma", config={
    "version": VER,
    "proc_ds": PROC_DS,
    "proc_lags": PROC_LAGS,
    "proc_vecs": PROC_VECS,
    "frac_0": FRAC_0,
    "min_datetime": flag_min_datetime,
    "max_datetime": flag_max_datetime,
    "current_user": current_user, 
    "uuid": unique_identifier,
    # ... other common parameters ...
}, 
           mode="online", 
           dir="/home/jovyan/")

In [20]:
settings = wandb.Settings()
print(settings) 



In [21]:
# Weights and Biases logging here:

counts = sdf.groupBy('payment_event_flag').count().toPandas()
wandb.log({"data_count": counts})
print("**After Initial Data Loading**")
print(counts)

**After Initial Data Loading**
   payment_event_flag      count
0                   0  752434695
1                   1      45742


### 2.2. Load or preprocess data - `lags` stage

2.2. Load or Preprocess Data - lags Stage

The `dataset_lags` function defines window specifications for different time intervals (e.g., 10 minutes to 1 hour, 1 day to 3 days).

It then uses these windows to calculate the list of event names within each time interval for each profile, creating lag features.

If `PROC_LAGS` is True, the function samples the data based on the `payment_event_flag` and the specified fraction for negative examples.

The processed data with lag features is saved as a parquet file and unloaded from memory.

Finally, it reads the data with lags and displays the count of positive and negative examples.

## Lags new  implementation 



In [22]:
def dataset_lags(sdf, shift=0):
    hour = 60 * 60
    day = 24 * 60 * 60

    w_10min_to_1week = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-7 * day + shift, -10 * 60 + shift))
    w_1week_to_2weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-14 * day + shift, -7 * day + shift))
    w_2weeks_to_3weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-21 * day + shift, -14 * day + shift))
    w_3weeks_to_4weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-28 * day + shift, -21 * day + shift))
    w_4weeks_to_5weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-35 * day + shift, -28 * day + shift))
    w_5weeks_to_6weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-42 * day + shift, -35 * day + shift))
    w_6weeks_to_7weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-49 * day + shift, -42 * day + shift))
    w_7weeks_to_8weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-56 * day + shift, -49 * day + shift))
    w_8weeks_to_9weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-63 * day + shift, -56 * day + shift))
    w_9weeks_to_10weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-70 * day + shift, -63 * day + shift))
    w_10weeks_to_11weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-77 * day + shift, -70 * day + shift))
    w_11weeks_to_12weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-84 * day + shift, -77 * day + shift))
    w_12weeks_to_13weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-91 * day + shift, -84 * day + shift))
    w_13weeks_to_14weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-98 * day + shift, -91 * day + shift))   
    w_14weeks_to_15weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-105 * day + shift, -98 * day + shift))
    w_15weeks_to_16weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-112 * day + shift, -105 * day + shift))
    w_16weeks_to_17weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-119 * day + shift, -112 * day + shift))
    w_17weeks_to_18weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-126 * day + shift, -119 * day + shift))
    w_18weeks_to_19weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-133 * day + shift, -126 * day + shift))
    w_19weeks_to_20weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-140 * day + shift, -133 * day + shift))
    w_20weeks_to_21weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-147 * day + shift, -140 * day + shift))
    w_21weeks_to_22weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-154 * day + shift, -147 * day + shift))
    w_22weeks_to_23weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-161 * day + shift, -154 * day + shift))
    w_23weeks_to_24weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-168 * day + shift, -161 * day + shift))
    w_24weeks_to_25weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-175 * day + shift, -168 * day + shift))
    w_25weeks_to_26weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-182 * day + shift, -175 * day + shift)) 
    w_26weeks_to_27weeks = (Window()
          .partitionBy(F.col('profile_id'))
          .orderBy(F.col('event_datetime').cast('timestamp').cast('long'))
          .rangeBetween(-189 * day + shift, -182 * day + shift)) 
    
    return (
        sdf
            #.withColumn('lag_10min_to_1week', F.collect_list('event_name').over(w_10min_to_1week))
            .withColumn('lag_1week_to_2weeks', F.collect_list('event_name').over(w_1week_to_2weeks))
            .withColumn('lag_2weeks_to_3weeks', F.collect_list('event_name').over(w_2weeks_to_3weeks))
            .withColumn('lag_3weeks_to_4weeks', F.collect_list('event_name').over(w_3weeks_to_4weeks))
            .withColumn('lag_4weeks_to_5weeks', F.collect_list('event_name').over(w_4weeks_to_5weeks))
            .withColumn('lag_5weeks_to_6weeks', F.collect_list('event_name').over(w_5weeks_to_6weeks))
            .withColumn('lag_6weeks_to_7weeks', F.collect_list('event_name').over(w_6weeks_to_7weeks))
            .withColumn('lag_7weeks_to_8weeks', F.collect_list('event_name').over(w_7weeks_to_8weeks))
            .withColumn('lag_8weeks_to_9weeks', F.collect_list('event_name').over(w_8weeks_to_9weeks))
            .withColumn('lag_9weeks_to_10weeks', F.collect_list('event_name').over(w_9weeks_to_10weeks))
            .withColumn('lag_10weeks_to_11weeks', F.collect_list('event_name').over(w_10weeks_to_11weeks))
            .withColumn('lag_11weeks_to_12weeks', F.collect_list('event_name').over(w_11weeks_to_12weeks))
            .withColumn('lag_12weeks_to_13weeks', F.collect_list('event_name').over(w_12weeks_to_13weeks))
            .withColumn('lag_13weeks_to_14weeks', F.collect_list('event_name').over(w_13weeks_to_14weeks))
            .withColumn('lag_14weeks_to_15weeks', F.collect_list('event_name').over(w_14weeks_to_15weeks))
            .withColumn('lag_15weeks_to_16weeks', F.collect_list('event_name').over(w_15weeks_to_16weeks))
            .withColumn('lag_16weeks_to_17weeks', F.collect_list('event_name').over(w_16weeks_to_17weeks))
            .withColumn('lag_17weeks_to_18weeks', F.collect_list('event_name').over(w_17weeks_to_18weeks))
            .withColumn('lag_18weeks_to_19weeks', F.collect_list('event_name').over(w_18weeks_to_19weeks))
            .withColumn('lag_19weeks_to_20weeks', F.collect_list('event_name').over(w_19weeks_to_20weeks))
            .withColumn('lag_20weeks_to_21weeks', F.collect_list('event_name').over(w_20weeks_to_21weeks))
            .withColumn('lag_21weeks_to_22weeks', F.collect_list('event_name').over(w_21weeks_to_22weeks))
            .withColumn('lag_22weeks_to_23weeks', F.collect_list('event_name').over(w_22weeks_to_23weeks))
            .withColumn('lag_23weeks_to_24weeks', F.collect_list('event_name').over(w_23weeks_to_24weeks))
            .withColumn('lag_24weeks_to_25weeks', F.collect_list('event_name').over(w_24weeks_to_25weeks))
            .withColumn('lag_25weeks_to_26weeks', F.collect_list('event_name').over(w_25weeks_to_26weeks))
            .withColumn('lag_26weeks_to_27weeks', F.collect_list('event_name').over(w_26weeks_to_27weeks))
            .select(
                'profile_id',
                'event_datetime',
                'payment_event_flag',
                'event_name',
                #'lag_10min_to_1week',
                'lag_1week_to_2weeks',
                'lag_2weeks_to_3weeks',
                'lag_3weeks_to_4weeks',
                'lag_4weeks_to_5weeks',
                'lag_5weeks_to_6weeks',
                'lag_6weeks_to_7weeks',
                'lag_7weeks_to_8weeks',
                'lag_8weeks_to_9weeks',
                'lag_9weeks_to_10weeks',
                'lag_10weeks_to_11weeks',
                'lag_11weeks_to_12weeks',
                'lag_12weeks_to_13weeks',
                'lag_13weeks_to_14weeks',
                'lag_14weeks_to_15weeks',
                'lag_15weeks_to_16weeks',
                'lag_16weeks_to_17weeks',
                'lag_17weeks_to_18weeks',
                'lag_18weeks_to_19weeks',
                'lag_19weeks_to_20weeks',
                'lag_20weeks_to_21weeks',
                'lag_21weeks_to_22weeks',
                'lag_22weeks_to_23weeks',
                'lag_23weeks_to_24weeks',
                'lag_24weeks_to_25weeks',
                'lag_25weeks_to_26weeks',
                'lag_26weeks_to_27weeks'
            )
        .orderBy(F.col('event_datetime'), ascending=False)
    )

## -- New implementation of time lag windows -- 

Added the following code; 

*     clean_parquet(file_path_lags)
*     dates  = (flag_min_datetime, flag_max_datetime)

1. `clean_parquet(file_path_lags)`:

This line of code is used to clean or remove any existing parquet files at the specified file_path_lags location before writing new data.

By calling `clean_parquet(file_path_lags)` before writing the new data with time lag windows, it ensures that any previous data stored at the same location is removed, preventing any conflicts or data inconsistencies.

**we have commented it out for this because we actually want to reuse the existing data when training and grid searching our models** 

2. `dates = (flag_min_datetime, flag_max_datetime):`

This line creates a tuple named dates that contains two datetime values: flag_min_datetime and flag_max_datetime.

The purpose of dates = (flag_min_datetime, flag_max_datetime) is to create a tuple that represents a date range for filtering the data. The values of flag_min_datetime and flag_max_datetime are defined earlier in the notebook – I.e. 

"""flag_min_datetime = datetime.datetime(2023, 8, 1, 0, 0, 0)
flag_max_datetime = datetime.datetime(2023, 8, 31, 23, 59, 59)
print(flag_min_datetime, flag_max_datetime)""" 

After creating the dates tuple, the code uses it to filter the sdf DataFrame based on the event_datetime column. 

The asterisk (*) before dates is used to unpack the tuple and pass the individual datetime values as arguments to the between function.


In [23]:
# Old version before VG update to filter for dates



if PROC_LAGS:
    sdf = sdf.sampleBy(
        'payment_event_flag', 
        fractions={0: FRAC_0, 1: 1}, 
        seed=2023
    )
    sdf = dataset_lags(sdf)
    dates  = (flag_min_datetime, flag_max_datetime)
    sdf = sdf.filter(sdf.event_datetime.between(*dates))
    sdf = sdf.filter(
        #(F.size('lag_10min_to_1week')      > 0) |
        (F.size('lag_1week_to_2weeks')     > 0) |
        (F.size('lag_2weeks_to_3weeks')    > 0) |
        (F.size('lag_3weeks_to_4weeks')    > 0) |
        (F.size('lag_4weeks_to_5weeks')    > 0) |
        (F.size('lag_5weeks_to_6weeks')    > 0) |
        (F.size('lag_6weeks_to_7weeks')    > 0) |
        (F.size('lag_7weeks_to_8weeks')    > 0) |
        (F.size('lag_8weeks_to_9weeks')    > 0) |
        (F.size('lag_9weeks_to_10weeks')   > 0) |
        (F.size('lag_10weeks_to_11weeks')  > 0) |
        (F.size('lag_11weeks_to_12weeks')  > 0) |
        (F.size('lag_12weeks_to_13weeks')  > 0) |
        (F.size('lag_13weeks_to_14weeks')  > 0) |
        (F.size('lag_14weeks_to_15weeks')  > 0) |
        (F.size('lag_15weeks_to_16weeks')  > 0) |
        (F.size('lag_16weeks_to_17weeks')  > 0) |
        (F.size('lag_17weeks_to_18weeks')  > 0) |
        (F.size('lag_18weeks_to_19weeks')  > 0) |
        (F.size('lag_19weeks_to_20weeks')  > 0) |
        (F.size('lag_20weeks_to_21weeks')  > 0) |
        (F.size('lag_21weeks_to_22weeks')  > 0) |
        (F.size('lag_22weeks_to_23weeks')  > 0) |
        (F.size('lag_23weeks_to_24weeks')  > 0) |
        (F.size('lag_24weeks_to_25weeks')  > 0) |
        (F.size('lag_25weeks_to_26weeks')  > 0) |
        (F.size('lag_26weeks_to_27weeks')  > 0) 
    )
    clean_parquet(file_path_lags)
    sdf.repartition(8).write.parquet(file_path_lags)
    sdf.unpersist()
sdf = spark.read.parquet(file_path_lags)


In [24]:
# Weights and Biases logging here:
counts = sdf.groupBy('payment_event_flag').count().toPandas()
wandb.log({"data_count_with_lags": counts})

## <span style="color: red;">Double check the dataset has been reduced in the following cell</span>


In [25]:
print("**Dataset Size After Lag Feature Creation**", '/n', counts)

**Dataset Size After Lag Feature Creation** /n    payment_event_flag  count
0                   1  23109
1                   0  19594


In [26]:
sdf.printSchema()

root
 |-- profile_id: string (nullable = true)
 |-- event_datetime: timestamp (nullable = true)
 |-- payment_event_flag: integer (nullable = true)
 |-- event_name: string (nullable = true)
 |-- lag_1week_to_2weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_2weeks_to_3weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_3weeks_to_4weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_4weeks_to_5weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_5weeks_to_6weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_6weeks_to_7weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_7weeks_to_8weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_8weeks_to_9weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_9weeks_to_1

### Train test split process 

In [27]:
# Define train test split function 

def stratified_split(sdf, frac, label, seed=2023):
    zeros = sdf.filter(sdf[label] == 0)
    ones = sdf.filter(sdf[label] == 1)
    train_, test_ = zeros.randomSplit([1 - frac, frac], seed=seed)
    train, test = ones.randomSplit([1 - frac, frac], seed=seed)
    train = train.union(train_)
    test = test.union(test_)
    return train, test

In [28]:
# Conduct train test split 

sdf_train, sdf_test = stratified_split(
    sdf, 
    frac=.2, # Size of the test dataset
    label='payment_event_flag',
    seed=2023
)

## <span style="color: red;">Check the data has been split and classes are approx equal</span>


In [29]:
sdf_train.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,1,18440
1,0,15649


In [30]:
sdf_test.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,1,4669
1,0,3945


In [31]:
# Weights and Biases logging here:

train_counts = sdf_train.groupBy('payment_event_flag').count().toPandas()
wandb.log({"train_data_count": train_counts})

test_counts = sdf_test.groupBy('payment_event_flag').count().toPandas()
wandb.log({"test_data_count": test_counts})

## <span style="color: red;">Ensure date ranges used are the ones we set earlier</span>


In [32]:
from pyspark.sql.functions import min, max

# Find the minimum and maximum dates
min_date = sdf.agg(min("event_datetime")).collect()[0][0]
max_date = sdf.agg(max("event_datetime")).collect()[0][0]

print(f"Minimum Date: {min_date}")
print(f"Maximum Date: {max_date}")

Minimum Date: 2024-03-21 00:00:17
Maximum Date: 2024-04-18 23:59:23


In [33]:
# Training set
train_min_date = sdf_train.agg(min("event_datetime")).collect()[0][0]
train_max_date = sdf_train.agg(max("event_datetime")).collect()[0][0]

print(f"Training Set Minimum Date: {train_min_date}")
print(f"Training Set Maximum Date: {train_max_date}")


Training Set Minimum Date: 2024-03-21 00:00:17
Training Set Maximum Date: 2024-04-18 23:57:47


In [34]:
# Test set
test_min_date = sdf_test.agg(min("event_datetime")).collect()[0][0]
test_max_date = sdf_test.agg(max("event_datetime")).collect()[0][0]

print(f"Test Set Minimum Date: {test_min_date}")
print(f"Test Set Maximum Date: {test_max_date}")

Test Set Minimum Date: 2024-03-21 00:05:21
Test Set Maximum Date: 2024-04-18 23:59:23


### 2.3. Load or preprocess data - `vectorize` stage

2.3. Load or Preprocess Data - vectorize Stage

This section defines a list of lag features to be used.

The datasets_tfidf function performs `TF-IDF` vectorization on the lag features for both training and test datasets.

It uses `HashingTF` to convert lists of event names into numerical feature vectors and `IDF` to rescale the features based on their document frequency.

The function also creates a dictionary mapping feature indices to the corresponding event names.

In [35]:
lags = [
#    'lag_10min_to_1week',
    'lag_1week_to_2weeks',
    'lag_2weeks_to_3weeks',
    'lag_3weeks_to_4weeks',
    'lag_4weeks_to_5weeks',
    'lag_5weeks_to_6weeks',
    'lag_6weeks_to_7weeks',
    'lag_7weeks_to_8weeks',
    'lag_8weeks_to_9weeks',
    'lag_9weeks_to_10weeks',
    'lag_10weeks_to_11weeks',
    'lag_11weeks_to_12weeks',
    'lag_12weeks_to_13weeks',
    'lag_13weeks_to_14weeks',
    'lag_14weeks_to_15weeks',
    'lag_15weeks_to_16weeks',
    'lag_16weeks_to_17weeks',
    'lag_17weeks_to_18weeks',
    'lag_18weeks_to_19weeks',
    'lag_19weeks_to_20weeks',
    'lag_20weeks_to_21weeks',
    'lag_21weeks_to_22weeks',
    'lag_22weeks_to_23weeks',
    'lag_23weeks_to_24weeks',
    'lag_24weeks_to_25weeks',
    'lag_25weeks_to_26weeks',
    'lag_26weeks_to_27weeks'
]

## TF-IDF implementation 

In [36]:
def datasets_tfidf(sdf_train, sdf_test, lags, min_freq=3, num_features=10):
    features_dict = {}
    count = 0
    for lag in tqdm(lags):
        hashingTF = HashingTF(
            inputCol=lag, 
            outputCol=lag + '_tf', 
            numFeatures=num_features
        )
        featurizedData = hashingTF.transform(sdf_train)
        idf = IDF(
            inputCol=lag + '_tf', 
            outputCol=lag + '_tfidf',
            minDocFreq=min_freq  
        )
        idfModel = idf.fit(featurizedData)
        sdf_train = idfModel.transform(featurizedData)
        sdf_test = idfModel.transform(
            hashingTF.transform(sdf_test)
        )
        events = [
            x
            for xs in sdf_train.select(lag).distinct().rdd.flatMap(lambda x: x).collect()
            for x in xs
        ]
        hash_dict = {}
        for e in events:
            hash_dict[lag + '_' + e] = hashingTF.indexOf(e)
        for feat_num in range(num_features):
            tmp_list = []
            for k, v in hash_dict.items():
                if v == feat_num: tmp_list.append(k)
            features_dict[count * num_features + feat_num] = tmp_list
        count += 1
    return sdf_train, sdf_test, features_dict

## Breakdown of if PROC_VECS code: 


**Conditional Execution:**

• if PROC_VECS:: The code within this block is executed only if the PROC_VECS flag is set to True. This flag controls whether TF-IDF vectorization is performed on the data.

**TF-IDF Vectorization:**

`sdf_train, sdf_test, vectorizers = datasets_tfidf(...)`: This line calls the datasets_tfidf function, which performs TF-IDF vectorization on the lag features present in the sdf_train and sdf_test DataFrames.
* The lags argument provides the list of lag feature column names to be vectorized.
* The vec_size=10 argument specifies the desired dimensionality (number of features) of the resulting TF-IDF vectors.

**The function returns three values:**

* `sdf_train`: The training DataFrame with the added TF-IDF vector columns.
* `sdf_test`: The test DataFrame with the added TF-IDF vector columns.
* `vectorizers`: A list of fitted TF-IDF vectorizer models (one for each lag feature).

**Cleaning and Saving Parquet Files:**

`clean_parquet(file_path_trn)`: This line calls a function (not shown) to clean up any existing Parquet files at the specified path (file_path_trn) before saving the new data.

`sdf_train.repartition(8).write.parquet(file_path_trn)`: The training DataFrame (sdf_train) is repartitioned into 8 partitions for optimized writing.

* The write.parquet method saves the DataFrame as a Parquet file at the specified path (file_path_trn).
* The same process is repeated for the test DataFrame (sdf_test) using file_path_tst.

**Unpersisting DataFrames:**

* `sdf_train.unpersist(), sdf_test.unpersist()`: These lines remove the DataFrames from Spark's memory. Since the data has been saved to disk, it can be reloaded later if needed, freeing up memory for subsequent processing.

**Reloading DataFrames (if necessary):**

* `sdf_train = spark.read.parquet(file_path_trn)`: This line reloads the training data from the saved Parquet file if it's not already in memory.

The same is done for the test data using file_path_tst.

In [37]:
# NEW version with clean_parquet (TF-IDF)

if PROC_VECS:
    sdf_train, sdf_test, features_dict = datasets_tfidf(
        sdf_train,
        sdf_test,
        lags,
        #vec_size=10,
        min_freq=3,
        num_features=100
    )
    clean_parquet(file_path_trn)
    sdf_train.repartition(8).write.parquet(file_path_trn)
    clean_parquet(file_path_tst)
    sdf_test.repartition(8).write.parquet(file_path_tst)
    sdf_train.unpersist()
    sdf_test.unpersist()
sdf_train = spark.read.parquet(file_path_trn)
sdf_test = spark.read.parquet(file_path_tst)


if not PROC_VECS:
    sdf_train, sdf_test, features_dict = datasets_tfidf(
        sdf_train, 
        sdf_test, 
        lags, 
        min_freq=3,
        num_features=100
    )
    print(len(features_dict.items()))


  0%|          | 0/26 [00:00<?, ?it/s]

In [38]:
# Check data size after reloading
print("**Training Set After TF-IDF**")
sdf_train.groupBy('payment_event_flag').count().show()
print("**Testing Set After TF-IDF**")
sdf_test.groupBy('payment_event_flag').count().show()

**Training Set After TF-IDF**
+------------------+-----+
|payment_event_flag|count|
+------------------+-----+
|                 1|18440|
|                 0|15649|
+------------------+-----+

**Testing Set After TF-IDF**
+------------------+-----+
|payment_event_flag|count|
+------------------+-----+
|                 1| 4669|
|                 0| 3945|
+------------------+-----+



In [39]:
sdf_train.printSchema()

root
 |-- profile_id: string (nullable = true)
 |-- event_datetime: timestamp (nullable = true)
 |-- payment_event_flag: integer (nullable = true)
 |-- event_name: string (nullable = true)
 |-- lag_1week_to_2weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_2weeks_to_3weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_3weeks_to_4weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_4weeks_to_5weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_5weeks_to_6weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_6weeks_to_7weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_7weeks_to_8weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_8weeks_to_9weeks: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- lag_9weeks_to_1

In [None]:
"""from pyspark.ml.feature import Word2Vec

def datasets_vecorized(sdf_train, sdf_test, lags, vec_size=10):
    vectorizers = []
    for lag in tqdm(lags):
        word2Vec = Word2Vec(
            vectorSize=vec_size,
            minCount=0,
            inputCol=lag,
            outputCol=lag + '_vec'
        )
        vectorizer = word2Vec.fit(sdf_train)
        sdf_train = vectorizer.transform(sdf_train)
        sdf_test = vectorizer.transform(sdf_test)
        vectorizers.append(vectorizer)
    return sdf_train, sdf_test, vectorizers"""

## 3. Model

### 3.1. Features assembling

3.1. Features Assembling

The features_assembled function prepares the data for model training:

* It selects the `TF-IDF` features and the target variable (`payment_event_flag`).
* It uses VectorAssembler to combine the `TF-IDF` features into a single vector column named `features`.
* It returns a DataFrame with the target variable and the assembled feature vector.

The upsampled function can be used to address class imbalance:

* It separates the data into positive and negative examples.
* It duplicates the positive examples to achieve a balanced class distribution based on the `UPSAMPLE` setting.

The code then:
* Defines a list of lag features to be used.
* Assembles features for both training and test sets.
* Optionally performs **upsampling** on the training set (and potentially the test set) if `UPSAMPLE` is enabled.
* Displays the class distribution after upsampling.

In [40]:
def features_assembled(sdf, feats):
    cols_to_model = [x + '_tfidf' for x in feats]
    cols_to_model.extend(['payment_event_flag'])
    print('columns to model:', cols_to_model)
    vecAssembler = VectorAssembler(
        inputCols=[c for c in cols_to_model if c != 'payment_event_flag'], 
        outputCol='features'
    )
    features = sdf.select(cols_to_model)
    features_vec = vecAssembler.transform(features)
    features_data = features_vec.select('payment_event_flag', 'features')
    return features_data

def upsampled(sdf, label, upsample='max'):
    zeros = sdf.filter(sdf[label] == 0)
    ones = sdf.filter(sdf[label] == 1)
    res = zeros.union(ones)
    if upsample == 'max':
        up_count = int(zeros.count() / ones.count())
        for _ in range(up_count - 1):
            res = res.union(ones)
    else:
        for _ in range(upsample - 1):
            res = res.union(ones)
    return res

In [41]:
# "MAX" Strategy: Setting UPSAMPLE = 'max' instructs the upsampled function to duplicate the minority class examples
# They are upsampled until their count becomes equal to the count of the majority class. 
# In other words, it aims for a 1:1 class ratio.

UPSAMPLE = 'max' # Can be either 'none' or 'max'


In [42]:
# Setting feats, make sure to comment out (#) any lags we will not use for train/pred 

feats = [
#    'lag_10min_to_1week',
    'lag_1week_to_2weeks',
    'lag_2weeks_to_3weeks',
    'lag_3weeks_to_4weeks',
    'lag_4weeks_to_5weeks',
    'lag_5weeks_to_6weeks',
    'lag_6weeks_to_7weeks',
    'lag_7weeks_to_8weeks',
    'lag_8weeks_to_9weeks',
    'lag_9weeks_to_10weeks',
    'lag_10weeks_to_11weeks',
    'lag_11weeks_to_12weeks',
    'lag_12weeks_to_13weeks',
    'lag_13weeks_to_14weeks',
    'lag_14weeks_to_15weeks',
    'lag_15weeks_to_16weeks',
    'lag_16weeks_to_17weeks',
    'lag_17weeks_to_18weeks',
    'lag_18weeks_to_19weeks',
    'lag_19weeks_to_20weeks',
    'lag_20weeks_to_21weeks',
    'lag_21weeks_to_22weeks',
    'lag_22weeks_to_23weeks',
    'lag_23weeks_to_24weeks',
    'lag_24weeks_to_25weeks',
    'lag_25weeks_to_26weeks',
    'lag_26weeks_to_27weeks'
]



features_train = features_assembled(sdf_train, feats=feats)
features_test = features_assembled(sdf_test, feats=feats)

if UPSAMPLE:
    features_train = upsampled(
        features_train, 
        label='payment_event_flag', 
        upsample=UPSAMPLE
    )
    # Use to upsample test set
    #features_test = upsampled(
    #    features_test, 
    #    label='payment_event_flag', 
    #    upsample=UPSAMPLE
    #)

columns to model: ['lag_1week_to_2weeks_tfidf', 'lag_2weeks_to_3weeks_tfidf', 'lag_3weeks_to_4weeks_tfidf', 'lag_4weeks_to_5weeks_tfidf', 'lag_5weeks_to_6weeks_tfidf', 'lag_6weeks_to_7weeks_tfidf', 'lag_7weeks_to_8weeks_tfidf', 'lag_8weeks_to_9weeks_tfidf', 'lag_9weeks_to_10weeks_tfidf', 'lag_10weeks_to_11weeks_tfidf', 'lag_11weeks_to_12weeks_tfidf', 'lag_12weeks_to_13weeks_tfidf', 'lag_13weeks_to_14weeks_tfidf', 'lag_14weeks_to_15weeks_tfidf', 'lag_15weeks_to_16weeks_tfidf', 'lag_16weeks_to_17weeks_tfidf', 'lag_17weeks_to_18weeks_tfidf', 'lag_18weeks_to_19weeks_tfidf', 'lag_19weeks_to_20weeks_tfidf', 'lag_20weeks_to_21weeks_tfidf', 'lag_21weeks_to_22weeks_tfidf', 'lag_22weeks_to_23weeks_tfidf', 'lag_23weeks_to_24weeks_tfidf', 'lag_24weeks_to_25weeks_tfidf', 'lag_25weeks_to_26weeks_tfidf', 'lag_26weeks_to_27weeks_tfidf', 'payment_event_flag']
columns to model: ['lag_1week_to_2weeks_tfidf', 'lag_2weeks_to_3weeks_tfidf', 'lag_3weeks_to_4weeks_tfidf', 'lag_4weeks_to_5weeks_tfidf', 'lag_5w

In [43]:
features_train.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,0,15649
1,1,18440


In [44]:
features_test.groupBy('payment_event_flag').count().toPandas()

Unnamed: 0,payment_event_flag,count
0,1,4669
1,0,3945


In [45]:
features_train.limit(3).toPandas()

Unnamed: 0,payment_event_flag,features
0,0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [46]:
features_train.limit(3).toPandas()['features'][0]

SparseVector(2600, {200: 4.7984, 352: 7.6642, 401: 5.2438})

### 3.2. Training and evaluating

#### <span style="color: red;">This will train and evaluate WITHOUT hyperopt</span>



* `labelCol`: Specifies the target variable column ("`payment_event_flag`").
* `featuresCol`: Specifies the feature vector column ("`features`").


* The model is trained using the fit method on the training data.
* The trained model is used to make predictions on the test data.

The `BinaryClassificationMetrics` class is used to calculate evaluation metrics:

* `areaUnderROC`: Area under the ROC curve, which measures the model's ability to distinguish between classes.
* `areaUnderPR`: Area under the Precision-Recall curve, which is more informative for imbalanced datasets.

The code also uses `classification_report` from `scikit-learn` to get a detailed report including precision, recall, F1-score, and support for each class.

In [46]:
# Define model filepath to link predictions results with the actual CSV 

current_model = "LogisticRegressionModel"

file_path_pred = f's3a://{BUCKET}/work/{VER}/preds_{current_model}_{current_datetime}_{current_user}_{unique_identifier}.csv'
print(f"- Prediction File Path: {file_path_pred}")

- Prediction File Path: s3a://pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a7b/work/vP1/preds_LogisticRegressionModel_20240423_105252_st095435_fd66151b.csv


In [47]:
from pyspark.ml.classification import LogisticRegressionModel

In [48]:
features_train.printSchema()

root
 |-- payment_event_flag: integer (nullable = true)
 |-- features: vector (nullable = true)



In [49]:
lr = LogisticRegression(labelCol="payment_event_flag", featuresCol="features")


In [50]:
%%time
model = lr.fit(features_train)

CPU times: user 10.8 ms, sys: 899 µs, total: 11.7 ms
Wall time: 7.86 s


In [51]:
features_train.show(5)


+------------------+--------------------+
|payment_event_flag|            features|
+------------------+--------------------+
|                 0|(2600,[200,352,40...|
|                 0|(2600,[445,1126,2...|
|                 0|(2600,[193,488],[...|
|                 0|(2600,[2500],[4.8...|
|                 0|(2600,[114,1191,1...|
+------------------+--------------------+
only showing top 5 rows



## <span style="color: red;">---------------</span>
NB


In [52]:
predictions = model.transform(features_test)
payment_event_flag_preds = predictions.select('prediction', 'payment_event_flag')
metrics = BinaryClassificationMetrics(
    payment_event_flag_preds.rdd.map(
        lambda lines: [float(x) for x in lines]
    )
)
print('ROC AUC:', metrics.areaUnderROC)
print('Area under PR-curve:', metrics.areaUnderPR)



ROC AUC: 0.6563167085658691
Area under PR-curve: 0.6695454803362192


In [53]:
# Weights and Biases logging here
wandb.log({"roc_auc": metrics.areaUnderROC})
wandb.log({"pr_auc": metrics.areaUnderPR})

In [54]:
#predictions = model.transform(features_test)

# Получаем фактические метки классов и предсказанные значения
y_true = predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()

# Выводим classification_report
report = classification_report(y_true, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.61      0.67      0.64      3945
           1       0.70      0.64      0.67      4669

    accuracy                           0.66      8614
   macro avg       0.66      0.66      0.65      8614
weighted avg       0.66      0.66      0.66      8614



In [55]:
# Weights and Biases logging here
wandb.log({"classification_report": wandb.Html(report)})

### Feature Importance 

* The code extracts feature importances from the trained model and filters them based on a threshold (`TH`).
* The feature importances are then sorted in descending order.
* The code prints the feature number, importance value, and corresponding event names for the most important features.


In [56]:
# Выводим доступные гиперпараметры модели
print("Available hyperparameters:\n")


Available hyperparameters:



In [57]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: payment_event_flag)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained opti

In [None]:
TH = .01

features_imps = {}
for i, v in enumerate(model.featureImportances.toArray()):
    if v >= TH: features_imps[i] = v
features_imps = dict(sorted(features_imps.items(), key=lambda x: x[1], reverse=True))
features_imps

In [73]:
for k, v in features_imps.items():
    print('-' * 100)
    print('feature number:', k, '| feature importance:', v)
    print('features:', features_dict[k])

In [74]:
# Weights and biases logging here

wandb.log({"feature_importances": wandb.Table(dataframe=pd.DataFrame.from_dict(features_imps, orient='index'))})

In [None]:
## Преобразование данных из DataFrame в numpy массив соответствующей размерности
#X_test = np.array(features_test.select('features').collect()).reshape(-1, len(feats))
#
#explainer = shap.TreeExplainer(model)  # Initialize explainer
#
## Получаем значения SHAP для данных
#shap_values = explainer.shap_values(X_test)


In [None]:
#shap.summary_plot()

In [None]:
# Weights and biases logging here 

#wandb.log({"shap_values": wandb.Table(dataframe=pd.DataFrame(shap_values))})


# 5. Predictions old variant (base model, no hyperopt) 

In [None]:
#sdf_pred = spark.read.parquet(file_path_ds)
#sdf_pred.limit(5).toPandas()

In [None]:
#%%time
#
#SHIFT = 7 * 24 * 60 * 60  # 7 days ahead
#
#sdf_pred = sdf_pred.sample(fraction=.0001)
#sdf_pred = dataset_lags(sdf_pred, shift=SHIFT)
#sdf = sdf.filter(
#        #(F.size('lag_10min_to_1week')      > 0) |
#        (F.size('lag_1week_to_2weeks')     > 0) |
#        (F.size('lag_2weeks_to_3weeks')    > 0) |
#        (F.size('lag_3weeks_to_4weeks')    > 0) |
#        (F.size('lag_4weeks_to_5weeks')    > 0) |
#        (F.size('lag_5weeks_to_6weeks')    > 0) |
#        (F.size('lag_6weeks_to_7weeks')    > 0) |
#        (F.size('lag_7weeks_to_8weeks')    > 0) |
#        (F.size('lag_8weeks_to_9weeks')    > 0) |
#        (F.size('lag_9weeks_to_10weeks')   > 0) |
#        (F.size('lag_10weeks_to_11weeks')  > 0) |
#        (F.size('lag_11weeks_to_12weeks')  > 0) |
#        (F.size('lag_12weeks_to_13weeks')  > 0) |
#        (F.size('lag_13weeks_to_14weeks')  > 0) |
#        (F.size('lag_14weeks_to_15weeks')  > 0) |
#        (F.size('lag_15weeks_to_16weeks')  > 0) |
#        (F.size('lag_16weeks_to_17weeks')  > 0) |
#        (F.size('lag_17weeks_to_18weeks')  > 0) |
#        (F.size('lag_18weeks_to_19weeks')  > 0) |
#        (F.size('lag_19weeks_to_20weeks')  > 0) |
#        (F.size('lag_20weeks_to_21weeks')  > 0) |
#        (F.size('lag_21weeks_to_22weeks')  > 0) |
#        (F.size('lag_22weeks_to_23weeks')  > 0) |
#        (F.size('lag_23weeks_to_24weeks')  > 0) |
#        (F.size('lag_24weeks_to_25weeks')  > 0) |
#        (F.size('lag_25weeks_to_26weeks')  > 0) |
#        (F.size('lag_26weeks_to_27weeks')  > 0) 
#    )
#sdf_pred.count()

In [None]:
#print(lags)

In [None]:
## if word2vec
#
#"""for i, lag in enumerate(lags):
#    sdf_pred = vectorizers[i].transform(sdf_pred)
#    print(lag, '-> done')""" 
#
## If tf-idf
#
#for lag in lags:
#    hashingTF = HashingTF(inputCol=lag, outputCol=lag + "_tf", numFeatures=100)
#    idf = IDF(inputCol=lag + "_tf", outputCol=lag + "_tfidf", minDocFreq=3)
#    sdf_pred = hashingTF.transform(sdf_pred)
#    idfModel = idf.fit(sdf_pred)  # Fit the IDF transformer
#    sdf_pred = idfModel.transform(sdf_pred)  # Use the fitted model for transformation
#    print(lag, "-> done")
#

In [None]:
#features_pred = features_assembled(sdf_pred, feats=feats)

In [None]:
#predictions_future = model.transform(features_pred)

### OLD method for compiling 

In [None]:
"""%%time

file_path_pred = f's3a://{BUCKET}/work/{VER}/preds.csv'
clean_parquet(file_path_pred)

sdf_pred = sdf_pred.join(predictions_future).select(
    sdf_pred.profile_id,
    F.col('probability')
)
sdf_pred.withColumn(
    'tmp',
    vector_to_array('probability')
).select(
    'profile_id',
    F.col('tmp')[1].alias('prob_next7days')
).write.csv(file_path_pred, header=True)"""

### Our new method for compiling, with datetime and unique IDs

In this updated code:

**We import the necessary libraries:**
* `datetime` for working with date and time objects
* `uuid` for generating unique identifiers
* `getpass` for retrieving the current user's username

**We create variables to store the additional information:**
* `current_datetime`: We use datetime.datetime.now() to get the current date and time and format it as a string in the format "YYYYMMDD_HHMMSS".
* `current_user`: We use getpass.getuser() to retrieve the username of the currently signed-in user.
* `unique_identifier`: We generate a unique identifier using uuid.uuid4() and take the first 8 characters of the string representation.

**We update the file_path_pred to include the additional information:**

We insert the current_datetime, current_user, and unique_identifier into the file path using f-string formatting.

**The resulting file path will have the format:** 

`s3a://{BUCKET}/work/{VER}/preds_{current_datetime}_{current_user}_{unique_identifier}.csv`

**The rest of the code remains the same, including:**

Cleaning the parquet file at the specified file_path_pred

In [None]:
#%%time
#
#clean_parquet(file_path_pred)
#
#sdf_pred = sdf_pred.join(predictions_future).select(
#    sdf_pred.profile_id,
#    F.col('probability')
#)
#
#

In [None]:
## Get model name (assuming your model variable is named "model")
#current_model = type(model).__name__
#
## Create file path with model name
#file_path_pred = f's3a://{BUCKET}/work/{VER}/preds_{model_name}_{current_datetime}_{current_user}_{unique_identifier}.csv'
#print(f"- Prediction File Path: {file_path_pred}")

In [None]:
#sdf_pred.withColumn(
#    'tmp',
#    vector_to_array('probability')
#).select(
#    'profile_id',
#    F.col('tmp')[1].alias('prob_next7days')
#).write.csv(file_path_pred, header=True)

#### Logging notebook characteristics

The below code snippet prints a formatted summary with clear headings and labels for each variable or parameter.

It includes information about data processing flags, lag feature details, date range used for filtering payment events, upsampling strategy, model type, and the output file path.

You can further customize this code by adding more relevant variables or model-specific parameters based on your pipeline configuration.

**Benefits:**

* Reproducibility: Having a summary of the pipeline parameters improves the reproducibility of your results and makes it easier to track the specific settings used for a particular experiment.

In [75]:
print("## Pipeline Summary ##")

# Data Processing Flags:
print(f"- PROC_DS: {PROC_DS}")
print(f"- PROC_LAGS: {PROC_LAGS}")
print(f"- PROC_VECS: {PROC_VECS}")

# Lag Feature Information:
print(f"- Lag Features Used: {lags}")
print(f"- Number of Features after TF-IDF: {len(features_dict.items())}")

# Date Range:
print(f"- flag_min_datetime: {flag_min_datetime}")
print(f"- flag_max_datetime: {flag_max_datetime}")

# Upsampling:
print(f"- UPSAMPLE: {UPSAMPLE}")

# Model:
print(f"- Model Type: {type(model).__name__}")
# Add more model-specific parameters if needed 

# Output:
print(f"- Predictions saved to: ", file_path_pred)

print("## End of Summary ##")

## Pipeline Summary ##
- PROC_DS: True
- PROC_LAGS: True
- PROC_VECS: True
- Lag Features Used: ['lag_1week_to_2weeks', 'lag_2weeks_to_3weeks', 'lag_3weeks_to_4weeks', 'lag_4weeks_to_5weeks', 'lag_5weeks_to_6weeks', 'lag_6weeks_to_7weeks', 'lag_7weeks_to_8weeks', 'lag_8weeks_to_9weeks', 'lag_9weeks_to_10weeks', 'lag_10weeks_to_11weeks', 'lag_11weeks_to_12weeks', 'lag_12weeks_to_13weeks', 'lag_13weeks_to_14weeks', 'lag_14weeks_to_15weeks', 'lag_15weeks_to_16weeks', 'lag_16weeks_to_17weeks', 'lag_17weeks_to_18weeks', 'lag_18weeks_to_19weeks', 'lag_19weeks_to_20weeks', 'lag_20weeks_to_21weeks', 'lag_21weeks_to_22weeks', 'lag_22weeks_to_23weeks', 'lag_23weeks_to_24weeks', 'lag_24weeks_to_25weeks', 'lag_25weeks_to_26weeks', 'lag_26weeks_to_27weeks']
- Number of Features after TF-IDF: 2600
- flag_min_datetime: 2024-03-21 00:00:00
- flag_max_datetime: 2024-04-18 23:59:59
- UPSAMPLE: max
- Model Type: LogisticRegressionModel
- Predictions saved to:  s3a://pvc-84ea79a0-dc20-4a2d-86ab-f83c1f8d4a

# <span style="color: red;">!!!!! BELOW WE WILL RUN HYPEROPT AGAIN WITHOUT TF-IDF !!!!!

It also does not include the date code before running the model

# *** Add filtering here ***
    if PROC_LAGS:
        dates  = (flag_min_datetime, flag_max_datetime)
        features_train_filtered = features_train.filter(features_train.event_datetime.between(*dates))
    else:
        features_train_filtered = features_train 

</span>

## Logistic regression WITHOUT tf-idf elements

In [71]:
from hyperopt import hp

search_space = {
    'regParam': hp.loguniform('regParam', -7, 1),
    'elasticNetParam': hp.uniform('elasticNetParam', 0, 1),
    'maxIter': hp.quniform('maxIter', 10, 100, 10),
    'tol': hp.loguniform('tol', -7, 0),
    # Другие параметры...
}


In [72]:
def objective(params):
    lr = LogisticRegression(labelCol="payment_event_flag", featuresCol="features",  ** params)
    model = lr.fit(features_train)
    predictions = model.transform(features_test)
    evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
    auc = evaluator.evaluate(predictions)
    # Log metrics and params to W&B
    wandb.log({"roc_auc_hyperopt": auc, "params_hyperopt": params})
    return {'loss': -auc, 'status': STATUS_OK} 

In [73]:
# Update W&B run


config = {
    "model_type": "LogisticRegressionHYPEROPT",
    #"output_file": "predictions_future_week.csv",
    "hyperopt_search_space": search_space  # Include search space here
}


# Update the config in the existing W&B run
wandb.config.update(config)

In [81]:
# Run Hyperopt optimization
trials = Trials()
best_lr_params = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest,
    max_evals=30,  # Number of trials
    trials=trials
)



100%|██████████| 30/30 [01:20<00:00,  2.67s/trial, best loss: -0.7158076855108569]


In [82]:
best_lr_params

{'elasticNetParam': 0.2796271657027966,
 'maxIter': 40.0,
 'regParam': 0.004123055975276005,
 'tol': 0.0021750755982573337}

In [83]:
# Log best hyperparameters
wandb.log({"best_lr_hyperparameters": best_lr_params})



In [84]:
# Train the best lr model
best_lr = LogisticRegression(labelCol="payment_event_flag", featuresCol="features", **best_lr_params)
best_lr_model = best_lr.fit(features_train)


In [85]:
# Evaluate the best model
best_predictions = best_lr_model.transform(features_test)
best_evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
best_lr_auc = best_evaluator.evaluate(best_predictions)



In [86]:
# Calculate classification report for the best model
y_true = best_predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = best_predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()
best_report = classification_report(y_true, y_pred)



In [87]:
print(best_report)

              precision    recall  f1-score   support

           0       0.62      0.53      0.57      3945
           1       0.65      0.72      0.68      4669

    accuracy                           0.64      8614
   macro avg       0.63      0.63      0.63      8614
weighted avg       0.63      0.64      0.63      8614



In [None]:
# Convert report to HTML format
hyperopt_report_html = classification_report(y_true, y_pred, output_dict=False)

# fix 
hyperopt_lr_report_html = dict_to_html_table(lr_report)

# Log the report as HTML to W&B
wandb.log({"hopt_best_lr_classification_report": wandb.Html(hyperopt_lr_report_html)})

In [76]:
"""from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import wandb

# Предполагаем, что features_train и features_test уже определены

def lr_objective(trial):
    params = {
        'labelCol': 'payment_event_flag',
        'featuresCol': 'features',
        # Другие параметры для оптимизации
    }
    params.update(trial.params)
    
    lr = LogisticRegression( ** params)
    lr_model = lr.fit(features_train)
    predictions = lr_model.transform(features_test)
    evaluator = BinaryClassificationEvaluator()
    auc = evaluator.evaluate(predictions)
    
    # Log metrics and params to W&B
    wandb.log({"lr_hyperopt_auc": auc, "lr_hyperopt_params": params})
    return {'loss': -auc, 'status': STATUS_OK}


In [78]:
'''for key, value in best_params.items():
    print(f"{key}: {value}")

elasticNetParam: 0.5040260781935606
maxIter: 70.0
regParam: 0.0022335019114765094
tol: 0.010194121730669711


In [77]:
"""# Update W&B run config
wandb.config.update({
    "model_type": "LogisticRegressionHyperopt",
    "hyperopt_search_space": lr_search_space
})



In [76]:
'''best_params = {'elasticNetParam': 0.5040260781935606,
               'maxIter': 70.0,
               'regParam': 0.0022335019114765094,
               'tol': 0.010194121730669711}




In [None]:
# Log best hyperparameters
wandb.log({"best_lr_hyperparameters": best_lr_params})



In [124]:
"""# Update W&B run


config = {
    "model_type": "LogisticRegressionHYPEROPT", # <-- change to your model name 
    #"output_file": "predictions_future_week.csv",
    "hyperopt_search_space": search_space  # Include search space here
}


In [None]:
'''# Update the config in the existing W&B run
wandb.config.update(config)


# Run Hyperopt optimization
trials = Trials()
best_params = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest,
    max_evals=30,  # Number of trials
    trials=trials
)

In [None]:
'''for key, value in best_params.items():
    print(f"{key}: {value}")


# best_params = convert_feature_subset_strategy(best_params)

best_params = convert_hyperopt_params(best_params)


for key, value in best_params.items():
    print(f"{key}: {value}")

In [None]:
'''from hyperopt import hp
from hyperopt.fmin import fmin
import time

metric = {'auc': 'auc'}

def objective(hyperopt_params):
    lr = LogisticRegression(
        labelCol="payment_event_flag", 
        featuresCol="features", 
        maxIter=hyperopt_params['maxIter'], 
        regParam=hyperopt_params['regParam']
    )
    
    start = time.time()
    model = lr.fit(features_train)
    runtime = time.time() - start
    
    predictions = model.transform(features_test)
    auc = roc_auc_score(features_test.select("payment_event_flag").toPandas(), predictions.select('probability').toPandas())
    
    result = {
        'loss': 1 - auc,
        'status': STATUS_OK,
        'model': lr,
        'runtime': runtime,
        'params': hyperopt_params
    }
    
    return result

params_space = {
    'maxIter': hp.qloguniform('maxIter', 5, 15, 1),
    'regParam': hp.loguniform('regParam', -5, -2)
}

trials = Trials()

best = fmin(
    fn=objective,
    space=params_space,
    algo=tpe.suggest,
    max_evals=100,
    trials=trials,
    rstate=np.random.RandomState(36)
)


In [None]:
'''from hyperopt import hp
from hyperopt.fmin import fmin
from hyperopt.mongoexp import MongoExp
import time


In [60]:

'''# Задаем функцию для оптимизации
def objective(params):
    lr = LogisticRegression(labelCol="payment_event_flag", featuresCol="features", *params)
    model = lr.fit(features_train)
    predictions = model.transform(features_test)
    
    evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
    auc = evaluator.evaluate(predictions)
    
    return {'loss': 1 - auc, 'status': STATUS_OK}



In [61]:
'''

# Диапазоны гиперпараметров
param_space = {
    'maxIter': scope.int(hp.quniform('maxIter', 10, 100, 1)),
    'regParam': hp.uniform('regParam', 0.0, 1.0),
    'elasticNetParam': hp.uniform('elasticNetParam', 0.0, 1.0),
}



In [None]:
'''
# Запускаем оптимизацию
trials = Trials()
best = fmin(fn=objective,
            space=param_space,
            algo=tpe.suggest,
            max_evals=10,
            trials=trials)

print(best)



## <span style="color: red;">Change the model name below </span>


In [89]:
# Update W&B run


config = {
    "model_type": "LogisticRegressionHYPEROPT", # <--  model name 
    #"output_file": "predictions_future_week.csv",
    "hyperopt_search_space": search_space  # Include search space here
}


# Update the config in the existing W&B run
wandb.config.update(config)

In [None]:
'''# Run Hyperopt optimization
trials = Trials()
best_params = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest,
    max_evals=30,  # Number of trials
    trials=trials
)



In [None]:
for key, value in best_params.items():
    print(f"{key}: {value}")

In [None]:
# best_params = convert_feature_subset_strategy(best_params)

best_params = convert_hyperopt_params(best_params)

In [None]:
for key, value in best_params.items():
    print(f"{key}: {value}")

In [90]:
# Log best hyperparameters
wandb.log({"best_hyperparameters_lr": best_params}) # <



In [91]:
# Train the best model

best_lr = LogisticRegression(labelCol="payment_event_flag", featuresCol="features", **best_params)
best_model = best_lr.fit(features_train)



In [92]:
# Evaluate the best model on the test set
predictions = best_model.transform(features_test)
evaluator = BinaryClassificationEvaluator(labelCol="payment_event_flag")
best_auc = evaluator.evaluate(predictions)


In [93]:
# Calculate classification report
y_true = predictions.select('payment_event_flag').rdd.map(lambda x: x['payment_event_flag']).collect()
y_pred = predictions.select('prediction').rdd.map(lambda x: x['prediction']).collect()
report = classification_report(y_true, y_pred)



In [94]:
# Calculate PR AUC
payment_event_flag_preds = predictions.select('probability', 'payment_event_flag')
metrics = BinaryClassificationMetrics(payment_event_flag_preds.rdd.map(lambda lp: (float(lp[0][1]), float(lp[1]))))
pr_auc = metrics.areaUnderPR





In [95]:
print(f"ROC AUC: {best_auc}")
print(f"PR AUC: {pr_auc}")

ROC AUC: 0.7171756055703816
PR AUC: 0.7581376191111291


In [96]:
print(report)


              precision    recall  f1-score   support

           0       0.62      0.54      0.58      3945
           1       0.65      0.72      0.69      4669

    accuracy                           0.64      8614
   macro avg       0.64      0.63      0.63      8614
weighted avg       0.64      0.64      0.64      8614



In [98]:
# Log additional metrics to W&B
wandb.log({"best_model_classification_report_rf": wandb.Html(report)})
wandb.log({"best_model_roc_auc_lr": best_auc})
wandb.log({"best_model_pr_auc_lr": pr_auc})

In [100]:
# Get the coefficients (feature weights) from the best SVM model
best_lr_coef = best_lr_model.coefficients

# Zip coefficients with feature names
features_weights = list(zip(features_dict.keys(), best_lr_coef))

# Sort features by absolute value of their weights (descending order)
features_weights.sort(key=lambda x: abs(x[1]), reverse=True)

# Print top features and their weights
print("Top Features for Hyperopt lr Model:")
for feat_num, weight in features_weights[:10]:
    print(f"Feature {feat_num}: {features_dict[feat_num]} - Weight: {weight:.4f}")

# Create a DataFrame for feature importance (without weights)
feature_importance_df = pd.DataFrame(features_weights, columns=["Feature Number", "Weight"])

# Log the DataFrame as a W&B Table
wandb.log({"best_lr_feature_importance": wandb.Table(dataframe=feature_importance_df)})

Top Features for Hyperopt lr Model:
Feature 982: ['lag_10weeks_to_11weeks_Мои штрафы/Оплата/Ушел с ввода данных'] - Weight: 0.3997
Feature 2007: ['lag_21weeks_to_22weeks_Проверка/История платежей', 'lag_21weeks_to_22weeks_Purchase'] - Weight: 0.3982
Feature 38: ['lag_1week_to_2weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_1week_to_2weeks_Страхование/Главная/Лендинг/Ошибка загрузки лендинга', 'lag_1week_to_2weeks_Пуш Локальный/Скидочный штраф/Показан'] - Weight: 0.3522
Feature 2246: ['lag_23weeks_to_24weeks_Мои штрафы/Оплата/Сберпей/Открыт'] - Weight: 0.3476
Feature 1205: ['lag_13weeks_to_14weeks_Мои штрафы/Оплата/Завешили оплату'] - Weight: 0.3400
Feature 538: ['lag_6weeks_to_7weeks_Мои штрафы/Оплата/Завершили оплату', 'lag_6weeks_to_7weeks_Пуш Локальный/Скидочный штраф/Показан'] - Weight: -0.3336
Feature 682: ['lag_7weeks_to_8weeks_Мои штрафы/Оплата/Ушел с ввода данных'] - Weight: 0.3175
Feature 528: ['lag_6weeks_to_7weeks_Мои штрафы/Оплата/Ввод личных данных'] - Weight: 0.3074
Feat

In [101]:
# Calculate total absolute weight
total_weight = sum(abs(weight) for _, weight in features_weights)

# Create a list to store feature information with weight percentages
feature_info = []

# Extract information for top 20 features with weight percentages
for feat_num, weight in features_weights[:20]:
    event_names = features_dict[feat_num]
    for event_name in event_names:
        percentage_contribution = (abs(weight) / total_weight) * 100
        feature_info.append({
            "Feature Number": feat_num,
            "Event Name": event_name,
            "Weight": weight,
            "Percentage Contribution": f"{percentage_contribution:.2f}%"
        })

# Create DataFrame for weighted feature importance
feature_weight_df = pd.DataFrame(feature_info)

# Log the DataFrame as a W&B Table
wandb.log({"best_lr_feature_weight": wandb.Table(dataframe=feature_weight_df)})

In [106]:
summary_dict = {
    "PROC_DS": PROC_DS,
    "PROC_LAGS": PROC_LAGS,
    "PROC_VECS": PROC_VECS,
    "flag_min_datetime": flag_min_datetime,
    "flag_max_datetime": flag_max_datetime,
    "UPSAMPLE": UPSAMPLE,
    "Number of Features after TF-IDF": len(features_dict.items()),
    "Model Type": type(model).__name__,
    "maxIter": model.getMaxIter(),
    "regParam": model.getRegParam(),
    "tol": model.getTol(),
#    "Predictions saved to": file_path_pred, 
    "user_id": current_user, 
    "unique_identifier": unique_identifier
}

summary_dict["Number of Features after TF-IDF"] = str(len(features_dict.items()))



In [107]:
def dict_to_html_table(data):
    html = "<table>"
    for key, value in data.items():
        html += f"<tr><th>{key}</th><td>{value}</td></tr>"
    html += "</table>"
    return html

summary_html = dict_to_html_table(summary_dict)

wandb.log({"pipeline_summary_html": wandb.Html(summary_html)})

In [None]:
'''# Convert report to HTML format
hyperopt_report_html = classification_report(y_true, y_pred, output_dict=True)

# fix 
hyperopt_lr_report_html = dict_to_html_table(lr_report)

# Log the report as HTML to W&B
wandb.log({"hopt_best_lr_classification_report": wandb.Html(hyperopt_lr_report_html)})

In [111]:
# Convert report to HTML format
lr_report = classification_report(y_true, y_pred, output_dict=True)
# fix 
hyperopt_lr_report_html = dict_to_html_table(lr_report)

# Log the report as HTML to W&B
wandb.log({"hopt_best_lr_classification_report": wandb.Html(hyperopt_lr_report_html)})

## <span style="color: red;">Below code will finish my W&B run</span>


In [112]:
wandb.finish()  # Finalize W&B run

VBox(children=(Label(value='0.129 MB of 0.129 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
best_model_pr_auc_lr,▁▁
best_model_roc_auc_lr,▁▁
pr_auc,▁
roc_auc,▁
roc_auc_hyperopt,▆▁▁▁██▃▇▁██▇██▁▆▁▁▇█▁▆█▅███▁▇▁▅▁▆▇▁▁████

0,1
best_model_pr_auc_lr,0.75814
best_model_roc_auc_lr,0.71718
pr_auc,0.66955
roc_auc,0.65632
roc_auc_hyperopt,0.7142
