### Create ALS input dataset
- Implicit Score
- General Analysis / Visuals

### Summary
- Remove bundled allocated (individual) products and treat bundles as singular products
- Incorporate next-purchase-sequences as products for general next-purchase and next-purchase within a product group
- Relate mintigo to next-purchase-sequences tied to the subsequent purchase
- Weight for products part of the last 2 purchases

### TODO
- Adjust impact of Security maybe
- Intricately Correlations



#### General Process:
- Treat bundles as single products instead of the elements of the bundle as individual products
- Create next-product-purchase-sequences as single products ; create next-product-purchase-within-a-product-group-sequence as single products ; rules: (for sequences that have 5+ accounts that have purchased that sequence, excluding sequences  with subsequent purchase is ADC-LTM ; excluding repurchases)
- 
- Mintigo correlations are calculated between (has_purchased of grain product_group_line_type_subtype, mintigo boolean feature)
- Top 5-10 correlations chosen for all products not in non product groups ADC & Security (except marketplace products) 
- Round correlations to 3 decimal points
- For single products: When an account has a 1 for mintigo boolean feature that is highly correlated with some product_group_line_type_subtype, then use that correlation as a score towards any products that line up with the related product_group_line_type_subtype
- For sequence products: When an account has a 1 for mintigo boolean feature that is highly correlated with the subsequent  product_group_line_type_subtype in a sequence that the account has previously purchased in that pattern before, then use that correlation as a score towards that sequence
- 
- Calculate pct purchase of the portfolio for each product for each account
- Calculate pct purchase of the portfolio for each product for each account, considering only the latest relevant purchases ; rules: (last 3 purchases ; excluding recent purchases of ADC (unless its Marketplace or part of a sequence) ; excluding recent purchases of Security (unless its Marketplace or part of a sequence))

- Calculate total score using for each (account, product) pair using: pct purchase + recent pct purchase + (2*mintigo score)

In [0]:
%run ./00_helpers

In [0]:
%run ./00_ede_base_query

In [0]:
# dbutils.widgets.removeAll()

In [0]:
dbutils.widgets.text("train_df_path", "dbfs:/tmp/next_product_recommender_train_df_xxx", "Training DF Path")
dbutils.widgets.text("product_field", "product_group_line_type_subtype_platform", "Product Field")
dbutils.widgets.text("implicit_rating_field", "pct_purchases", "Implicit Rating Field ")

In [0]:
train_df_path = dbutils.widgets.getArgument("train_df_path")

## choose column to represent the PRODUCT/ITEM and IMPLICIT RATING field
product_field = dbutils.widgets.getArgument("product_field")
implicit_rating_field = dbutils.widgets.getArgument("implicit_rating_field")

print("train_df_path", train_df_path)
print("product_field", product_field)
print("implicit_rating_field", implicit_rating_field)

In [0]:
# ede_base_stmt defined in 00_ede_base_query
df_pysh = dbh.read_ede_data(ede_base_stmt)
raw_data = pysh.safe_name(df_pysh)
raw_data = (raw_data
            .withColumn(product_field, F.trim(F.col(product_field)))
            .withColumn("product_group_line", F.concat_ws("-",F.col("product_group_c"), F.col("product_line_c")))
            .withColumn("product_group_line_type", F.concat_ws("-",F.col("product_group_c"), F.col("product_line_c"), F.col("product_type")))
            .withColumn("product_group_line_type_subtype", F.concat_ws("-",F.col("product_group_c"), F.col("product_line_c"), F.col("product_type"), F.col("sub_type")))
           )
raw_data.cache().count()
# display(raw_data)

In [0]:
display(raw_data)

display((raw_data
         .groupby("account_id")
         .agg(F.countDistinct("opportunity_id").alias("total_opps")
              , F.countDistinct("product_group_c").alias("total_product_groups")
              , F.countDistinct("product_group_c","product_line_c").alias("total_product_group_line_combos")
              , F.countDistinct(product_field).alias("total_"+product_field)
              , F.countDistinct("sku_c").alias("total_skus_purchased")
             )
         .orderBy(F.col("total_opps").desc())
        ))

account_id,account_name,opportunity_id,sales_order_c,order_number_c,forecast_date_c,created_date,fiscal_year_num,purchase_rank,is_best_better_bundle,product_group_c,product_line_c,product_type,sub_type,platform_c,product_group_line_type_subtype_platform,sku_c,product_2_id,quantity,end_of_sale_date_c,product_group_line,product_group_line_type,product_group_line_type_subtype
00100000000ujiKAAQ,"Digital River, Inc.",0065000000UjjeLAAR,729983.0,,2015-02-13,2015-02-02T21:16:47.000+0000,2015,1,0,ADC,LTM,Hardware,Switch,4200v,ADC-LTM-Hardware-Switch-4200v,F5-BIG-LTM-4200V,01t50000001lUQGAA2,8.0,2018-04-01,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Switch
00100000000ujiKAAQ,"Digital River, Inc.",0065000000aq0TGAAY,753804.0,,2016-06-16,2016-04-22T19:50:47.000+0000,2016,2,0,ADC,DNS,Hardware,Switch,2000s,ADC-DNS-Hardware-Switch-2000s,F5-BIG-GTM-2000S,01t50000001m7prAAA,2.0,2016-08-21,ADC-DNS,ADC-DNS-Hardware,ADC-DNS-Hardware-Switch
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000QO2ECAA1,724076.0,,2014-10-23,2014-03-12T17:13:55.000+0000,2015,1,1,Security,WAF,Software,Add-on,,Security-WAF-Software-Add-on-NA,F5-ADD-VPR-BT-C2X00,01t50000002TFdtAAG,1.0,2021-07-01,Security-WAF,Security-WAF-Software,Security-WAF-Software-Add-on
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000QO2ECAA1,724076.0,,2014-10-23,2014-03-12T17:13:55.000+0000,2015,1,1,ADC,DNS,Software,Add-on,,ADC-DNS-Software-Add-on-NA,F5-ADD-VPR-BT-C2X00,01t50000002TFdtAAG,1.0,2021-07-01,ADC-DNS,ADC-DNS-Software,ADC-DNS-Software-Add-on
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000QO2ECAA1,724076.0,,2014-10-23,2014-03-12T17:13:55.000+0000,2015,1,1,ADC,LTM,Software,Add-on,,ADC-LTM-Software-Add-on-NA,F5-ADD-VPR-BT-C2X00,01t50000002TFdtAAG,1.0,2021-07-01,ADC-LTM,ADC-LTM-Software,ADC-LTM-Software-Add-on
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000QO2ECAA1,724076.0,,2014-10-23,2014-03-12T17:13:55.000+0000,2015,1,1,Security,Firewall,Software,Add-on,,Security-Firewall-Software-Add-on-NA,F5-ADD-VPR-BT-C2X00,01t50000002TFdtAAG,1.0,2021-07-01,Security-Firewall,Security-Firewall-Software,Security-Firewall-Software-Add-on
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000QO2ECAA1,724076.0,,2014-10-23,2014-03-12T17:13:55.000+0000,2015,1,1,Security,IAM,Software,Add-on,,Security-IAM-Software-Add-on-NA,F5-ADD-VPR-BT-C2X00,01t50000002TFdtAAG,1.0,2021-07-01,Security-IAM,Security-IAM-Software,Security-IAM-Software-Add-on
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000QO2ECAA1,724076.0,,2014-10-23,2014-03-12T17:13:55.000+0000,2015,1,0,ADC,LTM,Hardware,Chassis,C2400,ADC-LTM-Hardware-Chassis-C2400,F5-VPR-LTM-C2400-AC,01t50000001LVY2AAO,1.0,,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Chassis
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000QO2ECAA1,724076.0,,2014-10-23,2014-03-12T17:13:55.000+0000,2015,1,0,ADC,LTM,Hardware,Blade,B2150,ADC-LTM-Hardware-Blade-B2150,F5-VPR-LTM-B2150,01t500000028gGXAAY,1.0,,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Blade
00100000000ujiRAAQ,XCEL Energy Inc.,0065000000TxeUjAAJ,726559.0,,2014-12-12,2014-12-04T17:08:21.000+0000,2015,2,1,Security,IAM,Software,Add-on,,Security-IAM-Software-Add-on-NA,F5-ADD-VPR-BT-C2X00,01t50000002TFdtAAG,1.0,2021-07-01,Security-IAM,Security-IAM-Software,Security-IAM-Software-Add-on


account_id,total_opps,total_product_groups,total_product_group_line_combos,total_product_group_line_type_subtype_platform,total_skus_purchased
0015000000Mw4T5AAJ,872,3,6,12,21
00100000000ujymAAA,550,3,10,42,93
00100000000ukYNAAY,365,4,11,70,111
0015000000n7kOwAAI,324,4,10,38,68
0015000000PGmx0AAD,297,3,7,64,95
00100000000wxtXAAQ,256,4,8,86,117
00100000001VRC2AAO,255,4,13,116,185
00100000002L6WIAA0,228,1,1,8,25
00100000000uk8YAAQ,223,5,13,75,117
00100000000unK3AAI,219,3,10,80,121


In [0]:
#raw_data.select("account_id").dropDuplicates().count()
print(f"Original no of accounts: {raw_data.select('account_id').dropDuplicates().count()}")



In [0]:
# print
print(f"Original data shape: {raw_data.count()} rows, {len(raw_data.columns)} columns")
print(f"Original no. of accounts: {raw_data.select('account_id').dropDuplicates().count()}")

# get valid accounts (ones who have bought 3+ skus in the last few years)
valid_accts = (raw_data
               .groupBy(F.col("account_id"))
               .agg(F.count("sku_c").alias("total_skus_purchased"))
               .filter(F.col("total_skus_purchased")>2)
               .select(F.col("account_id"))
               .dropDuplicates()
              )
print(f"No. accounts after dropping accounts that have made too few purchases: {valid_accts.count()}")

train_df = valid_accts.join(other=raw_data, on=["account_id"], how="INNER").alias("new_train_df")
print(f"Shape after dropping accounts that have made too few purchases: {train_df.count()} row and {len(train_df.columns)} columns")
print(f"No. accounts after dropping accounts that have made too few purchases: {train_df.select('account_id').dropDuplicates().count()}")

In [0]:
# get total account purchases per account
total_account_purchases = (raw_data
                           .groupBy(F.col("account_id"))
                           .agg(F.count(F.col(product_field)).alias("total_purchases"))
                          )
display(total_account_purchases)

account_id,total_purchases
00100000000uk8kAAA,25
00100000000ukKZAAY,219
00100000000uku4AAA,7
00100000000umgyAAA,1
00100000001Ak35AAC,14
00100000001Pk4MAAS,4
00100000001Td7jAAC,10
0011T00002JZMDaQAP,7
0011T00002JZwTrQAL,50
0011T00002K9vx8QAB,2


In [0]:
# # get total account purchases per account
# total_account_purchases = (train_df
#                            .groupBy(F.col("account_id"))
#                            .agg(F.count(F.col(product_field)).alias("total_purchases"))
#                           )
# display(total_account_purchases)

In [0]:
# get dataframe that labels the bundle purchases
acct_gbb = (train_df
            .filter(F.col("is_best_better_bundle")==1)
            .groupby(F.col("account_id"),F.col("account_name"),F.col("forecast_date_c"),F.col("opportunity_id"))
            .agg(F.sort_array(F.collect_set(F.col(product_field))).cast(F.StringType()).alias(product_field),
                 F.max(F.col("purchase_rank")).alias("purchase_rank"),
                 F.max(F.col("is_best_better_bundle")).alias("is_best_better_bundle"),
                 F.sort_array(F.collect_set(F.col("product_group_c"))).cast(F.StringType()).alias("product_group_c"), 
                 F.sort_array(F.collect_set(F.col("product_line_c"))).cast(F.StringType()).alias("product_line_c"),
                 F.sort_array(F.collect_set(F.col("product_type"))).cast(F.StringType()).alias("product_type"),
                 F.sort_array(F.collect_set(F.col("sub_type"))).cast(F.StringType()).alias("sub_type"),
                 F.sort_array(F.collect_set(F.col("platform_c"))).cast(F.StringType()).alias("platform_c"), 
                 F.sort_array(F.collect_set(F.col("product_group_line"))).cast(F.StringType()).alias("product_group_line"),
                 F.sort_array(F.collect_set(F.col("product_group_line_type"))).cast(F.StringType()).alias("product_group_line_type"),
                 F.sort_array(F.collect_set(F.col("product_group_line_type_subtype"))).cast(F.StringType()).alias("product_group_line_type_subtype"),
                 F.sort_array(F.collect_set(F.col("sku_c"))).cast(F.StringType()).alias("sku_c")
                )
           )


# remove bundle allocated rows and replace with single 'bundle' row
bundled_train_df = (train_df.filter(F.col("is_best_better_bundle")!=1).select(acct_gbb.columns) # only keep non bundle allocated items
                    .union(acct_gbb) # add the bundled item back in as their own product / row 
                    .drop("is_best_better_bundle")
                   )
print("Total products: ", bundled_train_df.select(product_field).distinct().count())
print("Total Accounts: ", bundled_train_df.select("account_id").distinct().count())

In [0]:
# create next-purchase-sequence product combinations dataframe
base = bundled_train_df.select("account_id","purchase_rank","product_group_c",product_field)


# Get sequential combos (next single purchase)  
w_order = Window.orderBy('account_id','purchase_rank')
w_partition_order = Window.partitionBy('account_id','purchase_rank')

general_next_purchase = (base
                         .withColumn("next_purchase_rank",F.last(F.lead("purchase_rank").over(w_order)).over(w_partition_order)).orderBy('purchase_rank')
                         .alias("lagged_df")
                         .join(other=(base
                                      .withColumnRenamed(product_field,"next_"+product_field)
                                      .withColumnRenamed("product_group_c","next_product_group_c")
                                      .alias("tmp")
                                     )
                               , on=[F.col("next_purchase_rank") == F.col("tmp.purchase_rank")
                                   , F.col("lagged_df.account_id") == F.col("tmp.account_id")]
                               ,how="cross"
                              )
                         .select("lagged_df.*","tmp.next_product_group_c","tmp.next_"+product_field)
                         .withColumn("sequential_combos", F.concat(F.col(product_field), F.lit("-->"), F.col("next_"+product_field)))
                         .filter(F.col("purchase_rank") < F.col("next_purchase_rank"))
                         .orderBy(F.col("account_id"),F.col("purchase_rank").asc(),F.col("next_purchase_rank").asc(), F.col(product_field), F.col("next_"+product_field))
                        )


## Get sequential combos (next single purchase) within product groups
w_order = Window.orderBy("account_id","product_group_c","purchase_rank")
w_partition_order = Window.partitionBy("account_id","product_group_c","purchase_rank")
product_groups_next_purchase = (base
                               .withColumn("next_purchase_rank",F.last(F.lead("purchase_rank").over(w_order)).over(w_partition_order)).orderBy('purchase_rank')
                               .alias("lagged_df")
                               .join(other=(base
                                           .withColumnRenamed(product_field,f"next_{product_field}")
                                           .withColumnRenamed("product_group_c","next_product_group_c")
                                           .alias("tmp")
                                          )
                                    ,on=[F.col("next_purchase_rank") == F.col("tmp.purchase_rank")
                                         , F.col("lagged_df.account_id")==F.col("tmp.account_id")
                                         , F.col("lagged_df.product_group_c")==F.col("tmp.next_product_group_c")
                                        ]
                                    , how = "cross")
                               .dropDuplicates()
                               .filter(F.col("lagged_df.purchase_rank")<F.col("next_purchase_rank"))
                               .orderBy(F.col("lagged_df.account_id"), F.col("lagged_df.purchase_rank").asc(), F.col("next_purchase_rank").asc(), F.col(product_field), F.col("next_"+product_field))
                               .select("lagged_df.account_id","lagged_df.purchase_rank","lagged_df.product_group_c","lagged_df."+product_field,"lagged_df.next_purchase_rank","tmp.next_product_group_c","tmp.next_"+product_field)
                               .withColumn("sequential_combos", F.concat(F.col(product_field), F.lit("-->"), F.col("next_"+product_field)))
                              )


# union sequential combo results
unioned_sequence_results = (general_next_purchase
                            .select("account_id",product_field,"next_"+product_field, "sequential_combos", "purchase_rank", "next_purchase_rank")
                            .union((product_groups_next_purchase
                                    .select("account_id",product_field,"next_"+product_field, "sequential_combos", "purchase_rank", "next_purchase_rank")
                                   )
                                  )
                            .distinct().alias("next_purchase_sequence"))

# get sequential combinations that have at least X number of accounts who purchased that sequence
min_num_accounts = 5
common_enough_products = (unioned_sequence_results
                          .groupby("sequential_combos")
                          .agg(F.count(F.col("account_id")).alias("total_rows"),
                               F.countDistinct(F.col("account_id")).alias("total_accounts"))
                          .filter(F.col("total_accounts")>=min_num_accounts)
                         )

# filter to ideal product sequences
filtered_results = (unioned_sequence_results
                    .distinct()
#                     .filter(~F.col("next_"+product_field).contains("ADC-LTM") | F.col("next_"+product_field).contains("Marketplace")) # remove where subsequent purchase in sequence is ADC-LTM except Marketplace 
#                     .filter(~F.col("next_"+product_field).contains("Security") | F.col("next_"+product_field).contains("Marketplace")) # remove where subsequent purchase in sequence is security except Marketplace 
                    .filter(F.col(product_field) != F.col("next_"+product_field)) # remove where sequence is a repurchase of same product
                    .alias("filtered_results")
                    .join(other=common_enough_products, on="sequential_combos", how="inner") # only select 'popular enough' sequence combos
                    .select("filtered_results.*")
                   )

# # # format 
next_purchase_sequence = (filtered_results
                          .select("sequential_combos","next_"+product_field,"account_id","next_purchase_rank")
                          .alias("filtered_results")
                          .join(other=train_df.alias("raw_data"), # relate the subsequent purchase in the sequence back to the raw data
                                on=[F.col("filtered_results.account_id")==F.col("raw_data.account_id"), 
                                    F.col("filtered_results.next_purchase_rank")==F.col("raw_data.purchase_rank"),
                                    F.col("filtered_results.next_"+product_field)==F.col("raw_data."+product_field)],
                                how="INNER"
                               )
                          .select('raw_data.account_id',
                                 'raw_data.account_name',
                                 'raw_data.opportunity_id',
                                 'raw_data.sales_order_c',
                                 'raw_data.order_number_c',
                                 'raw_data.forecast_date_c',
                                 'raw_data.created_date',
                                 'raw_data.fiscal_year_num',
                                 'raw_data.purchase_rank',
                                 'raw_data.is_best_better_bundle',
                                 'raw_data.product_group_c',
                                 'raw_data.product_line_c',
                                 'raw_data.product_type',
                                 'raw_data.sub_type',
                                 'raw_data.platform_c',
                                 'raw_data.product_group_line_type_subtype_platform',
                                 'raw_data.sku_c',
                                 'raw_data.product_2_id',
                                 'raw_data.quantity',
                                 'raw_data.end_of_sale_date_c',
                                 'raw_data.product_group_line',
                                 'raw_data.product_group_line_type',
                                 'raw_data.product_group_line_type_subtype',
                                 'filtered_results.sequential_combos')
                          .drop(product_field)
                          .withColumnRenamed("sequential_combos",product_field)
                          .select(bundled_train_df.columns) # grab same columns as other processed DFs above
                         )

display(next_purchase_sequence)


print("Total products: ", next_purchase_sequence.select(product_field).distinct().count())
print("Total Accounts: ", next_purchase_sequence.select("account_id").distinct().count())

account_id,account_name,forecast_date_c,opportunity_id,product_group_line_type_subtype_platform,purchase_rank,product_group_c,product_line_c,product_type,sub_type,platform_c,product_group_line,product_group_line_type,product_group_line_type_subtype,sku_c
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,ADC-LTM-Hardware-Blade-B2150-->ADC-DNS-Hardware-Switch-2000s,3,ADC,DNS,Hardware,Switch,2000s,ADC-DNS,ADC-DNS-Hardware,ADC-DNS-Hardware-Switch,F5-BIG-DNS-2000S
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,ADC-LTM-Hardware-Chassis-C2400-->ADC-DNS-Hardware-Switch-2000s,3,ADC,DNS,Hardware,Switch,2000s,ADC-DNS,ADC-DNS-Hardware,ADC-DNS-Hardware-Switch,F5-BIG-DNS-2000S
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,"[ADC-DNS-Software-Add-on-NA, ADC-LTM-Software-Add-on-NA, Security-Firewall-Software-Add-on-NA, Security-IAM-Software-Add-on-NA, Security-WAF-Software-Add-on-NA]-->ADC-DNS-Hardware-Switch-2000s",3,ADC,DNS,Hardware,Switch,2000s,ADC-DNS,ADC-DNS-Hardware,ADC-DNS-Hardware-Switch,F5-BIG-DNS-2000S
00100000000ujizAAA,U.S. Bancorp,2016-05-11,0061T00000ovBXtQAM,ADC-LTM-Hardware-Appliance-2000s-->NGINX-NGINX Plus-Subscription-Virtual Edition-Virtual,6,NGINX,NGINX Plus,Subscription,Virtual Edition,Virtual,NGINX-NGINX Plus,NGINX-NGINX Plus-Subscription,NGINX-NGINX Plus-Subscription-Virtual Edition,F5-NGX-PLS-BAS
00100000000ujizAAA,U.S. Bancorp,2016-05-11,0061T00000ovBXtQAM,ADC-LTM-Hardware-Appliance-2000s-->NGINX-NGINX Plus-Subscription-Virtual Edition-Virtual,6,NGINX,NGINX Plus,Subscription,Virtual Edition,Virtual,NGINX-NGINX Plus,NGINX-NGINX Plus-Subscription,NGINX-NGINX Plus-Subscription-Virtual Edition,F5-NGX-PLS-PRO
00100000000ujjRAAQ,3M Company,2016-07-19,0065000000beIIuAAM,ADC-LTM-Hardware-Appliance-5250v-->ADC-BIG-IQ-Software-Virtual Edition-NA,5,ADC,BIG-IQ,Software,Virtual Edition,,ADC-BIG-IQ,ADC-BIG-IQ-Software,ADC-BIG-IQ-Software-Virtual Edition,F5-BIQ-VE-S
00100000000ujjlAAA,Mayo Foundation,2018-04-30,0065000000kD5hSAAS,ADC-LTM-Hardware-Blade-B2250-->ADC-LTM-Hardware-Switch-7250v,14,ADC,LTM,Hardware,Switch,7250v,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Switch,F5-BIG-LTM-7250V
00100000000ujk4AAA,The Sherwin-Williams Company,2019-11-27,0061T00000nLtrHQAS,Security-IPI-Subscription-NA-NA-->Security-WAF-Subscription-NA-NA,13,Security,WAF,Subscription,,,Security-WAF,Security-WAF-Subscription,Security-WAF-Subscription-NA,F5-SBS-BIG-TC-3-3YR
00100000000ujkgAAA,Schwan's Sales Enterprises,2019-08-15,0061T00000nLVGvQAO,Security-WAF-Software-Add-on-NA-->ADC-LTM-Hardware-Appliance-i4800,3,ADC,LTM,Hardware,Appliance,i4800,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Appliance,F5-BIG-LTM-I4800
00100000000ujkgAAA,Schwan's Sales Enterprises,2019-08-15,0061T00000nLVGvQAO,ADC-LTM-Hardware-Appliance-i2800-->ADC-LTM-Hardware-Appliance-i4800,3,ADC,LTM,Hardware,Appliance,i4800,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Appliance,F5-BIG-LTM-I4800


In [0]:
union_results = bundled_train_df.union(next_purchase_sequence)
display(union_results)

account_id,account_name,forecast_date_c,opportunity_id,product_group_line_type_subtype_platform,purchase_rank,product_group_c,product_line_c,product_type,sub_type,platform_c,product_group_line,product_group_line_type,product_group_line_type_subtype,sku_c
00100000000ujiRAAQ,XCEL Energy Inc.,2014-10-23,0065000000QO2ECAA1,ADC-LTM-Hardware-Chassis-C2400,1,ADC,LTM,Hardware,Chassis,C2400,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Chassis,F5-VPR-LTM-C2400-AC
00100000000ujiRAAQ,XCEL Energy Inc.,2014-10-23,0065000000QO2ECAA1,ADC-LTM-Hardware-Blade-B2150,1,ADC,LTM,Hardware,Blade,B2150,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Blade,F5-VPR-LTM-B2150
00100000000ujiRAAQ,XCEL Energy Inc.,2014-12-12,0065000000TxeUjAAJ,ADC-LTM-Hardware-Blade-B2150,2,ADC,LTM,Hardware,Blade,B2150,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Blade,F5-VPR-LTM-B2150
00100000000ujiRAAQ,XCEL Energy Inc.,2014-12-12,0065000000TxeUjAAJ,ADC-DNS-Hardware-Switch-2000s,2,ADC,DNS,Hardware,Switch,2000s,ADC-DNS,ADC-DNS-Hardware,ADC-DNS-Hardware-Switch,F5-BIG-DNS-2000S
00100000000ujiRAAQ,XCEL Energy Inc.,2014-12-12,0065000000TxeUjAAJ,ADC-LTM-Hardware-Chassis-C2400,2,ADC,LTM,Hardware,Chassis,C2400,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Chassis,F5-VPR-LTM-C2400-AC
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,ADC-LTM-Hardware-Blade-B2150,3,ADC,LTM,Hardware,Blade,B2150,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Blade,F5-VPR-LTM-B2150
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,ADC-LTM-Hardware-Chassis-C2400,3,ADC,LTM,Hardware,Chassis,C2400,ADC-LTM,ADC-LTM-Hardware,ADC-LTM-Hardware-Chassis,F5-VPR-LTM-C2400-AC
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,ADC-DNS-Hardware-Switch-2000s,3,ADC,DNS,Hardware,Switch,2000s,ADC-DNS,ADC-DNS-Hardware,ADC-DNS-Hardware-Switch,F5-BIG-DNS-2000S
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,ADC-LTM-Software-Add-on-NA,3,ADC,LTM,Software,Add-on,,ADC-LTM,ADC-LTM-Software,ADC-LTM-Software-Add-on,F5-ADD-VPR-VCMP-2400
00100000000ujiRAAQ,XCEL Energy Inc.,2015-05-08,0065000000VvunVAAR,ADC-BIG-IQ-Software-Virtual Edition-NA,3,ADC,BIG-IQ,Software,Virtual Edition,,ADC-BIG-IQ,ADC-BIG-IQ-Software,ADC-BIG-IQ-Software-Virtual Edition,F5-BIQ-VE-S


In [0]:
display(union_results.groupBy('product_group_c').count())

product_group_c,count
Security,101704
Volterra,112
ADC,209428
Silverline,18123
NGINX,18199
Shape,1517
Aspen Mesh,16
[ADC],67
"[ADC, Security]",17319
[Security],1


### Mintigo Implicit Rating Score via Correlations

In [0]:
# bring in mintigo data and correlations to enhance implicit rating

####
# get important (mintigo feature, product) pairs and the related weight
path = "dbfs:/tmp/mintigo_implicit_ratings-product_group_line_type_subtype"
mintigo_product_field = "product_group_line_type_subtype"
mintigo_correlations = pysh.null_replace(pysh.safe_name(spark.read.format('delta').load(path)), cols=['phi_corr'], replace_type='zero')

# define feature weight - currently adding both correlations ; consider taking the MAX correlation btwn the two
mintigo_correlations = mintigo_correlations.withColumn('mintigo_feature_weight', F.round(F.col('phi_corr'),3))

mint_features = [row[0] for row in mintigo_correlations.select('index').collect()]
str_mint_features = str(set(mint_features)).replace("{","").replace("}","").replace("'","")
# display(mintigo_correlations)

####
# get mintigo data
stmt = """SELECT MINT.SALESFORCE_ACCOUNT_ID AS ACCOUNT_ID , SFDC.NAME, """ + str_mint_features + """
            FROM EXP_MKTG.DATA_SCIENCE.MINTIGO_TARGET MINT 
          INNER JOIN (
              SELECT SALESFORCE_ACCOUNT_ID, MAX(CALENDAR_DATE) CALENDAR_DATE 
              FROM EXP_MKTG.DATA_SCIENCE.MINTIGO_TARGET
              GROUP BY SALESFORCE_ACCOUNT_ID) LATEST_DATA 
          ON MINT.SALESFORCE_ACCOUNT_ID = LATEST_DATA.SALESFORCE_ACCOUNT_ID
              AND MINT.CALENDAR_DATE = LATEST_DATA.CALENDAR_DATE
          INNER JOIN PRD_ENT_RAW.SALESFORCE.ACCOUNT SFDC 
          ON MINT.SALESFORCE_ACCOUNT_ID = SFDC.ID
        """

mint_df_pysh = pysh.safe_name(dbh.read_ede_data(stmt))
rm_features = [col for col in mint_df_pysh.columns if 'date' in col or 'source_record' in col or 'company_name' in col or 'account_name' in col or 'url' in col or 'email' in col or 'annual_revenue_category_new' in col]

mint_df_pysh = pysh.drop_cols(df=mint_df_pysh, col_list=rm_features)

# replace all True/False strings with 1/0 nums
bool_cols = list(set(mint_df_pysh.columns) - {'account_id','name'})
mint_df_pysh = (pysh.null_replace(pysh.cast_col_type(mint_df_pysh, bool_cols, 'int'), cols=bool_cols, replace_type='zero')
                .withColumnRenamed("name","account_name")
               )
# display(mint_df_pysh)

In [0]:
## whats this doing? Any boolean features == 1, replace them with weight IF belonging to a significant (mintigo_feature,product) pair
##### Then sum to create 'score' for the account related to mintigo features 
tmp_m = mint_df_pysh.crossJoin(mintigo_correlations.select(mintigo_product_field).dropDuplicates())
mint_corr_pd = mintigo_correlations.toPandas() ## TO PANDAS FOR LOOPING PURPOSES!!!
for mint_ftr, product, weight in mint_corr_pd[['index',mintigo_product_field,'mintigo_feature_weight']].itertuples(index=False):
  tmp_m = tmp_m.withColumn(mint_ftr, (F.when(((F.col(mintigo_product_field)==product)&(F.col(mint_ftr)==1)), weight).otherwise(F.col(mint_ftr))))
  
# now replace all 1s with 0 - signifies they weren't a (mintigo feature, product) pair that should utilize correlation as score
tmp_m = tmp_m.replace(1,0,bool_cols)

# # add bool cols together to create mintigo score
tmp_m = (tmp_m
         .withColumn("mintigo_acct_product_score", sum([F.col(c) for c in bool_cols]))
         .filter(F.col("mintigo_acct_product_score")>0) # keep rows where theres a correlation score related to the product
         .withColumn("mintigo_acct_product_score_scaled", F.col("mintigo_acct_product_score")*2) # scale the correlation score to increase weight
        )

In [0]:
# Relate mintigo data to appropriate products
#### Single products can get a mintigo correlation boost (whether a previous purchase or not) if the account has a 1 for highly correlated mintigo features belonging to important (feature,pair)
#### Bundles get no boost
#### Sequences get a boost only if the sequence has been purchased previously and receives the boost of the 2nd product
mintigo_purchase_base = (union_results
                         .groupby("account_id",mintigo_product_field,product_field)
                         .agg(F.count(F.col("sku_c")).alias("total_sku"))
                        )

mintigo_score_data = (tmp_m 
                      .join(other=(mintigo_purchase_base 
                                   .filter(~F.col(product_field).contains('-->')) # remove sequences
                                   .filter(~F.col(product_field).contains('[')) # remove bundles
                                   .select([mintigo_product_field,product_field]) # dataset of unique products at the grain they can relate to mintigo + product recommender grain
                                   .dropDuplicates()
                                  ) # dataset of unique products at the grain they can relate to mintigo + product recommender grain
                            , on=[mintigo_product_field]
                            , how='left') # relate all accounts based on correlation to appropriate single products (non sequences & bundles) ; this introduces ratings for products both purchased and not already purchased
                      .select("account_id",mintigo_product_field,product_field,"mintigo_acct_product_score","mintigo_acct_product_score_scaled")
                      .union(
                        (tmp_m
                         .join(other=(mintigo_purchase_base # relate previous purchases to mintigo data
                                      .filter(F.col(product_field).contains('-->')) # of sequences only
                                      .filter(~F.col(product_field).contains('[')) # remove bundles
                                      .select(['account_id',mintigo_product_field,product_field]) # note that mintigo product field is related to the 2nd item in a sequence
                                     )
                               , on=['account_id',mintigo_product_field]
                               , how='inner' 
                            )
                         
                         .select("account_id",mintigo_product_field,product_field,"mintigo_acct_product_score","mintigo_acct_product_score_scaled")
                        )
                      )
                      .filter(~F.col(product_field).isNull())
                      .join(other=valid_accts, on=["account_id"], how="inner") # only get a mintigo score for 'valid accounts'
                     )
mintigo_score_data.cache().count()
# display(mintigo_score_data)

In [0]:
display(mintigo_score_data.summary())

summary,account_id,product_group_line_type_subtype,product_group_line_type_subtype_platform,mintigo_acct_product_score,mintigo_acct_product_score_scaled
count,321775,321775,321775,321775.0,321775.0
mean,,,,0.0866453018413507,0.1732906036827015
stddev,,,,0.0791292517099852,0.1582585034199705
min,00100000000ujiRAAQ,ADC-LTM-Utility-Marketplace,ADC-BIG-IQ-Hardware-Appliance-7000-->NGINX-NGINX Plus-Subscription-Virtual Edition-Virtual,0.004,0.008
25%,,,,0.031,0.062
50%,,,,0.0579999999999999,0.1159999999999999
75%,,,,0.118,0.236
max,0015000002IPJTGAA5,Volterra-VoltStack-Subscription-SaaS,Volterra-VoltStack-Subscription-SaaS-NA,0.518,1.036


### Intricately Implicit Rating Score via Correlations

In [0]:
####
# get important (INTRICATELT feature, product) pairs and the related weight
intricately_corr_path = "dbfs:/tmp/intricately_corr-product_group_line_type_subtype"
intricately_data_path = "dbfs:/tmp/intricately_data" ### FOR NOW, directly in DBFS til query is optimized to pull from EDE

intricately_product_field = "product_group_line_type_subtype"
intri_corr_col = "infrastructure_configuration"

intricately_correlations = (pysh.safe_name(spark.read.format('delta').load(intricately_corr_path))
                            .withColumn('intri_feature_weight', F.round(F.col('corr'),3))
                            .filter(F.col("index").contains("config")) # TEMP only use any intricately 'infrastructure configuration' correlations as a boost for score
                           )
intricately_correlations = (intricately_correlations
                            .withColumn(intri_corr_col, F.trim(F.element_at(F.split(intricately_correlations["index"],"\_"),-1)))) # get string label for infra config

intri_features = [row[0] for row in intricately_correlations.select('index').collect()]
str_intri_features = str(set(intri_features)).replace("{","").replace("}","").replace("'","")
# display(mintigo_correlations)

####
# get intricately  data
raw_intricately_data = (pysh.safe_name(spark.read.format('delta').load(intricately_data_path))
                        .withColumnRenamed('id','acct_id')
                       )

# get intricately data for select accounts and using only important fields
intri_df = (raw_intricately_data
            .select("acct_id",intri_corr_col) # select only intricately columns for correlation score
            .dropDuplicates()
            .join(other = (raw_intricately_data
                           .select("acct_id",intri_corr_col)
                           .dropDuplicates()
                           .groupby("acct_id")
                           .agg(F.count("*").alias("total_rows"))
                           .filter(F.col("total_rows")==1)
                          )
                  , on=["acct_id"]
                  , how="inner" # Select accounts with one-to-one mapping only 
                 )
            .withColumnRenamed("acct_id","account_id")
            .drop(*["total_rows"])     
           )

display(intricately_correlations)
# display(raw_intricately_data)
# display(intri_df)

index,product_group_line_type_subtype,corr,intri_feature_weight,infrastructure_configuration
infrastructure_configuration_Hybrid,Aspen Mesh-Aspen Mesh-Subscription-Virtual Edition,0.008843379184871,0.009,Hybrid
infrastructure_configuration_Hybrid,NGINX-NGINX Plus-Subscription-Marketplace,0.0045268736869379,0.005,Hybrid
infrastructure_configuration_Hybrid,Security-IAM-Utility-Marketplace,0.0350082528698362,0.035,Hybrid
infrastructure_configuration_Hybrid,Shape-WebFraud-Subscription-NA,0.0048321706597718,0.005,Hybrid
infrastructure_configuration_Hybrid,Silverline-DDoS-Utility-NA,0.0055526662350374,0.006,Hybrid
infrastructure_configuration_Cloud Leader,Volterra-VoltStack-Subscription-SaaS,0.0037360535879755,0.004,Cloud Leader
infrastructure_configuration_Hybrid,NGINX-Controller-Utility-NA,0.0118512551982345,0.012,Hybrid
infrastructure_configuration_Hybrid,NGINX-Microservices-Subscription-Virtual Edition,-0.008111455507119,-0.008,Hybrid
infrastructure_configuration_Hybrid,Security-Firewall-Utility-Marketplace,0.0389962075642003,0.039,Hybrid
infrastructure_configuration_Hybrid,Shape-EAP-Utility-Marketplace,0.0062556827746824,0.006,Hybrid


In [0]:
intri_corr_col_values = [r[intri_corr_col] for r in intri_df.select(intri_corr_col).distinct().collect()]
tmp_c = intri_df.crossJoin(intricately_correlations.select(intricately_product_field).dropDuplicates())
intri_corr_pd = intricately_correlations.toPandas() ## TO PANDAS FOR LOOPING PURPOSES!!!
for intri_ftr, product, infra_config, weight in intri_corr_pd[['index',intricately_product_field,intri_corr_col,'intri_feature_weight']].itertuples(index=False):
  tmp_c = tmp_c.withColumn(intri_ftr, (F.when(((F.col(intricately_product_field)==product)&(F.col(intri_corr_col)==infra_config)), weight).otherwise(0)))
  
# # add bool cols together to create intricately score
tmp_c = (tmp_c
         .withColumn("intricately_acct_product_score", sum([F.col(c) for c in intri_corr_pd['index'].unique()]))
         .filter(F.col("intricately_acct_product_score")>0) # keep rows where theres a correlation score related to the product
         .withColumn("intricately_acct_product_score_scaled", F.col("intricately_acct_product_score")*2) # scale the correlation score to
        )
  
  

# display(tmp_c)

In [0]:
# relate intricately data to approprite products
intricately_purchase_base = (union_results
                           .groupby("account_id",intricately_product_field,product_field)
                           .agg(F.count(F.col("sku_c")).alias("total_sku"))
                          )

intricately_score_data = (tmp_c 
                          .join(other=(intricately_purchase_base  # reuse intricately purchase base dataframe 
                                       .filter(~F.col(product_field).contains('-->')) # remove sequences
                                       .filter(~F.col(product_field).contains('[')) # remove bundles
                                       .select([intricately_product_field,product_field]) # dataset of unique products at the grain they can relate to intricately + product recommender grain
                                       .dropDuplicates()
                                      ) # dataset of unique products at the grain they can relate to intricately + product recommender grain
                                , on=[intricately_product_field]
                                , how='left') # relate all accounts based on correlation to appropriate single products (non sequences & bundles) ; this introduces ratings for products both purchased and not already purchased
                          .select("account_id",intricately_product_field,product_field,"intricately_acct_product_score","intricately_acct_product_score_scaled")
                          .union(
                            (tmp_c
                             .join(other=(intricately_purchase_base # relate previous purchases to intricately data
                                          .filter(F.col(product_field).contains('-->')) # of sequences only
                                          .filter(~F.col(product_field).contains('[')) # remove bundles
                                          .select(['account_id',intricately_product_field,product_field]) # note that intricately product field is related to the 2nd item in a sequence
                                         )
                                   , on=['account_id',intricately_product_field]
                                   , how='inner' 
                                )

                             .select("account_id",intricately_product_field,product_field,"intricately_acct_product_score","intricately_acct_product_score_scaled")
                            )
                          )
                          .filter(~F.col(product_field).isNull())
                          .join(other=valid_accts, on=["account_id"], how="inner") # get intricately score only for valid accounts
                         )
intricately_score_data.cache().count()
# display(intricately_score_data)

### Historical Purchase Implicit Rating Score via Pct Purchased

In [0]:
# aggregate purchase history training data
clean_training_data = (union_results
                       .groupBy([F.col("account_id"), F.col("account_name"), product_field]) 
                       .agg(F.count(F.col(product_field)).alias("num_purchases"),
                            F.max(F.col("purchase_rank")).alias("total_baskets"), # total organic baskets
                            F.max(F.col(mintigo_product_field)).alias(mintigo_product_field),
                            F.max(F.col("product_group_line")).alias("product_group_line")
                           ) # note that with these aggregations we are double counting transactions, so normalized counts won't add up to 1 (which is OK)
                       .join(other=total_account_purchases, on=["account_id"], how="LEFT")
                       .withColumn("pct_purchases", F.col("num_purchases") / F.col("total_purchases"))
#                        .withColumn("pct_purchases_scaled", F.when(F.col(product_field).contains("")))
                       .withColumn("log_num_purchases", F.log(F.col("num_purchases")))
                      )
distinct_accounts = clean_training_data.select(F.col("account_id")).distinct().count()
print(f"Training data info:\n" 
      f"Shape: {clean_training_data.count()} rows and {len(clean_training_data.columns)} columns. "
      f"Total accounts:{distinct_accounts}")
# display(clean_training_data)

In [0]:
# display((clean_training_data
#          .select(product_field)
#          .dropDuplicates()
#          .filter(((F.col(product_field).contains("ADC"))&~F.col(product_field).contains("-->")) 
#                                         | ((~F.col(product_field).contains("ADC"))|F.col(product_field).contains("Marketplace")))# remove rows with ADC except for Marketplace rows or related to sequences where ADC may be in a sequence
#         ))



# # (((~F.col(product_field).contains("ADC"))|F.col(product_field).contains("-->")) 
# #                                         | ((~F.col(product_field).contains("ADC"))|F.col(product_field).contains("Marketplace")))# remove rows with ADC except for Marketplace rows or related to sequences where ADC may be in a sequence

In [0]:
# display(clean_training_data.summary())

In [0]:
last_x_purchases = 3 #  make this a parameter for easy changes & tuning
acct_max_purchase_rank_filter_df = (union_results
                                    .groupBy([F.col("account_id"), F.col("account_name")])
                                    .agg(F.max(F.col("purchase_rank")).alias("last_purchase_rank"))
                                   )

recent_purchases_training_df = (union_results
                                .join(other=acct_max_purchase_rank_filter_df, on=["account_id","account_name"], how="INNER")
                                .filter(F.col("purchase_rank")>(F.col("last_purchase_rank")-last_x_purchases))
#                                 .filter(((~F.col(product_field).contains("ADC"))|F.col(product_field).contains("-->")) 
#                                         | ((~F.col(product_field).contains("ADC"))|F.col(product_field).contains("Marketplace"))) # Remove rows with ADC except for Marketplace rows or related to sequences where ADC may be in a sequence
#                                 .filter(((~F.col(product_field).contains("Security-"))|F.col(product_field).contains("-->")) 
#                                         | ((~F.col(product_field).contains("Security"))|F.col(product_field).contains("Marketplace"))) # Remove rows with Security except for Marketplace rows or related to sequences where Security may be in a sequence
                               )

total_account_recent_purchases = (recent_purchases_training_df
                                  .groupBy(F.col("account_id"))
                                  .agg(F.count(F.col(product_field)).alias("total_recent_purchases"))
                                 )

# aggregate recent training data
clean_recent_training_data = (recent_purchases_training_df
                              .groupBy([F.col("account_id"), F.col("account_name"), F.col("product_group_line"), F.col(mintigo_product_field),product_field])
                              .agg(F.count(F.col(product_field)).alias("num_recent_purchases"))
                              .join(other=total_account_recent_purchases, on=["account_id"], how="LEFT")
                              .withColumn("pct_recent_purchases", F.col("num_recent_purchases") / F.col("total_recent_purchases"))
                      )
clean_recent_training_data.cache().count() # materialize the df that you wrote in the cell with the least expensive call (.count())

In [0]:
# finalize scores related to purchase history
purchase_hist_training_data = (clean_training_data
                             .join(other=clean_recent_training_data.drop("account_name"),
                                   on=["account_id","product_group_line",mintigo_product_field,product_field], how="LEFT")
                             .drop(clean_recent_training_data.account_name) #this could LEFT join instead of FULL 
                              )

In [0]:
# combine aggregated scores
final_clean_training_data = (purchase_hist_training_data
                             .join(other=mintigo_score_data, on=["account_id",mintigo_product_field,product_field,], how="FULL")
                             .join(other=intricately_score_data, on=["account_id",intricately_product_field,product_field],how="FULL")
                            ) 
final_clean_training_data = (pysh.null_replace(final_clean_training_data, 
                                           cols=["num_purchases", "total_baskets", "total_purchases", "pct_purchases","log_num_purchases",
                                                 "num_recent_purchases", "total_recent_purchases", "pct_recent_purchases", "mintigo_acct_product_score", "mintigo_acct_product_score_scaled", "intricately_acct_product_score", "intricately_acct_product_score_scaled"],
                                           replace_type='zero')
                            )
final_clean_training_data = final_clean_training_data.withColumn("implicit_rating", F.col("pct_purchases")+F.col("pct_recent_purchases")+F.col("mintigo_acct_product_score")+F.col("intricately_acct_product_score"))

# final_clean_training_data.count()

In [0]:

print("Accounts:",final_clean_training_data.select("account_id").distinct().count())

In [0]:
# dbutils.notebook.exit("bye")

In [0]:
# Evict the raw data from cache to free up memory
raw_data.unpersist()
intricately_score_data.unpersist()
clean_recent_training_data.unpersist()
mintigo_score_data.unpersist()

# cache the final training data
final_clean_training_data.cache().count()

In [0]:
# display(final_clean_training_data)

In [0]:
# display((final_clean_training_data
#          .groupby(product_field)
#          .agg(F.count(F.col("*")).alias("total_count")
#               , F.max(F.col("implicit_rating")).alias("max_implicit_rating")
#               , F.min(F.col("implicit_rating")).alias("min_implicit_rating")
#               , F.mean(F.col("implicit_rating")).alias("avg_implicit_rating")
              
#               , F.max(F.col("pct_purchases")).alias("max_pct_purchases")
#               , F.min(F.col("pct_purchases")).alias("min_pct_purchases")
#               , F.mean(F.col("pct_purchases")).alias("avg_pct_purchases")
              
#               , F.max(F.col("pct_recent_purchases")).alias("max_pct_recent_purchases")
#               , F.min(F.col("pct_recent_purchases")).alias("min_pct_recent_purchases")
#               , F.mean(F.col("pct_recent_purchases")).alias("avg_pct_recent_purchases")
              
#               , F.max(F.col("mintigo_acct_product_score_scaled")).alias("max_mintigo_acct_product_score_scaled")
#               , F.min(F.col("mintigo_acct_product_score_scaled")).alias("min_mintigo_acct_product_score_scaled")
#               , F.mean(F.col("mintigo_acct_product_score_scaled")).alias("avg_mintigo_acct_product_score_scaled")
              
#               , F.max(F.col("intricately_acct_product_score_scaled")).alias("max_intricately_acct_product_score_scaled")
#               , F.min(F.col("intricately_acct_product_score_scaled")).alias("min_intricately_acct_product_score_scaled")
#               , F.mean(F.col("intricately_acct_product_score_scaled")).alias("avg_intricately_acct_product_score_scaled")
#              )
#          .orderBy(F.col("max_implicit_rating").desc())
#         )
#        )
        

In [0]:
# display((final_clean_training_data
# #          .filter(F.col(product_field).contains("["))
#          .orderBy(F.col(implicit_rating_field).desc())
#          .select("account_id",product_field,"implicit_rating","pct_purchases","pct_recent_purchases","mintigo_acct_product_score_scaled")
#          .groupby(product_field,"implicit_rating","pct_purchases","pct_recent_purchases","mintigo_acct_product_score_scaled")
#          .agg(F.count(F.col("account_id")).alias("total_accounts"))
#          .orderBy(F.col(implicit_rating_field).desc())
#         ))

In [0]:
# dbutils.notebook.exit("bye")

# display((filtered_recs
#          .filter(F.col("bundle")==False)
#          .filter(F.col("sequence")==False)
#          .filter(F.col(product_field).contains("Hardware"))
#          .select("account_id",product_field,rating)
# #          .withColumn("rating",F.when((F.col(product_field).contains("ADC-LTM-Hardware")&F.col("rating")>0.5), 0.5).otherwise(F.col("rating"))) 
#         ))

In [0]:
# # check on distribution of products
# display((final_recs
#          .groupby(product_field)
#          .agg(F.countDistinct(F.col("account_id")).alias("total_accounts")
#               , (F.countDistinct(F.col("account_id")) / final_recs.select("account_id").dropDuplicates().count() * 100).alias("pct_accounts") 
#              )
#          .join(other=(final_recs
#                       .filter(F.col("rank")<=10)
#                       .groupby(product_field)
#                       .agg(F.countDistinct(F.col("account_id")).alias("total_accounts_with_rec_in_top_10")
#                            , (F.countDistinct(F.col("account_id")) / final_recs.select("account_id").dropDuplicates().count() * 100).alias("pct_accounts_with_rec_in_top_10") 
#                           ))
#                , on=[product_field]
#                , how="left"
#               )
#          .join(other=(final_recs
#                       .filter(F.col("rank")<=5)
#                       .groupby(product_field)
#                       .agg(F.countDistinct(F.col("account_id")).alias("total_accounts_with_rec_in_top_5")
#                            , (F.countDistinct(F.col("account_id")) / final_recs.select("account_id").dropDuplicates().count() * 100).alias("pct_accounts_with_rec_in_top_5") 
#                           ))
#                , on=[product_field]
#                , how="left"
#               )
#          .orderBy(F.col("total_accounts_with_rec_in_top_5").desc())
#         )
#        )

In [0]:
display(final_clean_training_data.summary())

In [0]:
display(final_clean_training_data.filter(F.col("implicit_rating")==0))

In [0]:
# Create category codes for accounts and products 
account_window = Window.orderBy("account_cat_id_long")
product_window = Window.orderBy("product_cat_id_long")
account_codes = (valid_accts
                 .select("account_id")
                 .distinct()
                 .withColumn("account_cat_id_long", F.monotonically_increasing_id())
                 .withColumn("account_cat_id", F.row_number().over(account_window))
                )
# print(account_codes.count())

# Create product codes 
product_codes = (final_clean_training_data
                 .select(product_field)
                 .distinct()
                 .withColumn("product_cat_id_long", F.monotonically_increasing_id())
                 .withColumn("product_cat_id", F.row_number().over(product_window))
                )
# print(product_codes.count())

## Join back to processed training data and get account/product codes dataframe 
final_clean_training_data = (final_clean_training_data
                             .join(other=account_codes.select("account_id","account_cat_id"), on=["account_id"], how="INNER")
                             .join(other=product_codes.select(product_field,"product_cat_id"), on=[product_field], how="INNER")
                            )
# display(final_clean_training_data)

In [0]:
# print("Total accounts:",clean_training_data.select("account_id").distinct().count())
# print("Total accounts:",valid_accts.select("account_id").distinct().count())
# print("Total accounts:",next_purchase_sequence.select("account_id").distinct().count())
# print("Total accounts:",bundled_train_df.select("account_id").distinct().count())

# print("Total products:",product_codes.select(product_field).distinct().count())
# print("Total products:",clean_training_data.select(product_field).distinct().count())
# print("Total products:",raw_data.select(product_field).distinct().count())

In [0]:
# Write out training data to delta 
final_clean_training_data.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(train_df_path)

# Grab the delta version to pass as a exit 
delta_table = DeltaTable.forPath(spark, train_df_path)
full_history_df = delta_table.history() 
delta_table_version = full_history_df.orderBy(F.col("version").desc()).select("version").limit(1).collect()[0][0]

In [0]:
# Write out acccount ID data to delta 
account_df_path = "dbfs:/tmp/next_product_recommender_acccount_codes"
account_codes.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(account_df_path)

# Grab the delta version to pass as a exit 
delta_table = DeltaTable.forPath(spark, account_df_path)
full_history_df = delta_table.history() 
account_cat_delta_table_version = full_history_df.orderBy(F.col("version").desc()).select("version").limit(1).collect()[0][0]

In [0]:
# Write out prod ID data to delta 
prod_df_path = "dbfs:/tmp/next_product_recommender_prod_codes"
product_codes.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(prod_df_path)

# Grab the delta version to pass as a exit 
delta_table = DeltaTable.forPath(spark, prod_df_path)
full_history_df = delta_table.history() 
product_cat_delta_table_version = full_history_df.orderBy(F.col("version").desc()).select("version").limit(1).collect()[0][0]

### Analysis

In [0]:
# # Set default index type 
# ks.set_option("compute.default_index_type", "distributed")

In [0]:
# tmp = clean_training_data.select(F.col("num_purchases"), F.col("pct_purchases"), F.col("log_num_purchases")).to_koalas()
# tmp['num_purchases'].plot.box()

In [0]:
# tmp['pct_purchases'].plot.box()

In [0]:
# tmp['log_num_purchases'].plot.box()

In [0]:
# # How many unique product lines do customers see atm?
# print("How many unique products do customers see atm (median)?: ",tmp.groupby(['account_id'])[product_field].nunique().median())
# print("How many unique products do customers see atm (average)?: ",tmp.groupby(['account_id'])[product_field].nunique().mean())
# print("How many unique products do customers see atm (max)?: ",tmp.groupby(['account_id'])[product_field].nunique().max())

# tmp_df = tmp.groupby(['account_id'])[product_field].nunique().to_frame()
# tmp_df[product_field].plot.box()

In [0]:
# tmp_df["implicit_rating_field"].plot.box()

In [0]:
# # View num-purchases distribution per product field
# order = tmp.groupby(product_field)[implicit_rating_field].sum().to_frame().sort_values(by=[implicit_rating_field], ascending=False)#.head(100)
# order["num_purchases"].plot.box()
# # plot = sns.boxplot(x=product_field, y='num_purchases',data=train_data,palette='pastel',order=order.index)
# # plot.set_title('Product  num purchases',fontsize='x-large')

In [0]:
# # view products by their median sku purchase values - which products are more dispersed across account population 
# tmp = train_data.groupby(product_field)['num_purchases'].median().to_frame().reset_index().sort_values(by=['num_purchases'],ascending=False).head(10)
# plot = sns.barplot(x=product_field,y='num_purchases',data=tmp)

# plot.set_title('Product median num purchases',fontsize='x-large')

In [0]:
# train_data['product'] = train_data[product_field]
# spark_train = train_data.to_spark()

# windowSpecification = Window.partitionBy("account_cat_id").orderBy(F.desc(implicit_rating_field))
# spark_train = spark_train.withColumn('implicit_rating_rank',F.dense_rank().over(windowSpecification)).sort(F.col('account_cat_id').asc(),F.col('implicit_rating_rank').asc())

In [0]:
return_dict = {"train_df_path": train_df_path,
               "train_df_version": delta_table_version, 
               "account_df_path": account_df_path,
               "account_cat_delta_table_version":account_cat_delta_table_version,
               "prod_df_path": prod_df_path, 
               "product_cat_delta_table_version": product_cat_delta_table_version
              }

In [0]:
dbutils.notebook.exit(json.dumps(return_dict))