# Hackathon Group 22

## Imports

In [None]:
import pandas as pd
import numpy as np
from collections import Counter

## Datasets Exploration

### Retail DataFrame Overview

Dataframe overviewing the traffic website.

3 kinds of event: Order, Add to Basket, Product Page.

Only Order has Values for the Quantity and Sales columns. Otherwise, the customer did not buy the product.

#### Data Cleaning and first exploration

In [None]:
df_retail = pd.read_csv('retailer.csv')
df_retail['timestamp_utc'] = pd.to_datetime(df_retail['timestamp_utc'])
df_retail.head(3)

Unnamed: 0,customer_id,timestamp_utc,event_name,brand,product_name,sales,quantity
0,reFs5GI87lXJkJSi9r,2024-02-07 02:27:10,Product Page View,,,,
1,reFs5GI87lXJkJSi9r,2024-06-12 16:16:54,Product Page View,Science Diet,SD Fel A7+ SavCknEnt 24x5.5oz cs,,
2,reTjziox2cSrxVq70Y,2024-02-28 04:11:46,Product Page View,,,,


In [None]:
df_retail.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9866049 entries, 0 to 9866048
Data columns (total 7 columns):
 #   Column         Dtype         
---  ------         -----         
 0   customer_id    object        
 1   timestamp_utc  datetime64[ns]
 2   event_name     object        
 3   brand          object        
 4   product_name   object        
 5   sales          float64       
 6   quantity       float64       
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 526.9+ MB


Checking for null values for event_name == Order

In [None]:
print(f"There is {sum(df_retail[df_retail['event_name']=='Order']['sales'].isna())} NaN values in the sale column for Orders")
print(f"There is {sum(df_retail[df_retail['event_name']=='Order']['quantity'].isna())} NaN values in the quantity column for Orders")
print(f"There is {sum(df_retail[df_retail['event_name']=='Order']['brand'].isna())} NaN values in the brand column for Orders")
print(f"There is {sum(df_retail[df_retail['event_name']=='Order']['product_name'].isna())} NaN values in the product_name column for Orders")

There is 0 NaN values in the sale column for Orders
There is 0 NaN values in the quantity column for Orders
There is 8776 NaN values in the brand column for Orders
There is 8776 NaN values in the product_name column for Orders


Visibly we have the same number of NaN values for both brand and product_name column. Let's investigate.

In [None]:
df_retail_order = df_retail[df_retail['event_name']=='Order']
df_retail_missing = df_retail_order[df_retail_order.isnull().any(axis=1)]
df_retail_missing.head(3)

Unnamed: 0,customer_id,timestamp_utc,event_name,brand,product_name,sales,quantity
863,reczbO5sThS4rw0JdJ,2024-01-05 21:38:18,Order,,,78.99,1.0
885,reSFq86EE91VCfBImt,2024-03-15 03:35:17,Order,,,101.55,1.0
1575,reOseksZcmsUreRQbs,2024-03-15 02:40:17,Order,,,101.55,1.0


Not much information.. There doesn't seem to have an obvious reason for those transacation not having brand or product name.. But let's keep them for now as it is still an order that counts. ***If needed we can remove them, especially if we focus on brand/product name***

#### Proportion of the different columns

In [None]:
df_retail['event_name'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
event_name,Unnamed: 1_level_1
Product Page View,0.64632
Add to cart,0.207795
Order,0.145885


In [None]:
df_retail['brand'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
brand,Unnamed: 1_level_1
Science Diet,0.819553
Prescription Diet,0.130102
Hills,0.050345


In [None]:
df_retail['product_name'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
product_name,Unnamed: 1_level_1
SD Ca Adt SmPws Ckn 4.5lb bg,2.305219e-02
SD Ca Adt SenSt&Sk Sm&Min Ckn 4lb bg,1.956145e-02
SD Ca Adt SenSt&Sk Ckn 30lb bg,1.933848e-02
SD Ca A7+ SB Ckn 5lb bg,1.668998e-02
SD Pup SmPws Ckn 4.5lb bg,1.665535e-02
...,...
SD Fel Adt HBC OFEnt 24x5.5oz cs,8.343059e-07
SD Ktn 12x2.9oz VarPk,8.343059e-07
SD Fel Adt SavSalEnt 24x2.9oz cs,7.300177e-07
SD Fel A7+ SrVit Ckn&VgEnt 24x5.5oz cs,2.085765e-07


#### Unknown Investigation

In [None]:
df_retail_unknown = df_retail[df_retail['customer_id'] == 'unknown']
df_retail_unknown.head(3)

Unnamed: 0,customer_id,timestamp_utc,event_name,brand,product_name,sales,quantity
201,unknown,2024-04-13 00:44:59,Product Page View,,,,
202,unknown,2024-04-13 00:46:22,Product Page View,,,,
203,unknown,2024-04-13 00:53:12,Product Page View,Science Diet,SD Ca Adt Lt SB Ckn 5lb bg,,


In [None]:
df_retail_unknown.info()

<class 'pandas.core.frame.DataFrame'>
Index: 194530 entries, 201 to 9866032
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   customer_id    194530 non-null  object        
 1   timestamp_utc  194530 non-null  datetime64[ns]
 2   event_name     194530 non-null  object        
 3   brand          189317 non-null  object        
 4   product_name   189317 non-null  object        
 5   sales          28423 non-null   float64       
 6   quantity       28423 non-null   float64       
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 11.9+ MB


In [None]:
194530/9866049*100

1.9717112696277912

There is 194,530 Unknown customers. The entire DataFrame is 9,866,049 rows.

This represent roughly 2% of the Data.



In [None]:
df_retail_unknown['event_name'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
event_name,Unnamed: 1_level_1
Product Page View,0.646301
Add to cart,0.207588
Order,0.146111


Weirdly, it has almost exactly the same proportion for the event_name columns as the proportion of the full dataset...

This would mean that removing them would not change the whole outcome of our analysis. At least for training models.

### Socio Demo DataFrame Overview

DataFrame that focuses on the socio-demographical infromation of the customers. It contains: Customer_id, breed, age, income.

#### Data Cleaning and First Exploration

In [None]:
df_socio_demo = pd.read_csv('socio_demo.csv')
df_socio_demo.head(3)

Unnamed: 0,customer_id,breed,age,income
0,rezLh5Hae3m6flaxM4,Purebred,[25-35[,[120-200K$[
1,resWkHpEcL1IUfdoBp,Purebred,[25-35[,[80-120K$[
2,re9qxF7kS9R2LwOVVY,Purebred,[35-45[,[80-120K$[


In [None]:
df_socio_demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1354584 entries, 0 to 1354583
Data columns (total 4 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   customer_id  1354584 non-null  object
 1   breed        1354584 non-null  object
 2   age          1354584 non-null  object
 3   income       1354584 non-null  object
dtypes: object(4)
memory usage: 41.3+ MB


In [None]:
df_socio_demo[df_socio_demo.isnull().any(axis=1)]

Unnamed: 0,customer_id,breed,age,income


#### Proportion of the different columns


In [None]:
df_socio_demo['breed'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
breed,Unnamed: 1_level_1
Purebred,0.610532
Mixed-breed,0.389468


In [None]:
df_socio_demo['age'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
age,Unnamed: 1_level_1
[65+[,0.18973
[25-35[,0.180822
[55-65[,0.174363
[45-55[,0.170258
[35-45[,0.150322
[18-25[,0.134505


In [None]:
df_socio_demo['income'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
income,Unnamed: 1_level_1
[120-200K$[,0.219459
[200K$+[,0.209953
[40-80K$[,0.20649
[80-120K$[,0.206475
[0-40K$[,0.157623


#### Unknown Investigation

In [None]:
df_socio_demo[df_socio_demo['customer_id']=='unknown']

Unnamed: 0,customer_id,breed,age,income
744396,unknown,Mixed-breed,[35-45[,[0-40K$[


That was unexpected... unknown is actually in the df_socio_demo..

It is either one person that has made more than 2% of the whole traffic website (df_retails) or, the technology they use to map the cookies to the traffic flow has actually created of potential profile for all the transaction in the df_retails. Which i found more accurate

### TV Publisher Dataframe Overview

DataFrame containing the device used to play an ad on (TV) the time, and its cost.

#### Data Cleaning and First Exploration

In [None]:
df_tv_publisher = pd.read_csv('tv_publisher.csv')
df_tv_publisher['timestamp_utc'] = pd.to_datetime(df_tv_publisher['timestamp_utc'])
df_tv_publisher.head(3)

Unnamed: 0,device_id,timestamp_utc,cost_milli_cent
0,ctv81YlbBXho,2024-04-23 21:09:46,2325.51
1,ctvWr7bOO5Je,2024-04-19 18:31:30,2325.51
2,ctvktBqDUgcV,2024-05-07 23:32:37,2325.51


In [None]:
df_tv_publisher.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5827133 entries, 0 to 5827132
Data columns (total 3 columns):
 #   Column           Dtype         
---  ------           -----         
 0   device_id        object        
 1   timestamp_utc    datetime64[ns]
 2   cost_milli_cent  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 133.4+ MB


In [None]:
df_tv_publisher.describe()

Unnamed: 0,timestamp_utc,cost_milli_cent
count,5827133,5827133.0
mean,2024-04-17 18:17:59.785963008,2325.511
min,2024-02-29 21:00:57,2325.51
25%,2024-03-23 02:31:58,2325.51
50%,2024-04-08 15:31:50,2325.51
75%,2024-05-11 05:58:23,2325.51
max,2024-06-30 23:59:57,2646.51
std,,0.497555


In [None]:
df_tv_publisher[df_tv_publisher.isnull().any(axis=1)]

Unnamed: 0,device_id,timestamp_utc,cost_milli_cent


#### Unknown Investigation

In [None]:
df_tv_publisher[df_tv_publisher['device_id']=='unknown']

Unnamed: 0,device_id,timestamp_utc,cost_milli_cent
139,unknown,2024-04-08 04:11:50,2325.51
140,unknown,2024-04-19 02:34:21,2325.51
141,unknown,2024-04-23 03:22:12,2325.51
142,unknown,2024-04-24 03:15:22,2325.51
143,unknown,2024-04-25 02:54:45,2325.51
...,...,...,...
5826660,unknown,2024-04-16 04:13:45,2325.51
5826749,unknown,2024-05-12 18:23:42,2325.51
5826878,unknown,2024-05-04 15:53:56,2325.51
5826879,unknown,2024-05-04 16:25:33,2325.51


In [None]:
173206/5827133*100

2.9724051261572373

Again a small proportion (3%) of the DataFrame is labelled as 'Unknown'.

### Programmatic Publisher DataFrame Overview

In [None]:
programmatic_publisher = pd.read_csv('programmatic_publisher.csv')
programmatic_publisher['timestamp_utc'] = pd.to_datetime(programmatic_publisher['timestamp_utc'])
programmatic_publisher.head(3)

Unnamed: 0,dsp_id,timestamp_utc,campaign_name,device_type,cost_milli_cent
0,dsp9tnGII5BeXbn6LUSFZPcKGCyI0F,2024-02-06 04:10:41,Contextual,Phone,283.496
1,dsp1hXcI9Q6TZYzLEmeTkxzhjqD6HJ,2024-02-26 23:49:23,Retargeting,PC,1884.537
2,dspcd3UcXUcUk0PEo2hb8CEH3WVlFE,2024-06-16 20:55:27,Contextual,TV,601.93


In [None]:
programmatic_publisher.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17493428 entries, 0 to 17493427
Data columns (total 5 columns):
 #   Column           Dtype         
---  ------           -----         
 0   dsp_id           object        
 1   timestamp_utc    datetime64[ns]
 2   campaign_name    object        
 3   device_type      object        
 4   cost_milli_cent  float64       
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 667.3+ MB


In [None]:
programmatic_publisher.describe()

Unnamed: 0,timestamp_utc,cost_milli_cent
count,17493428,17493430.0
mean,2024-03-15 23:03:10.769140736,531.4746
min,2024-01-01 00:00:04,108.331
25%,2024-01-24 20:27:57.750000128,146.61
50%,2024-03-07 05:15:49,251.938
75%,2024-05-03 05:34:40.249999872,544.753
max,2024-06-30 23:59:59,11335.0
std,,728.2323


In [None]:
programmatic_publisher[programmatic_publisher.isnull().any(axis=1)]

Unnamed: 0,dsp_id,timestamp_utc,campaign_name,device_type,cost_milli_cent


#### Proportion for Different Columns

In [None]:
programmatic_publisher['device_type'].value_counts()

Unnamed: 0_level_0,count
device_type,Unnamed: 1_level_1
PC,10216038
Phone,5928879
TV,1347813
Unknown,686
Robot,12


In [None]:
programmatic_publisher['campaign_name'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
campaign_name,Unnamed: 1_level_1
Contextual,0.7785
Retargeting,0.2215


#### Unknown Investigation

In [None]:
programmatic_publisher[programmatic_publisher['dsp_id']=='unknown']

Unnamed: 0,dsp_id,timestamp_utc,campaign_name,device_type,cost_milli_cent
34,unknown,2024-03-02 18:51:43,Retargeting,PC,3723.196
72,unknown,2024-03-30 16:42:03,Retargeting,PC,1602.259
154,unknown,2024-01-03 17:36:20,Contextual,TV,537.129
155,unknown,2024-01-04 21:00:01,Contextual,TV,544.449
156,unknown,2024-01-05 23:43:44,Contextual,TV,1951.856
...,...,...,...,...,...
17493402,unknown,2024-03-27 21:07:35,Retargeting,TV,125.837
17493403,unknown,2024-03-27 22:50:42,Retargeting,TV,124.645
17493425,unknown,2024-04-16 02:21:41,Retargeting,TV,128.794
17493426,unknown,2024-04-16 02:38:15,Retargeting,TV,130.560


In [None]:
1907730/17493428*100

10.905409734444273

Here the Proportion of Unknown values is way higher.. roughly 11%, which is almost 5 times as more than the other dataframe.

Let's try to understand what it represent in terms of cost.

In [None]:
total_unknown_cost = programmatic_publisher[programmatic_publisher['dsp_id']=='unknown']['cost_milli_cent'].sum()
total_known_cost = programmatic_publisher[programmatic_publisher['dsp_id']!='unknown']['cost_milli_cent'].sum()
total_unknown_cost/total_known_cost*100

12.250736212977813

It seems to match pretty much the proportion.. If we decide to remove all unknown, the would represent an error of 12% for the Programmatic Publisher Cost.

### Mapping Transaction Publisher TV DataFrame Overview

#### Data Cleaning and First Exploration

In [None]:
mapping_transac_publisher_tv = pd.read_csv('mapping_transac_publisher_tv.csv')
mapping_transac_publisher_tv.head(3)

Unnamed: 0,customer_id,dsp_id,device_id
0,reFs5GI87lXJkJSi9r,dsp9tnGII5BeXbn6LUSFZPcKGCyI0F,ctv81YlbBXho
1,reTjziox2cSrxVq70Y,dspCSu1n1mhys37Na5OXMaKaE8P8CS,ctvHmkxqZXBg
2,reOrpt9vhSwhbPVtni,dsp1hXcI9Q6TZYzLEmeTkxzhjqD6HJ,ctvwp5n34myx


In [None]:
mapping_transac_publisher_tv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7984411 entries, 0 to 7984410
Data columns (total 3 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   customer_id  object
 1   dsp_id       object
 2   device_id    object
dtypes: object(3)
memory usage: 182.7+ MB


In [None]:
print(f"There is {len(mapping_transac_publisher_tv) - mapping_transac_publisher_tv['customer_id'].nunique()} duplicated values in customer_id ({round((mapping_transac_publisher_tv['customer_id'].nunique() / len(mapping_transac_publisher_tv)), 2 * 100)}%)")
print(f"There is {len(mapping_transac_publisher_tv) - mapping_transac_publisher_tv['dsp_id'].nunique()} duplicated values in dsp_id ({round((mapping_transac_publisher_tv['dsp_id'].nunique() / len(mapping_transac_publisher_tv)), 2 * 100)})%")
print(f"There is {len(mapping_transac_publisher_tv) - mapping_transac_publisher_tv['device_id'].nunique()} duplicated values in device_id ({round((mapping_transac_publisher_tv['device_id'].nunique() / len(mapping_transac_publisher_tv)), 2 * 100)})%")

There is 158700 duplicated values in customer_id (0.9801237686787416%)
There is 869310 duplicated values in dsp_id (0.8911240916831561)%
There is 391322 duplicated values in device_id (0.9509892464203058)%


These duplicates are surely due to the unknown id.. let's check in the next section.

#### Unknown Investigation

In [None]:
print(f"Number of unknown in customer_id: {len(mapping_transac_publisher_tv[mapping_transac_publisher_tv['customer_id']=='unknown']['customer_id'])}")
print(f"Number of unknown in dsp_id: {len(mapping_transac_publisher_tv[mapping_transac_publisher_tv['dsp_id']=='unknown']['dsp_id'])}")
print(f"Number of unknown in device_id: {len(mapping_transac_publisher_tv[mapping_transac_publisher_tv['device_id']=='unknown']['device_id'])}")

Number of unknown in customer_id: 158701
Number of unknown in dsp_id: 869311
Number of unknown in device_id: 232622


Everything seems to fit except for the device_id column, let's investigate this column

In [None]:
mapping_transac_publisher_tv.groupby('device_id')['device_id'].count().reset_index(name='count_unk').sort_values(by='count_unk', ascending=False).head(3)

Unnamed: 0,device_id,count_unk
7593088,unknown,232622
5062069,ctvfK4XJCFL9,1
5062067,ctvfK4TSbGuN,1


In [None]:
sum(mapping_transac_publisher_tv['device_id'].isna()) + 232622 + 7593088 == len(mapping_transac_publisher_tv)

True

Here the addition of: The number of NaN's, the number of Unknown in device_id and the number of non duplicated have been added to verify it is equal to the total lenght of the mapping dataframe. It is the case. Meaning there is only: unique values, NaN's, or Unknown.

Here are the proportion:

In [None]:
print(f"Proportion of NaN's Values: {round(sum(mapping_transac_publisher_tv['device_id'].isna()) / len(mapping_transac_publisher_tv) *100, 2)} %")
print(f"Proportion of Unique Values: {round(7593088 / len(mapping_transac_publisher_tv) *100, 2)} %")
print(f"Proportion of Unknown Values: {round(232622 / len(mapping_transac_publisher_tv) *100, 2)} %")

Proportion of NaN's Values: 1.99 %
Proportion of Unique Values: 95.1 %
Proportion of Unknown Values: 2.91 %


## Merging the whole data into one dataframe

#### Merging first without the Unknown (df_retail base df)

Let's try without the unknown first, then we will do a merge solely on unknown.

First let's create a pivot table to get a better vision of each event name.

In [None]:
df_retail_event_details = df_retail[df_retail['customer_id']!='unknown'].groupby(['customer_id', 'event_name'])['event_name'].count().reset_index(name='count_event')
df_retail_pivot_wo_unk = df_retail_event_details.pivot_table(index='customer_id', columns='event_name', values='count_event', fill_value=0).reset_index()
df_retail_pivot_wo_unk.head(3)

event_name,customer_id,Add to cart,Order,Product Page View
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0
2,re000fIO9QXTWYjOfn,8.0,8.0,7.0


In [None]:
df_retail_extra = df_retail[df_retail['customer_id']!='unknown'].groupby('customer_id').agg({
    'product_name': lambda x: Counter(x.dropna()).most_common(1)[0][0] if len(x.dropna()) > 0 else None,
    'timestamp_utc': lambda x: min(x),
    'sales': 'sum',
    'brand': lambda x: Counter(x.dropna()).most_common(1)[0][0] if len(x.dropna()) > 0 else None,
    'quantity': 'sum'
}).reset_index()

In [None]:
df_retail_extra

Unnamed: 0,customer_id,product_name,timestamp_utc,sales,brand,quantity
0,re0007V8sqIHsZnbvC,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.00,Science Diet,0.0
1,re000JYhnKbTkPqMB4,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.00,Science Diet,0.0
2,re000fIO9QXTWYjOfn,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0
3,re000kbtVVzPwZcEr4,SD Ca Adt SenSt&Sk Ckn 30lb bg,2024-03-06 23:57:04,0.00,Science Diet,0.0
4,re000pHbVOysCXRHgt,SD Ca Adt Lt Ckn 30lb bg,2024-01-01 18:12:21,394.94,Science Diet,5.0
...,...,...,...,...,...,...
1354578,rezzz8a320jhOvmL3A,SD Ca A6+ LB Ckn 15lb bg,2024-04-26 23:11:28,0.00,Science Diet,0.0
1354579,rezzzYRiwreLF23ot3,SD Ca Adt PerWgt Ckn SB 4lb bg,2024-06-04 23:56:40,0.00,Science Diet,0.0
1354580,rezzzZvkIaiWNQ1AmV,PD m/d Feline 8.5lb bg,2024-04-16 22:01:01,0.00,Prescription Diet,0.0
1354581,rezzzipns16pTCb4OS,SD Ca Adt Lt SB Ckn 5lb bg,2024-04-18 01:32:56,66.98,Science Diet,2.0


In [None]:
df_retail_pivot_wo_unk = df_retail_pivot_wo_unk.merge(df_retail_extra, how='left', on='customer_id')

In [None]:
df_retail_pivot_wo_unk[df_retail_pivot_wo_unk['brand_y']=='Hills']

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,product_name_x,timestamp_utc_x,sales_x,brand_x,quantity_x,product_name_y,timestamp_utc_y,sales_y,brand_y,quantity_y
13,re003crViui7MmOapY,2.0,2.0,1.0,HI Ca Nat BkdLtBisc Sm w/Ckn 12x8oz cs,2024-01-05 16:48:31,15.62,Hills,2.0,HI Ca Nat BkdLtBisc Sm w/Ckn 12x8oz cs,2024-01-05 16:48:31,15.62,Hills,2.0
15,re003pI9bjyJX0uX4I,1.0,0.0,1.0,HI Ca Nat SSav T PnBut&Ban 12x8.0oz cs,2024-06-06 21:38:28,0.00,Hills,0.0,HI Ca Nat SSav T PnBut&Ban 12x8.0oz cs,2024-06-06 21:38:28,0.00,Hills,0.0
17,re004P1fZfs9NXKVPL,0.0,0.0,1.0,HI Ca Nat SSav T PnBut&Ban 12x8.0oz cs,2024-03-04 22:49:57,0.00,Hills,0.0,HI Ca Nat SSav T PnBut&Ban 12x8.0oz cs,2024-03-04 22:49:57,0.00,Hills,0.0
34,re007YLIkh1xBxj9FK,14.0,16.0,48.0,HI Ca GF SBkNat Ckn&Car 12x8oz cs,2024-02-05 00:03:37,147.99,Hills,16.0,HI Ca GF SBkNat Ckn&Car 12x8oz cs,2024-02-05 00:03:37,147.99,Hills,16.0
84,re00HMRkq9TkWqSeb4,6.0,6.0,2.0,HI Ca Nat SSav T PnBut&Ban 12x8.0oz cs,2024-01-20 19:08:46,89.63,Hills,7.0,HI Ca Nat SSav T PnBut&Ban 12x8.0oz cs,2024-01-20 19:08:46,89.63,Hills,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1354511,rezzoLVD7Bu7sX6pmh,6.0,5.0,27.0,HI Ca NatJkySt Mini Bf 12x7.1oz cs,2024-01-21 23:21:45,132.30,Hills,15.0,HI Ca NatJkySt Mini Bf 12x7.1oz cs,2024-01-21 23:21:45,132.30,Hills,15.0
1354527,rezzrjoxk405XgsvEY,2.0,2.0,0.0,HI Ca GF SBkNat Bf&SwtPot 12x8oz cs,2024-06-08 18:47:06,15.56,Hills,2.0,HI Ca GF SBkNat Bf&SwtPot 12x8oz cs,2024-06-08 18:47:06,15.56,Hills,2.0
1354534,rezzsR3GRndlCjweZk,9.0,8.0,10.0,HI Ca Nat FlxStxT Bf 12x7.1oz cs,2024-05-04 16:09:43,124.57,Hills,16.0,HI Ca Nat FlxStxT Bf 12x7.1oz cs,2024-05-04 16:09:43,124.57,Hills,16.0
1354543,rezztqaQthZfoG7kms,3.0,3.0,6.0,HI Ca Nat BkdLtBisc Sm w/Ckn 12x8oz cs,2024-01-16 09:24:51,23.40,Hills,3.0,HI Ca Nat BkdLtBisc Sm w/Ckn 12x8oz cs,2024-01-16 09:24:51,23.40,Hills,3.0


Checking if there is the same number of rows, as there is unique customer id

In [None]:
df_retail_pivot_wo_unk['customer_id'].nunique() == len(df_retail_pivot_wo_unk)

True

Adding now a new column 'total_website_interaction'

In [None]:
df_retail_pivot_wo_unk['total_website_interaction'] = df_retail_pivot_wo_unk['Add to cart'] + df_retail_pivot_wo_unk['Order'] + df_retail_pivot_wo_unk['Product Page View']
df_retail_pivot_wo_unk.head(3)

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,product_name_x,timestamp_utc_x,sales_x,brand_x,quantity_x,product_name_y,timestamp_utc_y,sales_y,brand_y,quantity_y,total_website_interaction
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.0,Science Diet,0.0,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.0,Science Diet,0.0,2.0
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.0,Science Diet,0.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.0,Science Diet,0.0,3.0
2,re000fIO9QXTWYjOfn,8.0,8.0,7.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0,23.0


Let's now add the customers socio geogrphical data.

In [None]:
df_retail_socio_wo_unk = df_retail_pivot_wo_unk.merge(df_socio_demo, how='left', on='customer_id')
df_retail_socio_wo_unk.head(3)

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,product_name_x,timestamp_utc_x,sales_x,brand_x,quantity_x,product_name_y,timestamp_utc_y,sales_y,brand_y,quantity_y,total_website_interaction,breed,age,income
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.0,Science Diet,0.0,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.0,Science Diet,0.0,2.0,Purebred,[25-35[,[120-200K$[
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.0,Science Diet,0.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.0,Science Diet,0.0,3.0,Mixed-breed,[18-25[,[120-200K$[
2,re000fIO9QXTWYjOfn,8.0,8.0,7.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0,23.0,Purebred,[45-55[,[40-80K$[


Now we have a retail dataset containing additional information on the customers

Let's now try to add the mapping data in order to get the different ids related to a specific customer

In [None]:
#df_retail_socio_map_wo_unk = df_retail_socio_wo_unk.merge(mapping_transac_publisher_tv, how='left', on='customer_id')
#df_retail_socio_map_wo_unk

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,total_website_interaction,breed,age,income,dsp_id,device_id
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0,2.0,Purebred,[25-35[,[120-200K$[,dspt0ZXO5jvJTrabhKeLXZLiR2PbyQ,ctvkPZxmFRUl
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0,3.0,Mixed-breed,[18-25[,[120-200K$[,dsp6V6xOiPqoLgsv5GteFjwomGj2OE,ctv9vXQKdLuu
2,re000fIO9QXTWYjOfn,8.0,8.0,7.0,23.0,Purebred,[45-55[,[40-80K$[,dspIYJLpU5Ex3KiyA1K3wn7VNexZWB,ctvQEmtlTLvO
3,re000kbtVVzPwZcEr4,0.0,0.0,19.0,19.0,Purebred,[65+[,[200K$+[,dspJHqHlhc1SVwiHlbqjWofC0ReNwz,ctvE4b5vBWGQ
4,re000pHbVOysCXRHgt,5.0,5.0,7.0,17.0,Purebred,[18-25[,[120-200K$[,dspT2j7lTjKShSz4PDIlj3FcfXwIiS,ctvcfz2eP4fV
...,...,...,...,...,...,...,...,...,...,...
1354578,rezzz8a320jhOvmL3A,0.0,0.0,2.0,2.0,Mixed-breed,[35-45[,[80-120K$[,dspcRI6KFVdRU2K9XgArgerMN3NhKB,ctvNHXF7ePVj
1354579,rezzzYRiwreLF23ot3,0.0,0.0,1.0,1.0,Mixed-breed,[18-25[,[0-40K$[,unknown,ctvJGH8s82Fv
1354580,rezzzZvkIaiWNQ1AmV,0.0,0.0,1.0,1.0,Purebred,[25-35[,[80-120K$[,dspgVESJvONVmEFZPusBo1IvZ3d9pU,ctv7L6JydzSE
1354581,rezzzipns16pTCb4OS,2.0,2.0,1.0,5.0,Mixed-breed,[35-45[,[80-120K$[,dspuCywFbMu4BIENswJXXXcUOZ0luV,ctvoPF1pHlS7


We now have a DataFrame containing both information on customers, and the different IDs associated to that particular customer.

Moving on now to the Programmatic Publisher Data. However, before moving on, we need to adjust the dataframe for computing power. A lot of customer_id present in the Programmatic Publisher are not found in the retail DataFrame. This can be explained by the fact that not everyone who sees advertising go on the retailer website.

Before coninuing to merge. Let's see how many Programmatic Publisher transaction have customer_id in common from the dataframe we did before (df_retail + df_socio_geo + mapping (wo unknown))

To do that, we need to merge Mapping and Programmatic in order to obtiain the customer_id linked to a transaction in Programmatic. Then we will compare the number of customer_id that has a programmatic transaction.

In [None]:
customer_retail_list = df_retail_socio_wo_unk['customer_id'].to_list()
customer_map_prog_list = mapping_transac_publisher_tv[mapping_transac_publisher_tv['dsp_id']!='unknown'].merge(programmatic_publisher, how='left', on='dsp_id')['customer_id'].to_list()

In [None]:
len(set(customer_retail_list))

1354583

In [None]:
len(set(customer_map_prog_list))

6973891

In [None]:
common = set(customer_retail_list) & set(customer_map_prog_list)
len(common)

1207350

In [None]:
all_transac = len(customer_map_prog_list)
all_transac

17599774

In [None]:
nb_transac_customer = sum(pd.Series(customer_map_prog_list).isin(common))
nb_transac_customer

4178232

Small Statistical point on what has been discovered.

In [None]:
print(f'The proportion of ads that were not attached by a customer of the retailer website {round((1-len(common)/len(set(customer_map_prog_list)))*100, 2)}%')
print(f'The proportion of ads that were not followed by a visit of the retailer website {round((1-nb_transac_customer/len(customer_map_prog_list))*100, 2)}%')

The proportion of ads that were not attached by a customer of the retailer website 82.69%
The proportion of ads that were not followed by a visit of the retailer website 76.26%


Let's now add the Programmatic Publisher data to our customer DataFrame.

In [None]:
programmatic_publisher.head(3)

Unnamed: 0,dsp_id,timestamp_utc,campaign_name,device_type,cost_milli_cent
0,dsp9tnGII5BeXbn6LUSFZPcKGCyI0F,2024-02-06 04:10:41,Contextual,Phone,283.496
1,dsp1hXcI9Q6TZYzLEmeTkxzhjqD6HJ,2024-02-26 23:49:23,Retargeting,PC,1884.537
2,dspcd3UcXUcUk0PEo2hb8CEH3WVlFE,2024-06-16 20:55:27,Contextual,TV,601.93


In [None]:
programmatic_w_cust_id = mapping_transac_publisher_tv[mapping_transac_publisher_tv['dsp_id']!='unknown'].merge(programmatic_publisher, how='left', on='dsp_id')

In [None]:
programmatic_details_campain = programmatic_w_cust_id[programmatic_w_cust_id['customer_id']!='unknown'].groupby(['customer_id', 'campaign_name'])['campaign_name'].count().reset_index(name='campain_count')
programatic_campain = programmatic_details_campain.pivot_table(index='customer_id', columns='campaign_name', values='campain_count', fill_value=0).reset_index()
programatic_campain.head(3)

campaign_name,customer_id,Contextual,Retargeting
0,re0003BIjfgvMOXmfh,1.0,0.0
1,re000A5ftS1crvO4vW,0.0,1.0
2,re000BtJhv7zeg5jAh,1.0,0.0


In [None]:
programmatic_details_device = programmatic_w_cust_id[programmatic_w_cust_id['customer_id']!='unknown'].groupby(['customer_id', 'device_type'])['device_type'].count().reset_index(name='device_count')
programatic_device = programmatic_details_device.pivot_table(index='customer_id', columns='device_type', values='device_count', fill_value=0).reset_index()
programatic_device.head(3)

device_type,customer_id,PC,Phone,Robot,TV,Unknown
0,re0003BIjfgvMOXmfh,1.0,0.0,0.0,0.0,0.0
1,re000A5ftS1crvO4vW,1.0,0.0,0.0,0.0,0.0
2,re000BtJhv7zeg5jAh,0.0,1.0,0.0,0.0,0.0


In [None]:
df_program_extra = programmatic_w_cust_id[programmatic_w_cust_id['customer_id']!='unknown'].groupby('customer_id').agg({
    'cost_milli_cent': 'sum',
    'timestamp_utc': lambda x: min(x)
}).reset_index()

In [None]:
full_programmatic_detail = programatic_campain.merge(programatic_device, how='left', on='customer_id').merge(df_program_extra, how='left', on='customer_id')

In [None]:
full_programmatic_detail

Unnamed: 0,customer_id,Contextual,Retargeting,PC,Phone,Robot,TV,Unknown,cost_milli_cent,timestamp_utc
0,re0003BIjfgvMOXmfh,1.0,0.0,1.0,0.0,0.0,0.0,0.0,146.750,2024-01-03 18:58:49
1,re000A5ftS1crvO4vW,0.0,1.0,1.0,0.0,0.0,0.0,0.0,2728.290,2024-01-18 22:12:06
2,re000BtJhv7zeg5jAh,1.0,0.0,0.0,1.0,0.0,0.0,0.0,147.071,2024-01-05 20:18:40
3,re000ChB4g6FQco48O,1.0,0.0,0.0,0.0,0.0,1.0,0.0,153.500,2024-02-12 13:53:58
4,re000JLFUVT4zHZeg5,42.0,1.0,29.0,14.0,0.0,0.0,0.0,17375.798,2024-01-03 01:20:57
...,...,...,...,...,...,...,...,...,...,...
4999655,rezzziPVXXLJIMYeXI,1.0,0.0,0.0,1.0,0.0,0.0,0.0,334.789,2024-01-16 09:34:26
4999656,rezzzjryrM7TbEHdws,1.0,0.0,0.0,1.0,0.0,0.0,0.0,231.196,2024-04-01 15:23:05
4999657,rezzzk3fDGAIEZt4Iz,1.0,0.0,1.0,0.0,0.0,0.0,0.0,274.852,2024-04-14 11:07:35
4999658,rezzzqXlE0I0qrb6hI,3.0,0.0,0.0,3.0,0.0,0.0,0.0,467.941,2024-01-08 21:38:32


In [None]:
customer_df_wo_unk = df_retail_socio_wo_unk.merge(full_programmatic_detail, how='left', on='customer_id')

In [None]:
customer_df_wo_unk

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,product_name_x,timestamp_utc_x,sales_x,brand_x,quantity_x,product_name_y,...,income,Contextual,Retargeting,PC,Phone,Robot,TV,Unknown,cost_milli_cent,timestamp_utc
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.00,Science Diet,0.0,SD Fel Adt PerWgt Ckn 15lb bg,...,[120-200K$[,,,,,,,,,NaT
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.00,Science Diet,0.0,SD Ca Adt PerWgt Ckn 25lb bg,...,[120-200K$[,,,,,,,,,NaT
2,re000fIO9QXTWYjOfn,8.0,8.0,7.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,...,[40-80K$[,0.0,1.0,0.0,0.0,0.0,1.0,0.0,3595.156,2024-01-24 14:25:56
3,re000kbtVVzPwZcEr4,0.0,0.0,19.0,SD Ca Adt SenSt&Sk Ckn 30lb bg,2024-03-06 23:57:04,0.00,Science Diet,0.0,SD Ca Adt SenSt&Sk Ckn 30lb bg,...,[200K$+[,,,,,,,,,NaT
4,re000pHbVOysCXRHgt,5.0,5.0,7.0,SD Ca Adt Lt Ckn 30lb bg,2024-01-01 18:12:21,394.94,Science Diet,5.0,SD Ca Adt Lt Ckn 30lb bg,...,[120-200K$[,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1863.549,2024-01-01 18:30:47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1354578,rezzz8a320jhOvmL3A,0.0,0.0,2.0,SD Ca A6+ LB Ckn 15lb bg,2024-04-26 23:11:28,0.00,Science Diet,0.0,SD Ca A6+ LB Ckn 15lb bg,...,[80-120K$[,,,,,,,,,NaT
1354579,rezzzYRiwreLF23ot3,0.0,0.0,1.0,SD Ca Adt PerWgt Ckn SB 4lb bg,2024-06-04 23:56:40,0.00,Science Diet,0.0,SD Ca Adt PerWgt Ckn SB 4lb bg,...,[0-40K$[,,,,,,,,,NaT
1354580,rezzzZvkIaiWNQ1AmV,0.0,0.0,1.0,PD m/d Feline 8.5lb bg,2024-04-16 22:01:01,0.00,Prescription Diet,0.0,PD m/d Feline 8.5lb bg,...,[80-120K$[,,,,,,,,,NaT
1354581,rezzzipns16pTCb4OS,2.0,2.0,1.0,SD Ca Adt Lt SB Ckn 5lb bg,2024-04-18 01:32:56,66.98,Science Diet,2.0,SD Ca Adt Lt SB Ckn 5lb bg,...,[80-120K$[,,,,,,,,,NaT


In [None]:
customer_df_wo_unk.sort_values(by=['Order'], ascending=False).head(10)

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,total_website_interaction,breed,age,income,dsp_id,device_id,count_cost_ads,sum_ads
1217141,reth7mdoDJCbkDalaK,2051.0,2013.0,2126.0,6190.0,Mixed-breed,[65+[,[0-40K$[,dspVULqlhh0KIIPnhQMXDjrhjpAaFp,ctvA4p8jruSD,24.0,10210.488
426942,reJZhNiD0cXQyewGn3,1400.0,1259.0,2999.0,5658.0,Purebred,[25-35[,[0-40K$[,dspRSZkjQJ3ny5pOzOQXHsKpPBTzvX,ctvUgOASjPTK,3.0,8942.652
1182071,res6JFoMLy8731GMyL,210.0,205.0,209.0,624.0,Purebred,[45-55[,[0-40K$[,dspGWFS3mY2kEu6qM14fHCdqVpy9Vz,ctvpiV7WC43G,4.0,12447.093
796484,reaVIUXpo0kXvGM4GS,142.0,127.0,273.0,542.0,Mixed-breed,[45-55[,[120-200K$[,dspeeV8CmppSGVxWjOKFQh8VfYnhx6,unknown,1.0,0.0
618456,reSMoh2wR0jwjGGuTH,126.0,114.0,251.0,491.0,Purebred,[45-55[,[200K$+[,dspd0wKgDoXvcmuBDzU9aE5KtPuk7M,unknown,2.0,6935.539
935151,regozI7iXKLZVGK1pq,122.0,111.0,217.0,450.0,Mixed-breed,[35-45[,[120-200K$[,dspvZRpjlRDuryrp1HeYmsvYs3uEAM,ctvkTSDYG0Sh,2.0,7209.972
354100,reGDxoZK7LW7BoeHKK,103.0,101.0,113.0,317.0,Purebred,[55-65[,[200K$+[,dspoOqJZfv6qnFKOsT4bmlQCnwdNqw,ctvZ2DDkVZUq,8.0,12982.779
957977,rehrVgkm5z86NOsRKI,100.0,99.0,120.0,319.0,Mixed-breed,[45-55[,[0-40K$[,dspII7OxXqXpXJAlLQFpYyAuERuGpi,ctvN6CWz0H3m,36.0,18199.545
1081687,renVyYSbD5UZ2v9M0x,86.0,84.0,94.0,264.0,Purebred,[55-65[,[80-120K$[,dspO22gslMUgmH8QrHTFa5YsJ6RzcE,ctv4MbIYmasY,1.0,0.0
227022,reANkJdvATmvPQn1jO,106.0,84.0,16.0,206.0,Purebred,[55-65[,[40-80K$[,dspW2HZADXXUwBE4IGduzn3wNrkx6r,ctvcEqo8engz,1.0,0.0


let's now add the last dataframe

In [None]:
mapping_transac_publisher_tv.head(2)

Unnamed: 0,customer_id,dsp_id,device_id
0,reFs5GI87lXJkJSi9r,dsp9tnGII5BeXbn6LUSFZPcKGCyI0F,ctv81YlbBXho
1,reTjziox2cSrxVq70Y,dspCSu1n1mhys37Na5OXMaKaE8P8CS,ctvHmkxqZXBg


In [None]:
df_tv_publisher.head(2)

Unnamed: 0,device_id,timestamp_utc,cost_milli_cent
0,ctv81YlbBXho,2024-04-23 21:09:46,2325.51
1,ctvWr7bOO5Je,2024-04-19 18:31:30,2325.51


In [None]:
device_map = mapping_transac_publisher_tv[mapping_transac_publisher_tv['device_id']!='unknown'].merge(df_tv_publisher, on='device_id', how='left')

In [None]:
device_map

Unnamed: 0,customer_id,dsp_id,device_id,timestamp_utc,cost_milli_cent
0,reFs5GI87lXJkJSi9r,dsp9tnGII5BeXbn6LUSFZPcKGCyI0F,ctv81YlbBXho,2024-04-23 21:09:46,2325.51
1,reTjziox2cSrxVq70Y,dspCSu1n1mhys37Na5OXMaKaE8P8CS,ctvHmkxqZXBg,NaT,
2,reOrpt9vhSwhbPVtni,dsp1hXcI9Q6TZYzLEmeTkxzhjqD6HJ,ctvwp5n34myx,NaT,
3,reutQ3jiBX9Li4Ggqi,dspcd3UcXUcUk0PEo2hb8CEH3WVlFE,ctvdkYC70D2x,NaT,
4,reH7UgH29AreRh8wWy,dspSnORtuQRLSkZKp9nbSIpbJBQLP1,ctvTgbqnhRd8,NaT,
...,...,...,...,...,...
11268332,rers53Z80wb3d34mtx,dspTSnawu2ES3d8m7K1rN14CcqoQXc,ctvJ45lCNmAW,NaT,
11268333,reFfPzx1jCmscGp8dX,dspbV1X6ia7x8IMxQQgHqasXsZvSev,ctvF8GvF2QtU,NaT,
11268334,rew49Dee0MRk9OpGd5,dsp4PJIYh8QWwR5FCxYcxPJzUOCuQo,ctvd65OI6HsK,NaT,
11268335,re4poel4L00ESN8CeI,dspeuhlwK7LBRgCX3sMsSMFIxLHkeD,ctv2WPS9SbdL,NaT,


In [None]:
device_map_count = device_map.groupby('customer_id').agg({
    'device_id': 'count',
    'cost_milli_cent': 'sum',
    'timestamp_utc': lambda x: min(x)
}).reset_index()

In [None]:
device_map_count = device_map_count.rename(columns={'timestamp_utc':'timestamp_device'})

In [None]:
full_df_wo_unk = customer_df_wo_unk.merge(device_map_count, how='left', on='customer_id')

In [None]:
full_df_wo_unk

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,product_name_x,timestamp_utc_x,sales_x,brand_x,quantity_x,product_name_y,...,PC,Phone,Robot,TV,Unknown,cost_milli_cent_x,timestamp_utc,device_id,cost_milli_cent_y,timestamp_device
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.00,Science Diet,0.0,SD Fel Adt PerWgt Ckn 15lb bg,...,,,,,,,NaT,1.0,0.00,NaT
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.00,Science Diet,0.0,SD Ca Adt PerWgt Ckn 25lb bg,...,,,,,,,NaT,1.0,0.00,NaT
2,re000fIO9QXTWYjOfn,8.0,8.0,7.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,...,0.0,0.0,0.0,1.0,0.0,3595.156,2024-01-24 14:25:56,1.0,0.00,NaT
3,re000kbtVVzPwZcEr4,0.0,0.0,19.0,SD Ca Adt SenSt&Sk Ckn 30lb bg,2024-03-06 23:57:04,0.00,Science Diet,0.0,SD Ca Adt SenSt&Sk Ckn 30lb bg,...,,,,,,,NaT,1.0,0.00,NaT
4,re000pHbVOysCXRHgt,5.0,5.0,7.0,SD Ca Adt Lt Ckn 30lb bg,2024-01-01 18:12:21,394.94,Science Diet,5.0,SD Ca Adt Lt Ckn 30lb bg,...,1.0,0.0,0.0,0.0,0.0,1863.549,2024-01-01 18:30:47,1.0,0.00,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1354578,rezzz8a320jhOvmL3A,0.0,0.0,2.0,SD Ca A6+ LB Ckn 15lb bg,2024-04-26 23:11:28,0.00,Science Diet,0.0,SD Ca A6+ LB Ckn 15lb bg,...,,,,,,,NaT,1.0,0.00,NaT
1354579,rezzzYRiwreLF23ot3,0.0,0.0,1.0,SD Ca Adt PerWgt Ckn SB 4lb bg,2024-06-04 23:56:40,0.00,Science Diet,0.0,SD Ca Adt PerWgt Ckn SB 4lb bg,...,,,,,,,NaT,1.0,0.00,NaT
1354580,rezzzZvkIaiWNQ1AmV,0.0,0.0,1.0,PD m/d Feline 8.5lb bg,2024-04-16 22:01:01,0.00,Prescription Diet,0.0,PD m/d Feline 8.5lb bg,...,,,,,,,NaT,2.0,4651.02,2024-06-26 23:50:31
1354581,rezzzipns16pTCb4OS,2.0,2.0,1.0,SD Ca Adt Lt SB Ckn 5lb bg,2024-04-18 01:32:56,66.98,Science Diet,2.0,SD Ca Adt Lt SB Ckn 5lb bg,...,,,,,,,NaT,1.0,0.00,NaT


In [None]:
full_df_wo_unk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1354583 entries, 0 to 1354582
Data columns (total 30 columns):
 #   Column                     Non-Null Count    Dtype         
---  ------                     --------------    -----         
 0   customer_id                1354583 non-null  object        
 1   Add to cart                1354583 non-null  float64       
 2   Order                      1354583 non-null  float64       
 3   Product Page View          1354583 non-null  float64       
 4   product_name_x             1330504 non-null  object        
 5   timestamp_utc_x            1354583 non-null  datetime64[ns]
 6   sales_x                    1354583 non-null  float64       
 7   brand_x                    1330504 non-null  object        
 8   quantity_x                 1354583 non-null  float64       
 9   product_name_y             1330504 non-null  object        
 10  timestamp_utc_y            1354583 non-null  datetime64[ns]
 11  sales_y                    1354583 no

In [None]:
 full_df_wo_unk[full_df_wo_unk['Order']==0]

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,total_website_interaction,breed,age,income,dsp_id,device_id,count_cost_ads,sum_ads,count_device,sum_cost_tv
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0,2.0,Purebred,[25-35[,[120-200K$[,dspt0ZXO5jvJTrabhKeLXZLiR2PbyQ,ctvkPZxmFRUl,1.0,0.000,1.0,0.00
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0,3.0,Mixed-breed,[18-25[,[120-200K$[,dsp6V6xOiPqoLgsv5GteFjwomGj2OE,ctv9vXQKdLuu,1.0,0.000,1.0,0.00
3,re000kbtVVzPwZcEr4,0.0,0.0,19.0,19.0,Purebred,[65+[,[200K$+[,dspJHqHlhc1SVwiHlbqjWofC0ReNwz,ctvE4b5vBWGQ,1.0,0.000,1.0,0.00
5,re001cHwy3Mjc3HuLR,0.0,0.0,1.0,1.0,Purebred,[35-45[,[120-200K$[,dspPJTADQ4ZZngPgkPruB8lysq3ehV,ctvHk03oqvHG,1.0,0.000,1.0,0.00
6,re001dfhF1iIFRre85,0.0,0.0,1.0,1.0,Purebred,[25-35[,[120-200K$[,dspcm0EV62t2PBW7cg4x8f4CfkFN4c,ctvTBTYd8qQa,14.0,6868.310,1.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1354576,rezzz7qN964P8xoD2D,1.0,0.0,2.0,3.0,Purebred,[45-55[,[80-120K$[,unknown,ctvWUCCF2D9K,,,1.0,0.00
1354578,rezzz8a320jhOvmL3A,0.0,0.0,2.0,2.0,Mixed-breed,[35-45[,[80-120K$[,dspcRI6KFVdRU2K9XgArgerMN3NhKB,ctvNHXF7ePVj,1.0,0.000,1.0,0.00
1354579,rezzzYRiwreLF23ot3,0.0,0.0,1.0,1.0,Mixed-breed,[18-25[,[0-40K$[,unknown,ctvJGH8s82Fv,,,1.0,0.00
1354580,rezzzZvkIaiWNQ1AmV,0.0,0.0,1.0,1.0,Purebred,[25-35[,[80-120K$[,dspgVESJvONVmEFZPusBo1IvZ3d9pU,ctv7L6JydzSE,1.0,0.000,2.0,4651.02


In [None]:
df_retail_pivot_wo_unk

event_name,customer_id,Add to cart,Order,Product Page View,total_website_interaction
0,re0007V8sqIHsZnbvC,0.0,0.0,2.0,2.0
1,re000JYhnKbTkPqMB4,1.0,0.0,2.0,3.0
2,re000fIO9QXTWYjOfn,8.0,8.0,7.0,23.0
3,re000kbtVVzPwZcEr4,0.0,0.0,19.0,19.0
4,re000pHbVOysCXRHgt,5.0,5.0,7.0,17.0
...,...,...,...,...,...
1354578,rezzz8a320jhOvmL3A,0.0,0.0,2.0,2.0
1354579,rezzzYRiwreLF23ot3,0.0,0.0,1.0,1.0
1354580,rezzzZvkIaiWNQ1AmV,0.0,0.0,1.0,1.0
1354581,rezzzipns16pTCb4OS,2.0,2.0,1.0,5.0


In [None]:
1354583

1354583

In [None]:
missing_in_series2 = set(full_df_wo_unk['customer_id'].to_list()) - set(df_retail_pivot_wo_unk['customer_id'].to_list())

In [None]:
missing_in_series2

set()

In [None]:
programmatic_publisher.groupby('dsp_id')['dsp_id'].count().reset_index(name='dsp_count')

Unnamed: 0,dsp_id,dsp_count
0,dsp0000zgOGRGzNXlFGuswTD8LV6xA,2
1,dsp00040djIAqe73TjNJ1CFKw0K4eu,1
2,dsp0006CxEoblczDFJSqDw5BiCtNi6,1
3,dsp0006ipUFLA8MoTsAx83XZ11T5hl,1
4,dsp0008hWrKfZuquMQaJQrp8FQyKK6,2
...,...,...
5101020,dspzzzwxu1MiM9dpopZA1WjcBXUbPG,1
5101021,dspzzzyXFJ9oiWmTebxJ7AJHn7uLqQ,1
5101022,dspzzzyoBJJOHzShQr3B5XIUn4D4kh,1
5101023,dspzzzzkzSHeWWtd98oOyysGMctclX,1


In [None]:
#df_pivot = df_retail.pivot_table(
#    index='customer_id',
#    columns='event_name',
#    values='timestamp_utc',  # Using timestamp as placeholder for count
#    aggfunc='count',  # Count occurrences
#    fill_value=0
#).reset_index()

In [None]:
#df_extra = df_retail[df_retail['customer_id']!='unknown'].groupby('customer_id').agg({
#    'product_name': lambda x: set(list(x)),  # List of unique products
#    'timestamp_utc': lambda x: min(x),
#    'sales': 'sum',  # Sum sales
#    'quantity': 'sum'  # Sum quantity
#}).reset_index()

In [None]:
#df_extra

Unnamed: 0,customer_id,product_name,timestamp_utc,sales,quantity
0,re0007V8sqIHsZnbvC,"[SD Fel Adt PerWgt Ckn 15lb bg, SD Fel Adt Per...",2024-01-29 23:46:22,0.00,0.0
1,re000JYhnKbTkPqMB4,"[SD Ca Adt PerWgt Ckn 25lb bg, SD Ca Adt PerWg...",2024-01-08 06:10:19,0.00,0.0
2,re000fIO9QXTWYjOfn,"[PD Ca i/d Ckn&VgStew 12x12.5oz cs, SD Ca A7+ ...",2024-01-24 14:44:47,284.52,8.0
3,re000kbtVVzPwZcEr4,"[nan, nan, SD CA ADT PERWGT + Joint Support 3....",2024-03-06 23:57:04,0.00,0.0
4,re000pHbVOysCXRHgt,"[SD Ca A7+ NoCWS Ckn 4lb bg, SD Ca Adt Lt Ckn ...",2024-01-01 18:12:21,394.94,5.0
...,...,...,...,...,...
1354578,rezzz8a320jhOvmL3A,"[SD Ca A6+ LB Ckn 15lb bg, SD Ca Adt LB LM&BR ...",2024-04-26 23:11:28,0.00,0.0
1354579,rezzzYRiwreLF23ot3,[SD Ca Adt PerWgt Ckn SB 4lb bg],2024-06-04 23:56:40,0.00,0.0
1354580,rezzzZvkIaiWNQ1AmV,[PD m/d Feline 8.5lb bg],2024-04-16 22:01:01,0.00,0.0
1354581,rezzzipns16pTCb4OS,"[SD Ca Adt Lt SB Ckn 15lb bg, SD Ca Adt Lt SB ...",2024-04-18 01:32:56,66.98,2.0


## Testing the clean df to make sure it hasn't missed anything



In [21]:
len(pd.read_csv('clean_hills_data_without_unknown.csv')) == len(df_retail[df_retail['customer_id']!='unknown'].groupby('customer_id')['timestamp_utc'].count().reset_index())

df_clean = pd.read_csv('clean_hills_data_without_unknown.csv')
df_clean['first_web_visit_timestamp'] = pd.to_datetime(df_clean['first_web_visit_timestamp'])
df_clean['first_ads_timestamp'] = pd.to_datetime(df_clean['first_ads_timestamp'])
df_clean['first_ads_tv_timestamp'] = pd.to_datetime(df_clean['first_ads_tv_timestamp'])

df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1354583 entries, 0 to 1354582
Data columns (total 25 columns):
 #   Column                     Non-Null Count    Dtype         
---  ------                     --------------    -----         
 0   customer_id                1354583 non-null  object        
 1   Add to cart                1354583 non-null  float64       
 2   Order                      1354583 non-null  float64       
 3   Product Page View          1354583 non-null  float64       
 4   product_name               1330504 non-null  object        
 5   first_web_visit_timestamp  1354583 non-null  datetime64[ns]
 6   sales                      1354583 non-null  float64       
 7   brand                      1330504 non-null  object        
 8   quantity                   1354583 non-null  float64       
 9   total_website_interaction  1354583 non-null  float64       
 10  breed                      1354583 non-null  object        
 11  age                        1354583 no

In [39]:
df_clean[(df_clean['Order']!=0) & (df_clean['quantity']==0)]

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,product_name,first_web_visit_timestamp,sales,brand,quantity,total_website_interaction,...,PC,Phone,Robot,TV,Unknown,total_ads_cost,first_ads_timestamp,device_id,total_ads_tv_count,first_ads_tv_timestamp
298,re00pMfwsdb6QGKH9V,1.0,1.0,1.0,PD Ca c/d Mul Ckn&VgStew 12x12.5oz cs,2024-01-14 14:04:51,0.0,Prescription Diet,0.0,3.0,...,0.0,1.0,0.0,1.0,0.0,526.360,2024-01-23 07:07:53,,,NaT
437,re01CFgUC4zuXryYyK,7.0,1.0,21.0,PD Fel c/d Mul Strs 4lb bg,2024-01-12 19:22:30,0.0,Prescription Diet,0.0,29.0,...,0.0,1.0,0.0,0.0,0.0,553.500,2024-01-21 00:40:07,1.0,0.00,NaT
833,re02QG55H8pPpdjxcW,4.0,1.0,4.0,PD Ca Meta+Mob 24lb bg,2024-01-26 18:14:26,0.0,Prescription Diet,0.0,9.0,...,3.0,0.0,0.0,8.0,0.0,2335.297,2024-01-28 21:07:50,1.0,0.00,NaT
941,re02oQAsmzkvfEPSPs,1.0,1.0,1.0,PD z/d Ultra Canine 12/13oz cs,2024-02-02 22:08:09,0.0,Prescription Diet,0.0,3.0,...,,,,,,,NaT,1.0,0.00,NaT
1342,re03tqvWeTIrqWR3xa,1.0,1.0,4.0,SD Fel Adt 12x2.8oz Pou VarPk,2024-01-31 17:04:35,0.0,Science Diet,0.0,6.0,...,1.0,0.0,0.0,0.0,0.0,2652.336,2024-01-31 17:32:33,1.0,0.00,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1350502,rezoKzPmY75Sk50qem,1.0,1.0,10.0,PD Ca T 6x11oz cs,2024-01-05 00:23:09,0.0,Prescription Diet,0.0,12.0,...,0.0,2.0,0.0,0.0,0.0,641.190,2024-01-06 09:54:15,1.0,0.00,NaT
1351343,rezqizYWl3hr1a6MLp,1.0,1.0,1.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-02-01 22:03:20,0.0,Science Diet,0.0,3.0,...,,,,,,,NaT,1.0,0.00,NaT
1352472,reztwszsl09IHmMcTM,1.0,1.0,4.0,SD Pup SmPws Ckn 4.5lb bg,2024-01-02 21:01:03,0.0,Science Diet,0.0,6.0,...,0.0,0.0,0.0,2.0,0.0,3133.379,2024-01-02 21:05:58,4.0,9302.04,2024-03-23 03:47:24
1352927,rezvGVCn4d9JJIWDQx,1.0,1.0,4.0,PD Canine t/d SMB 5 lb bg,2024-01-12 20:24:38,0.0,Prescription Diet,0.0,6.0,...,8.0,17.0,0.0,19.0,0.0,14101.890,2024-01-12 21:00:42,1.0,0.00,NaT


In [35]:
test = df_retail[df_retail['brand']]
test[test['quantity']==0]


Unnamed: 0,customer_id,timestamp_utc,event_name,brand,product_name,sales,quantity


In [None]:
df_clean[df_clean[]]

In [40]:
df_retail[df_retail['customer_id']=='re00pMfwsdb6QGKH9V']

Unnamed: 0,customer_id,timestamp_utc,event_name,brand,product_name,sales,quantity
3397014,re00pMfwsdb6QGKH9V,2024-01-14 14:08:37,Product Page View,Prescription Diet,PD Ca c/d Mul Ckn&VgStew 12x12.5oz cs,,
3397015,re00pMfwsdb6QGKH9V,2024-01-14 14:25:10,Order,Prescription Diet,PD Ca c/d Mul Ckn&VgStew 12x12.5oz cs,0.0,0.0
3397016,re00pMfwsdb6QGKH9V,2024-01-14 14:04:51,Add to cart,Prescription Diet,PD Ca c/d Mul Ckn&VgStew 12x12.5oz cs,,


In [24]:
df_clean['Retargeting'].tail(10)

Unnamed: 0,Retargeting
1354573,
1354574,
1354575,0.0
1354576,
1354577,1.0
1354578,
1354579,
1354580,
1354581,
1354582,0.0


In [25]:
df_clean['Contextual'].tail(10)

Unnamed: 0,Contextual
1354573,
1354574,
1354575,4.0
1354576,
1354577,0.0
1354578,
1354579,
1354580,
1354581,
1354582,3.0


In [20]:
df_clean.sort_values(by='first_web_visit_timestamp', ascending=False)

Unnamed: 0,customer_id,Add to cart,Order,Product Page View,product_name,first_web_visit_timestamp,sales,brand,quantity,total_website_interaction,...,PC,Phone,Robot,TV,Unknown,total_ads_cost,first_ads_timestamp,device_id,total_ads_tv_count,first_ads_tv_timestamp
232205,reAcp5tTCtJCAoblFO,0.0,0.0,1.0,PD Ca i/d 27.5lb bg,2024-06-30 23:59:34,0.00,Prescription Diet,0.0,1.0,...,,,,,,,,1.0,0.00,
759145,reYnlBwWkvmRBLviCF,0.0,0.0,1.0,SD Ca Adt PerWgt Vg&CknStew 12x12.5oz cs,2024-06-30 23:59:08,0.00,Science Diet,0.0,1.0,...,0.0,13.0,0.0,0.0,0.0,6725.866,2024-05-29 00:21:56,1.0,0.00,
1150413,reqeV59lbpuGAAtqy8,0.0,0.0,1.0,SD Fel Adt Ckn 4lb bg,2024-06-30 23:59:07,0.00,Science Diet,0.0,1.0,...,,,,,,,,1.0,0.00,
681497,reVFCfxtMEsWjoyrqz,0.0,0.0,1.0,SD Ca Adt LM&BR 15.5lb bg,2024-06-30 23:58:48,0.00,Science Diet,0.0,1.0,...,,,,,,,,1.0,0.00,
826313,rebrYlCoUl86DHVQsH,0.0,0.0,1.0,PD Ca i/d Ckn&VgStew 12x12.5oz cs,2024-06-30 23:58:31,0.00,Prescription Diet,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1582.210,2024-01-01 18:09:16,1.0,0.00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52612,re2Q0ME7s3nJYG7XQm,0.0,0.0,2.0,SD Ca Adt SmPws Ckn 15.5lb bg,2024-01-01 00:00:09,0.00,Science Diet,0.0,2.0,...,0.0,1.0,0.0,0.0,0.0,153.528,2024-05-24 16:34:03,1.0,0.00,
1166038,rerMjVo7KmtBUPrCtA,2.0,2.0,5.0,HI Ca Nat BkdLtBisc Md w/Ckn 12x8oz cs,2024-01-01 00:00:08,14.98,Hills,2.0,9.0,...,,,,,,,,1.0,0.00,
1226947,reu9BCcbpLgp5T1qx3,2.0,1.0,3.0,HI Ca Nat SSav T Bf&Ched 12x8.0oz cs,2024-01-01 00:00:06,7.49,Science Diet,1.0,6.0,...,1.0,0.0,0.0,0.0,0.0,828.242,2024-05-04 05:39:06,8.0,18604.08,2024-04-25 02:40:10
1035587,relPJm9WmseQdlGhD0,4.0,0.0,23.0,SD Ca Adt SenSt&Sk Sm&Min Ckn 4lb bg,2024-01-01 00:00:06,0.00,Science Diet,0.0,27.0,...,4.0,1.0,0.0,0.0,0.0,6544.240,2024-01-10 02:04:37,1.0,0.00,


In [43]:
sum(df_clean['sales'])

61193480.149955

In [47]:
sum(df_retail[df_retail['sales']>0]['sales'])

62427949.179992914

In [49]:
test2 = df_retail[df_retail['customer_id']!='unknown']

In [50]:
sum(test2[test2['sales']>0]['sales'])

61193480.14997958

## Starting the Prediction Model

### Library for model imports

In [89]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import make_scorer, f1_score
from sklearn.pipeline import make_pipeline

In [71]:
df_wo_na = df_clean.copy().set_index('customer_id').fillna(0)

In [66]:
#df_wo_na['target'] = np.where(df_wo_na['Order'] == 0,0,1)

In [73]:
df_wo_na.head(3)

Unnamed: 0_level_0,Add to cart,Order,Product Page View,product_name,first_web_visit_timestamp,sales,brand,quantity,total_website_interaction,breed,...,Phone,Robot,TV,Unknown,total_ads_cost,first_ads_timestamp,device_id,total_ads_tv_count,first_ads_tv_timestamp,target
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
re0007V8sqIHsZnbvC,0.0,0.0,2.0,SD Fel Adt PerWgt Ckn 15lb bg,2024-01-29 23:46:22,0.0,Science Diet,0.0,2.0,Purebred,...,0.0,0.0,0.0,0.0,0.0,0,1.0,0.0,0,0
re000JYhnKbTkPqMB4,1.0,0.0,2.0,SD Ca Adt PerWgt Ckn 25lb bg,2024-01-08 06:10:19,0.0,Science Diet,0.0,3.0,Mixed-breed,...,0.0,0.0,0.0,0.0,0.0,0,1.0,0.0,0,0
re000fIO9QXTWYjOfn,8.0,8.0,7.0,SD Ca Adt SavStw S&TB Bf&Vg 12x3.5oz cs,2024-01-24 14:44:47,284.52,Science Diet,8.0,23.0,Purebred,...,0.0,0.0,1.0,0.0,3595.156,2024-01-24 14:25:56,1.0,0.0,0,1


In [74]:
df_wo_na.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1354583 entries, re0007V8sqIHsZnbvC to rezzzqXlE0I0qrb6hI
Data columns (total 25 columns):
 #   Column                     Non-Null Count    Dtype         
---  ------                     --------------    -----         
 0   Add to cart                1354583 non-null  float64       
 1   Order                      1354583 non-null  float64       
 2   Product Page View          1354583 non-null  float64       
 3   product_name               1354583 non-null  object        
 4   first_web_visit_timestamp  1354583 non-null  datetime64[ns]
 5   sales                      1354583 non-null  float64       
 6   brand                      1354583 non-null  object        
 7   quantity                   1354583 non-null  float64       
 8   total_website_interaction  1354583 non-null  float64       
 9   breed                      1354583 non-null  object        
 10  age                        1354583 non-null  object        
 11  income        

In [75]:
X = df_wo_na.drop(columns=['Order', 'first_web_visit_timestamp', 'first_ads_timestamp', 'first_ads_tv_timestamp', 'target'])
y = df_wo_na.target

In [78]:
y.value_counts(normalize=True)

Unnamed: 0_level_0,proportion
target,Unnamed: 1_level_1
0,0.629388
1,0.370612


In [76]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1354583 entries, re0007V8sqIHsZnbvC to rezzzqXlE0I0qrb6hI
Data columns (total 20 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Add to cart                1354583 non-null  float64
 1   Product Page View          1354583 non-null  float64
 2   product_name               1354583 non-null  object 
 3   sales                      1354583 non-null  float64
 4   brand                      1354583 non-null  object 
 5   quantity                   1354583 non-null  float64
 6   total_website_interaction  1354583 non-null  float64
 7   breed                      1354583 non-null  object 
 8   age                        1354583 non-null  object 
 9   income                     1354583 non-null  object 
 10  Contextual                 1354583 non-null  float64
 11  Retargeting                1354583 non-null  float64
 12  PC                         1354583 non-null  fl

In [100]:
X[['product_name', 'brand', 'breed', 'age', 'income']] = X[['product_name', 'brand', 'breed', 'age', 'income']].astype(str)

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=df_wo_na['target'])

In [102]:
feat_numerical = X_train.select_dtypes(include=['float64']).columns
feat_numerical

Index(['Add to cart', 'Product Page View', 'sales', 'quantity',
       'total_website_interaction', 'Contextual', 'Retargeting', 'PC', 'Phone',
       'Robot', 'TV', 'Unknown', 'total_ads_cost', 'device_id',
       'total_ads_tv_count'],
      dtype='object')

In [103]:
preproc_pipeline = make_column_transformer(
    (MinMaxScaler(), feat_numerical),
    (OneHotEncoder(sparse_output=False, handle_unknown='ignore'), ['brand', 'breed', 'age', 'income']),
    (make_pipeline(OrdinalEncoder(), MinMaxScaler()), ['product_name']),
    remainder="drop"
)

In [104]:
preproc_pipeline

In [105]:
X_train_transformed = preproc_pipeline.fit_transform(X_train)
X_test_transformed = preproc_pipeline.transform(X_test)

In [None]:
rf = RandomForestClassifier(n_estimators=1000)
mean_rf = cross_val_score(rf, X_train_transformed, y_train, cv=5, n_jobs=1)