# Merging Sample with Additional Feature Sets
This notebook combines the sampled dataset with complementary feature sets available in the Kaggle e-commerce dataset. By joining various related data files, we enrich the sampled client data with additional behavioral and contextual features. The merged dataset is then exported as a Parquet file, which will serve as the main input for subsequent team analyses and modeling tasks.

In [1]:
import duckdb
import pandas as pd

In [2]:

# Path to your Parquet file
parquet_path = "data/messages_subset.parquet"

# Connect to DuckDB (in-memory)
con = duckdb.connect()

# Read Parquet file and cast to Pandas DataFrame
messages_df = con.execute(f"SELECT * FROM read_parquet('{parquet_path}')").df()


In [3]:
campaigns_df = pd.read_csv('data/campaigns.csv')
first_purchase_df = pd.read_csv('data/client_first_purchase_date.csv')
holidays_df = pd.read_csv('data/holidays.csv')

In [4]:
messages_df.columns

Index(['id', 'message_id', 'campaign_id', 'message_type', 'client_id',
       'channel', 'category', 'platform', 'email_provider', 'stream', 'date',
       'sent_at', 'is_opened', 'opened_first_time_at', 'opened_last_time_at',
       'is_clicked', 'clicked_first_time_at', 'clicked_last_time_at',
       'is_unsubscribed', 'unsubscribed_at', 'is_hard_bounced',
       'hard_bounced_at', 'is_soft_bounced', 'soft_bounced_at',
       'is_complained', 'complained_at', 'is_blocked', 'blocked_at',
       'is_purchased', 'purchased_at', 'created_at', 'updated_at'],
      dtype='object')

In [5]:
campaigns_df.columns

Index(['id', 'campaign_type', 'channel', 'topic', 'started_at', 'finished_at',
       'total_count', 'ab_test', 'warmup_mode', 'hour_limit', 'subject_length',
       'subject_with_personalization', 'subject_with_deadline',
       'subject_with_emoji', 'subject_with_bonuses', 'subject_with_discount',
       'subject_with_saleout', 'is_test', 'position'],
      dtype='object')

In [6]:
first_purchase_df.columns

Index(['client_id', 'first_purchase_date'], dtype='object')

In [7]:
holidays_df.columns

Index(['date', 'holiday'], dtype='object')

In [8]:
messages_df.merge(campaigns_df.rename(columns = {'id' : 'campaign_id'}), on='campaign_id', how='left').head()

Unnamed: 0,id,message_id,campaign_id,message_type,client_id,channel_x,category,platform,email_provider,stream,...,hour_limit,subject_length,subject_with_personalization,subject_with_deadline,subject_with_emoji,subject_with_bonuses,subject_with_discount,subject_with_saleout,is_test,position
0,689791327,1515915625489087633-11387-64244e6bd4d4e,11387,bulk,1515915625489087633,mobile_push,,,,desktop,...,,30.0,False,False,True,False,False,False,,
1,689792401,1515915625489107288-11387-64244e6bd72d3,11387,bulk,1515915625489107288,mobile_push,,,,desktop,...,,30.0,False,False,True,False,False,False,,
2,689792664,1515915625489112445-11387-64244e6bd81d6,11387,bulk,1515915625489112445,mobile_push,,,,desktop,...,,30.0,False,False,True,False,False,False,,
3,689793297,1515915625489122538-11387-64244e6bd98c2,11387,bulk,1515915625489122538,mobile_push,,,,desktop,...,,30.0,False,False,True,False,False,False,,
4,689793902,1515915625489133634-11387-64244e6bdb2a6,11387,bulk,1515915625489133634,mobile_push,,,,desktop,...,,30.0,False,False,True,False,False,False,,


In [9]:
messages_df = messages_df.merge(campaigns_df.rename(columns = {'id' : 'campaign_id'}), on='campaign_id', how='left')

In [10]:
messages_df.merge(first_purchase_df, on='client_id', how='left').iloc[:,-1].head()

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: first_purchase_date, dtype: object

In [11]:
messages_df = messages_df.merge(first_purchase_df, on='client_id', how='left')

In [12]:
holidays_df['date'].head()

0    2021-01-01
1    2021-01-07
2    2021-01-13
3    2021-01-19
4    2021-01-25
Name: date, dtype: object

In [13]:
messages_df.date.head()

0   2023-03-29
1   2023-03-29
2   2023-03-29
3   2023-03-29
4   2023-03-29
Name: date, dtype: datetime64[us]

In [14]:
messages_df['date'] = messages_df['date'].dt.date
messages_df.merge(holidays_df, on='date', how='left').iloc[:,-1].unique()

array([nan], dtype=object)

Campaigns are sent before holidays, that is why none of the holidays has merged to the dataframe. We can calculate days before/after holiday that can be used as a feature.

In [15]:
messages_df.shape

(1686884, 51)

In [16]:
# save the combined dataframe
messages_df.to_parquet('data/combined_dataset.parquet')

In [17]:
messages_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1686884 entries, 0 to 1686883
Data columns (total 51 columns):
 #   Column                        Non-Null Count    Dtype         
---  ------                        --------------    -----         
 0   id                            1686884 non-null  int64         
 1   message_id                    1686884 non-null  object        
 2   campaign_id                   1686884 non-null  int64         
 3   message_type                  1686884 non-null  object        
 4   client_id                     1686884 non-null  int64         
 5   channel_x                     1686884 non-null  object        
 6   category                      0 non-null        object        
 7   platform                      247181 non-null   object        
 8   email_provider                1074538 non-null  object        
 9   stream                        1686884 non-null  object        
 10  date                          1686465 non-null  object        
 11

As Pandas Dataframe the dataset use 1.9 gb of memory.