<a href="https://colab.research.google.com/github/ShawnLiu119/Segmentation_Embedding_DL/blob/main/Customer_Embedding_EventSequence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Customer Embedding - Based on Event**

## Key Concept
**Event**: customer event sequence could be product browsing history on website, clicks, in store purchase, transactions </br>
**Doc2Vec**: vectorize the document: Doc2Vec is a neural network-based approach that learns the distributed representation of documents. It is an unsupervised learning technique that maps each document to a fixed-length vector in a high-dimensional space

## Use Case
1. **segmentation** capture behavioral semantics and use these embeddings for audience analysis (e.g. clustering)
2. **recommendation engine** input for the downstream personalization models

## Logical thoughts
Each customer is viewed as a document, orders as sentences, and products as words. Each customer could place multiple orders, while one order could include multiple products

## MLOPs consideration
 The dataset size and compute resources need to be carefully selected because Doc2Vec and subsequent clustering are computationally heavy processes.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from tqdm import tqdm_notebook as tqdm #show progress bar
import numpy as np
import os
from sklearn.manifold import TSNE #visualize high-dimensional data by converting similarity to join probabilities

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn import preprocessing

pd.options.display.max_rows = 20
%matplotlib inline

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 500)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import glob #glob module is used to retrieve files/pathnames matching a specified pattern

import multiprocessing as mp
print('Number of CPU cores:', mp.cpu_count())

Number of CPU cores: 2


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Step 1: Read Data

In [4]:
folder = "/content/drive/MyDrive/kaggle_data/instacart-market-basket-analysis"
files_list = glob.glob(f'{folder}/*.csv')

data_dict = {}

for file in files_list:
    print(f'\n\nReading: {file}')
    data = pd.read_csv(file)
    print(data.info(show_counts=True))
    data_dict[file.split('.')[0].split('/')[-1]] = data

print(f'Loaded data sets: {data_dict.keys()}')



Reading: /content/drive/MyDrive/kaggle_data/instacart-market-basket-analysis/sample_submission.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   order_id  75000 non-null  int64 
 1   products  75000 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.1+ MB
None


Reading: /content/drive/MyDrive/kaggle_data/instacart-market-basket-analysis/order_products__prior.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Non-Null Count     Dtype
---  ------             --------------     -----
 0   order_id           32434489 non-null  int64
 1   product_id         32434489 non-null  int64
 2   add_to_cart_order  32434489 non-null  int64
 3   reordered          32434489 non-null  int64
dtypes: int64(4)
memory usage: 989.8 MB
None


Reading: /content/drive/

In [6]:
df_test = data_dict['products']
df_test.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
4,5,Green Chile Anytime Sauce,5,13


In [8]:
df_t2 = data_dict['orders']
df_t2.head()
#按照order的时间排序 每个user

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [10]:
df_t3 = data_dict['order_products__train']
df_t3.head()
#按照order的时间排序 每个user

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [20]:
df_t4 = data_dict['aisles']
df_t4.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


## Step 2: Establish Baseline for comparison

note: no demogrphic data avaiable here

### 2.1 Feature Engineer

In [11]:
train_orders = data_dict['order_products__train']
prior_orders = data_dict['order_products__prior']
products = data_dict['products'].set_index('product_id')

orders = data_dict['orders']
prior_orders = prior_orders.merge(right=orders[['user_id','order_id','order_number']],on='order_id',how='left')

prior_orders.head()
#userid --> document, order_id --> sentence, product_id --> word

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number
0,2,33120,1,1,202279,3
1,2,28985,2,1,202279,3
2,2,9327,3,0,202279,3
3,2,45918,4,1,202279,3
4,2,30035,5,0,202279,3


In [16]:
print(len(prior_orders))
print(prior_orders['order_id'].nunique()) #unique order numbers
print(prior_orders['user_id'].nunique()) #unique order numbers

32434489
3214874
206209


In [17]:
#downsize sample to reduce the computational resource consumption
user_subset = 50000
user_id_sample = prior_orders['user_id'].sample(n=user_subset, replace=False) #replace: do not allow one row to be sampled more than once

In [18]:
type(user_id_sample)

In [21]:
prior_orders_details = prior_orders[prior_orders.user_id.isin(user_id_sample)].copy()
prior_orders_details['product_id'] = prior_orders_details['product_id'].astype(int)
prior_orders_details = prior_orders_details.merge(data_dict['products'], on='product_id', how='left')
prior_orders_details = prior_orders_details.merge(data_dict['aisles'], on='aisle_id', how='left')
prior_orders_details = prior_orders_details.merge(data_dict['departments'], on='department_id', how='left')

prior_orders_details.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,order_number,product_name,aisle_id,department_id,aisle,department
0,4,46842,1,0,178520,36,Plain Pre-Sliced Bagels,93,3,breakfast bakery,bakery
1,4,26434,2,1,178520,36,Honey/Lemon Cough Drops,11,11,cold flu allergy,personal care
2,4,39758,3,1,178520,36,Chewy 25% Low Sugar Chocolate Chip Granola,3,19,energy granola bars,snacks
3,4,27761,4,1,178520,36,Oats & Chocolate Chewy Bars,48,14,breakfast bars pastries,breakfast
4,4,10054,5,1,178520,36,Kellogg's Nutri-Grain Apple Cinnamon Cereal,48,14,breakfast bars pastries,breakfast
