# Customer Behavior Prediction using Deep Learning

Author: Tatsiana Mihai

## Project description

For this project I'll use public datasets with user behavior information available on Kaggle:
- Model training: E-commerce behavior data from multi category store for November 2019
https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store
- Validating model robustness: E-commerce behavior data from multi category store for January 2020 https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store


## Data exploration

In [None]:
# import required packages

import pandas as pd
import numpy as np

Disclaimer: due to Github size constraints `data` folder is not provided in this repository. You can download source files by using links mentioned in the Project Description

In [None]:
# read csv file and get the number of rows

# training dataset
# df = pd.read_csv('data/2019-Nov.csv')

# testing dataset
df = pd.read_csv('data/2020-Jan.csv')

df.count()

In [None]:
# show the first rows in the dataframe
df.head()

Let's check the quality of the data. Firts, I want to make sure there're no rows with null or invalid values that might affect model training. 
Let's check which columns have null values and how many.

In [None]:
# show sum of null values for each column

df.isnull().sum()

As we can see, there're quite many missed values for `category_code` and `brand`, and also 10 missed values for the `user_session`. 
Now let's check if there're negative `price` values for as they're invalid and will affect the accuracy of the trained model.

In [None]:
# show sum of negative values for the column `price`

(df.price < 0).sum()

As we can see, all prices are greater than `0` which makes the dataset is pretty clean. The only issue I'd like to address is the missed values. Also, I'd like to transform composite values in the `category_column` into multi-dimentional features to later try different combinations for model training.

## Data preprocessing

**Reduce dimensionality**  
The dataset is large, however not all of the columns are signficant for model training. The `user_session` column can be dropped as the `user_id` colimn contains all necessary information without missed values. Therefore, I'll remove `user_id` from the dataset.

In [None]:
# keep only necessary columns
required_cols = [
    'event_time',
    'event_type',
    'category_code',
    'product_id',
    'category_id',
    'brand',
    'price',
    'user_id'
]

df = df[required_cols]
df.head()

**Handling the missing values**

As we can see from the data exploration, there're 9224078 missed values in the `brand` column. As there's still other information such as `event_type`, `user_id` and `price`, which can be useful for ML training, I'll fill it with the value `unknown`. 
Apart fron that, some of `category_code` values contain `NaN` instead of expected string values which causes issues with data transformation. I'll fill them with `unknown` values as well.

In [None]:
df['brand'] = df['brand'].fillna('unknown')
df['category_code'] = df['category_code'].fillna('unknown')
df.head()

Let's see how many categories we can restore by mapping their known brands. To do that, I'll get a list of unique combinations of unknown category codes and known brands. 

In [None]:
# extract uniquecvalues from `category_code` and `brand` columns

unknown_cats = df[(df.category_code == 'unknown') & (df.brand != 'unknown')]
unknown_cats = pd.unique(unknown_cats['brand'])

len(unknown_cats)

Now let's explore if it's possible to re-use category that defined for another product with the same `brand` value. To do that I'll fetch all categories from `category_code` for the brands listed in the `unknown_cats`

In [None]:
# try to find categories for the brands

known_brands = df[(df.category_code != 'unknown') & (df.brand.isin(unknown_cats))]
known_brands = known_brands[['category_code', 'brand']]
brands_possible_cats = known_brands.groupby(by=['brand']).nunique().reset_index()

# group brands to see how many potential categories each brand has
grouped_known_brands = brands_possible_cats.groupby(by=['category_code']).count()
grouped_known_brands.plot(kind='bar')

As the plot shows, a big chunk of brands has just one possible category. Let's explore some of them

In [None]:
brands_with_single_cat = brands_possible_cats[\
        brands_possible_cats.category_code == 1].brand

known_brands[known_brands.brand.isin(brands_with_single_cat)][210:220]

In [None]:
brands_with_multi_cat = brands_possible_cats[\
        brands_possible_cats.category_code > 25].brand

known_brands[known_brands.brand.isin(brands_with_multi_cat)].head()

In [None]:
known_brands[known_brands.brand == 'xiaomi'].head(20)

At this moment it's obviout that only the first level category makes sence to be copied and added to the rows with missed values. If I add second or third category level I might affect the quality of the data.

**Data transformation**  
First, I want to transform values from `category_code` into multi-column data `cat_1`, `cat_2`, etc.
To make it I need to know the length of the longest chain in the `category_code` column.

In [None]:
# calculate the max number of categories

max(df.category_code.transform(lambda x: x.str.split('.').transform(lambda y: len(y))))

**Imputation**
The maximum length of nested categories is `4`. Now it's possible to create new columns to store each category separately. As only the first layer of categories is to be filled for unknown categories, the other empty values can be filled with `unknown`.  

In [None]:
# split `category_code` column into new columns
df[['cat_1', 'cat_2', 'cat_3', 'cat_4']] = df.category_code.str.split(".", expand = True)
df.head()

As the values from the `category_code` column are transferred to another columns, it can be finally removed.

In [None]:
# keep only necessary columns
required_cols = [
    'event_time',
    'event_type',
    'product_id',
    'category_id',
    'cat_1',
    'cat_2',
    'cat_3',
    'cat_4',
    'brand',
    'price',
    'user_id'
]

df = df[required_cols]
df.head()

Let's see how many new null varues have been created after changing the dimension of the data.

In [None]:
# show sum of null values for each column

df.isnull().sum()

The `cat_1` doesn't have empty values as it's been pre-filled with `unknown`. For the others the % of filling is 99.9% for the `cat_2`, 72.1% for the `cat_3`, and 32.4% for the `cat_4`. Though the last column doesn't look promising I'll keep it for know to be able to use it in future training and see how it affects the model.
Let's fill the `unknown` values in `cat_1` with the values from known brands. To make it easier I change `known_brands` in the same way I changed the dataset.

In [None]:
# split `category_code` column into new columns
known_brands[['cat_1', 'cat_2', 'cat_3', 'cat_4']] = known_brands.category_code.str.split(".", expand = True)
known_brands = known_brands[['brand', 'cat_1']].drop_duplicates(subset=['brand'])
known_brands.head()

In [None]:
res = pd.merge(df, known_brands, on='brand', how='left')
res.head()

In [None]:
res['cat'] = np.where(res['cat_1_x'] == 'unknown', res.cat_1_y, res.cat_1_x)
res.head()

In [None]:
# keep only necessary columns
required_cols = [
    'event_time',
    'event_type',
    'product_id',
    'category_id',
    'cat',
    'cat_2',
    'cat_3',
    'cat_4',
    'brand',
    'price',
    'user_id'
]

res = res[required_cols]
res.head()

In [None]:
res.isnull().sum()

In [None]:
res['cat'] = res['cat'].fillna('unknown')
res['cat_2'] = res['cat_2'].fillna('unknown')
res['cat_3'] = res['cat_3'].fillna('unknown')
res['cat_4'] = res['cat_4'].fillna('unknown')
res.head()

#### ML specific preprocessing

Data that is used for training model must be numeric. However, the dataset contains a few columns `string` or `datetime` types. First, let's convert `datetime` in the `event_time` into a `timetuple`

In [None]:
# import required packages
import time
import datetime

res.event_time = res.event_time.apply(lambda x: time.mktime(datetime.datetime.strptime(x,
                                             "%Y-%m-%d %H:%M:%S %Z").timetuple()))

The other columns can be converted to numeric values by using label encoding. I'll use the `LabelEncoder` class that Scikit-learn provides for this purpose.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the data
categorical_columns = [
    'event_type', 
    'cat', 
    'cat_2', 
    'cat_3', 
    'cat_4', 
    'brand'
]

for col in categorical_columns:
    res[col] = label_encoder.fit_transform(res[col])

In [None]:
# Normalize price
res['price'] = (res['price'] - res['price'].mean()) / res['price'].std()

In [None]:
# Sort data by 'event_time' 
res.sort_values(by='event_time', inplace=True)

In [None]:
res.head()

The dataframe is ready for using in model training. I'll save it to a new .csv file to use it later for training.

In [None]:
# saving in a file
# training dataset
res.to_csv('data/processed_data_train.csv')

# testinf dataset
# res.to_csv('data/processed_data_test.csv') 

I repeat the same oreration for the testing dataset and save it to `data/processed_data_test.csv`

## Model training

### CNN

In [4]:
import numpy as np
import pandas as pd

import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Conv1D, MaxPooling1D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers.legacy import Adam
from tensorflow.keras.optimizers.schedules import ExponentialDecay
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import f1_score

In [5]:
# Load the dataset
train_data = pd.read_csv('data/processed_data_train.csv')

In [6]:
def sequence_generator(X, y, sequence_length, batch_size):
    while True:
        for i in range(0, len(X) - sequence_length, batch_size):
            X_batch = [X[i+j:i+j+sequence_length] for j in range(batch_size)]
            y_batch = y[i:i+batch_size]  # Adjust the slicing for y

            # Pad sequences to ensure they have the same length
            max_length = max(len(seq) for seq in X_batch)
            X_batch = [np.pad(seq, ((0, max_length - len(seq)), (0, 0)), 'constant') for seq in X_batch]

            X_batch = np.array(X_batch)
            y_batch = np.array(y_batch)

            yield X_batch, y_batch

In [29]:
td = train_data[0:256000]

In [30]:
# Scale numerical features
scaler = StandardScaler()
# Combine all features
X = td[['event_time','product_id', 'category_id', 'cat', 'cat_2', 'cat_3', 'cat_4', 'brand', 'price']]
X = scaler.fit_transform(X)
y = td['event_type']
y = to_categorical(y, num_classes=3)

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [32]:
# Create sequences from the data
sequence_length = 100 
batch_size = 256
train_data_generator = sequence_generator(X_train, y_train, sequence_length, batch_size)
test_data_generator = sequence_generator(X_test, y_test, sequence_length, batch_size)

In [33]:
# Define the CNN model
input_layer = Input(shape=(sequence_length, X_train.shape[1]))
x = Conv1D(64, kernel_size=3, activation='relu')(input_layer)
x = MaxPooling1D(pool_size=2)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
output_layer = Dense(3, activation='softmax')(x)

In [34]:
# Compile the model
cnn_model = Model(inputs=input_layer, outputs=output_layer)

cnn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [35]:
# Train the model using the generator
cnn_model.fit(train_data_generator, 
          epochs=10, 
          steps_per_epoch=len(X_train) // batch_size,
          validation_data=test_data_generator,          
          validation_steps=len(X_test) // batch_size)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x2856d74f0>

In [36]:
# Make predictions on the test data
y_pred = cnn_model.predict(test_data_generator, steps=len(X_test) // batch_size)
y_pred = np.argmax(y_pred, axis=1)  # Convert one-hot encoded predictions to class labels

# Convert one-hot encoded true labels to class labels
y_true = np.argmax(y_test, axis=1)

# Calculate F1 score
f1 = f1_score(y_true, y_pred, average='weighted')
print("F1 Score:", f1)

F1 Score: 0.951643440189553


### CNN Second iteration - multiple improvements

In [37]:
# Create sequences from the data
sequence_length = 100 
batch_size = 64
train_data_generator = sequence_generator(X_train, y_train, sequence_length, batch_size)
test_data_generator = sequence_generator(X_test, y_test, sequence_length, batch_size)

In [43]:
# Define the CNN model

input_layer = Input(shape=(sequence_length, X_train.shape[1]))
x = Conv1D(128, kernel_size=3, activation='relu')(input_layer)
x = MaxPooling1D(pool_size=3)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
output_layer = Dense(3, activation='sigmoid')(x)  

# Define a learning rate schedule
initial_learning_rate = 0.001
lr_schedule = ExponentialDecay(initial_learning_rate, decay_steps=100, decay_rate=0.9)

# Use the learning rate schedule in the optimizer
optimizer = Adam(learning_rate=lr_schedule)

# Define early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Compile the model
cnn_imp_model = Model(inputs=input_layer, outputs=output_layer)
cnn_imp_model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

In [44]:
# Train the model using the generator
cnn_imp_model.fit(train_data_generator, 
          epochs=30, 
          steps_per_epoch=len(X_train) // batch_size,
          validation_data=test_data_generator,          
          validation_steps=len(X_test) // batch_size,
          callbacks=[early_stopping])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30


<keras.src.callbacks.History at 0x29a4bab00>

In [46]:
# Make predictions on the test data
y_pred = cnn_imp_model.predict(test_data_generator, steps=len(X_test) // batch_size)
y_pred = np.argmax(y_pred, axis=1)  # Convert one-hot encoded predictions to class labels

# Convert one-hot encoded true labels to class labels
y_true = np.argmax(y_test, axis=1)

# Calculate F1 score
f1 = f1_score(y_true, y_pred, average='weighted')
print("F1 Score:", f1)

F1 Score: 0.9516922575117627


### RNN

In [None]:
import pandas as pd 
import numpy as np

import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers.legacy import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import f1_score, confusion_matrix

In [None]:
# Load the dataset
train_data = pd.read_csv('data/processed_data_train.csv')

In [None]:
td = train_data

In [None]:
# Scale numerical features
scaler = StandardScaler()
# Combine all features
X = td[['event_time','product_id', 'category_id', 'cat', 'cat_2', 'cat_3', 'cat_4', 'brand', 'price']]
X = scaler.fit_transform(X)

y = td['event_type']
y = to_categorical(y, num_classes=3)

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [None]:
# Define the model
rnn_model = tf.keras.Sequential([
    LSTM(units=64, return_sequences=True, input_shape=(X_train.shape[1], 1)),
    LSTM(units=64),
    Dense(units=3, activation='softmax')
])

In [None]:
# Reshape the input data to fit the model input shape
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))

# Compile the model with accuracy as a metric
rnn_model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

# Train the model, including accuracy monitoring
rnn_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

In [None]:
# Preprocess testing data and reshape it
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Evaluate the model
loss = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}")

In [None]:
y_pred = rnn_model.predict(X_test)

# Convert the continuous predictions to class labels
y_test_int = np.argmax(y_test, axis=1)
y_pred_classes = np.argmax(y_pred, axis=1)

# Calculate the F1 score
f1 = f1_score(y_test_int, y_pred_classes, average='micro')

print(f"F1 Score: {f1}")

### RNN Second iteration - multiple improvements

In [None]:
# Define early stopping callback
early_stopping = EarlyStopping(monitor='val_loss',
                               patience=3,          
                               restore_best_weights=True)


In [None]:
rnn_model = tf.keras.Sequential([
    LSTM(units=128, return_sequences=True, input_shape=(X_train.shape[1], 1)),
    LSTM(units=128, return_sequences=True),
    Dropout(0.2),  
    LSTM(units=128),
    Dropout(0.2), 
    Dense(units=3, activation='softmax') 
])

In [None]:
optimizer = Adam(learning_rate=0.001)
rnn_model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
rnn_model.fit(X_train, y_train, 
              epochs=30,
              validation_split=0.2,
              callbacks=[early_stopping]
             )

In [None]:
# Preprocess testing data and reshape it
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Evaluate the model
loss = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}")

In [None]:
# Assuming you already have predictions from your model
y_pred = rnn_model.predict(X_test)

# Convert the continuous predictions to class labels (use argmax)
y_test_int = np.argmax(y_test, axis=1)
y_pred_classes = np.argmax(y_pred, axis=1)

# Calculate the F1 score
f1 = f1_score(y_test_int, y_pred_classes, average='micro')

print(f"F1 Score: {f1}")

## Model evaluation

### Baseline comparison

For the baseline I decided to use the Decision Tree machine learning algorithm. It's simpler than deep learning techniques but excels in sequential data prediction tasks because of the ability to recursively split data and follow tree-based structure to make a decision.

In [None]:
import pandas as pd

# import packages required for decision tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# load the preprocessed datasets 
train_data = pd.read_csv('data/processed_data_train.csv')
test_data = pd.read_csv('data/processed_data_test.csv')

In [None]:
# separate features and the target
Train_X = train_data.drop('event_type', axis=1)
Train_y = train_data['event_type']

Test_X = test_data.drop('event_type', axis=1)
Test_y = test_data['event_type']

# split the data into training and testing sets 80/20
Train_X_train, Train_X_test, Train_y_train, Train_y_test = train_test_split(Train_X,
                                                                            Train_y,
                                                                            test_size=0.2,
                                                                            random_state=12)

Test_X_train, Test_X_test, Test_y_train, Test_y_test = train_test_split(Test_X,
                                                                            Test_y,
                                                                            test_size=0.2,
                                                                            random_state=12)

In [None]:
# create a decision tree instance and train it on the training data
train_des_tree = DecisionTreeClassifier()
train_des_tree.fit(Train_X_train, Train_y_train)

test_des_tree = DecisionTreeClassifier()
test_des_tree.fit(Test_X_train, Test_y_train)

In [None]:
# make predictions on the test data
train_y_pred = train_des_tree.predict(Train_X_test)

# calculate F1 score 
train_f1 = f1_score(Train_y_test, train_y_pred, average='weighted')
print(f"Train dataset F1 Score: {train_f1:.2f}")

In [None]:
# create a confusion matrix
cm = confusion_matrix(Train_y_test, train_y_pred)

# plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=train_des_tree.classes_,
            yticklabels=train_des_tree.classes_)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# make predictions on the test data
test_y_pred = test_des_tree.predict(Test_X_test)

# calculate F1 score 
test_f1 = f1_score(Test_y_test, test_y_pred, average='weighted')
print(f"Test dataset F1 Score: {test_f1:.2f}")

In [None]:
# create a confusion matrix
cm = confusion_matrix(Test_y_test, test_y_pred)

# plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=test_des_tree.classes_,
            yticklabels=test_des_tree.classes_)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

#### Results

Decision Tree shows an impressive 90% F1 score result which hits the target F1 score > 80%. It takes much less time to onboard the algorithm and start using it for prediction in comparison with deep learning training. However, the score is slightly worse than RNN results which might be sighnificat for use-cases that rely on more accurate prediction.

### Model Robustness

#### Retraining with K-fold

#### Testing the model on a different dataset

In [None]:
# Load the dataset
test_data = pd.read_csv('data/processed_data_test.csv')

## Conclusion