# Store items recommendation system

In today's data-driven landscape, businesses are increasingly turning to data science and machine learning to gain competitive advantages and drive innovation. This document delves into key methodologies and practical applications discussed, including data preprocessing, exploratory data analysis (EDA), and advanced machine learning techniques. From optimizing data quality with Pandas to visualizing trends using Matplotlib, and employing sophisticated models like BERT for recommendation engines, we explore how these tools enable businesses to extract actionable insights. By embracing these methodologies, organizations can enhance decision-making, personalize customer experiences, and uncover new opportunities in a rapidly evolving market. This introduction sets the stage for an in-depth exploration of how data-driven strategies can empower businesses to thrive in the digital age.


Recommendation engine types:

- Collaborative filtering (behavioral clustering) - relies on user behavior and preferences to recommend items. It assumes that users who agreed in the past will agree in the future and that they will like similar kinds of items that they liked in the past.
- Content-based filtering - recommends items based on the features of the items and a profile of the user’s preferences. It uses item features (e.g., tags, categories, pricing, and other ) and compares them to the user's past preferences.
- Hybrid recommendation engines - combine collaborative filtering and content-based filtering to leverage the strengths of both methods. They aim to provide more accurate and comprehensive recommendations and it helps to suggest a similar or close item to the users.


We chose content-based filtering:

- No dependency on user behavior - Content-based filtering does not require data on other users’ preferences or behavior, making it useful in scenarios where user interaction data is sparse or unavailable.
- Personalization - It provides highly personalized recommendations based on the specific interests and preferences of individual users, utilizing detailed item features and characteristics.
- Handling of new items - It effectively addresses the cold start problem by recommending new items based on their intrinsic features, even if they have no or limited interaction history.


## Data preparation

In [1]:
import pandas as pd
import numpy as np

In [2]:
import matplotlib.pyplot as plt
import pickle

In [None]:

file_path_1="......."
file_path_2="......."

In [None]:
# Load the dataset with dask
df = pd.read_csv(file_path_1)
wgi_descr = pd.read_csv(file_path_2)
wgi_descr.rename(columns={'ARTICLE_ID': 'article_id'}, inplace=True)


In [3]:
from google.colab import drive 
drive.mount('/content/drive')


  and should_run_async(code)


Mounted at /content/drive


In [4]:
data_cleaned = pd.read_csv('data_sample.csv') 

  and should_run_async(code)


In [None]:
# Display the first few rows
print(df.head())
print(wgi_descr.head())

# we have columns "bon_id", "customer","article_id","price","qty", "unit","article_mother", "wgi", "date_time","article_id","wgi_description" (descriptions of the items in Bulgarian)

In [None]:
# Merge both data frames
merged_data = pd.merge(df, wgi_descr, on = "article_id")


In [None]:
merged_data.head()

In [None]:
# check data types and the count of the non-null values
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3740259 entries, 0 to 3740258
Data columns (total 11 columns):
 #   Column           Dtype  
---  ------           -----  
 0   bon_id           object 
 1   customer         object 
 2   article_id       int64  
 3   price            float64
 4   qty              float64
 5   unit             object 
 6   article_mother   float64
 7   wgi              int64  
 8   date_time        object 
 9   WGI              int64  
 10  WGI_DESCRIPTION  object 
dtypes: float64(3), int64(3), object(5)
memory usage: 313.9+ MB


In [None]:
# Check for duplicates and NA values
print(merged_data.columns) # 2 wgi columns
print(f'Number of duplicated records: {merged_data.duplicated().sum()}') # 1104597 duplicates
print(merged_data.isna().sum()) # article_mother has 2262151 NA values


Index(['bon_id', 'customer', 'article_id', 'price', 'qty', 'unit',
       'article_mother', 'wgi', 'date_time', 'WGI', 'WGI_DESCRIPTION'],
      dtype='object')
Number of duplicated records: 1104597
bon_id                   0
customer                 0
article_id               0
price                    0
qty                      0
unit                     0
article_mother     2262151
wgi                      0
date_time                0
WGI                      0
WGI_DESCRIPTION          0
dtype: int64


In [None]:
# Remove duplicated values
merged_data = merged_data[~merged_data.duplicated()]

print(f'Number of duplicated records: {merged_data.duplicated().sum()}') # 0 duplicated records


Number of duplicated records: 0


In [None]:
merged_data.shape

(2635662, 11)

In [None]:
# Remove NA values
#data_cleaned = merged_data.dropna(subset=['article_mother'])
#print(data_cleaned.isna().sum())



In [None]:
# Adjust data types
data_cleaned=merged_data.copy()
print(data_cleaned.dtypes)

bon_id              object
customer            object
article_id           int64
price              float64
qty                float64
unit                object
article_mother     float64
wgi                  int64
date_time           object
WGI                  int64
WGI_DESCRIPTION     object
dtype: object


In [None]:
# Changing data types
data_cleaned['bon_id'] = data_cleaned['bon_id'].astype(str)
data_cleaned['customer'] = data_cleaned['customer'].astype(str)
data_cleaned['article_id'] = data_cleaned['article_id'].astype('int32')
data_cleaned['price'] = data_cleaned['price'].astype('float32')
data_cleaned['qty'] = data_cleaned['qty'].astype('float32')
data_cleaned['unit'] = data_cleaned['unit'].astype(str)
data_cleaned['article_mother'] = data_cleaned['article_mother'].fillna(0).astype('int32')
data_cleaned['wgi'] = data_cleaned['wgi'].astype('int32')
data_cleaned['date_time'] = pd.to_datetime(data_cleaned['date_time'])
data_cleaned['WGI'] = data_cleaned['WGI'].astype('int32')
data_cleaned['WGI_DESCRIPTION'] = data_cleaned['WGI_DESCRIPTION'].astype(str)

In [None]:
print(data_cleaned.dtypes)

bon_id                     object
customer                   object
article_id                  int32
price                     float32
qty                       float32
unit                       object
article_mother              int32
wgi                         int32
date_time          datetime64[ns]
WGI                         int32
WGI_DESCRIPTION            object
dtype: object


In [None]:
# check if the two wgi columns are identical
print(f"Number of unique values in wgi: {data_cleaned['wgi'].nunique()}")
print(f"Number of unique values in WGI: {data_cleaned['WGI'].nunique()}")
diff_wgi = 0
for i in range(len(data_cleaned)):
    if data_cleaned.iloc[i]['WGI'] != data_cleaned.iloc[i]['wgi']:
        diff_wgi+=1

print(diff_wgi)
if(diff_wgi == 0):
  data_cleaned.drop('WGI', axis=1, inplace=True)
print(data_cleaned.columns)


Number of unique values in wgi: 1719
Number of unique values in WGI: 1719
0
Index(['bon_id', 'customer', 'article_id', 'price', 'qty', 'unit',
       'article_mother', 'wgi', 'date_time', 'WGI_DESCRIPTION'],
      dtype='object')


In [None]:
# Check for price < 0
negative_count = (data_cleaned['price'] < 0).sum()
print(f"Number of negative values in 'price': {negative_count}")



Number of negative values in 'price': 0


In [None]:
# Check for unit < 0
negative_count = (data_cleaned['qty'] < 0).sum()
print(f"Number of negative values in 'qty': {negative_count}")

Number of negative values in 'qty': 5828


In [None]:
# Clean negative counts by subtracting them from the bon where they were first added
negative_products = data_cleaned[data_cleaned['qty'] < 0]

In [None]:
# remove the negative products from the cleaned data
data_cleaned = data_cleaned.drop(negative_products.index)

In [None]:
# subtract the quantity of the negative produvs from the qty of the cleaned data

# Loop through each index in negative_products
subtracted_products = 0
for i in negative_products.index:
    temp_bon_id = negative_products.loc[i, 'bon_id']
    temp_wgi = negative_products.loc[i, 'wgi']
    temp_qty = negative_products.loc[i, 'qty']

    # Filter data before subtraction
    filtered_data = data_cleaned[(data_cleaned['bon_id'] == temp_bon_id) & (data_cleaned['wgi'] == temp_wgi)]

    for j in filtered_data.index:
        print(f"Quantity of product {data_cleaned.loc[j, 'wgi']} in checkbon {data_cleaned.loc[j, 'bon_id']} before subtraction: {data_cleaned.loc[j, 'qty']}")

        if (data_cleaned.loc[j, 'qty'] > -temp_qty and temp_qty != 0):
            data_cleaned.loc[j, 'qty'] += temp_qty
            temp_qty = 0
            negative_products = negative_products.drop(i)
            print(f"Quantity of product {data_cleaned.loc[j, 'wgi']} in checkbon {data_cleaned.loc[j, 'bon_id']} after subtraction: {data_cleaned.loc[j, 'qty']}")
        elif (data_cleaned.loc[j, 'qty'] <= -temp_qty and temp_qty != 0):
            temp_qty += data_cleaned.loc[j, 'qty']
            data_cleaned.loc[j, 'qty'] = 0
            print(f"Product {data_cleaned.loc[j, 'wgi']} in checkbon {data_cleaned.loc[j, 'bon_id']} was fully subtracted")
            data_cleaned = data_cleaned.drop(j)
            subtracted_products += 1

In [None]:
negative_products = data_cleaned[data_cleaned['qty'] < 0]
print(negative_products)

Empty DataFrame
Columns: [bon_id, customer, article_id, price, qty, unit, article_mother, wgi, date_time, WGI_DESCRIPTION]
Index: []


In [None]:
negative_products

Unnamed: 0,bon_id,customer,article_id,price,qty,unit,article_mother,wgi,date_time,WGI_DESCRIPTION


In [None]:
# One record was not deleted
aux = negative_products[(negative_products['bon_id'] == 'knwkwRqTvDSO8ZYYtltxmA') & (negative_products['wgi'] == 281015)]
aux2 = data_cleaned[(data_cleaned['bon_id'] == 'knwkwRqTvDSO8ZYYtltxmA') & (data_cleaned['wgi'] == 281015)]
data_cleaned = data_cleaned.drop(data_cleaned[(data_cleaned['bon_id'] == 'knwkwRqTvDSO8ZYYtltxmA') & (data_cleaned['wgi'] == 281015)].index)
subtracted_products += 1

print(f"Total subtracted products: {subtracted_products}")


Total subtracted products: 5590


In [None]:
# Save cleaned data
# Define the file path
file_path = 'cleaned_data.csv'

# Save DataFrame to CSV with additional options
data_cleaned.to_csv(file_path, index=False, header=True, sep=',', mode='w')

print(f"Data has been saved to {file_path}")

### Load the cleaned data

In [None]:
file_path = 'cleaned_data.csv'

In [None]:
# Load the dataset
data_cleaned = pd.read_csv(file_path)

## EDA

### Top 20 Sales Categories

In [None]:
# Top 20 categories most frequently appearing items
top_categories = data_cleaned.groupby('WGI_DESCRIPTION')['qty'].sum().nlargest(20)
plt.figure(figsize=(10, 6))
top_categories.plot(kind='bar')
plt.title('Top 20 Categories by Sales Quantity')
plt.xlabel('Category')
plt.ylabel('Total Quantity Sold')
plt.show()


### Customer spending categories

In [None]:
# Categorizing the spending
def categorize_spending(price):
    if price < 50:
        return 'Low (< 50 lv)'
    elif price < 150:
        return 'Medium (< 150 lv)'
    elif price < 500:
        return 'High (< 500 lv)'
    else:
        return 'Very High (> 500 lv)'



  and should_run_async(code)


In [None]:
customer_spending = data_cleaned.groupby(['customer', 'bon_id'])['price'].sum().reset_index()
customer_spending['category'] = data_cleaned['price'].apply(categorize_spending)

  and should_run_async(code)


In [None]:
# Creating a bar plot for the spending categories with counts on top of each bar
plt.figure(figsize=(10, 6))
category_counts = customer_spending['category'].value_counts()
bars = category_counts.plot(kind='bar', color=['blue', 'orange', 'green', 'red'])

plt.title('Customer Spending Categories (on a single shop visit)')
plt.xlabel('Spending Category')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Adding the count numbers on top of each bar
for bar in bars.patches:
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.1,
             bar.get_height() + 0.1,
             int(bar.get_height()),
             fontsize=12)

# Display the plot
plt.show()


In [23]:
# Categorizing the spending
def categorize_spending_total(price):
    if price < 100:
        return 'Low (< 100 lv)'
    elif price < 500:
        return 'Medium (< 500 lv)'
    elif price < 1000:
        return 'High (< 1000 lv)'
    else:
        return 'Very High (> 1000 lv)'

  and should_run_async(code)


In [None]:
# Calculate total spending per customer
customer_spending_total = data_cleaned.groupby('customer')['price'].sum().reset_index()
customer_spending_total['category'] = customer_spending_total['price'].apply(categorize_spending_total)

# Create a bar plot for the spending categories with counts on top of each bar
plt.figure(figsize=(10, 6))
category_counts = customer_spending_total['category'].value_counts().reindex([
    'Low (< 100 lv)', 'Medium (< 500 lv)', 'High (< 1000 lv)', 'Very High (> 1000 lv)'
])
bars = category_counts.plot(kind='bar', color=['blue', 'orange', 'green', 'red'])

plt.title('Customer Spending Categories (total for each customer)')
plt.xlabel('Spending Category')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Adding the count numbers on top of each bar
for bar in bars.patches:
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.1,
             bar.get_height() + 0.1,
             int(bar.get_height()),
             fontsize=12)

# Display the plot
plt.show()

### Number of visits

In [5]:
# Group by customer and bon_id to count number of visits per customer
visit_counts = data_cleaned.groupby(['customer', 'bon_id']).size().reset_index(name='visit_count')

# Count the number of unique bon_ids per customer to get the number of visits
visits_per_customer = visit_counts.groupby('customer').size().reset_index(name='num_visits')

# Count the number of customers for each number of visits
visits_distribution = visits_per_customer['num_visits'].value_counts().sort_index()

  and should_run_async(code)


In [21]:
# Bin the number of visits
bins = [0, 4, 9, 49, 99, float('inf')]
labels = ['<5 visits', ' 5=< visits <10', '10=< visits <50', '50=< visits <100', '>=100 visits']
visits_per_customer['visit_bin'] = pd.cut(visits_per_customer['num_visits'], bins=bins, labels=labels, right=False)

# Count the number of customers in each bin
visits_distribution = visits_per_customer['visit_bin'].value_counts().sort_index()

  and should_run_async(code)


In [None]:
# Plot the results
plt.figure(figsize=(10, 6))
visits_distribution.plot(kind='bar', color='skyblue')
plt.title('Number of Store Visits per Customer')
plt.xlabel('Number of Visits')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the plot
plt.show()

### Monthly sales trends

In [None]:
data_cleaned['date_time'] = pd.to_datetime(data_cleaned['date_time'])
data_cleaned.set_index('date_time', inplace=True)

# Resample sales data by month
monthly_sales = data_cleaned['qty'].resample('ME').sum()
plt.figure(figsize=(14, 7))
monthly_sales.plot()
plt.title('Monthly Sales Trends')
plt.xlabel('Month')
plt.ylabel('Total Quantity Sold')
plt.show()


### Correlation between price and quantity

In [None]:
# Analyze the relationship between price and quantity sold
price_qty_corr = data_cleaned[['price', 'qty']].corr()
print(price_qty_corr)


          price       qty
price  1.000000 -0.104357
qty   -0.104357  1.000000


### Basket analysis
Perform market basket analysis to identify items that are frequently bought together.

In [None]:
# Sample a subset of the data (e.g., 5%)
sample_data = data_cleaned.sample(frac=0.05, random_state=1)


In [None]:
sample_data.shape

(131212, 10)

In [None]:
# Create the binary basket representation
basket = sample_data.groupby(['bon_id', 'WGI_DESCRIPTION'])['qty'].sum().unstack().reset_index().fillna(0).set_index('bon_id')
binary_basket = basket.map(lambda x: 1 if x > 0 else 0)


In [None]:
# Check if any value in the entire DataFrame is 1
has_ones = (binary_basket.values == 1).any()

print(f"Any 1s in the binary_basket DataFrame: {has_ones}")


Any 1s in the binary_basket DataFrame: True


In [None]:
# Compute the co-occurrence matrix
co_occurrence_matrix = binary_basket.T.dot(binary_basket)

In [None]:
# Display the co-occurrence matrix
print(co_occurrence_matrix.head(10))

In [None]:
# Extract item pairs with high co-occurrence

# Threshold for high co-occurrence
threshold = 20

# Find item pairs with co-occurrence above the threshold
item_pairs = np.where(co_occurrence_matrix > threshold)

# Extract the item pairs and their co-occurrence counts
item_pairs_with_counts = [(basket.columns[i], basket.columns[j], co_occurrence_matrix.iloc[i, j])
                          for i, j in zip(*item_pairs) if i != j]

# Display the item pairs with high co-occurrence
item_pairs_with_counts[:20]  # Display the top 20 pairs


In [None]:
# Filter unique item pairs
unique_item_pairs = set()
for item1, item2, count in item_pairs_with_counts:
    # Ensure consistent order (item1, item2)
    if item1 < item2:
        unique_item_pairs.add((item1, item2, count))
    else:
        unique_item_pairs.add((item2, item1, count))

# Convert the set to a list and sort by the number of occurrences
sorted_unique_item_pairs = sorted(list(unique_item_pairs), key=lambda x: x[2], reverse=True)

# Display the sorted unique item pairs
print(sorted_unique_item_pairs[:50])  # Display the top 50 unique pairs of items frequently both together

In [None]:
#Calculate the correlation matrix

# Compute the correlation matrix for the binary data
correlation_matrix = binary_basket.corr()

# Display the correlation matrix
print(correlation_matrix.head())


In [None]:
# Extract the upper triangle of the correlation matrix
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))


In [None]:
# Find the top 15 highest positive correlations
top_positive_correlations = upper_tri.stack().nlargest(30)

# Display the top correlations
print(top_positive_correlations)


In [None]:
# Plot the top 30 highest positive correlations
plt.figure(figsize=(12, 8))
top_positive_correlations.plot(kind='barh')
plt.title('Top 30 Highest Positive Correlations Between Items')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Item Pairs')
plt.show()


In [None]:
# Check the number of unique WGI_DESCRIPTIONS
unique_wgi_descriptions = data_cleaned['WGI_DESCRIPTION'].nunique()

print(f'The number of unique WGI_DESCRIPTION values is: {unique_wgi_descriptions}')


The number of unique WGI_DESCRIPTION values is: 1608


## Modeling and recommendation engine using BERT

An option we considered was to create a recommendation engine using SpaCy, but we decided to use BERT - a leading-edge natural language processing model. BERT’s main innovation is its ability to understand the context of a word in a sentence by looking at the words that come before and after it. This bidirectional approach allows BERT to capture the meaning of words more accurately than previous models, which typically processed text in a unidirectional manner (left-to-right or right-to-left). Our goal is to leverage BERT’s advanced capabilities to enhance the relevance and accuracy of recommendations based on textual descriptions of items.


In [None]:
from google.colab import drive 
drive.mount('/content/drive')


Mounted at /content/drive


### Embeddings using only the WGI column from 'wgi_descriptions.csv'

In [None]:
data = pd.read_csv('wgi_descriptions.csv')


  and should_run_async(code)


In [None]:
# Load the dataset
sample_data = data.copy()

  and should_run_async(code)


In [None]:
unique_wgi_descriptions = sample_data['WGI_DESCRIPTION'].unique()

  and should_run_async(code)


In [None]:
len(unique_wgi_descriptions) #1609

  and should_run_async(code)


1609

In [None]:
import torch
from transformers import BertTokenizer, BertModel #Loading the pre-trained BERT multilingual model
from sklearn.metrics.pairwise import cosine_similarity

  and should_run_async(code)


In [None]:
# Load the multilingual BERT model and tokenizer
model_name = "bert-base-multilingual-cased" #for the descriptions which are in Bulgarian
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

  and should_run_async(code)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

In [None]:
# Function to get BERT embeddings (vectorization of the words)
def get_embeddings(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze()


  and should_run_async(code)


In [None]:
# Get embeddings for each item in the DataFrame
embeddings = sample_data['WGI_DESCRIPTION'].apply(lambda x: get_embeddings(x, tokenizer, model)).tolist()
embeddings_matrix = torch.stack(embeddings).numpy()

  and should_run_async(code)


#### Load already generated embeddings

In [None]:
# Load embeddings and embeddings matrix from Google Drive. These are the ebeddings that I have generated with the code above and have saved in Drive; 
embeddings_path = 'embeddings.pkl'
embeddings_matrix_path = 'embeddings_matrix.pkl'

with open(embeddings_path, 'rb') as f:
    embeddings = pickle.load(f)

with open(embeddings_matrix_path, 'rb') as f:
    embeddings_matrix = pickle.load(f)

  and should_run_async(code)


In [None]:
# Function to recommend items based on a description
def recommend_items(input_description, embeddings_matrix, sample_data, tokenizer, model, top_n=10):
    input_embedding = get_embeddings(input_description, tokenizer, model).unsqueeze(0).numpy()
    similarities = cosine_similarity(input_embedding, embeddings_matrix).flatten()

    # Create a DataFrame to store similarities and descriptions
    similarity_df = pd.DataFrame({
        'WGI_DESCRIPTION': sample_data['WGI_DESCRIPTION'],
        'similarity': similarities
    })

    # Sort by similarity in descending order and drop duplicates
    unique_recommendations = similarity_df.sort_values(by='similarity', ascending=False).drop_duplicates('WGI_DESCRIPTION')

    # Select top N recommendations
    top_recommendations = unique_recommendations.head(top_n)

    return top_recommendations

  and should_run_async(code)


### Get recommendations

In [None]:

# Example usage
input_description = sample_data['WGI_DESCRIPTION'].iloc[39] #use either this
#input_description='Вода' # or this, not both
print(f'Input item: {input_description}')
recommended_items = recommend_items(input_description, embeddings_matrix, sample_data, tokenizer, model)
print(recommended_items)

  and should_run_async(code)


Input item: Кърпи за почистване
                       WGI_DESCRIPTION  similarity
4311               Кърпи за почистване    1.000000
5824                     Кърпи за прах    0.866059
4259   Кърпички за почистване на очила    0.865328
22778   Кърпички за почистване на лице    0.833138
21226               Картофи за пържене    0.824234
16195                    Конци за зъби    0.795469
19675              Козметични кърпички    0.785511
4376                      Крем за ръце    0.777325
8515                Пликчета за фризер    0.776784
12084             Почистване на мебели    0.765757


## Evaluation

We are evaluating the recommendation engine by comparing the recommended products to the actual products purchased by customers on their subsequent visits. Initially, we identify customers who have made more than one purchase (bon_id) and split their purchase history into two dataframes: input_df (first visit) and test_df (second visit). For a random sample of 100 customers, we generate product recommendations based on their first visit's purchases and compare these recommendations (the top 5 recommendations) to the products they bought on their second visit. The success rate is calculated by determining the proportion of recommended products that match the actual purchased products, both for each customer and overall. The results are printed, showing the success rate for each customer and the overall success rate.


In [None]:
# Group by customer and bon_id
grouped = data_cleaned.groupby(['customer', 'bon_id']).agg({'date_time': 'min', 'price': 'sum'}).reset_index()

# Find customers with more than one bon_id
customer_counts = grouped['customer'].value_counts()
customers_with_multiple_bon_ids = customer_counts[customer_counts > 1].index.tolist()

  and should_run_async(code)


In [None]:
# Filter data to include only those customers
filtered_grouped = grouped[grouped['customer'].isin(customers_with_multiple_bon_ids)]

# Sort by customer and date_time in ascending order
sorted_grouped = filtered_grouped.sort_values(by=['customer', 'date_time'])

  and should_run_async(code)


In [None]:
# Create auxiliary table
auxiliary_table = sorted_grouped[['customer', 'bon_id', 'date_time']]

# Split into input_df (first visit) and test_df (second visit)
first_visits = auxiliary_table.groupby('customer').nth(0).reset_index()
second_visits = auxiliary_table.groupby('customer').nth(1).reset_index()

  and should_run_async(code)


In [None]:
# Ensure both dataframes have the same length
input_df = first_visits[first_visits['customer'].isin(second_visits['customer'])].reset_index(drop=True)
test_df = second_visits[second_visits['customer'].isin(first_visits['customer'])].reset_index(drop=True)


  and should_run_async(code)


In [None]:
# Take 100 random customers from each dataframe
random_customers = np.random.choice(input_df['customer'], size=100, replace=False)

input_df_sample = input_df[input_df['customer'].isin(random_customers)].reset_index(drop=True)
test_df_sample = test_df[test_df['customer'].isin(random_customers)].reset_index(drop=True)

  and should_run_async(code)


In [None]:
success_counts = []
test_products_count=[]

  and should_run_async(code)


In [None]:
for _, input_row in input_df_sample.iterrows():
    customer = input_row['customer']
    input_bon_id = input_row['bon_id']

    # Extract purchased products for the given customer and bon_id from data_cleaned
    input_products = data_cleaned[(data_cleaned['customer'] == customer) & (data_cleaned['bon_id'] == input_bon_id)]['WGI_DESCRIPTION']

    customer_success_count = 0
    customer_total_recommendations = 0

    for product in input_products:
        top_recommendations = recommend_items(product, embeddings_matrix, sample_data, tokenizer, model, top_n=5)
        recommended_products = top_recommendations['WGI_DESCRIPTION'].tolist()

        # Get test products for the customer from test_df_sample
        test_bon_id = test_df_sample[test_df_sample['customer'] == customer]['bon_id'].values[0]
        test_products = data_cleaned[(data_cleaned['customer'] == customer) & (data_cleaned['bon_id'] == test_bon_id)]['WGI_DESCRIPTION']

        # Check for successful recommendations
        successful_recommendations = set(recommended_products).intersection(set(test_products))
        customer_success_count += len(successful_recommendations)
        customer_total_recommendations += len(recommended_products)

    success_rate = customer_success_count / len(test_products) if len(test_products) > 0 else 0
    success_counts.append(success_rate)
    test_products_count.append(len(test_products))

  and should_run_async(code)


### Overall success rate

In [None]:
# Calculate overall success rate
overall_success_rate = sum(success_counts) / len(success_counts) if success_counts else 0

print("Overall Success Rate:", overall_success_rate) #Overall Success Rate: 0.3113482565852801

Overall Success Rate: 0.3113482565852801


  and should_run_async(code)


### Success rates for all individual test customers

In [None]:
for i in range (len(success_counts)):
  print(f'Success rate customer # {i}: {success_counts[i]} / Number of products: {test_products_count[i]}')



Success rate customer # 0: 0.30303030303030304 / Number of products: 33
Success rate customer # 1: 0.1 / Number of products: 10
Success rate customer # 2: 0.25 / Number of products: 4
Success rate customer # 3: 0.18181818181818182 / Number of products: 11
Success rate customer # 4: 0.8333333333333334 / Number of products: 6
Success rate customer # 5: 0.12121212121212122 / Number of products: 33
Success rate customer # 6: 0.625 / Number of products: 8
Success rate customer # 7: 0.4 / Number of products: 5
Success rate customer # 8: 0.3333333333333333 / Number of products: 6
Success rate customer # 9: 0.0 / Number of products: 9
Success rate customer # 10: 0.3333333333333333 / Number of products: 9
Success rate customer # 11: 1.0 / Number of products: 2
Success rate customer # 12: 0.0 / Number of products: 2
Success rate customer # 13: 0.0 / Number of products: 6
Success rate customer # 14: 0.08333333333333333 / Number of products: 12
Success rate customer # 15: 0.4 / Number of products:

  and should_run_async(code)
