# User Behavior Analysis

This notebook aims to analyze user behavior from an e-commerce dataset. The analysis includes metrics such as the number of orders, average order value, preferred product categories, order frequency, and average delivery time per customer.

We may potentially use this analysis to segment customers and build a function for the Agents to be used in the memory layers while answering user queries.

In [1]:
import pandas as pd

# Load datasets
customers = pd.read_csv('data/olist_customers_dataset.csv')
geolocations = pd.read_csv('data/olist_geolocation_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
payments = pd.read_csv('data/olist_order_payments_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv')
orders = pd.read_csv('data/olist_orders_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
sellers = pd.read_csv('data/olist_sellers_dataset.csv')
product_category_translation = pd.read_csv('data/product_category_name_translation.csv')

## Merge Datasets
To obtain a comprehensive view of user orders, we merge several datasets:

In [6]:
orders_customers_df = orders.merge(customers, on='customer_id')
orders_items_df = orders_customers_df.merge(order_items, on='order_id')
orders_reviews_df = orders_items_df.merge(reviews, on='order_id', how='left')
orders_products_df = orders_reviews_df.merge(products, on='product_id')


## Calculate Delivery Time
Next, we calculate the delivery time for each order:


In [7]:
orders_products_df['order_purchase_timestamp'] = pd.to_datetime(orders_products_df['order_purchase_timestamp'])
orders_products_df['order_delivered_customer_date'] = pd.to_datetime(orders_products_df['order_delivered_customer_date'])

# Remove rows where delivery date is missing
delivery_time_df = orders_products_df.dropna(subset=['order_delivered_customer_date'])

# Calculate delivery time in days
delivery_time_df['delivery_time_days'] = (delivery_time_df['order_delivered_customer_date'] - delivery_time_df['order_purchase_timestamp']).dt.days


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  delivery_time_df['delivery_time_days'] = (delivery_time_df['order_delivered_customer_date'] - delivery_time_df['order_purchase_timestamp']).dt.days


## User Behavior Metrics

We calculate various metrics to analyze user behavior:



In [8]:
# Calculate user behavior metrics

# Number of orders per customer
user_order_counts = orders_products_df.groupby('customer_unique_id').size().reset_index(name='num_orders')

# Average order value per customer
user_avg_order_value = orders_products_df.groupby('customer_unique_id')['price'].mean().reset_index(name='avg_order_value')

# Preferred product categories per customer (most frequent category)
user_preferred_category = orders_products_df.groupby(['customer_unique_id', 'product_category_name']).size().reset_index(name='count')
user_preferred_category = user_preferred_category.loc[user_preferred_category.groupby('customer_unique_id')['count'].idxmax()].drop(columns='count')

# Order frequency per customer (number of days between first and last order divided by number of orders)
orders_products_df['order_purchase_timestamp'] = pd.to_datetime(orders_products_df['order_purchase_timestamp'])
user_order_dates = orders_products_df.groupby('customer_unique_id')['order_purchase_timestamp'].agg(['min', 'max'])
user_order_dates['order_frequency_days'] = (user_order_dates['max'] - user_order_dates['min']).dt.days / user_order_counts['num_orders']

# Average delivery time per customer
user_avg_delivery_time = delivery_time_df.groupby('customer_unique_id')['delivery_time_days'].mean().reset_index(name='avg_delivery_time')

## Merge Metrics into a Single DataFrame
Finally, we merge all the calculated metrics into a single DataFrame:

In [9]:

# Merging all metrics into a single dataframe
user_behavior_df = user_order_counts.merge(user_avg_order_value, on='customer_unique_id') \
                                    .merge(user_preferred_category, on='customer_unique_id') \
                                    .merge(user_order_dates[['order_frequency_days']], on='customer_unique_id') \
                                    .merge(user_avg_delivery_time, on='customer_unique_id')

user_behavior_df.head()


Unnamed: 0,customer_unique_id,num_orders,avg_order_value,product_category_name,order_frequency_days,avg_delivery_time
0,0000366f3b9a7992bf8c76cfdf3221e2,1,129.9,cama_mesa_banho,,6.0
1,0000b849f77a49e4a4ce2b2a4ca5be3f,1,18.9,beleza_saude,,3.0
2,0000f46a3911fa3c0805444483337064,1,69.0,papelaria,,25.0
3,0000f6ccb0745a6a4b88665a16c9f078,1,25.99,telefonia,,20.0
4,0004aac84e0df4da2b147fca70cf8255,1,180.0,telefonia,,13.0


## Future Use
This detailed user behavior analysis can be leveraged to create a Retrieval-Augmented Generation (RAG) or memory layer for various agents. This will enhance the ability to answer queries related to user behavior with detailed and personalized insights.

