Day 12 of Python Summer Party

by Interview Master

Walmart

E-commerce Returns Customer Segmentation Model

You are a Data Analyst on the Walmart.com Insights team investigating customer return patterns. The team aims to develop a predictive approach to understanding customer return behaviors across different time periods. Your goal is to leverage transaction data to create a comprehensive view of customer return likelihood.

In [1]:
import pandas as pd
import numpy as np


In [2]:
# Load the CSV file into a DataFrame and display it
customer_returns = pd.read_csv('customer_returns.csv')
customer_returns_df = customer_returns.copy()
print(customer_returns_df)
print()
print(customer_returns_df.info())


   order_id  order_date customer_id  return_flag  order_amount
0   ORD0001  2024-07-05     CUST001         True         120.5
1   ORD0002  2024-07-10     CUST002        False          75.0
2   ORD0003  2024-08-15     CUST001         True          90.0
3   ORD0004  2024/09/01     CUST003        False          45.0
4   ORD0005  2024-10-20     CUST004         True         200.0
5   ORD0006  2024-11-11     CUST002         True           NaN
6   ORD0007  2024-11-15     CUST005        False          60.0
7   ORD0008  2024-12-05     CUST006         True         150.0
8   ORD0009  2024-12-25     CUST007        False          85.0
9   ORD0010  2025-01-10     CUST001         True         130.0
10  ORD0011  2025-01-15     CUST008        False          50.0
11  ORD0012  2025-02-10     CUST009         True         110.0
12  ORD0013  2025-02-14     CUST010        False         100.0
13  ORD0014  2025-03-03     CUST005         True          77.5
14  ORD0015         NaN     CUST002        False       

Question 1 of 2

Identify and list all unique customer IDs who have made returns between July 1st 2024 and June 30th 2025. This will help us understand the base set of customers involved in returns during the specified period.

In [3]:
# We first need to convert the 'return_date' column to datetime format
customer_returns_df['order_date'] = pd.to_datetime(customer_returns_df['order_date'], format='%Y-%m-%d', errors='coerce')
print("'order_date' after converting to datetime format:")
print(customer_returns_df.head())
print(customer_returns_df.info())
print()


'order_date' after converting to datetime format:
  order_id order_date customer_id  return_flag  order_amount
0  ORD0001 2024-07-05     CUST001         True         120.5
1  ORD0002 2024-07-10     CUST002        False          75.0
2  ORD0003 2024-08-15     CUST001         True          90.0
3  ORD0004        NaT     CUST003        False          45.0
4  ORD0005 2024-10-20     CUST004         True         200.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   order_id      60 non-null     object        
 1   order_date    58 non-null     datetime64[ns]
 2   customer_id   60 non-null     object        
 3   return_flag   60 non-null     bool          
 4   order_amount  56 non-null     float64       
dtypes: bool(1), datetime64[ns](1), float64(1), object(2)
memory usage: 2.1+ KB
None



In [4]:
# Now we filter for transaction between July 1, 2024 and June 30, 2025.
jul_jun_ustomer_returns_df = customer_returns_df[(customer_returns_df['order_date'] >= '2024-07-01') & (customer_returns_df['order_date'] <= '2025-06-30')]
print(jul_jun_ustomer_returns_df)
print(jul_jun_ustomer_returns_df.info())


   order_id order_date customer_id  return_flag  order_amount
0   ORD0001 2024-07-05     CUST001         True         120.5
1   ORD0002 2024-07-10     CUST002        False          75.0
2   ORD0003 2024-08-15     CUST001         True          90.0
4   ORD0005 2024-10-20     CUST004         True         200.0
5   ORD0006 2024-11-11     CUST002         True           NaN
6   ORD0007 2024-11-15     CUST005        False          60.0
7   ORD0008 2024-12-05     CUST006         True         150.0
8   ORD0009 2024-12-25     CUST007        False          85.0
9   ORD0010 2025-01-10     CUST001         True         130.0
10  ORD0011 2025-01-15     CUST008        False          50.0
11  ORD0012 2025-02-10     CUST009         True         110.0
12  ORD0013 2025-02-14     CUST010        False         100.0
13  ORD0014 2025-03-03     CUST005         True          77.5
15  ORD0016 2025-03-20     CUST003         True         180.0
16  ORD0017 2025-04-01     CUST004        False          95.0
17  ORD0

In [5]:
# Now we group by all transaction where the return flag is True
grouped_customer_returns_df = jul_jun_ustomer_returns_df[jul_jun_ustomer_returns_df['return_flag'] == True]
print(grouped_customer_returns_df)
print()
print(grouped_customer_returns_df.info())
print("\nThe unique customer_ids who have made returns between July 1st 2024 and June 30th 2025 is:");
print(grouped_customer_returns_df['customer_id'].unique())



   order_id order_date customer_id  return_flag  order_amount
0   ORD0001 2024-07-05     CUST001         True         120.5
2   ORD0003 2024-08-15     CUST001         True          90.0
4   ORD0005 2024-10-20     CUST004         True         200.0
5   ORD0006 2024-11-11     CUST002         True           NaN
7   ORD0008 2024-12-05     CUST006         True         150.0
9   ORD0010 2025-01-10     CUST001         True         130.0
11  ORD0012 2025-02-10     CUST009         True         110.0
13  ORD0014 2025-03-03     CUST005         True          77.5
15  ORD0016 2025-03-20     CUST003         True         180.0
17  ORD0018 2025-04-15     CUST006         True         210.0
18  ORD0019 2025-05-05     CUST007         True          55.0
20  ORD0021 2025-06-01     CUST009         True          85.0
22  ORD0023 2024-07-20     CUST001         True         130.5
24  ORD0025 2024-09-10     CUST003         True         300.0
26  ORD0027 2024-11-20     CUST005         True          40.0
28  ORD0

Question 2

Convert the 'order_date' column to a datetime format and create a MultiIndex with 'customer_id' and 'order_date'. Then, calculate the total number of returns per customer for each month. This will provide insights into monthly return patterns for each customer.

In [6]:
# Since we already converted the 'order_date' column to datetime format, we can now create a MultiIndex with 'customer_id' and 'order_date'.
customer_returns_df = customer_returns_df.set_index(['customer_id', 'order_date']).sort_index()
print(customer_returns_df)


                       order_id  return_flag  order_amount
customer_id order_date                                    
CUST001     2024-07-05  ORD0001         True         120.5
            2024-07-20  ORD0023         True         130.5
            2024-08-15  ORD0003         True          90.0
            2025-01-10  ORD0010         True         130.0
            2025-01-20  ORD0053         True         134.0
            2025-03-15  ORD0033         True         130.0
            2025-06-25  ORD0043         True         250.0
CUST002     2024-07-10  ORD0002        False          75.0
            2024-08-05  ORD0024        False          60.0
            2024-11-11  ORD0006         True           NaN
            2025-02-25  ORD0054        False          95.0
            2025-04-05  ORD0034        False          80.0
            2025-06-28  ORD0044        False          47.7
            NaT         ORD0015        False          65.0
CUST003     2024-09-10  ORD0025         True         300

In [7]:
# Now we group by monthly totals 
montly_returns = (customer_returns_df.groupby(['customer_id', pd.Grouper(level='order_date', freq='ME')]).agg(total_returns=('return_flag', 'sum')))
print("The total number of returns per customer for each month is:");
print(montly_returns)


The total number of returns per customer for each month is:
                        total_returns
customer_id order_date               
CUST001     2024-07-31              2
            2024-08-31              1
            2025-01-31              2
            2025-03-31              1
            2025-06-30              1
CUST002     2024-07-31              0
            2024-08-31              0
            2024-11-30              1
            2025-02-28              0
            2025-04-30              0
            2025-06-30              0
CUST003     2024-09-30              1
            2025-03-31              2
            2025-04-30              1
            2025-06-30              1
CUST004     2024-07-31              0
            2024-10-31              1
            2025-04-30              0
            2025-05-31              0
CUST005     2024-08-31              1
            2024-11-30              1
            2025-03-31              1
            2025-05-31      