# Data Collection and Preprocessing

This notebook involves:
- Feature Engineering
    - Create derived features (e.g., total order value, customer lifetime value)
    - Bin continuous variables (if necessary)

## Derived features:

- Product-centric:
    - TotalAmount: Quantity * UnitPrice
    - PriceCategory: Bin product prices into categories (e.g., low, medium, high)
    - ProductPopularity: Rank products based on total quantity sold

- Time-based:
    - DayOfWeek
    - Month
    - IsWeekend
    - MonthlySalesTrend = Aggregate TotalAmount by Month

- Customer-centric:
    - CLV: Customer lifetime value (Sum of TotalAmount per CustomerID)
    - AvgOrderValue: average cash spent per order
    - PurchaseFrequency: Number of Invoices per CustomerID
    - IsReturningCustomer: indicates if customer orders more than once

In [24]:
import pandas as pd

df = pd.read_csv("clean_data.csv")

df['TotalAmount'] = (df['Quantity'] * df['UnitPrice']).round(2)
df['PriceCategory'] = pd.qcut(df['UnitPrice'], 3, labels=['Low', 'Medium', 'High'])
df['ProductPopularity'] = df.groupby('Description')['Quantity'].transform('sum')

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce') # Convert InvoiceDate to datetime
df['DayOfWeek'] = df['InvoiceDate'].dt.day_name()
df['Month'] = df['InvoiceDate'].dt.month_name()
df['IsWeekend'] = df['DayOfWeek'].apply(lambda x: 1 if x in ['Saturday', 'Sunday'] else 0)
df['MonthlySalesTrend'] = df.groupby('Month')['TotalAmount'].transform('sum').round(2)

clv = df.groupby('CustomerID')['TotalAmount'].sum().reset_index().round(2)
clv.columns = ['CustomerID', 'CustomerLifetimeValue'] # Rename the column to 'CustomerLifetimeValue'
df = df.merge(clv, on='CustomerID', how='left')

AvgOrderValue = df.groupby('CustomerID')['TotalAmount'].mean().reset_index().round(2)
AvgOrderValue.columns = ['CustomerID', 'AvgOrderValue'] # Rename the column to 'AvgOrderValue'
df = df.merge(AvgOrderValue, on='CustomerID', how='left')

PurchaseFrequency = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
PurchaseFrequency.columns = ['CustomerID', 'PurchaseFrequency'] # Rename the column to 'PurchaseFrequency'
df = df.merge(PurchaseFrequency, on='CustomerID', how='left')

IsReturningCustomer = df.groupby('CustomerID')['InvoiceNo'].apply(lambda x: 1 if len(x) > 1 else 0).reset_index()
IsReturningCustomer.columns = ['CustomerID', 'IsReturningCustomer'] # Rename the column to 'IsReturningCustomer'
df = df.merge(IsReturningCustomer, on='CustomerID', how='left')

Double check data. Duplicates were removed in the previous notebook but inconsistent or contradictory data may remain

In [25]:
df.describe()

Unnamed: 0,Quantity,InvoiceDate,UnitPrice,CustomerID,TotalAmount,ProductPopularity,IsWeekend,MonthlySalesTrend,CustomerLifetimeValue,AvgOrderValue,PurchaseFrequency,IsReturningCustomer
count,399689.0,399689,399689.0,399689.0,399689.0,399689.0,399689.0,399689.0,399689.0,399689.0,399689.0,399689.0
mean,12.229383,2011-07-10 12:35:13.693046528,2.907457,15288.696411,20.679771,4538.686659,0.153947,769145.3,10977.118367,20.679728,22.55565,0.999805
min,-80995.0,2010-12-01 08:26:00,0.0,12346.0,-168469.6,-1475.0,0.0,421127.8,-1192.2,-238.44,1.0,0.0
25%,2.0,2011-04-06 15:02:00,1.25,13959.0,4.25,940.0,0.0,580214.7,1070.47,7.03,4.0,1.0
50%,5.0,2011-07-29 15:51:00,1.95,15152.0,11.56,2316.0,0.0,650314.2,2593.94,14.66,8.0,1.0
75%,12.0,2011-10-20 12:03:00,3.75,16791.0,19.5,5869.0,0.0,961042.5,6087.26,20.63,17.0,1.0
max,80995.0,2011-12-09 12:50:00,649.5,18287.0,168469.6,53119.0,1.0,1113102.0,278778.02,9904.88,242.0,1.0
std,250.836859,,4.451881,1710.810771,425.515532,6292.962652,0.360899,231425.6,29958.485265,51.325713,44.403024,0.013968


Key observations:
1) Quantity: Wide range from -80,995 to 80,995, indicating potential outliers or errors (negative quantities might represent returns).
2) UnitPrice: Some products are priced at 0, which might indicate free items or data errors.
3) TotalAmount: Includes negative values, likely due to returns or refunds.
4) ProductPopularity: Negative values suggest data inconsistency.
5) CustomerLifetimeValue and AvgOrderValue: Negative values are unusual and warrant further investigation.
6) IsReturningCustomer: Almost all customers are returning customers (mean ~1), but a few exceptions exist.

Next steps are to investigate values that don't make sense

In [26]:
negative_values = {
    "Quantity": df[df["Quantity"] < 0].shape[0],

    "TotalAmount": df[df["TotalAmount"] < 0].shape[0],

    "CustomerLifetimeValue": df[df["CustomerLifetimeValue"] < 0].shape[0],

    "AvgOrderValue": df[df["AvgOrderValue"] < 0].shape[0],

    "ProductPopularity": df[df["ProductPopularity"] < 0].shape[0],

    "UnitPrice (Zero)": df[df["UnitPrice"] == 0].shape[0], # Some UnitPrice values are zero
}
negative_values

{'Quantity': 8506,
 'TotalAmount': 8506,
 'CustomerLifetimeValue': 147,
 'AvgOrderValue': 147,
 'ProductPopularity': 374,
 'UnitPrice (Zero)': 33}


1) Quantity and TotalAmount: Treat rows with negative values as returns
2) CustomerLifetimeValue and AvgOrderValue: Investigate and handle negative values; they might result from data entry errors.
3) ProductPopularity: Replace negative values with a minimum threshold (e.g., 0) or investigate further.
4) UnitPrice (Zero) : assume order was a gift (i.e. free)

In [27]:
# Filter out rows with negative Quantity and TotalAmount (assumed to be returns)
cleaned_data = df[(df["Quantity"] >= 0) & (df["TotalAmount"] >= 0)]

# Investigate rows with negative CustomerLifetimeValue and AvgOrderValue
negative_customer_lifetime = cleaned_data[cleaned_data["CustomerLifetimeValue"] < 0]
negative_avg_order_value = cleaned_data[cleaned_data["AvgOrderValue"] < 0]

# Replace negative ProductPopularity values with 0 
cleaned_data["ProductPopularity"] = cleaned_data["ProductPopularity"].clip(lower=0)

# Handle zero UnitPrice rows (drop for now)
cleaned_data = cleaned_data[cleaned_data["UnitPrice"] > 0]

cleaned_summary = {
    "Rows Removed (Negative Quantity/TotalAmount)": df.shape[0] - cleaned_data.shape[0],
    "Negative CustomerLifetimeValue (Remaining)": negative_customer_lifetime,
    "Negative AvgOrderValue (Remaining)": negative_avg_order_value,
    "Final Rows": cleaned_data.shape[0],
}

cleaned_summary


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data["ProductPopularity"] = cleaned_data["ProductPopularity"].clip(lower=0)


{'Rows Removed (Negative Quantity/TotalAmount)': 8539,
 'Negative CustomerLifetimeValue (Remaining)':        InvoiceNo StockCode                          Description  Quantity  \
 2511      536663     22867              HAND WARMER BIRD DESIGN        24   
 2512      536663     22633               HAND WARMER UNION JACK        24   
 2513      536663     22632            HAND WARMER RED RETROSPOT        24   
 2514      536663     22910    PAPER CHAIN KIT VINTAGE CHRISTMAS        40   
 2515      536663     22737       RIBBON REEL CHRISTMAS PRESENT         20   
 2516      536663     22952      60 CAKE CASES VINTAGE CHRISTMAS        24   
 62178     544637     22245         HOOK, 1 HANGER ,MAGIC GARDEN        12   
 62179     544637     22251    BIRDHOUSE DECORATION MAGIC GARDEN        24   
 62180     544637     22250  DECORATION  BUTTERFLY  MAGIC GARDEN        16   
 62181     544637     22248  DECORATION  PINK CHICK MAGIC GARDEN        16   
 62182     544637     22244           3 H

In [28]:
# find unique customers with negative clv and/or avg_order_value
unique_values_clv = set(negative_customer_lifetime['CustomerID'].unique())
unique_values_avg = set(negative_avg_order_value['CustomerID'].unique())
union_values = unique_values_clv.union(unique_values_avg)
print(f"Union of values in {'CustomerID'}: {union_values}")

Union of values in CustomerID: {np.int64(16546), np.int64(17548)}


2 customers (no. 16546 and 17548) have negative clvs, avg_order_value or both, likey indicating refunds

In [29]:
df.to_csv('data_final.csv', index=False)