# E-Commerce Recommender System

## Main Goals

- Build a recommender system based on user purchase history.
- Create user-based features through data aggregation.
- Extract time-based features like customer recency and frequency.
- Apply feature scaling to prepare user data for clustering.
- Segment customers into distinct groups using a clustering algorithm.

### Context

In the vast and competitive landscape of e-commerce, providing personalized recommendations is essential for enhancing user experience, driving sales, and fostering customer loyalty. The primary challenge is to sift through massive product catalogs and complex user histories to suggest items that are genuinely relevant to an individual's tastes and needs. In the field of data science, recommender systems provide a powerful solution by analyzing past behavior to predict future preferences. This project uses a real-world dataset of e-commerce transactions to build a recommender system by creating detailed user profiles and segmenting customers into like-minded groups, enabling a more data-driven approach to product discovery.

## 1. Loading in the Data

For this project, we will use the [Online Retail Dataset](https://archive.ics.uci.edu/dataset/352/online+retail) from the UCI Machine Learning Repository. Please visit the UCI website to download the data file for this activity, and then upload the file to the same directory as the notebook file.

Note that this is currently an excel file as opposed to a csv. Opening it in excel, and then clicking file, gives you the option to save this as a csv, so it is recommended to do that. pandas may have an issue reading the file otherwise.

We can start by loading in the dataset into a pandas dataframe, and then displaying it to ensure it loaded correctly, and so we can see what the features like `CustomerID`, `InvoiceNo`, and `Quantity` look like. This means that we have to start by importing pandas as well.

It's worth mentioning that anytime you have a dataset from an external source, in this case the UCI repository, you can and should refer back to the source of the data to clear up misconceptions and also to get a better understanding of the data.

In [1]:
#import pandas
import pandas as pd

#load the dataset
df = pd.read_csv('Online Retail.csv')

#display info of the data
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
None
            Quantity      UnitPrice     CustomerID
count  541909.000000  541909.000000  406829.000000
mean        9.552250       4.611114   15287.690570
std       218.081158      96.759853    1713.600303
min    -80995.000000  -11062.060000   12346.000000
25%         1.000000       1.250000   13953.000000
50%         3.000000       2.080000   15152.000000
75%        10.000000       4.130000

Having taken a look at our data, there is a lot to take note of. Let's clarify the features below using information from UCI archive.

- `InvoiceNo`: A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. Since it can include the 'c', it's likely a string, and is categorical.
- `StockCode`: A 5-digit integral number uniquely assigned to each distinct product. Some do have letters, making this a categorical feature as well.
- `Description`: The name of the product purchased.
- `Quantity`: An integer representing the amount of each product (item) per transaction.
- `InvoiceDate`: Day and time when each transaction was generated. We'll need to check if it's in datetime format.
- `UnitPrice`: Product price per unit. In this case, it's per sterling, as this dataset is based on infromation from the UK.
- `CustomerID`: A 5-digit integral number uniquely assigned to each customer. 
- `Country`: Name of the country where each customer resides. Given this is from an online store, people could purchase from this store from anywhere.

With a better understanding of our data, let's move on to preprocessing.

## 2. Preprocessing
Having taken a good look at the data, we can now start to clean it. Before we split the data or build the model, it is important to make sure the data is ready for the model and any other transformations.

### Handling Null Entries
While the UCI archive says that there are no missing entries, inspecting our dataframe actually suggests otherwise, as both CustomerID and Description seems to be missing values. Let's confirm this by explicitly checking for null entries, and then remove rows as necessary.

In [2]:
#check for null values
print(df.isnull().sum())

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64


Just as we expected unfortunately, we are missing a lot of the CustomerIDs. Although we will be losing a lot of data by having to remove the rows where CustomerID is null, we can't make any predictions or references if we don't do so. As such, let's go ahead and remove these rows.

In [3]:
#Remove rows where CustomerID is null
df = df.dropna(subset=['CustomerID'])

#Check again for null values
print(df.isnull().sum())

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64


With that, we have dealt with the rows that can't be used for machine learning. Fortunately, all the rows with a missing description were also dropped due to this. Since we would still be able to know who bought what even without the description, there wouldn't be a point in dropping those rows, and we would have to have done extra work in order to impute the missing data. Since we don't have to worry about that, let's move on.

### Handling Irrelevant Data
Something else we have to worry about is data that won't help us or our model. For example, we'll want to filter out rows where the the price of an item is 0, or possibly even listed as less than 0. This might be because it was a promotional item or a test transaction. We'll also make sure that charges unrelated to actual purchases, like bank transactions and postage or shipping, is removed as well, as they don't tell us anything about which items the customer would purchase.

We'll filter out the dataframe with this in mind.

In [4]:
#Remove rows with zero or negative UnitPrice
#These are not standard purchases and can skew analysis.
df = df[df['UnitPrice'] > 0]

print(f"Shape after removing non-positive UnitPrice: {df.shape}")

#Remove rows with non-product StockCodes
#Filter out common operational codes like 'POST', 'M', 'BANK CHARGES', etc.
#We also filter out any StockCodes that are purely alphabetical, as these
#are typically not products (e.g., 'D' for discount, 'S' for samples).
df = df[~df['StockCode'].str.match('^[A-Z]+$', na=False)]

print(f"Shape after removing non-product StockCodes: {df.shape}")

#Display the cleaned data
display(df)

Shape after removing non-positive UnitPrice: (406789, 8)
Shape after removing non-product StockCodes: (405022, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


### Correcting the Date
We noticed earlier that our date is in an 'object' format. While this makes sense for us, it isn't exactly the best format for any machine learning model. As such, we'll go ahead and convert this feature into datetime format. By using the .to_datetime function, we can do just that. By having it in the special datetime format, we'll be able organize the data by recency and give weights to more recents.

As such, we'll go ahead and convert the data into datetime format.

In [5]:
#correcting
#check the invoicedate data type 
print(df['InvoiceDate'].dtype)

#convert the InvoiceDate to datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

#Check the data type again
print(df['InvoiceDate'].dtype)


object
datetime64[ns]


## 3. Feature Engineering

With our data cleaned, we can now proceed to the most critical phase for this project: **feature engineering**. The objective here is to transform our long list of individual transactions into a structured summary of each customer's unique behavior. We will create a new DataFrame where each row represents a single `CustomerID`, and the columns are metrics that quantitatively describe their purchasing habits, such as how often they shop, how much they spend, and how recently they've been active. This user-centric view is what will allow us to intelligently group similar customers and provide relevant recommendations.

To accomplish this, we will primarily use the powerful `groupby('CustomerID').agg()` method in pandas, which allows us to calculate multiple summary statistics for each user at once. The specific features we plan to engineer are:

* **`recency_days`**: Days since their last purchase.
* **`frequency`**: Total number of unique invoices per customer
* **`total_price`**: Quantity of an item multiplied by the price.
* **`total_spent`**: The sum of `TotalPrice` for all their purchases.
* **`avg_order_value`**: The average spending per transaction.
* **`unique_items`**: The count of unique `StockCode`s they have purchased.
* **`avg_items_per_order`**: The average number of items (`Quantity`) per transaction.
* **`avg_days_between_purchases`**: The average time gap between their transactions.

So, let's go ahead and create these features. This might be a bit longer of a code segment than usual so please do your best to follow along.

In [6]:
#Import numpy for numerical operations
import numpy as np

#Create TotalPrice Column
df['total_price'] = df['Quantity'] * df['UnitPrice']


#Calculate Recency, Frequency, and Monetary (RFM) values

#For Recency, we need a snapshot date to calculate days since last purchase.
#We'll use the day after the last transaction in the dataset as our reference.
#timedelta is used to add a day to the max date.
snapshot_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)

# Group by customer and calculate aggregations
user_features = df.groupby('CustomerID').agg(
    #Recency: Days since last purchase
    recency_days=('InvoiceDate', lambda date: (snapshot_date - date.max()).days),
    
    #Frequency: Count of unique invoices
    frequency=('InvoiceNo', 'nunique'),
    
    #Monetary: Sum of total price for all purchases
    total_spent=('total_price', 'sum'),

    #Other behavioral features
    unique_items=('StockCode', 'nunique'),
    total_items=('Quantity', 'sum')
)


#Calculate Additional Features
#Average Order Value
user_features['avg_order_value'] = user_features['total_spent'] / user_features['frequency']

#Average Items per Order
user_features['avg_items_per_order'] = user_features['total_items'] / user_features['frequency']


#Calculate Average Days Between Purchases
#First, get the list of purchase dates for each customer
customer_dates = df.groupby('CustomerID')['InvoiceDate'].apply(list)

#Calculate the time difference between consecutive purchases
def get_avg_purchase_gap(dates):
    if len(dates) > 1:
        dates = sorted(dates)
        gaps = [(dates[i] - dates[i-1]).days for i in range(1, len(dates))]
        return np.mean(gaps)
    return 0 # If only one purchase, return 0. Alternatively, we could return a large number to tell that this was a one-time purchase.

user_features['avg_days_between'] = customer_dates.apply(get_avg_purchase_gap)


#Display the Final User Features DataFrame
print("--- Feature Engineering Complete ---")
print("Preview of the final user_features DataFrame:")
print(user_features.head())

--- Feature Engineering Complete ---
Preview of the final user_features DataFrame:
            recency_days  frequency  total_spent  unique_items  total_items  \
CustomerID                                                                    
12346.0              326          2         0.00             1            0   
12347.0                2          7      4310.00           103         2458   
12348.0               75          4      1437.24            21         2332   
12349.0               19          1      1457.55            72          630   
12350.0              310          1       294.40            16          196   

            avg_order_value  avg_items_per_order  avg_days_between  
CustomerID                                                          
12346.0            0.000000             0.000000          0.000000  
12347.0          615.714286           351.142857          2.000000  
12348.0          359.310000           583.000000         10.846154  
12349.0         14

It was quite a bit of work, but we were able to create a new dataframe that helps us with our purposes. In this dataframe, the index is the customer, so now each row represents a specific customer. Using infromation from this dataframe and our original dataframe, we'll be able to create a list of items to recommend to each customer. But before getting to that point, there's still more tranformations and functions we want to apply.

## 4. Scaling

For this project, there isn't exactly a model to built. We'll be using different techniques to create an algorithm to recommend certain products to customers. This also means that there won't be a train-test split for this project, and that we can go ahead and start scaling our data. The data will be scaled in a way so that the mean of the features is 0 and the standard deviation of them is 1. 

We'll only be scaling our user features dataframe, as we plan to use the data from this dataframe as a direct input to the k-means clustering technique (which will be explained more after scaling). Additionally, our originally dataframe includes data that wouldn't benefit from being scaled, such as numerical codes or categorical information.

As such we'll go ahead and scale our user features dataframe by importing the standard scaler, fitting, and then transforming the dataframe.

In [7]:
#import StandardScaler
from sklearn.preprocessing import StandardScaler

#Initialize the StandardScaler
scaler = StandardScaler()

#Fit and transform the user features DataFrame
user_features_scaled = scaler.fit_transform(user_features)

## 5. Clustering the Data

With our user features now engineered and scaled, the next phase is to perform **customer segmentation** using a clustering algorithm. The goal here is to move beyond analyzing customers individually and instead discover natural groupings of users who exhibit similar purchasing behaviors. By identifying these segments, we can understand our customer base on a deeper level and tailor our recommendation strategy to the unique preferences of each group. This approach allows us to provide more relevant suggestions than a one-size-fits-all model.

To achieve this, we will use **K-Means**, a powerful and widely-used clustering algorithm. We will start by choosing the number of clusters (`k`) we want to partition our data into. We will then initialize the K-Means model with this `k` value and fit it to our scaled `user_features` DataFrame. The algorithm will then assign each customer to a specific cluster. Finally, we will analyze the characteristics of each cluster by examining the average feature values of its members, allowing us to create specific groups of customers with specific spending habits. We can work from there to recommend products.

In [None]:
#Import KMeans for clustering
from sklearn.cluster import KMeans

#We'll start by choosing a number of clusters, k. For this example, let's set k to 30, but we set this as a variable so we can easily change it
k = 30

#Initialize the KMeans model with k clusters
#We use 'k-means++' for better initialization of centroids, and set n_init to 10 for robustness.
kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
kmeans.fit(user_features_scaled) 

#Get the cluster assignment for each customer
cluster_labels = kmeans.labels_

#We add the cluster labels back to the original, unscaled DataFrame
#so we can interpret the results with human-readable values.
user_features['cluster'] = cluster_labels

#We can now analyze the characteristics of each cluster by grouping
#by the new 'cluster' column and calculating the mean of each feature.
cluster_analysis = user_features.groupby('cluster').mean()

print("Average feature values for each customer segment:")
print(cluster_analysis)

--- Cluster Analysis Results ---
Average feature values for each customer segment:
         recency_days   frequency    total_spent  unique_items    total_items  \
cluster                                                                         
0           55.516667    3.477778     593.947889     11.961111     289.261111   
1           51.318102    2.383128     978.436274     40.516696     651.579965   
2           24.000000   24.000000  123638.180000    443.000000   76946.000000   
3          205.000000    2.250000    3504.450000      9.000000    7196.000000   
4          242.418136    1.612091     305.423023     18.904282     155.450882   
5           29.266917    6.259398    1768.340075    112.569549    1086.509398   
6            8.426471   28.779412    8711.416765    262.161765    4934.014706   
7            1.500000  193.000000   35409.875000   1548.500000   23413.000000   
8           47.000000    2.000000     501.565000      1.500000     793.000000   
9           96.148148    2

From our results, we can see the characteristics of each cluster. How much money on average a person from a specific might have spent, the average number of items bought, and much more is shown to us. It's with this information that we'll go ahead and build our recommender system. To make things easier, we'll map the cluster labels to our original transactions so that we know which cluster each purchase belongs to. We'll do this by merging the cluster labels.

In [None]:
#We select only the cluster column for the merge
customer_clusters = user_features[['cluster']]

#Merge the cluster information back into the main dataframe
df_with_clusters = df.merge(customer_clusters, on='CustomerID')

## 6. Creating the Recommendation Logic
With our customer segments defined by the clustering algorithm, we will now build the core recommendation logic. The goal is to provide personalized suggestions to a user by leveraging the collective wisdom of their peer group, that being the other customers within their assigned cluster. This approach is a form of collaborative filtering, where we assume that users with similar purchasing behaviors will have similar tastes in products.

To accomplish this, we will create a function that, for any given `CustomerID`, first identifies which cluster they belong to. It then analyzes all purchases made by the other customers in that same cluster to determine which products are most popular among that specific segment. Finally, to ensure the recommendations are new and helpful, the function will filter out any items the original user has already bought and return a ranked list of the top remaining products, providing a targeted and relevant set of suggestions.

In [None]:
#Create a function for our recommendation logic
#This function will take a customer ID and return a list of recommended items based on their cluster
def get_recommendations(customer_id, top_n=5):
    # 1. Get the target user's cluster
    user_cluster = df_with_clusters.loc[df_with_clusters['CustomerID'] == customer_id, 'cluster'].iloc[0]

    # 2. Get the list of items the user has already bought
    user_purchases = df_with_clusters.loc[df_with_clusters['CustomerID'] == customer_id, 'StockCode'].unique()

    # 3. Get purchase data for the user's peer group
    peer_group = df_with_clusters[df_with_clusters['cluster'] == user_cluster]
    
    # 4. Find the most popular items in the peer group, excluding the user's own purchases
    popular_items = peer_group[~peer_group['StockCode'].isin(user_purchases)]
    recommendations = popular_items['StockCode'].value_counts().head(top_n).index.tolist()
    
    return recommendations

#Test with a sample customer
sample_customer = 17850 # Example customer ID. Just the first one in the dataset.
recommendations = get_recommendations(sample_customer, top_n=3)

print(f"Recommendations for Customer {sample_customer}:")
print(recommendations)

Recommendations for Customer 17850:
['85099B', '22423', '23203', '47566', '22720']


## 7. Refining the Recommender System
While we were able to make our Recommender system, we currently don't have an idea as to what it's recommending to us, or how good these recommendations are. We can start by figuring out what is being recommended.

We'll first start by creating a map for the stock codes. We want each unique stock code and it's description in a dictionary so that for any given code, we know what is being recommended. After running our function, we can save the recommendations and then translate them using the stock code map.

In [None]:
#Create a mapping dictionary from StockCode to Description
#We drop duplicates to ensure each StockCode maps to a single, clean Description
stock_code_map = df.drop_duplicates(subset=['StockCode']).set_index('StockCode')['Description'].to_dict()

#Get the recommended StockCodes
sample_customer = 17850
recommended_codes = get_recommendations(sample_customer, top_n=3)

#Translate the codes into human-readable names using the map
recommended_items = [stock_code_map[code] for code in recommended_codes]

#Display the final, human-readable recommendations
print(f"Recommended StockCodes for Customer {sample_customer}:")
print(recommended_codes)
print("\nTranslated Recommendations:")
print(recommended_items)

Recommended StockCodes for Customer 17850:
['85099B', '22423', '23203']

Translated Recommendations:
['JUMBO BAG RED RETROSPOT', 'REGENCY CAKESTAND 3 TIER', 'JUMBO BAG DOILEY PATTERNS']


Just like that, we know what is being recommended. Now this change could have simply been added to our recommendation logic function, but for the sake of listing each step one by one, we have it listed separate.

## 8. Evaulating the System

Finally, we must evaluate the quality of our recommender system to ensure it provides sensible suggestions. Since we are not predicting a single "correct" answer, traditional metrics like accuracy or RMSE do not apply here. Instead, we will perform a **qualitative analysis**, which involves a manual, logical inspection of the recommendations for a sample customer to determine if they are relevant and logical.

To do this, we will create a script that profiles a specific customer, analyzes the characteristics of the peer group they belong to, and then presents the final recommendations. The process involves selecting a sample `CustomerID`, retrieving their engineered features and assigned cluster, and then comparing their individual profile to the average profile of their cluster. By examining the recommended items in the context of both the individual's and the group's purchasing behavior, we can make a well-informed judgment about whether the recommendations are logical and add value, thereby validating the effectiveness of our segmentation strategy.

In [None]:
#Define the customer to analyze
customer_id_to_check = 17850

#Profile the Specific Customer
print(f"--- Analysis for Customer ID: {customer_id_to_check} ---")
user_profile = user_features.loc[customer_id_to_check]
print("\nCustomer's Personal Profile")
print(user_profile)

#Understand Their Peer Group (Cluster)
user_cluster_label = int(user_profile['cluster'])
cluster_profile = cluster_analysis.loc[user_cluster_label]
print(f"\nCustomer belongs to Cluster {user_cluster_label}, which has these average characteristics:")
print(cluster_profile)

#Review Their Recommendations
#Get recommended stock codes
recommended_codes = get_recommendations(customer_id_to_check, top_n=3)

#Translate codes to human-readable names
recommended_items = [stock_code_map.get(code, "Unknown Item") for code in recommended_codes]

print("\nTop 3 Recommendations")
for item in recommended_items:
    print(f"- {item}")

--- Analysis for Customer ID: 17850 ---

## Step 1: Customer's Personal Profile
recency_days            302.000000
frequency                35.000000
total_spent            5288.630000
unique_items             24.000000
total_items            1693.000000
avg_order_value         151.103714
avg_items_per_order      48.371429
avg_days_between          0.221865
cluster                  18.000000
Name: 17850.0, dtype: float64

## Step 2: Belongs to Cluster 18, which has these average characteristics:
recency_days             18.033019
frequency                16.410377
total_spent            4425.028349
unique_items            105.466981
total_items            2494.221698
avg_order_value         274.919524
avg_items_per_order     156.697645
avg_days_between          1.958521
Name: 18, dtype: float64

## Step 3: Top 3 Recommendations
- JUMBO BAG RED RETROSPOT
- REGENCY CAKESTAND 3 TIER
- JUMBO BAG DOILEY PATTERNS


This analysis reveals a successful, multi-layered segmentation where the model has correctly identified a high-value customer and provided relevant, if general, recommendations.

### The Quality of the Segmentation

The model correctly placed `CustomerID` 17850 into **Cluster 18**, a segment of "power users." This grouping is logical because the customer's core purchasing habits—their high **`frequency`** (35) and **`total_spent`** ($5288)—are much more similar to the high average values of this cluster than to any other. While their spending is lower than the cluster's average ($10,235), they are still clearly in the same league, distinguishing them from casual, low-spending shoppers.

### The "Lapsed Power User" Insight

It is perfectly okay that the user is "lapsed" within this group; in fact, this is a crucial business insight. The cluster represents a **behavioral type** ("power user"), not a group where every member is identical. The model correctly determined that this user's history of high-volume, frequent purchasing makes them a power user. Their high **`recency`** (302 days vs. the group's average of 18) simply adds another layer to their profile: they are a power user **who is at risk of churning**. This is a highly valuable segment for a business to identify.

### The Logic of the Recommendations

The recommendations ("JUMBO BAG," "REGENCY CAKESTAND," etc.) are likely general best-sellers that are popular among a wide range of customers. We can assume these are logical recommendations for this cluster because this "power user" segment, which includes B2B customers and bulk buyers, naturally purchases a high volume of the most popular and useful items for resale or frequent use. Therefore, recommending the top-selling items to a member of this group is a safe and logical strategy to re-engage them with products their peers have found valuable.

## Overall
Throughout this project, we successfully guided the e-commerce dataset through a complete customer segmentation pipeline, from data cleaning and preparation to advanced feature engineering and unsupervised learning. By transforming raw transaction logs into rich user profiles using aggregation methods like RFM (Recency, Frequency, Monetary) analysis and then applying **K-Means clustering**, we partitioned the customer base into distinct, data-driven groups. The value of this process lies in the techniques themselves. The ability to engineer behavioral features and apply clustering to segment a user base is a fundamental skill in modern data analytics, essential for tackling real-world business challenges. This project has built a solid foundation in a new class of data science problems. Good work, give yourself a pat on the back.