Skip to content

Using NumPy, Pandas, Matplotlib and Seaborn for RFM analysis, I identified key customer segments based on RFM value. This analysis generated insights that informed targeted marketing strategies, enhancing customer retention and engagement while recommending solutions to improve overall business practice and drive revenue growth.

Notifications You must be signed in to change notification settings

Sale2002/Python-RFM-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

[Python] RFM Analysis

I. Introduction

1.1. Project Overview:

This project aims to assist SuperStore, a global retail company, in launching effective marketing campaigns for the Christmas and New Year's seasons. The primary goal is to express appreciation to existing customers and identify potential loyal customers for personalized engagement. By applying RFM (Recency, Frequency, Monetary) analysis, the project seeks to segment the large customer base, enabling the development of targeted marketing strategies and customized customer care programs. This approach helps SuperStore increase ROI, reduce customer churn, optimize marketing expenses, and improve customer relationships.

In this project, the following libraries were utilized for data analysis:

  • pandas
  • numpy
  • datetime
  • matplotlib
  • seaborn
  • squarify

1.2. Business Problem:

SuperStore needs to identify key customer segments to focus their marketing resources on the right groups, ensuring more effective outreach and personalized customer engagement during the holiday seasons. This involves determining which customers are most loyal, most active, and most profitable, allowing the business to allocate resources efficiently and improve customer satisfaction.

1.3 Solution Approach:

The RFM analysis model is employed, which segments customers based on three factors:

  • Recency: How recently a customer made a purchase.
  • Frequency: How often the customer has made purchases.
  • Monetary: How much the customer has spent overall.

By scoring customers on these dimensions, the company can create meaningful customer segments, such as high-value, inactive, or recent customers, and tailor marketing strategies for each segment.

Reference: RFM analysis

II. Data Exploration & Data Processing

2.1. Data Information

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

  • InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where each customer resides.

2.2. Data Exploration & Data Processing

Each Step of Data Exploration and Processing:

  • Step 1: Data Preparation: The dataset was loaded and underwent a cleaning process to eliminate missing values and duplicates. Additionally, necessary columns were converted to the appropriate data types (e.g., conversion of InvoiceDate to datetime format).

  • Step 2: Transaction Segmentation: The data was filtered to distinguish between cancelled transactions (negative quantities) and successful ones. The total revenue was calculated for both categories.

  • Step 3: Data Aggregation: Transactions were grouped by CustomerID to compute key metrics, including total revenue (Monetary), the number of purchases (Frequency), and the date of the most recent purchase (Recency).

  • Step 4: RFM Metric Calculation: The RFM metrics were calculated for each customer:

    • Recency: The number of days since the last purchase.
    • Frequency: The total number of transactions.
    • Monetary: The total amount spent.
  • Step 5: RFM Scoring: Customers were assigned RFM scores based on the distribution of their RFM values using quantiles (e.g., a 1–5 scale).

  • Step 6: Customer Segmentation: The RFM scores were combined to classify customers into distinct segments (e.g., top customers, at-risk customers, and new customers).

To view the steps for executing the aforementioned process, please refer to the: Appendix 1

III. Data Visualization

image

Function Used:

  • plt.subplots(): Creates a subplot for each RFM variable.
  • sns.distplot(): Plots the distribution of each RFM variable.

Purpose: To visualize the distribution of Recency, Frequency, and Monetary values across customers.

Result: image

=> Insight: The customer purchase distribution is right-skewed, indicating many recent purchases and fewer customers who have not bought anything for a long time. The peak occurs in the 0-50 day range, suggesting that many customers make repeat purchases within this timeframe. The longer tail on the right indicates some customers have not purchased in a while, signaling they may be less engaged or at risk of being lost.

image

=> Insight: The customer transaction distribution is right-skewed, indicating that many customers have made only a few transactions, while a few have made significantly more. The peak is in the 0-10 range, showing that a substantial portion of customers has completed only a few transactions. The longer right tail indicates the presence of high-value or repeat customers.

image

=> Insight: The distribution of customer spending is right-skewed, similar to the recency and frequency distributions, indicating that many customers have made relatively small expenditures, while a few have spent significantly larger amounts. The peak occurs in the 0-50,000 range, suggesting that a substantial portion of customers has relatively low total spending. The longer right tail highlights the presence of high-value or VIP customers who have made large expenditures.

Preprocess the data before performing visualization: Appendix 2

image

image

IV. Recommendation

Segment Characteristics Recommendation
Champions Bought recently, buy often and spend the most! Reward them. Can be early adopters for new products. Will promote your brand.
Loyal Spend good money with us often. Responsive to promotions. Upsell higher value products. Ask for reviews. Engage them.
Potential Loyalist Recent customers, but spent a good amount and bought more than once. Offer membership/loyalty program, recommend other products.
New customers Bought most recently, but not often. Provide on-boarding support, give them early success, start building a relationship.
Promising Recent shoppers, but haven’t spent much. Create brand awareness, offer free trials.
Need attention Above average recency, frequency and monetary values. May not have bought very recently. Make limited-time offers, recommend based on past purchases. Reactivate them.
About to sleep Below average recency, frequency and monetary values. Will lose them if not reactivated. Share valuable resources, recommend popular products/renewals at discount, reconnect with them.
At risk Spent big money and purchased often, but a long time ago. Need to bring them back! Send personalized emails to reconnect, offer renewals, provide helpful resources.
Cannot lose them Made biggest purchases, and often, but haven’t returned for a long time. Win them back via renewals or newer products, don’t lose them to competition, talk to them.
Hibernating customers Last purchase was long back, low spenders and low number of orders. Offer other relevant products and special discounts. Recreate brand value.
Lost customers Lowest recency, frequency and monetary scores. Revive interest with a reach-out campaign, ignore otherwise.

Appendix_1

1. Data Exploration

1.1. Loading and Initial Data Inspection:

image image image image

Function Used:

  • pd.read_excel(): Loads the data from an Excel file.
  • data.head(): Displays the first five rows of the dataframe.
  • data.info(): Provides information about the dataframe (data types, non-null counts).
  • data.shape: Returns the dimensions of the dataframe (rows, columns).

Purpose: To load and inspect the dataset structure before manipulation.

1.2. Data Cleaning:

After reviewing the data, we observed that a significant portion of the data contained null values. Therefore, we have decided on the following approach: given the available data and context, it is not feasible to impute values for approximately 140,000 rows where the CustomerID is null. As a result, we will delete these rows from the dataframe. Additionally, duplicate rows will be removed in this step as well.

image

Function Used:

  • dropna(): Removes rows with missing values in the CustomerID column.
  • drop_duplicates(): Removes duplicate rows.

Purpose: To clean the dataset by removing invalid or redundant rows.

1.3. Converting Data Types:

Convert some columns to the potentially correct data type

image

Function Used:

  • pd.to_datetime(): Converts the InvoiceDate column to datetime format.
  • astype(): Changes the data type of CustomerID to object.

Purpose: Ensures the correct data types for analysis.

2. Data Processing

2.1. Handling Cancellations:

image image

Function Used:

  • Filtering: data[data['Quantity'] < 0] filters out transactions where Quantity is negative (indicating cancellations).
  • Multiplication: Calculates the loss for canceled transactions (Quantity × UnitPrice).

Purpose: To isolate and calculate the loss from canceled transactions.

2.2. Aggregating Canceled Transactions:

image image

Function Used:

  • groupby(): Groups the data by InvoiceNo, Country, and CustomerID.
  • agg(): Aggregates the grouped data by taking the maximum InvoiceDate and summing the Loss.
  • reset_index(): Resets the index after the aggregation.

Purpose: To summarize canceled transactions by invoice and customer.

2.3. Handling Delivered Transactions:

image

image

Function Used:

  • Filtering: data[data['Quantity'] >= 0] filters out transactions where Quantity is non-negative.
  • Multiplication: Calculates the revenue for delivered transactions (Quantity × UnitPrice).

Purpose: To isolate and calculate the revenue from delivered transactions.

2.4. Aggregating Delivered Transactions:

image image

Function Used:

  • groupby(): Groups the data by InvoiceNo, Country, and CustomerID.
  • agg(): Aggregates the grouped data by taking the maximum InvoiceDate and summing the Revenue.
  • reset_index(): Resets the index after the aggregation.

Purpose: To summarize delivered transactions by invoice and customer.

2.5. Percentage of Cancelled Transactions:

image

  • agg_cancelled.shape[0]: Counts the number of cancelled transactions (rows where Quantity < 0).
  • agg_delivered.shape[0]: Counts the number of delivered transactions (rows where Quantity >= 0).
  • The formula (agg_cancelled.shape[0] * 100 / (agg_cancelled.shape[0] + agg_delivered.shape[0])) calculates the percentage of cancelled transactions by dividing the number of cancelled transactions by the total number of transactions (cancelled + delivered) and multiplying by 100.

=> A high cancellation rate may point to problems in operations, product quality, or customer satisfaction, signaling areas that need immediate attention to prevent revenue loss and improve customer retention.

2.6. Calculating RFM Metrics (Recency, Frequency, Monetary):

image

Function Used:

  • Recency: The number of days since the customer's last purchase.
  • Frequency: The total number of purchases by the customer.
  • Monetary: The total revenue generated by the customer.
  • The Recency value is calculated by subtracting the last purchase date from a fixed "current date" (today).

Purpose: Group transactions by CustomerID to compute the three key metrics—Recency, Frequency, and Monetary (RFM).

2.7. RFM Score Calculation:

image image

Function Used:

  • pd.qcut(): Divides the Recency, Frequency, and Monetary values into quantiles (with 5 groups here).
  • rank(): Ranks the Frequency before quantile assignment.

Purpose: To assign scores (1-5) to each customer for Recency, Frequency, and Monetary, based on their relative values.

2.8. Creating RFM Combined Score:

image image

Function Used:

  • astype(): Converts the scores to string, concatenates them, and converts the result back to integer.

Purpose: To create a combined RFM score for each customer by joining their individual Recency, Frequency, and Monetary scores.

2.9. Merging with Segmentation Data:

image image image image

Function Used:

  • pd.read_excel(): Reads the segmentation data from another sheet in the Excel file.
  • str.split(): Splits the comma-separated RFM scores into a list.
  • explode(): Expands list-like elements into separate rows.
  • astype(): Converts RFM scores to integer.

Purpose: To process the segmentation data and prepare it for merging with the RFM analysis results.

2.10. Final Merge of RFM and Segmentation:

image

Function Used:

  • merge(): Joins the RFM analysis data with the segmentation data on the RFM_Score column.

Purpose: To associate each customer with their respective segment based on the combined RFM score.

Appendix_2

image image

Purpose: This aggregation step provides a clear overview of customer segments by summarizing essential metrics. It allows businesses to evaluate customer engagement, profitability, and retention strategies effectively.

image

Purpose: Calculating these shares enables businesses to identify and prioritize customer segments based on their size and revenue contribution.

Visualization: Code: image

Result: image

Code: image

Result: image

About

Using NumPy, Pandas, Matplotlib and Seaborn for RFM analysis, I identified key customer segments based on RFM value. This analysis generated insights that informed targeted marketing strategies, enhancing customer retention and engagement while recommending solutions to improve overall business practice and drive revenue growth.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published