## 1. Installing Packages 
 Installs Python packages "lifetimes," "seaborn," and "scikit-learn" using pip, which are essential for data analysis and machine learning.

In [None]:
%pip install lifetimes
%pip install seaborn
%pip install scikit-learn

## 2. Importing Python Libraries
This code imports necessary Python libraries, including "lifetimes" for Customer Lifetime Value (CLV) analysis, data manipulation with pandas and numpy, datetime handling, data visualization with matplotlib and seaborn, and machine learning tools from scikit-learn for preprocessing.

In [2]:
import lifetimes

import pandas as pd
import numpy as np
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns

from lifetimes import BetaGeoFitter, GammaGammaFitter
from sklearn.preprocessing import MinMaxScaler

## 3. Reading and Understanding Data

In [3]:
data = pd.read_csv('online_retail.csv')
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


This is a Pandas DataFrame with 541,909 rows and 8 columns, containing various data types (e.g., object, int64, float64), with some missing values in the 'Description' and 'CustomerID' columns

In [5]:
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


These statistics describe the distribution of 'Quantity,' 'UnitPrice,' and 'CustomerID' columns in a dataset. They are useful for understanding the central tendency, spread, and potential outliers in the data, which can inform decisions in areas such as inventory management, pricing strategy, and customer segmentation.

## 4. Data Manipulation
This data manipulation involves filtering out rows where 'Quantity' is less than or equal to 0, 'UnitPrice' is less than or equal to 0, and removing rows with 'InvoiceNo' containing "C" (indicating returns). This is done to clean the data by excluding invalid or unwanted records, ensuring that the analysis is based on valid and meaningful transactions.

In [6]:
data = data[data['Quantity'] > 0 ]
data = data[data['UnitPrice'] > 0]
data = data[~data['InvoiceNo'].str.contains("C",na=False)]

We see that there are missing values within CustomerID. Let’s remove any observation without CustomerID.

In [7]:
# Removing missing values from the data 
data.dropna(inplace=True)

## 5. Handling Outliers
We will create a function called cap_outliers that caps outliers in a specified DataFrame column by setting values below the 5th percentile (q1) to the 5th percentile value and values above the 95th percentile (q2) to the 95th percentile value. It's important to remove outliers to prevent extreme values from disproportionately affecting statistical analysis, ensuring that results are more representative of the overall data distribution and avoiding skewed or biased insights.

In [8]:
# Defining a function to remove outliers .
def cap_outliers(dataframe, variable, q1=0.05, q2=0.95):
    lower_bound = dataframe[variable].quantile(q1)
    upper_bound = dataframe[variable].quantile(q2)
    dataframe[variable] = np.clip(dataframe[variable], lower_bound, upper_bound)
    
# Calling cap_outliers for UnitPrice and Quantity
cap_outliers(data,'UnitPrice')
cap_outliers(data,'Quantity')
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,397884.0,397884.0,397884.0
mean,8.868022,2.675785,15294.423453
std,9.523425,2.275053,1713.14156
min,1.0,0.42,12346.0
25%,2.0,1.25,13969.0
50%,6.0,1.95,15159.0
75%,12.0,3.75,16795.0
max,36.0,8.5,18287.0


## 6. Creating Our RFM Dataset (Recency, Frequency, Monetary)

After we've completed the data preprocessing phase, the next crucial step is to construct an RFM (Recency, Frequency, Monetary) dataset. But what exactly do these terms mean?

- **Frequency**: This metric represents the number of repeat purchases a customer has made. It's actually one less than the total number of purchases, but it's more accurately defined as the count of time periods in which a customer made a purchase. For example, if you're measuring in days, it's the count of days on which the customer made a purchase.

- **Recency**: Recency measures the age of a customer when they made their most recent purchase. It's calculated as the duration between a customer's first purchase and their latest purchase. If a customer has only made a single purchase, their recency is 0.

- **T**: T represents the age of the customer using the chosen time units (e.g., weekly in the dataset mentioned). It's calculated as the duration between a customer's first purchase and the end of the period you're studying.

- **Monetary Value**: This metric signifies the average value of a customer's purchases. It's determined by dividing the sum of all a customer's purchases by the total number of purchases. It's important to note that the denominator in this calculation differs from the frequency calculation described earlier.

In essence, by constructing the RFM dataset, we're quantifying customer behavior in terms of how recently they made a purchase, how frequently they make purchases, the total duration of their engagement, and the average value of their purchases. This dataset serves as a valuable foundation for various customer segmentation and analysis techniques.

This code computes the RFM (Recency, Frequency, Monetary) summary statistics from a transaction dataset using the Lifetimes library for Customer Lifetime Value (CLV) analysis. The summary_data_from_transaction_data function computes the following RFM metrics for each customer. The resulting RFM dataset contains these calculated RFM metrics for each customer and serves as the basis for further analysis, such as predictive modeling of customer lifetime value and customer segmentation.


In [9]:
data['Total Price'] = data['UnitPrice'] * data['Quantity']
RFM = lifetimes.utils.summary_data_from_transaction_data(data,'CustomerID','InvoiceDate','Total Price',observation_period_end='2011-12-09')


In [10]:
RFM.head()

Unnamed: 0_level_0,frequency,recency,T,monetary_value
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12346.0,0.0,0.0,325.0,0.0
12347.0,6.0,365.0,367.0,550.57
12348.0,3.0,283.0,358.0,116.126667
12349.0,0.0,0.0,18.0,0.0
12350.0,0.0,0.0,310.0,0.0


In [11]:
# we want only customers shopped more than 2 times
RFM = RFM[RFM['frequency']>1] 
RFM.head()

Unnamed: 0_level_0,frequency,recency,T,monetary_value
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12347.0,6.0,365.0,367.0,550.57
12348.0,3.0,283.0,358.0,116.126667
12352.0,6.0,260.0,296.0,192.84
12356.0,2.0,303.0,325.0,226.08
12359.0,3.0,274.0,331.0,1495.65


## 7. Frequency/Recency Analysis using the BG/NBD Model 
It is a technique employed to gain insights into customer behavior, specifically regarding how often customers make purchases and how recently they've made their last purchase. It involves the use of the Lifetimes library and the Bayesian-Gamma Negative Binomial Distribution (BG/NBD) model.

**Analogy:** Think of this analysis as if you were observing a group of friends who regularly visit a coffee shop. You want to understand how often each friend visits (frequency) and how long it has been since their last visit (recency). By doing so, you can identify which friends are the most loyal and active customers of the coffee shop.

Fit the BG/NBD Model: Use the BG/NBD model to analyze customer behavior. The model considers factors like the number of repeat purchases and the time between purchases. Fit the model to your data using code similar to the following:

- RFM['frequency']: This represents how frequently each customer makes purchases.
- RFM['recency']: This measures how recently each customer made their last purchase.
- RFM['T']: This is the total time duration of the analysis period.

In [12]:
bgf = BetaGeoFitter(penalizer_coef=0.0)  # Create a BG/NBD model instance
bgf.fit(RFM['frequency'], RFM['recency'], RFM['T'])  # Fit the model with your data


<lifetimes.BetaGeoFitter: fitted with 1916 subjects, a: 0.02, alpha: 113.56, b: 13.28, r: 2.44>

**Predict Future Purchases:** Once you've fitted the model, you can use it to predict how many purchases each customer is likely to make in the future. For example, you can predict purchases for the next 6 months:

## 8. Expected Number of Purchases within 6 Months
To predict the expected number of purchases each customer is likely to make within the next 6 months, you can use the BG/NBD model you've already fitted. The predict method in the code above will give you these predictions.

In [13]:
# Predict future customer transactions (let's say for the next 6 months)
predicted_purchases = bgf.predict(6, RFM['frequency'], RFM['recency'], RFM['T'])

# Display the predicted purchases for each customer
RFM['predicted_purchases'] = predicted_purchases
print(RFM[['frequency', 'recency', 'T', 'predicted_purchases']].head())


            frequency  recency      T  predicted_purchases
CustomerID                                                
12347.0           6.0    365.0  367.0             0.105275
12348.0           3.0    283.0  358.0             0.069025
12352.0           6.0    260.0  296.0             0.123399
12356.0           2.0    303.0  325.0             0.060655
12359.0           3.0    274.0  331.0             0.073253


## 9. Gamma-Gamma Model
The Gamma-Gamma model is used to predict the monetary value for each transaction. 

In [14]:
# Fit the Gamma-Gamma model to your data (assuming you have 'monetary_value' column)
ggf = GammaGammaFitter(penalizer_coef=0.0)  # Adjust the penalizer_coef if needed
ggf.fit(RFM['frequency'], RFM['monetary_value'])

# Predict the expected monetary value for each transaction
predicted_monetary_value = ggf.conditional_expected_average_profit(RFM['frequency'], RFM['monetary_value'])

# Display the predicted monetary value for each transaction
RFM['predicted_monetary_value'] = predicted_monetary_value
print(RFM[['frequency', 'monetary_value', 'predicted_monetary_value']].head())

            frequency  monetary_value  predicted_monetary_value
CustomerID                                                     
12347.0           6.0      550.570000                491.654326
12348.0           3.0      116.126667                234.757676
12352.0           6.0      192.840000                246.672131
12356.0           2.0      226.080000                305.858551
12359.0           3.0     1495.650000                953.088382


## 10. Predicting CLV for the Next 6 Months
To predict Customer Lifetime Value (CLV) for the next 6 months, you can combine the predictions from the BG/NBD and Gamma-Gamma models. 

In [15]:
# Calculate the CLV prediction for each customer
RFM['predicted_CLV'] = RFM['predicted_purchases'] * RFM['predicted_monetary_value']

# Display the predicted CLV for each customer
print(RFM[['predicted_CLV']].head())


            predicted_CLV
CustomerID               
12347.0         51.758707
12348.0         16.204207
12352.0         30.439010
12356.0         18.551746
12359.0         69.816407


## 11. Segmenting CLV into Different Groups
Segmenting customers into different groups based on their CLV can be done using various methods, such as quantiles or clustering algorithms. This code quantiles customers into 4 groups based on their predicted CLV, but you can adjust the number of quantiles or use different segmentation methods as needed.




In [16]:
# Segment customers into quantiles based on predicted CLV
RFM['Segment'] = pd.qcut(RFM['predicted_CLV'], q=4, labels=['Hibernating', 'Need Attention', 'Loyal Customers', 'Champions'])

# Display the CLV quantile for each customer
RFM.head()


Unnamed: 0_level_0,frequency,recency,T,monetary_value,predicted_purchases,predicted_monetary_value,predicted_CLV,Segment
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12347.0,6.0,365.0,367.0,550.57,0.105275,491.654326,51.758707,Champions
12348.0,3.0,283.0,358.0,116.126667,0.069025,234.757676,16.204207,Hibernating
12352.0,6.0,260.0,296.0,192.84,0.123399,246.672131,30.43901,Need Attention
12356.0,2.0,303.0,325.0,226.08,0.060655,305.858551,18.551746,Hibernating
12359.0,3.0,274.0,331.0,1495.65,0.073253,953.088382,69.816407,Champions


#### Grouped Data:

After segmenting customers, you can analyze and compare their behavior and characteristics within each segment. Group your dataset by the 'Segment' column:

In [17]:
RFM.groupby('Segment').mean()

Unnamed: 0_level_0,frequency,recency,T,monetary_value,predicted_purchases,predicted_monetary_value,predicted_CLV
Segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Hibernating,2.887265,236.34238,314.776618,184.460644,0.0741,272.986569,19.634967
Need Attention,3.995825,240.048017,282.870564,288.097725,0.097324,315.073218,28.962636
Loyal Customers,5.087683,224.202505,255.139875,387.433829,0.124657,368.076314,42.133573
Champions,12.14405,265.471816,279.501044,565.476698,0.215625,501.891978,104.77734


## 12. Final Analysis
After segmenting our customers by CLV, we can:

- Offer specific products to each segment.
- Create a marketing plan to increase CLV for the lower segment.
- Focus on the higher segments to decrease customer acquisition costs.
