## 1. Installing Packages 
 Installs Python packages "lifetimes," "seaborn," and "scikit-learn" using pip, which are essential for data analysis and machine learning.

In [None]:
%pip install lifetimes
%pip install seaborn
%pip install scikit-learn

## 2. Importing Python Libraries
This code imports necessary Python libraries, including "lifetimes" for Customer Lifetime Value (CLV) analysis, data manipulation with pandas and numpy, datetime handling, data visualization with matplotlib and seaborn, and machine learning tools from scikit-learn for preprocessing.

In [2]:
import lifetimes

import pandas as pd
import numpy as np
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns

from lifetimes import BetaGeoFitter, GammaGammaFitter
from sklearn.preprocessing import MinMaxScaler

## 3. Reading and Understanding Data

In [4]:
data = pd.read_csv('online_retail.csv')
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


This is a Pandas DataFrame with 541,909 rows and 8 columns, containing various data types (e.g., object, int64, float64), with some missing values in the 'Description' and 'CustomerID' columns

In [6]:
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


These statistics describe the distribution of 'Quantity,' 'UnitPrice,' and 'CustomerID' columns in a dataset. They are useful for understanding the central tendency, spread, and potential outliers in the data, which can inform decisions in areas such as inventory management, pricing strategy, and customer segmentation.

## 4. Data Manipulation
This data manipulation involves filtering out rows where 'Quantity' is less than or equal to 0, 'UnitPrice' is less than or equal to 0, and removing rows with 'InvoiceNo' containing "C" (indicating returns). This is done to clean the data by excluding invalid or unwanted records, ensuring that the analysis is based on valid and meaningful transactions.

In [7]:
data = data[data['Quantity'] > 0 ]
data = data[data['UnitPrice'] > 0]
data = data[~data['InvoiceNo'].str.contains("C",na=False)]

We see that there are missing values within CustomerID. Let’s remove any observation without CustomerID.

In [8]:
# Removing missing values from the data 
data.dropna(inplace=True)

## 4. Handling Outliers
We will create a function called cap_outliers that caps outliers in a specified DataFrame column by setting values below the 5th percentile (q1) to the 5th percentile value and values above the 95th percentile (q2) to the 95th percentile value. It's important to remove outliers to prevent extreme values from disproportionately affecting statistical analysis, ensuring that results are more representative of the overall data distribution and avoiding skewed or biased insights.

In [15]:
# Defining a function to remove outliers .
def cap_outliers(dataframe, variable, q1=0.05, q2=0.95):
    lower_bound = dataframe[variable].quantile(q1)
    upper_bound = dataframe[variable].quantile(q2)
    dataframe[variable] = np.clip(dataframe[variable], lower_bound, upper_bound)
    
# Calling cap_outliers for UnitPrice and Quantity
cap_outliers(data,'UnitPrice')
cap_outliers(data,'Quantity')
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,397884.0,397884.0,397884.0
mean,8.868022,2.675785,15294.423453
std,9.523425,2.275053,1713.14156
min,1.0,0.42,12346.0
25%,2.0,1.25,13969.0
50%,6.0,1.95,15159.0
75%,12.0,3.75,16795.0
max,36.0,8.5,18287.0


## 5. Creating Our RFM Dataset (Recency, Frequency, Monetary)

After we've completed the data preprocessing phase, the next crucial step is to construct an RFM (Recency, Frequency, Monetary) dataset. But what exactly do these terms mean?

- **Frequency**: This metric represents the number of repeat purchases a customer has made. It's actually one less than the total number of purchases, but it's more accurately defined as the count of time periods in which a customer made a purchase. For example, if you're measuring in days, it's the count of days on which the customer made a purchase.

- **Recency**: Recency measures the age of a customer when they made their most recent purchase. It's calculated as the duration between a customer's first purchase and their latest purchase. If a customer has only made a single purchase, their recency is 0.

- **T**: T represents the age of the customer using the chosen time units (e.g., weekly in the dataset mentioned). It's calculated as the duration between a customer's first purchase and the end of the period you're studying.

- **Monetary Value**: This metric signifies the average value of a customer's purchases. It's determined by dividing the sum of all a customer's purchases by the total number of purchases. It's important to note that the denominator in this calculation differs from the frequency calculation described earlier.

In essence, by constructing the RFM dataset, we're quantifying customer behavior in terms of how recently they made a purchase, how frequently they make purchases, the total duration of their engagement, and the average value of their purchases. This dataset serves as a valuable foundation for various customer segmentation and analysis techniques.

This code computes the RFM (Recency, Frequency, Monetary) summary statistics from a transaction dataset using the Lifetimes library for Customer Lifetime Value (CLV) analysis. The summary_data_from_transaction_data function computes the following RFM metrics for each customer. The resulting RFM dataset contains these calculated RFM metrics for each customer and serves as the basis for further analysis, such as predictive modeling of customer lifetime value and customer segmentation.


In [18]:
data['Total Price'] = data['UnitPrice'] * data['Quantity']
RFM = lifetimes.utils.summary_data_from_transaction_data(data,'CustomerID','InvoiceDate','Total Price',observation_period_end='2011-12-09')


In [21]:
RFM.head()

Unnamed: 0_level_0,frequency,recency,T,monetary_value
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12346.0,0.0,0.0,325.0,0.0
12347.0,6.0,365.0,367.0,550.57
12348.0,3.0,283.0,358.0,116.126667
12349.0,0.0,0.0,18.0,0.0
12350.0,0.0,0.0,310.0,0.0


In [23]:
# we want only customers shopped more than 2 times
RFM = RFM[RFM['frequency']>1] 
RFM.head()

Unnamed: 0_level_0,frequency,recency,T,monetary_value
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12347.0,6.0,365.0,367.0,550.57
12348.0,3.0,283.0,358.0,116.126667
12352.0,6.0,260.0,296.0,192.84
12356.0,2.0,303.0,325.0,226.08
12359.0,3.0,274.0,331.0,1495.65
