## 1. Installing Packages 
 Installs Python packages "lifetimes," "seaborn," and "scikit-learn" using pip, which are essential for data analysis and machine learning.

In [None]:
%pip install lifetimes
%pip install seaborn
%pip install scikit-learn

## 2. Importing Python Libraries
This code imports necessary Python libraries, including "lifetimes" for Customer Lifetime Value (CLV) analysis, data manipulation with pandas and numpy, datetime handling, data visualization with matplotlib and seaborn, and machine learning tools from scikit-learn for preprocessing.

In [2]:
import lifetimes

import pandas as pd
import numpy as np
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns

from lifetimes import BetaGeoFitter, GammaGammaFitter
from sklearn.preprocessing import MinMaxScaler

## 3. Reading and Understanding Data

In [4]:
data = pd.read_csv('online_retail.csv')
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


This is a Pandas DataFrame with 541,909 rows and 8 columns, containing various data types (e.g., object, int64, float64), with some missing values in the 'Description' and 'CustomerID' columns

In [6]:
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


These statistics describe the distribution of 'Quantity,' 'UnitPrice,' and 'CustomerID' columns in a dataset. They are useful for understanding the central tendency, spread, and potential outliers in the data, which can inform decisions in areas such as inventory management, pricing strategy, and customer segmentation.

## 4. Data Manipulation
This data manipulation involves filtering out rows where 'Quantity' is less than or equal to 0, 'UnitPrice' is less than or equal to 0, and removing rows with 'InvoiceNo' containing "C" (indicating returns). This is done to clean the data by excluding invalid or unwanted records, ensuring that the analysis is based on valid and meaningful transactions.

In [7]:
data = data[data['Quantity'] > 0 ]
data = data[data['UnitPrice'] > 0]
data = data[~data['InvoiceNo'].str.contains("C",na=False)]