In [1]:
from modules.data_preprocessing import PandasDataPreprocessing
from modules.exploratory_data_analyst import PandasEDA
from modules.data_analysis import PandasCustomerSegmentation as PandasCS
from modules.feature_engineering import PandasFeatureEngineering as PandasFE

# Data Preprocessing

---
## Business Context
Due to the small scale of the dataset, we use Pandas for data preprocessing. This process includes:

- Loading the data
- Checking for missing values
- Verifying and converting data types as necessary, particularly converting date columns to datetime type
- Assigning ProductID to transaction data
- Using UUID (Universally Unique Identifier) to uniquely identify each transaction
- Calculating RFM (Recency, Frequency, Monetary) values

All data preprocessing is handled by the `PandasDataPreprocessing` class in the `modules/data_preprocessing.py` file. This class includes the following methods:
`__init__(self, product_file, transaction_file)`
The output of the class consists of the preprocessed transaction data and the calculated RFM data.

In [2]:
preprocessor = PandasDataPreprocessing(
    'data/Products_with_Prices.csv',
    'data/Transactions.csv',
)
df_transaction, df_rfm = preprocessor.run()

In [3]:
df_transaction

In [4]:
df_rfm

# Exploratory Data Analyst

---
## EDA
The purpose of exploratory data analysis (EDA) is to gain a comprehensive understanding of the dataset. The EDA process includes:

- Describing and summarizing the data
- Examining the distribution of data
- Analyzing correlations between variables

## Method
The EDA is performed using the PandasEDA class in the modules/exploratory_data_analyst.py file. The class is initialized with a dataframe and includes methods to:

- Plot data distributions
- Analyze correlations
- Give detailed data descriptions

## Analysis Insights
- Customer Activity: Most customers exhibit low recency values, indicating high activity levels and strong customer retention. This suggests the company has a robust retention strategy, reducing churn among high-value customers.
- Purchase Frequency: Customer purchase frequency is relatively low, implying long intervals between purchases. To boost revenue, the company should focus on enhancing the customer experience to encourage more frequent purchases. This strategy leverages existing customer relationships, which is more cost-effective and less risky than acquiring new customers or increasing prices.
- Monetary Value: The monetary value distribution follows a normal bell curve with a slight left skew, indicating that most customers spend around $50 during the research period.
- Data Correlation: The variables exhibit low correlation, indicating data independence. This is advantageous for regression modeling as it ensures the model can more accurately capture relationships between variables. However, for customer segmentation, a simple rule-based approach will be used instead of a regression model.

In [5]:
# EDA for RFM
eda_rfm = PandasEDA(df_rfm)
eda_rfm.run_all()

In [6]:
# EDA for transaction
eda_trans = PandasEDA(df_transaction)
eda_trans.plot_distributions()

# Data Analysis

---
## Customer Segmentation
The data analysis focuses on segmenting customers based on their RFM (Recency, Frequency, Monetary) scores. This segmentation helps in understanding customer behaviors and focusing on them. The customer segments include:

- VIP: Customers with high recent activity, high purchase frequency, and high spending.
- Loyalist: Customers with frequent purchases and moderate to high spending.
- Big Spender: Customers with high spending.
- Potential Loyalist: Customers with recent activity, occasional purchases, and moderate spending.
- New Customer: Customers with recent activity but low purchase frequency.
- At Risk: Customers with low recent activity, moderate to high purchase frequency, and spending.
- Hibernating: Customers with low recent activity and occasional purchases.
- Lost: Customers with low recent activity and infrequent purchases.

The segmentation is implemented in the segment_customers method of the `PandasCustomerSegmentation` class. The method follows these steps:
`
('VIP',
(self.df_rfm['Recency_Quintile'] == 1) &
(self.df_rfm['Frequency_Quintile'] == 5) &
(self.df_rfm['Monetary_Quintile'] == 5)),
`

`
('Loyalist',
(self.df_rfm['Recency_Quintile'] <= 2) &
(self.df_rfm['Frequency_Quintile'] == 5) &
(self.df_rfm['Monetary_Quintile'] >= 3)),
`

`
('Big Spender',
(self.df_rfm['Monetary_Quintile'] == 5)),
`

`
('New Customer',
(self.df_rfm['Recency_Quintile'] == 1) &
(self.df_rfm['Frequency_Quintile'] <= 2)),
`

`
('At Risk',
(self.df_rfm['Recency_Quintile'] == 5) &
(self.df_rfm['Frequency_Quintile'] >= 3) &
(self.df_rfm['Monetary_Quintile'] >= 3)),
`

`
('Hibernating',
(self.df_rfm['Recency_Quintile'] == 4) &
(self.df_rfm['Frequency_Quintile'] >= 2)),
`

`
('Lost',
(self.df_rfm['Recency_Quintile'] == 5) &
(self.df_rfm['Frequency_Quintile'] <= 2))
`

## Analysis Insights
### Customers
- VIP Customers: These highly valuable customers exhibit high activity, purchase frequency, and spending. This segment comprises 179 customers (4.6% of the total 3,898 customers) and contributes 10% of the total spending. The company should focus on retaining these customers through personalized services and offers. While offering more personalized services is manageable given their small number, it is crucial to avoid over-servicing, which could reduce profitability. Since 10% is not a considerable part of the customer base, providing more personalized services is reasonable, but excessive exclusive privileges should be avoided.

- Main Customer Segments: Big spenders and average customers form the core of the company’s revenue, contributing 32.2% and 22.2% respectively. The company should focus on retaining these customers through excellent customer service and sales events. Implementing a loyalty program could encourage repeat purchases. Personalized services should be avoided due to the large number of customers in these segments, as such campaigns could become prohibitively expensive.

- At Risk, Hibernating, and Lost Customers: These segments, accounting for 20.48% of the customer base, are at risk of churning. The company should focus on re-engaging these customers with personalized offers and incentives, as well as implementing a win-back campaign to encourage their return. Also, a customer feedback system could help understand the reasons behind their potential departure and identify retention strategies. Although these customers are not significant spenders, a win-back campaign could still prove beneficial in retaining some of them.

- New Customers: Comprising only 2.8% of the customer base, the number of new customers is relatively low. The company faces a choice between focusing on retaining and enhancing the efficiency of existing customers or developing new ones. This decision should be based on the financial situation. If revenue is declining, a win-back strategy is advisable, as get new customers is costlier and less likely to result in long-term retention. If revenue is growing, the company might consider focusing on get new customers.

## Product Insights
- Most Popular Products: The top-selling products are beef, tropical fruit, and napkins, with tropical fruit being the most trending, contributing 5.25% of sales in the last month. The company should focus on promoting these products to boost sales. Consider bundling these products to encourage customers to purchase more items together. Implementing a loyalty program could further incentivize repeat purchases of these popular products. Additionally, a customer feedback system could provide insights into the reasons behind the popularity of these products and help identify further promotional opportunities.

In [7]:
segmentation = PandasCS(df_rfm, df_transaction)
segmentation.run_all()

In [8]:
df_rfm

# Feature Engineering

---
Label:
- 0: Lost
- 1: Hibernating
- 2: At Risk
- 3: New Customer
- 4: Average Customer
- 5: Loyalist
- 6: Big Spender
- 7: VIP
		          

In [9]:
feature_engineering = PandasFE(df_rfm)
feature_engineering.run()
df_ml = df_rfm[['Recency', 'Frequency', 'Monetary', 'RFM_Score', 'Segment_Encoded']]

In [10]:
df_ml

# Machine Learning

---
## Model Evaluation and Selection
The machine learning process involves evaluating multiple models to determine the best-performing one.
The evaluation process includes:
- Silhouette score evaluation
- Calinski-Harabasz score evaluation

We only use KMeans, Birch, and Agglomerative Clustering.
Due to time constraint, I was unable to implement other heavier clustering algorithms like DBSCAN and Gaussian Mixture Model.

Result: All 3 clustering models performed similarly well,
with KMeans being the best-performing model based on the Calinski-Harabasz score and second-best based on the Silhouette.
This is what we should be using.

In [11]:
from modules.machine_learning import PandasML

In [12]:
pandas_ml = PandasML(df_ml, number_of_clusters=3)
results_df = pandas_ml.evaluate_models()

In [13]:
results_df

In [14]:
pandas_ml.visualize_model_evaluation(results_df)

In [15]:
kmeans_models, silhouette_scores = pandas_ml.elbow_method_kmeans()

The optimal k value is 5

In [16]:
kmeans_model = kmeans_models[5]
kmeans_model

In [17]:
# cluster the data
df_ml['Cluster'] = kmeans_model.labels_

In [18]:
df_ml

In [19]:
# visualize the clusters
pandas_ml.visualize_kmeans_clusters(5)

In [20]:
pandas_ml.visualize_hierarchical_clusters()

# Conclusion
- Data Analysis: The customer segmentation analysis identified key customer segments, including VIP customers, loyalists, big spenders, and at-risk customers. These insights can tell targeted marketing strategies to keep high-value customers and re-engage at-risk customers. The analysis highlight that the company should focus on retaining and enhancing the efficiency of existing customers.


- Machine learning: The KMeans clustering and hierarchical clustering models were evaluated to segment customers based on RFM scores. The KMeans clustering model with k=5 was selected as the best-performing model. The clustering results provide actionable insights for customer retention strategies. Both models suggest that the most recent customers, who visit the shop from the nearest 100 days, are the most valuable, consistent with the analysis from the customer segmentation.