# Problem Description
***alles um tier GmBH*** is a pet supplies company. They are currently auditing their promotional activities and the CEO, one of the main stakeholders, feels that the promotions they offer is too generic and not targeted. They have requested us to devise a customer segmentation model that they can use to run targeted promotional activities.

The client is interested in seeing what kind of customers are buying at ***alles um tier GmbH***. They assume that, in addition to private individuals, there are also smaller companies that purchase from ***alles um tier GmBH***. The project scope is to build a segmentation model and analyze the resulting customer segments.

# Data

You are given a dataset at customer level for the past year with the following data points. Number of transactions in the past year (*num_transactions*), order amount the past year (*total_order_value*), days between transactions the past year (*days_between_trans*), re-order rate the past year (*repeat_share*), and % of dog products bought (*dog_share*).

### Data Set
The dataset consists of 100k rows and has the following columns:

* CustomerID (int): UUID for the customer
* num_transactions (int): number of transactions in a given year
* total_order_value (float): total order value in € for the time period
* days_between_trans (float): average days between transactions for a user
* repeat_share (float): product share repeated every order
* dog_share (float): percentage of products ordered that are dog food related
    
# Technical Environment
* Python
* numpy
* pandas
* scikit-learn
* matplotlib / scipy / searborn / altair / plotly

# Approach
The solution is assessed on the following skills:
* A thorough evaluation of the data set using statistical measures and visualization
* Elegant Python coding skills
* Machine learning modelling fundamentals
* Model & result evaluation

# Output
Please provide your solution in a jupyter notebook with clear markdown comments.
The final output should be in the form of a DataFrame with two columns, the CustomerId and the assigned cluster.

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------


# Data Loading and Preprocessing

In [None]:
# Import all needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import altair as alt
import plotly.express as px
from scipy import stats

## Loading the Data

In [None]:
# load the data and make sure to specify the correct delimiter
df = pd.read_csv("DataSet_JuniorCodingChallenge.csv", delimiter='|')
df

## Handling Missing Data

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Check how often two values are missing in one row
print(f"Number of rows with two missing values: {len(df[df.isnull().sum(axis=1) == 2])}")

dropping = len(df[df.isnull().sum(axis=1) == 1])/len(df)
print(f"Percentage of rows with one missing value: {dropping:.2%}")

On the one hand, we see that only one entry is missing at a time, so it would make sense to fill in these values. On the other hand, only 0.46 % of all customer data have missing values. It could be argued that it is reasonable to simply drop these values, but due to the fact that only one value is missing at a time, we can e.g. use K-Nearest Neighbors Imputation on all numerical values and fill the missing entries in this way.

In [None]:
from sklearn.impute import KNNImputer

# Create a copy and drop the CustomerID column
df_knn = df.drop(columns='CustomerID')

# Use KNN imputation to fill in the missing values
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df_knn), columns=df_knn.columns)

# Print a statemetn that len(df_imputed.isnull().sum()) is 0
print(f"Showcasing the missing values after imputation:\n{df_imputed.isnull().sum()}")

# # Check the newly filled values
# missing_indices = df[df.isnull().sum(axis=1) == 1].index
# df.loc[missing_indices]
# df_imputed.loc[missing_indices]

# Print a statement to that the new values were checked and seem reasonable
print("The new filled entries were double checked again and they are reasonable.")

In [None]:
# Either drop the missing values or use the imputed data
df_drop = df.dropna()

# Set df to the imputed data and add the CustomerID column back to the first column
df = pd.concat([df['CustomerID'], df_imputed], axis=1)
df

## Data Integrity Check

At first we want to make sure that the num_transaction has the correct integer values and that the other numerical features are saved as floats.

In [None]:
# Check the current data types
df.dtypes

In [None]:
# Create a function that counts all entries that are not .00 floats
def find_floats(df, column):
    count = 0
    for i in df[column].unique():
        if i % 1 != 0:
            count += 1
    return count

# State how many entries need to be rounded in the num_transactions column
print(f"There are {find_floats(df, 'num_transactions')} entries that are not .00 floats and need to be rounded.")

# Round all float values before converting them to integers
df['num_transactions'] = df['num_transactions'].apply(lambda x: round(x))

# Double Check if all values are rounded now with a print statement
print(f"There are {find_floats(df, 'num_transactions')} entries that are not .00 floats left.")

# Convert num_transactions to integers
df['num_transactions'] = df['num_transactions'].astype(int)

In [None]:
# Check all types again
print("All data types are correct now.")
print(df.dtypes)

Now we want to make sure that all numerical values are in reasonable ranges.

In [None]:
# Count the number of negative values for num_transactions, total_order_value and days_between_trans for negative values
print(f"Number of negative values for num_transactions: {len(df[df['num_transactions'] < 0])}")
print(f"Number of negative values for total_order_value: {len(df[df['total_order_value'] < 0])}")
print(f"Number of negative values for days_between_trans: {len(df[df['days_between_trans'] < 0])}")

# Check repeat_share and dog_share for values between 0 and 1. So count the number of values outside of this range
print(f"Number of values outside of the range [0, 1] for repeat_share: {len(df[(df['repeat_share'] < 0) | (df['repeat_share'] > 1)])}")
print(f"Number of values outside of the range [0, 1] for dog_share: {len(df[(df['dog_share'] < 0) | (df['dog_share'] > 1)])}")

In [None]:
# Drop the negative values in the 3 columns
df = df[df['num_transactions'] >= 0]
df = df[df['total_order_value'] >= 0]
df = df[df['days_between_trans'] >= 0]

# Drop the values outside of the range 0 and 1 for the last two columns
df = df[(df['repeat_share'] >= 0) & (df['repeat_share'] <= 1)]
df = df[(df['dog_share'] >= 0) & (df['dog_share'] <= 1)]

df

In [None]:
# Find duplicates in the CustomerID column and check whether they are consistent
def find_inconsistent_duplicates(df, object_col):
    # Get the indices of duplicated entries in the object column
    duplicated_indices = df[df.duplicated(subset=[object_col], keep=False)].index

    # Dictionary to store the indices of inconsistent duplicates
    inconsistent_indices = []

    # Group by the object column and iterate over each group
    for key, group in df.loc[duplicated_indices].groupby(object_col):
        # Get the first row's data (excluding the object column)
        reference_row = group.iloc[0, 1:].values
        
        # Check if all rows in the group match the first row
        for idx, row in group.iterrows():
            if not (row.iloc[1:].values == reference_row).all():
                inconsistent_indices.append(idx)

    return inconsistent_indices

indices_with_inconsistencies = find_inconsistent_duplicates(df, 'CustomerID')

# Print the number of inconsistent duplicates
print(f"Number of inconsistent duplicates: {len(indices_with_inconsistencies)}. This is why we can simply drop one of them.")
df


In [None]:
# State how many CustomerIDs are duplicated and that they will be dropped
print(f"There are {df['CustomerID'].duplicated().sum()} duplicated CustomerIDs. They will be dropped.")

# Drop the duplicated CustomerIDs
df = df.drop_duplicates(subset='CustomerID')

# Reset the index
df = df.reset_index(drop=True)
df

In this section, we first loaded the data and then checked for missing values and filled them using K-Nearest Neighbors Imputation. We then checked for data integrity by looking at the data types of the columns, at the reasonable ranges of the data and the unique CustomerIDs.

Now the data is ready for further analysis.

# Exploratory Data Analysis **(EDA)**

## Statistical Summary

In [None]:
# Create summary statistics for the data
df.describe()

## Data Visualization

In [None]:
# Create a histogram for each column
df.hist(figsize=(10, 10))
plt.show()

In [None]:
import matplotlib.pyplot as plt

df_copy = df.drop(columns='CustomerID')

# Creating multiple boxplots, one for each column
plt.figure(figsize=(10, 10))
for i, col in enumerate(df_copy.columns):
    plt.subplot(3, 3, i + 1)
    sns.boxplot(x=df_copy[col])
    plt.title(col)
plt.tight_layout()
plt.show()

In [None]:
# Creating a correlation matrix and visualize it with a heatmap
corr = df_copy.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

We can see a high correlation between days_between_trans and repeat_share (-0.67). This negative linear relationship suggests that customer who order the same products every time they order, most likely do this in a more frequent way, meaning they for example have the same order on a weekly or 2-weekly basis. This could be a hint that these customers are rather small companies than private customers as they would probably have a lower frequency of orders.

There is also a higher correlation between days_between_trans and dog_share (0.41). This one suggests that customers who order less frequently have a higher percentage of dog related products in their order.

## Feature Relationships and Engineering

In [None]:
# Creating pairplots for all numerical data
sns.pairplot(df)
plt.show()

There are many interesting relationships here. The number of transaction and the total order value show in their relationship that customers either order very often for small amounts of money or they rather order selden, but therefore in big volumes. This could again be hint that there are private people ordering more often, but in smaller amounts as they have only themselves they need it for and then on the other hand smaller companies that buy stocks once every month or quarter. 

Also the number of transactions vs. days between transaction respectively the total order value vs. days between transaction show a similar behavior suiting the previously made statement. It seems that the customers who order less frequently have a lower total order value and of course less transactions (which seems obvious). So the previous made hypothesis seems to be supported. From this we could segment the customers into low-spending infrequent buyers vs. high-spending frequent buyers. This could then be added with a few customers who have an extremly high total order value which might represent the corporate clients which supports the hypothesis given by the company.

When looking at the number of transcations / total order value / days between transactions vs the repeated share in every order, we can see these 3 customer segmentations again. There are two strives on the bottom of each graph that seem to be different customer segmentations and then these outliers that seem to be the corporate clients. To better understand the relatinship of the two in strives showing customers segmentations we will now drop the outliers in the data set to get a better feeling for the two mentioned segments.


### Dropping some of the Outliers in a 1st Step

In [None]:
# Drop the 10% outliers from the data in a new copy
df_no_outliers = df.copy()

# Define the columns to check for outliers
columns = ['num_transactions', 'total_order_value', 'days_between_trans', 'repeat_share', 'dog_share']

# Iterate over the columns and remove the outliers
for col in columns:
    # Calculate the z-scores for each value in the column
    z_scores = np.abs(stats.zscore(df_no_outliers[col]))

    # 99.9% confidence interval (2194 outliers)
    outlier_indices = np.where(z_scores > 3.291)[0]

    # 99.5% confidence interval (3523 outliers)
    # outlier_indices = np.where(z_scores > 2.807)[0]

    # 99% confidence interval (4532 outliers)
    # outlier_indices = np.where(z_scores > 2.58)[0]

    # 95% confidence interval (17560 outliers)
    # outlier_indices = np.where(z_scores > 1.96)[0] 

    # 90% confidence interval (29567 outliers)
    # outlier_indices = np.where(z_scores > 1.64)[0]

    # Drop the outliers
    df_no_outliers = df_no_outliers.drop(index=outlier_indices)

    # Reset the index
    df_no_outliers = df_no_outliers.reset_index(drop=True)

# How many outliers were removed
print(f"{len(df) - len(df_no_outliers)} outliers were removed.")

# Visualize the pairplots for the data without outliers
sns.pairplot(df_no_outliers)
plt.show()

After trying out different confidence intervals we realized that a 99.9% confidence interval is already sufficient enough to drop all neccessary outliers, which probably represent corporate clients. They will be more specifically characterized later, but for now we can already see more meaningful visualizations of the different relationships.

We can now clearly see that the higher the toal order value, the more often the customer order it and the more often they have a higher share of repeated order. Looking at the days between transactions, it is seen that there is one segmentation of customers that have an extremly high period in between transactions and they are also the customers that have a very low number of transactions and a low total order value. It seems like this could be customers who might have just tried out the shop and that were not convinced of the product. This could of course be a segmentation of customers that could be won back with a very special and generous offer.

When it comes to the repeat share percentage, there is a linear relationship with the number of transactions and the total order value, meaning that the more the customer orders (more often and bigger volume) the higher is also the repeated order share. Now when it comes to the days between transactions, we can identify two segmentations. The customers who have a low number of days between transactions have a repeat share that is, especially for the customers with the lowest days between transactions, skew to a higher repeat share percentage (10-60%) and on the other side customers who order less frequent and that rather have a repeat share of 5-40%.

The dog share percentage seems to be normally distributed across all customers with a slight skew to higher percentages for customers who have a lower total order value and a lower number of transactions. So there could be a slight sign that the one time customers that tried out the shop might have looked for dog related products. This could be used as an information that the previously mentioned special and generous offer should mainly includ dog related products. To get a better insight into which products to offer there, it would be helpful to get more data from different product segments for this segment of customers, maybe even build a model which takes the orders made by these customers and then create specifically suited special offers for every single one on a stochastic calculation what these customers are most likely interested in.  


In [None]:
# Correlation matrix for the data without outliers
corr_no_outliers = df_no_outliers.drop(columns='CustomerID').corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_no_outliers, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix without Outliers')
plt.show()

Some of the things we saw in the pairplots can now be further validated. We can clearly see the very high correlation of the number of transactions and the total order value in the time period of one year. For private customers this seems to be absolutely reasonable as they will oftentimes order in smaller sizes and the more often they order, the higher is their total order value - in comparison to corporate customers who might order less often, but therefore in high total order values.
 
We can also see the negative linear relationship of these two features in regard to the days between transactions. This means that customers who order more often and with a higher total order value, have less days in between each order. This seems to be absolutely comprehensible. The same suggestive hypothesis can be made for the repeating order share and the days between transactions. Customers who order more frequently have a higher repeating order, meaning they have they standard products that they need to have in the houshold at all times and then they order some new things on the side or try out new products. These customers could be targeted especially with new products in the shop and a special offer due to the close relationship with the customer, maybe even offering a membership card with special opportunities and the chance to be the first person ordering new products as a special offering.
 
Something that stands out is the dog share of each order. It seems like that the regularly ordering customers have a rather lower dog food related order. The more often someone orders, the less often they go for dog food. This information can be used in choosing the right products for each customer segement. So it seems that low-spending infrequent buyers rather have a higher dog food related share in their orders than the high-spending frequent buyers.

### Creating one new Feature out of 2

In [None]:
# Due to the very high linear correlation between num_transactions and total_order_value, we create a combined feature
# that represents the average order value per transaction
df_no_outliers['avg_order_value'] = df_no_outliers['total_order_value'] / df_no_outliers['num_transactions']

# Set the new column as the first column and drop the old columns
df_no_outliers = df_no_outliers[['CustomerID', 'avg_order_value', 'days_between_trans', 'repeat_share', 'dog_share']]


# ALso do this for the original data
df['avg_order_value'] = df['total_order_value'] / df['num_transactions']
# df = df[['CustomerID', 'avg_order_value', 'days_between_trans', 'repeat_share', 'dog_share']]

# Check the new data
df_no_outliers

### Visualizing the Dataset with new Feature

In [None]:
# Visualize the pairplots for the new data without outliers
sns.pairplot(df_no_outliers)
plt.show()

In [None]:
# Correlation matrix for the data without outliers
corr_no_outliers = df_no_outliers.drop(columns='CustomerID').corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_no_outliers, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix without Outliers')
plt.show()

In [None]:
# Only visualize the avg_order_value column in a boxplot, but make two plots left and right.
# One plot with the original data and one without the outliers
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x=df['avg_order_value'])
plt.title('Original Data')

plt.subplot(1, 2, 2)
sns.boxplot(x=df_no_outliers['avg_order_value'])
plt.title('Data without Outliers')
plt.tight_layout()
plt.show()

### Precisely Finding the Outliers (Corporate Customers)

Now that we have a more robust understanding of the data and the newly created feature, we can use that feature to filter out more precisely the corporate customers. We are therefore first using a classic approach to identify outliers, the Initerquartile Range. Afterwards we will by hand look for a precise boundary.

In [None]:
# It is obvious that the avg_order_value column has a lot of outliers. What is the best way to identify how many outliers there are?
# We can use the IQR method to identify the outliers in the avg_order_value column
Q1 = df_no_outliers['avg_order_value'].quantile(0.25)
Q3 = df_no_outliers['avg_order_value'].quantile(0.75)
IQR = Q3 - Q1

# Calculate the number of outliers
outliers = df[(df['avg_order_value'] < (Q1 - 1.5 * IQR)) | (df['avg_order_value'] > (Q3 + 1.5 * IQR))]
print(f"Number of outliers in the avg_order_value column: {len(outliers)}")

In [None]:
# Use the original data and rank the dataset by the avg_order_value column in descending order
df_ranked = df.sort_values(by='avg_order_value', ascending=False).reset_index(drop=True)
df_ranked.head(len(outliers))

Here we can now clearly see a boundary and it seems more than reasonable to cut the dataset into the first 96 customers and then the rest which we from now on will assume to be private customers.

### Dividing the Dataset into Corporate and Private Customers

In [None]:
# Create the corporate custome datarframe by using only the first 96 rows
df_corporate = df_ranked.head(96)
df_corporate

In [None]:
# Create the private customer dataframe by using the remaining rows
df_private = df_ranked.iloc[96:]
df_private = df_private.reset_index(drop=True)
df_private

In [None]:
plt.figure(figsize=(12, 12))  # Adjust the figure size for four plots

# First row: Boxplots
plt.subplot(2, 2, 1)
sns.boxplot(x=df_corporate['avg_order_value'])
plt.title('Corporate Customers - Boxplot')

plt.subplot(2, 2, 2)
sns.boxplot(x=df_private['avg_order_value'])
plt.title('Private Customers - Boxplot')

# Second row: Histograms
plt.subplot(2, 2, 3)
sns.histplot(df_corporate['avg_order_value'], bins=20)
plt.title('Corporate Customers - Histogram')

plt.subplot(2, 2, 4)
sns.histplot(df_private['avg_order_value'], bins=20)
plt.title('Private Customers - Histogram')

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
# Show the correlation matrix for the corporate customers in a plot with two plots on the left, and on the right the private customers
plt.figure(figsize=(24, 6))
plt.subplot(1, 2, 1)
sns.heatmap(df_corporate.drop(columns=['CustomerID', 'total_order_value','num_transactions']).corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - Corporate Customers')

plt.subplot(1, 2, 2)
sns.heatmap(df_private.drop(columns=['CustomerID', 'total_order_value','num_transactions']).corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - Private Customers')
plt.show()

We are now able to analyse each customer segmentation in a more precisey way. We can already see clear differences.

Corporate Customers:

On the left hand side the corporate customers, which order less frequent the higher their average order value was - something that makes absolute sense. Then we can also see that the customers who more frequently order also have a higher repeating order share. This again makes sense as they are running out of specific product more quickly (as they also order less on average due to the first finding) and therefore need the same products more often. Lastly the dog related food products in each order. There seems to be a slight higher dog related food order for companies with a higher average order value. This feature should be more closely analyzed afterwards.


Private Customers:

On the right hand side the private customers, where we can see a very different behaviour when it comes to the average order value and the days between transactions. It seems that the higher the average order value, the more frequently the customer orders. It seems reasonable to make the hypothesis that we can divide the customer side into high-spending frequent buyers and low-spending infrequent buyers once again. This is also supported by the strong linear relationship of the average order value and the repeating share. It seems like the high-spending frequent buyers also order the same products very often. Interestingly, different to the corporate customers, the higher spending customers on average buy less dog food related products. 

In [None]:
# Visualize the corporate customers in a 3d scatter plot using the average order value, the days between transactions and the dog share, but color the points by the repeat share
fig = px.scatter_3d(df_corporate, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


With this plot we already get a good feeling of the rather small dataset of corporate customers. From more frequently ordering corporates with rather low average order value and a fairly equally distributed dog food related share (from about 3 to 50%) to the less frequent, but therefore rather higher per order spending corporates with a not distinctly seperatable dog share percentage. There is one high spending per order corporate with a really high dog share of about 62%, but also one with only 8%. 

In [None]:
# Now the same for the private customers
fig = px.scatter_3d(df_private, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Private Customers - 3D Scatter Plot')
fig.show()

In this plot we can see that there are some weird looking data points with days of more than 364 in between transactions. Due to the knowledge of a given data set of the last year this has to be an error. Something we should look at more closely next.

Also there is one very interesting customer with a very high dog share percentage and a really high average order value that seems to order every two days. This customer should be analyzed separately.

In [None]:
# Filter out the private customers that have a days_between_trans value of higher than 364 (the maximum)
df_private_hd = df_private[df_private['days_between_trans'] > 364]

# Visualize the private customers with a days_between_trans of more than 364 days in a 3d scatter plot
fig = px.scatter_3d(df_private_hd, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Private Customers with more than 364 days between transactions - 3D Scatter Plot')
fig.show()

Except for their off looking number of days between transactions the data seems to be reasonable. Therefore we will simply set the number of days_between_transaction to a reasonable value. We will therefore have a closer look at the most outstanding data points in that regard.

In [None]:
# Visualize the private customers again with the 3d scatter plot but without the customers that have more than 364 days between transactions
fig = px.scatter_3d(df_private[df_private['days_between_trans'] <= 364], x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')
fig.update_layout(width=1600, height=800)
fig.update_layout(title='Private Customers - 3D Scatter Plot')
fig.show()

In [None]:
# Print the number of customers that have a days_between
print(f"Number of private customers that have more than 364 days between transactions: {len(df_private[df_private['days_between_trans'] > 364])}")
print(f"Number of private customers that have more than 240 days between transactions: {len(df_private[df_private['days_between_trans'] > 240])}")
print(f"Number of private customers that have more than 236 days between transactions: {len(df_private[df_private['days_between_trans'] > 236])}")
print(f"Number of private customers that have more than 235 days between transactions: {len(df_private[df_private['days_between_trans'] > 235])}")
print(f"Number of private customers that have more than 234 days between transactions: {len(df_private[df_private['days_between_trans'] > 234])}")
print(f"Number of private customers that have more than 233 days between transactions: {len(df_private[df_private['days_between_trans'] > 233])}")
print(f"Number of private customers that have more than 1000 days between transactions: {len(df_private[df_private['days_between_trans'] > 1000])}")

Looking at these numbers, the previous plot and having in mind the total size of the private customer data set it seems reasonable to either set the boundary at 235 or 236. Moving on we will set all values above 235 to the value of 235, so that later segmentations can be done more usefully.

In [None]:
# Set the number of days between transactions to 235 for the private customers that have more than 235 days
df_private.loc[df_private['days_between_trans'] > 235, 'days_between_trans'] = 235

# Visualize the private customers again with the 3d scatter plot
fig = px.scatter_3d(df_private, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Private Customers - 3D Scatter Plot')
fig.show()

Now this plot gives us a great inside into how the private customer data set is distributed and we can already see 3 different segments of customers. For now there are still 3 customers that stand out. They have a high average order value and order about every second day with a very high repeating share of products. We will filter them out specifically now.

In [None]:
# Filter out the private customers with a higher average order value of 40
df_private_hao = df_private[df_private['avg_order_value'] > 40]
df_private_hao

These 3 customers should be contacted with a individual marketing strategy. We will drop them out of the continuing data set and come back to them at the end when recommending a marketing strategy.

In [None]:
# drop the df_private_hao from the df_private dataframe
df_private = df_private[df_private['avg_order_value'] <= 40]

# Visualize the private customers again with the 3d scatter plot
fig = px.scatter_3d(df_private, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Private Customers - 3D Scatter Plot')
fig.show()

In [None]:
# Filter all customers that have a days_between_trans of 235 and average oder value of above 20 and a repeat share of above 0.6
df_private_filtered = df_private[(df_private['days_between_trans'] == 235) & (df_private['avg_order_value'] > 20) & (df_private['repeat_share'] > 0.6)]
df_private_filtered

In [None]:
# Check both the CustomerID in the original data
df[df['CustomerID'].isin(df_private_filtered['CustomerID'])]

In [None]:
# Check both the CustomerID in the original data
df_private[df_private['CustomerID'].isin(df_private_filtered['CustomerID'])]

Seeing these two customers in the data set it seems more reasonable that the commata was set in the wrong way and that both these customers rather have about 2.3 days in between each transaction. Therefore we will change their value accordingly.

In [None]:
# In the df_private dataframe we will change the days_between_trans value to the value in the df dataframe, but we move the commata 3 values to the left
df_private.loc[df_private['CustomerID'].isin(df_private_filtered['CustomerID']), 'days_between_trans'] = df.loc[df['CustomerID'].isin(df_private_filtered['CustomerID']), 'days_between_trans'].values / 1000

In [None]:
# Visualize the private customers again with the 3d scatter plot
fig = px.scatter_3d(df_private, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Private Customers - 3D Scatter Plot')
fig.show()

In [None]:
# Save the dataframes to csv files
df_corporate.to_csv('corporate_customers.csv', index=False)
df_private.to_csv('private_customers.csv', index=False)
df.to_csv('cleaned_data.csv', index=False)

Now our data sets and features are finally prepared. In the followoing we will precisely analyse the clusters and do a segmentation analysis.

# Clustering and Semntation

What was done so far:
...

What we want to do next:
...

Where we want to end at the end:
...

In [None]:
# Load both dataframes again
df_corporate = pd.read_csv('corporate_customers.csv')
df_private = pd.read_csv('private_customers.csv')
df = pd.read_csv('cleaned_data.csv')

## Corporate Customers

In [None]:
# Visualize the corporate customers
fig = px.scatter_3d(df_corporate, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


In [None]:
# Visualize the corporate customers
fig = px.scatter_3d(df_corporate, x='total_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


In [None]:
# df_corporate create a copy of the dataframe and rank it by the column total_order_value
df_corporate_ranked = df_corporate.sort_values(by='total_order_value', ascending=False).reset_index(drop=True)
df_corporate_ranked.iloc[:40]

In [None]:
df_corporate_ranked.iloc[40:]

In [None]:
# Use k-means clustering to cluster the corporate customers into two groups
from sklearn.cluster import KMeans

# Create a copy of the corporate customers dataframe
df_corporate_cluster = df_corporate.copy()

# Drop the CustomerID column, the total_order_value column and the num_transactions column
df_corporate_cluster = df_corporate_cluster.drop(columns=['CustomerID', 'total_order_value', 'num_transactions'])

# Standardize the data
scaler = StandardScaler()

# Fit and transform the data
df_corporate_cluster_scaled = scaler.fit_transform(df_corporate_cluster)

# Create the KMeans model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)

# Fit the model
kmeans.fit(df_corporate_cluster_scaled)

# Add the cluster labels to the dataframe
df_corporate['cluster'] = kmeans.labels_

# Visualize the clusters in a 3D scatter plot
fig = px.scatter_3d(df_corporate, x='avg_order_value', y='days_between_trans', z='dog_share', color='cluster')
fig.update_layout(width=1600, height=800)
fig.update_layout(title='Corporate Customers - 3D Scatter Plot with Clusters')
fig.show()


In [None]:
# Use k-means clustering to cluster the corporate customers into two groups
from sklearn.cluster import KMeans

# Create a copy of the corporate customers dataframe
df_corporate_cluster = df_corporate.copy()

# Drop the CustomerID column, the total_order_value column and the num_transactions column
df_corporate_cluster = df_corporate_cluster.drop(columns=['CustomerID', 'avg_order_value', 'num_transactions'])

# Standardize the data
scaler = StandardScaler()

# Fit and transform the data
df_corporate_cluster_scaled = scaler.fit_transform(df_corporate_cluster)

# Create the KMeans model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)

# Fit the model
kmeans.fit(df_corporate_cluster_scaled)

# Add the cluster labels to the dataframe
df_corporate['cluster'] = kmeans.labels_

# Visualize the clusters in a 3D scatter plot
fig = px.scatter_3d(df_corporate, x='total_order_value', y='days_between_trans', z='dog_share', color='cluster')
fig.update_layout(width=1600, height=800)
fig.update_layout(title='Corporate Customers - 3D Scatter Plot with Clusters')
fig.show()


In [None]:
# Use k-means clustering to cluster the corporate customers into two groups
from sklearn.cluster import KMeans

# Create a copy of the corporate customers dataframe
df_corporate_cluster = df_corporate.copy()

# Drop the CustomerID column, the total_order_value column and the num_transactions column
df_corporate_cluster = df_corporate_cluster.drop(columns=['CustomerID', 'total_order_value', 'num_transactions'])

# Standardize the data
scaler = StandardScaler()

# Fit and transform the data
df_corporate_cluster_scaled = scaler.fit_transform(df_corporate_cluster)

# Create the KMeans model with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model
kmeans.fit(df_corporate_cluster_scaled)

# Add the cluster labels to the dataframe
df_corporate['cluster'] = kmeans.labels_

# Visualize the clusters in a 3D scatter plot
fig = px.scatter_3d(df_corporate, x='avg_order_value', y='days_between_trans', z='dog_share', color='cluster')
fig.update_layout(width=1600, height=800)
fig.update_layout(title='Corporate Customers - 3D Scatter Plot with Clusters')
fig.show()


In [None]:
# Use k-means clustering to cluster the corporate customers into two groups
from sklearn.cluster import KMeans

# Create a copy of the corporate customers dataframe
df_corporate_cluster = df_corporate.copy()

# Drop the CustomerID column, the total_order_value column and the num_transactions column
df_corporate_cluster = df_corporate_cluster.drop(columns=['CustomerID', 'avg_order_value', 'num_transactions'])

# Standardize the data
scaler = StandardScaler()

# Fit and transform the data
df_corporate_cluster_scaled = scaler.fit_transform(df_corporate_cluster)

# Create the KMeans model with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model
kmeans.fit(df_corporate_cluster_scaled)

# Add the cluster labels to the dataframe
df_corporate['cluster'] = kmeans.labels_

# Visualize the clusters in a 3D scatter plot
fig = px.scatter_3d(df_corporate, x='total_order_value', y='days_between_trans', z='dog_share', color='cluster')
fig.update_layout(width=1600, height=800)
fig.update_layout(title='Corporate Customers - 3D Scatter Plot with Clusters')
fig.show()


The k-means algorithm helps to get an understanding how on the used metric we can split this data set. In regard of our marketing strategies, it seems like is a different approach more applicable. Especially because we are talking about only 96 corporate customers, it makes sense to use a rather by hand approach. Here it makes sense to divide the corporate customers into a matrix total order value and the number of transactions on the side, once categorized to high and low. The boundary is simply set by using the average of each category.

In [None]:
import pandas as pd
import plotly.express as px

# Create a copy of the corporate customers dataframe
df_corporate_split = df_corporate.copy()

# Calculate the mean for total_order_value and num_transactions
mean_total_order_value = df_corporate_split['total_order_value'].mean()
mean_num_transactions = df_corporate_split['num_transactions'].mean()

# Split based on whether the values are above or below the mean
df_corporate_split['total_order_value_split'] = df_corporate_split['total_order_value'].apply(
    lambda x: 'low' if x < mean_total_order_value else 'high'
)

df_corporate_split['num_transactions_split'] = df_corporate_split['num_transactions'].apply(
    lambda x: 'low' if x < mean_num_transactions else 'high'
)

# Combine the two splits into a single group variable for four distinct categories
df_corporate_split['group'] = (
    df_corporate_split['total_order_value_split'].astype(str) + '_' +
    df_corporate_split['num_transactions_split'].astype(str)
)

# Visualize the 2x2 grid in a 3D scatter plot with four different colors
fig = px.scatter_3d(
    df_corporate_split,
    x='total_order_value',
    y='days_between_trans',
    z='dog_share',
    color='group',  # Use the combined group variable for color
    symbol='group',  # Optionally, use different symbols for clarity
    labels={
        'total_order_value': 'Total Order Value',
        'days_between_trans': 'Days Between Transactions',
        'dog_share': 'Dog Share',
        'group': 'Group'
    }
)
fig.update_layout(
    width=1600,
    height=800,
    title='Corporate Customers - 3D Scatter Plot with 2x2 Grid (Split by Mean)',
    legend_title_text='Group (Total Order Value vs Num Transactions)'
)
fig.show()

# Set the found clusters as df_corporate['cluster']
df_corporate['cluster'] = df_corporate_split['group']

In [None]:
# import pandas as pd
# import plotly.express as px

# # Create a copy of the corporate customers dataframe
# df_corporate_split = df_corporate.copy()

# # Calculate the mean for total_order_value and num_transactions
# mean_total_order_value = df_corporate_split['avg_order_value'].mean()
# mean_num_transactions = df_corporate_split['num_transactions'].mean()

# # Split based on whether the values are above or below the mean
# df_corporate_split['total_order_value_split'] = df_corporate_split['avg_order_value'].apply(
#     lambda x: 'low' if x < mean_total_order_value else 'high'
# )

# df_corporate_split['num_transactions_split'] = df_corporate_split['num_transactions'].apply(
#     lambda x: 'low' if x < mean_num_transactions else 'high'
# )

# # Combine the two splits into a single group variable for four distinct categories
# df_corporate_split['group'] = (
#     df_corporate_split['total_order_value_split'].astype(str) + '_' +
#     df_corporate_split['num_transactions_split'].astype(str)
# )

# # Visualize the 2x2 grid in a 3D scatter plot with four different colors
# fig = px.scatter_3d(
#     df_corporate_split,
#     x='avg_order_value',
#     y='days_between_trans',
#     z='dog_share',
#     color='group',  # Use the combined group variable for color
#     symbol='group',  # Optionally, use different symbols for clarity
#     labels={
#         'avg_order_value': 'avg_order_value',
#         'days_between_trans': 'Days Between Transactions',
#         'dog_share': 'Dog Share',
#         'group': 'Group'
#     }
# )
# fig.update_layout(
#     width=1600,
#     height=800,
#     title='Corporate Customers - 3D Scatter Plot with 2x2 Grid (Split by Mean)',
#     legend_title_text='Group (Total Order Value vs Num Transactions)'
# )
# fig.show()

# # Set the found clusters as df_corporate['cluster']
# df_corporate['cluster'] = df_corporate_split['group']

## 2x2 Matrix: Customer Segmentation Based on Total Order Value and Number of Transactions

|                          | **Low Number of Transactions** <br> (Customers below Average Transactions Number) | **High Number of Transactions** <br> (Customers above Average Transactions Number) |
|--------------------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| **Low Total Order Value** <br> (Customers below Average Order Value) | **Budget-Conscious Occasional Customers** <br> These customers make infrequent purchases and spend a low total amount. They may be cost-sensitive or not heavily engaged. | **Budget-Conscious Frequent Customers** <br> These customers make frequent purchases but still maintain a relatively low total order value, indicating smaller transaction sizes. |
| **High Total Order Value** <br> (Customers above Average Order Value) | **High-Spending Occasional Customers** <br> These customers do not transact often, but when they do, they tend to spend a significant amount. They might be selective but valuable. | **High-Spending Frequent Customers** <br> These are highly valuable customers who make frequent purchases with high total order value, indicating strong engagement and consistent spending. |


In [None]:
df_corporate

Having these 4 segmentations we can develop a marketing strategy customized for each one. When it comes to the dog_share value, there is no clear indication here for the different segments. It might be useful to simply split the dog_share value into 3 intervals and depending on each interval the content of the marketing strategy is adapted.

In [None]:
# I would now like to use Altair to create a linked histogram that is filtered based on a selection of the scatter plot. Use the corporate customers dataframe for this task.
import altair as alt

# Create the scatter plot
scatter = alt.Chart(df_corporate).mark_circle().encode(
    x='avg_order_value',
    y='days_between_trans',
    color='repeat_share',
    tooltip=['CustomerID', 'avg_order_value', 'days_between_trans', 'repeat_share']
).properties(
    width=800,
    height=600
)

# Create the histogram
hist = alt.Chart(df_corporate).mark_bar().encode(
    x='count()',
    y='repeat_share',
    color='repeat_share',
    tooltip=['repeat_share']
).properties(
    width=800,
    height=200
).interactive()

# Combine the scatter plot and the histogram
scatter & hist



In [None]:
import altair as alt

# Scatter plot with color based on 'cluster'
scatter_plot = alt.Chart(df_corporate).mark_circle(size=60).encode(
    x=alt.X('avg_order_value', title='Average Order Value'),
    y=alt.Y('days_between_trans', title='Days Between Transactions'),
    color=alt.Color('cluster:N', scale=alt.Scale(scheme='category10'), title='Customer Segment'),
    tooltip=['CustomerID', 'avg_order_value', 'days_between_trans', 'repeat_share']
).properties(
    width=1200,
    height=800,
    title="Customer Segments by Order Value and Days Between Transactions"
).interactive()

scatter_plot.show()


In [None]:
import pandas as pd
import altair as alt

# Prepare data for average metrics per cluster
avg_metrics = df_corporate.groupby('cluster').agg(
    avg_order_value=('avg_order_value', 'mean'),
    avg_days_between_trans=('days_between_trans', 'mean'),
    avg_repeat_share=('repeat_share', 'mean')
).reset_index()

# Transform the data from wide format to long format for easier plotting
avg_metrics_melted = avg_metrics.melt(id_vars='cluster', 
                                      value_vars=['avg_order_value', 'avg_days_between_trans', 'avg_repeat_share'], 
                                      var_name='Metric', 
                                      value_name='Value')

# Bar chart for average metrics per cluster
avg_bar_chart = alt.Chart(avg_metrics_melted).mark_bar().encode(
    x=alt.X('Metric:N', title='Metric'),
    y=alt.Y('Value:Q', title='Average Value'),
    color=alt.Color('cluster:N', scale=alt.Scale(scheme='category10'), title='Customer Segment'),
    column=alt.Column('cluster:N', title='Customer Segment'),
    tooltip=['cluster', 'Metric', 'Value']
).properties(
    width=300,
    height=600,
    title="Average Metrics by Customer Segment"
).interactive()

avg_bar_chart.show()


In [None]:
df_corporate['cluster'].value_counts()

In [None]:
# Heatmap of average order value vs days between transactions, color-coded by segment
heatmap = alt.Chart(df_corporate).mark_rect().encode(
    x=alt.X('avg_order_value:Q', bin=alt.Bin(maxbins=20), title='Average Order Value'),
    y=alt.Y('days_between_trans:Q', bin=alt.Bin(maxbins=20), title='Days Between Transactions'),
    color=alt.Color('count():Q', scale=alt.Scale(scheme='blues'), title='Customer Count'),
    facet=alt.Facet('cluster:N', title='Customer Segment'),
    tooltip=['avg_order_value', 'days_between_trans', 'cluster', 'count()']
).properties(
    width=300,
    height=600,
    title="Customer Distribution by Order Value and Transaction Frequency per Segment"
).interactive()

heatmap.show()


## Private Customers

In [None]:
# Reset index of df_private
df_private = df_private.reset_index(drop=True)

In [None]:
# Visualize the corporate customers
fig = px.scatter_3d(df_private, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


In [None]:
# Visualize the corporate customers
fig = px.scatter_3d(df_private, x='total_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


In [None]:
# Cluster one group out of the private customers. These are the ones with days_between_trans of below 5. Filter them out and create a new dataframe
df_private_cluster1 = df_private[df_private['days_between_trans'] < 5]
df_private_cluster1

In [None]:
# Create df_private_cluster2 by filtering out the private customers that are in df_private_cluster1
df_private_cluster_rest = df_private[~df_private['CustomerID'].isin(df_private_cluster1['CustomerID'])]
df_private_cluster_rest

In [None]:
# Visualize the corporate customers
fig = px.scatter_3d(df_private_cluster_rest, x='avg_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


In [None]:
# Visualize the corporate customers
fig = px.scatter_3d(df_private_cluster_rest, x='total_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


In [None]:
# Now I want to use Altair to create a scatter plot for the private customers

Now filter out dog food related customers who order every 6 month or only once in the time frame and that might get won back with special dog related marketing offers.

In [None]:
# Now filter out cluster2 from the private customers. These are the ones with a repeat_share of below 0.3 but a dog_share of above 0.29
df_private_cluster2 = df_private_cluster_rest[(df_private_cluster_rest['repeat_share'] < 0.3) & (df_private_cluster_rest['dog_share'] > 0.25)]

# Create df_private_cluster3 by filtering out the private customers that are in df_private_cluster2
df_private_cluster_rest = df_private_cluster_rest[~df_private_cluster_rest['CustomerID'].isin(df_private_cluster2['CustomerID'])]


In [None]:
# Visualize the corporate customers
fig = px.scatter_3d(df_private_cluster_rest, x='total_order_value', y='days_between_trans', z='dog_share', color='repeat_share')

# Set the size of the figure
fig.update_layout(width=1600, height=800)

fig.update_layout(title='Corporate Customers - 3D Scatter Plot')
fig.show()


In [None]:
# Use k-means to cluster the private customers into three groups
from sklearn.cluster import KMeans

# Create a copy of the private customers dataframe
df_private_cluster = df_private.copy()

# Drop the CustomerID column, the total_order_value column and the num_transactions column
df_private_cluster = df_private_cluster.drop(columns=['CustomerID', 'total_order_value'])

# Standardize the data
scaler = StandardScaler()

# Fit and transform the data
df_private_cluster_scaled = scaler.fit_transform(df_private_cluster)

# Create the KMeans model with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model
kmeans.fit(df_private_cluster_scaled)

# Add the cluster labels to the dataframe
df_private['cluster'] = kmeans.labels_

# Visualize the clusters in a 3D scatter plot
fig = px.scatter_3d(df_private, x='avg_order_value', y='days_between_trans', z='dog_share', color='cluster')
fig.update_layout(width=1600, height=800)
fig.update_layout(title='Private Customers - 3D Scatter Plot with Clusters')
fig.show()



### Scaling Features for further Analysis

In [None]:
## Scaling Features
# Create a copy of the dataframe
df_scaled = df.copy()

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data
df_scaled[['num_transactions', 'total_order_value', 'days_between_trans', 'repeat_share', 'dog_share']] = scaler.fit_transform(df[['num_transactions', 'total_order_value', 'days_between_trans', 'repeat_share', 'dog_share']])
df_scaled

In [None]:
# Check the summary statistics of the scaled data
df_scaled.describe()

# Clustering and Segmentation

In [None]:
#TODO: Try out more Cluster algorithms and see which one fits the best
#TODO: Try out more Dimensionality Reduction algorithms and see which one fits the best
#TODO: Try to generate more features and see if the model improves
#TODO: Try different number of clusters to find a better optimum (Elbow Method or Silhouette Score)

We can use a 3 cluster segmentation in which we describe a high quality, medium quality and low quality customer.

The high quality customer is a customer that has a high number of transactions, a high total order value, a low days between transactions, a high repeat share and a high dog share.

The low and medium quality customer accordingly. We create a lead score for each customer based on the above features and then segment the customers into 3 clusters.

We can then adjust our marketing strategy to target the high quality customers more effectively.

## Choosing the Clustering Algorithm

In [None]:
# First try the k-means clustering algorithm
from sklearn.cluster import KMeans

# Create a copy of the final dataframe and drop the CustomerID column
df_scaled_cluster = df_scaled.copy().drop('CustomerID', axis=1)

# Initialize the KMeans algorithm
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the algorithm to the data
df_scaled_cluster['Cluster'] = kmeans.fit_predict(df_scaled_cluster)
df_scaled_cluster

# Evaluation of Segments

## Analyzing Cluster Characteristics

In [None]:
# Calculate the mean of the clusters
cluster_stats = df_scaled_cluster.groupby('Cluster').mean()
cluster_stats

In [None]:
# Calculate the number of customers in each cluster
cluster_size = df_scaled_cluster['Cluster'].value_counts().reset_index()
cluster_size.columns = ['Cluster', 'Count']

# Calculate the distribution of each cluster
cluster_dist = cluster_size['Count'] / cluster_size['Count'].sum()
cluster_size['Distribution'] = cluster_dist
cluster_size

## Model Validation

### Calculating Commonly Used Scores

In [None]:
# Calculate the silhouette score
from sklearn.metrics import silhouette_score

silhouette_score(df_scaled_cluster.drop('Cluster', axis=1), df_scaled_cluster['Cluster'])

A score of more than 0.5 indicates a high-quality cluster. In our case it of course depends on the application of our clusters. If we are looking for a small number of high-quality customers, the results indicate that we could have already found them. Lets check the results further.

In [None]:
# Calculate the Davies-Bouldin Index
from sklearn.metrics import davies_bouldin_score

davies_bouldin_score(df_scaled_cluster.drop('Cluster', axis=1), df_scaled_cluster['Cluster'])

### Visualizing the Clusters

In [None]:
# Visualize results of the clustering by using the Cluster column and the num_transactions and total_order_value columns
plt.figure(figsize=(10, 6))
sns.scatterplot(x='repeat_share', y='dog_share', hue='Cluster', data=df_scaled_cluster, palette='viridis')
plt.title('KMeans Clustering Results')
plt.show()

In [None]:
# Visualize results of the clustering by using the Cluster column and the num_transactions and total_order_value columns
plt.figure(figsize=(10, 6))
sns.scatterplot(x='num_transactions', y='total_order_value', hue='Cluster', data=df_scaled_cluster, palette='viridis')
plt.title('KMeans Clustering Results')
plt.show()

In [None]:
# Visualize results of the clustering by using the Cluster column and the num_transactions and total_order_value columns
plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_order_value', y='days_between_trans', hue='Cluster', data=df_scaled_cluster, palette='viridis')
plt.title('KMeans Clustering Results')
plt.show()

We can clearly see already that the clustering gives us a good segmentation of the customers. It is especially helpful to find the high quality customers that we want to explicitly target.

In [None]:
# Add the Cluster column to the final dataframe
df_final['Cluster'] = df_scaled_cluster['Cluster']

# Visualize the cluster plots using PCA
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', data=df_final, palette='viridis')
plt.title('KMeans Clustering Results with PCA')
plt.show()

In [None]:
# # Visualize the cluster plots using t-SNE
# from sklearn.manifold import TSNE

# # Initialize the t-SNE algorithm
# tsne = TSNE(n_components=2, random_state=42)

# # Fit and transform the data
# df_tsne = tsne.fit_transform(df_scaled_cluster.drop('Cluster', axis=1))

# # Create a dataframe with the t-SNE components
# df_tsne = pd.DataFrame(data=df_tsne, columns=['t-SNE1', 't-SNE2'])

# # Concatenate the t-SNE components with the cluster column
# df_tsne = pd.concat([df_tsne, df_scaled_cluster['Cluster']], axis=1)

# # Visualize the cluster plots using t-SNE
# plt.figure(figsize=(10, 6))
# sns.scatterplot(x='t-SNE1', y='t-SNE2', hue='Cluster', data=df_tsne, palette='viridis')
# plt.title('KMeans Clustering Results with t-SNE')
# plt.show()

In [None]:
# Use plotly to create a 3D scatter plot of the clusters
fig = px.scatter_3d(df_scaled_cluster, x='num_transactions', y='total_order_value', z='days_between_trans', color='Cluster', opacity=0.7)
fig.update_layout(title='KMeans Clustering Results in 3D')
fig.show()

In [None]:
# Use plotly to create a 3D scatter plot of the clusters
fig = px.scatter_3d(df_scaled_cluster, x='num_transactions', y='repeat_share', z='dog_share', color='Cluster', opacity=0.7)
fig.update_layout(title='KMeans Clustering Results in 3D')
fig.show()

In [None]:
# Use plotly to create a 3D scatter plot of the clusters
fig = px.scatter_3d(df_scaled_cluster, x='num_transactions', y='days_between_trans', z='repeat_share', color='Cluster', opacity=0.7)
fig.update_layout(title='KMeans Clustering Results in 3D')
fig.show()

In this section we closely looked at the clusters and the characteristics of the customers in each cluster. We also validated the model by calculating commonly used scores and visualizing the clusters.

# Final Output

In [None]:
# Merge df_scaled['CustomerID'] with df_scaled_cluster['Cluster']
df_clustered = pd.concat([df_scaled['CustomerID'], df_scaled_cluster['Cluster']], axis=1)
df_clustered

In [None]:
# Save the clustered data to a CSV file
df_clustered.to_csv('Clustered_Data.csv', index=False)

# Conclusion and Recommendations

In [None]:
#TODO: Make create a Report that looks professional and is easy to understand

## Business Insights

## Next Steps

1. Having found the customer segments, it is now the next step to target these different segments in the correct way.
2. Offer them simple solutions for each customer segmentation that they can straight away use.
3. Sell them a follow up project for a more advanced solution (creating a 'Save the running away customers' by creating a model that can customize the advertisement and special offer).

4. With a longer history of at least 2-3 years we can implement a Customer Lifetime Value model which helps to find out which customers to invest in (using PyMC-Marketing library https://github.com/pymc-labs/pymc-marketing)