# Problem Description
***alles um tier GmBH*** is a pet supplies company. They are currently auditing their promotional activities and the CEO, one of the main stakeholders, feels that the promotions they offer is too generic and not targeted. They have requested us to devise a customer segmentation model that they can use to run targeted promotional activities.

The client is interested in seeing what kind of customers are buying at ***alles um tier GmbH***. They assume that, in addition to private individuals, there are also smaller companies that purchase from ***alles um tier GmBH***. The project scope is to build a segmentation model and analyze the resulting customer segments.

# Data

You are given a dataset at customer level for the past year with the following data points. Number of transactions in the past year (*num_transactions*), order amount the past year (*total_order_value*), days between transactions the past year (*days_between_trans*), re-order rate the past year (*repeat_share*), and % of dog products bought (*dog_share*).

### Data Set
The dataset consists of 100k rows and has the following columns:

* CustomerID (int): UUID for the customer
* num_transactions (int): number of transactions in a given year
* total_order_value (float): total order value in € for the time period
* days_between_trans (float): average days between transactions for a user
* repeat_share (float): product share repeated every order
* dog_share (float): percentage of products ordered that are dog food related
    
# Technical Environment
* Python
* numpy
* pandas
* scikit-learn
* matplotlib / scipy / searborn / altair / plotly

# Approach
The solution is assessed on the following skills:
* A thorough evaluation of the data set using statistical measures and visualization
* Elegant Python coding skills
* Machine learning modelling fundamentals
* Model & result evaluation

# Output
Please provide your solution in a jupyter notebook with clear markdown comments.
The final output should be in the form of a DataFrame with two columns, the CustomerId and the assigned cluster.

--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------


In [None]:
#TODO: Create a docker container and send that one back so that there are no troubles for a customer to install the dependencies

# Data Loading and Preprocessing

In [None]:
# Import all needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import altair as alt
import plotly.express as px
from scipy import stats

## Loading the Data

In [None]:
# Example what the data to work with looks like. Delimiter is |
# CustomerID|num_transactions|total_order_value|days_between_trans|repeat_share|dog_share
# dwa726|12|329.88|33.77|0.459759531109544|0.255687053101122
# asy963|1|11.28|234.14|0.0903120997560214|0.549722127878268

In [None]:
# load the data and make sure to specify the correct delimiter
df = pd.read_csv("DataSet_JuniorCodingChallenge.csv", delimiter='|')
df

## Handling Missing Data

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Decide in how to handle missing values 
# Due to the low number of missing values 
#TODO: Find a multivariate distribution and in cases where there is only one missing value of a Customer, compute the missing value with the distribution

# Drop the missing values
df = df.dropna()
df

## Data Integrity Check

In [None]:
# Ensure all columns are of the correct data type
df.dtypes

# Make sure to have the following data types:
# CustomerID as a string
# num_transactions as integer
# total_order_value, days_between_trans, repeat_share, dog_share as floats

In [None]:
# Convert CustomerID to string
df['CustomerID'] = df['CustomerID'].astype(str)

In [None]:
# Now create the same function but also return the total number of values like this
def find_floats(df, column):
    count = 0
    for i in df[column].unique():
        if i % 1 != 0:
            count += 1
    return count

# Find all float entries that are not .00 so typical integer values
print(find_floats(df, 'total_order_value'))

# There are 18249 entries that needs to be rounded at first before converting them to integers
df['total_order_value'] = df['total_order_value'].apply(lambda x: round(x))

# Double Check if all values are rounded now
find_floats(df, 'total_order_value')

# Convert all columns to integers
df['num_transactions'] = df['num_transactions'].astype(int)

In [None]:
# total_order_value as a float
df['total_order_value'] = df['total_order_value'].astype(float)

In [None]:
# Check all types again
print(df.dtypes)

## Data Integrity Check

In [None]:
# Count the number of negative values for num_transactions, total_order_value and days_between_trans for negative values
print(len(df[df['num_transactions'] < 0]))
print(len(df[df['total_order_value'] < 0]))
print(len(df[df['days_between_trans'] < 0]))

# Check repeat_share and dog_share for values between 0 and 1. So count the number of values outside of this range
print(len(df[(df['repeat_share'] < 0) | (df['repeat_share'] > 1)]))
print(len(df[(df['dog_share'] < 0) | (df['dog_share'] > 1)]))

# Drop the negative values in the 3 columns
df = df[df['num_transactions'] >= 0]
df = df[df['total_order_value'] >= 0]
df = df[df['days_between_trans'] >= 0]

# Drop the values outside of the range 0 and 1 for the last two columns
df = df[(df['repeat_share'] >= 0) & (df['repeat_share'] <= 1)]
df = df[(df['dog_share'] >= 0) & (df['dog_share'] <= 1)]

df

In [None]:
# Check for duplicated CustomerIDs
df['CustomerID'].duplicated().sum()

# Drop the duplicated CustomerIDs
df = df.drop_duplicates(subset='CustomerID')

# Reset the index
df = df.reset_index(drop=True)
df

In this section, we first loaded the data and then checked for missing values. We then checked for data integrity by looking at the data types of the columns, at the reasonable ranges of the data and the unique CustomerIDs.
Now the data is ready for further analysis.

# Exploratory Data Analysis **(EDA)**

## Statistical Summary

In [None]:
# Create summary statistics for the data
df.describe()

## Data Visualization

In [None]:
# Create a histogram for each column
df.hist(figsize=(10, 10))
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Create a boxplot for each column in the dataframe
plt.figure(figsize=(10, 6))
df.boxplot()
plt.xticks(rotation=45)
plt.title('Boxplot of Data Columns')
plt.show()

In [None]:
# Create a correlation matrix and visualize it with a heatmap
corr = df.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Now create pairplots for the data
sns.pairplot(df)
plt.show()

## Feature Relationships

In [None]:
# Have a closer look at the most outstanding relationships
plt.figure(figsize=(10, 6))
sns.scatterplot(x='num_transactions', y='total_order_value', data=df)
plt.title('num_transactions vs. total_order_value')
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(x='days_between_trans', y='repeat_share', data=df)
plt.title('days_between_trans vs. repeat_share')
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(x='dog_share', y='repeat_share', data=df)
plt.title('dog_share vs. repeat_share')
plt.show()


In [None]:
# Use scipy to calculate the p-value for the correlation between num_transactions and total_order_value
p_value = stats.pearsonr(df['num_transactions'], df['total_order_value'])[1]
print('P-value for num_transactions and total_order_value:', p_value)

# Use scipy to calculate the p-value for the correlation between days_between_trans and repeat_share
p_value = stats.pearsonr(df['days_between_trans'], df['repeat_share'])[1]
print('P-value for days_between_trans and repeat_share:', p_value)

# Use scipy to calculate the p-value for the correlation between dog_share and repeat_share
p_value = stats.pearsonr(df['dog_share'], df['repeat_share'])[1]    
print('P-value for dog_share and repeat_share:', p_value)


# Feature Engineering

In [None]:
## Scaling Features
# Create a copy of the dataframe
df_scaled = df.copy()

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data
df_scaled[['num_transactions', 'total_order_value', 'days_between_trans', 'repeat_share', 'dog_share']] = scaler.fit_transform(df[['num_transactions', 'total_order_value', 'days_between_trans', 'repeat_share', 'dog_share']])
df_scaled

In [None]:
# Check the summary statistics of the scaled data
df_scaled.describe()

## Dimensionality Reduction (optional - check model results)

In [None]:
# The correlation between days_between_trans and repeat_share is quite high, so it might be good to use PCA to reduce the dimensionality of the data
# The problem is the interpretation of the data. Maybe simply take the two and make one out of them with PCA and keep the other 4

# Use PCA to reduce the dimensionality of the data
from sklearn.decomposition import PCA

# Initialize the PCA
pca = PCA(n_components=4)

# Fit and transform the data
df_pca = pca.fit_transform(df_scaled[['num_transactions', 'total_order_value', 'days_between_trans', 'repeat_share', 'dog_share']])
df_pca = pd.DataFrame(data=df_pca, columns=['PCA1', 'PCA2', 'PCA3', 'PCA4'])
df_pca

In [None]:
# Check the explained variance ratio
print('Explained Variance Ratio:', pca.explained_variance_ratio_)

# Concatenate the PCA components with the original data
df_final = pd.concat([df_scaled['CustomerID'], df_pca], axis=1)
df_final

# Clustering and Segmentation

In [None]:
#TODO: Try out more Cluster algorithms and see which one fits the best
#TODO: Try out more Dimensionality Reduction algorithms and see which one fits the best
#TODO: Try to generate more features and see if the model improves
#TODO: Try different number of clusters to find a better optimum (Elbow Method or Silhouette Score)

We can use a 3 cluster segmentation in which we describe a high quality, medium quality and low quality customer.

The high quality customer is a customer that has a high number of transactions, a high total order value, a low days between transactions, a high repeat share and a high dog share.

The low and medium quality customer accordingly. We create a lead score for each customer based on the above features and then segment the customers into 3 clusters.

We can then adjust our marketing strategy to target the high quality customers more effectively.

## Choosing the Clustering Algorithm

In [None]:
# First try the k-means clustering algorithm
from sklearn.cluster import KMeans

# Create a copy of the final dataframe and drop the CustomerID column
df_scaled_cluster = df_scaled.copy().drop('CustomerID', axis=1)

# Initialize the KMeans algorithm
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the algorithm to the data
df_scaled_cluster['Cluster'] = kmeans.fit_predict(df_scaled_cluster)
df_scaled_cluster

# Evaluation of Segments

## Analyzing Cluster Characteristics

In [None]:
# Calculate the mean of the clusters
cluster_stats = df_scaled_cluster.groupby('Cluster').mean()
cluster_stats

In [None]:
# Calculate the number of customers in each cluster
cluster_size = df_scaled_cluster['Cluster'].value_counts().reset_index()
cluster_size.columns = ['Cluster', 'Count']

# Calculate the distribution of each cluster
cluster_dist = cluster_size['Count'] / cluster_size['Count'].sum()
cluster_size['Distribution'] = cluster_dist
cluster_size

## Model Validation

### Calculating Commonly Used Scores

In [None]:
# Calculate the silhouette score
from sklearn.metrics import silhouette_score

silhouette_score(df_scaled_cluster.drop('Cluster', axis=1), df_scaled_cluster['Cluster'])

A score of more than 0.5 indicates a high-quality cluster. In our case it of course depends on the application of our clusters. If we are looking for a small number of high-quality customers, the results indicate that we could have already found them. Lets check the results further.

In [None]:
# Calculate the Davies-Bouldin Index
from sklearn.metrics import davies_bouldin_score

davies_bouldin_score(df_scaled_cluster.drop('Cluster', axis=1), df_scaled_cluster['Cluster'])

### Visualizing the Clusters

In [None]:
# Visualize results of the clustering by using the Cluster column and the num_transactions and total_order_value columns
plt.figure(figsize=(10, 6))
sns.scatterplot(x='repeat_share', y='dog_share', hue='Cluster', data=df_scaled_cluster, palette='viridis')
plt.title('KMeans Clustering Results')
plt.show()

In [None]:
# Visualize results of the clustering by using the Cluster column and the num_transactions and total_order_value columns
plt.figure(figsize=(10, 6))
sns.scatterplot(x='num_transactions', y='total_order_value', hue='Cluster', data=df_scaled_cluster, palette='viridis')
plt.title('KMeans Clustering Results')
plt.show()

In [None]:
# Visualize results of the clustering by using the Cluster column and the num_transactions and total_order_value columns
plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_order_value', y='days_between_trans', hue='Cluster', data=df_scaled_cluster, palette='viridis')
plt.title('KMeans Clustering Results')
plt.show()

We can clearly see already that the clustering gives us a good segmentation of the customers. It is especially helpful to find the high quality customers that we want to explicitly target.

In [None]:
# Visualize the cluster plots using PCA
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', data=df_final, palette='viridis')
plt.title('KMeans Clustering Results with PCA')
plt.show()

In [None]:
# Visualize the cluster plots using t-SNE
from sklearn.manifold import TSNE

# Initialize the t-SNE algorithm
tsne = TSNE(n_components=2, random_state=42)

# Fit and transform the data
df_tsne = tsne.fit_transform(df_scaled_cluster.drop('Cluster', axis=1))

# Create a dataframe with the t-SNE components
df_tsne = pd.DataFrame(data=df_tsne, columns=['t-SNE1', 't-SNE2'])

# Concatenate the t-SNE components with the cluster column
df_tsne = pd.concat([df_tsne, df_scaled_cluster['Cluster']], axis=1)

# Visualize the cluster plots using t-SNE
plt.figure(figsize=(10, 6))
sns.scatterplot(x='t-SNE1', y='t-SNE2', hue='Cluster', data=df_tsne, palette='viridis')
plt.title('KMeans Clustering Results with t-SNE')
plt.show()

In [None]:
# Use plotly to create a 3D scatter plot of the clusters
fig = px.scatter_3d(df_scaled_cluster, x='num_transactions', y='total_order_value', z='days_between_trans', color='Cluster', opacity=0.7)
fig.update_layout(title='KMeans Clustering Results in 3D')
fig.show()

In [None]:
# Use plotly to create a 3D scatter plot of the clusters
fig = px.scatter_3d(df_scaled_cluster, x='num_transactions', y='repeat_share', z='dog_share', color='Cluster', opacity=0.7)
fig.update_layout(title='KMeans Clustering Results in 3D')
fig.show()

In [None]:
# Use plotly to create a 3D scatter plot of the clusters
fig = px.scatter_3d(df_scaled_cluster, x='num_transactions', y='days_between_trans', z='repeat_share', color='Cluster', opacity=0.7)
fig.update_layout(title='KMeans Clustering Results in 3D')
fig.show()

In this section we closely looked at the clusters and the characteristics of the customers in each cluster. We also validated the model by calculating commonly used scores and visualizing the clusters.

# Final Output

In [None]:
# Merge df_scaled['CustomerID'] with df_scaled_cluster['Cluster']
df_clustered = pd.concat([df_scaled['CustomerID'], df_scaled_cluster['Cluster']], axis=1)
df_clustered

In [None]:
# Save the clustered data to a CSV file
df_clustered.to_csv('Clustered_Data.csv', index=False)

# Conclusion and Recommendations

In [None]:
#TODO: Make create a Report that looks professional and is easy to understand

## Business Insights

## Next Steps