# Clustering drone hubs

## Assignment understanding

This assignment is based on a real-world drone delivery scenario. We've used Amazon's drone delivery system as an inspirational prototype for the machine learning problem we aim to solve. However, it's important to note that our dataset is completely independent of Amazon

With this assignment we aim to address two theoretical problem:

1. **Minimize Delivery Times & Costs**: By strategically placing drone hubs (depots), the company can reduce the average travel distance for each delivery, leading to lower energy costs and faster service. We will use **k-means and hierarchical clustering** to determine the optimal coordinates for these hubs.

2. **Increase Sales Revenue**: By understanding customer purchasing patterns, the company can create targeted marketing campaigns and product bundles. We will identify which product groups are frequently bought together using **Association Rule Mining**, as discussed in our lectures.

In [1]:
# --- Necessary Setup For This Assignment ---

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.cluster import KMeans, AgglomerativeClustering

# Set plot style for better visuals
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## Data understanding

This assignment uses two distinct datasets, one for each part of the analysis. Both have been imported from the GitHub repository provided by our instructor.

### 1. Clustering set

In [2]:
# Load the customer location data
locations_df = pd.read_csv('./droneData/drone_cust_locations.csv ', sep=';')

# Display the first few rows of the dataframe
print("\n--- First 5 rows of the dataset: ---\n")
display(locations_df.head(5))


# Display the all relevant info in a transposed summary table
print("\n--- Summary Table --- \n")
# Get descriptive statistics and transpose it
summary_table = locations_df.describe().T

# Add columns for data type and missing values
summary_table['value_type'] = locations_df.dtypes
summary_table['missing_values'] = locations_df.isnull().sum()

display(summary_table)


--- First 5 rows of the dataset: ---



Unnamed: 0,clientid,x,y
0,1,622.771572,164.857623
1,2,416.357298,630.193634
2,3,292.73502,567.333231
3,4,737.211288,166.225676
4,5,540.475375,682.912298



--- Summary Table --- 



Unnamed: 0,count,mean,std,min,25%,50%,75%,max,value_type,missing_values
clientid,5956.0,2978.5,1719.493433,1.0,1489.75,2978.5,4467.25,5956.0,int64,0
x,5956.0,508.823177,271.061462,0.017692,282.58292,518.100892,727.156497,999.533215,float64,0
y,5956.0,427.554772,289.04464,0.043285,170.079921,397.786441,669.982518,999.73172,float64,0


The customer locations dataset contains 5,956 entries and three initial features. The data is clean, with no missing values found.

Here's a breakdown of the columns:

- `clientid`: This is an integer column that uniquely identifies each customer. It will be excluded from the clustering model, as it's an identifier and not a geographic feature.

- `x` and `y`: These are float columns representing the customers' geographic coordinates. They are the essential features for our analysis. Their values range from 0 to 1000.

### 2. Association set

In [3]:
# Load the product group data
products_df = pd.read_csv('./droneData/drone_prod_groups.csv')

# Display the first few rows of the dataframe
print("\n--- First 5 rows of the product dataset: ---\n")
display(products_df.head(5))


# Display the all relevant info in a transposed summary table
print("\n--- Product Data Summary Table --- \n")

# Create a DataFrame from the column names
features_summary_df = pd.DataFrame(products_df.columns, columns=['Feature'])

# Add columns for Data Type, Missing Values, and a list of Unique Values
features_summary_df['Data Type'] = products_df.dtypes.values
features_summary_df['Value Count'] = products_df.count().values
features_summary_df['Missing Values'] = products_df.isnull().sum().values
features_summary_df['Unique Values'] = [', '.join(map(str, products_df[col].unique())) for col in products_df.columns]

# Set the 'Feature' column as the index for cleaner presentation
features_summary_df.set_index('Feature', inplace=True)

display(features_summary_df)


--- First 5 rows of the product dataset: ---



Unnamed: 0,ID,Prod1,Prod2,Prod3,Prod4,Prod5,Prod6,Prod7,Prod8,Prod9,...,Prod11,Prod12,Prod13,Prod14,Prod15,Prod16,Prod17,Prod18,Prod19,Prod20
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
1,2,0,1,0,0,0,0,0,0,1,...,0,0,0,0,1,1,1,1,1,1
2,3,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,1
3,4,1,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,1
4,5,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,1,1



--- Product Data Summary Table --- 



Unnamed: 0_level_0,Data Type,Value Count,Missing Values,Unique Values
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ID,int64,100000,0,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,..."
Prod1,int64,100000,0,"0, 1"
Prod2,int64,100000,0,"0, 1"
Prod3,int64,100000,0,"0, 1"
Prod4,int64,100000,0,"0, 1"
Prod5,int64,100000,0,"0, 1"
Prod6,int64,100000,0,"0, 1"
Prod7,int64,100000,0,"0, 1"
Prod8,int64,100000,0,"0, 1"
Prod9,int64,100000,0,"1, 0"


The customer transactions dataset contains 100,000 entries and 21 initial features. The data is clean, with no missing values found.

Here's a breakdown of the columns:

- `ID`: An integer that uniquely identifies each transaction. It will be excluded from the analysis.

- from `prod1` to `prod20`: These are binary columns, each representing a unique product group. The value is `1` if an item from that group was purchased in the transaction and `0` if it was not.

## Data preparation

data preprocessing
- cleaning the data
- transforming the data
- selecting the relevant features


For this assignment, the datasets are quite clean and well-structured.

- For clustering, we will use the x and y columns from locations_df directly.

- For association rule mining, we will use the product columns (Prod1 to Prod20) from products_df. The ID column will be dropped as it is not a feature.

In [4]:
# Prepare data for clustering
X_locations = locations_df[['x', 'y']]

# Prepare data for association rule mining
# Drop the transaction ID column
X_products = products_df.drop('ID', axis=1).astype(bool)

## Modeling

choose a machine learning method and train the model (+ model validation)
- which method was used?
- which parameters were used?
- what was the performance of the model?

In [5]:
# hello world

### Part 2: Finding interesting relationships between product groups

In this part, we apply association rule mining to discover interesting relationships between product groups. The goal is to find relationships between items that often occur together in transactions. For example, if customers frequently buy two specific products together, the algorithm will detect that as a strong pattern.

We are using the Apriori algorithm, which looks for itemsets that appear often enough for it to be considered an interesting relationship. From these frequent itemsets, we can generate rules that show how the presence of some items can imply the presence of others.

This helps us understand customer purchasing behavior and can be used for things like recommendations, product placement, or promotions.

In [6]:
pip install mlxtend

Note: you may need to restart the kernel to use updated packages.


In [7]:
# library used for association rule mining
from mlxtend.frequent_patterns import apriori, association_rules

#### Finding frequent itemsets

Support indicates the threshold for how frequently an itemset must appear in the dataset in order to be included in the findings. We chose to use 1% as the support value since the dataset is relatively large and a higher value would only show a few patterns.

In [8]:
# find frequent itemsets
frequent_itemsets = apriori(X_products, min_support=0.01, use_colnames=True)
frequent_itemsets.sort_values('support', ascending=False).head(20)

Unnamed: 0,support,itemsets
18,0.20626,( Prod19)
8,0.19853,( Prod9)
7,0.16179,( Prod8)
11,0.15971,( Prod12)
19,0.14798,( Prod20)
13,0.14557,( Prod14)
6,0.13499,( Prod7)
130,0.13476,"( Prod20, Prod19)"
15,0.131,( Prod16)
1,0.13098,( Prod2)


We can see that Product 19 appears in the transactions most often, and the combination of Product 19 and Product 20 is the most frequent itemset containing multiple items.

#### Generating association rules

Confidence measures the reliability of the association rule, representing the proportion of transactions that contain itemset X which also contain itemset Y. A higher confidence value means that the rule is interesting since it is accurate.

In [9]:
# generate association rules based on confidence
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.75)

# sort in descending order of confidence
rules_sorted = rules.sort_values(by='confidence', ascending=False)
rules_sorted[['antecedents','consequents','antecedent support','consequent support','support','confidence','lift']].head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift
4,"( Prod2, Prod15)",( Prod9),0.01947,0.19853,0.01843,0.946584,4.767967
19,"( Prod20, Prod15)",( Prod9),0.02241,0.19853,0.02119,0.94556,4.762807
30,"( Prod20, Prod19, Prod15)",( Prod9),0.0203,0.19853,0.01919,0.94532,4.761599
14,"( Prod15, Prod12)",( Prod9),0.02308,0.19853,0.02173,0.941508,4.742396
8,"( Prod7, Prod15)",( Prod9),0.02014,0.19853,0.01895,0.940914,4.739403
17,"( Prod18, Prod15)",( Prod9),0.01743,0.19853,0.0164,0.940906,4.739367
18,"( Prod19, Prod15)",( Prod9),0.03041,0.19853,0.02861,0.940809,4.738875
16,"( Prod15, Prod16)",( Prod9),0.01936,0.19853,0.0182,0.940083,4.735217
2,"(Prod1, Prod15)",( Prod9),0.01597,0.19853,0.01501,0.939887,4.734233
0,( Prod15),( Prod9),0.1188,0.19853,0.11145,0.938131,4.725388


- `antecedents` – The "if" part of the rule (the **X** in **X** -> **Y**).
- `consequents` – The "then" part of the rule (the **Y** in **X** -> **Y**).
- `support` – Frequency of the itemset appearing in the dataset.
- `confidence` – Probability that the consequent occurs given that the antecedent occurs.
- `lift` – How much more likely the antecedent and consequent are to occur together rather than independently.

We can see that the most reliable rule is those who purchased Product 2 and Product 15 also purchase Product 9 within the same transaction. However, since Product 9 was second most frequently present item in the transactions, it is not surprising that the most reliable rules sorted by confidence include the product. From a company's viewpoint, we want to know which rule has a high lift, which tells us how much more often the antecedent and consequent occur together than if they were statistically independent. This helps determine the most favorable recommendations, discounted bundles, and such for customers.

In [10]:
# generate association rules based on lift
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=4)

# sort in descending order of lift
rules_sorted = rules.sort_values(by='lift', ascending=False)
rules_sorted[['antecedents','consequents','antecedent support','consequent support','support','confidence','lift']].head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift
155,"( Prod19, Prod15)","( Prod9, Prod20)",0.03041,0.03676,0.01919,0.631042,17.166551
152,"( Prod9, Prod20)","( Prod19, Prod15)",0.03676,0.03041,0.01919,0.522035,17.166551
154,"( Prod20, Prod15)","( Prod9, Prod19)",0.02241,0.04996,0.01919,0.856314,17.139995
153,"( Prod9, Prod19)","( Prod20, Prod15)",0.04996,0.02241,0.01919,0.384107,17.139995
143,"( Prod19, Prod12)","( Prod20, Prod5)",0.03881,0.01888,0.01101,0.28369,15.025941
140,"( Prod20, Prod5)","( Prod19, Prod12)",0.01888,0.03881,0.01101,0.583157,15.025941
142,"( Prod5, Prod19)","( Prod20, Prod12)",0.0261,0.02811,0.01101,0.421839,15.006726
141,"( Prod20, Prod12)","( Prod5, Prod19)",0.02811,0.0261,0.01101,0.391676,15.006726
103,( Prod15),"( Prod9, Prod20)",0.1188,0.03676,0.02119,0.178367,4.852204
100,"( Prod9, Prod20)",( Prod15),0.03676,0.1188,0.02119,0.576442,4.852204


#### Business recommendations

**Personalised recommendations and marketing**
- Use association rules to personalise recommended products for customers, and trigger a promotional email/notification of the consequents when a customer buys items from an antecedent group.

**Bundles**
- For rules with high confidence and lift, create suggestions in the checkout/shopping cart (“Customers who bought Product 15 also bought Products 20 and 9 — add them to the cart?”) and possibly include a tempting discount for the bundled items.

**Pre-loading packages**
- If certain product groups are frequently purchased together, keep pre-loaded packages with those products at the depot locations for faster shipping.

**Depot layout**
- Place commonly co-purchased product groups close to each other in the depot to increase effectiveness.

## Evaluation

evaluate the model
- How well does the model perform?
- Does it meet the business requirements?

In [11]:
# hell world

## Deployment

johtopaatos / creating a recommendation of how to use the model in practice, or what to do next
- How will the model be used in practice?
- How will the results be communicated?

In [12]:
# hello world

### Reflection

#### Ai Usage
- for research

#### Team contribution
- who did what

#### Sources
- links & descriptions