# Clustering drone hubs

## Assignment understanding

This assignment is based on a real-world drone delivery scenario. We've used Amazon's drone delivery system as an inspirational prototype for the machine learning problem we aim to solve. However, it's important to note that our dataset is completely independent of Amazon

With this assignment we aim to address two theoretical problem:

1. **Minimize Delivery Times & Costs**: By strategically placing drone hubs (depots), the company can reduce the average travel distance for each delivery, leading to lower energy costs and faster service. We will use **k-means and hierarchical clustering** to determine the optimal coordinates for these hubs.

2. **Increase Sales Revenue**: By understanding customer purchasing patterns, the company can create targeted marketing campaigns and product bundles. We will identify which product groups are frequently bought together using **Association Rule Mining**, as discussed in our lectures.

In [14]:
# --- Necessary Setup For This Assignment ---

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.cluster import KMeans, AgglomerativeClustering
# from mlxtend.frequent_patterns import apriori, association_rules ## TODO: here I have some troubles - no idea why would we need it

# Set plot style for better visuals
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## Data understanding

collect and explore the data.
- What data is available? What are the characteristics of the data (variable types, value distributions etc.)?
- Are there any quality issues with the data (missing values, outliers, nonsensical values)?


We have two datasets - each for own part. We've imported them bouth form the teacher provided GitHub:

### 1. Clustering set

In [24]:
# Load the customer location data
locations_df = pd.read_csv('./droneData/drone_cust_locations.csv ', sep=';')

# Display the first few rows of the dataframe
print("\n--- First 5 rows of the dataset: ---\n")
display(locations_df.head(5))


# Display the all relevant info in a transposed summary table
print("\n--- Summary Table --- \n")
# Get descriptive statistics and transpose it
summary_table = locations_df.describe().T

# Add columns for data type and missing values
summary_table['value_type'] = locations_df.dtypes
summary_table['missing_values'] = locations_df.isnull().sum()

display(summary_table)


--- First 5 rows of the dataset: ---



Unnamed: 0,clientid,x,y
0,1,622.771572,164.857623
1,2,416.357298,630.193634
2,3,292.73502,567.333231
3,4,737.211288,166.225676
4,5,540.475375,682.912298



--- Summary Table --- 



Unnamed: 0,count,mean,std,min,25%,50%,75%,max,value_type,missing_values
clientid,5956.0,2978.5,1719.493433,1.0,1489.75,2978.5,4467.25,5956.0,int64,0
x,5956.0,508.823177,271.061462,0.017692,282.58292,518.100892,727.156497,999.533215,float64,0
y,5956.0,427.554772,289.04464,0.043285,170.079921,397.786441,669.982518,999.73172,float64,0


The customer locations dataset contains 5,956 entries and three initial features. The data is clean, with no missing values found.

Here's a breakdown of the columns:

- `clientid`: This is an integer column that uniquely identifies each customer. It will be excluded from the clustering model, as it's an identifier and not a geographic feature.

- `x` and `y`: These are float columns representing the customers' geographic coordinates. They are the essential features for our analysis. Their values range from 0 to 1000.

### 2. Association set

In [31]:
# Load the product group data
products_df = pd.read_csv('./droneData/drone_prod_groups.csv')

# Display the first few rows of the dataframe
print("\n--- First 5 rows of the product dataset: ---\n")
display(products_df.head(5))


# Display the all relevant info in a transposed summary table
print("\n--- Product Data Summary Table --- \n")

# Create a DataFrame from the column names
features_summary_df = pd.DataFrame(products_df.columns, columns=['Feature'])

# Add columns for Data Type, Missing Values, and a list of Unique Values
features_summary_df['Data Type'] = products_df.dtypes.values
features_summary_df['Value Count'] = products_df.count().values
features_summary_df['Missing Values'] = products_df.isnull().sum().values
features_summary_df['Unique Values'] = [', '.join(map(str, products_df[col].unique())) for col in products_df.columns]

# Set the 'Feature' column as the index for cleaner presentation
features_summary_df.set_index('Feature', inplace=True)

display(features_summary_df)


--- First 5 rows of the product dataset: ---



Unnamed: 0,ID,Prod1,Prod2,Prod3,Prod4,Prod5,Prod6,Prod7,Prod8,Prod9,...,Prod11,Prod12,Prod13,Prod14,Prod15,Prod16,Prod17,Prod18,Prod19,Prod20
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
1,2,0,1,0,0,0,0,0,0,1,...,0,0,0,0,1,1,1,1,1,1
2,3,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,1
3,4,1,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,1
4,5,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,1,1



--- Product Data Summary Table --- 



Unnamed: 0_level_0,Data Type,Value Count,Missing Values,Unique Values
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ID,int64,100000,0,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,..."
Prod1,int64,100000,0,"0, 1"
Prod2,int64,100000,0,"0, 1"
Prod3,int64,100000,0,"0, 1"
Prod4,int64,100000,0,"0, 1"
Prod5,int64,100000,0,"0, 1"
Prod6,int64,100000,0,"0, 1"
Prod7,int64,100000,0,"0, 1"
Prod8,int64,100000,0,"0, 1"
Prod9,int64,100000,0,"1, 0"


The customer transactions dataset contains 10,000 entries and 21 initial features. The data is clean, with no missing values found.

Here's a breakdown of the columns:

- `ID`: An integer that uniquely identifies each transaction. It will be excluded from the analysis.

- from `prod1` to `prod20`: These are binary columns, each representing a unique product group. The value is `1` if an item from that group was purchased in the transaction and `0` if it was not.

## Data preparation

data preprocessing
- cleaning the data
- transforming the data
- selecting the relevant features


For this assignment, the datasets are quite clean and well-structured.

- For clustering, we will use the x and y columns from locations_df directly.

- For association rule mining, we will use the product columns (Prod1 to Prod20) from products_df. The ID column will be dropped as it is not a feature.

In [18]:
# Prepare data for clustering
X_locations = locations_df[['x', 'y']]

# Prepare data for association rule mining
# Drop the transaction ID column
X_products = products_df.drop('ID', axis=1)

## Modeling

choose a machine learning method and train the model (+ model validation)
- which method was used?
- which parameters were used?
- what was the performance of the model?

In [19]:
# hello world

## Evaluation

evaluate the model
- How well does the model perform?
- Does it meet the business requirements?

In [20]:
# hell world

## Deployment

johtopaatos / creating a recommendation of how to use the model in practice, or what to do next
- How will the model be used in practice?
- How will the results be communicated?

In [21]:
# hello world

### Reflection

#### Ai Usage
- for research

#### Team contribution
- who did what

#### Sources
- links & descriptions