In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules



## Task
The task is split into two sub-tasks: finding optimal hub locations and finding interesting relationships between product groups.

### Part 1: Finding optimal hub locations
Guidelines for the Analysis phase:

Visualize the client locations by making a two-dimensional scatterplot. Can you give a geographic interpretation for what you see? Using k-means clustering, find optimal locations (i.e. x and y coordinates) for three drone depots. Each depot should serve its surrounding clients.
Hint: you can use Seaborn https://seaborn.pydata.org/generated/seaborn.scatterplot.html or Matplotlib: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html.

Hint: The centroids serve as the depot locations. You will later need to change the number of depots, so design your program in such a way that you just need to modify a single value to do that.

Attach the information on the closest depot to each client. That is, generate a data frame that is similar to the original one with the exception that it has an additional column that contains the identifier of the depot nearest to the client. Print the first 10 rows of the new data frame.

Make a scatterplot that uses three different colours. The markers with the same colour are served by the same depot. Hint: Re-check the web page(s) mentioned in the first task.

Play with the number of depots. What are the optimal locations for 10 depots, for example? Do you see a difference in the computation time when the number of depots increases?

Replace k-means with agglomerative hierarchical clustering and explore it with various depot numbers. What are your observations?

In the end, your report should give a recommendation on how the depots should be placed, depending on the number of depots. You should also discuss the differences between k-means and hierarchical clustering in this context.

### Part 2: Finding interesting relationships between product groups
Use association rule mining to find interesting relationships between product groups.

Your report should include a clear recommendation on how the company should use the results of the association rule mining to increase its revenue.

## Business Understanding
### The Goal:
The goal of this task is to determine the most efficient locations for drone depots to serve clients. By analyzing the geographic distribution of client locations. The aim is to find the optimal locations for the depots to minimize the distance between the depots and the clients they serve. The task is split into two sub-tasks: finding optimal hub locations and finding interesting relationships between product groups.
First part will be achieved by using k-means clustering to find the optimal locations for three drone depots. The second part will be achieved by using association rule mining to find interesting relationships between product groups.

### Requirements & Limitations:
- **Visualize client locations**: Create a two-dimensional scatterplot of client locations.
- **K-means clustering**: Use k-means clustering to find optimal locations for three drone depots.
- **Depot assignment**: Attach the closest depot information to each client in a new data frame.
- **Scatterplot with depot assignments**: Create a scatterplot with different colors for each depot's clients.
- **Hierarchical clustering**: Replace k-means with agglomerative hierarchical clustering and explore with various depot numbers.
- **Association rule mining**: Use association rule mining to find relationships between product groups.

### Expected Outcome:
- **Scatterplot of client locations**: A visual representation of client locations.
- **Optimal depot locations**: Coordinates of the optimal depot locations using k-means clustering.
- **Hierarchical clustering analysis**: Observations and comparisons of depot placements using hierarchical clustering.
- **Recommendations**: A report with recommendations on depot placements and a discussion on clustering methods.

## Data Understanding
### Dataset drone_prod_groups.csv
This dataset contains information about various product groups. Each row represents a unique product group with multiple products listed under it.
- **Columns**:
ID: Identifier for the product group.
Prod1 to Prod20: Products associated with the product group.

### Dataset drone_cust_locations.csv
This dataset contains the geographic locations of clients. Each row represents a client with their respective coordinates.
- **Columns**:
clientid: Unique identifier for the client.
x: X-coordinate of the client's location.
y: Y-coordinate of the client's location.

In [20]:
locations_df = pd.read_csv('../datasets/drone_cust_locations.csv', delimiter=';')
groups_df = pd.read_csv('../datasets/drone_prod_groups.csv')

# Describe the data
print(locations_df.describe())
print(groups_df.describe())

# Print data types
print(locations_df.columns)
print(groups_df.columns)

          clientid            x            y
count  5956.000000  5956.000000  5956.000000
mean   2978.500000   508.823177   427.554772
std    1719.493433   271.061462   289.044640
min       1.000000     0.017692     0.043285
25%    1489.750000   282.582920   170.079921
50%    2978.500000   518.100892   397.786441
75%    4467.250000   727.156497   669.982518
max    5956.000000   999.533215   999.731720
                  ID          Prod1         Prod2          Prod3  \
count  100000.000000  100000.000000  100000.00000  100000.000000   
mean    50000.500000       0.109980       0.13098       0.032710   
std     28867.657797       0.312866       0.33738       0.177877   
min         1.000000       0.000000       0.00000       0.000000   
25%     25000.750000       0.000000       0.00000       0.000000   
50%     50000.500000       0.000000       0.00000       0.000000   
75%     75000.250000       0.000000       0.00000       0.000000   
max    100000.000000       1.000000       1.00000  

## Data Preparation
### Cleaning the Data:
- Load the datasets: Read the drone_prod_groups.csv and drone_cust_locations.csv files into pandas DataFrames.
- Handle missing values: Check for and handle any missing values in the datasets.

### Feature selection:
- Select relevant features: For drone_cust_locations, use clientid, x, and y. For drone_prod_groups, use ID and Prod1 to Prod20

### Splitting the Data:
- Split the data: For clustering, use the x and y coordinates from drone_cust_locations. For association rule mining, use the product columns from drone_prod_groups.

### Data standardization:
- Standardize the data: Normalize the x and y coordinates for clustering.
- Convert data types: Ensure that the data types are appropriate for the analysis.

### Prepare data for association rule mining:
- Convert the product data to a boolean type.

### Check for duplicate entries:
- Check for duplicate client entries in the locations data.


In [22]:
# Check for missing values
print(locations_df.isnull().sum())
print(groups_df.isnull().sum())

# Drop rows with missing values (if any)
locations_df.dropna(inplace=True)
groups_df.dropna(inplace=True)

# Select relevant features
locations_df = locations_df[['clientid', 'x', 'y']]
groups_df = groups_df.loc[:, ['ID'] + [col for col in groups_df.columns if 'Prod' in col]]

# Splitting the Data
X = locations_df[['x', 'y']]  # For clustering
Y = groups_df.drop(columns=['ID'])  # For association rule mining

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert data to boolean type (fixes deprecation warning)
Y = (Y > 0).astype(bool)

# Check for duplicate client entries
print("Duplicate clients:", locations_df.duplicated(subset=['clientid']).sum())

# Ensure standardization has no anomalies
print(f"Mean: {X_scaled.mean(axis=0)}, Std: {X_scaled.std(axis=0)}")

# Prepare data for association rule mining
frequent_itemsets = apriori(Y, min_support=0.03, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Display results
print(rules.head())

# Print first few rows of prepared data
print(X_scaled[:5])
print(Y.head())


clientid    0
x           0
y           0
dtype: int64
ID         0
Prod1      0
 Prod2     0
 Prod3     0
 Prod4     0
 Prod5     0
 Prod6     0
 Prod7     0
 Prod8     0
 Prod9     0
 Prod10    0
 Prod11    0
 Prod12    0
 Prod13    0
 Prod14    0
 Prod15    0
 Prod16    0
 Prod17    0
 Prod18    0
 Prod19    0
 Prod20    0
dtype: int64
Duplicate clients: 0
Mean: [1.37193443e-16 3.81755667e-17], Std: [1. 1.]
  antecedents consequents  antecedent support  consequent support  support  \
0    ( Prod9)    ( Prod2)             0.19853             0.13098  0.03210   
1    ( Prod2)    ( Prod9)             0.13098             0.19853  0.03210   
2   ( Prod19)    ( Prod2)             0.20626             0.13098  0.03346   
3    ( Prod2)   ( Prod19)             0.13098             0.20626  0.03346   
4   ( Prod12)    ( Prod5)             0.15971             0.10459  0.06683   

   confidence      lift  representativity  leverage  conviction  \
0    0.161688  1.234451               1.0  0.00609