# Heom Tutorial
#### Table of Contents
[Introduction](#Introduction) 

[Preprocessing](#preprocessing) 

[HEOM with Scikit-learn](#heom-with-scikit-learn) 
#### Author
[Kacper Kubara](#www.kacperkubara.com)
## Introduction
If you are here, you probably encountered the same problem that I did not that long ago: how to cluster mixed-type data with missing values?
Most of the distance metrics and clustering algorithms that are available in Scikit-Learn fails badly with this type of data. To overcome this problem, you are expected to do some fancy preprocessing which can be either categorizing the continuous features and labeling the missing values or considering the categorical data as continuous one (usually it is a spectacularly bad idea).

However, there are some existing heterogeneous (aka. mixed-type) distance metrics which solve that problem. Although they have been around for a while, they are still not that popular amongst data scientists which is a shame because they can remove several preprocessing steps making it easier for you to cluster the heterogeneous data with missing values. One of the most influential papers, "Improved Heterogeneous Distance Functions", that discuss this problem and proposes improved heterogeneous distance metrics can be found [here](#https://arxiv.org/pdf/cs/9701101.pdf). Distython is a Python implementation of those distance metrics which is designed to be used both with custom machine learning algorithms and scikit-learn.

Heterogeneous Euclidan overlap Metric (HEOM) is a distance metric that uses different distance functions for continuous and categorical features. For continuous data it uses normalized Euclidan metric and for the categorical one overlap metric is used (e.g. return 0 if labels are the same, 0 otherwise). Missing data is handled by taking a maximum distance possible for the specific feature. After the distance for each feature is computed, the results are aggregated by taking the root of the sum of squared distances. Have a look at the [paper](#https://arxiv.org/pdf/cs/9701101.pdf), if you want to know gory details of HEOM implementation.

Drawback of this metric is that normalized Euclidean metric is sensitive to outliers.

## Preprocessing
There are few preprocessing requirements so that you are able to work with the package:
1.  Type of data needs to be np.ndarray
2.  Features with string types needs to be label encoded to convert it into numerical representation. Don't do one hot encoding - it will produce wrong results (related issue: [#5](#https://github.com/KacperKubara/distython/issues/5))!
3.  If you plan on using Distython with Scikit-Learn, you need to convert all NaNs to some nan equivalent, e.g. 12345. It is because Scikit-Learn will throw errors if you try using clustering algorithms with HEOM.

## HEOM with Scikit-Learn
The metrics work well Scikit-Learn but only with small datasets. The problem here is that overhead that is generated by calling the HEOM from Scikit-Learn is getting large for bigger datasets. However, the package should much smaller overhead if you use custom machine learning algorithms (i.e. code implementation)

In [None]:
# Example code of how the HEOM metric can be used together with Scikit-Learn
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import load_boston
from distython import HEOM

In [None]:
# Load the example dataset from sklearn
boston = load_boston()
boston_data = boston["data"]
# Define indices of categorical variables in the data
categorical_ix = [3, 8]
# The problem here is that NearestNeighbors from sklearn 
# can't handle np.nan, so we have to set up the NaN equivalent
nan_eqv = 12345 # Arbitrary number

In [None]:
# Introduce some missingness to the data for the purpose of the example
row_cnt, col_cnt = boston_data.shape
for i in range(row_cnt):
    for j in range(col_cnt):
        rand_val = np.random.randint(20, size=1)
        if rand_val == 10:
            boston_data[i, j] = nan_eqv

In [None]:
# Declare the HEOM with a correct NaN equivalent value
heom_metric = HEOM(boston_data, categorical_ix, nan_equivalents = [nan_eqv])

# Declare NearestNeighbor and link the metric
neighbor = NearestNeighbors(metric = heom_metric.heom)

In [None]:
# Fit the model which uses the custom distance metric 
neighbor.fit(boston_data)

In [None]:
# Return 5-Nearest Neighbors to the 1st instance (row 1)
result = neighbor.kneighbors(boston_data[0].reshape(1, -1), n_neighbors = 5)
print(result)