# Defining a Custom Dissimilarity Metric for Mixed Data Types

In this Jupyter notebook, we will define and compute a custom dissimilarity metric for a dataset containing both numerical and categorical features. Our goal is to accurately measure the dissimilarity between samples, taking into account the relative importance of each feature.

## Dataset Overview

First, let's load the dataset and take a quick look at its structure and contents.



In [24]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

In [25]:
# Load the dataset
df = pd.read_csv('./dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age,height,job,city,favorite music style
0,0,30.237071,179.874298,designer,paris,trap
1,1,27.915796,172.659587,fireman,marseille,hiphop
2,2,32.205338,181.337491,teacher,paris,metal
3,3,26.595215,172.337885,designer,toulouse,metal
4,4,27.39478,182.70803,teacher,paris,metal


## Custom Metric Definition

Given that our dataset contains both numerical and categorical data, we cannot directly apply a standard Euclidean metric. Instead, we need to define a hybrid metric that considers the different nature of the features.

### Feature Importance

- Numerical features: These will contribute to the dissimilarity based on their scaled difference.
- Categorical features: We will use a simple matching coefficient, where dissimilarity is 0 if the categories match, and a fixed positive value otherwise.

We will also define which features are more important based on domain knowledge or feature relevance.



### Let's check the music genre

The DataFrame contains the following unique music styles:

In [26]:
unique_music_styles = df['favorite music style'].unique()

print("Unique music styles in the DataFrame:")
for music_style in unique_music_styles:
    print(music_style)

Unique music styles in the DataFrame:
trap
hiphop
metal
rock
rap
classical
other
jazz
technical death metal


### Music Genre Dissimilarity

To analyze the dissimilarity between different music genres, a mapping from music style to numerical values is used. The closer the numbers are, the more similar the genres are considered to be. Here is the mapping:


In [27]:
music_style_values = {
    'trap': 1,
    'hiphop': 2,
    'rap': 2,
    'rock': 3,
    'metal': 4,
    'technical death metal': 4.1,
    'jazz': 5,
    'classical': 6,
    'other': 7
}

def music_style_dissimilarity(style1, style2):
    value1 = music_style_values.get(style1, 7)
    value2 = music_style_values.get(style2, 7)
    return abs(value1 - value2)

print(music_style_dissimilarity('trap', 'hiphop'))  # Expected to be low
print(music_style_dissimilarity('classical', 'technical death metal'))  # Expected to be higher

1
1.9000000000000004


In [28]:
# Define numerical and categorical columns
columns_numerical = ['age', 'height']
columns_categorical = ['job', 'city']
music_column = 'favorite music style'

# Fit a scaler based on the numerical columns of the dataframe
scaler = MinMaxScaler()
df_numerical = df[columns_numerical]
scaler.fit(df_numerical)
df_scaled = scaler.transform(df_numerical)

# Create a dissimilarity function that operates on the scaled data
def custom_dissimilarity(index1, index2, df_scaled, df):
    dissimilarity = 0
    for col_num in range(df_scaled.shape[1]):
        dissimilarity += abs(df_scaled[index1, col_num] - df_scaled[index2, col_num])

    for col in columns_categorical:
        if df.iloc[index1][col] != df.iloc[index2][col]:
            dissimilarity += 1

    # Include music style dissimilarity
    music_dissim = music_style_dissimilarity(df.iloc[index1][music_column], df.iloc[index2][music_column])
    dissimilarity += music_dissim

    return dissimilarity

## Computing Dissimilarities

Now, we will apply our custom metric to compute the dissimilarities between all pairs of samples in the dataset.


In [29]:
dissimilarity_matrix = np.zeros((len(df), len(df)))
for i in range(len(df)):
    for j in range(i + 1, len(df)):
        dissimilarity_matrix[i, j] = custom_dissimilarity(i, j, df_scaled, df)
        dissimilarity_matrix[j, i] = dissimilarity_matrix[i, j]
np.save('dissimilarity_matrix.npy', dissimilarity_matrix)

## Analyzing the Dissimilarity Distribution

We will compute the mean and standard deviation of the dissimilarities to understand the distribution.



In [30]:
# Calculate the mean and standard deviation of dissimilarities
mean_dissimilarity = np.mean(dissimilarity_matrix)
std_dissimilarity = np.std(dissimilarity_matrix)

print(f"Mean Dissimilarity: {mean_dissimilarity}")
print(f"Standard Deviation of Dissimilarity: {std_dissimilarity}")


Mean Dissimilarity: 4.165459750473434
Standard Deviation of Dissimilarity: 1.7079158102693894


## Conclusion

- **Mean Dissimilarity**: 4.1655  
  The average dissimilarity reflects a diverse dataset, indicating varied sample characteristics.

- **Standard Deviation of Dissimilarity**: 1.7079  
  This suggests a wide range of differences among the samples, pointing to a heterogenous dataset.

The dissimilarity values highlight the richness of the dataset's features, especially in music preferences, where genres have nuanced differences. This is critical for personalized applications like recommendations.
