In [1]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

Learning goal: To study distance functions and metrics for set form data. Consider a social network where each user is associated with a set of labels that best describe a set of properties of the user. We define the profile of the user to be the set of associated labels, i.e., given a set $P = {p_1,...,p_n}$ of user profiles and a universe of labels $L = {l_1,...,l_m}$, each profile $p_i \in P$ is a set of labels $L_i \subseteq L$. The task is to design functions to measure the distance between two labels and similarity between two user profiles.

(a) Propose a distance measure between labels, more precisely, given any
two labels $l_1, l_2 \subset L$, present a distance function $d$ such that $d(l_1, l_2)$
returns a distance measure between labels $l_1$ and $l_2$. The distance function should be (i) intuitive and (ii) satisfy the metric properties(see next parts).

Given that we know no semantic meanings of each label, we can only determine the distance between two labels based on their co-occurrence in user profiles. If two labels frequently appear together in user profiles, they are likely to be closely related and should have a small distance between them. Conversely, if they rarely or never appear together, they are likely unrelated and should have a larger distance.

For example, on Linkedin Profiles, the label "Data Science" and "Machine Learning" are likely to appear together, while the label "Data Science" and "Music" are unlikely to appear together. Therefore, the distance between "Data Science" and "Machine Learning" should be smaller than the distance between "Data Science" and "Music".

In other words, the distance function $d(l_1, l_2)$ should be inversely proportional to the number of profiles containing both $l_1$ and $l_2$ and directly proportional to the number of profiles containing either $l_1$ or $l_2$.

Let's define $P(L_i)$ as the set of profiles containing label $L_i$ and N as the total number of profiles.

My distance function for two labels $l_1$ and $l_2$ is defined as follows:
$$ 
d(L_1, L_2) = 
\begin{cases} 
|(P(L_1) \cup P(L_2)| - |P(L_1) \cap P(L_2)| + 1)/N & \text{if } L_1 \neq L_2 \\
0 & \text{if } L_1 = L_2 
\end{cases}

$$

(b) Discuss the intuition, strengths, and limitations of your measure.

$|P(L_1) \cup P(L_2)|$ is the number of profiles containing either $L_1$ or $L_2$, and $|P(L_1) \cap P(L_2)|$ is the number of profiles containing both $L_1$ and $L_2$. Therefore, $|P(L_1) \cup P(L_2)|$ - $|P(L_1) \cap P(L_2)|$ is the number of profiles containing either $L_1$ or $L_2$ but not both. This number is a good measure of how different $L_1$ and $L_2$ are. The larger the number, the more distant $L_1$ and $L_2$ becomes. The plus 1 term is added to avoid the case where $L_1$ and $L_2$ are the same label but $|P(L_1) \cup P(L_2)|$ - $|P(L_1) \cap P(L_2)|$ is 0, which violates the coincidence axiom of metric space. The division by N is to normalize the distance to be between 0 and 1. This helps to take into account the size of the dataset. For example, if we have a very large dataset, the number of profiles containing either $L_1$ or $L_2$ but not both will be large, and the distance between $L_1$ and $L_2$ will be large. However, if we have a very small dataset, the number of profiles containing either $L_1$ or $L_2$ but not both will be small, and the distance between $L_1$ and $L_2$ will be small. This is not desirable because the distance between $L_1$ and $L_2$ should be the same regardless of the size of the dataset.

Strength: The distance function is intuitive and easy to understand. It is also easy to compute.

Limitation: I haven't seen any obvious limitations

(c) Prove that your distance function is a metric. Depending on your
measure, this can be tricky, but study at least the easy properties!

1. Non-negativity $ d(L_1, L_2) \geq 0 $

This is true since the union $|P(L_1) \cup P(L_2)|$ is always greater than or equal the intersection $|P(L_1) \cap P(L_2)|$ according to set theories

2. Coincidence axiom:
 
$d(L_1, L_2) = 0$ if and only if $L_1 = L_2$ 

The union and intersection of the profiles are always equal if $L_1 = L_2$.

Question: when is the case when $L_1 \neq L_2$ but $|P(L_1) \cup P(L_2)|$ - $|P(L_1) \cap P(L_2)|$ = 0?

Answer: when both labels always co-occur together in the same profiles but never appear alone in any other profiles. 

This is prevented by the plus 1 term in the distance function. If $L_1 = L_2$, then $|P(L_1) \cup P(L_2)|$ - $|P(L_1) \cap P(L_2)|$ is 0, but the plus 1 term makes the distance 1.

3. Symmetry: $ d(L_1, L_2) = d(L_2, L_1) $

This is true since the union and intersection of the profiles are commutative.


4. Triangle Inequality: 
$$ 
d(L_1, L_3) + d(L_2, L_3) \geq d(L_1, L_2) \quad \forall L_1, L_2, L_3 \in L
$$

Let's denote the size of the union as U and intersection as I. Then we need to prove that

$$
U_{13} - I_{13} + U_{23} - I_{23} \geq U_{12} - I_{12}
$$

Case of disjoint sets:
If sets 1, 2, and 3 are all disjoint (i.e., they have no elements in common), then all intersections are empty, and the inequality simplifies to:

$$
U_{13} + U_{23} \geq U_{12}
$$

This is true since the union of two sets is always greater than or equal to the union of their subsets.

Case of Overlapping Sets:
Let's consider the worst-case scenario where all three sets overlap. This means the union is equal to the intersection. Then the inequality simplifies to:

$$
2U_{13} + 2U_{23} \geq 2U_{12} \\ 
U_{13} + U_{23} \geq U_{12}
$$

This is true with the same reasoning as the case of disjoint sets.

(d) Now we want to compare the similarity of two user profiles. Propose
an appropriate function $s(p_1, p_2)$ to compute the similarity of any two
profiles $p_1, p_2 \subset P$ and discuss its intuition.

We can use the Jaccard similarity to measure the similarity between two user profiles. The Jaccard similarity is defined as the size of the intersection divided by the size of the union of the two sets of labels belonging to each user. In other words, it is the number of common labels divided by the total number of labels in both profiles.

$$
s(p_1, p_2) = \frac{|p_1 \cap p_2|}{|p_1 \cup p_2|}
$$

(e) Be prepared to show code that implements d and s and demonstrate
its behavior with a small set of toy data.

In [23]:
# Implementing the distance measure d for labels

def d(L1, L2, profiles):
    numProfilesContainingBothL1andL2 = 0
    numProfilesContainingL1OrL2 = 0
    numProfiles = len(profiles)
    for user in profiles:
        profileLabels = profiles[user]
        if L1 in profileLabels and L2 in profileLabels:
            numProfilesContainingBothL1andL2 += 1
        if L1 in profileLabels or L2 in profileLabels:
            numProfilesContainingL1OrL2 += 1
    
    print(f"numProfiles Containing {L1} or {L2}: ", numProfilesContainingL1OrL2)
    print(f"numProfiles Containing Both {L1} and {L2}: ", numProfilesContainingBothL1andL2)
    
    if L1 != L2:
        return (numProfilesContainingL1OrL2 - numProfilesContainingBothL1andL2 + 1)/numProfiles
    else:
        return (numProfilesContainingL1OrL2 - numProfilesContainingBothL1andL2)/numProfiles

# Similarity of two user profiles: Jaccard similarity

def s(p1, p2):
    intersection = len(p1.intersection(p2))
    union = len(p1.union(p2))
    return intersection / union

# Toy data
labels = {"A", "B", "C", "D", "E", "F", "G"}  # A universe of 5 labels

profiles = {
    'user1': {"A", "B", "C"},
    'user2': {"A", "B", "D"},
    'user3': {"A", "D", "E"},
    'user4': {"E"},
    'user5': {"C", "D", "E"},
    'user6': {"A", "B", "C", "D", "E"},
    'user7': {"F", "G", "D"},
    'user8': {"F", "G", "E"},
}

# Demonstrate behavior
print("Distance between labels A and B:", d("A", "B", profiles),"\n")
print("Distance between labels A and C:", d("A", "C", profiles),"\n")
print("Distance between labels A and A:", d("A", "A", profiles),"\n")
print("Distance between labels F and G:", d("F", "G", profiles),"\n")

print("Similarity between user1 and user2 profiles:", s(profiles['user1'], profiles['user2']))
print("Similarity between user1 and user3 profiles:", s(profiles['user1'], profiles['user3']))
print("Similarity between user2 and user4 profiles:", s(profiles['user2'], profiles['user4']))


numProfiles Containing A or B:  4
numProfiles Containing Both A and B:  3
Distance between labels A and B: 0.25 

numProfiles Containing A or C:  5
numProfiles Containing Both A and C:  2
Distance between labels A and C: 0.5 

numProfiles Containing A or A:  4
numProfiles Containing Both A and A:  4
Distance between labels A and A: 0.0 

numProfiles Containing F or G:  2
numProfiles Containing Both F and G:  2
Distance between labels F and G: 0.125 

Similarity between user1 and user2 profiles: 0.5
Similarity between user1 and user3 profiles: 0.2
Similarity between user2 and user4 profiles: 0.0
