<a href="https://colab.research.google.com/github/Elian19-01/Massive-data/blob/main/Similarity_Elian_Vega.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Similarity measures

The following workbook is based on p. 236 of the text, exercises 2.124–127, regarding "similarity" of two sets.


## Definitions of similarity

Given two sets $A$ and $B$:
* the _cardinality measure_ of similarity is $|A \cap B|$
* the _Jaccard coefficient_ of similarity is $\frac{|A \cap B|}{|A \cup B|}$

**Question:** In your own words, explain why these two measures can represent the similarity of two sets, and explain the difference between the two.

The cardinality measure of similarity gives the number of items that are elements of both sets whereas the Jaccard coefficient gives the ratio of the number of items that are elements of both sets to the total number of unique elements in both sets. These two measures both represent similarity because they both give an idea of the quantity of elements that are members of both sets. 

The cardinality measures gives the items that are elements of both sets without context regarding the sets themselves. The Jaccard coefficient gives a better idea of what each set looks like, in terms of similarity, by providing the context of all items that are elements of both sets. However, the Jaccard coefficient does not provide the number of elements the sets have in common. 

## Similarity as a metric

In class we defined a metric, which is a distance function that obeys some nice, standard properties.  See the writeup distributed in Week 2 notes for more information about metrics.

Similarity, as a measure, is essentially the opposite of "distance":  two sets that are similar are not very "far apart".

In fact, it turns out we can formalize this mathematically.  Consider the function
$$ \mathrm{dist}(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}.$$
Intuitively, this is the "opposite" of the Jaccard coefficient:  Think of the coefficient as a percentage, because it's a fraction between $0$ and $1$; if the Jaccard coefficient of $A$ and $B$ says something like the sets are "20% similar," then $\mathrm{dist}(A, B)$ would say they are "80% different."

This function is in fact a metric on sets.  Below, justify this by arguing that this function satisfies the four properties that a metric must satisfy (non-negativity, identity, symmetry, and triangle inequality).

## Computing the two measures

Implement a python function `sim_card` that returns the similarity of two sets using the cardinality measure:

In [1]:
def sim_card(list1, list2):
    intersection = len(set.intersection(*[list1, list2]))
    return intersection
    pass

Implement a python function `sim_jaccard` that returns the similarity of two sets using the Jaccard coefficient:

In [2]:
def sim_jaccard(list1, list2):
    intersection = sim_card(list1, list2)
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union
    pass

You can use this space to test your functions.  Just be sure to hit the "run" button on the blocks above after you make changes, so that calling them below will invoke the correct code.

In [3]:
jeff = [1, 2, 3, 4, 5, 6]
jeffrey = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

jeff = set(jeff)
jeffrey = set(jeffrey)

kevin = [11, 12, 13]
kev = [11, 12, 13]

kevin = set(kevin)
kev = set(kev)

a = sim_card(jeffrey, jeff) 
b = sim_jaccard(jeffrey, jeff)

c = sim_card(kevin, kev)  
d = sim_jaccard(kevin, kev)

e = sim_card(kev, jeff)  
f = sim_jaccard(kev, jeff) 

print("The cardinality of jeffrey and jeff is", a, " and the jaccard coefficient is", b)
print("The cardinality of kevin and kev is", c, " and the jaccard coefficient is", d)
print("The cardinality of kevin and kev is", e, " and the jaccard coefficient is", f)

The cardinality of jeffrey and jeff is 6  and the jaccard coefficient is 0.6
The cardinality of kevin and kev is 3  and the jaccard coefficient is 1.0
The cardinality of kevin and kev is 0  and the jaccard coefficient is 0.0
