# Worksheet 04

Name:  Zihan Li

UID:  U83682995

### Topics

- Distance & Similarity

### Distance & Similarity

#### Part 1

a) In the minkowski distance, describe what the parameters p and d are.

p is the parameter that determines the type of norm used for the distance measurement, affecting how the distance between two points is calculated.

d represents the number of dimensions of the points being compared, indicating how many individual components are considered when calculating the distance.

b) In your own words describe the difference between the Euclidean distance and the Manhattan distance.

Euclidean distance measures the shortest direct line between two points in space, Manhattan distance reflects the sum of the absolute vertical and horizontal distances between points

Consider A = (0, 0) and B = (1, 1). When:

- p = 1, d(A, B) = 2
- p = 2, d(A, B) = $\sqrt{2} = 1.41$
- p = 3, d(A, B) = $2^{1/3} = 1.26$
- p = 4, d(A, B) = $2^{1/4} = 1.19$

c) Describe what you think distance would look like when p is very large.

when p becomes very large, the distance will approaches 1

d) Is the minkowski distance still a distance function when p < 1? Expain why / why not.

When p<1, the Minkowski distance does not work as a true distance function because it does not satisfy the triangle inequality, which is a necessary condition for a distance function.

e) when would you use cosine similarity over the euclidan distance?

We can use cosine similarity over Euclidean distance when the direction of data points matters more than their magnitude, such as in text similarity, recommendations, or high-dimensional data analysis.

f) what does the jaccard distance account for that the manhattan distance doesn't?


The Jaccard distance measures similarity between finite sample sets, focusing on shared and distinct elements. It's used for comparing sets, not vector magnitudes. Manhattan distance calculates the total difference between points in a space, not considering set overlap or uniqueness. Jaccard is better for assessing set similarity, while Manhattan evaluates geometric or spatial distance.

#### Part 2

Consider the following two sentences:

In [42]:
s1 = "hello my name is Alice"  
s2 = "hello my name is Bob"

using the union of words from both sentences, we can represent each sentence as a vector. Each element of the vector represents the presence or absence of the word at that index.

In this example, the union of words is ("hello", "my", "name", "is", "Alice", "Bob") so we can represent the above sentences as such:

In [43]:
v1 = [1,    1, 1,   1, 1,    0]
#     hello my name is Alice
v2 = [1,    1, 1,   1, 0, 1]
#     hello my name is    Bob

Programmatically, we can do the following:

In [44]:
corpus = [s1, s2]
all_words = list(set([item for x in corpus for item in x.split()]))
print(all_words)
v1 = [1 if x in s1 else 0 for x in all_words]
print(v1)

['name', 'my', 'Alice', 'hello', 'is', 'Bob']
[1, 1, 1, 1, 1, 0]


Let's add a new sentence to our corpus:

In [45]:
s3 = "hi my name is Claude"
corpus.append(s3)

a) What is the new union of words used to represent s1, s2, and s3?

In [46]:
all_words = list(set([item for x in corpus for item in x.split()]))
print(f"a) New union of words: {all_words}")



a) New union of words: ['name', 'my', 'Alice', 'Claude', 'hello', 'is', 'Bob', 'hi']


b) Represent s1, s2, and s3 as vectors as above, using this new set of words.

In [47]:
v1 = [1 if x in s1 else 0 for x in all_words]
v2 = [1 if x in s2 else 0 for x in all_words]
v3 = [1 if x in s3 else 0 for x in all_words]
print(f"b) Vector representations:\ns1: {v1}\ns2: {v2}\ns3: {v3}")

b) Vector representations:
s1: [1, 1, 1, 0, 1, 1, 0, 0]
s2: [1, 1, 0, 0, 1, 1, 1, 0]
s3: [1, 1, 0, 1, 0, 1, 0, 1]


c) Write a function that computes the manhattan distance between two vectors. Which pair of vectors are the most similar under that distance function?

In [48]:
def manhattan_distance(v1, v2):
    return sum(abs(x - y) for x, y in zip(v1, v2))

distances = {
    "s1-s2": manhattan_distance(v1, v2),
    "s1-s3": manhattan_distance(v1, v3),
    "s2-s3": manhattan_distance(v2, v3),
}
print(f"c) Manhattan distances between vectors: {distances}")

c) Manhattan distances between vectors: {'s1-s2': 2, 's1-s3': 4, 's2-s3': 4}


d) Create a matrix of all these vectors (row major) and add the following sentences in vector form:

- "hi Alice"
- "hello Claude"
- "Bob my name is Claude"
- "hi Claude my name is Alice"
- "hello Bob"

In [49]:
new_sentences = [
    "hi Alice",
    "hello Claude",
    "Bob my name is Claude",
    "hi Claude my name is Alice",
    "hello Bob"
]

# Update corpus and recalculate the union of words
corpus.extend(new_sentences)
all_words = (list(set(word for sentence in corpus for word in sentence.split())))

# Represent all sentences as vectors
matrix = [[1 if word in sentence.split() else 0 for word in all_words] for sentence in corpus]
print(f"d) matrix:\n{matrix}")

d) matrix:
[[1, 1, 1, 0, 1, 1, 0, 0], [1, 1, 0, 0, 1, 1, 1, 0], [1, 1, 0, 1, 0, 1, 0, 1], [0, 0, 1, 0, 0, 0, 0, 1], [0, 0, 0, 1, 1, 0, 0, 0], [1, 1, 0, 1, 0, 1, 1, 0], [1, 1, 1, 1, 0, 1, 0, 1], [0, 0, 0, 0, 1, 0, 1, 0]]


e) How many rows and columns does this matrix have?

In [50]:
rows = len(matrix)
cols = len(matrix[0]) if matrix else 0
print(f"e) Matrix dimensions: {rows} rows, {cols} columns")

e) Matrix dimensions: 8 rows, 8 columns


f) When using the Manhattan distance, which two sentences are the most similar?

In [51]:
from itertools import combinations

pairs = list(combinations(range(len(matrix)), 2))

distances_pairs = {f"s{i+1}-s{j+1}": manhattan_distance(matrix[i], matrix[j]) for i, j in pairs}

min_distance_pair = min(distances_pairs, key=distances_pairs.get)

print(f"f) Most similar sentences based on Manhattan distance: {min_distance_pair} with distance {distances_pairs[min_distance_pair]}")

f) Most similar sentences based on Manhattan distance: s3-s7 with distance 1


Part 3 Challenge