## __Jaccard Distance__

A technique which can be used to measure similarity between two sets.
the formula for finding the Jaccard Distance is as follows.


<center><image src="./images/Jaccard.png" /></center>

In [4]:
def jaccard_distance(set1, set2):

    set1 = set(set1)
    set2 = set(set2)

    intersect = len(set1.intersection(set2))
    union = len(set1.union(set2))

    return 1 - (intersect/union)


set1 = [1,2,3,4,5]
set2 = [4,5,6,7,8]

jaccard_distance(set1, set2)

0.75

This above implementation can be considered as the original intended implementation. But based on the context usage can be changed as below.

> assume there are 2 movies A and B. </br>
* A = [1, 1, 1, 0, 0, 0]
* B = [1, 1, 0, 1, 0, 0]
>Assume these vector columns indicating the peoples action related to those movies (like 1= buy, 0=not buy). Therefore from user behaviour perspective same value in same position of above 2 vectors indicate some form of similarity and hence we can use jaccard similarity to measure that value.


In [7]:
def jaccard_similarity_s(vector1, vector2):
    count = 0
    for i in range(len(vector1)):
        if(vector1[i]==vector2[i]):
            count += 1
    
    return count/len(vector1)

A = [1, 1, 1, 0, 0, 0]
B = [1, 1, 0, 1, 0, 0]

jaccard_similarity_s(A, B)

0.6666666666666666

## __L1 and L2 norm Distances__

Assume there are 2 users say X and Y. They rate 2 items on a 1-10 scale to show their preference. Now we want to check the similarity of those 2 users based on their ratings.

>For an item, rating distance (difference) = |X's rating - Y's rating|
>
>For convience we normalize above using: 1/(distance + 1)

Lets say we measured the rating distance for items based on X and Y's ratings. We can add up these distances and say that is the overall distance between 2 users since these distances reflect how similarly they rate items. This is called `Manhatten distance` or `L1-Norm Distance`.

In [25]:
def l1_norm_distance(rating_user1, rating_user2):
    '''
    If possible nomalize the distance output based on the rating values.
    eg: if rating max = 10 then we can return distance as 
       (distance/(10*len(rating_user1)))
    '''

    distance = 0
    for i in range(len(rating_user1)):
        distance += abs(rating_user1[i]-rating_user2[i])
    
    return distance

rating_u1 = [0,0,0,0,0]
rating_u2 = [0,0,0,0,0]

l1_norm_distance(rating_u1, rating_u2)

0

L2 Norm Distance is very similar to the L1 Norm distance, except instead taking the absolute difference between rating differences we take the squared value and their summation. This is also known as `Euclidian Distance` as well.

In [28]:
def l2_norm_distance(rating_user1, rating_user2):
    '''
    If possible nomalize the distance output based on the rating values.
    '''
    
    squred_sum = 0
    for i in range(len(rating_user1)):
        squred_sum += (rating_user1[i]-rating_user2[i])**2
    
    return squred_sum**0.5

rating_u1 = [0,0,0,0,0]
rating_u2 = [5,5,5,5,5]

l2_norm_distance(rating_u1, rating_u2)

11.180339887498949

In [24]:
def cosine_similarity(user_rating1, user_rating2, adjusted = False):
    '''
    Note that zero related calculations may be wrong.
    '''
    from math import sqrt

    if(not adjusted):
        product = 0
        r1_squared = 0
        r2_squared = 0

        for i in range(len(user_rating1)):
            product += user_rating1[i]*user_rating2[i]
            r1_squared += user_rating1[i]**2
            r2_squared += user_rating2[i]**2

        if(product==0):
            return 1.0
        elif(r1_squared==0 or r2_squared==0):
            return 0.0
        cosine = product/((r1_squared**0.5)*(r2_squared**0.5))
        return cosine

    else:

        from math import sqrt

        user1_avg = 0
        user2_avg = 0
        user1_count = 0
        user2_count = 0
        for i in range(len(user1_ratings)):
            if(user1_ratings[i]!=0):
                user1_avg += user1_ratings[i]
                user1_count += 1
            if(user2_ratings[i]!=0):
                user2_avg += user2_ratings[i]
                user2_count += 1

        user1_avg = user1_avg/user1_count
        user2_avg = user2_avg/user2_count

        normalized_rating1 = [i-user1_avg if (i!=0) else 0 for i in user1_ratings]
        normalized_rating2 = [i-user2_avg if (i!=0) else 0 for i in user2_ratings]

        diff1Sq = 0
        diff2Sq = 0
        diffMul = 0
        
        # Only difference in Adjusted Cosine and Pearson's similarity is in this part.
        # Here we consider all the rated items regardless of whether both users have rated them.
        # If one user has not rated an item, rating would be zero.
        for i in range(len(user1_ratings)):
            u1 = normalized_rating1[i]
            u2 = normalized_rating2[i]
            diff1Sq += u1**2
            diff2Sq += u2**2
            diffMul += u1*u2

        cosine = diffMul/(sqrt(diff1Sq)*sqrt(diff2Sq))

        return cosine

rating_u1 = [4,5,4,0,3,3]
rating_u2 = [3,3,3,2,4,5]

cosine_similarity(rating_u1, rating_u2, adjusted=True)


0.29981267559834457

### Pearson's correlation coefficient

Statistical measure of calculating relationship between 2 variables. Basically can determine how different each variable on average.

>eg:- In rating context, if 2 user's ratings are similar correlation coefficient would be 1 while if one likes and other dislikes coefficient would be -1. 0 indicate no relationship at all.
<p></p>
<center><img src="./images/pearson_coefficient.png"/></center>

<p>
But in simple terms implementation is as follows.

1. Calculate the average rating per user rated items
2. Normalize the ratings. (to match different user rating patterns)
3. Put the results into the formula to calculate the similarity

</p>


>### The only difference between the adjusted cosine and Pearson’s correlation is that the Pearson function uses item set both two users have rated, while the cosine similarity function uses all rated items even if only one has rated that, setting the ratings to zero when one of the users doesn’t rated an item.

In [42]:
def pearson_similarity(user1_ratings, user2_ratings):
    from math import sqrt

    user1_avg = 0
    user2_avg = 0
    user1_count = 0
    user2_count = 0
    for i in range(len(user1_ratings)):
        if(user1_ratings[i]!=0):
            user1_avg += user1_ratings[i]
            user1_count += 1
        if(user2_ratings[i]!=0):
            user2_avg += user2_ratings[i]
            user2_count += 1

    user1_avg = user1_avg/user1_count
    user2_avg = user2_avg/user2_count

    normalized_rating1 = [i-user1_avg if (i!=0) else 0 for i in user1_ratings]
    normalized_rating2 = [i-user2_avg if (i!=0) else 0 for i in user2_ratings]

    diff1Sq = 0
    diff2Sq = 0
    diffMul = 0

    # Here we only consider the set which have been rated by the both users.
    for i in range(len(user1_ratings)):
        if(user1_ratings[i]!=0 and user2_ratings[i]!=0):
            u1 = normalized_rating1[i]
            u2 = normalized_rating2[i]
            diff1Sq += u1**2
            diff2Sq += u2**2
            diffMul += u1*u2

    cosine = diffMul/(sqrt(diff1Sq)*sqrt(diff2Sq))

    return cosine

rating_u1 = [4,5,4,0,3,3]
rating_u2 = [3,3,3,2,4,5]

pearson_similarity(rating_u1, rating_u2)

-0.7606388292556647