<hr/>

# Question 2

<hr/>

Implement a function that calculates and returns a measure of the similarity between 2
lists of items. The output of the function should be bounded between 0 and 1 such that
if 1 is returned, it means the 2 lists are equal and if 0 is returned, the 2 lists are
completely different.

Scenarios : 

(1) One that does an item-wise comparison and takes order of items into account  
(2) Same as above, but without taking order into account   
(3) One that looks at the lists globally   

To be tested on the following list: 
```
['A','b','c','d'] & ['e', 'F', 'G', 'h']
['a','B','c','d'] & ['d', 'C', 'A', 'H']
['A','B','C','D'] & ['a', 'b', 'c', 'd']
['banana','Orange','pear','Avocado'] & ['mango', 'strawberry', 'apple', 'orange']
['D','a','B','b', 'e', 'L'] & ['B', 'B', 'l', 'e', 'a', 'd']
['A','B','E','D']& ['a', 'b', 'c']
```

Conditions : 

(1) The measures cannot be equality (i.e., item 1 == item 2) or length (i.e., len(item 1) == len(item 2))  
(2) The lists do not need to have the same length and the measure should reflect this  
(3) Your implementation must be case-insensitive  



<hr/>

# Solution (Cosine Similarity)

<hr/>


We can use cosine similarity to measure the similarity between two list of items or pair of items.   
It returns value between 0 and 1 depending on the distance between the pair of vectors.  


Let us say the angle between two vectors is theta degrees:  

    (1) theta= 90 degrees ; cos(90)=0   
        (smaller cosine value implies larger distance between vectors) 
    (2) theta= 0 degrees ; cos(0) = 1  
        (larger cosine value implies smaller distance between vectors)
      
In a nutshell, <b>higher Cosine value implies higher similarity between terms.</b>


Note:   
1. Global functions can be used globaly to reutrn output for all the three scenarios   
2. Exceptions are not handled to keep the solution simple

<hr/>

# Define global functions to test scenarios

<hr/>

In [19]:
import random
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
random.seed(7)

def cosdis(a, b):
    """
    
    Implement the cosine similarity function 

    Arguments:
    a -- list of items or item 
    b -- list of items or item to compare 

    Return:
    res -- similarity measure between 0 & 1  
    
    """
    c1= Counter( map(str.lower, a))
    c2= Counter( map(str.lower, b))
    words= list(c1.keys() | c2.keys() )
    v1 = [c1.get(word, 0) for word in words]       
    v2 = [c2.get(word, 0) for word in words]  
    res = cosine_similarity([v1], [v2]).flatten()[0]
    return round(res, 4)


def word2word(list_a, list_b):
    """ 
    
    Iterate similarity measure itemwise for every item in list_a against list_b until len(list_a)

    Arguments:
    list_a -- list of items as parent
    list_ b -- list of items to compare as children

    Return:
    res -- prints similarity measure for each item of parent against children until len(children)
    
    """
    for x, y in zip(list_a, list_b):
        res = cosdis( x , y)
        print(f'The cosine similarity between : {x} and : {y} is: {res}')


def word2many(list_a, list_b): 
    """ 
    
    Iterate similarity measure for every item in list_a and list_b

    Arguments:
    list_a -- list of items as parent
    list_b -- list of items to compare as children

    Return:
    res -- prints similarity measure each item of parent againstm all items of children 
    
    """
    for key in list_a:
        for word in list_b:
            try:
                res = cosdis( word , key)
                print(f'The cosine similarity between : {word} and : {key} is: {res}')
            except IndexError:
                pass


def list2list(list_a, list_b):
    """ 
    
    Iterate similarity measure for every item in parent and children list 

    Arguments:
    list_a -- list of items as parent
    list_b -- list of items to compare as children

    Return:
    res --  prints similarity measure for lists 
    
    """
    res= cosdis(list_a, list_b)
    return print(f'The cosine similarity between : {list_a} and : {list_b} is: {res}')



def testCase(list_a, list_b): 
    """
    
    Run test case for any chosen pair of test lists as list_a & list_b
    
    Return:
    res --  prints results for all three test case scenarios 
    
    """
    print("\n" + "========= word to word =================" + "\n")
    word2word(list_a, list_b)
    print("\n" + " ========== word to many =============="+ "\n")
    word2many(list_a, list_b)
    print("\n" + "========== list to list ================"+ "\n")
    list2list(list_a, list_b)
    return ;  


# initialise test-cases

one = ['A','b','c','d'] 
two = ['e', 'F', 'G', 'h']

three = ['a','B','c','d'] 
four = ['d', 'C', 'A', 'H']

five = ['A','B','C','D'] 
six = ['a', 'b', 'c', 'd']

seven = ['banana','Orange','pear','Avocado'] 
eight = ['mango', 'strawberry', 'apple', 'orange']

nine= ['D','a','B','b', 'e', 'L'] 
ten= ['B', 'B', 'l', 'e', 'a', 'd']

eleven= ['A','B','E','D'] 
twelve= ['a', 'b', 'c']

<hr/>

# Run Test Cases (Result)

<hr/>

In [20]:

testCase(seven,eight) # choose any pair of testcase scenario



The cosine similarity between : banana and : mango is: 0.5976
The cosine similarity between : Orange and : strawberry is: 0.5103
The cosine similarity between : pear and : apple is: 0.7559
The cosine similarity between : Avocado and : orange is: 0.4924


The cosine similarity between : mango and : banana is: 0.5976
The cosine similarity between : strawberry and : banana is: 0.2673
The cosine similarity between : apple and : banana is: 0.303
The cosine similarity between : orange and : banana is: 0.5455
The cosine similarity between : mango and : Orange is: 0.7303
The cosine similarity between : strawberry and : Orange is: 0.5103
The cosine similarity between : apple and : Orange is: 0.3086
The cosine similarity between : orange and : Orange is: 1.0
The cosine similarity between : mango and : pear is: 0.2236
The cosine similarity between : strawberry and : pear is: 0.625
The cosine similarity between : apple and : pear is: 0.7559
The cosine similarity between : orange and : pear is: 0

#### <i>end of solution</i>

<hr/>

####  Appendix

The working of cosine function cab be explained with the help of following steps.  

In the following example we will consider comparision of tow strings say "banana" & "mango" 

<ins>step 1</ins>: we calculate the term frequency for above string using `Counter`  

```
Counter(banana) => {'a': 3, 'n': 2, 'b': 1}
Counter(mango) => {'m': 1, 'a': 1, 'n': 1, 'g': 1, 'o': 1} 
```

<ins>step 2</ins>: create union set of terms in the string (or document) 

```
words => ['o', 'n', 'a', 'm', 'b', 'g']
```
<ins>step 3</ins>: Vectorize against union set 

```
vector for banana => [0, 2, 3, 0, 1, 0]
vector for mango => [1, 1, 1, 1, 0, 1]
```

<ins>step 4</ins>: Calculate cosine similarity

```
cosine_similarity([a_vect], [b_vect]) => 0.5976143
```