## Fetch Rewards Coding Exercise - Text Similarity

Sample Inputs are as follows:
<br>
Sample 1
The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you.

Sample 2
The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you.

Sample 3
We are always looking for opportunities for you to earn more points, which is why we also give you a selection of Special Offers. These Special Offers are opportunities to earn bonus points on top of the regular points you earn every time you purchase a participating brand. No need to pre-select these offers, we'll give you the points whether or not you knew about the offer. We just think it is easier that way.

Formulation of the code is as follows:
1. Reading/Loading the text
2. Converting it to lowercase
3. Tokenizing the words
4. Removing the Stop Words
5. Removing the punctuation
6. Performing differnt methods for document similarity like Jaccard Similarity and Cosine Similarity

In [1]:
import math
import string
import re
from collections import Counter

In [2]:
doc1 = "The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you."
doc2 = "The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you."
doc3 = "We are always looking for opportunities for you to earn more points, which is why we also give you a selection of Special Offers. These Special Offers are opportunities to earn bonus points on top of the regular points you earn every time you purchase a participating brand. No need to pre-select these offers, we'll give you the points whether or not you knew about the offer. We just think it is easier that way."
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]


In [3]:
def preProcessing(doc1, doc2):
    ## Converting all the characters to lower case letters
    doc1 = doc1.lower()
    doc2 = doc2.lower()
    
    ## Tokeinizing the words using the self written tokkenize function
    doc1 = tokkenize(doc1)
    doc2 = tokkenize(doc2)
    
    ## Filtering the stopwords from the Tokenined words
    filtered_doc1 = [word for word in doc1 if word not in stopwords]
    filtered_doc2 = [word for word in doc2 if word not in stopwords]
    
    ## Removing the punctuation from the Filtered documents
    remove_punct_doc1 = [''.join(c for c in s if c not in string.punctuation) for s in filtered_doc1]
    remove_punct_doc2 = [''.join(c for c in s if c not in string.punctuation) for s in filtered_doc2]
   
   
    return remove_punct_doc1,remove_punct_doc2

def tokkenize(doc):
    words = doc.split(' ')

    words_in_sentences = [sentence.split(' ') for sentence in words]
    return words_in_sentences

def calculate_jaccard(word_tokens1, word_tokens2):
    both_tokens = word_tokens1 + word_tokens2
    union = set(both_tokens)

    intersection = set()
    for w in word_tokens1:
        if w in word_tokens2:
            intersection.add(w)

    jaccard_score = len(intersection)/len(union)
    return jaccard_score

def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    return Counter(text)
    


In [4]:
document1, document2 = preProcessing(doc1, doc2)
print('Jaccard Similarity Between Sample First and Second: ',calculate_jaccard(document1,document2))
print('Cosine Simiarity Between Sample First and Second: ',get_cosine(text_to_vector(document1), text_to_vector(document2)))


document1, document3 = preProcessing(doc1, doc3)
print('\nJaccard Similarity Between Sample First and Third: ',calculate_jaccard(document1,document3))
print('Cosine Simiarity Between Sample First and Third: ',get_cosine(text_to_vector(document1), text_to_vector(document3)))


Jaccard Similarity Between Sample First and Second:  0.6333333333333333
Cosine Simiarity Between Sample First and Second:  0.8775496206997365

Jaccard Similarity Between Sample First and Third:  0.16470588235294117
Cosine Simiarity Between Sample First and Third:  0.5479591080851048
