# Why is this Notebook here? 
This notebook will be containing cells of all the code and notes of my thought process, this is for pure organization and bug fixing purposes only. 

## Project goal: 

* To build and deploy an algorithm that will compare two input texts and return 1 if they are similar, 0 if they are not similar. 
* The comparison will be based on the words inside the sentence itself:
  * Based on the two samples 1 and 2 should be much more similar than 1 and 3
  * I will be looking counting the amount of similar words
  * I should consider common "stop" words and have them not be considered (similar to how nlp libaries have them built in)
  * I should have words be stemmed down and forced to be all lowercase for easy comparison 
  * Order of words at the moment will matter. 
  * Will be using lists to structure the data, compared to dictionaries is easier to group together in a tight timeframe. 

In [None]:
# Starting off with assigning the sample text into variables. 

sample_1 = "The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you."
sample_2 = "The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you."
sample_3 = "We are always looking for opportunities for you to earn more points, which is why we also give you a selection of Special Offers. These Special Offers are opportunities to earn bonus points on top of the regular points you earn every time you purchase a participating brand. No need to pre-select these offers, we'll give you the points whether or not you knew about the offer. We just think it is easier that way."


## Removing Punctuation 

The current stop word function is starting to do too much, extra tasks inside that function will be done with other functions to make each individual function easier to follow through in case they need to be updated. 

In [None]:
def remove_Punctuation(text):
  # List of punctuation to be removed. 
  punctuation = [".", ",", "!", "?"]
  
  # List to be returned with the updated words. 
  updated_List = []

  # Loop to check for punctuation
  for word in text: 
    if word[-1] in punctuation: 
      updated_List.append(word[:-1])
    else:
      updated_List.append(word)

  return updated_List

In [None]:
# Testing the function
text_List = sample_1.split()
text_List = remove_Punctuation(text_List)
print(text_List)

['The', 'easiest', 'way', 'to', 'earn', 'points', 'with', 'Fetch', 'Rewards', 'is', 'to', 'just', 'shop', 'for', 'the', 'products', 'you', 'already', 'love', 'If', 'you', 'have', 'any', 'participating', 'brands', 'on', 'your', 'receipt', "you'll", 'get', 'points', 'based', 'on', 'the', 'cost', 'of', 'the', 'products', 'You', "don't", 'need', 'to', 'clip', 'any', 'coupons', 'or', 'scan', 'individual', 'barcodes', 'Just', 'scan', 'each', 'grocery', 'receipt', 'after', 'you', 'shop', 'and', "we'll", 'find', 'the', 'savings', 'for', 'you']


## Creating Stem Words

This function will provide a check to make sure the word coming in are stemmed down for easy analysis

In [None]:
def create_Stem(text):
  # List of ends to stem out
  stem = ["ing", "ed", "n't", "'ll", "est"]

  # Updated list to return at the end of the function 
  update_Text = []

  for word in text:
    if word[-3:] in stem:
      update_Text.append(word[:-3])
    else:
      update_Text.append(word)
  
  return update_Text

In [None]:
# Testing the function 
text_List = sample_1.split()
text_List = create_Stem(text_List)
print(text_List)

['The', 'easi', 'way', 'to', 'earn', 'points', 'with', 'Fetch', 'Rewards', 'is', 'to', 'just', 'shop', 'for', 'the', 'products', 'you', 'already', 'love.', 'If', 'you', 'have', 'any', 'participat', 'brands', 'on', 'your', 'receipt,', 'you', 'get', 'points', 'based', 'on', 'the', 'cost', 'of', 'the', 'products.', 'You', 'do', 'need', 'to', 'clip', 'any', 'coupons', 'or', 'scan', 'individual', 'barcodes.', 'Just', 'scan', 'each', 'grocery', 'receipt', 'after', 'you', 'shop', 'and', 'we', 'find', 'the', 'savings', 'for', 'you.']


## The Stop Words: 
* This will be a list of words commonly found in text. 
* This will be done without importing libaries, but using the stop words found inside all of the samples (similuating that this is the only data I have) 
* This list of words will be applied to a function that will strip those words from the text before being compared.  
* This function will also case check the words to make sure they are all lowercased and stemmed down 
* This will also remove puncuation 


In [None]:
def remove_Stop_Words(text):
  # Defining the stop words based on the given samples 
  stop_Words = ["the", "to", "with", "is", "for", "you", "if", "have", "any", "on", 'your', 'of', 'or']

  # Both lists that will 
  new_Text = []

  # Splitting the sting into a list so it will not return the individal letters. 
  text_List = text.split()

  # Removing the punctuation
  text_List = remove_Punctuation(text_List)

  # Stemming down the words
  text_List = create_Stem(text_List)

  # For loop to take the list of words, compares each string to the strings inside the stop_Words list and ignores 
  # if string is in the stop_Words list, if not then will be appended to the new list that will be returned. 
  # The if statement will automatically lowercase each word
  for word in text_List:
    # checking if the initial word is in the stop check
    if word.lower() in stop_Words: 
      continue
    else:
      new_Text.append(word.lower())
    
  return new_Text
  

In [None]:
# # Testing to see if the function works. 
print(remove_Stop_Words(sample_1))

['easi', 'way', 'earn', 'points', 'fetch', 'rewards', 'just', 'shop', 'products', 'already', 'love', 'participat', 'brands', 'receipt', 'get', 'points', 'based', 'cost', 'products', 'do', 'need', 'clip', 'coupons', 'scan', 'individual', 'barcodes', 'just', 'scan', 'each', 'grocery', 'receipt', 'after', 'shop', 'and', 'we', 'find', 'savings']


## The function: 
* Starting with a **Jaccard Similarity** function that will directly compare which words are the same between each sample text. 
* This will calculate that based on the size of shared similar words divided by the total number of words in both texts.
* This function will also take in the stop words into consideration and remove the stop words. 

In [None]:
def jaccard_Similarity(text1, text2):
  # Slimming down text from junk
  text1, text2 = remove_Stop_Words(text1), remove_Stop_Words(text2)

  # Grabbing all of the text shared between the two
  shared_text = set(text1).intersection(set(text2))
  
  # Now grabbing combined count of all of the text in general
  all_text = set(text1).union(set(shared_text))

  # Getting the result, if it has more than 50% similarty
  # Function will return 1, if less than 50% return 0
  result = 100*(len(shared_text)/len(all_text))

  if result > 50:
    return 1
  else:
    return 0

In [None]:
# Testing the function 

# Between sample 1 and sample 2
print(jaccard_Similarity(sample_1, sample_2))

# Between 1 and 3
print(jaccard_Similarity(sample_1, sample_3))

# Between 2 and 3
print(jaccard_Similarity(sample_2, sample_3))

1
0
0


In [None]:
#