# Cameron Stewart
# Project 6

## 1.	Evaluate text similarity of Amazon book search results by doing the following:
-	Do a book search on Amazon via the search box. Manually copy the full book title (including subtitle) of each of the top 24 books listed in the first two pages of search results. 
-	In Python, run one of the text-similarity measures covered in this course, e.g., cosine similarity. Compare each of the book titles, pairwise, to every other one. 
-	Which two titles are the most similar to each other? Which are the most dissimilar? Where do they rank, among the first 24 results?


In [1]:
#Load Required Libraries
import spacy
import numpy as np
from scipy.stats import rankdata

Loaded list of top book picks for me on Amazon. I manually copied the full book title of each of the top 24 books listed.

In [2]:
#Collect top 24 recommended book's titels on Amazon and print
books=['My Killer Vacation', 'Love on the Brain', 
       "Los niños de Irena / Irena's Children: The extraordinary Story of the Woman Who Saved 2.500 Children from the Warsaw Ghetto (Spanish Edition)", 
       'Las montañas de Buda (OTROS LIB. EN EXISTENCIAS S.BARRAL) (Spanish Edition)', 'Muchas vidas, muchos sabios', 
       'La bailarina de Auschwitz: Una inspiradora historia de valentía y supervivencia (No Ficción) (Spanish Edition)', 
       "Text Analytics with Python: A Practitioner's Guide to Natural Language Processing", 
       'R for Data Science: Import, Tidy, Transform, Visualize, and Model Data', 'You Had Me at Hola: A Novel', 
       'Wanna Bet?: An Interracial Romance (Dirty British Romance)', 'His & Hers: A Novel', 'Just Last Night: A Novel', 
       'The Soulmate Equation', 'Rock Paper Scissors: A Novel', 'Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems', 
       'Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming', 'Storytelling with Data: A Data Visualization Guide for Business Professionals', 
       'Behind Closed Doors: A Novel', 'Everything for You (Bergman Brothers Book 5)', 'Book Lovers', 
       'The Family Upstairs: A Novel', 'My Favorite Half-Night Stand', 'The Missed Connection', 'The Bride Test']

print(*enumerate(books,1),sep='\n')

(1, 'My Killer Vacation')
(2, 'Love on the Brain')
(3, "Los niños de Irena / Irena's Children: The extraordinary Story of the Woman Who Saved 2.500 Children from the Warsaw Ghetto (Spanish Edition)")
(4, 'Las montañas de Buda (OTROS LIB. EN EXISTENCIAS S.BARRAL) (Spanish Edition)')
(5, 'Muchas vidas, muchos sabios')
(6, 'La bailarina de Auschwitz: Una inspiradora historia de valentía y supervivencia (No Ficción) (Spanish Edition)')
(7, "Text Analytics with Python: A Practitioner's Guide to Natural Language Processing")
(8, 'R for Data Science: Import, Tidy, Transform, Visualize, and Model Data')
(9, 'You Had Me at Hola: A Novel')
(10, 'Wanna Bet?: An Interracial Romance (Dirty British Romance)')
(11, 'His & Hers: A Novel')
(12, 'Just Last Night: A Novel')
(13, 'The Soulmate Equation')
(14, 'Rock Paper Scissors: A Novel')
(15, 'Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems')
(16, 'Python Crash Course, 2nd Edition: A Hands-On, Pr

Using spacy, I performed a pairwise comparison of all book titles based on cosine similarity. I stored the results in a numpy array where the books are in the same order for rows and columns. I removed the stop words from each of the book titles in order to make the comparison based on more contextual words. I also lemmatized the text to find the base version of each word.

In [3]:
#Load spacy tools
nlp = spacy.load("en_core_web_lg")

#Create numpy array with required shape filled with zeros and suppress exponential notation
np.set_printoptions(suppress=True)
book_comparison=np.zeros((len(books),len(books)))

#Fill numpy array with similarity values
#Books are in same in order for rows and columns
#Resulting array is symmetrical and the diagonals will all be 1
for i,b1 in enumerate(books):
    for j,b2 in enumerate(books):
        compared_from = nlp(b1)
        compared_to = nlp(b2)

        compared_from_no_stop_words = nlp(' '.join([str(t.lemma_) for t in compared_from if not t.is_stop]))
        compared_to_no_stop_words = nlp(' '.join([str(t.lemma_) for t in compared_to if not t.is_stop]))

        book_comparison[i,j]=round(compared_from_no_stop_words.similarity(compared_to_no_stop_words),4)

print(book_comparison)

[[ 1.      0.4571  0.4388  0.2191 -0.0097  0.1936  0.3029  0.3511  0.219
   0.476   0.2141  0.3804  0.23    0.3195  0.2924  0.3864  0.3076  0.2915
   0.3179  0.4374  0.4472  0.4808  0.4042  0.3757]
 [ 0.4571  1.      0.4583  0.1975 -0.0219  0.1971  0.4559  0.495   0.3464
   0.5591  0.3523  0.4682  0.3392  0.5045  0.4119  0.4962  0.4211  0.4083
   0.3833  0.5151  0.4911  0.5398  0.4608  0.4434]
 [ 0.4388  0.4583  1.      0.6727  0.1708  0.6611  0.4032  0.3917  0.441
   0.6511  0.3383  0.4609  0.1858  0.4305  0.2842  0.4827  0.3102  0.4193
   0.5958  0.4626  0.4885  0.4728  0.2928  0.3975]
 [ 0.2191  0.1975  0.6727  1.      0.3372  0.8678  0.2847  0.2882  0.3098
   0.4653  0.1894  0.2168  0.0699  0.2332  0.1922  0.3824  0.1855  0.203
   0.5276  0.2109  0.2065  0.2667  0.1638  0.1898]
 [-0.0097 -0.0219  0.1708  0.3372  1.      0.3107 -0.1264 -0.0525  0.1974
  -0.0192 -0.2139 -0.1637  0.0536 -0.104  -0.0762 -0.1693 -0.1333 -0.2052
  -0.0581 -0.064  -0.1048 -0.1979 -0.1161 -0.0979]
 [ 0.193

Outputted the results of the most and least similar book titles along with their relative result positions on the Amazon page.

In [4]:
#Finding least similar first because diagonal is 1
min_val_row,min_val_col = np.unravel_index(book_comparison.argmin(), book_comparison.shape)
print('Books that are least similar:',books[min_val_row],'AND',books[min_val_col])
print('Minimum Similarity Score:',book_comparison[min_val_row,min_val_col])
print('Search result position:',min_val_row+1,'AND', min_val_col+1,'\n')

#Replace diagonal with zero and find most similar
np.fill_diagonal(book_comparison, 0)
max_val_row,max_val_col = np.unravel_index(book_comparison.argmax(), book_comparison.shape)
print('Books that are most similar:',books[max_val_row],'AND',books[max_val_col])
print('Maximum Similarity Score:',book_comparison[max_val_row,max_val_col])
print('Search result position:',max_val_row+1,'AND', max_val_col+1)

Books that are least similar: Muchas vidas, muchos sabios AND His & Hers: A Novel
Minimum Similarity Score: -0.2139
Search result position: 5 AND 11 

Books that are most similar: Las montañas de Buda (OTROS LIB. EN EXISTENCIAS S.BARRAL) (Spanish Edition) AND La bailarina de Auschwitz: Una inspiradora historia de valentía y supervivencia (No Ficción) (Spanish Edition)
Maximum Similarity Score: 0.8678
Search result position: 4 AND 6


Create ordered list of all similarity scores in order from most similar to least similar (lower value means lower similarity).

In [5]:

#Fill in diagonal and values below diagonal with -10 so they will all be ranked last
nonsym_book_matrix=book_comparison.copy()
for i in range(book_comparison.shape[0]):
    for j in range(book_comparison.shape[1]):
        if i<=j:
            nonsym_book_matrix[i,j]=-10

#Transform the non-symmetrical array into an array of ranks
r= rankdata(nonsym_book_matrix, method='dense').reshape(nonsym_book_matrix.shape)
ranks = np.array((r.max()+1) - r)

#Create a list of lists that concatenates the rank with the similarity score and the book titles
max_rank=ranks.max()
rank_sim_list=[]
for i in range(nonsym_book_matrix.shape[0]):
    for j in range(nonsym_book_matrix.shape[1]):
        if ranks[i,j]<max_rank:
            rank_sim_list.append([ranks[i,j],'Sim Score: '+str(nonsym_book_matrix[i,j]),'Book Title: '+books[i],'Book Title: '+books[j]])

#Sort the list by rank order
rank_sim_list.sort(key=lambda x: x[0])

#Print the list
print('Ordered list of every pairwise comparison:')
print(*rank_sim_list,sep='\n\n')

Ordered list of every pairwise comparison:
[1, 'Sim Score: 0.8678', 'Book Title: La bailarina de Auschwitz: Una inspiradora historia de valentía y supervivencia (No Ficción) (Spanish Edition)', 'Book Title: Las montañas de Buda (OTROS LIB. EN EXISTENCIAS S.BARRAL) (Spanish Edition)']

[2, 'Sim Score: 0.8065', 'Book Title: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems', 'Book Title: R for Data Science: Import, Tidy, Transform, Visualize, and Model Data']

[3, 'Sim Score: 0.8047', 'Book Title: The Family Upstairs: A Novel', 'Book Title: Just Last Night: A Novel']

[4, 'Sim Score: 0.7919', 'Book Title: Storytelling with Data: A Data Visualization Guide for Business Professionals', "Book Title: Text Analytics with Python: A Practitioner's Guide to Natural Language Processing"]

[5, 'Sim Score: 0.7863', 'Book Title: Storytelling with Data: A Data Visualization Guide for Business Professionals', 'Book Title: Designing Data-Intensive 

## 2.	Now evaluate using a major search engine.
-	Enter one of the book titles from question 1a into Google, Bing, or Yahoo!. Copy the capsule of the first organic result and the 20th organic result. Take web results only (i.e., not video results), and skip sponsored results. 
-	Run the same text similarity calculation that you used for question 1b on each of these capsules in comparison to the original query (book title). 
-	Which one has the highest similarity measure? 


Queried in Google for the first book title from the Amazon list ('My Killer Vacation'). I stored the results of the 1st and 20th capsule and printed the results.

In [6]:
#Capture query, 1st Capsule, and 20th Capsule. Then print.
searched_book="My Killer Vacation"
capsule_1="An all-new, spicy murder mystery from Tessa Bailey, New York Times bestselling author of It Happened One Summer... It was supposed to be a relaxing vacation"
capsule_20="Successful businessman Jake takes his four-month pregnant girlfriend, Lindsey, on a babymoon trip to a Hawaiian resort. When they get there,"

print('Searched Book Title:',searched_book,'\n')
print('Capsule 1 Results:',capsule_1,'\n')
print('Capsule 20 Results:',capsule_20)

Searched Book Title: My Killer Vacation 

Capsule 1 Results: An all-new, spicy murder mystery from Tessa Bailey, New York Times bestselling author of It Happened One Summer... It was supposed to be a relaxing vacation 

Capsule 20 Results: Successful businessman Jake takes his four-month pregnant girlfriend, Lindsey, on a babymoon trip to a Hawaiian resort. When they get there,


Use cosine similarity with spacy to compare the book title to the 1st and 20th capsule. The 1st capsule has a slightly higher similarity than the 20th.

In [7]:
#Load spacy tools
nlp = spacy.load("en_core_web_lg")

#Remove stop words
searched_book_parsed = nlp(' '.join([str(t.lemma_) for t in nlp(searched_book) if not t.is_stop]))
capsule_1_parsed = nlp(' '.join([str(t.lemma_) for t in nlp(capsule_1) if not t.is_stop]))
capsule_20_parsed = nlp(' '.join([str(t.lemma_) for t in nlp(capsule_20) if not t.is_stop]))

#Compare book title to capsule
similarity_capsule_1=round(searched_book_parsed.similarity(capsule_1_parsed),4)
similarity_capsule_20=round(searched_book_parsed.similarity(capsule_20_parsed),4)

#Output Results
print("Similarity between searched book and capsule 1:",similarity_capsule_1)
print("Similarity between searched book and capsule 20:",similarity_capsule_20)

Similarity between searched book and capsule 1: 0.6769
Similarity between searched book and capsule 20: 0.6763


## Submit all of your inputs and outputs and your code for this assignment, along with a brief written explanation of your findings.

In the first exercise, I performed a pairwise comparison of 24 book titles with a cosine similarity measure. Prior to the comparison, all stop words were removed to keep the comparison on more contextual words. Also, all text was lemmatized before comparison to transform all words to their base verson. This was performed using the spacy package. The similarity scores were organized into a numpy array which allowed easy identification of the maximum and minimum values along with the associated book titles. The most similar book titles had a similarity measure of .8678, while the least had a similarity of -.2139. The most similar books shared some common words and were both in spanish. The least similar books had no words in common and one was english and the other was in spanish. This cross language output by Amazon was an interesting challenge to see how the parser would handle these differences. Largely books of the same language were more similar. Another interesting finding was that the position of the most similar book titles were only two positions apart while the least similar were 6 apart. This shows Amazon likely keeps similar books generally closer together. In an extra step, I also printed all the book comparison scores in rank order along with the book titles being compared.

In the second exercise, I used the first book title of the list "My Killer Vacation" as a search query in Google. Taking the text from the 1st and 20th capsules, I performed the same text similarity calculation as exercise 1 with the stop word removal and lemmatization. Both results very close (.6769 and .6763), but the 20th capsule was .0006 lower in similarity. This shows Googles search results likely factor in text similarity as one of many factors to order results.