## Importing Libraries

In this notebook, we'll examine the impact of choosing n in our n-grams across two libraries of texts. Onelibrary is comprised of several full-length novels, while another consists of articles of varying lenghts. We'llﬁrst deﬁne several helper functions that will be re-used throughout this process (each one is commented).We'll then proceed to evaluate our choice of n, per library, and then ﬁnally, use that n-value to compare ourlibrary of texts and return Jaccard similarity values for the more similiar texts.

In [1]:
import nltk
import pandas as pd

In [2]:
text1 = "texas online preparatory school now accepting enrollment for 2018 school year"
text2 = "texas online preparatory school now accepting enrollment for 2019 school year"
text3 = "No one knows how Google Duplex will work with eavesdropping laws"
text4 = "Google sells the future, powered by your personal data"
text5 = "Google And Amazon Raise The Volume On Conversational Commerce"
text6 = "Google"
text7 = "Apple made more Profit in the last Quarter than Amazon made since its Inception"
text8 = "Recipe: Apple Pear Sour Cocktail"

all_documents = [text1,text2,text3,text4,text5,text6,text7,text8]

### Jaccard similarity using words

In [3]:
def jaccard_similarity_words(text1, text2):

    n=1

    doc1_grams = nltk.ngrams(text1.split(),n)
    doc2_grams = nltk.ngrams(text2.split(),n)

    doc1 = []
    doc2 = []

    for gram in doc1_grams:
        doc1.append(gram)

    for gram in doc2_grams:
        doc2.append(gram)

    intersection = len(list(set(doc1).intersection(set(doc2))))
    union = len(set(doc1)) + len(set(doc2)) - intersection
    jaccard_similarity = intersection / union

    return doc1, doc2, intersection, union, jaccard_similarity

In [4]:
doc1, doc2, Intersection, Union, Jaccard_similarity = jaccard_similarity_words(text1, text2)

print(doc1)
print(doc2)

print("Intersection: {}". format(Intersection))
print("Union: {}". format(Union))
print("Jaccard_similarity: {}". format(Jaccard_similarity))

[('texas',), ('online',), ('preparatory',), ('school',), ('now',), ('accepting',), ('enrollment',), ('for',), ('2018',), ('school',), ('year',)]
[('texas',), ('online',), ('preparatory',), ('school',), ('now',), ('accepting',), ('enrollment',), ('for',), ('2019',), ('school',), ('year',)]
Intersection: 9
Union: 11
Jaccard_similarity: 0.8181818181818182


### Jaccard similarity using hashes

In [5]:
def jaccard_similarity_hashed(text1, text2):

    n=1

    doc1_grams = nltk.ngrams(text1.split(),n)
    doc2_grams = nltk.ngrams(text2.split(),n)

    hashed1 = []
    hashed2 = []

    for gram in doc1_grams:
        hashed1.append(hash(gram))

    for gram in doc2_grams:
        hashed2.append(hash(gram)) 

    intersection = len(list(set(hashed1).intersection(set(hashed2))))
    union = len(set(hashed1)) + len(set(hashed2)) - intersection
    jaccard_similarity = intersection / union

    return hashed1, hashed2, intersection, union, jaccard_similarity

In [6]:
hashed1, hashed2, Intersection, Union, Jaccard_similarity = jaccard_similarity_hashed(text1, text2)

print(hashed1)
print(hashed2)

print("Intersection: {}". format(Intersection))
print("Union: {}". format(Union))
print("Jaccard_similarity: {}". format(Jaccard_similarity))

[-3828334556592602502, 7796169260296222837, 3829359575656866846, 7380576032124812189, -2794614432831930426, -3880342475057074692, 2865250114530183441, 8842832327081768553, 3337012716206599963, 7380576032124812189, -1468383533794503398]
[-3828334556592602502, 7796169260296222837, 3829359575656866846, 7380576032124812189, -2794614432831930426, -3880342475057074692, 2865250114530183441, 8842832327081768553, 3476266166996983440, 7380576032124812189, -1468383533794503398]
Intersection: 9
Union: 11
Jaccard_similarity: 0.8181818181818182


### Compare similar topics

In [7]:
doc1, doc2, Intersection, Union, Jaccard_similarity = jaccard_similarity_words(text4, text5)

print(doc1)
print(doc2)

print("Intersection: {}". format(Intersection))
print("Union: {}". format(Union))
print("Jaccard_similarity: {}". format(Jaccard_similarity))

[('Google',), ('sells',), ('the',), ('future,',), ('powered',), ('by',), ('your',), ('personal',), ('data',)]
[('Google',), ('And',), ('Amazon',), ('Raise',), ('The',), ('Volume',), ('On',), ('Conversational',), ('Commerce',)]
Intersection: 1
Union: 17
Jaccard_similarity: 0.058823529411764705


### Compare long vs short strings

In [8]:
doc1, doc2, Intersection, Union, Jaccard_similarity = jaccard_similarity_words(text5, text6)

print(doc1)
print(doc2)

print("Intersection: {}". format(Intersection))
print("Union: {}". format(Union))
print("Jaccard_similarity: {}". format(Jaccard_similarity))

[('Google',), ('And',), ('Amazon',), ('Raise',), ('The',), ('Volume',), ('On',), ('Conversational',), ('Commerce',)]
[('Google',)]
Intersection: 1
Union: 9
Jaccard_similarity: 0.1111111111111111


### Compare topics

In [9]:
doc1, doc2, Intersection, Union, Jaccard_similarity = jaccard_similarity_words(text7, text8)

print(doc1)
print(doc2)

print("Intersection: {}". format(Intersection))
print("Union: {}". format(Union))
print("Jaccard_similarity: {}". format(Jaccard_similarity))

[('Apple',), ('made',), ('more',), ('Profit',), ('in',), ('the',), ('last',), ('Quarter',), ('than',), ('Amazon',), ('made',), ('since',), ('its',), ('Inception',)]
[('Recipe:',), ('Apple',), ('Pear',), ('Sour',), ('Cocktail',)]
Intersection: 1
Union: 17
Jaccard_similarity: 0.058823529411764705
