# Similarity between two documents

We calculate similarity between two documents as a similarity measure between two vectors. We convert the documents into vectors and we use a similarity measure between two vectors to calculate similarity between two documents

In [2]:
# creating a corpus of two documents

corpus = ['''Quietly and without fanfare, Bettie Page blew into New York City in the fall of 1950. Seven years later, at the height of her fame, the Queen of Curves abdicated her throne and left the same way.

Her timing was perfect. A decade earlier, her images would have been hidden under cigar shop counters while the wartime wholesomeness of Betty Grable dominated the display stands. A decade later, Page’s provocative poses seemed almost prudish. By vanishing at her beauty’s peak in 1957, Page achieved immortality. Like James Dean, her image remains frozen in time. Forever beautiful, forever young.

During her brief time in the spotlight, Page found fame twice over. She was the most sought-after model for the girlie magazines that served as the prototype for the slicker, more overtly sexual publications like Playboy that followed. She was also the poster child for a government-led backlash against the titillation of the young American male.

Coney Island still attracted a crowd in October. As the memory of summer faded, New Yorkers continued to flock to the famous landmark to ride the Ferris wheel and gorge on candy floss.

In the summer, sun worshipers were packed so tightly on the beach that there was no room to lay a towel. In autumn, however, even the hardier swimmers confined themselves to the baths, which provided relative protection against the breeze blowing in from the Atlantic'''
    ,
     '''There is a persistent Left Bank legend that, when President François Paul Jules Grévy first set eyes of Fernand Cormon’s new work at the opening of the 1880 Paris Salon, he immediately ordered the artist be taken to the Palais de la Légion d’Honneur to receive France’s highest order of merit.

Certainly, Cormon became an Officer in the National Order of the Legion of Honour that year, but the details seems too perfect, too utterly French, to be so precise.

But when you stand in front of Cain for the very first you believe, without a shred of doubt, it couldn’t have happened any other way.

It isn’t the scale that hits you first. Sure, it’s big. Seven metres wide. But taking that in is like standing on the beach and trying to say how wide the ocean is. In fact, it isn’t any one thing at all, it’s everything. Every brushstroke, every shadow, every expression, every intricate detail, hit you all at once.

The subject itself — the expulsion of the tribe of the first murderer from Eden — is a fairly common Western Christian theme, but the presentation of a biblical scene had never been depicted as, in the words of art critic Martha Lucy, “a dishevelled, prehistoric tribe. Clad in animal pelts and brandishing Stone-Age weapons, with wild manes, the tribe trudges across the desert hauling its cargo of bloody carcasses.”

We see Cain in a new way. No longer the cold, arrogant, Shakespearean villain we heard of in Sunday school, Cormon’s Cain is a wretched thing. Still walking a pace ahead of them, near naked but for a tattered loincloth and the murderous weapon that brought him to this end. He walks a leader, but it is the hunched, sleepwalk of someone as doomed and defeated as they are.

Here, I need to interrupt the scene and inject myself into the story, because art requires an observer.

Back home, my own creativity had hit a roadblock. The great histories I imagined I would write had failed to materialize and I found myself hacking out speeches for politicians for whom I had little faith and less respect. I was done.

So I ended up here, because if you have to walk the lonely streets as a gloomy failed writer, you may as well do it in Paris.

There, in that great room of the Musee d’Orsay, my view of the world changed. I saw elongated shadows that stretch out in from of Cain’s caravan as they trudged away from The Light, forever. And downcast faces of guilty men, and sorrowful faces of their dragged-down women, and the sleep of guileless babes. And a receding blue sky behind and the long, uphill climb ahead.'''    ]


In [3]:
# Preprocessing

# 1. Stemming
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

for i in corpus:
    example = i
    example = [stemmer.stem(token) for token in example.split(" ")]
    print(" ".join(example))
    print("\n")

quietli and without fanfare, betti page blew into new york citi in the fall of 1950. seven year later, at the height of her fame, the queen of curv abdic her throne and left the same way.

her time wa perfect. A decad earlier, her imag would have been hidden under cigar shop counter while the wartim wholesom of betti grabl domin the display stands. A decad later, page’ provoc pose seem almost prudish. By vanish at her beauty’ peak in 1957, page achiev immortality. like jame dean, her imag remain frozen in time. forev beautiful, forev young.

dur her brief time in the spotlight, page found fame twice over. she wa the most sought-aft model for the girli magazin that serv as the prototyp for the slicker, more overtli sexual public like playboy that followed. she wa also the poster child for a government-l backlash against the titil of the young american male.

coney island still attract a crowd in october. As the memori of summer faded, new yorker continu to flock to the famou landmark to

In [4]:
# 2. Lemmatization

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for i in corpus:
    example = i
    example = [lemmatizer.lemmatize(token) for token in example.split(" ")]
    print(" ".join(example))
    print("\n")

Quietly and without fanfare, Bettie Page blew into New York City in the fall of 1950. Seven year later, at the height of her fame, the Queen of Curves abdicated her throne and left the same way.

Her timing wa perfect. A decade earlier, her image would have been hidden under cigar shop counter while the wartime wholesomeness of Betty Grable dominated the display stands. A decade later, Page’s provocative pose seemed almost prudish. By vanishing at her beauty’s peak in 1957, Page achieved immortality. Like James Dean, her image remains frozen in time. Forever beautiful, forever young.

During her brief time in the spotlight, Page found fame twice over. She wa the most sought-after model for the girlie magazine that served a the prototype for the slicker, more overtly sexual publication like Playboy that followed. She wa also the poster child for a government-led backlash against the titillation of the young American male.

Coney Island still attracted a crowd in October. As the memory o

In [5]:
# Feature Engineering 

# 1. CountVectors

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer( binary = True)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print("\n")
print(X.toarray())

['1880', '1950', '1957', 'abdicated', 'achieved', 'across', 'after', 'against', 'age', 'ahead', 'all', 'almost', 'also', 'american', 'an', 'and', 'animal', 'any', 'are', 'arrogant', 'art', 'artist', 'as', 'at', 'atlantic', 'attracted', 'autumn', 'away', 'babes', 'back', 'backlash', 'bank', 'baths', 'be', 'beach', 'beautiful', 'beauty', 'became', 'because', 'been', 'behind', 'believe', 'bettie', 'betty', 'biblical', 'big', 'blew', 'bloody', 'blowing', 'blue', 'brandishing', 'breeze', 'brief', 'brought', 'brushstroke', 'but', 'by', 'cain', 'candy', 'caravan', 'carcasses', 'cargo', 'certainly', 'changed', 'child', 'christian', 'cigar', 'city', 'clad', 'climb', 'cold', 'common', 'coney', 'confined', 'continued', 'cormon', 'couldn', 'counters', 'creativity', 'critic', 'crowd', 'curves', 'de', 'dean', 'decade', 'defeated', 'depicted', 'desert', 'detail', 'details', 'dishevelled', 'display', 'do', 'dominated', 'done', 'doomed', 'doubt', 'down', 'downcast', 'dragged', 'during', 'earlier', 'ede

In [6]:
# 2. TF-IDF Vectors

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
Y = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print("\n")
print(Y[0].toarray())
print("\n")
print(Y[1].toarray())

['1880', '1950', '1957', 'abdicated', 'achieved', 'across', 'after', 'against', 'age', 'ahead', 'all', 'almost', 'also', 'american', 'an', 'and', 'animal', 'any', 'are', 'arrogant', 'art', 'artist', 'as', 'at', 'atlantic', 'attracted', 'autumn', 'away', 'babes', 'back', 'backlash', 'bank', 'baths', 'be', 'beach', 'beautiful', 'beauty', 'became', 'because', 'been', 'behind', 'believe', 'bettie', 'betty', 'biblical', 'big', 'blew', 'bloody', 'blowing', 'blue', 'brandishing', 'breeze', 'brief', 'brought', 'brushstroke', 'but', 'by', 'cain', 'candy', 'caravan', 'carcasses', 'cargo', 'certainly', 'changed', 'child', 'christian', 'cigar', 'city', 'clad', 'climb', 'cold', 'common', 'coney', 'confined', 'continued', 'cormon', 'couldn', 'counters', 'creativity', 'critic', 'crowd', 'curves', 'de', 'dean', 'decade', 'defeated', 'depicted', 'desert', 'detail', 'details', 'dishevelled', 'display', 'do', 'dominated', 'done', 'doomed', 'doubt', 'down', 'downcast', 'dragged', 'during', 'earlier', 'ede

In [7]:
# Calculate Cosine Similarity

from sklearn.metrics.pairwise import cosine_similarity
similarity_1 = cosine_similarity(X[0] , X[1])
similarity_2 = cosine_similarity(Y[0] , Y[1])
print(similarity_1)
print(similarity_2)

[[0.1493408]]
[[0.58744731]]
