# Web Intelligence
# Text Processing and Similarity Search

#### Prof. Claudio Lucchese

## Similarity search

Similarity search is a fundamental task in Web and Data Mining. It's a building block for:
 - **Plagiarism Detection**. Consider newly published document that may infringe copyrights, including images and videos.
 - **Mirror Pages**. Crawled pages from a search engines may have 40% mirror pages.
 - **Recommendation Systems**. On-line purchases, Movies, Hotels, Restaurants, ...
 - **Person identification**. Personal devices, security monitoring, ...

## Similarity Search in text

Data source:
    https://www.kaggle.com/mousehead/songlyrics

**Task:
Find the most similar song to "Pink" by Aerosmith**


In [8]:
songs_file = "../datasets/lyrics/songdata.csv"

In [16]:
# Tentative: read the first 10 lines and inspect the content

import csv
# see DickReader https://docs.python.org/3/library/csv.html

with open(songs_file, newline='') as f:
    reader = csv.DictReader(f)
    for i,row in enumerate(reader):
        if i==10:break
        for k,v in row.items():
            print (k,":",v)

artist : ABBA
song : Ahe's My Kind Of Girl
link : /a/abba/ahes+my+kind+of+girl_20598417.html
text : Look at her face, it's a wonderful face  
And it means something special to me  
Look at the way that she smiles when she sees me  
How lucky can one fellow be?  
  
She's just my kind of girl, she makes me feel fine  
Who could ever believe that she could be mine?  
She's just my kind of girl, without her I'm blue  
And if she ever leaves me what could I do, what could I do?  
  
And when we go for a walk in the park  
And she holds me and squeezes my hand  
We'll go on walking for hours and talking  
About all the things that we plan  
  
She's just my kind of girl, she makes me feel fine  
Who could ever believe that she could be mine?  
She's just my kind of girl, without her I'm blue  
And if she ever leaves me what could I do, what could I do?


artist : ABBA
song : Andante, Andante
link : /a/abba/andante+andante_20002708.html
text : Take it easy with me, please  
Touch me gently l

In [9]:
#linux only
!head {songs_file}

artist,song,link,text
ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face  
And it means something special to me  
Look at the way that she smiles when she sees me  
How lucky can one fellow be?  
  
She's just my kind of girl, she makes me feel fine  
Who could ever believe that she could be mine?  
She's just my kind of girl, without her I'm blue  
And if she ever leaves me what could I do, what could I do?  


### Load all dataset in memory

**Disclaimer**: in the following we limit the number of songs to make sure we have enough memory and reasonable running times. It's clear that 
**the more songs the more interesting the result**.

In [7]:
def load_data(filename, max_songs = 5000):
    rows = []

    with open(filename, newline='') as f:
        reader = csv.DictReader(f)
        for i,row in enumerate(reader):
            rows += [ [x for x in row.values()] ]
            if len(rows)>=max_songs:
                break
    return rows

raw_dataset = load_data(songs_file)

print ("artist:", raw_dataset[0][0])
print ("title:",  raw_dataset[0][1])
print ("lyrics:", raw_dataset[0][-1])

NameError: name 'songs_file' is not defined

#### Search for "Pink" by Aerosmith

Try with another song of your choice.

In [18]:
for i,row in enumerate(raw_dataset):
    if "Pink" in row[1]:
        print(i,row[0],row[1])

183 Aerosmith Pink
780 Ariana Grande Pink Champagne
2126 Cake Pretty Pink Ribbon


In [19]:
print (raw_dataset[183][-1])

Pink, it's my new obsession, yeah  
Pink, it's not even a question  
Pink, on the lips of your lover  
'Cause pink is the love you discover  
Pink, as the bing on your cherry  
Pink, 'cause you are so very  
Pink, it's the color of passion  
  
'Cause today it just goes with the fashion  
Pink, it was love at first sight  
Yeah pink, when I turn out the light  
And pink gets me high as a kite  
And I think everything is going to be alright  
No matter what we do tonight  
  
You could be my flamingo  
'Cause pink, it's a new kinda lingo  
Pink, like a deco umbrella  
Ffff, it's kink that you don't ever tell her  
Yeah, pink, it was love at first sight  
Then pink when I turn out the light  
Yeah, pink gets me high as a kite  
And I think everything is going to be alright  
No matter what we do tonight  
  
Yeah,  
I, want to be your lover  
Ffff, I I wanna wrap you in rubber  
And it's pink as the sheets that we lay on  
'Cause pink, It's my favorite crayon  
  
Yeah  
Pink, it was lov

In [20]:
for i,(a,t,u,l) in enumerate(raw_dataset):
    if "Pink" in t:
        print(i,a,t)

183 Aerosmith Pink
780 Ariana Grande Pink Champagne
2126 Cake Pretty Pink Ribbon


In [21]:
# This is the query song
query_id = 183
skip = [] # if any, put covers here !

In [22]:
print ( raw_dataset[query_id][-1])

Pink, it's my new obsession, yeah  
Pink, it's not even a question  
Pink, on the lips of your lover  
'Cause pink is the love you discover  
Pink, as the bing on your cherry  
Pink, 'cause you are so very  
Pink, it's the color of passion  
  
'Cause today it just goes with the fashion  
Pink, it was love at first sight  
Yeah pink, when I turn out the light  
And pink gets me high as a kite  
And I think everything is going to be alright  
No matter what we do tonight  
  
You could be my flamingo  
'Cause pink, it's a new kinda lingo  
Pink, like a deco umbrella  
Ffff, it's kink that you don't ever tell her  
Yeah, pink, it was love at first sight  
Then pink when I turn out the light  
Yeah, pink gets me high as a kite  
And I think everything is going to be alright  
No matter what we do tonight  
  
Yeah,  
I, want to be your lover  
Ffff, I I wanna wrap you in rubber  
And it's pink as the sheets that we lay on  
'Cause pink, It's my favorite crayon  
  
Yeah  
Pink, it was lov

## How to find the most similar song?

We need two ingredients:
 - define what is a song?
 - define when two songs are similar
 
better:
 - define a **representation** for the song
 - define a **similarity** function 


Representation and similarity are two key ingredients in several data mining tasks, e.g., collaborative filtering, clustering, etc. Beyond the limitations of the example below, you should first design a suitable similarity function, and then find a good representation to implement such similarity function.

## Option 1

 - A song is a **set of words**
 - Similarity is given by the **number of shared words**

#### Compute the set of words of each song

In [23]:
print (raw_dataset[0][0], raw_dataset[0][1])

ABBA Ahe's My Kind Of Girl


In [24]:
print (raw_dataset[0][-1])

Look at her face, it's a wonderful face  
And it means something special to me  
Look at the way that she smiles when she sees me  
How lucky can one fellow be?  
  
She's just my kind of girl, she makes me feel fine  
Who could ever believe that she could be mine?  
She's just my kind of girl, without her I'm blue  
And if she ever leaves me what could I do, what could I do?  
  
And when we go for a walk in the park  
And she holds me and squeezes my hand  
We'll go on walking for hours and talking  
About all the things that we plan  
  
She's just my kind of girl, she makes me feel fine  
Who could ever believe that she could be mine?  
She's just my kind of girl, without her I'm blue  
And if she ever leaves me what could I do, what could I do?




In [25]:
# Python allows sets !
print ( set(raw_dataset[0][-1].split()) )

{'at', 'the', 'leaves', 'way', 'if', 'of', 'special', 'Look', 'all', 'she', 'feel', 'lucky', 'I', 'when', 'it', 'And', 'blue', 'squeezes', "We'll", 'talking', 'me', "She's", 'mine?', 'fine', 'without', "I'm", 'we', 'fellow', 'smiles', 'Who', 'walking', 'that', 'About', 'one', 'wonderful', 'girl,', 'kind', 'holds', 'sees', 'and', 'things', 'to', "it's", 'park', 'hours', 'ever', 'go', 'be', 'be?', 'means', 'could', 'for', 'walk', 'How', 'face', 'something', 'hand', 'on', 'makes', 'what', 'can', 'a', 'just', 'my', 'believe', 'in', 'do,', 'face,', 'plan', 'her', 'do?'}


In [26]:
def get_words (songs):
    songs_words = []
    for s in songs:
        lyrics = s[-1]         # this is a string
        words = lyrics.split() # this is a list
        words = set(words)     # create a set
        songs_words += [words] # append words to the list
    return songs_words


lyrics_word_split = get_words(raw_dataset)

print ( lyrics_word_split[0] )

{'at', 'the', 'leaves', 'way', 'if', 'of', 'special', 'Look', 'all', 'she', 'feel', 'lucky', 'I', 'when', 'it', 'And', 'blue', 'squeezes', "We'll", 'talking', 'me', "She's", 'mine?', 'fine', 'without', "I'm", 'we', 'fellow', 'smiles', 'Who', 'walking', 'that', 'About', 'one', 'wonderful', 'girl,', 'kind', 'holds', 'sees', 'and', 'things', 'to', "it's", 'park', 'hours', 'ever', 'go', 'be', 'be?', 'means', 'could', 'for', 'walk', 'How', 'face', 'something', 'hand', 'on', 'makes', 'what', 'can', 'a', 'just', 'my', 'believe', 'in', 'do,', 'face,', 'plan', 'her', 'do?'}


In [27]:
# As above, but in one line
def get_words (songs):
    return [ set(s[-1].split()) for s in songs ]

lyrics_word_split = get_words(raw_dataset)

print ( lyrics_word_split[0] )

{'at', 'the', 'leaves', 'way', 'if', 'of', 'special', 'Look', 'all', 'she', 'feel', 'lucky', 'I', 'when', 'it', 'And', 'blue', 'squeezes', "We'll", 'talking', 'me', "She's", 'mine?', 'fine', 'without', "I'm", 'we', 'fellow', 'smiles', 'Who', 'walking', 'that', 'About', 'one', 'wonderful', 'girl,', 'kind', 'holds', 'sees', 'and', 'things', 'to', "it's", 'park', 'hours', 'ever', 'go', 'be', 'be?', 'means', 'could', 'for', 'walk', 'How', 'face', 'something', 'hand', 'on', 'makes', 'what', 'can', 'a', 'just', 'my', 'believe', 'in', 'do,', 'face,', 'plan', 'her', 'do?'}


In [28]:
print (lyrics_word_split[query_id])

{'at', 'is', 'was', "'cause", 'sight,', 'out', 'as', 'I', "don't", 'red', 'deco', 'matter', 'lips', 'to', 'lay', 'ever', 'be', 'yeah', 'I,', 'today', 'rubber', 'like', 'not', 'No', 'goes', 'discover', 'crayon', 'cherry', 'with', 'color', 'when', 'it', 'Yeah', 'sight', 'tonight', 'Then', 'bing', 'favorite', 'fashion', 'tell', 'umbrella', 'Pink,', 'quite', 'your', 'even', 'kink', 'do', 'high', 'pink,', 'want', 'the', 'Ffff,', 'everything', 'of', 'love', 'flamingo', 'And', 'going', 'me', 'light', 'we', 'wanna', 'think', 'you', 'first', "It's", 'could', 'on', 'kite', 'a', 'in', 'wrap', 'but', 'very', 'lingo', 'her', 'so', 'are', 'kinda', 'new', 'obsession,', 'alright', 'passion', 'sheets', 'question', 'that', 'gets', 'pink', "it's", 'Yeah,', 'lover', 'turn', 'You', "'Cause", 'what', 'my', 'just'}


#### Find the most similar

In [29]:
# Test similarity
print ( lyrics_word_split[0] & lyrics_word_split[query_id])

{'at', 'the', 'of', 'I', 'when', 'it', 'And', 'me', 'we', 'that', 'to', "it's", 'ever', 'be', 'could', 'on', 'what', 'my', 'a', 'just', 'in', 'her'}


In [30]:
print ( len(lyrics_word_split[0] & lyrics_word_split[query_id]) )

22


In [31]:
def most_similar_by_words(s, songs, skip_list):
    most_similar = None
    largest_similarity = 0.0
    
    for s_id, s_text in enumerate(songs):
        
        if s_id == query_id: continue
        if s_id in skip_list: continue
        
        # compute number of common words
        sim = len(s_text & songs[s])      

        if sim>=largest_similarity:
            most_similar = s_id
            largest_similarity = sim
    
    return most_similar, largest_similarity

sim_id, sim_value = most_similar_by_words(query_id, lyrics_word_split, skip)

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

Most similar song is: 151
Similarity is: 45
Artist: Aerosmith
Title: Fever
Lyrics: I got a rip in my shoes  
And a hole in my brand new shoes  
I got a Margarita nose  
And a breath full of Mad Dog Booze  
I got the fever, fever, fever, fever  
Yeah, they threw me outta jail  
I tell ya it ain't fair  
I tried to kiss the judge  
From the electrica' chair  
Yeah we're all here  
'Cause we're not all there tonight  
The guitar's cranked  
And the bass man's blown a fuse  
And when the whole gang bangs  
Then what is your excuse?  
I got the fever, fever, fever, fever  
Fever gives you lust with an appetite  
It hits you like the fangs  
From a rattlesnake bite  
Yeah we're all here  
'Cause we're not all there tonight  
We can't run away from trouble  
There ain't no place that far  
But if we do it right at the speed of light  
There's the backseat of my car - caviar  
I was feelin' so high I forgot what day  
Now I'm feeling low down  
Even slow feels way to fast  
And now the booze d

Can we do it in one line of code?

In [32]:
# a small check
print ( max([1,2,3]) )
print ( max([(2,1),(2,2),(1,3)]) )

3
(2, 2)


In [33]:
## Exercise

def most_similar_by_words(s, songs, skiplist):
    most_similar = max( [ (len(s_text & songs[s]), s_id) 
                             for s_id, s_text in enumerate(songs) 
                             if s_id not in skiplist ]   )
    return most_similar[1], most_similar[0]

    
sim_id, sim_value = most_similar_by_words(query_id, lyrics_word_split, 
                                          set(skip+[query_id]))

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

Most similar song is: 151
Similarity is: 45
Artist: Aerosmith
Title: Fever
Lyrics: I got a rip in my shoes  
And a hole in my brand new shoes  
I got a Margarita nose  
And a breath full of Mad Dog Booze  
I got the fever, fever, fever, fever  
Yeah, they threw me outta jail  
I tell ya it ain't fair  
I tried to kiss the judge  
From the electrica' chair  
Yeah we're all here  
'Cause we're not all there tonight  
The guitar's cranked  
And the bass man's blown a fuse  
And when the whole gang bangs  
Then what is your excuse?  
I got the fever, fever, fever, fever  
Fever gives you lust with an appetite  
It hits you like the fangs  
From a rattlesnake bite  
Yeah we're all here  
'Cause we're not all there tonight  
We can't run away from trouble  
There ain't no place that far  
But if we do it right at the speed of light  
There's the backseat of my car - caviar  
I was feelin' so high I forgot what day  
Now I'm feeling low down  
Even slow feels way to fast  
And now the booze d

just too many words in this song ?

Some exercises:
 - longest song
 - shortest song
 - song with most unique terms
 - song with least unique terms
 - artist with most songs
 - artist with most unique terms
 - how many unique terms in all songs

In [35]:
def most_similar_jaccard(s, songs, skiplist):
    most_similar = max( [ (jaccard(s_text, songs[s]), s_id) 
                             for s_id, s_text in enumerate(songs) 
                             if s_id not in skiplist ]   )
    return most_similar[1], most_similar[0]

sim_id, sim_value = most_similar_jaccard(query_id, lyrics_word_split, 
                                         set(skip+[query_id]))

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

Most similar song is: 2390
Similarity is: 0.21379310344827587
Artist: Chaka Khan
Title: Jigsaw
Lyrics: Jigsaw - Puzzle  
Jigsaw - Puzzle  
  
Your love is like a maze  
I can't get through to you  
You keep me in a daze  
What's a girl to do with you?  
  
Loving you is like a puzzle  
Just when I think that this is it  
You shake me up  
The pieces just don't fit  
It's like a ...  
Jigsaw - Puzzle  
  
I lay my heart out on the table  
I let you know my every move  
But darlin' you are so unstable  
Your needle never lifts the groove  
  
Now order may not be your nature  
But I think that we could work out fine  
If you would only trace the dotted line  
  
Jigsaw - Puzzle  
Jigsaw - Puzzle




# Jaccard similarity


$$
J(A,B) = \frac{|A\cap B|}{|A \cup B|}
$$

In [6]:
def jaccard(a,b):
    return len(a & b) / len( a | b)

print ( jaccard( set([1,2,3]), set([2,3,4])))

0.5


In [36]:
print (raw_dataset[query_id][0])
print (raw_dataset[query_id][1])
print (raw_dataset[query_id][-1])

Aerosmith
Pink
Pink, it's my new obsession, yeah  
Pink, it's not even a question  
Pink, on the lips of your lover  
'Cause pink is the love you discover  
Pink, as the bing on your cherry  
Pink, 'cause you are so very  
Pink, it's the color of passion  
  
'Cause today it just goes with the fashion  
Pink, it was love at first sight  
Yeah pink, when I turn out the light  
And pink gets me high as a kite  
And I think everything is going to be alright  
No matter what we do tonight  
  
You could be my flamingo  
'Cause pink, it's a new kinda lingo  
Pink, like a deco umbrella  
Ffff, it's kink that you don't ever tell her  
Yeah, pink, it was love at first sight  
Then pink when I turn out the light  
Yeah, pink gets me high as a kite  
And I think everything is going to be alright  
No matter what we do tonight  
  
Yeah,  
I, want to be your lover  
Ffff, I I wanna wrap you in rubber  
And it's pink as the sheets that we lay on  
'Cause pink, It's my favorite crayon  
  
Yeah  
P

In [37]:
lyrics_word_split[query_id]

{"'Cause",
 "'cause",
 'And',
 'Ffff,',
 'I',
 'I,',
 "It's",
 'No',
 'Pink,',
 'Then',
 'Yeah',
 'Yeah,',
 'You',
 'a',
 'alright',
 'are',
 'as',
 'at',
 'be',
 'bing',
 'but',
 'cherry',
 'color',
 'could',
 'crayon',
 'deco',
 'discover',
 'do',
 "don't",
 'even',
 'ever',
 'everything',
 'fashion',
 'favorite',
 'first',
 'flamingo',
 'gets',
 'goes',
 'going',
 'her',
 'high',
 'in',
 'is',
 'it',
 "it's",
 'just',
 'kinda',
 'kink',
 'kite',
 'lay',
 'light',
 'like',
 'lingo',
 'lips',
 'love',
 'lover',
 'matter',
 'me',
 'my',
 'new',
 'not',
 'obsession,',
 'of',
 'on',
 'out',
 'passion',
 'pink',
 'pink,',
 'question',
 'quite',
 'red',
 'rubber',
 'sheets',
 'sight',
 'sight,',
 'so',
 'tell',
 'that',
 'the',
 'think',
 'to',
 'today',
 'tonight',
 'turn',
 'umbrella',
 'very',
 'wanna',
 'want',
 'was',
 'we',
 'what',
 'when',
 'with',
 'wrap',
 'yeah',
 'you',
 'your'}

# Attempt 1: Tokenization

Text Processing library: https://textblob.readthedocs.io/en/dev/

In [2]:
from textblob import TextBlob
from nltk import word_tokenize,sent_tokenize
# if the above command is not working
# execute below

In [3]:
!python --version

Python 3.7.3


In [5]:
TextBlob(raw_dataset[query_id][-1])

TextBlob(raw_dataset[query_id][-1]).lower()

TextBlob(raw_dataset[query_id][-1]).lower().words

NameError: name 'raw_dataset' is not defined

In [45]:
def get_tokens (songs):
    return [ set(TextBlob(song[-1]).lower().words) for song in songs ]

lyrics_tokens = get_tokens(raw_dataset)

In [46]:
print ( sorted(lyrics_word_split[query_id] ) )
print ()
print ( sorted(lyrics_tokens[query_id] ) )

["'Cause", "'cause", 'And', 'Ffff,', 'I', 'I,', "It's", 'No', 'Pink,', 'Then', 'Yeah', 'Yeah,', 'You', 'a', 'alright', 'are', 'as', 'at', 'be', 'bing', 'but', 'cherry', 'color', 'could', 'crayon', 'deco', 'discover', 'do', "don't", 'even', 'ever', 'everything', 'fashion', 'favorite', 'first', 'flamingo', 'gets', 'goes', 'going', 'her', 'high', 'in', 'is', 'it', "it's", 'just', 'kinda', 'kink', 'kite', 'lay', 'light', 'like', 'lingo', 'lips', 'love', 'lover', 'matter', 'me', 'my', 'new', 'not', 'obsession,', 'of', 'on', 'out', 'passion', 'pink', 'pink,', 'question', 'quite', 'red', 'rubber', 'sheets', 'sight', 'sight,', 'so', 'tell', 'that', 'the', 'think', 'to', 'today', 'tonight', 'turn', 'umbrella', 'very', 'wanna', 'want', 'was', 'we', 'what', 'when', 'with', 'wrap', 'yeah', 'you', 'your']

["'cause", "'s", 'a', 'alright', 'and', 'are', 'as', 'at', 'be', 'bing', 'but', 'cherry', 'color', 'could', 'crayon', 'deco', 'discover', 'do', 'even', 'ever', 'everything', 'fashion', 'favorite'

In [47]:
sim_id, sim_value = most_similar_jaccard(query_id, lyrics_tokens, 
                                         set(skip+[query_id]))

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

Most similar song is: 156
Similarity is: 0.25308641975308643
Artist: Aerosmith
Title: Lay It Down
Lyrics: Ruby red... her lips were on fire  
A do me with a kiss, if you please  
Tell me what your sweet heart desires  
Tell me how you want it to be  
  
'Cause if it's love you want  
Then you won't mind a little tenderness  
That sometimes is so hard to find  
  
(Lay it down)  
Lay it down  
Make it alright  
(Lay it down)  
Lay it down  
I'll hold you so tight  
(Lay it down)  
Oh...before the morning light  
It's gonna be alright  
  
Oh...lay it down  
Come and lay it down tonight  
  
Tell me how you feel when we make love  
Tell me is it real or just make believe  
You will never know what'chor made of  
'Til you open up your heart to receive  
'Cause if the love you got that same old crime  
We're talkin' tenderness that's so hard to find  
And I'm gettin' behind you  
  
(Lay it down)  
Lay it down  
Make it alright  
(Lay it down)  
A lay it down  
I'll hold you so tight  
(La

# Attempt 2: Stemming, Lemmatization

Stemming refers to the removal of prefix/suffixes: `being` -> `be`, `was`->`was` 

Lemming refers to the identification of the "origin" of a word: `being` -> `be`, `was`->`be`


The task is quite difficult, there might be errors anyway ...

In [48]:
print ( sorted(set(TextBlob(raw_dataset[query_id][-1]).lower().words.stem() ) ) )

["'caus", "'s", 'a', 'alright', 'and', 'are', 'as', 'at', 'be', 'bing', 'but', 'cherri', 'color', 'could', 'crayon', 'deco', 'discov', 'do', 'even', 'ever', 'everyth', 'fashion', 'favorit', 'ffff', 'first', 'flamingo', 'get', 'go', 'goe', 'her', 'high', 'i', 'in', 'is', 'it', 'just', 'kinda', 'kink', 'kite', 'lay', 'light', 'like', 'lingo', 'lip', 'love', 'lover', 'matter', 'me', 'my', "n't", 'na', 'new', 'no', 'not', 'obsess', 'of', 'on', 'out', 'passion', 'pink', 'question', 'quit', 'red', 'rubber', 'sheet', 'sight', 'so', 'tell', 'that', 'the', 'then', 'think', 'to', 'today', 'tonight', 'turn', 'umbrella', 'veri', 'wa', 'wan', 'want', 'we', 'what', 'when', 'with', 'wrap', 'yeah', 'you', 'your']


In [None]:
def get_stems (songs):
    return [ set(TextBlob(song[-1]).lower().words.stem()) for song in raw_dataset ]

lyrics_stems = get_stems(raw_dataset)

In [None]:
print ( sorted(lyrics_stems[query_id] ) )

In [None]:
sim_id, sim_value = most_similar_jaccard(query_id, lyrics_stems, set(skip+[query_id]))

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

In [None]:
print ( lyrics_stems[sim_id] )

In [None]:
print ( sorted( lyrics_stems[sim_id] & lyrics_stems[query_id]) )

Exercise:
 - find the pair of songs whose similarity that was most affected by the stemming

# Attempt 3: Term Frequency and Inverse Document Frequency


We want to take into account the number of occurences of terms. Therefore we need to change our representation:
 - A song is vector of occurences
 - Similarity is ...
 
 
To do so, we use a large matrix #songs x #terms.

### Build a vector space

In [1]:
## Exercise: find the lexicon


lexicon = set([])
for song in lyrics_stems:
    lexicon |= set(song)

print (len(lexicon))

NameError: name 'lyrics_stems' is not defined

In [None]:
lexicon_map     = {term:id for id,term in enumerate(lexicon)}
lexicon_rev_map = {id:term for term,id in lexicon_map.items()}

In [None]:
lexicon_map['pink']

In [None]:
lexicon_rev_map[6686]

#### Use numpy for matrix-based computations

http://www.numpy.org/

In [None]:
import numpy as np


lyrics_vector_space = np.zeros((len(lyrics_stems), len(lexicon)))


In [None]:
lyrics_vector_space.shape

In [None]:
lyrics_vector_space.dtype

In [None]:
def get_vector_space (songs, lex_map):
    m = np.zeros((len(songs), len(lex_map)))

    for song_id, (s,t,l,song_text) in enumerate(songs):
        for stem in TextBlob(song_text).lower().words.stem():
            if stem in lex_map:
                term_id = lex_map[stem]
                m[song_id,term_id] += 1.0
    
    return m

lyrics_vector_space = get_vector_space(raw_dataset, lexicon_map)

In [None]:
lyrics_vector_space[query_id]

In [None]:
sum(lyrics_vector_space[query_id])

In [None]:
lyrics_vector_space[query_id]!=0

In [None]:
sum(lyrics_vector_space[query_id]!=0)

In [None]:
len(lyrics_stems[query_id])

Good, numbers look consistent!

#### What kind of similarity we can use ?

#### This is euclidean ...

In [None]:
# let's try with euclidean

a = np.array([1,2,3])
b = np.array([1,5,7])

print (a-b)

In [None]:
print ( (a-b)**2 )

In [None]:
print ( np.sum((a-b)**2) )

In [None]:
print ( np.sqrt(np.sum((a-b)**2)) )

In [None]:
def euclidean(a,b):
    return np.sqrt( np.sum((a-b)**2.0) )

In [None]:
def most_similar_euclidean(s, songs, skiplist):
    num_songs, num_terms = songs.shape
    most_similar = min( [ (euclidean(songs[s], songs[s_id]), s_id) 
                             for s_id in range(num_songs) 
                             if s_id not in skiplist ]   )
    return most_similar[1], most_similar[0]

sim_id, sim_value = most_similar_euclidean(query_id, lyrics_vector_space, 
                                           set(skip+[query_id]))

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

In [None]:
print ( sorted(lyrics_stems[sim_id] & lyrics_stems[query_id]) )

#### Space for proposals

#### What we need is Cosine Similarity


$$
\cos(A,B)= \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_i A_i B_i}{\sqrt{\sum_i A_i^2}\sqrt{\sum_i B_i^2}}
$$

In [None]:
def cosine(a,b):
    return np.dot(a,b)/( np.sqrt( np.sum(a**2.0) ) * np.sqrt( np.sum(b**2.0) ) )

In [None]:
cosine( np.array([1,0]), np.array([0,1]) )

In [None]:
cosine( np.array([1,0]), np.array([2,0]) )

In [None]:
def most_similar_cosine(s, songs, skiplist):
    num_songs, num_terms = songs.shape
    most_similar = max( [ (cosine(songs[s], songs[s_id]), s_id) 
                             for s_id in range(num_songs) 
                             if s_id not in skiplist ]   )
    return most_similar[1], most_similar[0]

sim_id, sim_value = most_similar_cosine(query_id, lyrics_vector_space, set(skip+[query_id]))

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

#### Let's investigate the words that contribute most

In [None]:
# normalized prod instead of dot product

def norm_prod(a,b):
    return a*b/( np.sqrt( np.sum(a**2.0) ) * np.sqrt( np.sum(b**2.0) ) )

p = norm_prod(lyrics_vector_space[sim_id], lyrics_vector_space[query_id])
print (len(p), p)

In [None]:
idx = np.argsort(p)
print (idx)

In [None]:
prova = np.array([1,3,5,4,2])

In [None]:
np.argsort(prova)

In [None]:
prova[ np.argsort(prova) ]

In [None]:
prova[ np.argsort(prova)[::-1] ]

In [None]:
p[idx]

In [None]:
p[idx[::-1]]

In [None]:
for term_id in np.argsort(idx)[::-1]:
    if p[term_id]>0:
        print (lexicon_rev_map[term_id])

### Inverse document frequency

Inverse document frequency approximates the specificity of a term in a given document collection.
IDF is defined as follows:

$$
idf(t) = \ln \frac{N_{docs}}{df(t)}
$$

where $N_{docs}$ is the number of documents in the collection and ${df(t)}$ is the number of documents containing the term $t$.

IDF is used to discount frequent terms.

The weight of a term for a document is thus defined as $tf(t)\cdot idf(t)$.

In [None]:
m_sum = np.sum(lyrics_vector_space)

In [None]:
print (m_sum)

In [None]:
lyrics_vector_space.shape

In [None]:
m_sum = np.sum(lyrics_vector_space, axis=1)
print (m_sum)
print (m_sum.shape)

In [None]:
m_sum = np.sum(lyrics_vector_space, axis=0)
print (m_sum)
print (m_sum.shape)

In [None]:
m_sum = np.sum(lyrics_vector_space>0, axis=0)
print (m_sum)
print (m_sum.shape)

In [None]:
# let's do a small test

A = np.array([ [1,2,3], [4,5,6] ])
b = np.array([1,2,3])

print (A)
print (b)
print (A*b)

In [None]:
print( np.sum(A/b>2, axis=0) )

In [None]:
num_docs, _ = lyrics_vector_space.shape

In [None]:
lyrics_tdif = lyrics_vector_space * np.log( num_docs/m_sum)

In [None]:
def most_similar_cosine(s, songs, skiplist):
    num_songs, num_terms = songs.shape
    most_similar = max( [ (cosine(songs[s], songs[s_id]), s_id) 
                             for s_id in range(num_songs) 
                             if s_id not in skiplist ]   )
    return most_similar[1], most_similar[0]

sim_id, sim_value = most_similar_cosine(query_id, lyrics_tdif, set(skip+[query_id]))

print ("Most similar song is:", sim_id)
print ("Similarity is:", sim_value)
print ("Artist:", raw_dataset[sim_id][0])
print ("Title:", raw_dataset[sim_id][1])
print ("Lyrics:", raw_dataset[sim_id][-1])

In [None]:
# normalized prod instead of dot product

def norm_prod(a,b):
    return a*b/( np.sqrt( np.sum(a**2.0) ) * np.sqrt( np.sum(b**2.0) ) )

p = norm_prod(lyrics_vector_space[sim_id], lyrics_vector_space[query_id])

idx = np.argsort(p)

for term_id in np.argsort(idx)[::-1]:
    if p[term_id]>0:
        print (lexicon_rev_map[term_id])

#### How do you like it?

# Is this enough ?

## Exercises:

 - remove stop-words, too frequent words, unusual/infrequent words
 - Find the most original song:
   - most distant on average?
   - most distant from the 10 closest ones?
 - [advanced] filter using additional functionalities of textblob, e.g., lemmatize rather than stem, use only nouns, anything else comes up to your mind

In [None]:
from nltk.corpus import stopwords
print(sorted(stopwords.words('english')))

## Good to know about TextBlob

### Get Sentiment/Polarity of first sentence

TextBlob provides a polarity score, i.e., a number between -1 (negative) and 1 (positive).

In [50]:
from textblob import Sentence

In [51]:
Sentence("I love playing tennis!").polarity

0.625

In [52]:
Sentence("I hate running!").polarity

-1.0

### Translation

In [53]:
chinese_blob = TextBlob(u"美丽优于丑陋")
chinese_blob.translate(from_lang="zh-CN", to='en')

TextBlob("Beauty is better than ugly")

In [54]:
eng_blob = TextBlob("Beauty is better than ugly")
eng_blob.translate(from_lang="en", to='it')

TextBlob("La bellezza è meglio che brutta")

# References

 - **Introduction to Information Retrieval**. Manning, Raghavan, Schütze. Cambridge University Press. 2008.
   - Sections 2.1, 2.2, 6.2, 6.3
   - Download: https://nlp.stanford.edu/IR-book/information-retrieval-book.html
 - **Web Data Mining** 2nd edition. Liu. Springer. 2011.
   - Sections 6.2.1, 6.2.2, 6.5
 - **Mining of Massive Datasets**. Leskovec, Rajaraman, Ullman. Cambridge University Press. 2014.
   - Sections 3.1
   - Download: http://www.mmds.org/