# Text Similarity Measures Exercises #

## Introduction ##

We will be using [a song lyric dataset from Kaggle](https://www.kaggle.com/mousehead/songlyrics) to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help you load in the data and get set up for this exercise.

In [19]:
import nltk
import pandas as pd
import re
from nltk.corpus import stopwords
import numpy as np
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

## Question 1 ##

* Filter the lyrics data set to only select songs by The Beatles.
* How many songs are there in total by The Beatles?
* Take a look at the first song's lyrics.

In [2]:
data = pd.read_csv('../data/songdata.csv')
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [3]:
data_beatles = data[data.artist.apply(lambda x: x=='The Beatles')] #Removing all other artist entries

In [4]:
data_beatles.head(10)

Unnamed: 0,artist,song,link,text
1198,The Beatles,A Shot Of Rhythm And Blues,/b/beatles/a+shot+of+rhythm+blues_20014867.html,"Well, if your hands start a-clappin' \r\nAnd ..."
1199,The Beatles,Across The Universe,/b/beatles/across+the+universe_10026507.html,Words are flowing out like \r\nEndless rain i...
1200,The Beatles,All I've Got To Do,/b/beatles/all+ive+got+to+do_10026646.html,"Whenever I want you around, yeah \r\nAll I go..."
1201,The Beatles,And I Love Her,/b/beatles/and+i+love+her_10026463.html,I give her all my love \r\nThat's all I do \...
1202,The Beatles,And Your Bird Can Sing,/b/beatles/and+your+bird+can+sing_10026364.html,You tell me that you've got everything you wan...
1203,The Beatles,Another Girl,/b/beatles/another+girl_10026200.html,"For I have got another girl, another girl \r\..."
1204,The Beatles,Any Time At All,/b/beatles/any+time+at+all_10025891.html,"Any time at all, any time at all, any time at ..."
1205,The Beatles,Ask Me Why,/b/beatles/ask+me+why_10025893.html,"I love you, 'cause you tell me things I want t..."
1206,The Beatles,"Baby, You're A Rich Man",/b/beatles/baby+youre+a+rich+man_10026560.html,How does it feel to be \r\nOne of the beautif...
1207,The Beatles,Birthday,/b/beatles/birthday_10025908.html,You say it's your birthday \r\nIt's my birthd...


In [5]:
print('The number of entries in artist beatles is', data_beatles.shape[0])

The number of entries in artist beatles is 178


In [6]:
print('The total number of entries is',data.shape[0])

The total number of entries is 57650


In [7]:
print('The number of artist in data is ',len(data.artist.unique()))

The number of artist in data is  643


## Question 2 ##

In [8]:
data.loc[:,"text"] = data.text.apply(lambda x:" ".join(re.findall('[\w]+',x)))

In [9]:
stopwords_english = set(stopwords.words('english'))
def remove_stopwords(s):
    s = ' '.join(word for word in s.split() if word not in stopwords_english)
    return s
data.loc[:,'text'] = data.text.apply(lambda x:remove_stopwords(x))

In [10]:
data.head(10)

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,Look face wonderful face And means something s...
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,Take easy please Touch gently like summer even...
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I never know I go Why I put lousy rotten show ...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy question give take You l...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy question give take You l...
5,ABBA,Burning My Bridges,/a/abba/burning+my+bridges_20003011.html,Well hoot holler make mad And I always heel Ho...
6,ABBA,Cassandra,/a/abba/cassandra_20002811.html,Down street singing shouting Staying alive tho...
7,ABBA,Chiquitita,/a/abba/chiquitita_20002978.html,Chiquitita tell wrong You enchained sorrow In ...
8,ABBA,Crazy World,/a/abba/crazy+world_20003013.html,I morning sun Couldn sleep I thought I take wa...
9,ABBA,Crying Over You,/a/abba/crying+over+you_20177611.html,I waitin baby I sitting alone I feel cold with...


In [11]:
corpus = np.r_[data.loc[:10,'text']]

In [12]:
print(len(corpus))

11


Apply the following preprocessing steps:
* Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.
* Remove all words with numbers using regular expressions.
* Create a document-term matrix using Count Vectorizer, with each row as a song and each column as a word in the lyrics. Have the Count Vectorizer remove all stop words as well.

Note: Count Vectorizer automatically removes punctuation and makes all characters lowercase.

In [20]:
cv = CountVectorizer(stop_words = 'english')
x = cv.fit_transform(corpus).toarray()
data_cv = pd.DataFrame(x,columns=cv.get_feature_names())
data_cv.head(10)

Unnamed: 0,aching,acted,advice,alive,andante,anymore,away,baby,bad,bags,...,watched,way,ways,weave,went,wonderful,words,world,wrong,yes
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
1,0,0,0,0,20,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,1,0,0,...,0,2,0,0,0,0,0,0,0,3
3,0,0,1,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,1,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6,1,0,0,1,0,0,0,0,0,1,...,1,0,0,3,0,0,3,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,1,0
8,0,1,0,0,0,0,0,2,0,0,...,0,1,0,0,1,0,0,5,0,0
9,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
pairs = list(combinations(range(len(corpus)),2))
combos = [(corpus[a_index],corpus[b_index]) for (a_index,b_index) in pairs]
results = [cosine_similarity([x[a_index]],[x[b_index]]) for (a_index,b_index) in pairs]
print(sorted(zip(results,combos),reverse=True))



## Question 3 ##

* Take a look at the lyrics for the song "Imagine".
* Which song is the most similar to the song "Imagine"?
     * Use cosine similarity to calculate the similarity
     * Use Count Vectorizer to numerically encode the lyrics
* Find the most similar song using the TF-IDF Vectorizer.

Compare the most similar song of the outputs of both the Count Vectorizer and the TF-IDF Vectorizer.

In [22]:
data_song = data_beatles[data_beatles.song.apply(lambda x:x=='Imagine' )]['text']

In [23]:
print(data_song)

24783    Imagine there's no heaven  \r\nIt's easy if yo...
Name: text, dtype: object


In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv_tfidf = TfidfVectorizer()
x  = cv_tfidf.fit_transform(corpus).toarray()
data_tfidf = pd.DataFrame(x,columns=cv_tfidf.get_feature_names())
data_tfidf.head(10)

Unnamed: 0,about,aching,acted,advice,alive,all,almost,alone,always,and,...,without,wonderful,words,world,would,wrong,yes,yet,you,your
0,0.085119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.165712,...,0.145514,0.085119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014182,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.050399,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.100932,0.064799,...,0.0,0.0,0.0,0.0,0.0,0.0,0.1118,0.0,0.0,0.0
3,0.0,0.0,0.0,0.032648,0.0,0.0,0.0,0.0,0.04633,0.014872,...,0.0,0.0,0.0,0.0,0.0,0.0,0.025659,0.0,0.052853,0.0
4,0.0,0.0,0.0,0.03502,0.0,0.0,0.0,0.0,0.049696,0.015952,...,0.0,0.0,0.0,0.0,0.0,0.0,0.027524,0.0,0.056692,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081237,0.104309,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.061783,0.0
6,0.0,0.042836,0.0,0.0,0.042836,0.0,0.042836,0.032201,0.0,0.050037,...,0.0,0.0,0.128509,0.0,0.257018,0.0,0.0,0.0,0.09879,0.0
7,0.0,0.0,0.0,0.0,0.0,0.052095,0.0,0.0,0.031594,0.020284,...,0.0,0.0,0.0,0.0,0.0,0.052095,0.0,0.0,0.14417,0.052095
8,0.0,0.0,0.057943,0.0,0.0,0.0,0.0,0.0,0.0,0.135365,...,0.0,0.0,0.0,0.289713,0.0,0.0,0.0,0.0,0.080177,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.121723,0.0,0.0,...,0.138409,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Question 4 ##

Which two Beatles songs are the most similar?
   * Using Count Vectorizer
   * Using TF-IDF Vectorizer
     
Compare the results. Which Vectorizer seems to do a better job?

In [25]:
pairs = list(combinations(range(len(corpus)),2))
combos = [(corpus[a_index],corpus[b_index]) for (a_index,b_index) in pairs]
results = [cosine_similarity([x[a_index]],[x[b_index]]) for (a_index,b_index) in pairs]
print(sorted(zip(results,combos),reverse=True))

