# Text Retrieval
There are 2 standard models for retrieving text data.
1. Boolean Retrieval Model
2. Vector Space Model

The aim of any information retrieval model is to retrieve documents related to a query.

##  Vector Space Model
In this model, every document and query is represented as a vector and closest vector as measured by cosine distance is considered as the correct answer.

In [1]:
#import numpy and pandas libraries
import numpy as np
import pandas as pd

In [2]:
#Read the given csv dataset into dataframe df
df = pd.read_csv('modified_song_lyrics.csv') 

#List first 5 rows
df.head(5)


Unnamed: 0,album,track_title,lyric,year
0,Taylor Swift,Tim McGraw,He said the way my blue eyes shined Put those ...,2006
1,Taylor Swift,Picture To Burn,"State the obvious, I didn't get my perfect fan...",2006
2,Taylor Swift,Teardrops On My Guitar,Drew looks at me I fake a smile so he won't se...,2006
3,Taylor Swift,A Place In This World,"I don't know what I want, so don't ask me Caus...",2006
4,Taylor Swift,Cold as You,You have a way of coming easily to me And when...,2006


**Documentation Reference:**<br>
1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

**Step 1. Import above references

In [3]:
#import required libraries for tfidf & similarities 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

**Step 2. Create a 'vectorizer' object of 'TfidfVectorizer' - 1 Mark**

In [4]:
#Create the object from tfidf 
vectorizer = TfidfVectorizer()

Here we attempt to calculate tf-idf scores of the terms (lyrics). We do that by doing the following. <br> 
**Fit and transform the lyric column using vectorizer** <br>
X object is a matrix of size (n_songs,n_unique_words) where each entry is tf-idf score of the word in this song. Verify this using X.shape method

In [5]:
#Transform lyrics data into vectorized notation
X = vectorizer.fit_transform(df.lyric)

#Print the shape post transformation
print (X.shape)

(94, 2301)


**. Use 'transform' method of vectorizer on 'query' and store in 'query_vec' - .**<br>
This method converts a text value into a tf-idf vector

In [6]:
query = "Take it easy, with me"
#Transform text query into list & pass into transform method
query_vec = vectorizer.transform([query])

**. Use 'cosine_similarity' on 'X' and 'query_vec' store it in 'results' .**

In [7]:
#Perform cosine similiarity between available corpus and given query
results = cosine_similarity(X,query_vec)

In [8]:
# Print Name of the song
song_index = np.argmax(results.reshape((-1,)))

#Print query details 
print ('Query fired for all documents -- ', query)

#Print the details to get required track title. 
print('\nSong Name -- ',df.track_title[song_index]) # add song name here 

#Print the lyrics as well to get the insights to compare the results with query 
print('\nSong Lyrics -- \n',df.lyric[song_index])

Query fired for all documents --  Take it easy, with me

Song Name --  Breathe (Ft. Colbie Caillat)

Song Lyrics -- 
 I see your face in my mind as I drive away 'Cause none of us thought it was Going to end that way People are people And sometimes we change our minds But it's killing me to see you go after all this time Mm mm mm, mm mm mm, mm mm Mm mm mm, mm mm mm, mm mm Music starts playing like the end of a sad movie It's the kind of ending you Don't really want to see 'Cause it's tragedy and it'll only bring you Down Now I don't know what to be without you around And we know it's never simple, never easy Never a clean break, no one here to Save me You're the only thing I know like the back of my hand And I can't Breathe Without you but I have to breathe Without you but I have to Never wanted this, never want to see you hurt Every little bump in the road I tried to swerve But people are people And sometimes it doesn't work out Nothing we say is gonna save us from the fall out It's 2 