# Text Retrieval
There are 2 standard models for retrieving text data.
1. Boolean Retrieval Model
2. Vector Space Model

The aim of any information retrieval model is to retrieve documents related to a query.

## 1. Boolean Retrieval Model
In this model we consider every query and document as a set of words and we retrieve a document if and only if the query word is present in it. Model can be extended to support complex queries with boolean operators.

In this assignment we are going to implement both the models, using scikit-learn package. We are going to use song lyrics dataset.


**Step 1. Import necessary packages -- numpy and pandas - 1 Mark** 

In [1]:
#import numpy and pandas libraries
import numpy as np
import pandas as pd

**Step 2. Read the dataset and store it in variable 'df' - 1 mark** <br> 

The lyric column of the dataset has song lyrics. We aim to give some lyrics as a query and retrieve the song name. 


In [2]:
#Read the given csv dataset into dataframe df
df = pd.read_csv('modified_song_lyrics.csv') 

#List first 5 rows
df.head(5)


Unnamed: 0,album,track_title,lyric,year
0,Taylor Swift,Tim McGraw,He said the way my blue eyes shined Put those ...,2006
1,Taylor Swift,Picture To Burn,"State the obvious, I didn't get my perfect fan...",2006
2,Taylor Swift,Teardrops On My Guitar,Drew looks at me I fake a smile so he won't se...,2006
3,Taylor Swift,A Place In This World,"I don't know what I want, so don't ask me Caus...",2006
4,Taylor Swift,Cold as You,You have a way of coming easily to me And when...,2006


**Documentation Reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html**<br>

**Step 3**<br>
1. Import this class
2. Create a 'vectorizer' object of 'CountVectorizer' with parameter binary=True

In [3]:
#import CountVectorizer from library 
from sklearn.feature_extraction.text import CountVectorizer

#Create object for vectorization
vectorizer = CountVectorizer(binary=True)

We aim to analyze the lyrics for presence or absence. <br> 
**Step 4. Fit and transform the lyric column using vectorizer - .**<br>
X object is a matrix of size (n_songs,n_unique_words) where each entry is 0 or 1 if the word in present in this song. Verify this using X.shape method

In [4]:
#Get the vectorization for columm lyrics from dataframe df
X = vectorizer.fit_transform(df.lyric)

#Print the shape post vectorization
print (X.shape)

(94, 2301)


In [5]:
query1 = 'beautiful'
query2 = 'girl'
# To get list of all doc containing a word, we do it in the following way
list_q1 = X[:,vectorizer.vocabulary_[query1]]
# Step 5. Do the same for 'query2' and store it in 'list_q2'
list_q2 = X[:,vectorizer.vocabulary_[query2]]

In [6]:
# AND Operation
for i in range(list_q1.shape[0]):
    if list_q1[i]==1 and list_q2[i]==1:
        print(df.iloc[i,1])

Teardrops On My Guitar
Superman
End Game (Ft. Ed Sheeran & Future)


**Step 6. Implement OR operation - **

In [7]:
# Performing OR Operation between both the list
for i in range(list_q1.shape[0]):
    if list_q1[i]==1 or list_q2[i]==1:
        print(df.iloc[i,1])

Teardrops On My Guitar
A Place In This World
Stay Beautiful
Mary's Song (Oh My My My)
I'm Only Me When I'm With You
Invisible
Fifteen
Hey Stephen
White Horse
You Belong With Me
The Way I Loved You
Back To December
Speak Now
Dear John
Innocent
Last Kiss
Superman
Holy Ground
Sad Beautiful Tragic
Everything Has Changed (Ft. Ed Sheeran)
Begin Again
Girl at Home
Blank Space
Style
How You Get The Girl
End Game (Ft. Ed Sheeran & Future)
So It Goes...
King of My Heart
