# Music Retrieval - A Boolean Retrieval Approach

In this notebook a solution for the retrieval of songs based on boolean queries is presented.

In [4]:
# Load libraries
## Python version is 3.11.6
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
import string
from functools import total_ordering
import re
import _pickle as pickle # cPickle

[nltk_data] Downloading package wordnet to /home/akasnipe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/akasnipe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The dataset contains lyrics of songs in the English language, from 1950 to 2019.

In [5]:
data = pd.read_csv('data/spotify_millsongdata.csv', sep=",")
data.head(5)

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


Only `artist`, `song`  and `text` will be used for the retrieval.

In [6]:
data = data[['artist', 'song', 'text']]
data.head(5)

Unnamed: 0,artist,song,text
0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...


### IR System

The system will be composed of the following classes:
* Posting: The class to implement the Posting objects,
* Posting List: The class to implement the Posting List objects,
* Term: The class to implement the Term objects,
* Inverted Index: The class to implement the Inverted Index objects for Boolean Retrieval,
* Song: The class to implement the Song objects,
* IR System: The "main" class that puts everything together.

In [7]:
# Posting class

@total_ordering
class Posting:
    
    # Initializer, takes a document ID as an argument.
    def __init__(self, docID):
        self._docID = docID
    
    # Retrieve a document's contents from a corpus using the document ID.
    def get_from_corpus(self, corpus):
        return corpus[self._docID]
    
    # Check equality with another Posting, based on document ID.
    def __eq__(self, other):
        return self._docID == other._docID
    
    # Check if this Posting has document ID greater than another Posting.
    def __gt__(self, other):
        return self._docID > other._docID
    
    # Provide the string representation of the Posting.
    def __repr__(self):
        return str(self._docID)

In [8]:
# Posting List class

class PostingList:

    # Initializer, initializes an empty list of postings.
    def __init__(self):
        self._postings = []
    
    # Create a PostingList instance with a single Posting from a document ID.
    @classmethod
    def from_docID(cls, docID):
        posting_list = cls()
        posting_list._postings = [(Posting(docID))]
        return posting_list
    
    # Create a PostingList instance from an existing list of Postings.
    @classmethod
    def from_posting_list(cls, postingList):
        plist = cls()
        plist._postings = postingList
        return plist

    # Merge another PostingList into this one, avoiding duplicates.
    def merge(self, other):
        i = 0  # Index for the other PostingList.
        last = self._postings[-1]  # The last Posting in the current list.

        while (i < len(other._postings) and last == other._postings[i]):
            i += 1  # Increment the index if a duplicate is found.
        self._postings += other._postings[i:]  # Append the non-duplicate postings from the other list.
    
    # Retrieve the contents of each Posting from a corpus.
    def get_from_corpus(self, corpus):
        return list(map(lambda x: x.get_from_corpus(corpus), self._postings))
    
    # Provide the string representation of the PostingList.
    def __repr__(self):
        return ", ".join(map(str, self._postings))

In [9]:
# Term class

# Exception class for handling merge operation errors.
class ImpossibleMergeError(Exception):
    pass

# A class that represents a term in a document, along with its posting list.
@total_ordering
class Term:

    # Initializer, takes a term and a document ID as arguments.
    def __init__(self, term, docID):
        self.term = term
        # Initialize posting_list for the term with a PostingList created from the given document ID.
        self.posting_list = PostingList.from_docID(docID)

    # Merge another Term's posting list into this one if they have the same term.
    def merge(self, other):
        if (self.term == other.term):
            self.posting_list.merge(other.posting_list)
        else:
            raise ImpossibleMergeError
    
    # Check equality with another Term.
    def __eq__(self, other):
        return self.term == other.term
    
    # Determine if this Term is greater than another.
    def __gt__(self, other):
        return self.term > other.term
    
    # Provide the string representation of the Term.
    def __repr__(self):
        return self.term + ": " + repr(self.posting_list)

Before defining the Inverted Index class, let's define functions to perform normalization, stemming and lemmatization.

In [10]:
# Stop Word removal, Normalization and Stemming/Lemmatization

def remove_stop_words(text):
     
    # Start from a list containing the tokens in "text"
    text_list = text.split()

    # Filter out stop words
    text_list = [word for word in text_list if word not in set(nltk.corpus.stopwords.words('english'))]

    # Join the remaining words into a single string
    result = " ".join(text_list)

    return result

def normalize(text):

    # Make a translation table that maps all punctuation characters to None
    translator = str.maketrans("", "", string.punctuation)

    # Apply the translation table to the input string
    result = text.translate(translator)

    # Converts the text to lowercase.
    result = result.lower()

    return result

def stem(text, type='porter'):
        
    # Start from a list containing the tokens in "text"
    stemmed_text = text.split()

    # Create a stemmer object
    if type == 'porter':
        stemmer = nltk.stem.porter.PorterStemmer()
    elif type == 'snowball':
        stemmer = nltk.stem.snowball.SnowballStemmer("english")
    else:
        raise ValueError('Stemmer type not supported')

    # Loop through each word in the text and retrieve the stem
    for i in range(len(stemmed_text)):
        stemmed_text[i] = stemmer.stem(stemmed_text[i])

    # Join the stemmed words into a single string
    result = " ".join(stemmed_text)

    return result

def lemmatize(text):
     # Start from a list containing the tokens in "text"
        lemmatized_text = text.split()
    
        # Create a lemmatizer object
        lemmatizer = nltk.stem.WordNetLemmatizer()
    
        # Loop through each word in the text and retrieve the lemma
        for i in range(len(lemmatized_text)):
            lemmatized_text[i] = lemmatizer.lemmatize(lemmatized_text[i])
    
        # Join the lemmatized words into a single string
        result = " ".join(lemmatized_text)
    
        return result

In [11]:
# Inverted Index class

class InvertedIndex:
    
    # Initialize the inverted index with an empty dictionary.
    def __init__(self):
        self._dictionary = []
        
    # Create an inverted index from a corpus of documents
    ## Argument word_reduction_type enables to choose between stemming and lemmatization
    ## Argument stop_words enables to maintain stop words (stop_words=True) or remove them (stop_words=False)
    @classmethod
    def from_corpus(cls, corpus, word_reduction_type = 'stemming_porter', stop_words = True):
        intermediate_dict = {}  # Intermediate dictionary to store the terms and their postings.
        for docID, song in enumerate(corpus):
            # Remove stop words, normalize and stem/lemmatize
            document = song.lyrics
            document = normalize(document)
            if not stop_words:
                document = remove_stop_words(document)
            if word_reduction_type == 'stemming_porter':
                document = stem(document, type = 'porter')
            elif word_reduction_type == 'stemming_snowball':
                document = stem(document, type = 'snowball')
            elif word_reduction_type == 'lemmatization':
                document = lemmatize(document)
            tokens_list = document.split() # Tokenize the document into individual words.
            tokens = set(tokens_list) # Remove duplicates         
            biwords = set([tokens_list[i]+' '+tokens_list[i+1] for i in range(len(tokens_list)-1)]) # Get all biwords in the document, remove duplicates.
            for token in tokens:
                term = Term(token, docID) # Create a new term with the token and the current document ID.
                try: # Try to merge the term with existing one in the intermediate dictionary.
                    intermediate_dict[token].merge(term)
                except KeyError: # If the term is not already in the dictionary, add it.
                    intermediate_dict[token] = term
            for biword in biwords:
                term = Term(biword, docID) # Create a new term with the biword and the current document ID.
                try: # Try to merge the term with existing one in the intermediate dictionary.
                    intermediate_dict[biword].merge(term)
                except KeyError: # If the term is not already in the dictionary, add it.
                    intermediate_dict[biword] = term
        idx = cls() # Create a new InvertedIndex instance.
        idx._dictionary = sorted(intermediate_dict.values(), key=lambda term: term.term) # Sort the terms in the intermediate dictionary and store them in the index's dictionary.
        return idx
    
    # Retrieve the posting list for a given term.
    def __getitem__(self, key):
        for term in self._dictionary:
            if term.term == key: # If the term matches the key, return its posting list.
                return term.posting_list
        raise KeyError("No song matches the given query.") # If the term is not in the dictionary, raise a KeyError.
    
    # Provide a string representation of the inverted index.
    def __repr__(self):
        return "A dictionary with " + str(len(self._dictionary)) + " terms"

In [12]:
# Song class

# Class to hold the title, author, genre, topic and lyrics of a song
class Song:
    
    # Initializer, initializes the title, author, genre, topic and lyrics attributes.
    def __init__(self, title, author, lyrics):
        self.title = title
        self.author = author
        self.lyrics = lyrics
        
    # Provide the string representation of the Song object.
    def __repr__(self):
        return "Title: " + self.title + ",\nAuthor: " + self.author + "\n\n"
    
# Get song author, title and lyrics from data
def get_songs_data(path):
    data = pd.read_csv(path, sep=",")
    # Remove newline characters from song lyrics
    data['text'] = data['text'].replace('\r\n',' ', regex=True)
    corpus = []
    for index, item in data.iterrows():
        song = Song(title = item['song'],
                    author = item['artist'],
                    lyrics = item['text'])
        # Add the Song object to the corpus.
        corpus.append(song)
    # Return the populated list of MovieDescription objects.
    return corpus


In [13]:
# Information Retrieval (IR) system class

class IRsystem:

    # Initialize the IR system with a corpus and the inverted index.   
    def __init__(self, corpus, index):
        self._corpus = corpus
        self._index = index
    
    # Create an IR system instance from a given corpus.
    @classmethod
    def from_corpus(cls, corpus, word_reduction_type = 'stemming_porter', stop_words=True):
        index = InvertedIndex.from_corpus(corpus, word_reduction_type, stop_words)
        return cls(corpus, index)
    
    # Return the posting list of a given posting
    def get_posting_list(self, posting):
        # Retrieve the posting list from the index.
        posting_list = self._index[posting]
        # Return the list of documents.
        return posting_list.get_from_corpus(self._corpus)

In [14]:
# Function to execute a text query against an IR system.

def query(ir, query, word_reduction_type = 'stemming_porter', stopwords = True, _print = True):
    answer = set()
    # Split the text query into individual words/biwords.
    words = re.split('(AND|OR|NOT)', query)
    for i in range(len(words)):
        words[i] = words[i].strip()
    # Check if the first or the last word is a boolean operator and return an error.
    if words[0] in ["AND", "OR", "NOT"] or words[len(words)-1] in ["AND", "OR", "NOT"]:
        raise KeyError("The first and the last word of the query cannot be a boolean operator.")
    # Normalize, remove stopwords and stem/lemmatize the query words/biwords but not the boolean operators.
    for i in range(len(words)):
        if words[i] not in ["AND", "OR", "NOT"]:
            if word_reduction_type == 'stemming_porter':    
                words[i] = normalize(words[i])
                if not stopwords:
                    words[i] = remove_stop_words(words[i])               
                words[i] = stem(words[i], type = 'porter')
            elif word_reduction_type == 'stemming_snowball':
                words[i] = normalize(words[i])
                if not stopwords:
                    words[i] = remove_stop_words(words[i]) 
                words[i] = stem(words[i], type = 'snowball')
            elif word_reduction_type == 'lemmatization':
                words[i] = normalize(words[i])
                if not stopwords:
                    words[i] = remove_stop_words(words[i]) 
                words[i] = lemmatize(words[i])
    # Retrieve the posting list for the first word/biword from the index.
    result = ir.get_posting_list(words[0])
    for song in result:
        answer.add(song)
    # Loop through the remaining words in the query.
    for i in range(1, len(words), 2):
        # Retrieve the posting lists for the next word from the index.
        result = ir.get_posting_list(words[i+1])
        # Case AND: Intersect the current answer with the new posting lists.
        if words[i] == "AND":
            answer = answer.intersection(result)
        # Case OR: Unite the current answer with the new posting lists.
        elif words[i] == "OR":
            answer = answer.union(result)
        # Case NOT: Subtract the new posting lists from the current answer.
        elif words[i] == "NOT":
            answer = answer.difference(result)
    # Print out each song that matches the query.
    ## If no song matches the query, print out a message.
    if len(answer) == 0:
        raise KeyError("No song matches the given query.")
    if _print:
        for song in answer:
            print(song)
    else:
        return answer

Let's test the Boolean Retrieval System with different parametrizations.

In [12]:
corpus = get_songs_data("data/spotify_millsongdata.csv")

In [13]:
# Generate and save on disk the IR System with different parametrizations

for word_reduction_type in ['stemming_porter', 'stemming_snowball', 'lemmatization']:
    for stopwords in [True, False]:
        ir = IRsystem.from_corpus(corpus, word_reduction_type, stopwords)
        filename = 'IRSystem/ir_' + word_reduction_type + '_with_stopwords' + '.pkl'
        if not stopwords:
            filename = 'IRSystem/ir_' + word_reduction_type + '_without_stopwords' + '.pkl'
        with open(filename, 'wb') as output:
            pickle.dump(ir, output, protocol=5)

In the following, there will be a comparison between the time needed to generate the IR system and the time needed to load it from disk. The times may vary from machine to machine, but the order of magnitude should not change much.

Generating the IR system and dumping it on disk takes approximately 70 minutes, this means that each IR system took approximately 10-12 minutes to be generated and saved.

During experimentation, the generation of a single IR system took approximately 6 minutes (with different times for different parametrizations), and also the dumping on disk took almost the same amount of time.

Suggestion: free up some RAM before running the following cells. Restarting the kernel and running all the cells from 1 to 11 should be enough.

Let's now test the IR system in its default parametrization (porter stemming, with stopwords).

In [14]:
# Load the IR system from a file

with open('IRSystem/ir_stemming_porter_with_stopwords.pkl', 'rb') as handle:
    ir = pickle.load(handle)

Loading an Inverted Index from disk takes approximately 1 minute, which is an improvement with respect to the time needed to generate it.

In [15]:
query(ir, "american AND idiot")

Title: American Idiot,
Author: Green Day


Title: American Idiot (Greenday Cover),
Author: Avril Lavigne


Title: Raleigh Soliloquy Pt. Ii,
Author: Sublime


Title: Irresponsible Hate Anthem,
Author: Marilyn Manson




As you can see, the IR system returns (also) the song American Idiot by Green Day, which is expected from the given query.

In [17]:
query(ir, "american AND idiot NOT media")

Title: Raleigh Soliloquy Pt. Ii,
Author: Sublime


Title: Irresponsible Hate Anthem,
Author: Marilyn Manson




If the "NOT media" part is added to the query, the IR system returns the same results as before, but without the song American Idiot by Green Day (and the Avril Lavigne cover), since the word "media" is present in the lyrics of this song, but not in the lyrics of the other two songs.

In [32]:
query(ir, "american idiot OR work sucks")

Title: American Idiot,
Author: Green Day


Title: All The Small Things (Blink 182 Cover),
Author: Avril Lavigne


Title: American Idiot (Greenday Cover),
Author: Avril Lavigne




Also phrasal queries work as expected, the IR system returned the songs American Idiot (the only song where the phrase "american idiot" is present) and All The Small Things (the only song where the phrase "work sucks" is present).

Let's now evaluate the IR system, with the different parametrizations, and analyze the results.

### IR System Evaluation

Since the chosen dataset only contains the documents used by the IR system, there is no reference dataset for the evaluation.

For this reason, a simple evaluation procedure has been implemented, which consists in the following steps:
1. Sample some songs from the dataset.
    - For each song, compute the 5 most frequent words/bigrams in the lyrics (including and excluding stopwords, in order to test the IR system with different parametrizations).
    - Use these words for the queries, **A document is considered relevant if it has the word among the 5 most frequent ones in its lyrics**. Other techniques could have been used to determine the relevance of a document, but this one has been thought as simple and effective enough in this tricky situation.
2. Pick 3 words/bigrams at random from the previous result (`word1`, `word2`, `word3`).
3. Loop N times:
    Randomly generate K queries, with K random (ranging from 1 to 6), each one having one of the following 6 structures (again, chosen at random):
    - Simple `AND` query: "word1 AND word2"
    - Simple `OR` query: "word1 OR word2"
    - Simple `NOT` query: "NOT word1"
    - Complex `AND NOT` query: "word1 AND word2 NOT word3"
    - Complex `OR NOT` query: "word1 OR word2 NOT word3"
    - Simple 1 word query: "word1"
    Each query is associated to the relevant documents for that query.
3. Run the IR system on the generated queries, compute the precision and recall for each query, and average them over all the queries.
4. Analyze the results.   


In [15]:
# Sample N songs from the dataset

sample = data.sample(n=500, replace=False)

# For each lyrics, compute the frequency of words and biwords, maintain only the 5 most frequent words/biwords

def get_top5_frequencies(corpus, stop_words = True):
    frequencies = pd.DataFrame(columns=['songID', 'song', 'author', 'word', 'frequency'])
    # Loop through each song in the corpus and retrieve the frequencies of words and biwords
    for index, item in corpus.iterrows():
        song = item['song']
        author = item['artist']
        lyrics = item['text']
        if not stop_words:
                lyrics = normalize(lyrics)
                lyrics = remove_stop_words(lyrics)
        tokens_list = lyrics.split() # Tokenize the lyrics into individual words.
        if not stop_words:
            tokens_list = [token for token in tokens_list if len(token) > 3] # Exclude tokens that have less than 4 characters. (e.g. 'she', 'im', 'ive', etc.)
        tokens = set(tokens_list) # Remove duplicates.
        biwords_list = [tokens_list[i]+' '+tokens_list[i+1] for i in range(len(tokens_list)-1)] # Get all biwords in the document
        biwords = set(biwords_list) # Remove duplicates.           
        tmp_frequencies = pd.DataFrame(columns=['songID', 'song', 'author', 'word', 'frequency'])
        for token in tokens:
            tmp_frequencies = pd.concat([tmp_frequencies, pd.DataFrame({'songID': index, 'song': song, 'author': author, 'word': token, 'frequency': tokens_list.count(token)}, index=[0])], ignore_index=True)
        for biword in biwords:
            tmp_frequencies = pd.concat([tmp_frequencies, pd.DataFrame({'songID': index, 'song': song, 'author': author, 'word': biword, 'frequency': biwords_list.count(biword)}, index=[0])], ignore_index=True)
        # Maintain only the 5 most frequent words/biwords
        tmp_frequencies = tmp_frequencies.sort_values(by=['frequency'], ascending=False).head(5)
        # Append the frequencies of words and biwords to the dataframe
        frequencies = pd.concat([frequencies, tmp_frequencies], ignore_index=True)
    return frequencies

frequencies_with_stopwords = get_top5_frequencies(sample)
frequencies_without_stopwords = get_top5_frequencies(sample, stop_words=False)

It has been decided to sample 500 songs from the dataset, after experimentation this has been chosen as a compromise between the variety of songs used for evaluation and the time needed to execute the function above.

In [19]:
# Generate a set of queries

def make_queries(data, n=350):
    queries = pd.DataFrame(columns=['query_type', 'query', 'relevant_song_title', 'relevant_song_author'])

    for i in range(n):
        words = np.random.choice(data['word'].unique(), 3, replace=False)
        word1 = words[0]
        word2 = words[1]
        word3 = words[2]
        # Create a random number of queries (from 1 to 5)
        combinations = []
        n_queries = np.random.randint(1, 6)
        if n_queries == 5:
            combinations = range(1, 6)
        else:
            combinations = np.random.choice(range(1, 7), n_queries, replace=False)
        for combination in combinations:
            match combination:
                case 1: # Simple AND query
                    # Find all songs where word1 and word2 are among the 5 most frequent words
                    relevant_songIDs = pd.merge(data[data['word'] == word1], data[data['word'] == word2], on=['songID'], how='inner')['songID']
                    relevant_song_titles = []
                    relevant_song_authors = []
                    if len(relevant_song_titles) == 0:
                        relevant_song_titles = [pd.NA]
                        relevant_song_authors = [pd.NA]
                    else:
                        relevant_song_titles = data[data['songID'].isin(relevant_songIDs)]['song'].values
                        relevant_song_authors = data[data['songID'].isin(relevant_songIDs)]['author'].values
                    for k in range(len(relevant_song_titles)):
                        queries = pd.concat([queries, pd.DataFrame({'query_type': 1, 'query': word1 + ' AND ' + word2, 'relevant_song_title': relevant_song_titles[k], 'relevant_song_author': relevant_song_authors[k]}, index=[0])], ignore_index=True)
                    queries = queries.drop_duplicates() # Remove duplicates
                case 2: # Simple OR query
                    # Find all songs where word1 or word2 are among the 5 most frequent words
                    relevant_song_titles = data[(data['word'] == word1) | (data['word'] == word2)]['song'].values
                    relevant_song_authors = data[(data['word'] == word1) | (data['word'] == word2)]['author'].values
                    if len(relevant_song_titles) == 0:
                        relevant_song_titles = [pd.NA]
                        relevant_song_authors = [pd.NA]
                    for k in range(len(relevant_song_titles)):
                        queries = pd.concat([queries, pd.DataFrame({'query_type': 2, 'query': word1 + ' OR ' + word2, 'relevant_song_title': relevant_song_titles[k], 'relevant_song_author': relevant_song_authors[k]}, index=[0])], ignore_index=True)
                    queries = queries.drop_duplicates() # Remove duplicates
                case 3: # Simple NOT query
                    # Find all songs where word1 is among the 5 most frequent words, but word2 is not
                    good = data[data['word'] == word1]['songID'].values
                    bad = data[data['word'] == word2]['songID'].values
                    relevant_song_titles = []
                    relevant_song_authors = []
                    for songID in good:
                        if songID not in bad:
                            relevant_song_titles.append(data[data['songID'] == songID]['song'].iloc[0])
                            relevant_song_authors.append(data[data['songID'] == songID]['author'].iloc[0])
                    if len(relevant_song_titles) == 0:
                        relevant_song_titles = [pd.NA]
                        relevant_song_authors = [pd.NA]
                    for k in range(len(relevant_song_titles)):
                        queries = pd.concat([queries, pd.DataFrame({'query_type': 3, 'query': word1 + ' NOT ' + word2, 'relevant_song_title': relevant_song_titles[k], 'relevant_song_author': relevant_song_authors[k]}, index=[0])], ignore_index=True)
                    queries = queries.drop_duplicates() # Remove duplicates
                case 4: # Complex AND NOT query
                    # Find all songs where word1 and word2 are among the 5 most frequent words, but word3 is not
                    good = pd.merge(data[data['word'] == word1], data[data['word'] == word2], on=['songID'], how='inner')['songID']
                    bad = data[data['word'] == word3]['songID'].values
                    relevant_song_titles = []
                    relevant_song_authors = []
                    if len(good) == 0:
                        relevant_song_titles = [pd.NA]
                        relevant_song_authors = [pd.NA]
                    else:
                        for songID in good:
                            if songID not in bad:
                                relevant_song_titles.append(data[data['songID'] == songID]['song'].iloc[0])
                                relevant_song_authors.append(data[data['songID'] == songID]['author'].iloc[0])
                    for k in range(len(relevant_song_titles)):
                        queries = pd.concat([queries, pd.DataFrame({'query_type': 4, 'query': word1 + ' AND ' + word2 + ' NOT ' + word3, 'relevant_song_title': relevant_song_titles[k], 'relevant_song_author': relevant_song_authors[k]}, index=[0])], ignore_index=True)
                    queries = queries.drop_duplicates()
                case 5: # Complex OR NOT query
                    # Find all songs where word1 or word2 are among the 5 most frequent words, but word3 is not
                    good = data[(data['word'] == word1) | (data['word'] == word2)]['songID'].values
                    bad = data[data['word'] == word3]['songID'].values
                    relevant_song_titles = []
                    relevant_song_authors = []
                    for songID in good:
                        if songID not in bad:
                            relevant_song_titles.append(data[data['songID'] == songID]['song'].iloc[0])
                            relevant_song_authors.append(data[data['songID'] == songID]['author'].iloc[0])
                    if len(relevant_song_titles) == 0:
                        relevant_song_titles = [pd.NA]
                        relevant_song_authors = [pd.NA]
                    for k in range(len(relevant_song_titles)):
                        queries = pd.concat([queries, pd.DataFrame({'query_type': 5, 'query': word1 + ' OR ' + word2 + ' NOT ' + word3, 'relevant_song_title': relevant_song_titles[k], 'relevant_song_author': relevant_song_authors[k]}, index=[0])], ignore_index=True)
                    queries = queries.drop_duplicates() # Remove duplicates
                case 6: # Single word/biword query
                    # Find all songs where word1 is among the 5 most frequent words
                    relevant_song_titles = data[data['word'] == word1]['song'].values
                    relevant_song_authors = data[data['word'] == word1]['author'].values
                    for k in range(len(relevant_song_titles)):
                        queries = pd.concat([queries, pd.DataFrame({'query_type': 6, 'query': word1, 'relevant_song_title': relevant_song_titles[k], 'relevant_song_author': relevant_song_authors[k]}, index=[0])], ignore_index=True)
                    queries = queries.drop_duplicates()
    return queries

queries_with_stopwords = make_queries(frequencies_with_stopwords)
queries_without_stopwords = make_queries(frequencies_without_stopwords)

It has been decided to loop the query generator 350 times, again after some experimentation, in order to end up having a few thousand queries to work with.

In [26]:
# Compute precision and recall

def evaluate(ir, queries):

    evaluation = pd.DataFrame(columns=['query_type', 'precision', 'recall'])

    # Compute precision and recall for each query
    for _query in queries['query'].unique():
        relevant_titles = queries[queries['query'] == _query]['relevant_song_title'].values
        relevant_authors = queries[queries['query'] == _query]['relevant_song_author'].values
        relevant_songs = pd.DataFrame(columns=['title', 'author'])
        for i in range(len(relevant_titles)):
            relevant_songs = pd.concat([relevant_songs, pd.DataFrame({'title': relevant_titles[i], 'author': relevant_authors[i]}, index=[0])], ignore_index=True)
        try:
            retrieved_songs = pd.DataFrame({'title': [song.title for song in query(ir, _query, _print=False)], 'author': [song.author for song in query(ir, _query, _print=False)]})
            # Compute precision and recall
            precision = len(pd.merge(relevant_songs, retrieved_songs, on=['title', 'author'], how='inner')) / len(retrieved_songs)
            recall = len(pd.merge(relevant_songs, retrieved_songs, on=['title', 'author'], how='inner')) / len(relevant_songs)
            evaluation = pd.concat([evaluation, pd.DataFrame({'query_type': queries[queries['query'] == _query]['query_type'].values[0], 'precision': precision, 'recall': recall}, index=[0])], ignore_index=True)      
        except KeyError:
            precision = 0
            recall = 0
            evaluation = pd.concat([evaluation, pd.DataFrame({'query_type': queries[queries['query'] == _query]['query_type'].values[0], 'precision': precision, 'recall': recall}, index=[0])], ignore_index=True)
            continue

    return evaluation

Now the evaluation of all the previously generated IR systems will be performed.

Suggestion: free up some RAM before running the following cells. Restarting the kernel and running all the cells from 1 to 11, and from 15 to 17 should be enough.

In [27]:
# Evaluate the IR system with different parametrizations

results = pd.DataFrame(columns=['word_reduction_type', 'stopwords', 'query_type', 'precision', 'recall'])

for word_reduction_type in ['stemming_porter', 'stemming_snowball', 'lemmatization']:
    for stopwords in [True, False]:
        if stopwords:
            with open('IRSystem/ir_' + word_reduction_type + '_with_stopwords.pkl', 'rb') as handle:
                ir = pickle.load(handle)
            evaluation = evaluate(ir, queries_with_stopwords)
            results = pd.concat([results, pd.DataFrame({'word_reduction_type': word_reduction_type, 'stopwords': 1, 'query_type': evaluation['query_type'], 'precision': evaluation['precision'], 'recall': evaluation['recall']})], ignore_index=True)
            del ir # Free memory
        else:
            with open('IRSystem/ir_' + word_reduction_type + '_without_stopwords.pkl', 'rb') as handle:
                ir = pickle.load(handle)
            evaluation = evaluate(ir, queries_without_stopwords)
            results = pd.concat([results, pd.DataFrame({'word_reduction_type': word_reduction_type, 'stopwords': 0, 'query_type': evaluation['query_type'], 'precision': evaluation['precision'], 'recall': evaluation['recall']})], ignore_index=True)
            del ir # Free memory

# Save results on disk
results.to_csv('results/evaluation.csv', index=False)