# Part 3: Ranking and filtering

In this part of the project, you will experiment with different ranking algorithms that can be
applied in a search engine. Your task is to design and implement a retrieval pipeline that:
- Takes a query as input (a piece of text).
- Finds all documents that contain all query terms (conjunctive query, i.e., AND semantics).
- Sorts the matching documents by relevance using different ranking methods. The main goal of this assignment is to explore and compare various relevance scoring approaches. By the end, you should be able to analyze how different algorithms affect the ranking order of documents.

The main goal of this assignment is to explore and compare various relevance scoring approaches. By the end, you should be able to analyze how different algorithms affect the ranking order of documents.

**Important: For this assignment, we only consider conjunctive queries (AND). This means that a document is included in the results only if it contains every word from the query.**

## Prelude

### Imports

In [1]:
import os, collections, string, re, math
from collections import defaultdict
from array import array
import pandas as pd
import numpy as np
import numpy.linalg as la

from unidecode import unidecode
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

from rank_bm25 import BM25Okapi

from gensim.models import Word2Vec

### Data loading

In [2]:
# DATA LOADING
DATA_PATH =  os.path.join(os.getcwd(), '../../data/')

data = pd.read_csv(os.path.join(DATA_PATH, 'fashion_products_cleaned.csv'))
USED_TEXT_COLUMNS = ['title', 'description', 'brand', 'category', 'sub_category', 'seller']
# USED_TEXT_COLUMNS = ['title', 'description']
# USED_TEXT_COLUMNS = ['title']

data[USED_TEXT_COLUMNS] = data[USED_TEXT_COLUMNS].fillna('')

### Functions

In [3]:
# Preprocessing function used in parts 1 and 2
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
translator = str.maketrans('', '', string.punctuation)

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower() # Lowercase
    text = text.translate(translator) # Remove punctuation
    text = unidecode(text) # normalize
    tokens = word_tokenize(text) # Tokenization
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words] # Remove stopwords and non-alphabetic tokens
    stemmed_tokens = [stemmer.stem(word) for word in tokens] # Stemming 
    stemmed_tokens = [word for word in stemmed_tokens if len(word) > 2] # Remove short tokens
    return stemmed_tokens

def print_top_k_results(ranked_documents, k=20, columns=USED_TEXT_COLUMNS):
    # Print header
    print("=" * 42)
    print(f"{'Rank':<6} | {'Document ID':<20} | {'Score':>10}")
    print("=" * 42)

    # Print each document row
    for i, (score, doc) in enumerate(ranked_documents[:k], 1):
        print(f"{i:<6} | {doc:<20} | {score:>10.3f}")

    print("=" * 42)

def get_top_k_results(data: pd.DataFrame,
                      ranked_documents,
                      k: int | str = 'all',
                      text_columns: list[str] = USED_TEXT_COLUMNS,
                      num_columns: list[str] = []):
    '''
    Parameters
    -----
        data: pandas dataframe loaded from the cleaned csv file
        ranked_documents: return of search_tf_idf or other search method
        k: int of first documents to be retrieved or default is all documents as a string
        columns: columns used for text searching, should be defined globally
    '''
    ranked_documents_df = pd.DataFrame(ranked_documents, columns=['score', 'pid'])

    # they should be already ordered but just make sure
    ranked_documents_df = ranked_documents_df.sort_values('score', ascending=False).reset_index(drop=True)
    
    if k == 'all':
        return ranked_documents_df.merge(data[['pid'] + text_columns + num_columns], on='pid', how='left')
    
    return ranked_documents_df.merge(data[['pid'] + text_columns + num_columns], on='pid', how='left')[:k]

## 1. You’re asked to provide 3 different ways of ranking:

#### a. TF-IDF + cosine similarity

Classical scoring, which we have also seen during the practical labs

In [4]:
def create_index_tfidf(data, columns=['title', 'description', 'category']):
    '''
    Implement the inverted index and compute tf, df and idf

    Argument:
    lines -- collection of Wikipedia articles
    num_documents -- total number of documents

    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of document these keys appears in (and the positions) as values.
    tf - normalized term frequency for each term in each document
    df - number of documents each term appear in
    idf - inverse document frequency of each term
    '''

    index = defaultdict(list)
    tf = defaultdict(list)  #term frequencies of terms in documents (documents in the same order as in the main index)
    df = defaultdict(int)  #document frequencies of terms in the corpus
    idf = defaultdict(float)
    N = len(data.index)

    for _, row in data.iterrows():
        
        page_id = row['pid']
        terms = preprocess_text(' '.join(row[columns].values))

        ## ===============================================================
        ## create the index for the **current page** and store it in current_page_index
        ## current_page_index ==> { ‘term1’: [current_doc, [list of positions]], ...,‘term_n’: [current_doc, [list of positions]]}

        ## Example: if the curr_doc has id 1 and its text is
        ##'web retrieval information retrieval':

        ## current_page_index ==> { ‘web’: [1, [0]], ‘retrieval’: [1, [1,4]], ‘information’: [1, [2]]}

        ## the term ‘web’ appears in document 1 in positions 0,
        ## the term ‘retrieval’ appears in document 1 in positions 1 and 4
        ## ===============================================================

        current_page_index = {}

        for position, term in enumerate(terms):  ## terms contains page_title + page_text
            try:
                # if the term is already in the dict append the position to the corresponding list
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [page_id, array('I', [position])]  #'I' indicates unsigned int (int in Python)

        # normalize term frequencies
        # Compute the denominator to normalize term frequencies (formula 2 above)
        # norm is the same for all terms of a document.
        norm = 0
        for term, posting in current_page_index.items():
            # posting will contain the list of positions for current term in current document.
            # posting ==> [current_doc, [list of positions]]
            # you can use it to infer the frequency of current term.
            norm += len(posting[1]) ** 2
        norm = math.sqrt(norm)

        #calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1]) / norm, 4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] += 1 # increment DF for current term

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

    # Compute IDF following the formula (3) above. HINT: use np.log
    # Note: It is computed later after we know the df.
    for term in df:
        idf[term] = np.round(np.log(float(N / df[term])), 4)

    return index, tf, df, idf

def rank_documents(terms, docs, index, tf, idf):
    '''
    Perform the ranking of the results of a search based on the tf-idf weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    tf -- term frequencies
    idf -- inverted document frequencies

    Returns:
    Print the list of ranked documents
    '''

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query.
    # Example: collections.Counter(['hello','hello','world']) --> Counter({'hello': 2, 'world': 1})
    # HINT: use when computing tf for query_vector

    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]= query_terms_count[term] / query_norm * idf[term] #query_vector[0] corresponds to the first term in the query

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):
            # Example of [doc_index, (doc, postings)]
            # 0 (26, array('I', [1, 4, 12, 15, 22, 28, 32, 43, 51, 68, 333, 337]))
            # 1 (33, array('I', [26, 33, 57, 71, 87, 104, 109]))
            # term is in doc 26 in positions 1,4, .....
            # term is in doc 33 in positions 26,33, .....

            #tf[term][0] will contain the tf of the term 'term' in the doc 26
            if doc in docs: #if the odcument is in the list of documents retrieved (matching the query)
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot

    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)
    #print document titles instead if document id's
    #result_docs=[ title_index[x] for x in result_docs ]
    if len(doc_scores) == 0:
        print('No results found, try again')
        query = input()
        docs = search_tf_idf(query, index, tf, idf)
    #print ('\n'.join(result_docs), '\n')
    return doc_scores

def search_tf_idf(query, index, tf, idf):
    '''
    output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    '''
    query = preprocess_text(query)
    docs = set()
    for term in query:
        
        try:
            # store in term_docs the ids of the docs that contain 'term'
            term_docs=[posting[0] for posting in index[term]]

            # docs = docs Union term_docs
            docs = docs.union(set(term_docs))
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs = rank_documents(query, docs, index, tf, idf)
    return ranked_docs

In [5]:
inverted_index, tf_index, df_index, idf_index = create_index_tfidf(data, USED_TEXT_COLUMNS)

In [6]:
# EXAMPLE SEARCH before we decide real queries used in all methods
print_top_k_results(search_tf_idf('zipper sweater', inverted_index, tf_index, idf_index))

Rank   | Document ID          |      Score
1      | SWSFMJZHZFUCHHTH     |     11.087
2      | SWTFHFY4QXVP2EPQ     |      8.927
3      | SWTFHFY4QPNNNYED     |      8.927
4      | SWTFMQYG6ENPEEGN     |      8.256
5      | SWTFMQYZSHVMUCNY     |      8.094
6      | SWTFXZVZNXFDCCNY     |      7.489
7      | SWTFVMSDDZRPXAPT     |      7.296
8      | SWTFQFZTNZWGHGK9     |      7.296
9      | SWTFQFPBSTQD4XPZ     |      7.240
10     | SWTFH2GZF24UJXJK     |      6.880
11     | SWTFMARTZVUY7MAY     |      6.076
12     | SWTFMARTNHCDE56Y     |      6.076
13     | SWTFMARSFXNNRSKW     |      6.076
14     | SWTFMARSGQZRZQU3     |      6.004
15     | SWTFMARSGHPNXTJY     |      6.004
16     | SWTFMARTTT9GQF8G     |      5.934
17     | SWTFMARTMHFAMZNK     |      5.934
18     | SWTFMART7TBXGARV     |      5.934
19     | SWTFYHPKSZDBSBGU     |      5.866
20     | SWTFY245GB9HZCMZ     |      5.866


In [7]:
display(get_top_k_results(data, search_tf_idf('zipper sweater', inverted_index, tf_index, idf_index), k=20))

Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,11.086585,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,8.926861,SWTFHFY4QPNNNYED,stripe round neck casual men beig sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
2,8.926861,SWTFHFY4QXVP2EPQ,solid high neck casual women green sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
3,8.255595,SWTFMQYG6ENPEEGN,solid casual women dark blue sweater,navi solid sweater hood long sleev rib hem max...,szto,clothing and accessories,winter wear,shreyashfashions
4,8.094102,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
5,7.48899,SWTFXZVZNXFDCCNY,solid neck casual men multicolor sweater,mash unlimit mustard navi pullov sweater,mash unlimit,clothing and accessories,winter wear,highstreet trendz llp
6,7.296366,SWTFQFZTNZWGHGK9,solid round neck casual men dark blue sweater,navi blue solid sweater round neck long sleev ...,lev,clothing and accessories,winter wear,kondefashions
7,7.296366,SWTFVMSDDZRPXAPT,stripe round neck casual women revers red sweater,red black stripe revers sweater round neck lon...,lev,clothing and accessories,winter wear,
8,7.239941,SWTFQFPBSTQD4XPZ,self design round neck casual men dark blue sw...,navi blue selfdesign sweater round neck long s...,lev,clothing and accessories,winter wear,kondefashions
9,6.879987,SWTFH2GZF24UJXJK,stripe neck casual women blue sweater,women casual revers sweater featur neck sleev ...,byford by pantaloo,clothing and accessories,winter wear,aum3etail


### b. BM25

In [8]:
bm25 = BM25Okapi(data.apply(lambda x: ' '.join(x[USED_TEXT_COLUMNS].values).split(' '), axis=1).to_list())

In [9]:
def search_BM25(bm25, data, query, k=10):
   
    #apply preprocessing to the query using get_tokens and tranform it from string to list of terms
    query = preprocess_text(query) # apply preprocessing

    # score docs using a specific function of bm25
    scores = np.array(bm25.get_scores(query))

    # get indices of top k scores
    idx = np.argpartition(scores, -k)[-k:]

    # sort top k scores and return their indices
    # if all the scores are 0 return empty list
    if np.sum(scores[idx]) == 0:
        return []
    
    # sort in descending order
    top_indices = idx[np.argsort(-scores[idx])]

    # build pairs (score, doc_id)
    result = [(scores[i], data.iloc[i]['pid'] if 'pid' in data.columns else i) for i in top_indices]

    return result

In [10]:
print_top_k_results(search_BM25(bm25, data, 'zipper sweater', k=20))

Rank   | Document ID          |      Score
1      | SWSFMJZHZFUCHHTH     |      9.538
2      | RNCF4YV4GZRYJY3A     |      9.441
3      | SWTFHFY4QXVP2EPQ     |      8.590
4      | SWTFHFY4QPNNNYED     |      8.590
5      | SWTFXZVZNXFDCCNY     |      8.301
6      | SWTFMQYZSHVMUCNY     |      8.196
7      | SWTFMQYG6ENPEEGN     |      8.196
8      | SWTFQFZTNZWGHGK9     |      7.960
9      | SWTFVMSDDZRPXAPT     |      7.960
10     | SWTFH2GZF24UJXJK     |      7.859
11     | SWTFQFPBSTQD4XPZ     |      7.792
12     | SWTFMARTZVUY7MAY     |      7.616
13     | SWTFMARTTT9GQF8G     |      7.616
14     | SWTFMARSFXNNRSKW     |      7.616
15     | SWTFMARTMHFAMZNK     |      7.616
16     | SWTFMARTNHCDE56Y     |      7.616
17     | SWTFMART7TBXGARV     |      7.616
18     | SWTFMARSGHPNXTJY     |      7.539
19     | SWTFMARSGQZRZQU3     |      7.539
20     | SWTFY83XBYN5D4HG     |      7.315


In [11]:
display(get_top_k_results(data, search_BM25(bm25, data, 'zipper sweater', k=20)))

Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,9.538064,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,9.440835,RNCF4YV4GZRYJY3A,solid men raincoat,rainsuit rain coat jacket pant men women kid f...,the dry ca,clothing and accessories,raincoats,newera
2,8.590108,SWTFHFY4QXVP2EPQ,solid high neck casual women green sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
3,8.590108,SWTFHFY4QPNNNYED,stripe round neck casual men beig sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
4,8.300637,SWTFXZVZNXFDCCNY,solid neck casual men multicolor sweater,mash unlimit mustard navi pullov sweater,mash unlimit,clothing and accessories,winter wear,highstreet trendz llp
5,8.195779,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
6,8.195779,SWTFMQYG6ENPEEGN,solid casual women dark blue sweater,navi solid sweater hood long sleev rib hem max...,szto,clothing and accessories,winter wear,shreyashfashions
7,7.960347,SWTFQFZTNZWGHGK9,solid round neck casual men dark blue sweater,navi blue solid sweater round neck long sleev ...,lev,clothing and accessories,winter wear,kondefashions
8,7.960347,SWTFVMSDDZRPXAPT,stripe round neck casual women revers red sweater,red black stripe revers sweater round neck lon...,lev,clothing and accessories,winter wear,
9,7.859234,SWTFH2GZF24UJXJK,stripe neck casual women blue sweater,women casual revers sweater featur neck sleev ...,byford by pantaloo,clothing and accessories,winter wear,aum3etail


### c. Your Score

Here, the task is to create a new score. (Be creative , think about what factors could make a document more relevant to a query and include them in your formula.)

Explain how the ranking differs when using TF-IDF and BM25, and think about the pros and cons of using each of them. Regarding your own score, justify the choice of the score (pros and cons). HINT: Look into numerical fields that each record has to build your score.

**Custom Score Explanation**

The idea is to combine in our score textual relevance (like we do in TF-IDF and BM25) together with numerical relevance (higher product average rating can be more relevant for the user searching for that product, or relevance for the user could be inversely proportional to the price, or higher discount could be relevant, etc). 

We will make a function where the user could decide which numerical column is more relevant, and another which combines all of them. 

The options for relevance order would be: 
- highest average_rating first
- highest price first
- lowest price first
- highest discount first


In [12]:
class ranking_G23:
    def __init__(self, df, columns=USED_TEXT_COLUMNS, method='tfidf'):
        self.df = df
        self.columns = columns
        self.method = method
        self.ranked_docs = None
        
        if method == 'tfidf':
            self.index, self.tf, _, self.idf = create_index_tfidf(self.df, self.columns)
        elif method == 'bm25':
            self.bm25 = BM25Okapi(self.df.apply(lambda x: ' '.join(x[self.columns].values).split(' '), axis=1).to_list())
        else:
            raise ValueError("Method must be 'tfidf' or 'bm25'")

    def search(self, query : str):
        if self.method == 'tfidf':
            ranked_docs = search_tf_idf(query, self.index, self.tf, self.idf)

        elif self.method == 'bm25':
            ranked_docs = search_BM25(self.bm25, self.df, query, k=len(self.df))

        results = []

        if not ranked_docs:
            return []
        
        doc_ids = [pid for score, pid in ranked_docs]
        text_scores = [score for score, pid in ranked_docs]
        max_text_score = max(text_scores) if len(text_scores) > 0 else 1

        for i, pid in enumerate(doc_ids): 
            row = self.df[self.df['pid'] == pid ].iloc[0]

            # normalize text score
            text_score = text_scores[i] / max_text_score

            # numerical features
            rating_score = row['average_rating'] / 5 if not pd.isna(row['average_rating']) else 0
            discount_score = row['discount'] / 100 if not pd.isna(row['discount']) else 0
            availability_score = 1 if row['out_of_stock'] == 0 else 0
            price_score = 1 - (1 + np.log1p(row['selling_price'])) if not pd.isna(row['selling_price']) else 0

            # combine with weights
            combined_score = (0.4 * text_score +
                            0.3 * rating_score +
                            0.2 * discount_score +
                            0.05 * availability_score +
                            0.05 * price_score)
            
            results.append((combined_score, pid))

        results.sort(reverse=True)
        self.ranked_docs = results

        return results
    
    def sort(self, criterion: list[str] = ['average_rating'], ascending: list[bool] = [False]):
        if self.ranked_docs == None:
            return
        
        # filter df to only the documents retrieved
        pids = [pid for _, pid in self.ranked_docs]
        df_filtered = self.df[self.df['pid'].isin(pids)].copy()
        
        # sort by chosen criterion
        df_filtered.sort_values(by=criterion, ascending=ascending, inplace=True)
                
        return df_filtered

In [13]:
our_ranking = ranking_G23(data, columns=USED_TEXT_COLUMNS, method='tfidf')

In [14]:
custom_scores = our_ranking.search('zipper sweater')
display(get_top_k_results(data, custom_scores))

Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.430543,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,0.410186,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
2,0.373094,SWTFYGS7GFGWVX4J,stripe neck casual women orang sweater,,man,clothing and accessories,winter wear,shakticreation
3,0.343740,SWTFYGQ7S4KYESNY,stripe collar neck casual men green sweater,,man,clothing and accessories,winter wear,shakticreation
4,0.342564,SWTFQFPBSTQD4XPZ,self design round neck casual men dark blue sw...,navi blue selfdesign sweater round neck long s...,lev,clothing and accessories,winter wear,kondefashions
...,...,...,...,...,...,...,...,...
377,-0.108779,JEAFCXFZ4P6KE7Y2,jogger fit men black jean,distress denim jogger slim silhouett sit low w...,attiitu,clothing and accessories,bottomwear,attiitude23seller changed check for any change...
378,-0.153575,TROFVW76G5DEFHQH,regular fit men beig polyest trouser,beig solid midris trouser button closur pocket...,szto,clothing and accessories,bottomwear,
379,-0.182120,SWTFNMJUPH6VC3PD,solid round neck casual women blue sweater,,true bl,clothing and accessories,winter wear,
380,-0.216568,JCKFNYCAZBJDK4MZ,full sleev solid men casual jacket,high qualiti smooth polyest fabric refin handi...,yoga,clothing and accessories,winter wear,


In [15]:
# rank by highest rating
criterion_columns=['average_rating']
ascending_columns=[False]

display(our_ranking.sort(criterion=criterion_columns, ascending=ascending_columns))

Unnamed: 0,pid,title,description,brand,category,sub_category,seller,out_of_stock,selling_price,discount,...,detail_size,detail_neck_type,detail_country_of_origin,detail_brand_fit,detail_sleeve_type,detail_other_details,detail_model_name,detail_occasion,detail_closure,detail_secondary_color
7849,TKPFJQFWVKGVX4ZT,solid men brown track pant,enjoy workout train session wear stylish track...,,clothing and accessories,bottomwear,skenterprises21,0,740.0,18.0,...,,,india,,,,,,,
6889,SWTEX9HRYPTAUMCZ,stripe vneck casual men blue grey sweater,,g,clothing and accessories,winter wear,retailnet,0,2394.0,40.0,...,,,,,,,,,no closure,
12242,VESFKGD9Y5EXQKHK,attiitud men vest,chri gayl limit edit signatur upgrad basic ves...,,clothing and accessories,innerwear and swimwear,buy more,0,937.0,6.0,...,,,india,,,,antonio zafra,,,black
12243,VESFKGDNFEGK8VVC,attiitud men vest,chri gayl limit edit signatur upgrad basic gre...,,clothing and accessories,innerwear and swimwear,buy more,0,937.0,6.0,...,,,india,,,,antonio zafra,,,black
6397,SRTEXZA6DWSXYWPT,solid men multicolor regular short,men zipper short gray milanch humbert star log...,humbe,clothing and accessories,bottomwear,humbert,0,1045.0,12.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20803,SWTFY6BR3UFX5Q8W,checker round neck casual women multicolor swe...,,mett,clothing and accessories,winter wear,,1,1039.0,35.0,...,,,india,,,,,,no closure,
20842,SWTFYM2H8NPVGZZH,solid high neck casual men white sweater,,mett,clothing and accessories,winter wear,,1,909.0,35.0,...,,,india,,,,,,no closure,
20856,SWTFYM279ZAJGV5H,stripe round neck casual women multicolor sweater,,mett,clothing and accessories,winter wear,,1,1039.0,35.0,...,,,india,,,,,,no closure,
20861,SWTFYM2QNJTYJEVM,stripe round neck casual women light green swe...,,mett,clothing and accessories,winter wear,,1,1039.0,35.0,...,,,india,,,,,,no closure,


In [16]:
# rank by lowest price
criterion_columns=['selling_price']
ascending_columns=[True]

display(our_ranking.sort(criterion=criterion_columns, ascending=ascending_columns))

Unnamed: 0,pid,title,description,brand,category,sub_category,seller,out_of_stock,selling_price,discount,...,detail_size,detail_neck_type,detail_country_of_origin,detail_brand_fit,detail_sleeve_type,detail_other_details,detail_model_name,detail_occasion,detail_closure,detail_secondary_color
9980,CAPF7FWGTQXHUQJS,self design skull beani cap cap,materi winter knit beani cap made premium qual...,gracew,clothing and accessories,clothing accessories,graceway,0,199.0,75.0,...,,,,,,,,casual,,
10110,CAPF7FWG7HHD9YAR,self design skull beani cap cap,materi winter knit beani cap made premium qual...,gracew,clothing and accessories,clothing accessories,graceway,0,199.0,75.0,...,,,,,,,,casual,,
10073,CAPF7FWGG3QZSXPE,self design skull beani cap cap,materi winter knit beani cap made premium qual...,gracew,clothing and accessories,clothing accessories,graceway,0,199.0,75.0,...,,,,,,,,casual,,
10082,CAPF7FWGCPETHTNS,self design skull beani cap cap,materi winter knit beani cap made premium qual...,gracew,clothing and accessories,clothing accessories,graceway,0,249.0,68.0,...,,,,,,,,casual,,
9808,TSHFUWQAAECQFGJB,solid women neck maroon tshirt,cotton biowash full sleev tshirt high qualiti ...,eyetwist,clothing and accessories,topwear,eyetwister,0,299.0,50.0,...,s,v neck,india,,narrow,,maroon v neck stylish t shirt,,,black
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6900,SWTF8S5W3EFQGGGN,self design vneck casual women black grey sweater,,g,clothing and accessories,winter wear,retailnet,0,3380.0,43.0,...,,,,,,,,,button,
12237,JEAFCXFZ4P6KE7Y2,jogger fit men black jean,distress denim jogger slim silhouett sit low w...,attiitu,clothing and accessories,bottomwear,attiitude23seller changed check for any change...,0,3949.0,,...,,,india,,,,,,button zip,grey
1719,JCKFVEVF24KXKB2Y,full sleev solid women sport jacket,bunge stopper hood secur fit light transpar ma...,reeb,clothing and accessories,winter wear,retailnet,0,4499.0,35.0,...,,,,,,,osr anrk jkt,,zipper,black
4335,JCKFX4GJD4WYQYHM,full sleev solid men bomber jacket,full sleev neck sweater,us polo ass,clothing and accessories,winter wear,retailnet,0,4549.0,35.0,...,,,,,,,,,zipper,


## 2. Implement **word2vec + cosine ranking** score.

Return a top-20 list of documents for each of the 5 queries defined in the Part 2 of your project, using search and word2vec + cosine similarity ranking.
​

To represent a piece of **text** using **word2vec**, we create a **single vector** that represents the entire text. This vector has the same number of dimensions as the word vectors and is calculated by **averaging the vectors of all words** in the text.

**Example:​**

Consider the text:
```
“Wireless Bluetooth headphones with noise cancellation”
```

Suppose we have Word2Vec vectors for each word:

* Wireless → v1
* Bluetooth → v2
* headphones → v3
* with → v4
* noise → v5
* cancellation → v6

All vectors have the same number of dimensions. To represent the text as a single vector, we average the word vectors:

```
Text vector = (v1 + v2 + v3 + v4 + v5 + v6) ÷ 6
```

The resulting vector has the same number of dimensions as the individual word vectors and represents the content of the entire text. This approach allows us to compare texts based on their vector representations for tasks like search or recommendation.

In [None]:
def text_vector_w2v(model, tokens):
    """
    computes and returns the vector representation of a text by averaging the word vectors of its tokens
    """
    # select only words present in vocabulary
    vectors = [model.wv[word] for word in tokens if word in model.wv]

    if not vectors: # if no words in the text are in the model vocabulary
        return np.zeros(model.vector_size)
    
    # average all word vectors to get text vector representation
    return np.mean(vectors, axis=0)

def create_index_w2v(model, data, columns=USED_TEXT_COLUMNS):
    """
    creates an index of document vectors for all documents in the dataset
    returns dict(pid -> document vector)
    """
    doc_vectors = {}
    for _, row in data.iterrows():
        pid = row['pid']
        # tokenize and preprocess text from specified columns
        tokens = preprocess_text(' '.join(row[columns].values))
        # compute document vector
        doc_vectors[pid] = text_vector_w2v(model, tokens)

    return doc_vectors


def rank_documents_w2v(query_terms, model, doc_vectors):
    """
    ranks documents based on cosine similarity between query vector and document vectors
    returns a ranked list of (similarity score, pid) tuples
    """
    # convert query terms to vector
    query_vector = text_vector_w2v(model, query_terms)
    
    doc_scores = []

    for pid, doc_vector in doc_vectors.items():
        if la.norm(doc_vector) == 0 or la.norm(query_vector) == 0: # avoid division by zero
            similarity = 0
        else:
            # compute cosine similarity
            similarity = np.dot(query_vector, doc_vector) / (la.norm(query_vector) * la.norm(doc_vector))
        doc_scores.append((similarity, pid))
    
    doc_scores.sort(reverse=True)

    return doc_scores

def search_w2v(model, doc_vectors, query):
    """
    performs a search by ranking documents based on their similarity to the query
    """
    # tokenize and preprocess the query
    query_terms = preprocess_text(query)
    # rank documents based on similarity to the query
    return rank_documents_w2v(query_terms, model, doc_vectors)

In [None]:
# test queries (defined in part 2)
test_queries = {
    1: 'men cotton shirt',
    2: 'women casual polo neck',
    3: 'men regular fit tshirt',
    4: 'zipper sweater',
    5: 'solid round neck cotton'
}

tokenized_corpus = data[USED_TEXT_COLUMNS].apply(lambda x: preprocess_text(' '.join(x.values)), axis=1).to_list()

In [None]:
# train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_corpus, 
                     vector_size=100, 
                     window=5, 
                     min_count=1, 
                     workers=4, 
                     seed=42)

In [20]:
# create document vectors
w2d_doc_vectors = create_index_w2v(w2v_model, data, USED_TEXT_COLUMNS)

In [21]:
# EXAMPLE SEARCH
print_top_k_results(search_w2v(w2v_model, w2d_doc_vectors, 'zipper sweater'), k=20)

Rank   | Document ID          |      Score
1      | TKPEXZA5PYYD3PUW     |      0.724
2      | TKPEGZ7NSMS42CUH     |      0.697
3      | TKPEGZ7NPFPBKRQU     |      0.697
4      | SRTEXZA6DWSXYWPT     |      0.693
5      | TKPEGZ7M7JQB6NPG     |      0.691
6      | TKPEGZ7NVWAZEK3C     |      0.685
7      | TKPEGZ7MJH25GHGX     |      0.681
8      | TKPEGZ7HVXNTZAVX     |      0.679
9      | TKPEGZ7MBAUKYGWR     |      0.675
10     | TKPFK5KGYRBNKZKK     |      0.670
11     | SWSFXMJG5SHFFX56     |      0.670
12     | TKPEGZ7NRH2RYMFM     |      0.669
13     | TKPEGZ7HXE6TZB3G     |      0.669
14     | TKPFPT6HJMNYFPYH     |      0.668
15     | JCKFY9HZUCMZ9XUZ     |      0.667
16     | JCKFVEVF24KXKB2Y     |      0.666
17     | TKPEGZ7H4U4NCHGP     |      0.664
18     | SWSFXMKPWFS9HRGS     |      0.663
19     | SRTFVEVFFMSRZF9F     |      0.661
20     | TKPEXZA5Z5TZZS2A     |      0.660


In [22]:
display(get_top_k_results(data, search_w2v(w2v_model, w2d_doc_vectors, 'zipper sweater'), k=20))

Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.724234,TKPEXZA5PYYD3PUW,solid women multicolor track pant,women zipper lower black colour humbert crown ...,humbe,clothing and accessories,bottomwear,humbert
1,0.697152,TKPEGZ7NPFPBKRQU,solid women gold track pant,women zipper lower mous colorhumbert authent e...,humbe,clothing and accessories,bottomwear,humbert
2,0.697152,TKPEGZ7NSMS42CUH,solid women gold track pant,women zipper lower mous colorhumbert authent e...,humbe,clothing and accessories,bottomwear,humbert
3,0.692824,SRTEXZA6DWSXYWPT,solid men multicolor regular short,men zipper short gray milanch humbert star log...,humbe,clothing and accessories,bottomwear,humbert
4,0.69093,TKPEGZ7M7JQB6NPG,solid men silver track pant,men zipper lower grey milanch colorhumbert aut...,humbe,clothing and accessories,bottomwear,humbert
5,0.685393,TKPEGZ7NVWAZEK3C,solid men gold track pant,men doubl folder lower mous colorhumbert leaf ...,humbe,clothing and accessories,bottomwear,humbert
6,0.680768,TKPEGZ7MJH25GHGX,solid men brown track pant,men zipper lower brown colorhumbert authent em...,humbe,clothing and accessories,bottomwear,humbert
7,0.678609,TKPEGZ7HVXNTZAVX,solid women silver track pant,women doubl folder lower grey milanch colorhum...,humbe,clothing and accessories,bottomwear,humbert
8,0.675006,TKPEGZ7MBAUKYGWR,solid women black track pant,women zipper lower black colorhumbert authent ...,humbe,clothing and accessories,bottomwear,humbert
9,0.670209,TKPFK5KGYRBNKZKK,solid women silver track pant,cotton blend fabric track pant grey milanch co...,humbe,clothing and accessories,bottomwear,humbert


In [23]:
# for our test queries (defined in part 2)
for qid, query in test_queries.items():
    print(f"Results for Query {qid}: '{query}'")
    results = search_w2v(w2v_model, w2d_doc_vectors, query)
    display(get_top_k_results(data, results, k=20))

Results for Query 1: 'men cotton shirt'


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.86763,TSHEGHGENWAEJ3JV,solid men polo neck yellow tshirt,beat heat summer scott men cotton solid polo t...,scott internation,clothing and accessories,topwear,switzinc
1,0.86713,TSHEGHG2YWG7TZJQ,solid men polo neck black tshirt,beat heat summer scott men cotton solid polo t...,scott internation,clothing and accessories,topwear,switzinc
2,0.864871,TSHEGHGHJTJG5ZF8,solid women polo neck white tshirt,beat heat summer scott women cotton solid polo...,scott internation,clothing and accessories,topwear,switzinc
3,0.864871,TSHEGHG65MQWR2GF,solid women polo neck white tshirt,beat heat summer scott women cotton solid polo...,scott internation,clothing and accessories,topwear,switzinc
4,0.864425,TSHEGHHYE2MPGGUS,solid women polo neck red tshirt,beat heat summer scott women cotton solid polo...,scott internation,clothing and accessories,topwear,switzinc
5,0.859598,SHTFTK76NMTN7HRP,men regular fit solid formal shirt,shirt made cotton fabric,sora,clothing and accessories,topwear,sorang
6,0.859015,SHTFTK76FUYYGKXK,men regular fit solid casual shirt,shirt made cotton fabric,sora,clothing and accessories,topwear,sorang
7,0.855106,SHTFTK76BEPJKDZ7,women regular fit solid casual shirt,shirt made cotton fabric,sora,clothing and accessories,topwear,sorang
8,0.851711,SHTFCF8DGPUTHS93,women regular fit solid casual shirt,cotton shirt superior experiencecotton shirt p...,mo,clothing and accessories,topwear,kksons
9,0.843327,SHTFTJWCPVTWGBYW,men regular fit print casual shirt,shirt made cotton fabric,sora,clothing and accessories,topwear,sorang


Results for Query 2: 'women casual polo neck'


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.909401,TSHEM9Y6QZGGZFEV,solid women polo neck blue tshirt,sport navi half polo tshirt women,t10 spor,clothing and accessories,topwear,t10sports
1,0.907842,TSHEM8ZZ4HZJBTCC,solid men polo neck blue tshirt,sport navi half polo tshirt men,t10 spor,clothing and accessories,topwear,t10sports
2,0.907842,TSHEMA3YS7HEPCUY,solid men polo neck blue tshirt,sport navi half polo tshirt men,t10 spor,clothing and accessories,topwear,t10sports
3,0.905861,TSHFZHPBVYPHHC2J,solid men polo neck pink tshirt,short sleev polo shirt men,ganti,clothing and accessories,topwear,sanchalsales
4,0.904814,TSHFZHPBZ6QSYFRN,solid women polo neck red tshirt,short sleev polo shirt women,ganti,clothing and accessories,topwear,sanchalsales
5,0.904193,TSHFUMM2UQCQV5QA,stripe women polo neck pink tshirt,slim fit women casual tshirt featur polo neck ...,arbo,clothing and accessories,topwear,arbor
6,0.904029,TSHFUMM2WXY73YHC,stripe women polo neck dark blue tshirt,slim fit women casual tshirt featur polo neck ...,arbo,clothing and accessories,topwear,arbor
7,0.902731,TSHEM8ZHJHJHNDMU,solid men polo neck brown tshirt,sport coffe half polo tshirt men,t10 spor,clothing and accessories,topwear,t10sports
8,0.902606,TSHFZHPBXMRMMUY4,solid men polo neck green tshirt,short sleev polo shirt men,ganti,clothing and accessories,topwear,sanchalsales
9,0.90234,TSHFUMM2F6VHKRDJ,stripe women polo neck light blue tshirt,slim fit women casual tshirt featur polo neck ...,arbo,clothing and accessories,topwear,arbor


Results for Query 3: 'men regular fit tshirt'


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.888354,SHTFNHGDFHCAKVJZ,men slim fit print sport shirt,men casual shirt featur half sleev regular fit...,byford by pantaloo,clothing and accessories,topwear,aum3etail
1,0.888023,SHTFNXNX4JFVZMGF,men slim fit solid sport shirt,men solid casual shirt featur collar neck full...,byford by pantaloo,clothing and accessories,topwear,aum3etail
2,0.886283,SHTFNXNYHBYQTQPQ,women slim fit solid sport shirt,women solid casual shirt featur collar neck fu...,byford by pantaloo,clothing and accessories,topwear,aum3etail
3,0.886283,SHTFNXNXHCKQH4GD,women slim fit solid sport shirt,women solid casual shirt featur collar neck fu...,byford by pantaloo,clothing and accessories,topwear,aum3etail
4,0.886283,SHTFNXNXFHAAD4TE,women slim fit solid sport shirt,women solid casual shirt featur collar neck fu...,byford by pantaloo,clothing and accessories,topwear,aum3etail
5,0.886195,SHTFNXNXFJYFGRFG,men regular fit solid sport shirt,men solid casual shirt featur mandarin collar ...,byford by pantaloo,clothing and accessories,topwear,aum3etail
6,0.884369,SHTFNHGDAQFEETHV,women slim fit print sport shirt,women print casual shirt featur half sleev reg...,byford by pantaloo,clothing and accessories,topwear,aum3etail
7,0.882485,SHTFHV3H2HFE8MZQ,men slim fit print sport shirt,men casual print shirt featur mandarin collar ...,byford by pantaloo,clothing and accessories,topwear,aum3etail
8,0.879313,SHTFNHGD3G6PKZM3,women slim fit print sport shirt,women casual shirt featur collar neck full sle...,byford by pantaloo,clothing and accessories,topwear,aum3etail
9,0.879089,SHTFMZ4AFP4RSUQF,men slim fit print sport shirt,men print casual shirt featur mandarin collar ...,byford by pantaloo,clothing and accessories,topwear,aum3etail


Results for Query 4: 'zipper sweater'


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.724234,TKPEXZA5PYYD3PUW,solid women multicolor track pant,women zipper lower black colour humbert crown ...,humbe,clothing and accessories,bottomwear,humbert
1,0.697152,TKPEGZ7NPFPBKRQU,solid women gold track pant,women zipper lower mous colorhumbert authent e...,humbe,clothing and accessories,bottomwear,humbert
2,0.697152,TKPEGZ7NSMS42CUH,solid women gold track pant,women zipper lower mous colorhumbert authent e...,humbe,clothing and accessories,bottomwear,humbert
3,0.692824,SRTEXZA6DWSXYWPT,solid men multicolor regular short,men zipper short gray milanch humbert star log...,humbe,clothing and accessories,bottomwear,humbert
4,0.69093,TKPEGZ7M7JQB6NPG,solid men silver track pant,men zipper lower grey milanch colorhumbert aut...,humbe,clothing and accessories,bottomwear,humbert
5,0.685393,TKPEGZ7NVWAZEK3C,solid men gold track pant,men doubl folder lower mous colorhumbert leaf ...,humbe,clothing and accessories,bottomwear,humbert
6,0.680768,TKPEGZ7MJH25GHGX,solid men brown track pant,men zipper lower brown colorhumbert authent em...,humbe,clothing and accessories,bottomwear,humbert
7,0.678609,TKPEGZ7HVXNTZAVX,solid women silver track pant,women doubl folder lower grey milanch colorhum...,humbe,clothing and accessories,bottomwear,humbert
8,0.675006,TKPEGZ7MBAUKYGWR,solid women black track pant,women zipper lower black colorhumbert authent ...,humbe,clothing and accessories,bottomwear,humbert
9,0.670209,TKPFK5KGYRBNKZKK,solid women silver track pant,cotton blend fabric track pant grey milanch co...,humbe,clothing and accessories,bottomwear,humbert


Results for Query 5: 'solid round neck cotton'


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.888746,TSHFTC3SWRUHU2VZ,solid men round neck white black beig tshirt pack,men cotton round neck half sleev,flexim,clothing and accessories,topwear,flexikaart
1,0.885041,TSHEM9Y5HRKVQNR9,solid women round neck grey tshirt,sport gray half cotton round tshirt women,t10 spor,clothing and accessories,topwear,
2,0.881899,TSHFKPF9HXNJ6FGU,print men round neck maroon tshirt,maroon kint cotton solid round neckful sleev r...,amp,clothing and accessories,topwear,affgarments
3,0.880426,TSHEM8ZFMJQBDMHU,solid men round neck blue tshirt,sport blue half cotton round tshirt men,t10 spor,clothing and accessories,topwear,t10sports
4,0.878526,TSHFGMMNJ2QG7GSR,stripe women polo neck green tshirt,women tshirt featur polo neck half sleev craft...,byford by pantaloo,clothing and accessories,topwear,aum3etail
5,0.877048,TSHFMWPZSVQRGJE6,color block men round neck maroon tshirt,maroon kint cotton color block round neck full...,amp,clothing and accessories,topwear,affgarments
6,0.87695,TSHEM9Y64JDEYMFC,solid men round neck red tshirt,sport red half bamboo charcoal crew neck round...,t10 spor,clothing and accessories,topwear,t10sports
7,0.8761,TSHFTC58EYURFZMG,solid women round neck multicolor tshirt pack,women cotton round neck half sleev,flexim,clothing and accessories,topwear,flexikaart
8,0.87583,TSHFK68NEVEVVWPM,stripe men polo neck blue tshirt,men stripe tshirt featur polo neck regular fit...,byford by pantaloo,clothing and accessories,topwear,aum3etail
9,0.875622,TSHEGGR76B8X7GXB,solid men round neck red tshirt,scott men red bio wash cotton round neck tshirt,scott internation,clothing and accessories,topwear,switzinc


## 3. Can you imagine a better representation than word2vec?

Justify your answer. (**HINT** - what about Doc2vec? Sentence2vec? What are the pros and cons?)

Word2Vec has some limitations that can be further improved. For instance: 
- it only has word-level embeddings, so it doesn't learn how words combine in longer text (it has no context). we also cannot represent expressions that use more than one word.
- document vectors average out the text vectors, which makes word position render useless. it also means that every word contribute equally, regardless of their importance. 
- text length is unimportant, which can be both an improvement in some cases and a limitation in other cases. 
- etc.

Doc2Vec is an upgrade over Word2Vec averaging because it learns a single vector per document, instead of averaging the text vectors within that document. It also incorporates context. However, it needs more training data to perform well. 

Sentence2Vec are also a good option because they can capture semantic meaning (word order and structure is learned). The complexity is higher, so it required more training data, training can be slower and harder to tune. 

There exist other transformer-based embeddings (BERT, sBERT, MiniLM) which are the state-of-the-art in text representation. These have the best semantic interpretation, are context-aware, have in some cases pretrained models ready to use, and very often ourperform Word2Vec, Doc2Vec. They however have a heavy computational cost, and require GPUs for training.

(https://cholakovit.com/en/ai/embeddings/sentence-transformers)
