# Part 3: Ranking and filtering

In this part of the project, you will experiment with different ranking algorithms that can be
applied in a search engine. Your task is to design and implement a retrieval pipeline that:
- Takes a query as input (a piece of text).
- Finds all documents that contain all query terms (conjunctive query, i.e., AND semantics).
- Sorts the matching documents by relevance using different ranking methods. The main goal of this assignment is to explore and compare various relevance scoring approaches. By the end, you should be able to analyze how different algorithms affect the ranking order of documents.

The main goal of this assignment is to explore and compare various relevance scoring approaches. By the end, you should be able to analyze how different algorithms affect the ranking order of documents.

**Important: For this assignment, we only consider conjunctive queries (AND). This means that a document is included in the results only if it contains every word from the query.**

## Prelude

### Imports

In [4]:
import os, collections, string, re, math
from collections import defaultdict
from array import array
import pandas as pd
import numpy as np
import numpy.linalg as la

from unidecode import unidecode
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

from rank_bm25 import BM25Okapi

### Data loading

In [5]:
# DATA LOADING
DATA_PATH =  os.path.join(os.getcwd(), '../../data/')

data = pd.read_csv(os.path.join(DATA_PATH, 'fashion_products_cleaned.csv'))
USED_TEXT_COLUMNS = ['title', 'description', 'brand', 'category', 'sub_category', 'seller']
# USED_TEXT_COLUMNS = ['title', 'description']
# USED_TEXT_COLUMNS = ['title']

data[USED_TEXT_COLUMNS] = data[USED_TEXT_COLUMNS].fillna('')

### Functions

In [6]:
# Preprocessing function used in parts 1 and 2
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
translator = str.maketrans('', '', string.punctuation)

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower() # Lowercase
    text = text.translate(translator) # Remove punctuation
    text = unidecode(text) # normalize
    tokens = word_tokenize(text) # Tokenization
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words] # Remove stopwords and non-alphabetic tokens
    stemmed_tokens = [stemmer.stem(word) for word in tokens] # Stemming 
    stemmed_tokens = [word for word in stemmed_tokens if len(word) > 2] # Remove short tokens
    return stemmed_tokens

def print_top_k_results(ranked_documents, k=20, columns=USED_TEXT_COLUMNS):
    # Print header
    print("=" * 42)
    print(f"{'Rank':<6} | {'Document ID':<20} | {'Score':>10}")
    print("=" * 42)

    # Print each document row
    for i, (score, doc) in enumerate(ranked_documents[:k], 1):
        print(f"{i:<6} | {doc:<20} | {score:>10.3f}")

    print("=" * 42)

def get_top_k_results(data: pd.DataFrame, ranked_documents: list[list], k: int | str = 'all', columns: list[str] = USED_TEXT_COLUMNS):
    '''
    Parameters
    -----
        data: pandas dataframe loaded from the cleaned csv file
        ranked_documents: return of search_tf_idf or other search method
        k: int of first documents to be retrieved or default is all documents as a string
        columns: columns used for text searching, should be defined globally
    '''
    ranked_documents_df = pd.DataFrame(ranked_documents, columns=['score', 'pid'])

    # they should be already ordered but just make sure
    ranked_documents_df = ranked_documents_df.sort_values('score', ascending=False).reset_index(drop=True)
    
    if k == 'all':
        return ranked_documents_df.merge(data[['pid'] + columns], on='pid', how='left')
    
    return ranked_documents_df.merge(data[['pid'] + columns], on='pid', how='left')[:k]

## 1. You’re asked to provide 3 different ways of ranking:

#### a. TF-IDF + cosine similarity

Classical scoring, which we have also seen during the practical labs

In [7]:
def create_index_tfidf(data, columns=['title', 'description', 'category']):
    '''
    Implement the inverted index and compute tf, df and idf

    Argument:
    lines -- collection of Wikipedia articles
    num_documents -- total number of documents

    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of document these keys appears in (and the positions) as values.
    tf - normalized term frequency for each term in each document
    df - number of documents each term appear in
    idf - inverse document frequency of each term
    '''

    index = defaultdict(list)
    tf = defaultdict(list)  #term frequencies of terms in documents (documents in the same order as in the main index)
    df = defaultdict(int)  #document frequencies of terms in the corpus
    idf = defaultdict(float)
    N = len(data.index)

    for _, row in data.iterrows():
        
        page_id = row['pid']
        terms = preprocess_text(' '.join(row[columns].values))

        ## ===============================================================
        ## create the index for the **current page** and store it in current_page_index
        ## current_page_index ==> { ‘term1’: [current_doc, [list of positions]], ...,‘term_n’: [current_doc, [list of positions]]}

        ## Example: if the curr_doc has id 1 and its text is
        ##'web retrieval information retrieval':

        ## current_page_index ==> { ‘web’: [1, [0]], ‘retrieval’: [1, [1,4]], ‘information’: [1, [2]]}

        ## the term ‘web’ appears in document 1 in positions 0,
        ## the term ‘retrieval’ appears in document 1 in positions 1 and 4
        ## ===============================================================

        current_page_index = {}

        for position, term in enumerate(terms):  ## terms contains page_title + page_text
            try:
                # if the term is already in the dict append the position to the corresponding list
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [page_id, array('I', [position])]  #'I' indicates unsigned int (int in Python)

        # normalize term frequencies
        # Compute the denominator to normalize term frequencies (formula 2 above)
        # norm is the same for all terms of a document.
        norm = 0
        for term, posting in current_page_index.items():
            # posting will contain the list of positions for current term in current document.
            # posting ==> [current_doc, [list of positions]]
            # you can use it to infer the frequency of current term.
            norm += len(posting[1]) ** 2
        norm = math.sqrt(norm)

        #calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1]) / norm, 4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] += 1 # increment DF for current term

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

    # Compute IDF following the formula (3) above. HINT: use np.log
    # Note: It is computed later after we know the df.
    for term in df:
        idf[term] = np.round(np.log(float(N / df[term])), 4)

    return index, tf, df, idf

def rank_documents(terms, docs, index, tf, idf):
    '''
    Perform the ranking of the results of a search based on the tf-idf weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    tf -- term frequencies
    idf -- inverted document frequencies

    Returns:
    Print the list of ranked documents
    '''

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query.
    # Example: collections.Counter(['hello','hello','world']) --> Counter({'hello': 2, 'world': 1})
    # HINT: use when computing tf for query_vector

    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query
        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]= query_terms_count[term] / query_norm * idf[term] #query_vector[0] corresponds to the first term in the query

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):
            # Example of [doc_index, (doc, postings)]
            # 0 (26, array('I', [1, 4, 12, 15, 22, 28, 32, 43, 51, 68, 333, 337]))
            # 1 (33, array('I', [26, 33, 57, 71, 87, 104, 109]))
            # term is in doc 26 in positions 1,4, .....
            # term is in doc 33 in positions 26,33, .....

            #tf[term][0] will contain the tf of the term 'term' in the doc 26
            if doc in docs: #if the odcument is in the list of documents retrieved (matching the query)
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot

    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)
    #print document titles instead if document id's
    #result_docs=[ title_index[x] for x in result_docs ]
    if len(doc_scores) == 0:
        print('No results found, try again')
        query = input()
        docs = search_tf_idf(query, index, tf, idf)
    #print ('\n'.join(result_docs), '\n')
    return doc_scores

def search_tf_idf(query, index, tf, idf):
    '''
    output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    '''
    query = preprocess_text(query)
    docs = set()
    for term in query:
        
        try:
            # store in term_docs the ids of the docs that contain 'term'
            term_docs=[posting[0] for posting in index[term]]

            # docs = docs Union term_docs
            docs = docs.union(set(term_docs))
        except:
            #term is not in index
            pass
    docs = list(docs)
    ranked_docs = rank_documents(query, docs, index, tf, idf)
    return ranked_docs

In [8]:
inverted_index, tf_index, df_index, idf_index = create_index_tfidf(data, USED_TEXT_COLUMNS)

In [None]:
# EXAMPLE SEARCH before we decide real queries used in all methods
print_top_k_results(search_tf_idf('zipper sweater', inverted_index, tf_index, idf_index))

Rank   | Document ID          |      Score
1      | SWSFMJZHZFUCHHTH     |     11.087
2      | SWTFHFY4QXVP2EPQ     |      8.927
3      | SWTFHFY4QPNNNYED     |      8.927
4      | SWTFMQYG6ENPEEGN     |      8.256
5      | SWTFMQYZSHVMUCNY     |      8.094
6      | SWTFXZVZNXFDCCNY     |      7.489
7      | SWTFVMSDDZRPXAPT     |      7.296
8      | SWTFQFZTNZWGHGK9     |      7.296
9      | SWTFQFPBSTQD4XPZ     |      7.240
10     | SWTFH2GZF24UJXJK     |      6.880
11     | SWTFMARTZVUY7MAY     |      6.076
12     | SWTFMARTNHCDE56Y     |      6.076
13     | SWTFMARSFXNNRSKW     |      6.076
14     | SWTFMARSGQZRZQU3     |      6.004
15     | SWTFMARSGHPNXTJY     |      6.004
16     | SWTFMARTTT9GQF8G     |      5.934
17     | SWTFMARTMHFAMZNK     |      5.934
18     | SWTFMART7TBXGARV     |      5.934
19     | SWTFYHPKSZDBSBGU     |      5.866
20     | SWTFY245GB9HZCMZ     |      5.866


In [10]:
display(get_top_k_results(data, search_tf_idf('zipper sweater', inverted_index, tf_index, idf_index), k=20))

Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,11.086585,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,8.926861,SWTFHFY4QPNNNYED,stripe round neck casual men beig sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
2,8.926861,SWTFHFY4QXVP2EPQ,solid high neck casual women green sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
3,8.255595,SWTFMQYG6ENPEEGN,solid casual women dark blue sweater,navi solid sweater hood long sleev rib hem max...,szto,clothing and accessories,winter wear,shreyashfashions
4,8.094102,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
5,7.48899,SWTFXZVZNXFDCCNY,solid neck casual men multicolor sweater,mash unlimit mustard navi pullov sweater,mash unlimit,clothing and accessories,winter wear,highstreet trendz llp
6,7.296366,SWTFQFZTNZWGHGK9,solid round neck casual men dark blue sweater,navi blue solid sweater round neck long sleev ...,lev,clothing and accessories,winter wear,kondefashions
7,7.296366,SWTFVMSDDZRPXAPT,stripe round neck casual women revers red sweater,red black stripe revers sweater round neck lon...,lev,clothing and accessories,winter wear,
8,7.239941,SWTFQFPBSTQD4XPZ,self design round neck casual men dark blue sw...,navi blue selfdesign sweater round neck long s...,lev,clothing and accessories,winter wear,kondefashions
9,6.879987,SWTFH2GZF24UJXJK,stripe neck casual women blue sweater,women casual revers sweater featur neck sleev ...,byford by pantaloo,clothing and accessories,winter wear,aum3etail


### b. BM25

In [11]:
bm25 = BM25Okapi(data.apply(lambda x: ' '.join(x[USED_TEXT_COLUMNS].values).split(' '), axis=1).to_list())

In [13]:
def search_BM25(bm25, data, query, k=10):
   
    #apply preprocessing to the query using get_tokens and tranform it from string to list of terms
    query = preprocess_text(query) # apply preprocessing

    # score docs using a specific function of bm25
    scores = np.array(bm25.get_scores(query))

    # get indices of top k scores
    idx = np.argpartition(scores, -k)[-k:]

    # sort top k scores and return their indices
    # if all the scores are 0 return empty list
    if np.sum(scores[idx]) == 0:
        return []
    
    # sort in descending order
    top_indices = idx[np.argsort(-scores[idx])]

    # build pairs (score, doc_id)
    result = [(scores[i], data.iloc[i]['pid'] if 'pid' in data.columns else i) for i in top_indices]

    return result

In [14]:
search_tf_idf('zipper sweater', inverted_index, tf_index, idf_index)

[[np.float64(11.086584932065335), 'SWSFMJZHZFUCHHTH'],
 [np.float64(8.92686059465001), 'SWTFHFY4QXVP2EPQ'],
 [np.float64(8.92686059465001), 'SWTFHFY4QPNNNYED'],
 [np.float64(8.255594922210113), 'SWTFMQYG6ENPEEGN'],
 [np.float64(8.094102021304282), 'SWTFMQYZSHVMUCNY'],
 [np.float64(7.4889900673077365), 'SWTFXZVZNXFDCCNY'],
 [np.float64(7.296366004781504), 'SWTFVMSDDZRPXAPT'],
 [np.float64(7.296366004781504), 'SWTFQFZTNZWGHGK9'],
 [np.float64(7.239940774344527), 'SWTFQFPBSTQD4XPZ'],
 [np.float64(6.879986718108641), 'SWTFH2GZF24UJXJK'],
 [np.float64(6.076413608782037), 'SWTFMARTZVUY7MAY'],
 [np.float64(6.076413608782037), 'SWTFMARTNHCDE56Y'],
 [np.float64(6.076413608782037), 'SWTFMARSFXNNRSKW'],
 [np.float64(6.0044227975348585), 'SWTFMARSGQZRZQU3'],
 [np.float64(6.0044227975348585), 'SWTFMARSGHPNXTJY'],
 [np.float64(5.934377683888957), 'SWTFMARTTT9GQF8G'],
 [np.float64(5.934377683888957), 'SWTFMARTMHFAMZNK'],
 [np.float64(5.934377683888957), 'SWTFMART7TBXGARV'],
 [np.float64(5.86627826784

In [15]:
search_BM25(bm25, data, 'zipper sweater')

[(np.float64(9.538064250977918), 'SWSFMJZHZFUCHHTH'),
 (np.float64(9.440834650057766), 'RNCF4YV4GZRYJY3A'),
 (np.float64(8.590108236281917), 'SWTFHFY4QXVP2EPQ'),
 (np.float64(8.590108236281917), 'SWTFHFY4QPNNNYED'),
 (np.float64(8.30063703703381), 'SWTFXZVZNXFDCCNY'),
 (np.float64(8.195778855474906), 'SWTFMQYZSHVMUCNY'),
 (np.float64(8.195778855474906), 'SWTFMQYG6ENPEEGN'),
 (np.float64(7.96034711392507), 'SWTFQFZTNZWGHGK9'),
 (np.float64(7.96034711392507), 'SWTFVMSDDZRPXAPT'),
 (np.float64(7.859233637591256), 'SWTFH2GZF24UJXJK')]

In [16]:
print_top_k_results(search_BM25(bm25, data, 'zipper sweater'), k=20)

Rank   | Document ID          |      Score
1      | SWSFMJZHZFUCHHTH     |      9.538
2      | RNCF4YV4GZRYJY3A     |      9.441
3      | SWTFHFY4QXVP2EPQ     |      8.590
4      | SWTFHFY4QPNNNYED     |      8.590
5      | SWTFXZVZNXFDCCNY     |      8.301
6      | SWTFMQYZSHVMUCNY     |      8.196
7      | SWTFMQYG6ENPEEGN     |      8.196
8      | SWTFQFZTNZWGHGK9     |      7.960
9      | SWTFVMSDDZRPXAPT     |      7.960
10     | SWTFH2GZF24UJXJK     |      7.859


In [17]:
display(get_top_k_results(data, search_BM25(bm25, data, 'zipper sweater'), k=20))

Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,9.538064,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,9.440835,RNCF4YV4GZRYJY3A,solid men raincoat,rainsuit rain coat jacket pant men women kid f...,the dry ca,clothing and accessories,raincoats,newera
2,8.590108,SWTFHFY4QXVP2EPQ,solid high neck casual women green sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
3,8.590108,SWTFHFY4QPNNNYED,stripe round neck casual men beig sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
4,8.300637,SWTFXZVZNXFDCCNY,solid neck casual men multicolor sweater,mash unlimit mustard navi pullov sweater,mash unlimit,clothing and accessories,winter wear,highstreet trendz llp
5,8.195779,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
6,8.195779,SWTFMQYG6ENPEEGN,solid casual women dark blue sweater,navi solid sweater hood long sleev rib hem max...,szto,clothing and accessories,winter wear,shreyashfashions
7,7.960347,SWTFQFZTNZWGHGK9,solid round neck casual men dark blue sweater,navi blue solid sweater round neck long sleev ...,lev,clothing and accessories,winter wear,kondefashions
8,7.960347,SWTFVMSDDZRPXAPT,stripe round neck casual women revers red sweater,red black stripe revers sweater round neck lon...,lev,clothing and accessories,winter wear,
9,7.859234,SWTFH2GZF24UJXJK,stripe neck casual women blue sweater,women casual revers sweater featur neck sleev ...,byford by pantaloo,clothing and accessories,winter wear,aum3etail


### c. Your Score

Here, the task is to create a new score. (Be creative , think about what factors could make a document more relevant to a query and include them in your formula.)

Explain how the ranking differs when using TF-IDF and BM25, and think about the pros and cons of using each of them. Regarding your own score, justify the choice of the score (pros and cons). HINT: Look into numerical fields that each record has to build your score.

**Custom Score Explanation**

The idea is to combine in our score textual relevance (like we do in TF-IDF and BM25) together with numerical relevance (higher product average rating can be more relevant for the user searching for that product, or relevance for the user could be inversely proportional to the price, or higher discount could be relevant, etc). 

We will make a function where the user could decide which numerical column is more relevant, and another which combines all of them. 

The options for relevance order would be: 
- highest average_rating first
- highest price first
- lowest price first
- highest discount first


In [33]:
def compute_custom_score(df, query, columns=USED_TEXT_COLUMNS, method='tfidf'):
    
    results = []

    # compute text scores
    if method == 'tfidf':
        index, tf, _, idf = create_index_tfidf(data, columns)        
        ranked_docs = search_tf_idf(query, index, tf, idf)

    elif method == 'bm25':
        bm25 = BM25Okapi(df.apply(lambda x: ' '.join(x[columns].values).split(' '), axis=1).to_list())
        ranked_docs = search_BM25(bm25, df, query, k=len(df))
        
    else:
        raise ValueError("Method must be 'tfidf' or 'bm25'")

    if not ranked_docs:
        return []
    
    doc_ids = [pid for score, pid in ranked_docs]
    text_scores = [score for score, pid in ranked_docs]
    max_text_score = max(text_scores) if len(text_scores) > 0 else 1

    for i, pid in enumerate(doc_ids): 
        row = df[df['pid'] == pid ].iloc[0]

        # normalize text score
        text_score = text_scores[i] / max_text_score

        # numerical features
        rating_score = row['average_rating'] / 5 if not pd.isna(row['average_rating']) else 0
        discount_score = row['discount'] / 100 if not pd.isna(row['discount']) else 0
        availability_score = 1 if row['out_of_stock'] == 0 else 0
        price_score = 1 - (1 + np.log1p(row['selling_price'])) if not pd.isna(row['selling_price']) else 0

        # combine with weights
        combined_score = (0.4 * text_score +
                          0.3 * rating_score +
                          0.2 * discount_score +
                          0.05 * availability_score +
                          0.05 * price_score)
        
        results.append((combined_score, pid))

    results.sort(reverse=True)
    return results

def rank_by_num(df, criterion='average_rating', order='desc'):

    ranked_docs = search_tf_idf("zipper sweater", inverted_index, tf_index, idf_index)

    # filter df to only the documents retrieved
    pids = [pid for _, pid in ranked_docs]
    df_filtered = df[df['pid'].isin(pids)].copy()
    
    # sort by chosen criterion
    df_filtered.sort_values(by=criterion, ascending=(order=='asc'), inplace=True)
    
    # return list of (score, pid), using text_score
    score_mapping = {pid: score for score, pid in ranked_docs}
    results = [(score_mapping[row['pid']], row['pid']) for _, row in df_filtered.iterrows()]
    
    return results

In [32]:
custom_ranking_tfidf = compute_custom_score(data, 'zipper sweater', method='tfidf')

print_top_k_results(custom_ranking_tfidf, k=20)
display(get_top_k_results(data, custom_ranking_tfidf, k=20))

Rank   | Document ID          |      Score
1      | SWSFMJZHZFUCHHTH     |      0.431
2      | SWTFMQYZSHVMUCNY     |      0.410
3      | SWTFYGS7GFGWVX4J     |      0.373
4      | SWTFYGQ7S4KYESNY     |      0.344
5      | SWTFQFPBSTQD4XPZ     |      0.343
6      | SWTFYGQHYZWGHJ6S     |      0.333
7      | SWTFHFY4QPNNNYED     |      0.328
8      | SWTFXZVZNXFDCCNY     |      0.327
9      | SWTFQFZTNZWGHGK9     |      0.316
10     | SWTFYGQAJ53QGAW8     |      0.305
11     | SWTFH224JA88NGGQ     |      0.305
12     | SWTFYGPQREHCG5ZK     |      0.281
13     | CAPF7FWGTQXHUQJS     |      0.273
14     | CAPF7FWGG3QZSXPE     |      0.273
15     | CAPF7FWG7HHD9YAR     |      0.273
16     | SWTFZHHBGGFUTTBW     |      0.271
17     | SWTFYGRSFPU6EZJZ     |      0.271
18     | SWTFHFY4QXVP2EPQ     |      0.270
19     | SWTFYGPVCBXP3PQB     |      0.269
20     | SWTFM4YSZGJE2ZBA     |      0.269


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,0.430543,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,0.410186,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
2,0.373094,SWTFYGS7GFGWVX4J,stripe neck casual women orang sweater,,man,clothing and accessories,winter wear,shakticreation
3,0.34374,SWTFYGQ7S4KYESNY,stripe collar neck casual men green sweater,,man,clothing and accessories,winter wear,shakticreation
4,0.342564,SWTFQFPBSTQD4XPZ,self design round neck casual men dark blue sw...,navi blue selfdesign sweater round neck long s...,lev,clothing and accessories,winter wear,kondefashions
5,0.332556,SWTFYGQHYZWGHJ6S,stripe collar neck casual men grey sweater,,man,clothing and accessories,winter wear,shakticreation
6,0.328417,SWTFHFY4QPNNNYED,stripe round neck casual men beig sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
7,0.326548,SWTFXZVZNXFDCCNY,solid neck casual men multicolor sweater,mash unlimit mustard navi pullov sweater,mash unlimit,clothing and accessories,winter wear,highstreet trendz llp
8,0.315754,SWTFQFZTNZWGHGK9,solid round neck casual men dark blue sweater,navi blue solid sweater round neck long sleev ...,lev,clothing and accessories,winter wear,kondefashions
9,0.305162,SWTFYGQAJ53QGAW8,stripe collar neck casual women maroon sweater,,man,clothing and accessories,winter wear,shakticreation


In [34]:
# rank by highest rating
numeric_ranking = rank_by_num(data, criterion='average_rating', order='desc')
print("Numeric-only ranking by highest average_rating:")
print_top_k_results(numeric_ranking, k=20)
display(get_top_k_results(data, numeric_ranking, k=20))

# rank by lowest price
numeric_ranking_low_price = rank_by_num(data, criterion='selling_price', order='asc')
print("Numeric-only ranking by lowest price:")
print_top_k_results(numeric_ranking_low_price, k=20)
display(get_top_k_results(data, numeric_ranking_low_price, k=20))

Numeric-only ranking by highest average_rating:
Rank   | Document ID          |      Score
1      | TKPFJQFWVKGVX4ZT     |      2.796
2      | SWTEX9HRYPTAUMCZ     |      5.617
3      | VESFKGD9Y5EXQKHK     |      2.452
4      | VESFKGDNFEGK8VVC     |      2.452
5      | SRTEXZA6DWSXYWPT     |      4.853
6      | SWTFMQYZSHVMUCNY     |      8.094
7      | SWTFMSQFVZXYCXNY     |      2.203
8      | SWTFY83XBYN5D4HG     |      5.450
9      | CAPF7FWGG3QZSXPE     |      1.881
10     | CAPF7FWGTQXHUQJS     |      1.881
11     | SWSFVEV2SHAPBRGS     |      3.449
12     | JCKFX4GJD4WYQYHM     |      4.246
13     | TSHFPR67UGYHQZQS     |      3.114
14     | CAPF7FWGCPETHTNS     |      1.881
15     | SWTF65GGFRT7EBQ6     |      5.397
16     | CAPF7FWG7HHD9YAR     |      1.881
17     | TKPFPBJCMTSKMGWD     |      2.954
18     | SRTEXZA8UDGSZYB8     |      2.624
19     | TKPFW44SHG2VYJBT     |      1.032
20     | TKPEZAGRRPZU7GY3     |      2.202


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,11.086585,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,8.926861,SWTFHFY4QPNNNYED,stripe round neck casual men beig sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
2,8.926861,SWTFHFY4QXVP2EPQ,solid high neck casual women green sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
3,8.255595,SWTFMQYG6ENPEEGN,solid casual women dark blue sweater,navi solid sweater hood long sleev rib hem max...,szto,clothing and accessories,winter wear,shreyashfashions
4,8.094102,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
5,7.48899,SWTFXZVZNXFDCCNY,solid neck casual men multicolor sweater,mash unlimit mustard navi pullov sweater,mash unlimit,clothing and accessories,winter wear,highstreet trendz llp
6,7.296366,SWTFQFZTNZWGHGK9,solid round neck casual men dark blue sweater,navi blue solid sweater round neck long sleev ...,lev,clothing and accessories,winter wear,kondefashions
7,7.296366,SWTFVMSDDZRPXAPT,stripe round neck casual women revers red sweater,red black stripe revers sweater round neck lon...,lev,clothing and accessories,winter wear,
8,7.239941,SWTFQFPBSTQD4XPZ,self design round neck casual men dark blue sw...,navi blue selfdesign sweater round neck long s...,lev,clothing and accessories,winter wear,kondefashions
9,6.879987,SWTFH2GZF24UJXJK,stripe neck casual women blue sweater,women casual revers sweater featur neck sleev ...,byford by pantaloo,clothing and accessories,winter wear,aum3etail


Numeric-only ranking by lowest price:
Rank   | Document ID          |      Score
1      | CAPF7FWGTQXHUQJS     |      1.881
2      | CAPF7FWG7HHD9YAR     |      1.881
3      | CAPF7FWGG3QZSXPE     |      1.881
4      | CAPF7FWGCPETHTNS     |      1.881
5      | TSHFUWQAAECQFGJB     |      1.450
6      | SRTEQXSVMZMPDHEH     |      1.249
7      | SRTEQXSVEYUXTYB9     |      1.252
8      | TSHFVJ3ZF32K8GZY     |      3.302
9      | SRTEQKY62CMVFZNY     |      1.446
10     | TSHFK4VZHGP7BCXF     |      0.992
11     | TKPFHPFKTZC4JGYP     |      1.755
12     | TKPFHNXFYCSVNZWM     |      1.755
13     | TSHFUXPWUR2ZTEE2     |      1.514
14     | SRTEQKY6HWKC3UYX     |      1.446
15     | TKPFHNXKBZYFXGZK     |      1.755
16     | TKPFHPFKZ73M4R5N     |      1.755
17     | TKPFHNX3YWP4CSZG     |      1.755
18     | TKPFHPFKYBBZMPYV     |      1.755
19     | TKPFHNXFRN5HYV65     |      1.755
20     | TKPFHNXKSN6FGMHT     |      1.755


Unnamed: 0,score,pid,title,description,brand,category,sub_category,seller
0,11.086585,SWSFMJZHZFUCHHTH,full sleev print women sweatshirt,arbour give new trend sweater women arbour pro...,arbo,clothing and accessories,winter wear,arbor
1,8.926861,SWTFHFY4QPNNNYED,stripe round neck casual men beig sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
2,8.926861,SWTFHFY4QXVP2EPQ,solid high neck casual women green sweater,full sleev sweater,us polo ass,clothing and accessories,winter wear,retailnet
3,8.255595,SWTFMQYG6ENPEEGN,solid casual women dark blue sweater,navi solid sweater hood long sleev rib hem max...,szto,clothing and accessories,winter wear,shreyashfashions
4,8.094102,SWTFMQYZSHVMUCNY,stripe neck casual men grey sweater,charcoal grey solid sweater vneck long sleev s...,szto,clothing and accessories,winter wear,
5,7.48899,SWTFXZVZNXFDCCNY,solid neck casual men multicolor sweater,mash unlimit mustard navi pullov sweater,mash unlimit,clothing and accessories,winter wear,highstreet trendz llp
6,7.296366,SWTFQFZTNZWGHGK9,solid round neck casual men dark blue sweater,navi blue solid sweater round neck long sleev ...,lev,clothing and accessories,winter wear,kondefashions
7,7.296366,SWTFVMSDDZRPXAPT,stripe round neck casual women revers red sweater,red black stripe revers sweater round neck lon...,lev,clothing and accessories,winter wear,
8,7.239941,SWTFQFPBSTQD4XPZ,self design round neck casual men dark blue sw...,navi blue selfdesign sweater round neck long s...,lev,clothing and accessories,winter wear,kondefashions
9,6.879987,SWTFH2GZF24UJXJK,stripe neck casual women blue sweater,women casual revers sweater featur neck sleev ...,byford by pantaloo,clothing and accessories,winter wear,aum3etail


## 2. Implement **word2vec + cosine ranking** score.

Return a top-20 list of documents for each of the 5 queries defined in the Part 2 of your project, using search and word2vec + cosine similarity ranking.
​

To represent a piece of **text** using **word2vec**, we create a **single vector** that represents the entire text. This vector has the same number of dimensions as the word vectors and is calculated by **averaging the vectors of all words** in the text.

**Example:​**

Consider the text:
```
“Wireless Bluetooth headphones with noise cancellation”
```

Suppose we have Word2Vec vectors for each word:

* Wireless → v1
* Bluetooth → v2
* headphones → v3
* with → v4
* noise → v5
* cancellation → v6

All vectors have the same number of dimensions. To represent the text as a single vector, we average the word vectors:

```
Text vector = (v1 + v2 + v3 + v4 + v5 + v6) ÷ 6
```

The resulting vector has the same number of dimensions as the individual word vectors and represents the content of the entire text. This approach allows us to compare texts based on their vector representations for tasks like search or recommendation.

In [19]:
# TODO

## 3. Can you imagine a better representation than word2vec?

Justify your answer. (**HINT** - what about Doc2vec? Sentence2vec? What are the pros and cons?)

**TODO**