# Information Retrieval and Web Analytics

# Indexing + Modeling (TF-IDF)

In [669]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Welcome to the first hands-on session of Information Retrieval and Web Analytics!

In this exercise you will implement a simple search engine to query a sample of Wikipedia articles. You will be provided with a sample of 500 Wikipedia articles in text format (some preprocessing has already been done to remove html tags).

For each article you have the following features:

- article id
- article title
- article body

This session is composed by three main parts:

1. **Create the index by going through the documents**
2. **Query the index to obtain a set of documents**
3. **Add some ranking to obtain a sorted set of documents when querying the index**



## 1. Create the index
The index is implemented through an **Inverted Index** which is the main data structure of our search engine. It maps the terms of our corpus (the collection of documents) to the documents that those terms appear in.

You will implement the index through a Python dictionary, and then you will use it to return the list of documents relevant for a query.

Each **vocabulary term** is a key in the index whose value is the list of documents that the term appears in.

    
*Figure 1* shows a basic implementation of an inverted index. However, there exists a special type of queries, named **Phrase Queries**, where the position of the terms in the document matters. Phrase Queries are those queries typed inside double quotes when we want the matching documents to contain the terms in the query exactly in the specified order.
    
In order to work with Phrase Queries we need to add some information in the inverted index. The new inverted index will store, for each term, the list of documents containing the term and the positions of the term in the corresponding document.

See *Figure 2*:
    
In the above example the term *Information* appears in document 1 at positions 0 (we start counting positions from 0), and in document 3 at position 0.
    
Notice that when implementing the index, you will need to perform some preprocessing:
    
    - Transform all words to lower case ( we don’t want to index *Information*, *information*, and *INFORMATION* differently.)
    - Remove stop words ( very common words like articles, etc.)
    - Apply Stemming (remove common endings from words. For example the stemmed version of the words fish, fishes, fishing, fisher, fished is the word 'fish')
    
But do not worry about that, we will provide you with simple tools to do it!

### Index implementation
To create the index you will perform the following steps:
- Loop over all documents of the collection provided in the dataset found in the project file `inputs/documents-corpus.tsv`.
- Concatenate the title and the text of the page.
- Lowercase all words.
- Get tokens (transform the string title+body into a list of terms)
- Remove stop words
- Stem each token
- Build the index following the model of Figure 2.

#### Load Python packages
Let's first import all the packages that you will need during this assignment.

In [670]:
# if you do not have 'nltk', the following command should work "python -m pip install nltk"
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [671]:
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
from numpy import linalg as la
import pandas as pd
from IPython.display import display
import json

#### Load data into memory
The dataset is stored in the TSV file, and it contains 500 Wikipedia articles (one article per line). For each article we have the document id, document title and document body separated by "|" character.

In [672]:
#docs_path = '/content/drive/MyDrive/Recuperació de la informació en la Web/Seminaris _ Practiques/PROJECT FIRST PART/IRWA_data_2023/Rus_Ukr_war_data.json'

#maping_doc = '/content/drive/MyDrive/Recuperació de la informació en la Web/Seminaris _ Practiques/PROJECT FIRST PART/IRWA_data_2023/Rus_Ukr_war_data_ids.csv'

docs_path = '/content/drive/MyDrive/1st TERM/IRWA/P1/Rus_Ukr_war_data.json'

maping_doc = '/content/drive/MyDrive/1st TERM/IRWA/P1/Rus_Ukr_war_data_ids.csv'

tweets = []
with open(docs_path, 'r') as file:
    for line in file:
        tweet = json.loads(line)
        tweets.append(tweet)

with open(maping_doc) as fp2:
    lines = fp2.readlines()
lines = [l.split() for l in lines]

mapping_dt = pd.DataFrame(lines, columns=['doc_id', 'id'])


In [673]:
print("Total number of tweets: {}".format(len(tweets)))

Total number of tweets: 4000


Implement the function ```build_terms(line)```.

It takes as input a text and performs the following operations:

- Transform all text to lowercase
- Tokenize the text to get a list of terms (use *split function*)
- Remove stop words
- Stem terms (example: to stem the term 'researcher', you will use ```stemmer.stem(researcher)```)

In [674]:
def build_terms(tweet):
    """
    Preprocess the tweet text removing stop words, stemming, tokenization, removing punctuation marks,
    and extracting relevant information like hashtags, date, likes, retweets, and URL.

    Argument:
    tweet -- dictionary representing a tweet

    Returns:
    processed_tweet -- a dictionary containing preprocessed tweet information
    """

    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))

    tweet_text = tweet.lower()
    tweet_text = tweet_text.replace(".", "")        #remove the . son war. counts as war
    tweet_text = tweet_text.split()  # Tokenize the tweet text
    tweet_text = [word for word in tweet_text if word not in stop_words]
    tweet_text = [stemmer.stem(word) for word in tweet_text]


    tweet_text = [word[1:] if word.startswith('#') else word for word in tweet_text]


    return tweet_text

In [675]:
tweets_stuctured = []
info_query = []

for tweet in tweets:
  tweet_id = tweet['id']
  created_at = tweet['created_at']

  tweet_text = build_terms(tweet['full_text'])

  hashtags = [word for word in tweet['full_text'].split() if word.startswith('#')]  # Extract hashtags without #



  likes = tweet['favorite_count']
  retweets = tweet['retweet_count']
  url = tweet['entities']['urls'][0]['expanded_url'] if tweet['entities']['urls'] else None

  # Return processed tweet information as a dictionary
  processed_tweet = {
      'created_at': created_at,
      'hashtags': hashtags,
      'likes': likes,
      'retweets': retweets,
      'url': url,
      'processed_text': tweet_text,
      'original_text': tweet['full_text'],
      'id': tweet_id,
  }
  info_query.append([processed_tweet['original_text'], processed_tweet['created_at'], processed_tweet['hashtags'], processed_tweet['likes'], processed_tweet['retweets'], processed_tweet['url'], processed_tweet['id']])
  tweets_stuctured.append([processed_tweet['original_text'], processed_tweet['created_at'], processed_tweet['hashtags'], processed_tweet['likes'], processed_tweet['retweets'], processed_tweet['url'], processed_tweet['id'], processed_tweet['processed_text']])


df_query = pd.DataFrame(info_query, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])

df2_structured = pd.DataFrame(tweets_stuctured, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id', 'processed_tweet'])

#now we map the id of the tweets with the id of the documents:

mapped_df = pd.concat([df2_structured, mapping_dt], axis=1, join="outer")

# mapped_df.to_excel('/content/drive/MyDrive/Recuperació de la informació en la Web/Seminaris _ Practiques/PROJECT FIRST PART/IRWA_data_2023/output.xlsx')

In [676]:
data = pd.concat([df2_structured, mapping_dt], axis=1, join="outer")
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Text             4000 non-null   object
 1   Created_time     4000 non-null   object
 2   Hashtags         4000 non-null   object
 3   Likes            4000 non-null   int64 
 4   Retweets         4000 non-null   int64 
 5   url              1760 non-null   object
 6   id               4000 non-null   int64 
 7   processed_tweet  4000 non-null   object
 8   doc_id           4000 non-null   object
 9   id               4000 non-null   object
dtypes: int64(3), object(7)
memory usage: 312.6+ KB


In [677]:
def create_index(tweets, mapping_dt):
    """
    Implement the inverted index

    Argument:
    lines -- collection of Wikipedia articles

    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of documents where these keys appears in (and the positions) as values.
    """
    index = defaultdict(list)
    tweet_index = {}
    for i, tweet in enumerate(tweets):  #
        #processed_tweet = build_terms(tweets[i])
        tweet_id = int(tweet[6])
        doc_id = mapping_dt[mapping_dt['id'] == str(tweet_id)]
        doc_id = (doc_id.values[0])[0]
        tweet_index[doc_id] = [tweet[0], tweet[1], tweet[2], tweet[3], tweet[4], tweet[5], tweet[6]]
        terms = tweet[7]


        current_page_index = {}

        for position, term in enumerate(terms): #
            try:
                # if the term is already in the index for the current page (current_page_index)
                # append the position to the corresponding list

        ## START CODE
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [doc_id, array('I', [position])]#'I' indicates unsigned int (int in Python)

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

        ## END CODE

    return index, tweet_index

In [678]:
import time
start_time = time.time()
index, title_index = create_index(tweets_stuctured, mapping_dt)
print("Total time to create the index: {} seconds".format(np.round(time.time() - start_time, 2)))

Total time to create the index: 4.45 seconds


Notice that if you look in the index for ```researcher```you will not find any result, while if you look for ```research``` you will get some results. That happens because we are storing in the index stemmed terms.

In [679]:
print("Index results for the term 'fear': {}\n".format(index['war']))
print("First 10 Index results for the term 'fear': \n{}".format(index['war'][:10]))

Index results for the term 'fear': [['doc_12', array('I', [4, 8])], ['doc_15', array('I', [8])], ['doc_22', array('I', [14])], ['doc_25', array('I', [3])], ['doc_31', array('I', [15])], ['doc_44', array('I', [18])], ['doc_55', array('I', [16])], ['doc_67', array('I', [17])], ['doc_71', array('I', [6])], ['doc_77', array('I', [22])], ['doc_85', array('I', [18])], ['doc_87', array('I', [17])], ['doc_88', array('I', [8])], ['doc_99', array('I', [8])], ['doc_102', array('I', [3])], ['doc_129', array('I', [2])], ['doc_133', array('I', [1])], ['doc_137', array('I', [0])], ['doc_139', array('I', [2])], ['doc_153', array('I', [15])], ['doc_158', array('I', [12])], ['doc_165', array('I', [14])], ['doc_189', array('I', [1])], ['doc_194', array('I', [3])], ['doc_198', array('I', [15, 22])], ['doc_204', array('I', [3, 7])], ['doc_222', array('I', [7, 19])], ['doc_226', array('I', [11])], ['doc_236', array('I', [3])], ['doc_238', array('I', [5])], ['doc_240', array('I', [4])], ['doc_247', array('I'

## 2. Querying the Index

Even if before we mentioned that in case of phrase queries we need to take into account the position of the terms in the document and we have implemented an index that would allow us to also work with this type of queries, here you are going to implement a search function that will query the index without take into account the terms' positions.


We will use english Free Text Queries, that means that the query we will query the index using  a sequence of english words as query, and the output will be the list of documents that contain any of the query terms.

For instance if we write the query **"computer science"** the output will be the union of all documents containing the term "computer" with all documents containing the term "science".

In [680]:
def search(query, index):
    """
    The output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    """

    query = build_terms(query)


    docs = set()
    first_term = query[0]
    try:
        docs = set(posting[0] for posting in index[first_term])
    except KeyError:
        pass

    for term in query[1:]:
        try:
            term_docs = set(posting[0] for posting in index[term])
            docs = docs.intersection(term_docs)
        except KeyError:
            pass
    return docs

In [681]:
print("Insert your query (i.e.: Computer Science):\n")
query = input()
docs = search(query, index)
top = 10

list_tweets = []


for i, doc in enumerate(docs):
  list_tweets.append([doc, title_index[doc][0], title_index[doc][1], title_index[doc][2], title_index[doc][3], title_index[doc][4], title_index[doc][5], title_index[doc][6]])




df_search = pd.DataFrame(list_tweets, columns=['doc_id','Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])

display(df_search.info())

Insert your query (i.e.: Computer Science):

russia
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1474 entries, 0 to 1473
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   doc_id        1474 non-null   object
 1   Text          1474 non-null   object
 2   Created_time  1474 non-null   object
 3   Hashtags      1474 non-null   object
 4   Likes         1474 non-null   int64 
 5   Retweets      1474 non-null   int64 
 6   url           654 non-null    object
 7   id            1474 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 92.2+ KB


None

Results for ```Computer Science``` query

======================
Sample of 10 results out of 345 for the seached query:

- page_id= 1029 - page_title: Adjoint state method
- page_id= 2059 - page_title: Apache Cassandra
- page_id= 1036 - page_title: Adminer
- page_id= 3089 - page_title: BCSWomen
- page_id= 1043 - page_title: Admissible heuristic
- page_id= 1046 - page_title: Admon
- page_id= 26 - page_title: 12th Computer Olympiad
- page_id= 3103 - page_title: BESM
- page_id= 33 - page_title: 18 bit
- page_id= 1059 - page_title: Adobe Flash

## 3. Add Ranking with TF-IDF

When searching in a search engine, we are interested in obtain the results sorted by relevance or by some other criteria. Notice that **the above results are not ranked**.

Here you are going to implement the **TF-IDF (Term Frequency — Inverse Document Frequency)** mechanism and use it to obtain a list of ordered results.

TF-IDF is a weighting scheme that assigns each term in a document a weight based on its term frequency (FT) and the inverse document frequency (IDF).  The higher the scores, more important the term is.

##### TF
**TF** refers to the frequency of a term $t$ in a specific document $d$. The basic idea is that as a term appears more in the document it becomes more important. On the other side, if we only use pure term counts, longer documents will be favored more. Consider two documents with exactly the same content but one being twice longer by concatenating with itself.  The tf weights of each word in the longer document will be twice the shorter one, although they essentially have the same content. To deal with this issue we need to **normalize the term frequencies**.


$$tf_{t,d} = \dfrac{N_{t,d}}{||D||}\tag{1}$$


where ||D|| is the Euclidean norm.


Let $D=[t_1, t_2, \dots, t_n]$ be the document vector where $t_i$ represent the frequency of the term $i$, the  Euclidean Norm is calculated as


$$\sqrt{\sum_{t=1}^{n}t_i{^2}}\tag{2}$$


Note that $||D||$ is the same for all terms of a document.


##### IDF
A drawback of tf is that it considers all terms equally important. However, less common terms are more discriminative than others. To deal with this issue we introduce **idf (inverse document frequency)** that takes into account the number of documents containing the term.

$$idf_t = log\dfrac{N}{df_t}\tag{3}$$

where:

- $N$ is the total number of documents;
- $df_t$ is the number of documents containing the term $t$.

The log operation is applied to avoid that terms that appears in a high number of documents are considered to be too much less important, in this way we are smoothing (dampening) this difference.


In [682]:
def create_index_tfidf(tweets, num_documents, mapping_dt):
    """
    Implement the inverted index and compute tf, df and idf

    Argument:
    lines -- collection of Wikipedia articles
    num_documents -- total number of documents

    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of document these keys appears in (and the positions) as values.
    tf - normalized term frequency for each term in each document
    df - number of documents each term appear in
    idf - inverse document frequency of each term
    """

    index = defaultdict(list)
    tf = defaultdict(list)  # term frequencies of terms in documents (documents in the same order as in the main index)
    df = defaultdict(int)  # document frequencies of terms in the corpus
    tweet_index = defaultdict(str)
    idf = defaultdict(float)

    for i, tweet in enumerate(tweets):  #
        tweet_id = int(tweet[6])
        doc_id = mapping_dt[mapping_dt['id'] == str(tweet_id)]
        doc_id = (doc_id.values[0])[0]
        tweet_index[doc_id] = [tweet[0], tweet[1], tweet[2], tweet[3], tweet[4], tweet[5], tweet[6]]
        terms = tweet[7]

        current_page_index = {}

        for position, term in enumerate(terms):  ## terms contains page_title + page_text
            try:
                # if the term is already in the dict append the position to the corresponding list
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [doc_id, array('I', [position])] #'I' indicates unsigned int (int in Python)

        #normalize term frequencies
        # Compute the denominator to normalize term frequencies (formula 2 above)
        # norm is the same for all terms of a document.
        norm = 0

        for term, posting in current_page_index.items():
            # posting will contain the list of positions for current term in current document.
            # posting ==> [current_doc, [list of positions]]
            # you can use it to infer the frequency of current term.
            norm += len(posting[1]) ** 2
        norm = math.sqrt(norm)

        # calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1])/norm,4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] +=1  # increment DF for current term

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

        # Compute IDF following the formula (3) above. HINT: use np.log
        for term in df:
            idf[term] = np.round(np.log(float(num_documents/len(df))), 4)

    return index, tf, df, idf, tweet_index


In [683]:
start_time = time.time()
num_documents = len(tweets)
index, tf, df, idf, title_index = create_index_tfidf(tweets_stuctured, num_documents, mapping_dt)
print("Total time to create the index: {} seconds" .format(np.round(time.time() - start_time, 2)))

Total time to create the index: 404.65 seconds


In [684]:
def rank_documents(terms, docs, index, idf, tf, tweet_index):
    """
    Perform the ranking of the results of a search based on the tf-idf weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies
    title_index -- mapping between page id and page title

    Returns:
    Print the list of ranked documents
    """

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query.
    # Example: collections.Counter(["hello","hello","world"]) --> Counter({'hello': 2, 'world': 1})
    # HINT: use when computing tf for query_vector

    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query

        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]= (query_terms_count[term] / query_norm) * idf[term]

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):
            # Example of [doc_index, (doc, postings)]
            # 0 (26, array('I', [1, 4, 12, 15, 22, 28, 32, 43, 51, 68, 333, 337]))
            # 1 (33, array('I', [26, 33, 57, 71, 87, 104, 109]))
            # term is in doc 26 in positions 1,4, .....
            # term is in doc 33 in positions 26,33, .....
            #tf[term][0] will contain the tf of the term "term" in the doc 26

            if doc in docs:
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot

    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)
    #print(doc_scores)
    result_docs = [x[1] for x in doc_scores]
    #print document titles instead if document id's
    result_docs=[tweet_index[x] for x in result_docs]

    doc_scores = [x[0] for x in doc_scores]        #here we save the doc_scores to do the Evaluation later


    if len(result_docs) == 0:
        print("No results found, try again")
        query = input()
        docs = search_tf_idf(query, index)

    #print ("".join(str(result_docs)), "\n")
    return result_docs, doc_scores

In [685]:
def search_tf_idf(query, index):
    """
    output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    """
    query = build_terms(query)

    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs = [posting[0] for posting in index[term]]

            # docs = docs Union term_docs
            docs |= set(term_docs)
        except:
            #term is not in index
            pass

    docs = list(docs)
    ranked_docs, doc_scores = rank_documents(query, docs, index, idf, tf, title_index)
    return ranked_docs, doc_scores

#QUERIES

In [868]:
#evaluation_doc = '/content/drive/MyDrive/Recuperació de la informació en la Web/Seminaris _ Practiques/PROJECT FIRST PART/Queries_Evaluation_gt.csv'
evaluation_doc = '/content/drive/MyDrive/1st TERM/IRWA/P2/Queries_Evaluation_gt.csv'
df_evaluation = pd.read_csv(evaluation_doc)

In [None]:
query = 'United States'
ranked_docs, doc_scores = search_tf_idf(query, index)

df_query1 = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])

df_query1['predicted_score'] = doc_scores
df_query1['id'] = df_query1['id'].astype(str)
df_query1 = df_query1.merge(mapping_dt, on='id', how='left').drop(columns=['id'])
df_query1 = df_query1[['predicted_score', 'doc_id']].copy()

#Q1
df_evaluation_1 =  pd.DataFrame()
df_evaluation_1 = df_evaluation[df_evaluation['query_id'] == "Q1"]

df_evaluation_1Qs = df_evaluation[(df_evaluation['label'] == 1) & (df_evaluation['query_id'] != 'Q1')]
df_evaluation_1Qs['label'] = 0

merge_df1 = pd.concat([ df_evaluation_1Qs, df_evaluation_1], ignore_index = True)

merge_df1.rename(columns={'doc': 'doc_id'}, inplace=True)

merge_df1 = merge_df1.merge(df_query1,on='doc_id',how='left')
merge_df1['predicted_score'].fillna(0, inplace=True)


#display(merge_df1)

In [1033]:
query = 'trump russia'
ranked_docs, doc_scores = search_tf_idf(query, index)

df_query2 = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df_query2['predicted_score'] = doc_scores
df_query2['id'] = df_query2['id'].astype(str)
df_query2 = df_query2.merge(mapping_dt, on='id', how='left').drop(columns=['id'])
df_query2 = df_query2[['predicted_score', 'doc_id']].copy()

#Q1
df_evaluation_2 =  pd.DataFrame()
df_evaluation_2 = df_evaluation[df_evaluation['query_id'] == "Q2"]

df_evaluation_2Qs = df_evaluation[(df_evaluation['label'] == 1) & (df_evaluation['query_id'] != 'Q2')]
df_evaluation_2Qs['label'] = 0

merge_df2 = pd.concat([ df_evaluation_2Qs, df_evaluation_2], ignore_index = True)

merge_df2.rename(columns={'doc': 'doc_id'}, inplace=True)

merge_df2 = merge_df2.merge(df_query2,on='doc_id',how='left')
merge_df2['predicted_score'].fillna(0, inplace=True)

display(merge_df2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_evaluation_2Qs['label'] = 0


Unnamed: 0,doc_id,query_id,label,predicted_score
0,doc_3974,Q5,0,0.0
1,doc_1384,Q5,0,0.0
2,doc_696,Q5,0,0.288057
3,doc_136,Q5,0,0.458455
4,doc_114,Q5,0,0.0
5,doc_2954,Q5,0,0.0
6,doc_2088,Q5,0,0.0
7,doc_400,Q5,0,0.0
8,doc_12,Q5,0,0.0
9,doc_1606,Q5,0,0.215352


In [871]:
query = 'Bombs cities'
ranked_docs, doc_scores = search_tf_idf(query, index)

df_query3 = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df_query3['predicted_score'] = doc_scores
df_query3['id'] = df_query3['id'].astype(str)
df_query3 = df_query3.merge(mapping_dt, on='id', how='left').drop(columns=['id'])
df_query3 = df_query3[['predicted_score', 'doc_id']].copy()

#Q1
df_evaluation_3 =  pd.DataFrame()
df_evaluation_3 = df_evaluation[df_evaluation['query_id'] == "Q3"]

df_evaluation_3Qs = df_evaluation[(df_evaluation['label'] == 1) & (df_evaluation['query_id'] != 'Q3')]
df_evaluation_3Qs['label'] = 0

merge_df3 = pd.concat([ df_evaluation_3Qs, df_evaluation_3], ignore_index = True)

merge_df3.rename(columns={'doc': 'doc_id'}, inplace=True)

merge_df3 = merge_df3.merge(df_query3,on='doc_id',how='left')
merge_df3['predicted_score'].fillna(0, inplace=True)

#display(merge_df3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_evaluation_3Qs['label'] = 0


In [872]:
query = 'World War'
ranked_docs, doc_scores = search_tf_idf(query, index)

df_query4 = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df_query4['predicted_score'] = doc_scores
df_query4['id'] = df_query4['id'].astype(str)
df_query4 = df_query4.merge(mapping_dt, on='id', how='left').drop(columns=['id'])
df_query4 = df_query4[['predicted_score', 'doc_id']].copy()

#Q1
df_evaluation_4 =  pd.DataFrame()
df_evaluation_4 = df_evaluation[df_evaluation['query_id'] == "Q4"]

df_evaluation_4Qs = df_evaluation[(df_evaluation['label'] == 1) & (df_evaluation['query_id'] != 'Q4')]
df_evaluation_4Qs['label'] = 0

merge_df4 = pd.concat([ df_evaluation_4Qs, df_evaluation_4], ignore_index = True)

merge_df4.rename(columns={'doc': 'doc_id'}, inplace=True)

merge_df4 = merge_df4.merge(df_query4,on='doc_id',how='left')
merge_df4['predicted_score'].fillna(0, inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_evaluation_4Qs['label'] = 0


In [873]:
query = 'Nazi war'
ranked_docs, doc_scores = search_tf_idf(query, index)

df_query5 = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df_query5['predicted_score'] = doc_scores
df_query5['id'] = df_query5['id'].astype(str)
df_query5 = df_query5.merge(mapping_dt, on='id', how='left').drop(columns=['id'])
df_query5 = df_query5[['predicted_score', 'doc_id']].copy()


#Q1
df_evaluation_5 =  pd.DataFrame()
df_evaluation_5 = df_evaluation[df_evaluation['query_id'] == "Q5"]

df_evaluation_5Qs = df_evaluation[(df_evaluation['label'] == 1) & (df_evaluation['query_id'] != 'Q5')]
df_evaluation_5Qs['label'] = 0

merge_df5 = pd.concat([ df_evaluation_5Qs, df_evaluation_5], ignore_index = True)

merge_df5.rename(columns={'doc': 'doc_id'}, inplace=True)

merge_df5 = merge_df5.merge(df_query5,on='doc_id',how='left')
merge_df5['predicted_score'].fillna(0, inplace=True)

#display(merge_df5)

#display(merge_df5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_evaluation_5Qs['label'] = 0


In [1011]:
#Choose query to study

current_query = "Q5"

if(current_query == "Q1"):
  current_query_res = merge_df1[merge_df1["query_id"] == current_query]

if(current_query == "Q2"):
  current_query_res = merge_df2[merge_df2["query_id"] == current_query]

if(current_query == "Q3"):
  current_query_res = merge_df3[merge_df3["query_id"] == current_query]

if(current_query == "Q4"):
  current_query_res = merge_df4[merge_df4["query_id"] == current_query]

if(current_query == "Q5"):
  current_query_res = merge_df5[merge_df5["query_id"] == current_query]



## Precision@K (P@K)

In [1012]:
def precision_at_k(doc_score, y_score, k):
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    precision @k : float

    """
    order = np.argsort(y_score)[::-1]
    doc_score = np.take(doc_score,order[:k])#y_true
    relevant = sum(doc_score == 1)
    return float(relevant)/k

In [1013]:
# Check for current query
k = 10
print("==> Precision@{}: {}\n".format(k, precision_at_k(current_query_res["label"], current_query_res["predicted_score"], k)))
print("\nCheck on the dataset sorted by score:\n")
#current_query_res.sort_values("score", ascending=False).head(k)
current_query_res.sort_values("predicted_score", ascending=False).head(k)

==> Precision@10: 0.6


Check on the dataset sorted by score:



Unnamed: 0,doc_id,query_id,label,predicted_score
40,doc_3974,Q5,1,0.687745
50,doc_3356,Q5,0,0.636764
51,doc_1117,Q5,0,0.627848
45,doc_2954,Q5,1,0.609138
41,doc_1384,Q5,1,0.591935
53,doc_2834,Q5,0,0.591935
52,doc_2792,Q5,0,0.591935
42,doc_696,Q5,1,0.576113
48,doc_12,Q5,1,0.483318
47,doc_400,Q5,1,0.45858


## Recall@K (R@K)

In [1014]:
def recall_at_k(doc_score, y_score, k):
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    recall @k : float

    """
    relevant = sum(doc_score[:k] == 1)
    total_relevant = len(y_score[:k] > 0)

    return float(relevant) / total_relevant  # Recall is the number of relevant items retrieved divided by the total number of relevant items.


In [1015]:
k = 10
print("==> recall@{}: {}\n".format(k, recall_at_k(current_query_res["label"], current_query_res["predicted_score"], k)))
print("\nCheck on the dataset sorted by score:\n")
#current_query_res.sort_values("score", ascending=False).head(k)
current_query_res.sort_values("predicted_score", ascending=False).head(k)

==> recall@10: 1.0


Check on the dataset sorted by score:



Unnamed: 0,doc_id,query_id,label,predicted_score
40,doc_3974,Q5,1,0.687745
50,doc_3356,Q5,0,0.636764
51,doc_1117,Q5,0,0.627848
45,doc_2954,Q5,1,0.609138
41,doc_1384,Q5,1,0.591935
53,doc_2834,Q5,0,0.591935
52,doc_2792,Q5,0,0.591935
42,doc_696,Q5,1,0.576113
48,doc_12,Q5,1,0.483318
47,doc_400,Q5,1,0.45858


## Average Precision@K (P@K)

In [1016]:
def avg_precision_at_k(doc_score, y_score, k=10):
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    average precision @k : float
    """
    gtp = np.sum(doc_score == 1)
    order = np.argsort(y_score)[::-1]
    doc_score = np.take(doc_score, order[:k])
    ## if all documents are not relevant
    if gtp == 0:
        return 0
    n_relevant_at_i = 0
    prec_at_i = 0
    for i in range(len(doc_score)):
        if doc_score[i] == 1:
            n_relevant_at_i += 1
            prec_at_i += n_relevant_at_i / (i + 1)
    return prec_at_i / gtp

In [1017]:
avg_precision_at_k(np.array(current_query_res["label"]), np.array(current_query_res["predicted_score"]), 10)

0.3755555555555556

In [1018]:
current_query_res.sort_values("predicted_score", ascending=False).head(10)

Unnamed: 0,doc_id,query_id,label,predicted_score
40,doc_3974,Q5,1,0.687745
50,doc_3356,Q5,0,0.636764
51,doc_1117,Q5,0,0.627848
45,doc_2954,Q5,1,0.609138
41,doc_1384,Q5,1,0.591935
53,doc_2834,Q5,0,0.591935
52,doc_2792,Q5,0,0.591935
42,doc_696,Q5,1,0.576113
48,doc_12,Q5,1,0.483318
47,doc_400,Q5,1,0.45858


In [1019]:
def f1_score_at_k(doc_score, y_score, k):
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    f1_score @k : float
    """
    precision = precision_at_k(doc_score, y_score, k)
    recall = recall_at_k(doc_score, y_score, k)
    f1_score = (2*precision*recall)/(precision+recall)

    return f1_score

In [1020]:
# Check for query Q1
k = 10
print("==> F1-score@{}: {}\n".format(k, f1_score_at_k(current_query_res["label"], current_query_res["predicted_score"], k)))
print("\nCheck on the dataset sorted by score:\n")
#current_query_res.sort_values("score", ascending=False).head(k)
current_query_res.sort_values("predicted_score", ascending=False).head(k)

==> F1-score@10: 0.7499999999999999


Check on the dataset sorted by score:



Unnamed: 0,doc_id,query_id,label,predicted_score
40,doc_3974,Q5,1,0.687745
50,doc_3356,Q5,0,0.636764
51,doc_1117,Q5,0,0.627848
45,doc_2954,Q5,1,0.609138
41,doc_1384,Q5,1,0.591935
53,doc_2834,Q5,0,0.591935
52,doc_2792,Q5,0,0.591935
42,doc_696,Q5,1,0.576113
48,doc_12,Q5,1,0.483318
47,doc_400,Q5,1,0.45858


## Mean Average Precision (MAP)

In [1021]:
def map_at_k(search_res, k=10):
    """
    Parameters
    ----------
    search_res: search results dataset containing:
        query_id: query id.
        doc_id: document id.
        predicted_score: relevance predicted through LightGBM.
        label: actual score of the document for the query (ground truth).

    Returns
    -------
    mean average precision @ k : float
    """
    avp = []
    for q in search_res["query_id"].unique():  # loop over all query id
        curr_data = search_res[search_res["query_id"] == q]  # select data for current query
        avp.append(avg_precision_at_k(np.array(curr_data["label"]),
                   np.array(curr_data["predicted_score"]), k))  #append average precision for current query
    return np.sum(avp) / len(avp), avp  # return mean average precision

In [1022]:
if(current_query == "Q1"):
  merged_df = merge_df1

if(current_query == "Q2"):
  merged_df = merge_df2

if(current_query == "Q3"):
  merged_df = merge_df3

if(current_query == "Q4"):
  merged_df = merge_df4

if(current_query == "Q5"):
  merged_df = merge_df5



map_k, avp = map_at_k(merged_df, 10)
map_k

0.07511111111111111

## Mean Reciprocal Rank (MRR)

In [1023]:
def rr_at_k(doc_score, y_score, k=10):
    """
    Parameters
    ----------
    doc_score: Ground truth (true relevance labels).
    y_score: Predicted scores.
    k : number of doc to consider.

    Returns
    -------
    Reciprocal Rank for qurrent query
    """

    order = np.argsort(y_score)[::-1]  # get the list of indexes of the predicted score sorted in descending order.
    doc_score = np.take(doc_score, order[
                             :k])  # sort the actual relevance label of the documents based on predicted score(hint: np.take) and take first k.
    if np.sum(doc_score) == 0:  # if there are not relevant doument return 0
        return 0
    return 1 / (np.argmax(doc_score == 1) + 1)  # hint: to get the position of the first relevant document use "np.argmax"

In [1024]:
current_query_res.sort_values("predicted_score", ascending=False).head(10)

Unnamed: 0,doc_id,query_id,label,predicted_score
40,doc_3974,Q5,1,0.687745
50,doc_3356,Q5,0,0.636764
51,doc_1117,Q5,0,0.627848
45,doc_2954,Q5,1,0.609138
41,doc_1384,Q5,1,0.591935
53,doc_2834,Q5,0,0.591935
52,doc_2792,Q5,0,0.591935
42,doc_696,Q5,1,0.576113
48,doc_12,Q5,1,0.483318
47,doc_400,Q5,1,0.45858


In [1030]:
if(current_query == "Q1"):
  merged_df = merge_df1

if(current_query == "Q2"):
  merged_df = merge_df2

if(current_query == "Q3"):
  merged_df = merge_df3

if(current_query == "Q4"):
  merged_df = merge_df4

if(current_query == "Q5"):
  merged_df = merge_df5


labels = np.array(merged_df[merged_df['query_id'] == current_query]["label"])
scores = np.array(merged_df[merged_df['query_id'] == current_query]["predicted_score"])
np.round(rr_at_k(labels, scores, 10), 10)

1.0

In [1031]:
mrr = {}
for k in [3, 5, 10]:
    RRs = []
    for q in merged_df['query_id'].unique():  # loop over all query ids
        labels = np.array(merged_df[merged_df['query_id'] == q]["label"])  # get labels for current query
        scores = np.array(merged_df[merged_df['query_id'] == q]["predicted_score"])  # get predicted score for current query
        RRs.append(rr_at_k(labels, scores, k))  # append RR for current query
    mrr[k] = np.round(float(sum(RRs) / len(RRs)), 4)  # Mean RR at current k

In [1032]:
mrr

{3: 0.2, 5: 0.2, 10: 0.2}

## Normalized Discounted Cumulative Gain (NDCG)

In [1028]:
def dcg_at_k(doc_score, y_score, k=10):
    order = np.argsort(y_score)[::-1]  # get the list of indexes of the predicted score sorted in descending order.
    doc_score = np.take(doc_score, order[:k])  # sort the actual relevance label of the documents based on predicted score(hint: np.take) and take first k.
    gain = 2 ** doc_score - 1  # Compute gain (use formula 7 above)
    discounts = np.log2(np.arange(len(doc_score)) + 2)  # Compute denominator
    return np.sum(gain / discounts)  #return dcg@k


def ndcg_at_k(doc_score, y_score, k=10):
    dcg_max = dcg_at_k(doc_score, doc_score, k)
    if not dcg_max:
        return 0
    return np.round(dcg_at_k(doc_score, y_score, k) / dcg_max, 4)

In [1029]:
k = 10
labels = np.array(merged_df[merged_df['query_id'] == current_query]["label"])
scores = np.array(merged_df[merged_df['query_id'] == current_query]["predicted_score"])
ndcg_k = np.round(ndcg_at_k(labels, scores, k), 4)
print("ndcg@{} for query with query_id={}: {}".format(k, current_query, ndcg_k))

ndcg@10 for query with query_id=Q5: 0.5993
