# Information Retrieval and Web Analytics

# Indexing + Modeling (TF-IDF)

In [49]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Welcome to the first hands-on session of Information Retrieval and Web Analytics!

In this exercise you will implement a simple search engine to query a sample of Wikipedia articles. You will be provided with a sample of 500 Wikipedia articles in text format (some preprocessing has already been done to remove html tags).

For each article you have the following features:

- article id
- article title
- article body

This session is composed by three main parts:

1. **Create the index by going through the documents**
2. **Query the index to obtain a set of documents**
3. **Add some ranking to obtain a sorted set of documents when querying the index**



## 1. Create the index
The index is implemented through an **Inverted Index** which is the main data structure of our search engine. It maps the terms of our corpus (the collection of documents) to the documents that those terms appear in.

You will implement the index through a Python dictionary, and then you will use it to return the list of documents relevant for a query.

Each **vocabulary term** is a key in the index whose value is the list of documents that the term appears in.

    
*Figure 1* shows a basic implementation of an inverted index. However, there exists a special type of queries, named **Phrase Queries**, where the position of the terms in the document matters. Phrase Queries are those queries typed inside double quotes when we want the matching documents to contain the terms in the query exactly in the specified order.
    
In order to work with Phrase Queries we need to add some information in the inverted index. The new inverted index will store, for each term, the list of documents containing the term and the positions of the term in the corresponding document.

See *Figure 2*:
    
In the above example the term *Information* appears in document 1 at positions 0 (we start counting positions from 0), and in document 3 at position 0.
    
Notice that when implementing the index, you will need to perform some preprocessing:
    
    - Transform all words to lower case ( we don’t want to index *Information*, *information*, and *INFORMATION* differently.)
    - Remove stop words ( very common words like articles, etc.)
    - Apply Stemming (remove common endings from words. For example the stemmed version of the words fish, fishes, fishing, fisher, fished is the word 'fish')
    
But do not worry about that, we will provide you with simple tools to do it!

### Index implementation
To create the index you will perform the following steps:
- Loop over all documents of the collection provided in the dataset found in the project file `inputs/documents-corpus.tsv`.
- Concatenate the title and the text of the page.
- Lowercase all words.
- Get tokens (transform the string title+body into a list of terms)
- Remove stop words
- Stem each token
- Build the index following the model of Figure 2.

#### Load Python packages
Let's first import all the packages that you will need during this assignment.

In [50]:
# if you do not have 'nltk', the following command should work "python -m pip install nltk"
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [51]:
from collections import defaultdict
from array import array
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import math
import numpy as np
import collections
from numpy import linalg as la
import pandas as pd
from IPython.display import display
import json

#### Load data into memory
The dataset is stored in the TSV file, and it contains 500 Wikipedia articles (one article per line). For each article we have the document id, document title and document body separated by "|" character.

In [52]:
docs_path = '/content/drive/MyDrive/1st TERM/IRWA/P1/Rus_Ukr_war_data.json'

maping_doc = '/content/drive/MyDrive/1st TERM/IRWA/P1/Rus_Ukr_war_data_ids.csv'

tweets = []
with open(docs_path, 'r') as file:
    for line in file:
        tweet = json.loads(line)
        tweets.append(tweet)

with open(maping_doc) as fp2:
    lines = fp2.readlines()
lines = [l.split() for l in lines]

mapping_dt = pd.DataFrame(lines, columns=['doc_id', 'id'])


In [53]:
print("Total number of tweets: {}".format(len(tweets)))

Total number of tweets: 4000


Implement the function ```build_terms(line)```.

It takes as input a text and performs the following operations:

- Transform all text to lowercase
- Tokenize the text to get a list of terms (use *split function*)
- Remove stop words
- Stem terms (example: to stem the term 'researcher', you will use ```stemmer.stem(researcher)```)

In [54]:
def build_terms(tweet):
    """
    Preprocess the tweet text removing stop words, stemming, tokenization, removing punctuation marks,
    and extracting relevant information like hashtags, date, likes, retweets, and URL.

    Argument:
    tweet -- dictionary representing a tweet

    Returns:
    processed_tweet -- a dictionary containing preprocessed tweet information
    """

    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))

    tweet_text = tweet.lower()
    tweet_text = tweet_text.replace(".", "")        #remove the . son war. counts as war
    tweet_text = tweet_text.split()  # Tokenize the tweet text
    tweet_text = [word for word in tweet_text if word not in stop_words]
    tweet_text = [stemmer.stem(word) for word in tweet_text]


    tweet_text = [word[1:] if word.startswith('#') else word for word in tweet_text]


    return tweet_text

In [55]:
tweets_stuctured = []
info_query = []

for tweet in tweets:
  tweet_id = tweet['id']
  created_at = tweet['created_at']

  tweet_text = build_terms(tweet['full_text'])

  hashtags = [word for word in tweet['full_text'].split() if word.startswith('#')]  # Extract hashtags without #



  likes = tweet['favorite_count']
  retweets = tweet['retweet_count']
  url = tweet['entities']['urls'][0]['expanded_url'] if tweet['entities']['urls'] else None

  # Return processed tweet information as a dictionary
  processed_tweet = {
      'created_at': created_at,
      'hashtags': hashtags,
      'likes': likes,
      'retweets': retweets,
      'url': url,
      'processed_text': tweet_text,
      'original_text': tweet['full_text'],
      'id': tweet_id,
  }
  info_query.append([processed_tweet['original_text'], processed_tweet['created_at'], processed_tweet['hashtags'], processed_tweet['likes'], processed_tweet['retweets'], processed_tweet['url'], processed_tweet['id']])
  tweets_stuctured.append([processed_tweet['original_text'], processed_tweet['created_at'], processed_tweet['hashtags'], processed_tweet['likes'], processed_tweet['retweets'], processed_tweet['url'], processed_tweet['id'], processed_tweet['processed_text']])


df_query = pd.DataFrame(info_query, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])

df2_structured = pd.DataFrame(tweets_stuctured, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id', 'processed_tweet'])

#now we map the id of the tweets with the id of the documents:

mapped_df = pd.concat([df2_structured, mapping_dt], axis=1, join="outer")

display(mapped_df)

Unnamed: 0,Text,Created_time,Hashtags,Likes,Retweets,url,id,processed_tweet,doc_id,id.1
0,@MelSimmonsFCDO Wrong. Dictator Putin's Fascis...,Fri Sep 30 18:39:17 +0000 2022,"[#RussiainvadesUkraine, #UkraineRussiaWar]",0,0,,1575918221013979136,"[@melsimmonsfcdo, wrong, dictat, putin', fasci...",doc_1,1575918221013979136
1,🇺🇦❤️ The Armed Forces liberated the village of...,Fri Sep 30 18:38:44 +0000 2022,"[#Drobysheve, #Lymansk, #Donetsk, #UkraineWar,...",0,0,,1575918081461080065,"[🇺🇦❤️, arm, forc, liber, villag, drobyshev, ly...",doc_2,1575918081461080065
2,ALERT 🚨Poland preps anti-radiation tablets ove...,Fri Sep 30 18:38:23 +0000 2022,"[#NATO, #Putin, #Russia, #RussiaInvadedUkraine...",0,0,,1575917992390823936,"[alert, 🚨poland, prep, anti-radi, tablet, nucl...",doc_3,1575917992390823936
3,I’m still waiting for my google map 🗺️ to upda...,Fri Sep 30 18:38:03 +0000 2022,"[#Putin, #UkraineRussiaWar]",0,0,,1575917907774967809,"[i’m, still, wait, googl, map, 🗺️, updat, russ...",doc_4,1575917907774967809
4,@EmmanuelMacron probably you're right or you h...,Fri Sep 30 18:37:56 +0000 2022,"[#European, #UkraineRussiaWar]",0,0,,1575917878410301441,"[@emmanuelmacron, probabl, right, say, it,, an...",doc_5,1575917878410301441
...,...,...,...,...,...,...,...,...,...,...
3995,🎥 Ukraine’s president has warned that Russia’s...,Wed Sep 28 16:05:00 +0000 2022,[#UkraineRussiaWar],4,1,,1575154617620504576,"[🎥, ukraine’, presid, warn, russia’, “sham, re...",doc_3996,1575154617620504576
3996,Germany amusingly shares days old intelligense...,Wed Sep 28 16:04:19 +0000 2022,"[#germany, #UkraineRussiaWar]",0,0,https://www.tagesschau.de/investigativ/kontras...,1575154444165156864,"[germani, amusingli, share, day, old, intellig...",doc_3997,1575154444165156864
3997,The US Embassy in Moscow is urging Americans t...,Wed Sep 28 16:04:18 +0000 2022,"[#fakenewsfilter, #RealNews, #news, #RussianMo...",0,0,https://oigetit.app.link/GyxcQNf7Gtb,1575154440012812288,"[us, embassi, moscow, urg, american, leav, rus...",doc_3998,1575154440012812288
3998,After the staged fake referendum as of Septemb...,Wed Sep 28 16:03:56 +0000 2022,[#UkraineRussiaWar],13,2,,1575154351273873410,"[stage, fake, referendum, septemb, 2022,, russ...",doc_3999,1575154351273873410


In [56]:
def create_index(tweets, mapping_dt):
    """
    Implement the inverted index

    Argument:
    lines -- collection of Wikipedia articles

    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of documents where these keys appears in (and the positions) as values.
    """
    index = defaultdict(list)
    tweet_index = {}
    for i, tweet in enumerate(tweets):  #
        #processed_tweet = build_terms(tweets[i])
        tweet_id = int(tweet[6])
        doc_id = mapping_dt[mapping_dt['id'] == str(tweet_id)]
        doc_id = (doc_id.values[0])[0]
        tweet_index[doc_id] = [tweet[0], tweet[1], tweet[2], tweet[3], tweet[4], tweet[5], tweet[6]]
        terms = tweet[7]


        current_page_index = {}

        for position, term in enumerate(terms): #
            try:
                # if the term is already in the index for the current page (current_page_index)
                # append the position to the corresponding list

        ## START CODE
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [doc_id, array('I', [position])]#'I' indicates unsigned int (int in Python)

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

        ## END CODE

    return index, tweet_index

In [57]:
import time
start_time = time.time()
index, title_index = create_index(tweets_stuctured, mapping_dt)
print("Total time to create the index: {} seconds".format(np.round(time.time() - start_time, 2)))

Total time to create the index: 3.44 seconds


Notice that if you look in the index for ```researcher```you will not find any result, while if you look for ```research``` you will get some results. That happens because we are storing in the index stemmed terms.

In [58]:
print("Index results for the term 'fear': {}\n".format(index['war']))
print("First 10 Index results for the term 'fear': \n{}".format(index['war'][:10]))

Index results for the term 'fear': [['doc_12', array('I', [4, 8])], ['doc_15', array('I', [8])], ['doc_22', array('I', [14])], ['doc_25', array('I', [3])], ['doc_31', array('I', [15])], ['doc_44', array('I', [18])], ['doc_55', array('I', [16])], ['doc_67', array('I', [17])], ['doc_71', array('I', [6])], ['doc_77', array('I', [22])], ['doc_85', array('I', [18])], ['doc_87', array('I', [17])], ['doc_88', array('I', [8])], ['doc_99', array('I', [8])], ['doc_102', array('I', [3])], ['doc_129', array('I', [2])], ['doc_133', array('I', [1])], ['doc_137', array('I', [0])], ['doc_139', array('I', [2])], ['doc_153', array('I', [15])], ['doc_158', array('I', [12])], ['doc_165', array('I', [14])], ['doc_189', array('I', [1])], ['doc_194', array('I', [3])], ['doc_198', array('I', [15, 22])], ['doc_204', array('I', [3, 7])], ['doc_222', array('I', [7, 19])], ['doc_226', array('I', [11])], ['doc_236', array('I', [3])], ['doc_238', array('I', [5])], ['doc_240', array('I', [4])], ['doc_247', array('I'

## 2. Querying the Index

Even if before we mentioned that in case of phrase queries we need to take into account the position of the terms in the document and we have implemented an index that would allow us to also work with this type of queries, here you are going to implement a search function that will query the index without take into account the terms' positions.


We will use english Free Text Queries, that means that the query we will query the index using  a sequence of english words as query, and the output will be the list of documents that contain any of the query terms.

For instance if we write the query **"computer science"** the output will be the union of all documents containing the term "computer" with all documents containing the term "science".

In [59]:
def search(query, index):
    """
    The output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    """

    query = build_terms(query)


    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs = [posting[0] for posting in index[term]]

            # docs = docs Union term_docs
            docs |= set(term_docs)
        except:
            #term is not in index
            pass

    return docs

In [60]:
print("Insert your query (i.e.: Computer Science):\n")
query = input()
docs = search(query, index)
top = 10

list_tweets = []


for i, doc in enumerate(docs):
  list_tweets.append([doc, title_index[doc][0], title_index[doc][1], title_index[doc][2], title_index[doc][3], title_index[doc][4], title_index[doc][5], title_index[doc][6]])

df_search = pd.DataFrame(list_tweets, columns=['doc_id','Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])

display(df_search)

Insert your query (i.e.: Computer Science):

russia


Unnamed: 0,doc_id,Text,Created_time,Hashtags,Likes,Retweets,url,id
0,doc_687,Putin calls on Ukraine to immediately stop hos...,Fri Sep 30 12:30:20 +0000 2022,"[#Russia, #RussiaInvadedUkraine, #Ukraine, #Uk...",0,0,,1575825370061844480
1,doc_3770,Did #Russia sabotage the #Nordstream pipelines...,Wed Sep 28 17:50:58 +0000 2022,"[#Russia, #Nordstream, #nuclearwar., #UkraineR...",0,0,https://open.spotify.com/episode/4VVYlwDtMFDoz...,1575181283885428736
2,doc_3865,#Ukrainian police have identified five #ruSSia...,Wed Sep 28 17:13:23 +0000 2022,"[#Ukrainian, #ruSSian`Z, #Krasnoyarsk, #OMON,,...",1,1,https://twitter.com/adagamov/status/1575129438...,1575171826577444864
3,doc_508,The Ukrainian 128th Brigade used the postal dr...,Fri Sep 30 13:56:21 +0000 2022,"[#Ukraine, #UkraineRussiaWar, #Ukrainians, #Uk...",14,1,,1575847019390046208
4,doc_435,#Zelenskyy admits that #UkraineRussiawar is al...,Fri Sep 30 14:41:32 +0000 2022,"[#Zelenskyy, #UkraineRussiawar, #NATO, #Russia...",33,17,https://www.euronews.com/2022/09/30/ukraine-an...,1575858386893099009
...,...,...,...,...,...,...,...,...
1469,doc_757,#Russia #Ukraine #RussianArmy #RussianMobiliza...,Fri Sep 30 12:16:18 +0000 2022,"[#Russia, #Ukraine, #RussianArmy, #RussianMobi...",0,0,https://curiosityguide.org/trending/russia-ukr...,1575821838109872129
1470,doc_2592,#Russia #Ukraine \nSHOCKING! RUSSIAN SOLDIERS ...,Thu Sep 29 12:09:00 +0000 2022,"[#Russia, #Ukraine, #NATO, #Putin, #Russian, #...",0,1,https://curiosityguide.org/videos/shocking-mom...,1575457613243899904
1471,doc_3692,#Russia said US President #JoeBiden should iss...,Wed Sep 28 18:23:27 +0000 2022,"[#Russia, #JoeBiden, #NordStream2, #UkraineRus...",1,0,,1575189460786073612
1472,doc_1419,Russian armored vehicles attacked Along with ...,Fri Sep 30 05:51:01 +0000 2022,"[#Ukraine, #Ukraine, #Russian, #Russia, #Ukrai...",19,7,,1575724877482405889


Results for ```Computer Science``` query

======================
Sample of 10 results out of 345 for the seached query:

- page_id= 1029 - page_title: Adjoint state method
- page_id= 2059 - page_title: Apache Cassandra
- page_id= 1036 - page_title: Adminer
- page_id= 3089 - page_title: BCSWomen
- page_id= 1043 - page_title: Admissible heuristic
- page_id= 1046 - page_title: Admon
- page_id= 26 - page_title: 12th Computer Olympiad
- page_id= 3103 - page_title: BESM
- page_id= 33 - page_title: 18 bit
- page_id= 1059 - page_title: Adobe Flash

## 3. Add Ranking with TF-IDF

When searching in a search engine, we are interested in obtain the results sorted by relevance or by some other criteria. Notice that **the above results are not ranked**.

Here you are going to implement the **TF-IDF (Term Frequency — Inverse Document Frequency)** mechanism and use it to obtain a list of ordered results.

TF-IDF is a weighting scheme that assigns each term in a document a weight based on its term frequency (FT) and the inverse document frequency (IDF).  The higher the scores, more important the term is.

##### TF
**TF** refers to the frequency of a term $t$ in a specific document $d$. The basic idea is that as a term appears more in the document it becomes more important. On the other side, if we only use pure term counts, longer documents will be favored more. Consider two documents with exactly the same content but one being twice longer by concatenating with itself.  The tf weights of each word in the longer document will be twice the shorter one, although they essentially have the same content. To deal with this issue we need to **normalize the term frequencies**.


$$tf_{t,d} = \dfrac{N_{t,d}}{||D||}\tag{1}$$


where ||D|| is the Euclidean norm.


Let $D=[t_1, t_2, \dots, t_n]$ be the document vector where $t_i$ represent the frequency of the term $i$, the  Euclidean Norm is calculated as


$$\sqrt{\sum_{t=1}^{n}t_i{^2}}\tag{2}$$


Note that $||D||$ is the same for all terms of a document.


##### IDF
A drawback of tf is that it considers all terms equally important. However, less common terms are more discriminative than others. To deal with this issue we introduce **idf (inverse document frequency)** that takes into account the number of documents containing the term.

$$idf_t = log\dfrac{N}{df_t}\tag{3}$$

where:

- $N$ is the total number of documents;
- $df_t$ is the number of documents containing the term $t$.

The log operation is applied to avoid that terms that appears in a high number of documents are considered to be too much less important, in this way we are smoothing (dampening) this difference.


In [61]:
def create_index_tfidf(tweets, num_documents, mapping_dt):
    """
    Implement the inverted index and compute tf, df and idf

    Argument:
    lines -- collection of Wikipedia articles
    num_documents -- total number of documents

    Returns:
    index - the inverted index (implemented through a Python dictionary) containing terms as keys and the corresponding
    list of document these keys appears in (and the positions) as values.
    tf - normalized term frequency for each term in each document
    df - number of documents each term appear in
    idf - inverse document frequency of each term
    """

    index = defaultdict(list)
    tf = defaultdict(list)  # term frequencies of terms in documents (documents in the same order as in the main index)
    df = defaultdict(int)  # document frequencies of terms in the corpus
    tweet_index = defaultdict(str)
    idf = defaultdict(float)

    for i, tweet in enumerate(tweets):  #
        tweet_id = int(tweet[6])
        doc_id = mapping_dt[mapping_dt['id'] == str(tweet_id)]
        doc_id = (doc_id.values[0])[0]
        tweet_index[doc_id] = [tweet[0], tweet[1], tweet[2], tweet[3], tweet[4], tweet[5], tweet[6]]
        terms = tweet[7]

        current_page_index = {}

        for position, term in enumerate(terms):  ## terms contains page_title + page_text
            try:
                # if the term is already in the dict append the position to the corresponding list
                current_page_index[term][1].append(position)
            except:
                # Add the new term as dict key and initialize the array of positions and add the position
                current_page_index[term] = [doc_id, array('I', [position])] #'I' indicates unsigned int (int in Python)

        #normalize term frequencies
        # Compute the denominator to normalize term frequencies (formula 2 above)
        # norm is the same for all terms of a document.
        norm = 0

        for term, posting in current_page_index.items():
            # posting will contain the list of positions for current term in current document.
            # posting ==> [current_doc, [list of positions]]
            # you can use it to infer the frequency of current term.
            norm += len(posting[1]) ** 2
        norm = math.sqrt(norm)

        # calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1])/norm,4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] +=1  # increment DF for current term

        #merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

        # Compute IDF following the formula (3) above. HINT: use np.log
        for term in df:
            idf[term] = np.round(np.log(float(num_documents/len(df))), 4)

    return index, tf, df, idf, tweet_index


In [62]:
start_time = time.time()
num_documents = len(tweets)
index, tf, df, idf, title_index = create_index_tfidf(tweets_stuctured, num_documents, mapping_dt)
print("Total time to create the index: {} seconds" .format(np.round(time.time() - start_time, 2)))

Total time to create the index: 384.82 seconds


In [63]:
def rank_documents(terms, docs, index, idf, tf, tweet_index):
    """
    Perform the ranking of the results of a search based on the tf-idf weights

    Argument:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies
    title_index -- mapping between page id and page title

    Returns:
    Print the list of ranked documents
    """

    # I'm interested only on the element of the docVector corresponding to the query terms
    # The remaining elements would became 0 when multiplied to the query_vector
    doc_vectors = defaultdict(lambda: [0] * len(terms)) # I call doc_vectors[k] for a nonexistent key k, the key-value pair (k,[0]*len(terms)) will be automatically added to the dictionary
    query_vector = [0] * len(terms)

    # compute the norm for the query tf
    query_terms_count = collections.Counter(terms)  # get the frequency of each term in the query.
    # Example: collections.Counter(["hello","hello","world"]) --> Counter({'hello': 2, 'world': 1})
    # HINT: use when computing tf for query_vector

    query_norm = la.norm(list(query_terms_count.values()))

    for termIndex, term in enumerate(terms):  #termIndex is the index of the term in the query

        if term not in index:
            continue

        ## Compute tf*idf(normalize TF as done with documents)
        query_vector[termIndex]= (query_terms_count[term] / query_norm) * idf[term]

        # Generate doc_vectors for matching docs
        for doc_index, (doc, postings) in enumerate(index[term]):
            # Example of [doc_index, (doc, postings)]
            # 0 (26, array('I', [1, 4, 12, 15, 22, 28, 32, 43, 51, 68, 333, 337]))
            # 1 (33, array('I', [26, 33, 57, 71, 87, 104, 109]))
            # term is in doc 26 in positions 1,4, .....
            # term is in doc 33 in positions 26,33, .....
            #tf[term][0] will contain the tf of the term "term" in the doc 26

            if doc in docs:
                doc_vectors[doc][termIndex] = tf[term][doc_index] * idf[term]  # TODO: check if multiply for idf

    # Calculate the score of each doc
    # compute the cosine similarity between queyVector and each docVector:
    # HINT: you can use the dot product because in case of normalized vectors it corresponds to the cosine similarity
    # see np.dot

    doc_scores=[[np.dot(curDocVec, query_vector), doc] for doc, curDocVec in doc_vectors.items() ]
    doc_scores.sort(reverse=True)
    #print(doc_scores)
    result_docs = [x[1] for x in doc_scores]
    #print document titles instead if document id's
    result_docs=[tweet_index[x] for x in result_docs]

    doc_scores = [x[0] for x in doc_scores]        #here we save the doc_scores to do the Evaluation later


    if len(result_docs) == 0:
        print("No results found, try again")
        query = input()
        docs = search_tf_idf(query, index)

    #print ("".join(str(result_docs)), "\n")
    return result_docs, doc_scores

In [64]:
def search_tf_idf(query, index):
    """
    output is the list of documents that contain any of the query terms.
    So, we will get the list of documents for each query term, and take the union of them.
    """
    query = build_terms(query)

    docs = set()
    for term in query:
        try:
            # store in term_docs the ids of the docs that contain "term"
            term_docs = [posting[0] for posting in index[term]]

            # docs = docs Union term_docs
            docs |= set(term_docs)
        except:
            #term is not in index
            pass

    docs = list(docs)
    ranked_docs, doc_scores = rank_documents(query, docs, index, idf, tf, title_index)
    return ranked_docs, doc_scores

In [65]:
query = "United States"
ranked_docs, doc_scores  = search_tf_idf(query, index)

df1 = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df1['predicted_score'] = doc_scores

df1['id'] = df1['id'].astype(str)
df_query1 = df1.merge(mapping_dt, on='id', how='left').drop(columns=['id'])

display(df_query1[:10])

df_query1[:50].to_csv('/content/drive/MyDrive/1st TERM/IRWA/P1/query1.csv')

Unnamed: 0,Text,Created_time,Hashtags,Likes,Retweets,url,predicted_score,doc_id
0,PUTIN: UNITED STATES USED NUCLEAR WEAPONS IN J...,Fri Sep 30 12:40:36 +0000 2022,"[#Russia, #RussiaInvadedUkraine, #Ukraine, #Uk...",1,0,,0.874467,doc_647
1,i am declaring Belgorod saint Petersburg mos...,Fri Sep 30 00:16:05 +0000 2022,"[#UkraineRussiaWar, #UkrainianArmy]",7,0,,0.861408,doc_1747
2,@cspanwj @RepFrenchHill The country that shoul...,Thu Sep 29 11:43:42 +0000 2022,[#UkraineRussiaWar],0,0,,0.837047,doc_2635
3,"Latvia, Romania, Bulgaria, Poland, Estonia and...",Wed Sep 28 18:34:28 +0000 2022,"[#Russia;, #UkraineRussiaWar]",1,0,,0.725039,doc_3665
4,Putin: United States used nuclear weapons in J...,Fri Sep 30 12:51:35 +0000 2022,"[#Putin, #Russian, #UkraineRussiaWar]",6,2,http://bit.ly/3rCK6Pz,0.69666,doc_613
5,“The United States does not object to Ukraine ...,Wed Sep 28 16:34:12 +0000 2022,"[#UkraineRussiaWar, #Ukraine, #UkraineUnderAtt...",3,0,,0.666021,doc_3931
6,"🇷🇺🇺🇸Explosions on the ""Northern Stream gas pip...",Fri Sep 30 07:01:30 +0000 2022,"[#Russian, #UkraineRussiaWar, #Ukraine, #Nords...",0,0,,0.648442,doc_1391
7,"It should be noted that on September 24, #Russ...",Wed Sep 28 18:41:34 +0000 2022,"[#Russian, #Ukraine., #RussiaUkraineConflict, ...",9,1,,0.627848,doc_3645
8,Putin notified the State Duma of the proposal ...,Fri Sep 30 10:56:39 +0000 2022,"[#Ukraine, #Ukrainewar, #UkraineRussiaWar]",1,0,,0.619309,doc_1008
9,#Russia said US President #JoeBiden should iss...,Wed Sep 28 18:23:27 +0000 2022,"[#Russia, #JoeBiden, #NordStream2, #UkraineRus...",1,0,,0.591935,doc_3692


In [66]:
query = "trump russia"
ranked_docs, doc_scores  = search_tf_idf(query, index)

df = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df['predicted_score'] = doc_scores

df['id'] = df['id'].astype(str)
df_query2 = df.merge(mapping_dt, on='id', how='left').drop(columns=['id'])

display(df_query2[:10])

df_query2[:50].to_csv('/content/drive/MyDrive/1st TERM/IRWA/P1/query2.csv')


Unnamed: 0,Text,Created_time,Hashtags,Likes,Retweets,url,predicted_score,doc_id
0,Russia #Putin. #Moscow. #Russia \n\nPutin chee...,Fri Sep 30 18:37:06 +0000 2022,"[#Putin., #Moscow., #Russia, #Russian, #Ukrain...",0,0,,0.887903,doc_9
1,Trump is the only solution we have today to en...,Thu Sep 29 11:24:51 +0000 2022,"[#Biden, #USA, #Trump, #COVID19, #Ukraine, #Ru...",1,0,https://twitter.com/bennyjohnson/status/157518...,0.768863,doc_2660
2,#UkraineRussiaWar #Russians #Russia #russiaisa...,Fri Sep 30 10:24:38 +0000 2022,"[#UkraineRussiaWar, #Russians, #Russia, #russi...",0,0,,0.765975,doc_1100
3,PUTIN: THERE ARE FOUR NEW REGIONS OF RUSSIA\n\...,Fri Sep 30 12:24:26 +0000 2022,"[#Russia, #RussiaInvadedUkraine, #Ukraine, #Uk...",0,0,,0.757185,doc_720
4,PUTIN: MOST STATES CHOSE TO COOPERATE WITH RUS...,Fri Sep 30 12:49:04 +0000 2022,"[#Russia, #RussiaInvadedUkraine, #Ukraine, #Uk...",0,0,,0.757185,doc_624
5,"More than 100,000 people have already been mob...",Thu Sep 29 14:44:45 +0000 2022,"[#Russia, #RussiaInvadedUkraine, #Ukraine, #Uk...",0,0,,0.757185,doc_2370
6,#TrumpIsALaughingStock #Putin #PutinsPuppet #U...,Thu Sep 29 02:43:28 +0000 2022,"[#TrumpIsALaughingStock, #Putin, #PutinsPuppet...",3,0,,0.738726,doc_3089
7,Why can’t NATO accept Ukraines application fas...,Fri Sep 30 12:11:07 +0000 2022,"[#russia, #UkraineRussiaWar]",0,0,,0.725039,doc_784
8,Russia Ukraine updates \n#Russia #RussiaInvade...,Thu Sep 29 03:26:13 +0000 2022,"[#Russia, #RussiaInvadedUkraine, #Ukraine, #Uk...",1,1,,0.725039,doc_3063
9,#Russia \nIt seems that Russia is trying to pr...,Fri Sep 30 14:55:47 +0000 2022,"[#Russia, #USA, #MAGA, #Politics, #Biden, #GOP...",0,0,,0.725039,doc_407


In [67]:
query = "Bombs cities"
ranked_docs, doc_scores  = search_tf_idf(query, index)

df = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df['predicted_score'] = doc_scores

df['id'] = df['id'].astype(str)
df_query3 = df.merge(mapping_dt, on='id', how='left').drop(columns=['id'])

display(df_query3[:10])

df_query3[:50].to_csv('/content/drive/MyDrive/1st TERM/IRWA/P1/query3.csv')


Unnamed: 0,Text,Created_time,Hashtags,Likes,Retweets,url,predicted_score,doc_id
0,A series of explosions rocked the Ukrainian ci...,Thu Sep 29 19:56:43 +0000 2022,"[#Russia, #RussiaInvadedUkraine, #Ukraine, #Uk...",0,0,,0.648442,doc_2069
1,russians bombed a public transport stop with c...,Fri Sep 30 10:41:28 +0000 2022,"[#Mykolaiv., #russiaisateroriststate, #Ukraine...",0,0,,0.648442,doc_1055
2,Just imagine this being in your city\n#Thursda...,Fri Sep 30 02:43:26 +0000 2022,"[#Thursday, #UkraineRussiaWar]",2,1,https://twitter.com/TpyxaNews/status/157551099...,0.561547,doc_1566
3,"Overnight September 30, powerful explosions he...",Fri Sep 30 12:19:38 +0000 2022,"[#Russian, #Belgorod., #Ukraine, #UkraineWar, ...",0,3,,0.492484,doc_741
4,Dnipro city was hit overnight. Missiles target...,Thu Sep 29 07:15:28 +0000 2022,[#UkraineRussiaWar],2,1,,0.483318,doc_2844
5,The Ukrainian city of Zaporizhzhia earlier thi...,Fri Sep 30 03:11:37 +0000 2022,[#UkraineRussiaWar],102,5,,0.458455,doc_1539
6,Russia bombing innocent citizens.\nThis must s...,Fri Sep 30 03:26:50 +0000 2022,[#UkraineRussiaWar],8,0,,0.444014,doc_1528
7,"⚡️ 1 killed, several wounded as result of #Rus...",Thu Sep 29 21:30:03 +0000 2022,"[#Russian, #UkraineRussiaWar]",7,1,,0.437233,doc_1978
8,#JoeBiden bombed the #NordStream2 pipeline!\n\...,Thu Sep 29 01:21:41 +0000 2022,"[#JoeBiden, #NordStream2, #WWIII, #NordStreamP...",1,0,https://youtu.be/fB4bKbrCMZE,0.418524,doc_3157
9,#Lyman 🇺🇦🇷🇺 Russian forces finally get out of ...,Fri Sep 30 11:54:42 +0000 2022,"[#Lyman, #Russia, #Ukraine️, #UkraineRussiaWar]",1,0,,0.378592,doc_846


In [74]:
query = "World War"
ranked_docs, doc_scores  = search_tf_idf(query, index)

df = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df['predicted_score'] = doc_scores

df['id'] = df['id'].astype(str)
df_query4 = df.merge(mapping_dt, on='id', how='left').drop(columns=['id'])

display(df_query4[:10])

df_query4[:50].to_csv('/content/drive/MyDrive/1st TERM/IRWA/P1/query4.csv')


Unnamed: 0,Text,Created_time,Hashtags,Likes,Retweets,url,predicted_score,doc_id
0,WW3 BIDEN! #IMPEACHBIDENNOW #ImpeachBiden #Ukr...,Wed Sep 28 23:41:32 +0000 2022,"[#IMPEACHBIDENNOW, #ImpeachBiden, #UkraineRuss...",0,0,,0.794148,doc_3220
1,How accurate is this ! \n\n#Russia #UkraineWar...,Wed Sep 28 19:46:31 +0000 2022,"[#Russia, #UkraineWar, #UkraineRussiaWar, #WW3]",9,4,,0.671261,doc_3514
2,Watch out! Warming up WW3 in #UkraineRussiaWar...,Thu Sep 29 06:17:48 +0000 2022,"[#UkraineRussiaWar, #poundcrash]",9,8,https://twitter.com/HBOAsia/status/15526500717...,0.627931,doc_2910
3,@JackPosobiec Biden voters should fight that o...,Fri Sep 30 00:31:54 +0000 2022,"[#Trump2024, #WW3, #UkraineRussiaWar]",0,0,,0.561515,doc_1698
4,All the leaders involved with Coordinating #WW...,Wed Sep 28 18:27:49 +0000 2022,"[#WW3, #UkraineRussiaWar]",0,0,,0.535411,doc_3679
5,Blame Biden voters for all war casualties. #Uk...,Wed Sep 28 18:08:41 +0000 2022,"[#UkraineRussiaWar, #Ukraine, #WW3, #Biden]",0,0,https://twitter.com/TimRunsHisMouth/status/157...,0.51268,doc_3727
6,#Cyberwar is on the horizon:\n\nListen\nhttps:...,Wed Sep 28 18:06:01 +0000 2022,"[#Cyberwar, #ww3, #NordStream2, #Nordstream, #...",0,0,https://youtu.be/qZWba-GOCD0,0.492613,doc_3734
7,The UK &amp; US when war comes &amp; they've s...,Wed Sep 28 17:21:43 +0000 2022,"[#WWIII, #WW3, #UkraineRussiaWar]",1,0,,0.397074,doc_3854
8,#PutinWarCriminal knows that he has his #back ...,Wed Sep 28 16:54:45 +0000 2022,"[#PutinWarCriminal, #back, #wall, #military, #...",0,0,,0.378605,doc_3893
9,#Americans are the waste of the 21st century. ...,Thu Sep 29 20:27:17 +0000 2022,"[#Americans, #German, #Germany, #UnitedStates,...",0,0,,0.362445,doc_2047


In [75]:
query = "Nazi war"
ranked_docs, doc_scores  = search_tf_idf(query, index)

df = pd.DataFrame(ranked_docs, columns=['Text', 'Created_time', 'Hashtags', 'Likes', 'Retweets', 'url', 'id'])
df['predicted_score'] = doc_scores

df['id'] = df['id'].astype(str)
df_query5 = df.merge(mapping_dt, on='id', how='left').drop(columns=['id'])

display(df_query5[:10])

df_query5[:50].to_csv('/content/drive/MyDrive/1st TERM/IRWA/P1/query5.csv')


Unnamed: 0,Text,Created_time,Hashtags,Likes,Retweets,url,predicted_score,doc_id
0,"The war crime, for The Holocaust was \nThe Nur...",Wed Sep 28 16:14:41 +0000 2022,"[#Putin, #UkraineRussiaWar)]",1,1,,0.687745,doc_3974
1,If I could send a message in the form of a son...,Wed Sep 28 21:52:37 +0000 2022,"[#Russia, #UkraineRussiaWar]",0,0,https://www.youtube.com/watch?v=isCh4kCeNYU&ab...,0.636764,doc_3356
2,Land grabs in #UkraineRussiaWar \n\nInfrastruc...,Fri Sep 30 10:17:42 +0000 2022,"[#UkraineRussiaWar, #EU, #russia]",1,1,,0.627848,doc_1117
3,Least fascist Ukrane Nazi who Biden honors at ...,Thu Sep 29 05:38:57 +0000 2022,"[#BidenDeliversAGAIN, #UkraineRussiaWar, #Bide...",190,37,,0.609138,doc_2954
4,#NFTCommunitys #NFTdrop #War #NFTProjects #U...,Thu Sep 29 07:33:19 +0000 2022,"[#NFTCommunitys, #NFTdrop, #War, #NFTProjects,...",24,1,,0.591935,doc_2834
5,#218dayofwar\n\n⚡Map of the war in Ukraine by ...,Thu Sep 29 08:38:59 +0000 2022,"[#218dayofwar, #UkraineWar, #StandWithUkraine,...",10,4,,0.591935,doc_2792
6,#219dayofwar\n\n⚡Map of the war in Ukraine by ...,Fri Sep 30 07:04:28 +0000 2022,"[#219dayofwar, #UkraineWar, #StandWithUkraine,...",13,3,,0.591935,doc_1384
7,#russiaisateroriststate if every single one of...,Fri Sep 30 12:29:02 +0000 2022,"[#russiaisateroriststate, #Ukraine, #Putin, #R...",0,0,,0.576113,doc_696
8,2)\nUSA preparing for a war with Russia?\nUsua...,Wed Sep 28 16:34:26 +0000 2022,"[#Russian, #UkraineRussiaWar]",0,0,https://thehill.com/policy/international/36645...,0.535429,doc_3928
9,Now that the superpowers at war are cutting ea...,Wed Sep 28 19:46:54 +0000 2022,"[#UkraineRussiaWar, #UkraineWar]",0,0,,0.535429,doc_3512
