# Search Engine

This script implements the search engine for tweets related to COVID-19. Usage:
* Download required files from Google Drive or execute `Indexing.ipynb`.
* Execute cells until declaring `TwitterSearch` class.
* Initialize an instance of this class (called SE in this code).
* Call the method `interface` to launch the interface to interact with the search engine.
* Introduce -1 in the number of tweets to be retrieved to stop the execution.

### Important note

#### Download required files stored in Google Drive
This script requires the file `inverted_index.json` and `tweets_with_authority.csv` that can be downloaded from the Google Drive Folder (`https://drive.google.com/drive/u/1/folders/16I4_ZCre59ufD9lDZbFK9cn1mALRmPjB`). The file must be stored in the `~/data` folder as specified in the README.

In [1]:
# We specify the path for importing modules
import sys
sys.path.append('../')

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import json
import pickle
from sklearn.metrics.pairwise import cosine_similarity
from utils import clean_text, personalized_tokenizer
from IPython.display import clear_output

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/javi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
AUTHORITY_DATASET = "tweets_with_authority.csv"
INVERTED_INDEX = "inverted_index.json"
VECTORIZER = "vectorizer.pickle"
INPUT_PATH = "../data/"
MODES = ['1', '2']

In [68]:
class TwitterSearch:
    def __init__(self):
        self.data, self.corpus, self.vectorizer, self.inverted_index = self._load_information()

    def _load_information(self):
        # Load pretrained vectorizer
        vectorizer = pickle.load(open(INPUT_PATH + VECTORIZER, "rb"))

        # Load corpus
        df = pd.read_csv(INPUT_PATH + AUTHORITY_DATASET)
        corpus = df['clean_text']
        corpus = corpus.fillna('')
        corpus = vectorizer.transform(corpus)
        
        with open(INPUT_PATH + INVERTED_INDEX, 'r') as f:
            inverted_index = json.load(f)

        return df, corpus, vectorizer, inverted_index

    def _get_tweet_fields(self, i):
        """
        Returns the relevant fields for each tweet
        i: id of the tweet we want to extract the information
        returns various fields needed for showing the result to the user
        """
        df = self.data
        user_name = eval(df.iloc[i]['user'])['name']
        text = df.iloc[i]['full_text']
        entities = eval(df.iloc[i]['entities'])
        urls = entities['urls']
        if urls:
            url = urls[0]['url']
            text = text.replace(url, '')
        else:
            url = 'No url'

        hashtags = entities['hashtags']

        if not hashtags:
            hashtags = 'No hashtags'

        favorite_count = df.iloc[i]['favorite_count']
        retweet_count = df.iloc[i]['retweet_count']
        followers_count = df.iloc[i]['followers_count']

        return user_name, text, url, hashtags, favorite_count, retweet_count, followers_count

    def find_full_match_docs(self, query):
        """
        Return the indexes of the documents containing all terms in the query
        """
        docs = None

        for word in query.split():
            if docs is None:
                docs = set([i[0] for i in self.inverted_index[word]])
            else:
                try:
                    docs = docs.intersection(set([i[0] for i in self.inverted_index[word]]))
                except:
                    docs = docs
                    
        return list(self.data[self.data['id_str'].isin(docs)].index)
    
    def _get_hashtags(self, hashtags):
        hashtags = hashtags[:]
        clean_hashtags = []
        if hashtags != 'No hashtags':
            for h in hashtags:
                clean_hashtags.append('#'+h['text'])
            return ', '.join(clean_hashtags)
        else:
            return 'No hashtags'
    
    def return_top_n_doc(self,query,n,show = True,authority = None):
        """
        query: Query that the user writes.
        tf_idf: dataframe containing tfidf weights for each word in each doc
        n: number of doc to return to the user
        show: if you want to visualize the results

        returns a list with the most top n relevant tweets
        """
        assert n>0, "n should be a positive integer"
        query = clean_text(query) #noramalize the query
        query_vec = self.vectorizer.transform([query]) #calculate tdidf
        results = cosine_similarity(self.corpus, query_vec)
        results = results.flatten()

        documents_retrieved = []

        #######Return the results#########
        rank=0

        if authority is not None:
            results = 3*results*0.5*authority

        # Reverse the results
        results = results.argsort()[::-1]

        ## Generate print mask for results

        # The mask will contain the indexes from the results array in printing order
        # By default this mask will be the first n results of our cosine similarity output
        mask = [i for i in range(n)]

        # We find those documents that contain all the terms in the query
        full_matches = np.array(self.find_full_match_docs(query))

        # If we have more full matches than desired results, we just use them in order to print
        if len(full_matches)>=n:
            mask = list(np.where(np.isin(results, full_matches))[0])

        elif len(full_matches)==0:
            pass    
        # If not, we will include first those with full match and the remaining ones will be ordered
        # simply by cosine similarity
        else:
            full_rank = 0
            mask = [i for i in mask if results[i] not in full_matches] # Remove values in full matches to avoid duplicates
            for i in range(len(results)):
                if results[i] in full_matches:
                    # Insert the full matches at the beggining to preserve the order of the remaining results
                    mask.insert(full_rank, i) 
                    full_rank+=1

        # Ensure we will only print n results
        mask = mask[:n]

        # Print following the order determined by the mask
        for i in mask:
            i = int(i)
            user_name, text, url, hashtags, favorites, retweets, followers = self._get_tweet_fields(results[i])
            if show == True:
                print("-->",rank + 1)
                print(text," | ",user_name," | ",self.data.iloc[results[i]]['created_at']," | ", self._get_hashtags(hashtags) ," | Favorites: ",favorites," | Retweets: ", retweets," | ",url, " | Followers: ", followers)
                print('\n')
            documents_retrieved.append(results[i])
            rank +=1

        return documents_retrieved

    def query(self, query, n=20, show=True, authority=None):
        self.return_top_n_doc(query, n, show, authority)

    def interface(self):
        while True:
            n = int(input("Enter the desired number of results: "))
            if n<=0:
                print('Stopping Search Engine... Good Bye and See You Soon :)')
                break
            
            while True:
                mode = str(input("""Which mode would you like to use (insert number for the desired option)\n1: TF-IDF\n2: TF-IDF and authority\n"""))
                
                if mode in MODES:
                    break
                else:
                    print("Please insert some of these options: {}".format(', '.join(MODES)))

            query = input("Enter your query: ")
            if mode == "1":
                self.query(query, n, show=True, authority=None)
            elif mode == "2":
                self.query(query, n, show=True, authority=self.data.authority_interp.values)
            
            while True:
                br = input("Do you want to input another query [y/n]: ")
                if br in ['y', 'n']:
                    break
                else:
                    print('Please enter a valid value (y: yes, n: no)')
            
            if br == 'n':
                print('Stopping Search Engine... Good Bye and See You Soon :)')
                break
            else:
                clear_output(wait=False)

In [69]:
SE = TwitterSearch()

In [70]:
SE.interface()

Enter the desired number of results: 10
Which mode would you like to use (insert number for the desired option)
1: TF-IDF
2: TF-IDF and authority
2
Enter your query: covid deaths england
--> 1
🇬🇧#coronavirus was the third most common cause of death in England and Wales in October.
The Official for National Statistics has said 3,367 of the 43,265 deaths in England last month involved Covid. #WNM https://t.co/JZh6guAqvk  |  World News Media ALERT  |  2020-11-19 16:34:10+00:00  |  #coronavirus, #WNM  | Favorites:  0  | Retweets:  0  |  No url  | Followers:  115815


--> 2
England witnessed its highest death rate in over a decade in the year to end October as a result of the Covid pandemic, according to data released by the Office for National Statistics.  |  katniss  |  2020-11-19 15:54:38+00:00  |  No hashtags  | Favorites:  2  | Retweets:  0  |  No url  | Followers:  3950


--> 3
BBC reports that Covid was the third most common cause of death in October in England and Wales - behind dem