# Output diversification
Students: Irene Cantero (U151206) / Jian Chen (U150279)

The idea of this exercise is to diversify the outputs of the current ranking systems.

Content:
- Current state of the search engine
- Measures for the diversity score and coverage
- Post processing results after applying a custom algorithm made by us to increase the diversity score and coverage


In [12]:
from search_engine.search_engine import SearchEngine
import pandas as pd
import os
import warnings
import csv
import numpy as np
from sklearn.cluster import KMeans
import random
warnings.filterwarnings('ignore')

In [13]:
# Initialization of the search engine
search_engine = SearchEngine()

In [14]:
'''
Assignation of clusters for the documents using K-means.
"search_engine.query_results" is the database that contains the tweets to be returned by the ranking system. The reason of
its existance is because it is a simplified version of "search_engine.tweets", which contains all the columns of the
original tweets.
'''
def cluster_assignation(model):
    search_engine.query_results["cluster_label"]=np.zeros(len(search_engine.query_results))
    NUM_CLUSTERS = 5
    total_tokens=[]
    # Word embedding using Word2Vec for each tweet in the simplified database
    for tweet in search_engine.query_results["Tweet"]:
        tokens = []
        for word in tweet.split():
            try:
                tokens.append(model[word])
            except:
                pass
        tokens = np.mean(np.array(tokens), axis=0)
            
    # # K-means clustering
    kmeans = KMeans(n_clusters=NUM_CLUSTERS)
    kmeans = kmeans.fit(total_tokens)
    labels = kmeans.predict(total_tokens)
    
    # Label assignation
    for k in range(len(labels)):
        search_engine.query_results["cluster_label"][k]=labels[k]

In [15]:
model = search_engine.ranking_system.w2v
cluster_assignation(model)

In [27]:
def get_sorted_dictionary(input_: dict) -> list:
    return list(dict(sorted(input_.items(), key=lambda item: item[1], reverse=True)).keys())

def compute_cluster_dominance(results: pd.DataFrame, num_clusters: int) -> dict:
    count_clusters = {}
    for i in range(num_clusters):
        count_clusters[i]=0
    for tweet in range(len(results)):
        results.columns
        count_clusters[results["cluster_label"][tweet]]+=1
    
    total_tweets = sum(count_clusters.values())
    for i in range(num_clusters):
        count_clusters[i] = count_clusters[i]/total_tweets
    
    return count_clusters

def coverage_score(clusters: dict) -> float:
    coverage = 1.0
    for cluster in clusters:
        if clusters[cluster] == 0:
            coverage -= 1/len(clusters)
        return coverage
def diversity_score(clusters: dict) -> float:
    difference = 0.0
    for cluster in clusters:
        difference += np.abs(1/len(clusters) - clusters[cluster])
    return 1 - difference

This is how we have the search engine right now. As we can see here, the top 20, is returning results from the cluster 0 and cluster 2, but nothing for 1 and 3. Therefore, we should try to find a way to include these cluster at least in the top 20. 

The diversity score has been defined by us, and it is computed with the following formula: 

\begin{equation*}
score  = 1 - \sum_{k=1}^N  |\frac{1}{N} - dominance_k|
\end{equation*} where N is number of clusters and dominance is the cluster presence in the suggested ranking list.

Below we provide an example:

In [16]:
pd.set_option('display.max_colwidth', -1)
print("Insert your query:\n")
query = input()
results = search_engine.run(query).query("score>0")
results

Insert your query:

joe biden


Unnamed: 0,Tweet,Username,Date,Hashtags,Likes,Retweets,Replies,Url,cluster_label,score
0,Joe Biden wins ... Again.,Joe Scarborough,Sat Dec 05 12:21:27 +0000 2020,[],5699,470,197,https://twitter.com/i/web/status/1335197592054063106,2.0,0.721872
1,"Part 1 {Thread to Document}:\n\nJoe Biden, Biden Family, Burisma #Corruption \n\n“Report Shows Joe Biden Stole $140 Million From US Federal Treasury &amp; Transferred the $$ to Rosemont Seneca, Purportedly\nFor Bank Bailouts &amp; Then to His Personal Account in the Cayman Islands”\n\n@POTUS https://t.co/LsoWoflm0C https://t.co/xb8F10jFiL",Liberty Times & Politics,Sat Mar 28 02:50:26 +0000 2020,[Corruption],596,804,31,https://twitter.com/i/web/status/1243732150714761219,0.0,0.406497
2,Barack Obama was a Republican and Joe Biden is to the right of him,Dr. Manhattan 🇳🇬,Fri Dec 04 15:51:24 +0000 2020,[],1747,244,16,https://twitter.com/i/web/status/1334888040360112128,4.0,0.380502
3,joe biden is going to be time's person of the year in 2021 isn't he,Jack Saint,Sat Dec 05 20:12:22 +0000 2020,[],1074,16,23,https://twitter.com/i/web/status/1335316102318936064,0.0,0.361759
4,Should we extradite Joe Biden to the Ukraine?,Dean Browning,Sat Dec 05 21:52:26 +0000 2020,[],1072,178,117,https://twitter.com/i/web/status/1335341283515265025,0.0,0.352605
...,...,...,...,...,...,...,...,...,...,...
70,"According to the pool report, Joe Biden went to church this afternoon at St. Joseph on the Brandywine Church.",Kyle Griffin,Sat Dec 05 23:00:00 +0000 2020,[],2938,204,85,https://twitter.com/i/web/status/1335358289777811456,0.0,0.177726
71,"Joe Biden's OMB nominee Neera Tanden tried to force Catholic nuns to fund abortions.\n\nBut the liberal media keeps telling us he's a ""faithful Catholic!"" 🤣",LifeNews.com,Sat Dec 05 23:35:37 +0000 2020,[],156,99,5,https://twitter.com/i/web/status/1335367252976300032,4.0,0.175571
72,Number of times Twitter has censored President Trump: 325 Number of times Twitter has censored Joe Biden: 0,LifeNews.com,Sat Dec 05 17:27:47 +0000 2020,[],798,390,31,https://twitter.com/i/web/status/1335274682430468097,0.0,0.172428
73,@BreitbartNews It isn’t Joe Biden’s slogan. It’s the NWO slogan. Go look to see who else has used this slogan in Europe,🐸Deplorable Artist🐸 ⭐️ ⭐️ ⭐️,Sat Dec 05 23:21:35 +0000 2020,[],2,1,0,https://twitter.com/i/web/status/1335363719480619008,0.0,0.161351


The output below shows the dominance of each cluster, the diversity score, the most dominant cluster and the ranking coverage that is computed by doing:
\begin{equation*}
score  = 1 - \sum_{k=1}^N  \frac{total\_clusters\_in\_the\_ranking}{clusters\_in\_the\_ranking}
\end{equation*}

In [28]:
results_clusters = compute_cluster_dominance(results, 5)
dominant_clusters = get_sorted_dictionary(results_clusters)
score = diversity_score(results_clusters)
cov_score = coverage_score(results_clusters)
print(f"Clusters and percentage of dominance{results_clusters}")
print(f"Diversity score of original ranking: {score}")
print(f"Cluster dominance: {dominant_clusters[0]}")
print(f"Ranking coverage: {cov_score}")

Clusters and percentage of dominance{0: 0.18666666666666668, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.21333333333333335}
Diversity score of original ranking: 0.9733333333333334
Cluster dominance: 4
Ranking coverage: 1.0


Here we start with the preprocessing part of the current result in order to increase the diversity score and the ranking coverage. The goal is to increase the diversity score and ranking coverage, while preserving a decent score of the overall ranking. To do so, we perform the following steps:
- Get the most dominant cluster and the least dominant
- Generate a random index number within the results belonging to the dominant cluster to replace it.
- Replace it for the top 1 of the least dominant cluster, to maintain a good score.
- Repeat the process until passing certain threshold or after N iterations

In [33]:
# given a cluster, finds the first tweet belonging to that cluster, and returns it
def get_first_result_of_cluster(results: pd.DataFrame, cluster: int) -> pd.Series:
    first_result = results.query(f"cluster_label == {cluster}")
    return first_result.iloc[0]

# returns the dominant clusters presence
def compute_dominance(results: pd.DataFrame, num_clusters: int) -> dict:
    cluster_dominance=compute_cluster_dominance(results, num_clusters)
    sorted_cluster_dominance = get_sorted_dictionary(cluster_dominance)
    return sorted_cluster_dominance

# returns if the score passed certain threshold
def cluster_diversity(clusters: dict) -> bool:
    threshold = 0.9
    score = diversity_score(clusters)
    return score > threshold

# core function 
def diversity_increaser(results: pd.DataFrame, num_clusters:int , num_iter:int) -> None:
    minimum_length = 20
    # We cannot add tweets from other clusters, if the tweets are not related with the query
    if len(results) <= minimum_length:
        return
    
    # getting most and least dominant clusters
    sorted_cluster_dominance = compute_dominance(results, num_clusters)
    
    most_dominant = sorted_cluster_dominance[0]
    least_dominant = sorted_cluster_dominance[num_clusters-1]
    
    # getting index of the tweets that appears in the ranking and belongs to the most dominant cluster.
    most_dominant_results = results.query(f"cluster_label == {most_dominant}")
    most_dominant_results_index = list(most_dominant_results.index)

    # the for loop gets the most dominant document, generates a random index number, and replace the tweet with that index
    # by the top 1 of the least dominant cluster.
    for i in range(num_iter):
        for tweet in range(len(most_dominant_results)):
            # getting a random index of the dominant cluster
            random_position = random.randint(0, len(most_dominant_results)-1) 
            index_to_replace = most_dominant_results_index[random_position]
            # replace the tweet with the random index, by the top 1 of the least dominant cluster
            results.iloc[index_to_replace] = get_first_result_of_cluster(results, least_dominant)
            
            # update the state by looking how dominant are each cluster after the replace.
            sorted_cluster_dominance = compute_dominance(results, num_clusters)
            new_dominant = sorted_cluster_dominance[0]
            new_least_dominant = sorted_cluster_dominance[num_clusters-1]
            
            if least_dominant != new_least_dominant:
                least_dominant = new_least_dominant
                
            if most_dominant != new_dominant:
                most_dominant = new_dominant
                break
        # if the diversity score is high enough, break the loop to not continue iterating
        if cluster_diversity(sorted_cluster_dominance):
            break


This is the results after the postprocessing. As you can see the diversity score has increased significantly, and the coverage is also maximized.

In [34]:
diversity_increaser(results, 5, 500)
results_clusters = compute_cluster_dominance(results, 5)
dominant_clusters = get_sorted_dictionary(results_clusters)
score = diversity_score(results_clusters)
cov_score = coverage_score(results_clusters)

print(f"Clusters and percentage of dominance{results_clusters}")
print(f"Diversity score of original ranking: {score}")
print(f"Cluster dominance: {dominant_clusters[0]}")
print(f"Ranking coverage: {cov_score}")

Clusters and percentage of dominance{0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2}
Diversity score of original ranking: 1.0
Cluster dominance: 0
Ranking coverage: 1.0


In [21]:
results.head(20)

Unnamed: 0,Tweet,Username,Date,Hashtags,Likes,Retweets,Replies,Url,cluster_label,score
0,Joe Biden wins ... Again.,Joe Scarborough,Sat Dec 05 12:21:27 +0000 2020,[],5699,470,197,https://twitter.com/i/web/status/1335197592054063106,2.0,0.721872
1,Barack Obama was a Republican and Joe Biden is to the right of him,Dr. Manhattan 🇳🇬,Fri Dec 04 15:51:24 +0000 2020,[],1747,244,16,https://twitter.com/i/web/status/1334888040360112128,4.0,0.380502
2,Barack Obama was a Republican and Joe Biden is to the right of him,Dr. Manhattan 🇳🇬,Fri Dec 04 15:51:24 +0000 2020,[],1747,244,16,https://twitter.com/i/web/status/1334888040360112128,4.0,0.380502
3,Joe Biden's Latest Interview Shows He's a Political Coward With No Plan,RedState,Fri Dec 04 18:03:45 +0000 2020,[],71,14,8,https://twitter.com/i/web/status/1334921346623676418,1.0,0.285205
4,Congressman-elect @RonnyJacksonTX⁩: 'Something is Going on with Joe Biden's Health',Kyle Morris,Sat Dec 05 22:18:18 +0000 2020,[],659,188,142,https://twitter.com/i/web/status/1335347795792977921,3.0,0.265962
5,"Part 1 {Thread to Document}:\n\nJoe Biden, Biden Family, Burisma #Corruption \n\n“Report Shows Joe Biden Stole $140 Million From US Federal Treasury &amp; Transferred the $$ to Rosemont Seneca, Purportedly\nFor Bank Bailouts &amp; Then to His Personal Account in the Cayman Islands”\n\n@POTUS https://t.co/LsoWoflm0C https://t.co/xb8F10jFiL",Liberty Times & Politics,Sat Mar 28 02:50:26 +0000 2020,[Corruption],596,804,31,https://twitter.com/i/web/status/1243732150714761219,0.0,0.406497
6,"If you voted for Joe Biden, like and retweet this. Prove to Republicans that the election wasn't rigged and Biden just kicked Trump's ass.",I Smoked Trump's Massive Bribery Dump,Sat Dec 05 17:01:13 +0000 2020,[],5412,3305,173,https://twitter.com/i/web/status/1335267997724925953,2.0,0.344144
7,I will no longer support any Republican that supports Joe Biden !,❌🇺🇸Steve🇺🇸🇺🇸America First🇺🇸🇮🇹MAGA🇺🇸KAG,Sat Dec 05 15:35:40 +0000 2020,[],5300,1603,138,https://twitter.com/i/web/status/1335246467204845568,2.0,0.341731
8,Do you think Joe Biden won the Presidential election fair and square?,Donna 🌺,Sat Dec 05 20:40:29 +0000 2020,[],34,28,25,https://twitter.com/i/web/status/1335323180140032002,2.0,0.335759
9,Congressman-elect @RonnyJacksonTX⁩: 'Something is Going on with Joe Biden's Health',Kyle Morris,Sat Dec 05 22:18:18 +0000 2020,[],659,188,142,https://twitter.com/i/web/status/1335347795792977921,3.0,0.265962
