## Search Engine

Students: Irene Cantero (U151206) / Jian Chen (U150279)

All the code is stored in the folder `search_engine`. The notebook only contains the calls and some functions to do the T-SNE and the clustering.

Content: 

- Top 10 results only using TF-IDF + Cosine similarity for 10 chosen queries
- Top 10 results using Word2Vec + Cosine similarity for the same 10 chosen queries
- Search with custom score G(d) + cosine similarity for a given query
        - G(d) considers the (1/2)Tweets likes, (1/3)retweets and (1/6) replies
- T-SNE implementation and plot using T-SNE.
- Plot to see the optimal number of clusterings.
- Clustering using K means and the optimal number of clusterings, showing the most common words of each cluster as well.

In [3]:
from search_engine.search_engine import SearchEngine
import pandas as pd
import os
import warnings
import csv

warnings.filterwarnings('ignore')

In [4]:
search_engine = SearchEngine()

Collection time: 0.0


Here we run the search engine considering the popularity (Likes, Retweets and Replies) of each tweet.

In [5]:
pd.set_option('display.max_colwidth', -1)
search_engine.ranking_system.change_user_output(2)
print("Insert your query:\n")
#query = input()
query = "joe biden"
search_engine.run(query).query("score > 0").head(20)

Insert your query:



Unnamed: 0,Tweet,Username,Date,Hashtags,Likes,Retweets,Replies,Url,score,g(d),total_score
0,A new Gallup poll finds that President-elect Joe Biden has a 55% favorable rating and a 41% unfavorable rating.\n\nJoe Biden's already more popular than Trump's ever been. https://t.co/3Qse62vVrH,Vaughn Sterling,Sun Dec 06 15:32:42 +0000 2020,[],4050,843,92,https://twitter.com/i/web/status/1335608111353237504,2.329264,0.026003,2.355268
1,"Joe #Biden’s Lead in #Arizona Shrinks to 12.8K, 46.7K Outstanding #Ballots\n\nhttps://t.co/WYsz3zgglP\n\n#QAnon2018 #QAnon2020 \n#StopTheSteal\n#Election2020 https://t.co/ZFIUO6ShS7","Zeus 🇺🇸 ⭐⭐⭐ No Collusion, No Obstruction!",Wed Nov 11 21:27:32 +0000 2020,"[Biden, Arizona, Ballots, QAnon2018, QAnon2020, StopTheSteal, Election2020]",504,362,36,https://twitter.com/i/web/status/1326637712129069060,2.329264,0.00516,2.334425


This is the code to answer RQ1b:

In [None]:
pd.set_option('display.max_colwidth', -1)
#chosen queries
queries=[]
queries.append("joe biden won elections")
queries.append("donald trump is the president")
queries.append("elections are a fraud")
queries.append("pennsylvania")
queries.append("trump out")
queries.append("votes fraud")
queries.append("i voted")
queries.append("georgia votes")
queries.append("trump team")
queries.append("biden team")

#NOTE: to open the tsv correctly use UTF-8
try:
    os.remove('other-outputs/RQ1b.tsv')
except:
    pass

search_engine.ranking_system.change_user_output(1) #Using TF-IDF + cosine_similarity
# Setting the header of the TSV file
RQ1 = open('other-outputs/RQ1b.tsv', 'a+')
RQ1.write("\tTweet\tUsername\tDate\tHashtags\tLikes\tRetweets\tReplies\tUrl\tScore\n")
RQ1.write(f"QUERY\t{queries[0]}\n")
RQ1.close()

# Storing each result of each query in the TSV file
for query in queries:
    RQ1 = open('other-outputs/RQ1b.tsv', 'a+')
    RQ1.write(f"QUERY\t{query}\n")
    RQ1.close()
    print(f"\nQUERY: {query.upper()}\n")
    results=search_engine.run(query).query("score > 0").head(20)
    display(results)
    results.replace('\n',' ', regex=True).to_csv(path_or_buf='other-outputs/RQ1b.tsv', sep='\t', header=False, mode = 'a')
    

This is the code to the answer RQ1c:

In [None]:
search_engine.ranking_system.change_user_output(2) #Using Word2Vec + cosine_similarity
#chosen queries
queries=[]
queries.append("joe biden won elections")
queries.append("donald trump is the president")
queries.append("elections are a fraud")
queries.append("pennsylvania")
queries.append("trump out")#3
queries.append("votes fraud")
queries.append("i voted")
queries.append("georgia votes")
queries.append("trump team")
queries.append("biden team")

#NOTE: to open the tsv correctly use UTF-8
try:
    os.remove('other-outputs/RQ1c.tsv')
except:
    pass

# Setting the header of the TSV file
RQ1 = open('other-outputs/RQ1c.tsv', 'a+')
RQ1.write("\tTweet\tUsername\tDate\tHashtags\tLikes\tRetweets\tReplies\tUrl\tScore\n")
RQ1.write(f"QUERY\t{queries[0]}\n")
RQ1.close()

# Storing each result of each query in the TSV file
for query in queries:
    RQ1 = open('other-outputs/RQ1c.tsv', 'a+')
    RQ1.write(f"QUERY\t{query}\n")
    RQ1.close()
    print(f"\nQUERY: {query.upper()}\n")
    results=search_engine.run(query).query("score > 0").head(20)
    display(results)
    results.replace('\n',' ', regex=True).to_csv(path_or_buf='other-outputs/RQ1c.tsv', sep='\t', header=False, mode = 'a')

In [None]:
#Can you imagine a better representation than word2vec? Justify your answer.
#(HINT - what about Doc2vec? Sentence2vec? Which are the pros and cons?)

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans

In [None]:
# This function performs K-means to see what are the best number of clusters, recollects the words of each cluster
# of tweets and gives labels
def KMeans_setup(model) -> None:
    total_tokens=[]
    tweets_words = []
    # for to do word embedding using word2vec and collecting words from the tweets
    for tweet in search_engine.tweets["text"]:
        tweet_words = {}
        tokens = []
        # for each word of the tweet do word emebeding, and add it to the dictionary of the tweet. If 
        # it exists already, then just add 1
        for word in tweet.split():
            try:
                tokens.append(model[word])
                if word not in tweet_words.keys():
                    tweet_words[word] = 1
                else:
                    tweet_words[word]+=1
            except:
                pass
        # We do the mean of the word embeddings to represent the tweet (Tweet2Vec)
        tokens = np.mean(np.array(tokens), axis=0)
        # For the tweets, we just append it to the tweets_words. Since the embedded tweet and tweet words are in the same order,
        # we do not need any mapping function to make sure that the tweet words corresponds to the embedded tweet.
        tweets_words.append(tweet_words)
        if str(tokens) != 'nan':
            total_tokens.append(tokens)
    
    # Plot the sum of square distance depending on the number of clusterings
    K = range(1,15)
    Sum_of_squared_distances = []
    for k in K:
        km = KMeans(n_clusters=k)
        km = km.fit(total_tokens)
        Sum_of_squared_distances.append(km.inertia_)
    
    plt.plot(K, Sum_of_squared_distances, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Sum_of_squared_distances')
    plt.title('Elbow Method For Optimal k')
    plt.show()
    return total_tokens, tweets_words

In [None]:
# This is the function of the T-SNE plot, which takes as inputs the output of the function Kmeans_Setup
# The number of clusters have been set according to the plot of Kmeans_setup.
def tsne_plot(total_tokens, tweet_words):
    NUM_CLUSTERS = 3
    # T-SNE setup
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=5000, random_state=0)
    new_values = tsne_model.fit_transform(total_tokens)
    
    # K-means setup with the specified number of clusters
    kmeans = KMeans(n_clusters=NUM_CLUSTERS)
    kmeans = kmeans.fit(total_tokens)
    # getting the labels to do the coloring of each node (tweet.)
    labels = kmeans.predict(total_tokens)
    ColorsA=plt.cm.viridis(np.linspace(0, 1, NUM_CLUSTERS),alpha=0.8)
    
    # Getting the results of the T-SNE model by getting the coordinates of x,y of each tweet
    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])

    clusters_common_words = []
    
    # Coloring each tweet node depending on the cluster it belongs. We take profit of the variable "labels" to do that
    plt.figure(figsize=(10,10))
    # For each cluster...
    for i in range(NUM_CLUSTERS):
        xL=[]
        yL=[]
        cluster_words = {}
        # ... and for every tweet...
        for k in range(len(x)):
            # ...if the tweet belongs to the cluster get the coordinates, and store the words in a dictionary
            if labels[k]==i:
                xL.append(x[k])
                yL.append(y[k])
                # collecting and counting the number of words 
                for word in tweets_words[k]:
                    if word not in cluster_words.keys():
                        cluster_words[word] = 1
                    else:
                        cluster_words[word] += tweets_words[k].get(word)
        # Associate all the words collected to a cluster
        clusters_common_words.append(cluster_words)
        plt.scatter(xL,yL,color=ColorsA[i])
    
    # Extra for loop just to show most common words of each cluster
    for i in range(NUM_CLUSTERS):
        print(dict(sorted(clusters_common_words[i].items(), key=lambda x: x[1], reverse=True)[:5]))

    plt.show()


The following plot show us the optimal number of clusters needed in our dataset. However, in this case is not very clear, because the "elbow" is not very well defined. We think that this is because the tweets are very similar between each other, and that leads us to have at most 2 clusters.

In [None]:
model = search_engine.ranking_system.w2v
total_tokens, tweets_words = KMeans_setup(model)

In [None]:
tsne_plot(total_tokens, tweets_words)