# Semantic search over social media posts (tweets)

The method implements semantic search over a collection of social media posts e.g., tweets, and returns the most similar posts to the search query. It uses a pretrained language model (embeddings) to determine the posts closest to the query text using cosine similarity and returns the ranked results.  

The script is divided into two main sections:

__1. Getting the Document Embeddings:__ This section processes tweets and converts them into numerical embeddings.

__2. Looking Up Tweets:__ Here, we use cosine similarity to find tweets that are most similar to the user-provided search term.

We use Twitter samples downloaded from the NLTK library. You can download it here:
```
!pip install nltk
import nltk
nltk.download('twitter_samples')
```


In [111]:
import pdb
import pickle
import string

import time

import nltk
import numpy as np
from nltk.corpus import stopwords, twitter_samples

from utils import (cosine_similarity, get_dict,
                   process_tweet)
from os import getcwd

import w4_unittest

In [112]:
# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path
filePath = f"{getcwd()}/tmp2/"
nltk.data.path.append(filePath)

In [113]:
all_tweets = twitter_samples.strings('tweets.20150430-223406.json')#all_positive_tweets + all_negative_tweets



## 1 - Getting the Document Embeddings

###  The Word Embeddings Data for English Words

The full dataset for English embeddings is about 3.64 gigabytes. To prevent the workspace from
crashing, we've extracted a subset of the embeddings for the words that you'll
use in this Tutorial.


In [114]:
en_embeddings_subset = pickle.load(open("./data/en_embeddings.p", "rb"))



### Bag-of-words (BOW) Document Models
Text documents are sequences of words.
* The ordering of words makes a difference. For example, sentences "Apple pie is
better than pepperoni pizza." and "Pepperoni pizza is better than apple pie"
have opposite meanings due to the word ordering.
* However, for some applications, ignoring the order of words can allow
us to train an efficient and still effective model.
* This approach is called Bag-of-words document model.

### Document Embeddings
* Document embedding is created by summing up the embeddings of all words
in the document.
* If we don't know the embedding of some word, we can ignore that word.

<a name="1-1-4"></a>

### Function 'get_document_embedding'
* The function `get_document_embedding()` encodes entire document as a "document" embedding.
* It takes in a document (as a string) and a dictionary, `en_embeddings`
* It processes the document, and looks up the corresponding embedding of each word.
* It then sums them up and returns the sum of all word vectors of that processed tweet.

In [115]:
# UNQ_C12 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_embedding(tweet, en_embeddings, process_tweet=process_tweet):
    '''
    Input:
        - tweet: a string
        - en_embeddings: a dictionary of word embeddings
    Output:
        - doc_embedding: sum of all word embeddings in the tweet
    '''
    doc_embedding = np.zeros(300)

    # process the document into a list of words (process the tweet)
    processed_doc = process_tweet(tweet)
    for word in processed_doc:
        # add the word embedding to the running total for the document embedding
        doc_embedding += en_embeddings.get(word, np.zeros(300))

    return doc_embedding

<a name="1-1-5"></a>
### Function 'get_document_vecs'

#### Store all document vectors into a dictionary
Now, let's store all the tweet embeddings into a dictionary.
Implement `get_document_vecs()`

In [116]:
# UNQ_C14 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_vecs(all_docs, en_embeddings, get_document_embedding=get_document_embedding):
    '''
    Input:
        - all_docs: list of strings - all tweets in our dataset.
        - en_embeddings: dictionary with words as the keys and their embeddings as the values.
    Output:
        - document_vec_matrix: matrix of tweet embeddings.
        - ind2Doc_dict: dictionary with indices of tweets in vecs as keys and their embeddings as the values.
    '''

    # the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}

    # this is list that will store the document vectors
    document_vec_l = []

    for i, doc in enumerate(all_docs):

        # get the document embedding of the tweet
        doc_embedding = get_document_embedding(doc,en_embeddings)

        # save the document embedding into the ind2Tweet dictionary at index i
        ind2Doc_dict[i] = doc_embedding

        # append the document embedding to the list of document vectors
        document_vec_l.append(doc_embedding)


    # convert the list of document vectors into a 2D array (each row is a document vector)
    document_vec_matrix = np.vstack(document_vec_l)

    return document_vec_matrix, ind2Doc_dict

In [117]:
document_vecs, ind2Tweet = get_document_vecs(all_tweets, en_embeddings_subset)


In [118]:
# UNQ_C15 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

print(f"length of dictionary {len(ind2Tweet)}")
print(f"shape of document_vecs {document_vecs.shape}")

length of dictionary 20000
shape of document_vecs (20000, 300)



## 2 - Looking up the Tweets

Now you have a vector of dimension (m,d) where `m` is the number of tweets
(10,000) and `d` is the dimension of the embeddings (300).  Now you
will input a tweet, and use cosine similarity to see which tweet in our
corpus is similar to your tweet.

In [122]:
my_tweet = 'I am sad'
process_tweet(my_tweet)
tweet_embedding = get_document_embedding(my_tweet, en_embeddings_subset)
cosine_similarities = cosine_similarity(document_vecs, tweet_embedding)

#The number of returned tweets
N = 3
top_indices = np.argsort(cosine_similarities)[-N:][::-1]
# Display the top N most similar tweets
print("Top similar tweets:")
for idx in top_indices:
    print(f"Tweet: {all_tweets[idx]} (Similarity: {cosine_similarities[idx]})")

Top similar tweets:
Tweet: @hugorifkind Audience - good. Mili - bad. Clegg - a bit sad. Cam - unscathed (Similarity: 0.7288986447264361)
Tweet: @bootheghost @gaarbage oh UKIP *playful tutting* sad thing is that they're popular though jesus christ (Similarity: 0.6744912436671723)
Tweet: @Bad_Braminski Why do you think it's sad? I'm curious as one of those who wanted to go it alone. And for this election, my vote for SNP isnt (Similarity: 0.6620227394305912)
