# Semantic search over social media posts (tweets)

The method implements semantic search over a collection of social media posts e.g., tweets, and returns the most similar posts to the search query. It uses a pretrained language model (embeddings) to determine the posts closest to the query text using cosine similarity and returns the ranked results.  

The script is divided into two main sections:

__1. Getting the Document Embeddings:__ This section processes social media posts and converts them into numerical embeddings.

__2. Looking Up posts:__ Here, we use cosine similarity to find social media posts that are most similar to the user-provided search query.

*Some utility functionalities regarding data loading, preprocessing, tokenization are in the `utils.py` file.*

We use Twitter samples downloaded from the NLTK library to demonstrate this method. You can download it here:

```
import nltk
nltk.download('twitter_samples')
```


In [6]:
import nltk
nltk.download('twitter_samples')

In [7]:
# import internal (utils.py) and external resources
from utils import (load_data, clean_posts, tokenize_posts, 
cosine_similarity, read_configurations, write_output)
import numpy as np
import json

In [8]:
# For predictable random numbers in reuse
import random
random.seed(13)
np.random.seed(13)

In [9]:
configurations = read_configurations("/config.json")
configurations

{'ifpreprocess': True,
 'top-k': 5,
 'input_query_filepath': '/data/input_queries.txt',
 'output_filepath': '/data/output.json',
 'posts_filepath': '/corpora/tweets.20150430-223406.json'}

In [10]:
# Read input queries


ls_input_queries, ls_posts = load_data(configurations['input_query_filepath'], 
                            configurations['posts_filepath'])
ls_input_queries[:3], ls_posts[:3]

(['social media', 'women', 'election'],
 ['hopeless for tmr :(',
  "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
  '@Hegelbon That heart sliding into the waste basket. :('])


## 1 - Getting the Document Embeddings

###  The Word Embeddings Data for English Words

The full dataset for English embeddings is about 3.64 gigabytes. To prevent the workspace from
crashing, we've extracted a subset of the embeddings for the words that you'll
use in this Tutorial.


In [11]:
import pickle
import numpy as np
en_embeddings = pickle.load(open("./embeddings/en_embeddings.p", "rb"))



### Bag-of-words (BOW) Document Models
Text documents are sequences of words.
* The ordering of words makes a difference. For example, sentences "Apple pie is
better than pepperoni pizza." and "Pepperoni pizza is better than apple pie"
have opposite meanings due to the word ordering.
* However, for some applications, ignoring the order of words can allow
us to train an efficient and still effective model. *In this method, we are averaging the word vectors in a post i.e., losing their position related information*
* This approach is called Bag-of-words document model.

### Document Embeddings
* Document embedding is created by summing up the embeddings of all words
in the document.
* If we don't know the embedding of some word, we can ignore that word.

In [12]:
# preprocess social media posts by removing urls, hashtags, stickers and other unwanted patterns.

if configurations["ifpreprocess"]:
    posts = clean_posts(ls_posts)

# Tokenize, stem and return clean tokens
tokenized_posts = tokenize_posts(posts)

<a name="1-1-4"></a>

### Function 'get_document_embedding'
* The function `get_document_embedding()` encodes entire document as a "document" embedding.
* It takes in a document (as a string) and a dictionary, `en_embeddings`
* It processes the document, and looks up the corresponding embedding of each word.
* It then sums them up and returns the sum of all word vectors of that processed tweet.

In [13]:
#Computer doc embedding vector i.e., average of all its word embeddings
def compute_doc_embedding(post):
    doc_embedding = np.zeros(300)
    for token in post:
        # add the word embedding to the running total for the document embedding
        doc_embedding += en_embeddings.get(token, np.zeros(300))
        doc_embedding = np.divide(doc_embedding, len(post))
    return doc_embedding

# This method reads a social media posts anc returns its embedded vector i.e., the average of embedded vectors of all its words
def vectorize_posts(posts, en_embeddings):
    '''
    Input:
        - tweet: a string
        - en_embeddings: a dictionary of word embeddings
    Output:
        - doc_embedding: sum of all word embeddings in the tweet
    '''

    # the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}

    # this is list that will store the document vectors
    document_vec_l = []
    
    posts_embeddings = []
    i = 0
    for post in posts:
        doc_embedding = compute_doc_embedding(post)

        # save the document embedding into the ind2Tweet dictionary at index i
        ind2Doc_dict[i] = doc_embedding
        i += 1
        # append the document embedding to the list of document vectors
        document_vec_l.append(doc_embedding)

        # convert the list of document vectors into a 2D array (each row is a document vector)
        document_vec_matrix = np.vstack(document_vec_l)

    return document_vec_matrix, ind2Doc_dict

<a name="1-1-5"></a>
### Function 'Vectorize_posts'

#### Store all document vectors into a dictionary
Now, let's store all the posts embeddings into a dictionary.
Implement `vectorize_posts()`

The following cell computes embedding (Posts x vector) matirx having each post represented with the standard 300 size vector. It may take *5-10mins* for *20,000 posts* on regular PC

In [None]:
# Convert tokenized documents into their 300vector embeddings
# The word vectors are averaged for each document

posts_vec_matrix, ind2Doc_dict = vectorize_posts(tokenized_posts,en_embeddings)



In [11]:
# ind2Doc dictionary and matrix of posts vectors generated

print(f"length of dictionary {len(ind2Doc_dict)}")
print(f"shape of document_vecs {posts_vec_matrix.shape}")

length of dictionary 5000
shape of document_vecs (5000, 300)



## 2 - Looking up the Posts

Now you have a vector of dimension (m,d) where `m` is the number of posts
(20,000) and `d` is the dimension of the embeddings (300).  

Now we will calculate post similarities for the query inputs using cosine similarity over the entire posts vector matix

In [12]:
# preprocess queries.

if configurations["ifpreprocess"]:
    ls_input_queries = clean_posts(ls_input_queries)

# Tokenize, stem and return clean tokens
tokenized_queries = tokenize_posts(ls_input_queries)
tokenized_queries

[['social', 'media'], ['women'], ['election']]

In [13]:
#Get top K similar posts for all query items

query_posts_similarities = {}
for query in tokenized_queries:
    query_embedding = compute_doc_embedding(query)
    cosine_score = cosine_similarity(posts_vec_matrix, query_embedding)
    top_indices = np.argsort(cosine_score)[-configurations["top-k"]:][::-1]
    query_str = ' '.join(query)
    query_posts_similarities[query_str] = []
    top_posts = []
    for idx in top_indices:
        top_posts.append({'post ID': str(idx),
                          'post text': ls_posts[idx], 
                          "sim score": str(cosine_score[idx])})
    query_posts_similarities[query_str] = top_posts

In [15]:
#json.dumps(query_posts_similarities)

In [16]:
# Write output in json format
write_output(configurations['output_filepath'], json.dumps(query_posts_similarities))