# Semantic search over social media posts (tweets)

This method performs a semantic search on a collection of social media posts, such as tweets, and identifies the posts most similar to a given search query. It leverages a pretrained language model (embeddings) to calculate semantic similarity using cosine similarity and provides ranked results based on relevance.

The script is divided into these sections:

__1. Environment Setup and Dependencies:__ This section imports all necessary utility functions and configurations.

__2. Data Loading and Configuration:__  Here, the input query is set, and the dataset (social media posts) is loaded for the search process.

__3. Getting the Document Embeddings:__
 This section processes the social media posts and transforms them into numerical embeddings.

__4. Semantic Search through the Posts:__ Using cosine similarity, the script identifies posts most similar to the user-provided search query.

__5. Output:__ The search results are displayed and saved in a JSON file for further analysis.




 ## 1. Environment Setup and Dependencies
*Some utility functionalities regarding data loading, preprocessing, and tokenization are in the `utils.py` file.*

We use Twitter samples downloaded from the NLTK library to demonstrate this method. You can download it here:

In [1]:
import nltk
#nltk.download('twitter_samples')

Now we import import internal (utils.py) and external resources.

In [2]:
# import internal (utils.py) and external resources
from utils import (load_data, clean_posts, tokenize_posts, 
cosine_similarity, read_configurations, write_output)
import numpy as np
import json

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/shyam/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


We predefine random seeds to ensure reproducibility of results across executions.

In [3]:
# For predictable random numbers in reuse
import random
random.seed(13)
np.random.seed(13)

## 2. Data Loading and Configuration
We now load the configurations from `/config.json`. This file defines the paths for the dataset and input data, the location for saving output results, and additional parameters to customize the method's behavior.

In [4]:
configurations = read_configurations("/config.json")
print("configurations: ", configurations)

configurations:  {'ifpreprocess': True, 'top-k': 5, 'input_query_filepath': '/data/input_queries.txt', 'output_filepath': '/data/output.json', 'posts_filepath': '/corpora/tweets.20150430-223406.json'}


Now, you can set your own input query directly. If you leave the `ls_input_queries` variable empty, the method will read the query from the input file defined in the configuration. The posts will be loaded from the file specified in the configuration.

In [36]:
# Read input queries
ls_input_queries, ls_posts = load_data(configurations['input_query_filepath'], configurations['posts_filepath']) 
user_input = input("Provide your input query for search: ")

if user_input != "":
    ls_input_queries = []
    ls_input_queries.append(user_input)



## 3. Getting the Document Embeddings

###  The Word Embeddings Data for English Words

The full dataset for English embeddings is about 3.64 gigabytes. To prevent the workspace from
crashing, we've extracted a subset of the embeddings for the words that you'll
use in this Tutorial.


In [37]:
import pickle
en_embeddings = pickle.load(open("./embeddings/en_embeddings.p", "rb"))



### Bag-of-words (BOW) Document Models
Text documents are sequences of words.
* The ordering of words makes a difference. For example, sentences "Apple pie is
better than pepperoni pizza." and "Pepperoni pizza is better than apple pie"
have opposite meanings due to the word ordering.
* However, for some applications, ignoring the order of words can allow
us to train an efficient and still effective model. *In this method, we are averaging the word vectors in a post i.e., losing their position related information*
* This approach is called Bag-of-words document model.

### Document Embeddings
* Document embedding is created by summing up the embeddings of all words
in the document.
* If we don't know the embedding of some word, we can ignore that word.

In [38]:
# preprocess social media posts by removing urls, hashtags, stickers and other unwanted patterns.

if configurations["ifpreprocess"]:
    posts = clean_posts(ls_posts)

# Tokenize, stem and return clean tokens
tokenized_posts = tokenize_posts(posts)

<a name="1-1-4"></a>

### Function 'get_document_embedding'
* The function `get_document_embedding()` encodes entire document as a "document" embedding.
* It takes in a document (as a string) and a dictionary, `en_embeddings`
* It processes the document, and looks up the corresponding embedding of each word.
* It then sums them up and returns the sum of all word vectors of that processed tweet.

In [39]:
#Compute doc embedding vector i.e., average of all its word embeddings
def compute_doc_embedding(post):
    doc_embedding = np.zeros(300)
    for token in post:
        # add the word embedding to the running total for the document embedding
        doc_embedding += en_embeddings.get(token, np.zeros(300))
        doc_embedding = np.divide(doc_embedding, len(post))
    return doc_embedding

# This method reads a social media posts anc returns its embedded vector i.e., the average of embedded vectors of all its words
def vectorize_posts(posts, en_embeddings):
    '''
    Input:
        - tweet: a string
        - en_embeddings: a dictionary of word embeddings
    Output:
        - doc_embedding: sum of all word embeddings in the tweet
    '''

    # the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}

    # this is list that will store the document vectors
    document_vec_l = []
    
    posts_embeddings = []
    i = 0
    for post in posts:
        doc_embedding = compute_doc_embedding(post)

        # save the document embedding into the ind2Tweet dictionary at index i
        ind2Doc_dict[i] = doc_embedding
        i += 1
        # append the document embedding to the list of document vectors
        document_vec_l.append(doc_embedding)

        # convert the list of document vectors into a 2D array (each row is a document vector)
        document_vec_matrix = np.vstack(document_vec_l)

    return document_vec_matrix, ind2Doc_dict

<a name="1-1-5"></a>
### Function `'Vectorize_posts'`
Now, let's store all the posts embeddings into a dictionary.
Implement `vectorize_posts()`

The following cell computes embedding (Posts x vector) matirx having each post represented with the standard 300 size vector. It may take *5-10mins* for *20,000 posts* on regular PC

In [40]:
# Convert tokenized documents into their 300vector embeddings
# The word vectors are averaged for each document

posts_vec_matrix, ind2Doc_dict = vectorize_posts(tokenized_posts,en_embeddings)



In [41]:
# ind2Doc dictionary and matrix of posts vectors generated

print(f"length of dictionary {len(ind2Doc_dict)}")
print(f"shape of document_vecs {posts_vec_matrix.shape}")

length of dictionary 500
shape of document_vecs (500, 300)



## 4 - Semantic Search through the Posts

Now you have a vector of dimension (m,d) where `m` is the number of posts
(20,000) and `d` is the dimension of the embeddings (300).  

Now we will calculate post similarities for the query inputs using cosine similarity over the entire posts vector matix

In [42]:
# preprocess queries.

if False: #configurations["ifpreprocess"]:
    ls_input_queries = clean_posts(ls_input_queries)

# Tokenize, stem and return clean tokens
if False:
    tokenized_queries = tokenize_posts(ls_input_queries)


In [43]:
#Get top K similar posts for all query items

query_posts_similarities = {}
for query in ls_input_queries:
    query = query.lower()
    query = query.split(' ')
    query_embedding = compute_doc_embedding(query)
    cosine_score = cosine_similarity(posts_vec_matrix, query_embedding)
    top_indices = np.argsort(cosine_score)[-configurations["top-k"]:][::-1]
    query_str = ' '.join(query)
    query_posts_similarities[query_str] = []
    top_posts = []
    for idx in top_indices:
        top_posts.append({'post ID': str(idx),
                          'post text': ls_posts[idx], 
                          "sim score": str(cosine_score[idx])})
    query_posts_similarities[query_str] = top_posts

In [44]:
#json.dumps(query_posts_similarities)

## 5. Output
Now we save output in json format and you can read the results:

In [45]:
write_output(configurations['output_filepath'], json.dumps(query_posts_similarities))

In [46]:
query_posts_similarities

{'social norms': [{'post ID': '471',
   'post text': 'Valentine et al (2009) found relationships between homo/biphobic comments &amp; certain disciplines- incl. European languages, lit, education :(',
   'sim score': '0.36462688735808163'},
  {'post ID': '340',
   'post text': "Valentine et al found r'ships btwn homo/biphobic comments &amp; certain disciplines - incl. European langs, lit, education :( #fresherstofinals",
   'sim score': '0.3646158363561469'},
  {'post ID': '436',
   'post text': 'Last time I was here, was a funeral and a again funeral. Modimo ho tseba wena fela. :( — feeling emotional at... http://t.co/mQYsswdot7',
   'sim score': '0.33736974710126716'},
  {'post ID': '361',
   'post text': 'facebook, y u no work ? y u do this facebook ? :(',
   'sim score': '0.30875597566025537'},
  {'post ID': '219',
   'post text': 'why they cut the encore i wanna see snsd infinite interaction :(',
   'sim score': '0.28145890015142466'}],
 'community interaction': [{'post ID': '219'