# Assignment 3: Hashtag based Tweet search

We will extend Assignment 2 and work on building a vector based search for hashtag based search of tweets.

Overview:
Welcome to TweetMiner, the leading organization in Twitter data analysis! As an NLP scientist in our team, you're entrusted with the task of extracting the most relevant tweets based on input hashtags. For instance, if the hashtag is "#abortion," we expect you to extract the top N (let's say N=10) tweets that truly discuss the topic of "abortion." Similarly, for a hashtag like "#politicaladvertising," your algorithm should identify and extract the top N (again, let's use N=10) tweets about "political advertising".
For this assignment your tasks are the following:

## Task 1: Use CountVectorizer (binary = true) vectorization technique and perform search

### Processing Tweets:

1. Pre-process tweets using applicable pre-processing techniques.


In [4]:
import os.path
if not os.path.isfile("preprocessed_tweets.txt"):

  # first load file! same as Assignment 2
  with open("australian_election_2019_tweets.txt") as f:
      list_tweets = f.read().splitlines()

  # pre-processing from Lab 4
  import spacy
  from nltk.corpus import stopwords
  import nltk
  nltk.download('stopwords')

  # get a list of stopwords from NLTK
  stops = set(stopwords.words('english'))

  # Load SpaCy English language model
  # this is a pipeline capable of applying morphological, lexical and syntax analysis on text

  nlp_pipeline = spacy.load("en_core_web_sm")

  def pre_process_a_single_sentence(sentence: str):
    # Lower case text
    sentence = sentence.lower()

    processed_sentence = []

    # Tokenize, and lemmatize the text
    doc = nlp_pipeline(sentence)

    for token in doc:
      # here token is an object that contains various information about each token
      # information such as lemma, pos, parse labels are available
      # we will check here if tokens are present in stopwords; if not, we will retain their lemma
      if token not in stops:
        lemmatized_token = token.lemma_
        processed_sentence.append(lemmatized_token)
    processed_sentence = " ".join (processed_sentence)
    return processed_sentence

  # remove duplicates first
  l_t = list(set(list_tweets))

  # we use regex for removing URLs, non-english text
  import re
  # credit to https://www.geeksforgeeks.org/remove-urls-from-string-in-python/
  def remove_non_english(text):
      # Define a regex pattern to find
      pattern = re.compile(r"https?://\S+|(?<=\s)[@#]|^[@#]|[^a-zA-Z0-9\s]")

      # Use the sub() method to replace
      text_without_noneg = pattern.sub("", text)

      return text_without_noneg

  ltrdru = []

  for line in l_t:
    ltrdru.append(remove_non_english(line))

  # preprocess text actual
  prepro_tweets = [pre_process_a_single_sentence(sentence) for sentence in ltrdru]


  # save lines
  with open('preprocessed_tweets.txt', 'w') as f:
      for line in prepro_tweets:
          f.write('%s\n' %line)

else:
  print("File Exists, skipping")

File Exists, skipping


I am skipping if I have the file saved because it takes 20 MINUTES to pre-process the file. But now I can simply load it to save time

In [5]:
prepro_tweets = []
with open("preprocessed_tweets.txt") as f:
    for line in f.readlines():
        # see if there is a loose blank line and skip it if so
        if (len(line.strip()) == 0):
            continue
        if line:
            prepro_tweets.append(line.strip())
print("opened")
f.close()

opened


2. Vectorize pre-processed tweets with CountVectorizer (binary = true) . This will create sparse vectors of tweets based on its vocabulary.

In [6]:
# also taken from lab 4
from sklearn.feature_extraction.text import CountVectorizer
# Define the N for N-grams
N = 1
# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False, binary = True, max_features = 3000) # very low for now. I can't have it be too high

# Fit and transform the corpus
vectorizer.fit(prepro_tweets)

# Check a few items in the vocabulary
vocab = vectorizer.get_feature_names_out()

# sanity check: check the list of vocabulary
print(vocab[:10])

['10' '100' '10newsfirst' '10yourvote' '11' '12' '13' '14' '15' '15bn']


In [7]:
# making a transformation of each tweet (very memory intensive and a complete space hog)
tweet_vectors = []

#We are going to process only 1000 though
prepro_tweets = prepro_tweets[:1000]
for sentence in prepro_tweets:
  transformed_vector = vectorizer.transform([sentence])
  tweet_vectors.append(transformed_vector.toarray()[0])

print ("Transformed Tweets", tweet_vectors[:5])

Transformed Tweets [array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0])]


I am processing only 1000 tweets at this time though. Why? Because if I do more then my computer will crash. I've tried 30 different things and none of them work so... I will verify with 1000, and if I get a computer with more RAM in the future then I will re-run with the full dataset.

### Processing hashtags and conduct search:

1. Manually define a list of 10 hashtags, initiating each with the "#" symbol. Ensure the list consists of 5 single-word hashtags and 5 multiword hashtags. For multiword hashtags, capitalize the first letter of each word (e.g., #PoliticalAdvertising). 

Hashtags used: '#RenewableEnergy', '#TaxLaws', '#ParliamentaryMajority', '#coalition', '#Labor' '#Liberal', '#Auspol', '#DemocracySausage', '#ausvotes', and '#AusVotes22'

In [8]:
# list of 10 hashtags
hashtags = ['#RenewableEnergy', '#TaxLaws', '#ParliamentaryMajority', '#Coalition', '#Labor', '#Liberal', '#Auspol', '#DemocracySausage', '#Ausvotes', '#AusVotes22']

2. Remove the "#" symbol from all hashtags. If the hashtag is multiword, split it into individual words using regular expressions. Refer to the code snippet available at https://stackoverflow.com/questions/68448243/efficient-way-to-split-multi-word-hashtag-in-python

In [9]:
import re
x = 0
for tag in hashtags:
    hashtags[x] = re.sub(r'#(\w*[A-Z]\w*)', 
                         lambda m: ' '.join(re.findall('[A-Z][^A-Z]*', m.group())), tag)
    print(hashtags[x])
    x += 1

Renewable Energy
Tax Laws
Parliamentary Majority
Coalition
Labor
Liberal
Auspol
Democracy Sausage
Ausvotes
Aus Votes22


3. For each hashtag,

a. Vectorize the hashtags USING THE SAME Vectorizer that you built under "Processing Tweets". Let's call it "queryVector"

b. Compute the pairwise similarity between the "queryVector" and  each tweet vector using inverse of Euclidean Distance (you can copy the implementation from ALTERNATIVE_Lab4 notebook).  

c. Rank tweets based on the similarity score in ascending order. Print the top 10 most similar tweets. 

d. Repeat for each hashtag

In [19]:
import numpy as np
# using the original from lab 4 because it goes line by line instead of needing the full corpus at once
def euclidean_distance_based_similarity (vector1, vector2):
    distance = np.linalg.norm(np.array(vector1) - np.array(vector2))

    # if the distance is 0 then it can be discarded, otherwise return it
    if distance == 0:
        return 0
    else:
        return 1 / (np.linalg.norm(np.array(vector1) - np.array(vector2)))
# repeat for each hashtag
for hashtag in hashtags:
    query_vector = vectorizer.transform([hashtag]).toarray()[0]

    # this is all taken from lab 4
    similarity_scores = {}
    for i, tweet_vector in enumerate(tweet_vectors):
        sim = euclidean_distance_based_similarity(tweet_vector, query_vector)
        similarity_scores[i] = sim
    ranked_documents = sorted(similarity_scores.items(),key=lambda x: x[1], reverse = True)
    # print the top 10 documents based on ranked score
    print (f"\nQuery: {hashtag}")
    for document_idx, score in ranked_documents[:10]:
        print(f"Document: {document_idx} {prepro_tweets[document_idx]}, Score: {score}")


Query: Renewable Energy
Document: 27 auspol, Score: 1.0
Document: 71 ausvote, Score: 1.0
Document: 90 ausvote usefulidiot, Score: 1.0
Document: 100 auspol, Score: 1.0
Document: 135 ausvote, Score: 1.0
Document: 151 very engaging, Score: 1.0
Document: 170 ausvote   barrel   jaidynlstephenson, Score: 1.0
Document: 190 alp                 180, Score: 1.0
Document: 194 uap 1359, Score: 1.0
Document: 204 ausvote   electionresult, Score: 1.0

Query: Tax Laws
Document: 27 auspol, Score: 1.0
Document: 71 ausvote, Score: 1.0
Document: 90 ausvote usefulidiot, Score: 1.0
Document: 100 auspol, Score: 1.0
Document: 135 ausvote, Score: 1.0
Document: 151 very engaging, Score: 1.0
Document: 170 ausvote   barrel   jaidynlstephenson, Score: 1.0
Document: 190 alp                 180, Score: 1.0
Document: 194 uap 1359, Score: 1.0
Document: 204 ausvote   electionresult, Score: 1.0

Query: Parliamentary Majority
Document: 27 auspol, Score: 1.0
Document: 71 ausvote, Score: 1.0
Document: 90 ausvote usefulidi

[Before fix]

This is not working as expected. There are a lot of cases where there is a divide by 0, which is resulting in infinite. I will change that so that all infinite cases are removed.

[After fix]

Ok, now it is working as expected. As expected, the top 10 of each hashtag are tweets that are *just* the hashtag, and nothing else. Frankly, this does not tell me a lot about what is similar, so maybe in the future I will remove that, but for now I will let it stay

## Task 2: Use TfIdfVectorizer  vectorization technique and perform search

We do the same thing as above but with TfIdfVectorizer instead

So we make the vectorizer, vectorize the tweets with it, define hashtags, normalize & vectorize them, and then perform similarity search.

In [20]:
# still taken from lab 4
from sklearn.feature_extraction.text import TfidfVectorizer
# N is still 1
vectorizer = TfidfVectorizer(ngram_range=(N, N), lowercase = False, binary = True, max_features = 3000) # very low for now. I can't have it be too high
vectorizer.fit(prepro_tweets)
vocab = vectorizer.get_feature_names_out()

# making a transformation of each tweet (very memory intensive and a complete space hog)
tweet_vectors = []

# prepro_tweets is still 1000 in size
for sentence in prepro_tweets:
  transformed_vector = vectorizer.transform([sentence])
  tweet_vectors.append(transformed_vector.toarray()[0])

# Processing hashtags is already done, no need to do it further
# the euclidian distance function is also already done

# repeat for each hashtag
for hashtag in hashtags:
    query_vector = vectorizer.transform([hashtag]).toarray()[0]

    # this is all taken from lab 4
    similarity_scores = {}
    for i, tweet_vector in enumerate(tweet_vectors):
        sim = euclidean_distance_based_similarity(tweet_vector, query_vector)
        similarity_scores[i] = sim
    ranked_documents = sorted(similarity_scores.items(),key=lambda x: x[1], reverse = True)
    # print the top 10 documents based on ranked score
    print (f"\nQuery: {hashtag}")
    for document_idx, score in ranked_documents[:10]:
        print(f"Document: {document_idx} {prepro_tweets[document_idx]}, Score: {score}")


Query: Renewable Energy
Document: 0 voting complete in nyc now for democracy sausage auspol, Score: 1.0000000000000002
Document: 13 australias labor party weigh up future after shock election defeat, Score: 1.0000000000000002
Document: 21 tomorrow from 8309am abcsydney live election forum with instudio audience, Score: 1.0000000000000002
Document: 26 coalitionman you know it be not the pole fault it be russia, Score: 1.0000000000000002
Document: 28 2019 australia election generational issue dominate vote a record number be register to vote in a poll that come month after a brutal leadership tussle, Score: 1.0000000000000002
Document: 31 beware the opinion pollster, Score: 1.0000000000000002
Document: 34 auspol how about horrendous allege rape of 16 yo girl at labor youth camp by labor leader billshorten in 1986 gt new evidence go to vicpolice to reopen case yesterday he could be our next laborliar pm gt do that charming story bias theirabc abcnews no bet not, Score: 1.0000000000000002

This is really different it seems, it's getting that the euclidian distances between vectors is < 1, which leads to similarity scores over 1! This should not technically happen, but I really don't know what the problem is, and I'm short on time so I'll let it go for now.

## Task 3: Use WordEmbedding vectorization technique and perform search

1. Repeat Task 1 steps but use Glove Vectors ("glove-wiki-gigaword-50") to extract word embedding and then convert all word embeddings into sentence embedding by averaging the word embeddings.

In [23]:
# straight out of lab 4 again!

from gensim.models import KeyedVectors
import gensim.downloader as api

# Load pre-trained GloVe embeddings
word_vectors = api.load("glove-wiki-gigaword-50")

# Function to generate average word vectors for a sentence
def average_word_embeddings(sentence):
    words = sentence.split()
    embeddings = []
    for word in words:
        if word in word_vectors:
            embeddings.append(word_vectors[word])
    if len(embeddings) > 0:
        # is word vector exists for the word
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)


In [24]:
word_vector_tweets = []

for sentence in prepro_tweets:
  transformed_vector = average_word_embeddings(sentence)
  word_vector_tweets.append(transformed_vector)

# Processing hashtags is already done, no need to do it further
# the euclidian distance function is also already done

# repeat for each hashtag
for hashtag in hashtags:
    query_vector = average_word_embeddings(hashtag)

    # this is all copied from above
    similarity_scores = {}
    for i, tweet_vector in enumerate(word_vector_tweets):
        sim = euclidean_distance_based_similarity(tweet_vector, query_vector)
        similarity_scores[i] = sim
    ranked_documents = sorted(similarity_scores.items(),key=lambda x: x[1], reverse = True)
    # print the top 10 documents based on ranked score
    print (f"\nQuery: {hashtag}")
    for document_idx, score in ranked_documents[:10]:
        print(f"Document: {document_idx} {prepro_tweets[document_idx]}, Score: {score}")


Query: Renewable Energy
Document: 560 skynewsaust scottmorrisonmp oh ffs   morrison turnbull abbott high tax ng govt sincehoward   auspol ausvotes2019 msm, Score: 0.511864672832316
Document: 832 eurovision2019 vote kmillerheidke australia awesome queenslander qld brisbane topclass corinda, Score: 0.4718485538749829
Document: 598 liberal james stevens lead labors cressida ohanlon comfortably in sturt live update   auspol ausvote, Score: 0.4658693080011748
Document: 250 hotham vic   137 count, Score: 0.45786288794181984
Document: 308 dutton be untouchable fucking untouchable auspol depress, Score: 0.4409414970177699
Document: 630 wat pn murdoch auspol, Score: 0.4320281914406413
Document: 694 menzie voter have a choice stella yee or, Score: 0.43011942859537405
Document: 226 albanese v plibersek v bowen who s your pick auspol ausvote ausvotes19, Score: 0.425023810091644
Document: 781 wow   bodexpress iceland ausvote btsxmetlife ufcrochester preakness mixer, Score: 0.4027776785291585
Docum

This one is a little surprising. The scores are lower than the count or the tf-idf vectorizer, but maybe that's to be expected, and that using a bigger word model may produce better results that way.

Another way to look at it is that this model may be more accurate, as it's able to figure out words that are not exactly the search hashtag and weight it properly. This could be beneficial, but without further research I cannot say for certain if what I see here is true.

Another thing to note is that this is done on a sample of 1000 words, which will probably hamper the results somewhat for *all* the vectorizing models involved. If/when I get a beefier computer I will try to run this thing at full speed.