# Hashtag based Tweet Search

In [18]:
# read tweets from file
file = open("australian_election_2019_tweets.txt", "r")
tweets = file.readlines()
file.close()

# Remove duplicate tweets by using set
tweets = list(set(tweets))

# show some tweets
tweets[:5]

['The last election was held on my 21st birthday, I stayed in the hotel room watching election coverage until it was called and then drank my sorrows away at the club.\n',
 'Huge yarn from @politico - Australia allegedly took in mass murderers in exchange for sending asylum seekers on Nauru and Manus to the US. Australia’s secret US refugee deal faces blowback from victims, Kiwis – POLITICO via \u2066@zoyashef\u2069 #auspol  https://t.co/cgR0h0nceT\n',
 'C’mon @TheWeeklyTV We need a clip of that Liberal Party PR team. Comedy GOLD! #TheWeekly #auspol\n',
 'Primary Votes: ALP 38.7 (-13.1) NAT 21.9 (-4.4) ON 21.8 (+21.8) GRN 7.2 (+0.1) UAP 4.2 (+4.2)\n',
 '#ausvotes https://t.co/Hty7S5zwFJ\n']

Here I load the file, remove duplicates using set() function, and perform a sanity check for tweets.

In [19]:
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [20]:
import re
from nltk.corpus import words

english_words = set(words.words())

# Function to clean tweets (URLS, mentions, hastags, non-english text)
def clean_tweet(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)
    # Remove non-English text
    words_in_text = text.split()
    english_text = ' '.join(word for word in words_in_text if word.lower() in english_words)
    return english_text

# Clean tweets
cleaned_tweets = [clean_tweet(tweet) for tweet in tweets]

# Display the first few cleaned tweets to verify
cleaned_tweets[:5]

['The last election was on my I stayed in the hotel room watching election coverage until it was and then drank my away at the',
 'Huge yarn from allegedly took in mass in exchange for sending asylum on and Manus to the secret US refugee deal blowback from POLITICO via',
 'We need a clip of that Liberal Party Comedy',
 'Primary ALP NAT ON',
 '']

Here I define a function to clean tweets by removing URLS, mentions, hastags and non-english text. I implement the function on all the tweets and perform a sanity check.

In [21]:
hashtag_list = [
    "#RenewableEnergy",
    "#TaxLaws",
    "#PoliticalAdvertising",
    "#AustraliaCricket",
    "#VotingRights",
    "#politics",
    "#trump",
    "#abortion",
    "#climate",
    "#ashes",
]


def do_process_eng_hashtag(input_text: str) -> str:
    return re.sub(
        r'#[A-Za-z0-9]+',
        lambda m: ' '.join(re.findall('[A-Z][^A-Z]*|[a-z][^A-Z]*', m.group().lstrip('#'))),
        input_text,)


clean_hashtags = [do_process_eng_hashtag(hashtag) for hashtag in hashtag_list]
clean_hashtags

['Renewable Energy',
 'Tax Laws',
 'Political Advertising',
 'Australia Cricket',
 'Voting Rights',
 'politics',
 'trump',
 'abortion',
 'climate',
 'ashes']

Here I define a hashtag list that I will use to conduct my search. I also define a function to process each hashtag.

In [22]:
# Convert to lowercase
def clean_and_lowercase_text(text):
    text = text.lower()
    return text


clean_tweets = [clean_and_lowercase_text(tweet) for tweet in cleaned_tweets]
cleaned_hashtags = [clean_and_lowercase_text(hashtag) for hashtag in clean_hashtags]

Here I define and implement a lowercasing function.

In [23]:
import nltk

# these datasets that NLTK needs to tokenize should be downloaded once.
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [24]:
from nltk.tokenize import word_tokenize

def my_tokenize(text):
  tokenized_text = word_tokenize(text)
  return " ".join(tokenized_text)

tokenized_cleaned_tweets = [my_tokenize(tweet) for tweet in clean_tweets[:5]]
tokenized_cleaned_hashtags = [my_tokenize(hashtag) for hashtag in cleaned_hashtags]

print(tokenized_cleaned_tweets[:5])
print(tokenized_cleaned_hashtags[:5])

['the last election was on my i stayed in the hotel room watching election coverage until it was and then drank my away at the', 'huge yarn from allegedly took in mass in exchange for sending asylum on and manus to the secret us refugee deal blowback from politico via', 'we need a clip of that liberal party comedy', 'primary alp nat on', '']
['renewable energy', 'tax laws', 'political advertising', 'australia cricket', 'voting rights']


Here I define and implement a tokenizing function.

In [25]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()


def stem_text_and_remove_stop_words(text):
  text = my_tokenize(text)
  tokens = text.split()
  stemmed_words = []

  for token in tokens:
    if token not in stop_words:
      stemmed_token = ps.stem(token)
      stemmed_words.append(stemmed_token)
  return " ".join(stemmed_words)


stemmed_tweets = [stem_text_and_remove_stop_words(tweet) for tweet in clean_tweets]
stemmed_hashtags = [stem_text_and_remove_stop_words(hashtag) for hashtag in cleaned_hashtags]
print(stemmed_tweets[:5])
print(stemmed_hashtags[:5])

['last elect stay hotel room watch elect coverag drank away', 'huge yarn allegedli took mass exchang send asylum manu secret us refuge deal blowback politico via', 'need clip liber parti comedi', 'primari alp nat', '']
['renew energi', 'tax law', 'polit advertis', 'australia cricket', 'vote right']


Here I define and implement a stemming function and remove stopwords form the tweets.

# Task 1: Use CountVectorizer (binary = true) vectorization technique and perform search

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances
import numpy as np

# Function to retrieve top n_similar tweets for a given hashtag using sklearn's CountVectorizer
def find_similar_tweets_using_countvectorizer(hashtag, n_similar=10):

  # Define the N for N-grams
  N = 1

  # Initialize CountVectorizer with N and binary=True
  vectorizer = CountVectorizer(ngram_range=(N, N), lowercase=False, binary=True)

  # Fit vectorizer using stemmed tweets
  vectorizer.fit(stemmed_tweets)

  # Vectorize the hashtag using the fitted vectorizer
  queryVector = vectorizer.transform([hashtag.lower()]).toarray()

  # Calculate similarity score using inverse Euclidean distance (built-in sklearn)
  distances = 1 / (1 + pairwise_distances(queryVector, vectorizer.transform(stemmed_tweets), metric='euclidean'))[0]

  # Sort indices
  sorted_indices = np.argsort(distances)

  # Return the top n_similar tweets and scores
  return [(stemmed_tweets[idx], distances[idx]) for idx in sorted_indices[:n_similar]]

# Process each hashtag and print results using CountVectorizer
print("Using CountVectorizer:")
for i in range(len(stemmed_hashtags)):
  similar_tweets = find_similar_tweets_using_countvectorizer(stemmed_hashtags[i])
  print(f"Top 10 most similar tweets for {hashtag_list[i]}:")
  for tweet, score in similar_tweets:
    print(f"\t* {tweet} (Score: {score:.4f})")

Using CountVectorizer:
Top 10 most original tweets for #RenewableEnergy:
	* ban free alcohol parliament hous might good start labor former cop member wide bay breath test vote futur alcohol abus lead disorderli name call behavior question (Score: 0.1614)
	* appar apathi visibl launch concern big chunk couch potato give toss follow balanc social media still verili cast vote incompet lie scheme dishonest polici (Score: 0.1640)
	* well honestli dont think cost live hous climat chang effect much someon year peopl part time casual unemploy homeless pleas consid us make (Score: 0.1667)
	* cruel oppress tyrann polici toward need social secur spread inequ racism attempt destroy take money health educ govern wealthi big busi polit corrupt (Score: 0.1667)
	* back vote first howev confid saniti lot young peopl virtu signal never read scientif paper climat yet experi flat cell phone sud tonight (Score: 0.1667)
	* climat make big coal ga pay tax dead without huge labor swing non labor cant offset i

In this code block, I do the following:


*   Import appropriate libraries - numpy and sklearn's CountVectorizer and pairwise_distances. I refered to scikit-learn's documentation for this implementation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

*   Define a function to retrieve top n_similar tweets for a given hashtag using sklearn's CountVectorizer. I chose to use a function since it helps with encapsulation and reusability of code.

*   Create a vectorizer and fit it onto stemmed_tweets.

*   Create a query vector of the hashtag and calculate similarity score using inverse Euclidean distance.

*   Return top 10 similar tweets for a particular hashtag. I use a for loop to iterate over the indices of stemmed_hashtags and implement the function above. I then return each tweet with its corresponding score.

### Analysis
The tweets extracted using CountVectorizer don't seem to be the best/ don't relate well to the specified tweet. This could be due to several reasons like data processing, similarity score method or vectorization method.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Function to retrieve top n_similar tweets for a given hashtag using TfidfVectorizer
def find_similar_tweets_using_Tfidf(hashtag, n_similar=10):

  # Define the N for N-grams
  N = 1

  # Initialize  TfidfVectorizer
  vectorizer = TfidfVectorizer(ngram_range=(N, N), lowercase=False, binary=True)

  # Fit vectorizer
  vectorizer.fit(stemmed_tweets)

  # Vectorize the hashtag using the fitted vectorizer
  queryVector = vectorizer.transform([hashtag.lower()]).toarray()

  # Calculate similarity using inverse Euclidean distance (sklearn implementation)
  distances = 1 / (1 + pairwise_distances(queryVector, vectorizer.transform(stemmed_tweets), metric='euclidean'))[0]

  # Sort indices
  sorted_indices = np.argsort(distances)

  # Return the top tweets and scores
  return [(stemmed_tweets[idx], distances[idx]) for idx in sorted_indices[:n_similar]]


# Process each hashtag and print results using TfidfVectorizer
print("Using TfidfVectorizer:")
for i in range(len(stemmed_hashtags)):
  similar_tweets = find_similar_tweets_using_Tfidf(stemmed_hashtags[i])
  print(f"Top 10 most similar tweets for {hashtag_list[i]}:")
  for tweet, score in similar_tweets:
    print(f"\t* {tweet} (Score: {score:.4f})")

Using TfidfVectorizer:
Top 10 most similar tweets for ##RenewableEnergy:
	* support step forward peopl earth love earthli life inspir support eu climat (Score: 0.4142)
	* govern bereft moral bankrupt twist infight insati lust short long term ensur environment econom futur need (Score: 0.4142)
	* made supposedli independ given balanc senat power phoni suspect lose elect set help (Score: 0.4142)
	* massiv short fit mad boot torn apart team go bring niggl (Score: 0.4142)
	* bad vote experi today usual found despit issu mistaken fine home address list elector one sausag sizzl (Score: 0.4142)
	* whoever wish success lead countri parti everi time (Score: 0.4142)
	* go senat list tri find six worth vote tri find get six go singl issu (Score: 0.4142)
	* alp climat chang like choic actual thing happen desper need (Score: 0.4142)
	* bob current said shorten set danger preced religion stage close campaign day elect move make controversi faith part futur (Score: 0.4142)
	* progress left tri push c

In this code block, I do the following:


*   Import sklearn's TfidfVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

*   Define a function to retrieve top n_similar tweets for a given hashtag using sklearn's TfidfVectorizer. I chose to use a function since it helps with encapsulation and reusability of code.

*   Create a vectorizer and fit it onto stemmed_tweets.

*   Create a query vector of the hashtag and calculate similarity score using inverse Euclidean distance.

*   Return top 10 similar tweets for a particular hashtag using a for loop, iterating over each hashtag.

### Analysis
Even using the TfidfVectorizer, the retrieved tweets don't seem to be the best/ don't relate well to the specified tweet. This could be due to several reasons like data processing, similarity score method or vectorization method.

In [None]:
from scipy.spatial.distance import euclidean
import gensim.downloader as api

# Load pre-trained GloVe embeddings
word_vectors = api.load("glove-wiki-gigaword-50")


# Function to generate average word embeddings for a sentence using GloVe vectors
def average_word_embeddings(sentence):
    words = sentence.split()
    embeddings = []
    for word in words:
        if word in word_vectors:
            embeddings.append(word_vectors[word])
    if len(embeddings) > 0:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)


# Function to calculate inverse Euclidean distance between two vectors
def inverse_euclidean_distance(vector1, vector2):
    return 1 / (1 + euclidean(vector1, vector2))


word_vector_transformed_corpus = []
for tweet in stemmed_tweets:
  transformed_vector = average_word_embeddings(tweet)
  word_vector_transformed_corpus.append(transformed_vector)

# Process each hashtag and print results
for hashtag in stemmed_hashtags:
    # Vectorize the hashtag using GloVe word embeddings
    hashtag_embedding = average_word_embeddings(hashtag.lower())

    # Calculate inverse Euclidean distance between hashtag embedding and tweet embeddings
    similarities = [inverse_euclidean_distance(hashtag_embedding, tweet_embedding) for tweet_embedding in word_vector_transformed_corpus]

    # Sort indices by descending order of similarity score
    sorted_indices = np.argsort(similarities)[::-1]

    print(f"Top 10 most similar tweets for #{hashtag}:")
    for idx in sorted_indices[:10]:
        print(f"\t* {tweets[idx]} (Score: {similarities[idx]:.4f})")

Here I do the following:
*   Load pre-trained GloVe embeddings

*   Define a function to generate average word embeddings for a sentence using GloVe vectors

*   Define a function to calculate inverse Euclidean distance between two vectors
*   Create a word_vector_transformed_corpus by applying the average_word_embeddings function to each tweet

*   Process each hashtag and print results



### Analysis
The retrieved tweets using WordEmbedding vectorization technique definitely seems better than the results obtained using CountVectorizer and TfidfVectorizer. The tweets actually relate to the hashtag and the score also seems to carry an approriate weight. This is definitely the superior method for sentiment analysis of the tweets.

## Testing with a larger word bank

In [31]:
from scipy.spatial.distance import euclidean
import gensim.downloader as api

# Load pre-trained GloVe embeddings
word_vectors = api.load("glove-wiki-gigaword-200")


# Function to generate average word embeddings for a sentence using GloVe vectors
def average_word_embeddings(sentence):
    words = sentence.split()
    embeddings = []
    for word in words:
        if word in word_vectors:
            embeddings.append(word_vectors[word])
    if len(embeddings) > 0:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)


# Function to calculate inverse Euclidean distance between two vectors
def inverse_euclidean_distance(vector1, vector2):
    return 1 / (1 + euclidean(vector1, vector2))


word_vector_transformed_corpus = []
for tweet in stemmed_tweets:
  transformed_vector = average_word_embeddings(tweet)
  word_vector_transformed_corpus.append(transformed_vector)

# Process each hashtag and print results
for hashtag in stemmed_hashtags:
    # Vectorize the hashtag using GloVe word embeddings
    hashtag_embedding = average_word_embeddings(hashtag.lower())

    # Calculate inverse Euclidean distance between hashtag embedding and tweet embeddings
    similarities = [inverse_euclidean_distance(hashtag_embedding, tweet_embedding) for tweet_embedding in word_vector_transformed_corpus]

    # Sort indices by descending order of similarity score
    sorted_indices = np.argsort(similarities)[::-1]

    print(f"Top 10 most similar tweets for #{hashtag}:")
    for idx in sorted_indices[:10]:
        print(f"\t* {tweets[idx]} (Score: {similarities[idx]:.4f})")

Top 10 most similar tweets for #renew energi:
	* 4) Australian renewable energy agency
 (Score: 1.0000)
	* with renewable energy and efficiencies technologies...
 (Score: 1.0000)
	* ✅ abundant renewable energy generation potential
 (Score: 0.3430)
	* Germany shows how shifting to renewable energy can backfire #auspol #ausvotes https://t.co/QiGxH2p1j1
 (Score: 0.3100)
	* 2) Plan to have 50% renewable energy by 2030?
 (Score: 0.3050)
	* Renewable energy is key to Australia’s economic growth. 
 (Score: 0.2970)
	* The cost of renewable energy continues to drop.
 (Score: 0.2911)
	* 1) renewable energy target 
 (Score: 0.2899)
	* It is possible to live on 100% renewable energy .@ScottMorrisonMP . 
 (Score: 0.2815)
	* Renewable energy can't replace fossil fuels entirely https://t.co/HIR91pAK5j #billshorten #auspol #ausvotes
 (Score: 0.2648)
Top 10 most similar tweets for #tax law:
	* Keating: "L.A.W. Law tax cuts"
 (Score: 1.0000)
	* ❌Accounting and legal fees for tax advice 
 (Score: 0.2867)

In [32]:
from scipy.spatial.distance import euclidean
import gensim.downloader as api

# Load pre-trained GloVe embeddings
word_vectors = api.load("glove-wiki-gigaword-300")


# Function to generate average word embeddings for a sentence using GloVe vectors
def average_word_embeddings(sentence):
    words = sentence.split()
    embeddings = []
    for word in words:
        if word in word_vectors:
            embeddings.append(word_vectors[word])
    if len(embeddings) > 0:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)


# Function to calculate inverse Euclidean distance between two vectors
def inverse_euclidean_distance(vector1, vector2):
    return 1 / (1 + euclidean(vector1, vector2))


word_vector_transformed_corpus = []
for tweet in stemmed_tweets:
  transformed_vector = average_word_embeddings(tweet)
  word_vector_transformed_corpus.append(transformed_vector)

# Process each hashtag and print results
for hashtag in stemmed_hashtags:
    # Vectorize the hashtag using GloVe word embeddings
    hashtag_embedding = average_word_embeddings(hashtag.lower())

    # Calculate inverse Euclidean distance between hashtag embedding and tweet embeddings
    similarities = [inverse_euclidean_distance(hashtag_embedding, tweet_embedding) for tweet_embedding in word_vector_transformed_corpus]

    # Sort indices by descending order of similarity score
    sorted_indices = np.argsort(similarities)[::-1]

    print(f"Top 10 most similar tweets for #{hashtag}:")
    for idx in sorted_indices[:10]:
        print(f"\t* {tweets[idx]} (Score: {similarities[idx]:.4f})")

Top 10 most similar tweets for #renew energi:
	* 4) Australian renewable energy agency
 (Score: 1.0000)
	* with renewable energy and efficiencies technologies...
 (Score: 1.0000)
	* ✅ abundant renewable energy generation potential
 (Score: 0.2942)
	* 2) Plan to have 50% renewable energy by 2030?
 (Score: 0.2932)
	* Germany shows how shifting to renewable energy can backfire #auspol #ausvotes https://t.co/QiGxH2p1j1
 (Score: 0.2877)
	* Renewable energy is key to Australia’s economic growth. 
 (Score: 0.2862)
	* The cost of renewable energy continues to drop.
 (Score: 0.2796)
	* 1) renewable energy target 
 (Score: 0.2736)
	* It is possible to live on 100% renewable energy .@ScottMorrisonMP . 
 (Score: 0.2654)
	* Renewable energy can't replace fossil fuels entirely https://t.co/HIR91pAK5j #billshorten #auspol #ausvotes
 (Score: 0.2537)
Top 10 most similar tweets for #tax law:
	* Keating: "L.A.W. Law tax cuts"
 (Score: 1.0000)
	* ❌Accounting and legal fees for tax advice 
 (Score: 0.2643)

### Analysis

Both glove-wiki-gigaword-200 and glove-wiki-gigaword-300 seem to deliver similar results. glove-wiki-gigaword-200 definitely runs faster due to lower dimensionality, but it also captures less nuanced semantic information. The relevance of tweets gathered by glove-wiki-gigaword-300 seem to be more appropriate for the task.