# Assignment 3: Hashtag based Tweet search

We will extend Assignment 2 and work on building a vector based search for hashtag based search of tweets.

Overview:
Welcome to TweetMiner, the leading organization in Twitter data analysis! As an NLP scientist in our team, you're entrusted with the task of extracting the most relevant tweets based on input hashtags. For instance, if the hashtag is "#abortion," we expect you to extract the top N (let's say N=10) tweets that truly discuss the topic of "abortion." Similarly, for a hashtag like "#politicaladvertising," your algorithm should identify and extract the top N (again, let's use N=10) tweets about "political advertising".
For this assignment your tasks are the following:

## Task 1: Use CountVectorizer (binary = true) vectorization technique and perform search

### Processing Tweets:

1. Pre-process tweets using applicable pre-processing techniques.


In [1]:
import os.path
if not os.path.isfile("preprocessed_tweets.txt"):

  # first load file! same as Assignment 2
  with open("australian_election_2019_tweets.txt") as f:
      list_tweets = f.read().splitlines()

  # pre-processing from Lab 4
  import spacy
  from nltk.corpus import stopwords
  import nltk
  nltk.download('stopwords')

  # get a list of stopwords from NLTK
  stops = set(stopwords.words('english'))

  # Load SpaCy English language model
  # this is a pipeline capable of applying morphological, lexical and syntax analysis on text

  nlp_pipeline = spacy.load("en_core_web_sm")

  def pre_process_a_single_sentence(sentence: str):
    # Lower case text
    sentence = sentence.lower()

    processed_sentence = []

    # Tokenize, and lemmatize the text
    doc = nlp_pipeline(sentence)

    for token in doc:
      # here token is an object that contains various information about each token
      # information such as lemma, pos, parse labels are available
      # we will check here if tokens are present in stopwords; if not, we will retain their lemma
      if token not in stops:
        lemmatized_token = token.lemma_
        processed_sentence.append(lemmatized_token)
    processed_sentence = " ".join (processed_sentence)
    return processed_sentence

  # remove duplicates first
  l_t = list(set(list_tweets))

  # we use regex for removing URLs, non-english text
  import re
  # credit to https://www.geeksforgeeks.org/remove-urls-from-string-in-python/
  def remove_non_english(text):
      # Define a regex pattern to find
      pattern = re.compile(r"https?://\S+|(?<=\s)[@#]|^[@#]|[^a-zA-Z0-9\s]")

      # Use the sub() method to replace
      text_without_noneg = pattern.sub("", text)

      return text_without_noneg

  ltrdru = []

  for line in l_t:
    ltrdru.append(remove_non_english(line))

  # preprocess text actual
  prepro_tweets = [pre_process_a_single_sentence(sentence) for sentence in ltrdru]
  print(prepro_tweets[:10])


  # save lines
  with open('preprocessed_tweets.txt', 'w') as f:
      for line in prepro_tweets:
          f.write('%s\n' %line)

else:
  print("File Exists, skipping")

File Exists, skipping


I am skipping if I have the file saved because it takes 20 MINUTES to pre-process the file. But now I can simply load it to save time

In [2]:
with open("preprocessed_tweets.txt") as f:
    prepro_tweets = f.read().splitlines()

2. Vectorize pre-processed tweets with CountVectorizer (binary = true) . This will create sparse vectors of tweets based on its vocabulary.

In [3]:
# also taken from lab 4
from sklearn.feature_extraction.text import CountVectorizer
# Define the N for N-grams
N = 1
# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False, binary = True, max_features = 10000)

# Fit and transform the corpus
vectorizer.fit(prepro_tweets)

# Check a few items in the vocabulary
vocab = vectorizer.get_feature_names_out()

# sanity check: check the list of vocabulary
print(vocab[:10])

10000
['000' '001' '003' ... 'zombie' 'zone' 'zubspike']


In [4]:
# making a transformation of the text
bow_transformed_corpus = []

for sentence in prepro_tweets:
  transformed_vector = vectorizer.transform([sentence])
  bow_transformed_corpus.append(transformed_vector.toarray()[0])

# sanity check : print a few items from the bow_transformed_corpus and bow_transformed_query
print ("Transformed Corpus Samples", bow_transformed_corpus[:5])

: 

### Processing hashtags and conduct search:

1. Manually define a list of 10 hashtags, initiating each with the "#" symbol. Ensure the list consists of 5 single-word hashtags and 5 multiword hashtags. For multiword hashtags, capitalize the first letter of each word (e.g., #PoliticalAdvertising). 

Hashtags used: '#RenewableEnergy', '#TaxLaws', '#ParliamentaryMajority', '#coalition', '#Labor' '#Liberal', '#Auspol', '#DemocracySausage', '#ausvotes', and '#AusVotes22'

In [None]:
# list of 10 hashtags
hashtags = ['#RenewableEnergy', '#TaxLaws', '#ParliamentaryMajority', '#Coalition', '#Labor', '#Liberal', '#Auspol', '#DemocracySausage', '#Ausvotes', '#AusVotes22']

2. Remove the "#" symbol from all hashtags. If the hashtag is multiword, split it into individual words using regular expressions. Refer to the code snippet available at https://stackoverflow.com/questions/68448243/efficient-way-to-split-multi-word-hashtag-in-python

In [None]:
import re
x = 0
for tag in hashtags:
    hashtags[x] = re.sub(r'#(\w*[A-Z]\w*)', 
                         lambda m: ' '.join(re.findall('[A-Z][^A-Z]*', m.group())), tag)
    print(hashtags[x])
    x += 1


Renewable Energy
Tax Laws
Parliamentary Majority
Coalition
Labor
Liberal
Auspol
Democracy Sausage
Ausvotes
Aus Votes22


3. For each hashtag,

a. Vectorize the hashtags USING THE SAME Vectorizer that you built under "Processing Tweets". Let's call it "queryVector" 

15
['Aus' 'Auspol' 'Ausvotes' 'Coalition' 'Democracy' 'Energy' 'Labor' 'Laws'
 'Liberal' 'Majority' 'Parliamentary' 'Renewable' 'Sausage' 'Tax'
 'Votes22']


b. Compute the pairwise similarity between the "queryVector" and  each tweet vector using inverse of Euclidean Distance (you can copy the implementation from ALTERNATIVE_Lab4 notebook). 