Achintya Yedavalli

# Assignment 2: Exploring Pre-Processing Techniques

Welcome to TweetMiner, the leading organization in Twitter data analysis! As an NLP scientist in our team, you're entrusted with the task of extracting the most relevant tweets based on input hashtags. For instance, if the hashtag is "#abortion," we expect you to extract the top N (let's say N=10) tweets that truly discuss the topic of "abortion." Similarly, for a hashtag like "#politicaladvertising," your algorithm should identify and extract the top N (again, let's use N=10) tweets about "political advertising".

Despite the availability of advanced algorithms, we're interested in exploring a few fundamental approaches as given below.


## 1. Simple Word-Overlap Based Match (Search V1.0)

We're going to implement the following:

### a. Offline Processing

- load tweets from file

In [37]:
# load file!
# solution from https://stackoverflow.com/questions/15233340/getting-rid-of-n-when-using-readlines

with open("australian_election_2019_tweets.txt") as f:
    list_messages = f.read().splitlines()


list_messages[:10]


['After the climate election: shellshocked green groups remain resolute https://t.co/wyJzmAcyiD',
 '@narendramodi @smritiirani Coverage of indian election on SBS tv channel, Australia. Jai hind 🇮🇳🙏 https://t.co/90qplBEAf8',
 '@workmanalice Do you know if Facebook is releasing an election post-mortem in Australia? They looked into the midterms, but were we important enough to bother?',
 '@vanbadham We all understand we have a compulsory preference system. Vote 1 mightn’t go to the major but 2 or 3 usually does.',
 'Majority of Australia wanted LNP, that’s the facts.',
 'This is nothing like the USA System.',
 'Shares were mixed in Asia, with India and Australia leading gains for the region following elections that looked set to keep incumbents in office. https://t.co/krRhPYuRID',
 "Australia's pollsters to review incorrect election forecasts https://t.co/isy2SPg7L5",
 'It is disappointing that @tanya_plibersek has ruled herself out of the @AustralianLabor leadership challenge - given he


- remove duplicate tweets

In [38]:
# remove duplicates using set() method

# see how many tweets are in the dataset
print(len(list_messages))

lm_rd = list(set(list_messages))

# see how many were removed
print(len(lm_rd))

347192
264734


In [39]:
lm_rd[:10]

['',
 "At the end of the day, the economy will tank if people don't have money to spend, &amp; surplus/negative gearing/franking credits won't mean shit when our crops fail &amp; the oceans boil.",
 'What an embarrassment. "Don\'t be afraid. Don\'t be scared. It won\'t hurt you. It\'s coal," said \u2066@ScottMorrisonMP\u2069 . In the world news again for all the wrong reasons. It’s time for change, it’s time for these dinosaurs to go. #auspol #ausvotes  https://t.co/eOOjdd2unh',
 '@btckr Especially\xa0for Scott - "But ... all liars—they will be consigned to the fiery lake of burning sulfur. This is the second death.”',
 'Just want to put this on the public record for future generations: I DID NOT VOTE FOR THIS.',
 'https://t.co/NYujmSSeZV',
 'Bob Hawke in an open letter to voters,. "As I said repeatedly when I was Prime Minister , if you can’t govern yourselves, you can’t govern the country."',
 "Didn't anyone tell you about Australia?",
 '"This country does need a Labor government... 

- remove URLs, mentions, hashtags, & non-english text

In [40]:
# we use regex for this
import re
# credit to https://www.geeksforgeeks.org/remove-urls-from-string-in-python/
def remove_non_english(text):
    # Define a regex pattern to find
    pattern = re.compile(r"https?://\S+|(?<=\s)[@#]|^[@#]|[^a-zA-Z0-9\s]")

    # Use the sub() method to replace
    text_without_noneg = pattern.sub("", text)

    return text_without_noneg

lm_rd_ru = []

for line in lm_rd:
  lm_rd_ru.append(remove_non_english(line))


In [41]:
lm_rd_ru[:10]

['',
 'At the end of the day the economy will tank if people dont have money to spend amp surplusnegative gearingfranking credits wont mean shit when our crops fail amp the oceans boil',
 'What an embarrassment Dont be afraid Dont be scared It wont hurt you Its coal said ScottMorrisonMP  In the world news again for all the wrong reasons Its time for change its time for these dinosaurs to go auspol ausvotes  ',
 'btckr Especially\xa0for Scott  But  all liarsthey will be consigned to the fiery lake of burning sulfur This is the second death',
 'Just want to put this on the public record for future generations I DID NOT VOTE FOR THIS',
 '',
 'Bob Hawke in an open letter to voters As I said repeatedly when I was Prime Minister  if you cant govern yourselves you cant govern the country',
 'Didnt anyone tell you about Australia',
 'This country does need a Labor government the truth is hope is more powerful than fear  AlboMP addressing supporters with LNP 74 seats to ALP 66 auspol ausvotes',

### b. Real-Time Processing

- Generate a list of 10 hashtags, initiating each with the "#" symbol. Ensure the list consists of 5 single-word hashtags and 5 multiword hashtags. For multiword hashtags, capitalize the first letter of each word (e.g., #PoliticalAdvertising).

hashtags: '#renewableEnergy', '#taxLaws', '#parliamentaryMajority', '#coalition', '#Labor' '#Liberal', '#auspol', '#DemocracySausage', '#ausvotes', and '#ausvotes22'

In [42]:
# list of 10 hashtags
hashtags = ['#renewableEnergy', '#taxLaws', '#parliamentaryMajority', '#coalition', '#Labor', '#Liberal', '#auspol', '#democracySausage', '#ausVotes', '#ausVotes22']

- Remove the "#" symbol from all hashtags. If the hashtag is multiword, split it into individual words using regular expressions. Refer to the code snippet available at https://stackoverflow.com/questions/68448243/efficient-way-to-split-multi-word-hashtag-in-python

In [43]:
# removing # from hashtags and splitting up multi word hashtags
x = 0
for tag in hashtags:
  tag = tag.lower()
  hashtags[x] = re.sub(r'#[a-z]\S*',
        lambda m: ' '.join(re.findall('[A-Z][^A-Z]*|[a-z][^A-Z]*', m.group().lstrip('#'))),tag)
  print(hashtags[x])
  x += 1


renewableenergy
taxlaws
parliamentarymajority
coalition
labor
liberal
auspol
democracysausage
ausvotes
ausvotes22


- Implement a string match-based search to find and display the top 5 tweets for each hashtag. Determine the relevance score for each tweet by counting the occurrences of each word in the hashtag. Compile a list of tweets based on the top 5 relevance scores for each hashtag.

In [44]:
# string match-based search (code taken from lab 3)

def return_ranked_results(hashtags, tweet_list):
    match_score = {}

    for i, document in enumerate(tweet_list):
      score = 0
      document_index = i
      for token in hashtags:
        if token in document:
          score += 1
      match_score[document_index] = score

    ranked_documents = sorted(match_score.items(),key = lambda x: x[1] ,reverse = True)
    return list(ranked_documents)

def perform_search_and_show_results(documents, search_queries):
  documents_tokens = []

  for document in documents:
    documents_tokens.append(document.split())

  for query in search_queries:
    search_tokens = query.split()
    results = return_ranked_results(search_tokens, documents_tokens)
    print (f"--------------------------")
    print (f"Results for query: {query}")
    print (f"--------------------------")
    top_count = 0
    for result in results:
      document_id = result[0]
      score = result[1]
      if score !=0 and top_count < 5:
        print (documents[document_id], score)
        top_count += 1

perform_search_and_show_results(lm_rd_ru, hashtags)

--------------------------
Results for query: renewableenergy
--------------------------
What better car than this to transport an Independent for Climate Action Now ICANSenate ausvotes hybrid renewableenergy climateelection IndependentsDay Lexus  1
ACE EV to sign agreement to build electric vehicles in Adelaide starting in 2020 Interesting  ev ElectricVehicles AutonomousVehicles selfdrivingcars renewables renewableenergy energy resources Auspol Aceev Bellresources Bellhubhq 1
stopadani climatechange coal renewables solar windpower coalpower globalwarming renewableenergy auspol ausvotes2019 climateemergency  1
Solar Citizens today welcomes federal Labors announcement that theyll support a Community Power Hub in Western Sydney to help diverse communities access renewableenergy  1
LARGESCALE renewableenergy is the  go  1
--------------------------
Results for query: taxlaws
--------------------------
--------------------------
Results for query: parliamentarymajority
--------------------

## Improving Search Quality with Text-preprocessing (Search V2.0)

We'll be doing the following: (same as the previous question)

### a. Offline Processing

- Load all tweets from file

In [45]:
# load file!
# solution from https://stackoverflow.com/questions/15233340/getting-rid-of-n-when-using-readlines

with open("australian_election_2019_tweets.txt") as f:
    list_messages = f.read().splitlines()


list_messages[:10]

['After the climate election: shellshocked green groups remain resolute https://t.co/wyJzmAcyiD',
 '@narendramodi @smritiirani Coverage of indian election on SBS tv channel, Australia. Jai hind 🇮🇳🙏 https://t.co/90qplBEAf8',
 '@workmanalice Do you know if Facebook is releasing an election post-mortem in Australia? They looked into the midterms, but were we important enough to bother?',
 '@vanbadham We all understand we have a compulsory preference system. Vote 1 mightn’t go to the major but 2 or 3 usually does.',
 'Majority of Australia wanted LNP, that’s the facts.',
 'This is nothing like the USA System.',
 'Shares were mixed in Asia, with India and Australia leading gains for the region following elections that looked set to keep incumbents in office. https://t.co/krRhPYuRID',
 "Australia's pollsters to review incorrect election forecasts https://t.co/isy2SPg7L5",
 'It is disappointing that @tanya_plibersek has ruled herself out of the @AustralianLabor leadership challenge - given he

- Remove duplicate tweets

In [46]:
# remove duplicates using set() method

# see how many tweets are in the dataset
print(len(list_messages))

lm_rd = list(set(list_messages))

# see how many were removed
print(len(lm_rd))

347192
264734


- Remove URLs, mentions, hashtags, and non-english text.  

In [47]:
# we use regex for this
import re
# credit to https://www.geeksforgeeks.org/remove-urls-from-string-in-python/
def remove_non_english(text):
    # Define a regex pattern to find
    pattern = re.compile(r"https?://\S+|(?<=\s)[@#]|^[@#]|[^a-zA-Z0-9\s]")

    # Use the sub() method to replace
    text_without_noneg = pattern.sub("", text)

    return text_without_noneg

lm_rd_ru = []

for line in lm_rd:
  lm_rd_ru.append(remove_non_english(line))


- Apply text pre-processing techniques in the following order: (a) Lower casing (b) Tokenization (c) Stemming (d)Stopword removal.

In [48]:
%pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting click (from nltk)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
Using cached regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Using cached tqdm-4.66.2-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk
Successfully installed click-8.1.7 nltk-3.8.1 regex-2023.12.25 tqdm-4.66.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to res

In [52]:
# these datasets that NLTK needs to tokenize should be downloaded once.
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# use nltk (from lab 3)
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# for stemming
from nltk.stem import PorterStemmer

ps = PorterStemmer()

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [53]:
# all preprocessing
def preprocessing(text):
  # lowercasing
  text = text.lower()
  # tokenization
  tokens = word_tokenize(text)
  # stemming
  stemmed_words = [ps.stem(token) for token in tokens]
  # stopword removing
  filtered_text = [token for token in tokens if token not in stop_words]
  return " ".join(filtered_text)

pp_lm_rd_ru = [preprocessing(line) for line in lm_rd_ru]

pp_lm_rd_ru[:10]

['',
 'end day economy tank people dont money spend amp surplusnegative gearingfranking credits wont mean shit crops fail amp oceans boil',
 'embarrassment dont afraid dont scared wont hurt coal said scottmorrisonmp world news wrong reasons time change time dinosaurs go auspol ausvotes',
 'btckr especially scott liarsthey consigned fiery lake burning sulfur second death',
 'want put public record future generations vote',
 '',
 'bob hawke open letter voters said repeatedly prime minister cant govern cant govern country',
 'didnt anyone tell australia',
 'country need labor government truth hope powerful fear albomp addressing supporters lnp 74 seats alp 66 auspol ausvotes',
 'msveruca facing trump moment auspol']

- Explain.

Before doing any actual searching for this, we can take a look at the first 10 and see that the keyword search will go very well in my opinion. All of the big words are left over while the unneccesary words and symbols are not there anymore. I can already see the hashtags we are going to see here in the little text preview...

However... this preprocessing takes forever. like *2 whole minutes* in a world where computers can run at up to 7 BILLION cycles/second. So, not very fast.

### b. Real-Time Processing

- start with the same procedure as 1b

(using same hashtags list)

- Additionally, apply text pre-processing techniques in the following order: (a) Lower casing (b) Tokenization (c) Stemming (d) Stopword removal. You can use NLTK library and can refer to the code from Week3's tutorial.

*already done!*

- Implement a string match-based search to find and display the top 5 tweets for each hashtag. Determine the relevance score for each tweet by counting the occurrences of each word in the hashtag. Compile a list of tweets based on the top 5 relevance scores for each hashtag.

In [54]:
# just copy+pasting the code from above
# string match-based search (code taken from lab 3)

def return_ranked_results(hashtags, tweet_list):
    match_score = {}

    for i, document in enumerate(tweet_list):
      score = 0
      document_index = i
      for token in hashtags:
        if token in document:
          score += 1
      match_score[document_index] = score

    ranked_documents = sorted(match_score.items(),key = lambda x: x[1] ,reverse = True)
    return list(ranked_documents)

def perform_search_and_show_results(documents, search_queries):
  documents_tokens = []

  for document in documents:
    documents_tokens.append(document.split())

  for query in search_queries:
    search_tokens = query.split()
    results = return_ranked_results(search_tokens, documents_tokens)
    print (f"--------------------------")
    print (f"Results for query: {query}")
    print (f"--------------------------")
    top_count = 0
    for result in results:
      document_id = result[0]
      score = result[1]
      if score !=0 and top_count < 5:
        print (documents[document_id], score)
        top_count += 1

perform_search_and_show_results(lm_rd_ru, hashtags)

--------------------------
Results for query: renewableenergy
--------------------------
What better car than this to transport an Independent for Climate Action Now ICANSenate ausvotes hybrid renewableenergy climateelection IndependentsDay Lexus  1
ACE EV to sign agreement to build electric vehicles in Adelaide starting in 2020 Interesting  ev ElectricVehicles AutonomousVehicles selfdrivingcars renewables renewableenergy energy resources Auspol Aceev Bellresources Bellhubhq 1
stopadani climatechange coal renewables solar windpower coalpower globalwarming renewableenergy auspol ausvotes2019 climateemergency  1
Solar Citizens today welcomes federal Labors announcement that theyll support a Community Power Hub in Western Sydney to help diverse communities access renewableenergy  1
LARGESCALE renewableenergy is the  go  1
--------------------------
Results for query: taxlaws
--------------------------
--------------------------
Results for query: parliamentarymajority
--------------------

- Explain.

Just by making all the letters lowercase helps everything immensely. No longer are capitalization errors causing lots of mis-read data and unmarked scores; now it is more even. After taking out stopwords and stemming, it greatly increases the speed of the processing, making sure that it can run faster. Overall, it is a very good idea to implement all these steps. Considering all the processing time is at the beginning, however, it just distributes the processing time instead of making it happen at one place.