##### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2022 Semester 1

## Assignment 2: Sentiment Classification of Tweets

This notebook contains code focused on analysing features for the final model.
**This is only for exploration.** See `features.py` for functionality related to extracting features for model implementation.

First, read the CSV datafiles (Train and Test).

In [3]:
import pandas as pd

train_data = pd.read_csv("../datasets/Train.csv", sep=',')
test_data = pd.read_csv("../datasets/Test.csv", sep=',')

### Extracting the individual words

In [22]:
import re

tweets = train_data[['text']].values[:, 0]

# split by spaces first (removing extraneous punctuation)
tweet_word_lists = [re.sub(r'[".,/\\(\'):;\t]||\t', "", x).split(" ") for x in tweets];

# print the head of both lists just to check that it worked
print(tweets[:10])
print(tweet_word_lists[:10])

[' doctors hit campaign trail as race to medical council elections heats up https://t.co/iifdwb9v0w #homeopathy'
 ' is anybody going to the radio station tomorrow to see shawn? me and my friend may go but we would like to make new friends/meet there (:\t'
 " i just found out naruto didn't become the 5th hokage....\t"
 ' "prince george reservist who died saturday just wanted to help people, his father tells @cbcnews http://t.co/riauzrjgre"\t'
 ' season in the sun versi nirvana rancak gak..slow rockkk...\t'
 " if i didnt have you i'd never see the sun. #mtvstars lady gaga\t"
 ' this is cute. #thisisus @nbcthisisus https://t.co/ndxqyl4gjk'
 ' today is the international day for the elimination of violence against women #orangetheworld #unitednations #unodc‚ä¶ https://t.co/uyqctttufj'
 ' "in his first game back since april 14, david wright went 2-for-5 with a hr, bb and three r on monday. he also made two errors at 3b."\t'
 ' josh hamilton flies out to center... we are going to the bottom o

### 1. TF-IDF
The TF-IDF will identify the most important words in the tweet relative to the other tweets (helps prune out "the").

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

#Build the feature set (vocabulary) and vectorise the Tarin dataset using TFIDF
tweets_tfidf = tfidf_vectorizer.fit_transform(tweets)

print("Train feature space size (using TFIDF):", tweets_tfidf.shape)
print(tweets_tfidf[1])

Train feature space size (using TFIDF): (21802, 44045)
  (0, 37883)	0.18565385954834512
  (0, 24659)	0.2500345232367134
  (0, 15226)	0.25639046572035723
  (0, 26660)	0.17561152736960378
  (0, 23985)	0.1925927500306722
  (0, 22991)	0.16044767939535962
  (0, 42083)	0.18984640176982912
  (0, 41365)	0.1543207744837252
  (0, 7246)	0.14059126992943502
  (0, 16261)	0.1784628628725588
  (0, 24454)	0.12804387104621462
  (0, 15223)	0.26344567340807307
  (0, 26105)	0.14662061838154353
  (0, 3761)	0.09883064069307852
  (0, 24586)	0.1579972519146742
  (0, 34418)	0.22806178452645745
  (0, 34040)	0.1638445966736955
  (0, 38468)	0.13527781692615354
  (0, 36044)	0.34058106427217183
  (0, 31309)	0.2838666463265357
  (0, 37689)	0.06611242944726782
  (0, 16331)	0.16788221772423795
  (0, 3989)	0.29703234834833714
  (0, 19715)	0.1065038202170494
  (0, 38395)	0.2534685554135372


### 2. Links

Some tweets have links to other websites or other tweets.
Links can be identified by the appearance of the `https://t.co/` redirect service.

This *numeric* feature will track the number of links in a single tweet. Linking could indicate some opinion is being stated on the linked text.

In [26]:
# Create a list of number of redirects/links
num_links = []

for t in tweets:
    links = re.findall(r"https://t.co", t)
    num_links.append(len(links))

# check the head and range of values for this
print(num_links[:10])
print(set(num_links))

[1, 0, 0, 0, 0, 0, 1, 1, 0, 0]
{0, 1, 2, 3, 5}


### 3. Hashtags
Hashtags are used in twitter to link tweets with similar subjects together.
Extracting these from the tweets may help in grouping similar tweets together.

In [43]:
# create a list of hashtags in a tweet
hashtags = []
all_hashtags = set()

for t in tweets:
    hashes = re.findall(r"#\w+", t)
    hashtags.append(hashes)
    all_hashtags = all_hashtags.union(hashes)

# check the head and range of values for this
print(hashtags[:10])
# print(all_hashtags)

[['#homeopathy'], [], [], [], [], ['#mtvstars'], ['#thisisus'], ['#orangetheworld', '#unitednations', '#unodc'], [], ['#nevereverquit']]


### 4. User References
`@` symbols are used in twitter to reference a specific user.
Extracting these from the tweets may help in grouping tweets with similar recipients.

In [50]:
# create a list of hashtags in a tweet
references = []
all_references = set()

for t in tweets:
    refs = re.findall(r"@\w+", t)
    references.append(refs)
    all_references = all_references.union(refs)

# check the head and range of values for this
print(references[:10])
# print(all_references)

[[], [], [], ['@cbcnews'], [], [], ['@nbcthisisus'], [], [], []]


### 5. Smiley Faces
`ASCII` emoji such as `:)`, `:(`, `:P` can indicate emotion, leading to potentially easier judgment on the sentiment of a tweet.

For the purposes of simplicity, emoji will be a combination of common eye symbols `;:B`, nose/middle symbols `',-` and mouth symbols `LlPp|\/()VOo3`

In [55]:
# create a list of hashtags in a tweet
emoticons = []
all_emoticons = set()

for t in tweets:
    emotes = re.findall(r"(?<=[ ^])[;:B]+[',-]*[LlPp\|\/()VOo3]+(?=[ $])", t)
    emoticons.append(emotes)
    all_emoticons = all_emoticons.union(emotes)

# check the head and range of values for this
print(emoticons[:10])
print(all_emoticons)

[[], [], [], [], [], [], [], [], [], []]
{":')", ':--(', ':-o', ';)', ':-)', ':/', ';o', ':(', ':o', ':)', ':3', ':p', ':))', ":'(", ';-)'}
