##### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2022 Semester 1

## Assignment 2: Sentiment Classification of Tweets

This is a sample code to assist you with vectorising the 'Train' dataset for your assignment 2.

First we read the CSV datafiles (Train and Test).

In [1]:
import pandas as pd

train_data = pd.read_csv("Train.csv", sep=',')
test_data = pd.read_csv("Test.csv", sep=',')

Then we separate the tweet text and the label (sentiment). 

In [3]:
#separating instance and label for Train
X_train_raw = [x[0] for x in train_data[['text']].values]
Y_train = [x[0] for x in train_data[['sentiment']].values]

#check the result
print("Train length:",len(X_train_raw))

#separating instance and label for Test
X_test_raw = [x[0] for x in test_data[['text']].values]

#check the result
print("Test length:",len(X_test_raw))

Train length: 21802
Test length: 6099


In [5]:
#Let's see one example tweet
print(X_train_raw[1])

 is anybody going to the radio station tomorrow to see shawn? me and my friend may go but we would like to make new friends/meet there (:	


### 1. Bag of Words (BoW)
In this approach, we use the **CountVectorizer** library to separate all the words in the Train corpus (dataset). These words are then used as the 'vectors' or 'features' to represent each instance (Tweet) in `Train` and `Test` datasets. 

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

BoW_vectorizer = CountVectorizer()

#Build the feature set (vocabulary) and vectorise the Tarin dataset using BoW
X_train_BoW = BoW_vectorizer.fit_transform(X_train_raw)

#Use the feature set (vocabulary) from Train to vectorise the Test dataset 
X_test_BoW = BoW_vectorizer.transform(X_test_raw)

print("Train feature space size (using BoW):",X_train_BoW.shape)
print("Test feature space size (using BoW):",X_test_BoW.shape)

Train feature space size (using BoW): (21802, 44045)
Test feature space size (using BoW): (6099, 44045)


Now each row is a list of tuples with the vector_id (word_id in the vocabulary) and the number of times it repeated in that given instance (tweet).

In [None]:
#Let's see one example tweet using the BoW feature space
print(X_train_BoW[1])

We can save the created vocabulary for the given dataset in a separate file.

In [28]:
output_dict = BoW_vectorizer.vocabulary_
output_pd = pd.DataFrame(list(output_dict.items()),columns = ['word','count'])

output_pd.T.to_csv('BoW-vocab.csv',index=False)

### 2. TFIDF
In this approach, we use the **TfidfVectorizer** library to separate all the words in this corpus (dataset). Same as the BoW approach, these words are then used as the 'vectors' or 'features' to represent each instance (Tweet).

However, in this method for each instance the value associated with each 'vector' (word) is not the number of times the word repeated in that tweet, but the TFIDF value of then 'voctor' (word).

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

#Build the feature set (vocabulary) and vectorise the Tarin dataset using TFIDF
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_raw)

#Use the feature set (vocabulary) from Train to vectorise the Test dataset 
X_test_tfidf = tfidf_vectorizer.transform(X_test_raw)

print("Train feature space size (using TFIDF):",X_train_BoW.shape)
print("Test feature space size (using TFIDF):",X_test_BoW.shape)


Train feature space size (using TFIDF): (21802, 44045)
Test feature space size (using TFIDF): (6099, 44045)


In [30]:
#Let's see one example tweet using the TFIDF feature space
print(X_train_tfidf[1])

  (0, 37883)	0.18565385954834512
  (0, 24659)	0.2500345232367134
  (0, 15226)	0.25639046572035723
  (0, 26660)	0.17561152736960378
  (0, 23985)	0.1925927500306722
  (0, 22991)	0.16044767939535962
  (0, 42083)	0.18984640176982912
  (0, 41365)	0.1543207744837252
  (0, 7246)	0.14059126992943502
  (0, 16261)	0.1784628628725588
  (0, 24454)	0.12804387104621462
  (0, 15223)	0.26344567340807307
  (0, 26105)	0.14662061838154353
  (0, 3761)	0.09883064069307852
  (0, 24586)	0.1579972519146742
  (0, 34418)	0.22806178452645745
  (0, 34040)	0.1638445966736955
  (0, 38468)	0.13527781692615354
  (0, 36044)	0.34058106427217183
  (0, 31309)	0.2838666463265357
  (0, 37689)	0.06611242944726782
  (0, 16331)	0.16788221772423795
  (0, 3989)	0.29703234834833714
  (0, 19715)	0.1065038202170494
  (0, 38395)	0.2534685554135372
