# Group Members

Xuan Ming Teo

Lee Yu Xuan

Ananthan Srinath Adhvait

# Description

Write the code to implement a classifier that determines whether a given comment expresses a pro-vaccination or anti-vaccination stance. Initially, you will work with a small sample that you can use to get things set up. Eventually, you will receive the full dataset: first including the result of the first annotation, and later the result of the second round. Please note that your results may change (e.g. which model performs best) when you switch from the small sample to the full dataset.

In [None]:
!wget --no-check-certificate https://www.cse.chalmers.se/~richajo/dit866/data/a3_train_final.tsv

--2024-02-14 21:19:31--  https://www.cse.chalmers.se/~richajo/dit866/data/a3_train_final.tsv
Resolving www.cse.chalmers.se (www.cse.chalmers.se)... 129.16.221.33
Connecting to www.cse.chalmers.se (www.cse.chalmers.se)|129.16.221.33|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 7387583 (7.0M) [text/tab-separated-values]
Saving to: ‘a3_train_final.tsv.1’


2024-02-14 21:20:25 (136 KB/s) - ‘a3_train_final.tsv.1’ saved [7387583/7387583]



In [None]:
!wget --no-check-certificate https://www.cse.chalmers.se/~richajo/dit866/data/a3_test.tsv

--2024-02-14 21:20:25--  https://www.cse.chalmers.se/~richajo/dit866/data/a3_test.tsv
Resolving www.cse.chalmers.se (www.cse.chalmers.se)... 129.16.221.33
Connecting to www.cse.chalmers.se (www.cse.chalmers.se)|129.16.221.33|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 273177 (267K) [text/tab-separated-values]
Saving to: ‘a3_test.tsv.1’


2024-02-14 21:20:26 (308 KB/s) - ‘a3_test.tsv.1’ saved [273177/273177]



In [None]:
# Importing Libraries

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neural_network import MLPClassifier
from sklearn.dummy import DummyClassifier

In [None]:
# Loading in the dataset

df = pd.read_csv("a3_train_final.tsv", sep='\t', header=None)
df.columns = ['Sentiments', 'Text']
df

Unnamed: 0,Sentiments,Text
0,1/1,I'll only consume if I know what's inside it....
1,0/-1,It is easier to fool a million people than it...
2,0/0,NATURAL IMMUNITY protected us since evolutio...
3,0/-1,NATURAL IMMUNITY protected us since evolutio...
4,0/0,"Proud to have resisted. Proud of my husband, ..."
...,...,...
50063,0/0,🤣 keep your 💩 I already know 3 people who have...
50064,0/0,"🤣🤣🤣 ""JUST BECAUSE IT'S SAFE, DOESN'T MEAN IT D..."
50065,0/0,🤣🤣🤣 I took the Vaccine because of work. If I d...
50066,0/0,🤨there's people already having severe side eff...


In [None]:
sentiments = df['Sentiments']
sentiments

0         1/1
1        0/-1
2         0/0
3        0/-1
4         0/0
         ... 
50063     0/0
50064     0/0
50065     0/0
50066     0/0
50067     1/1
Name: Sentiments, Length: 50068, dtype: object

In [None]:
# We aim to remove data that do not align with the following sentiment annotations : sentiment_1/sentiment_2

uneven_data = []

for i in range(len(sentiments)):
  slash_count = sentiments[i].count('/')
  if slash_count != 1:
    uneven_data.append(i)

len(uneven_data)

5434

In [None]:
updated_df = df.drop(uneven_data)
updated_df = updated_df.reset_index()
updated_df.shape

(44634, 3)

In [None]:
# Here we are attempting to convert the annotations into a single integer representation where
# 1 - Positive
# 0 - Negative
# -1 - Unclear
# If either of the annotations contains a '-1' or both annotations are not the same,
# we will classify the sentiments as '-1' as there is no clear indication whether the annotation is positive is negative
# If both sentiments are the same, we will label the sentiments as accordingly

sentiments = updated_df['Sentiments']
for i in range(len(sentiments)):
  # Split the sentiment string by '/'
  split_sentiments = sentiments[i].split('/')
  if split_sentiments[0] == '-1' or split_sentiments[1] == '-1':
    sentiments[i] = -1
  elif split_sentiments[0] != split_sentiments[1]:
    sentiments[i] = -1
  elif split_sentiments[0] == split_sentiments[1]:
    sentiments[i] = int(split_sentiments[0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sentiments[i] = int(split_sentiments[0])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sentiments[i] = -1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sentiments[i] = -1


In [None]:
sentiments

0         1
1        -1
2         0
3        -1
4         0
         ..
44629     0
44630     0
44631     0
44632     0
44633     1
Name: Sentiments, Length: 44634, dtype: object

In [None]:
sentiments.value_counts()

 1    19248
 0    18221
-1     7165
Name: Sentiments, dtype: int64

In [None]:
updated_df['Sentiments'] = updated_df['Sentiments'].replace(1,'Positive')
updated_df['Sentiments'] = updated_df['Sentiments'].replace(0,'Negative')
updated_df['Sentiments'] = updated_df['Sentiments'].replace(-1,'Unclear')

In [None]:
updated_df

Unnamed: 0,index,Sentiments,Text
0,0,Positive,I'll only consume if I know what's inside it....
1,1,Unclear,It is easier to fool a million people than it...
2,2,Negative,NATURAL IMMUNITY protected us since evolutio...
3,3,Unclear,NATURAL IMMUNITY protected us since evolutio...
4,4,Negative,"Proud to have resisted. Proud of my husband, ..."
...,...,...,...
44629,50063,Negative,🤣 keep your 💩 I already know 3 people who have...
44630,50064,Negative,"🤣🤣🤣 ""JUST BECAUSE IT'S SAFE, DOESN'T MEAN IT D..."
44631,50065,Negative,🤣🤣🤣 I took the Vaccine because of work. If I d...
44632,50066,Negative,🤨there's people already having severe side eff...


In [None]:
positive_sentiments = updated_df[updated_df['Sentiments'] == 'Positive']
negative_sentiments = updated_df[updated_df['Sentiments'] == 'Negative']
unclear_sentiments = updated_df[updated_df['Sentiments'] == 'Unclear']

In [None]:
positive_sentiments.shape

(19248, 3)

In [None]:
negative_sentiments.shape

(18221, 3)

In [None]:
unclear_sentiments.shape

(7165, 3)

## Train - Validation Split

In [None]:
x_positive_sentiments = positive_sentiments['Text']
y_positive_sentiments = positive_sentiments['Sentiments']
x_negative_sentiments = negative_sentiments['Text']
y_negative_sentiments = negative_sentiments['Sentiments']
x_unclear_sentiments = unclear_sentiments['Text']
y_unclear_sentiments = unclear_sentiments['Sentiments']

In [None]:
x_train_positive, x_val_positive, y_train_positive, y_val_positive = train_test_split(x_positive_sentiments, y_positive_sentiments, test_size=0.2, random_state=0)
x_train_negative, x_val_negative, y_train_negative, y_val_negative = train_test_split(x_negative_sentiments, y_negative_sentiments, test_size=0.2, random_state=0)
x_train_unclear, x_val_unclear, y_train_unclear, y_val_unclear = train_test_split(x_unclear_sentiments, y_unclear_sentiments, test_size=0.2, random_state=0)

In [None]:
# x_train = pd.concat([x_train_positive, x_train_negative, x_train_unclear])
# y_train = pd.concat([y_train_positive, y_train_negative, y_train_unclear])
# x_val = pd.concat([x_val_positive, x_val_negative, x_val_unclear])
# y_val = pd.concat([y_val_positive, y_val_negative, y_val_unclear])
x_train = pd.concat([x_train_positive, x_train_negative])
y_train = pd.concat([y_train_positive, y_train_negative])
x_val = pd.concat([x_val_positive, x_val_negative])
y_val = pd.concat([y_val_positive, y_val_negative])

In [None]:
x_train.shape

(29974,)

In [None]:
x_val.shape

(7495,)

In [None]:
y_train.shape

(29974,)

In [None]:
y_val.shape

(7495,)

## Processing Test Data

In [None]:
# Loading in the dataset

df_test = pd.read_csv("/content/a3_test.tsv", sep='\t', header=None)
df_test.columns = ['Sentiments', 'Text']
df_test

Unnamed: 0,Sentiments,Text
0,1,Don't tell me what to do with my body - the sa...
1,1,I did my own research means you looked online ...
2,1,I don't know what's in it. As if they know wha...
3,1,"I trust my immune system just translates to ""I..."
4,1,"In the September time frame, unvaccinated peop..."
...,...,...
2034,0,“Medical professionals” that don’t even eat he...
2035,1,“No vaccine has ever been proven effective.” T...
2036,0,“We cannot have normality until everyone globa...
2037,1,”i’d do anything to keep my child safe” except...


In [None]:
x_test = df_test['Text']
y_test = df_test['Sentiments']

In [None]:
x_test

0       Don't tell me what to do with my body - the sa...
1       I did my own research means you looked online ...
2       I don't know what's in it. As if they know wha...
3       I trust my immune system just translates to "I...
4       In the September time frame, unvaccinated peop...
                              ...                        
2034    “Medical professionals” that don’t even eat he...
2035    “No vaccine has ever been proven effective.” T...
2036    “We cannot have normality until everyone globa...
2037    ”i’d do anything to keep my child safe” except...
2038    …and the VACCINATION could give you a heart at...
Name: Text, Length: 2039, dtype: object

In [None]:
y_test

0       1
1       1
2       1
3       1
4       1
       ..
2034    0
2035    1
2036    0
2037    1
2038    0
Name: Sentiments, Length: 2039, dtype: int64

In [None]:
y_test.value_counts()

0    1020
1    1019
Name: Sentiments, dtype: int64

In [None]:
y_test = y_test.replace(1,'Positive')
y_test = y_test.replace(0,'Negative')

# Dummy Classifier as baseline benchmark

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=5)
x_train_vectorized = dummy_vectorizer.fit_transform(x_train)
dummy_clf.fit(x_train_vectorized, y_train)

x_val_vectorized = dummy_vectorizer.transform(x_val)

y_pred = dummy_clf.predict(x_val_vectorized)

accuracy = accuracy_score(y_val, y_pred)
print("TFIDIF with Dummy Classifier Validation Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_val, y_pred))

TFIDIF with Dummy Classifier Validation Accuracy: 0.513675783855904
Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00      3645
    Positive       0.51      1.00      0.68      3850

    accuracy                           0.51      7495
   macro avg       0.26      0.50      0.34      7495
weighted avg       0.26      0.51      0.35      7495



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Test Set Accuracy

In [None]:
x_test_vectorized = dummy_vectorizer.transform(x_test)

y_pred = dummy_clf.predict(x_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
print("TFIDFVectorizer with Dummy Classifier Test Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))

TFIDFVectorizer with Dummy Classifier Test Accuracy: 0.49975478175576266
Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00      1020
    Positive       0.50      1.00      0.67      1019

    accuracy                           0.50      2039
   macro avg       0.25      0.50      0.33      2039
weighted avg       0.25      0.50      0.33      2039



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Utilizing CountVectorizer with Logistic Regression

In [None]:
vectorizer = CountVectorizer()

x_train_vectorized = vectorizer.fit_transform(x_train)

clf = LogisticRegression(max_iter=1000)
clf.fit(x_train_vectorized, y_train)

x_val_vectorized = vectorizer.transform(x_val)

y_pred = clf.predict(x_val_vectorized)

accuracy = accuracy_score(y_val, y_pred)
print("CountVectorizer with Logistic Regression Validation Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_val, y_pred))

CountVectorizer with Logistic Regression Validation Accuracy: 0.8277518345563709
Classification Report:
              precision    recall  f1-score   support

    Negative       0.82      0.83      0.82      3645
    Positive       0.84      0.82      0.83      3850

    accuracy                           0.83      7495
   macro avg       0.83      0.83      0.83      7495
weighted avg       0.83      0.83      0.83      7495



## Test Set Accuracy

In [None]:
x_test_vectorized = vectorizer.transform(x_test)

y_pred = clf.predict(x_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
print("CountVectorizer with Logistic Regression Test Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))

CountVectorizer with Logistic Regression Test Accuracy: 0.8425698871996077
Classification Report:
              precision    recall  f1-score   support

    Negative       0.84      0.84      0.84      1020
    Positive       0.84      0.84      0.84      1019

    accuracy                           0.84      2039
   macro avg       0.84      0.84      0.84      2039
weighted avg       0.84      0.84      0.84      2039



## Utilizing TfidfVectorizer with Logistic Regression

In [None]:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=5)

x_train_vectorized = vectorizer.fit_transform(x_train)

clf = LogisticRegression(max_iter=1000, C=1.0)
clf.fit(x_train_vectorized, y_train)

x_val_vectorized = vectorizer.transform(x_val)

y_pred = clf.predict(x_val_vectorized)

accuracy = accuracy_score(y_val, y_pred)
print("TfidfVectorizer with Logistic Regression Validation Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_val, y_pred))

TfidfVectorizer with Logistic Regression Validation Accuracy: 0.8452301534356238
Classification Report:
              precision    recall  f1-score   support

    Negative       0.85      0.83      0.84      3645
    Positive       0.84      0.86      0.85      3850

    accuracy                           0.85      7495
   macro avg       0.85      0.84      0.85      7495
weighted avg       0.85      0.85      0.85      7495



## Test Set Accuracy

In [None]:
x_test_vectorized = vectorizer.transform(x_test)

y_pred = clf.predict(x_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
print("TfidfVectorizer with Logistic Regression Test Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))

TfidfVectorizer with Logistic Regression Test Accuracy: 0.8543403629230014
Classification Report:
              precision    recall  f1-score   support

    Negative       0.85      0.86      0.86      1020
    Positive       0.86      0.85      0.85      1019

    accuracy                           0.85      2039
   macro avg       0.85      0.85      0.85      2039
weighted avg       0.85      0.85      0.85      2039



## Utilizing TfidfVectorizer with Multinomial Naive Bayes Classifier

In [None]:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=5)

x_train_vectorized = vectorizer.fit_transform(x_train)

clf = MultinomialNB()
clf.fit(x_train_vectorized, y_train)

x_val_vectorized = vectorizer.transform(x_val)

y_pred = clf.predict(x_val_vectorized)

accuracy = accuracy_score(y_val, y_pred)
print("TfidfVectorizer with Multinomial Naive Bayes Validation Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_val, y_pred))

TfidfVectorizer with Multinomial Naive Bayes Validation Accuracy: 0.8268178785857239
Classification Report:
              precision    recall  f1-score   support

    Negative       0.82      0.83      0.82      3645
    Positive       0.84      0.83      0.83      3850

    accuracy                           0.83      7495
   macro avg       0.83      0.83      0.83      7495
weighted avg       0.83      0.83      0.83      7495



## Test Set Accuracy

In [None]:
x_test_vectorized = vectorizer.transform(x_test)

y_pred = clf.predict(x_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
print("TfidfVectorizer with Multinomial Naive Bayes Test Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))

TfidfVectorizer with Multinomial Naive Bayes Test Accuracy: 0.8396272682687592
Classification Report:
              precision    recall  f1-score   support

    Negative       0.83      0.86      0.84      1020
    Positive       0.85      0.82      0.84      1019

    accuracy                           0.84      2039
   macro avg       0.84      0.84      0.84      2039
weighted avg       0.84      0.84      0.84      2039



## Utilizing TfidfVectorizer with Perceptron for Clasification

In [None]:
tf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=5)
x_train_vectorized = tf_vectorizer.fit_transform(x_train)

clf = Perceptron(random_state=0)
clf.fit(x_train_vectorized, y_train)

x_tf_vec_val = tf_vectorizer.transform(x_val)
y_pred = clf.predict(x_tf_vec_val)

accuracy = accuracy_score(y_pred, y_val)

print("TfidfVectorizer with Perceptron Validation Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_val, y_pred))

TfidfVectorizer with Perceptron Validation Accuracy: 0.7990660440293529
Classification Report:
              precision    recall  f1-score   support

    Negative       0.81      0.76      0.79      3645
    Positive       0.79      0.83      0.81      3850

    accuracy                           0.80      7495
   macro avg       0.80      0.80      0.80      7495
weighted avg       0.80      0.80      0.80      7495



## Test Set Accuracy

In [None]:
x_test_vectorized = vectorizer.transform(x_test)

y_pred = clf.predict(x_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)
print("TfidfVectorizer with Perceptron Test Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))

TfidfVectorizer with Perceptron Test Accuracy: 0.8052967140755272
Classification Report:
              precision    recall  f1-score   support

    Negative       0.82      0.78      0.80      1020
    Positive       0.79      0.83      0.81      1019

    accuracy                           0.81      2039
   macro avg       0.81      0.81      0.81      2039
weighted avg       0.81      0.81      0.81      2039



## Utilizing TfidfVectorizer with Multi-Layer Perceptron for Clasification

In [None]:
tf_vectorizer_nn = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=5)
X_tf_vec_nn = tf_vectorizer_nn.fit_transform(x_train)

clf_nn = MLPClassifier(hidden_layer_sizes=(5,35), max_iter=100, n_iter_no_change=20, verbose=True, activation="relu", solver='adam', early_stopping=True, random_state=2)
clf_nn.fit(X_tf_vec_nn, y_train)

X_tf_vec_val = tf_vectorizer.transform(x_val)
y_pred = clf_nn.predict(X_tf_vec_val)

accuracy = accuracy_score(y_val, y_pred)

print("TfidfVectorizer with Mulit-Layer Perceptron Validation Accuracy::", accuracy)
print('Classification Report:')
print(classification_report(y_val, y_pred))

Iteration 1, loss = 0.64464450
Validation score: 0.816544
Iteration 2, loss = 0.42965981
Validation score: 0.836224
Iteration 3, loss = 0.32839064
Validation score: 0.840894
Iteration 4, loss = 0.29389900
Validation score: 0.842228
Iteration 5, loss = 0.27502115
Validation score: 0.833556
Iteration 6, loss = 0.26305986
Validation score: 0.829553
Iteration 7, loss = 0.25444635
Validation score: 0.827552
Iteration 8, loss = 0.24811131
Validation score: 0.822882
Iteration 9, loss = 0.24297165
Validation score: 0.822882
Iteration 10, loss = 0.23891544
Validation score: 0.819213
Iteration 11, loss = 0.23570838
Validation score: 0.817545
Iteration 12, loss = 0.23215546
Validation score: 0.818212
Iteration 13, loss = 0.22985209
Validation score: 0.814877
Iteration 14, loss = 0.22724574
Validation score: 0.815210
Iteration 15, loss = 0.22504113
Validation score: 0.815877
Iteration 16, loss = 0.22313701
Validation score: 0.814543
Iteration 17, loss = 0.22169935
Validation score: 0.812875
Iterat

## Test Set Accuracy

In [None]:
X_test_vectorized = tf_vectorizer.transform(x_test)
y_pred = clf_nn.predict(X_test_vectorized)

accuracy = accuracy_score(y_test, y_pred)

print("TfidfVectorizer with Mulit-Layer Perceptron Test Accuracy:", accuracy)
print('Classification Report:')
print(classification_report(y_test, y_pred))

TfidfVectorizer with Mulit-Layer Perceptron Test Accuracy: 0.8567925453653752
Classification Report:
              precision    recall  f1-score   support

    Negative       0.86      0.86      0.86      1020
    Positive       0.86      0.86      0.86      1019

    accuracy                           0.86      2039
   macro avg       0.86      0.86      0.86      2039
weighted avg       0.86      0.86      0.86      2039

