# Covid Tweet Sentiment Classifier

Write a description of this project

## 0. Imports

Include discussion about file structure and choice of machine learning frameworks (for us, it will be PyTorch)

In [1]:
import numpy as np
import pandas as pd
import engine
#import tweepy

print("Imports succeeded")
#engine.sanity_check()

Imports succeeded


## 1. Data Preprocessing

We need to perform the following steps:
- [x] Read the raw data
- [X] Scrape the tweets using the IDs
- [X] Bin the annotations into either legit information or misinformation

Ideally, we create a new dataset .csv and save it. It should have the following headers: "id", "text", "annotation1","annotation2,"class label"


In [2]:
# Uncomment to perform preprocessing again
raw_data = pd.read_csv('../data/miscov19.csv')
raw_data.tail()

Unnamed: 0,status_id,status_created_at,annotation1,annotation2,text
4568,1280654762950774785,Wed Jul 08 00:07:43 +0000 2020,calling out or correction,,Da fark is wrong with these people?!\n\nhttps:...
4569,1280669090835828738,Wed Jul 08 01:04:39 +0000 2020,fake cure,,Doctor Finds Cure to COVID-19 (without Vaccine...
4570,1280675380014309376,Wed Jul 08 01:29:39 +0000 2020,calling out or correction,,I can’t believe people legitimately think that...
4571,1280676605396836352,Wed Jul 08 01:34:31 +0000 2020,politics,,"...(wrong with Trump continues), \n-Will push ..."
4572,1280693005599506432,Wed Jul 08 02:39:41 +0000 2020,calling out or correction,,For anyone who still believes that COVID-19 is...


Next, we select the relevant columns for classification. We keep the ID just in case we want to reference the tweet.

In [3]:
# Uncomment to perform data preprocessing again
dataset = raw_data[['status_id','text','annotation1', 'annotation2']]
uniq = raw_data["annotation1"].unique()
print(uniq)
uniq1 = raw_data["annotation2"].unique()
print(uniq1)
dataset.tail()

['irrelevant' 'politics' 'true public health response' 'news'
 'calling out or correction' 'sarcasm or satire' 'fake cure' 'conspiracy'
 'true prevention' 'ambiguous or hard to classify'
 'false fact or prevention' 'panic buying'
 'commercial activity or promotion' 'fake treatment' 'emergency'
 'false public health response']
[nan 'calling out or correction' 'ambiguous or hard to classify'
 'conspiracy' 'sarcasm or satire' 'politics' 'false fact or prevention'
 'irrelevant' 'true public health response' 'true prevention'
 'false public health response' 'commercial activity or promotion'
 'fake treatment' 'emergency' 'fake cure']


Unnamed: 0,status_id,text,annotation1,annotation2
4568,1280654762950774785,Da fark is wrong with these people?!\n\nhttps:...,calling out or correction,
4569,1280669090835828738,Doctor Finds Cure to COVID-19 (without Vaccine...,fake cure,
4570,1280675380014309376,I can’t believe people legitimately think that...,calling out or correction,
4571,1280676605396836352,"...(wrong with Trump continues), \n-Will push ...",politics,
4572,1280693005599506432,For anyone who still believes that COVID-19 is...,calling out or correction,


Here, we bin the annotations into legit or illegitimate information in the following way:
- Legit information (Label = 1):
    - 'true public health response'
    - 'calling out or correction'
    - 'news'
    - 'true prevention'
    - 'emergency'
- Misinformation (Label = 2):
    - 'fake cure'
    - 'consipiracy'
    - 'false fact or prevention'
    - 'panic buying'
    - 'fake treatment'
    - 'false public health response'
- Irrelevant to conversation (Label = 3):
    - 'sarcasm or satire'
    - 'irrelevant'
    - 'politics' (can technically be legit or misinformation)
    - 'ambiguous or hard to classify'
    - 'commercial activity or promotion'

In [4]:
def legit(value):
    if(value == 'true public health response'\
      or value == 'calling out or correction'\
      or value == 'news'
      or value == 'true prevention'
      or value == 'emergency'):
        return True
    return False

def misinformation(value):
    if (value == 'fake cure' \
       or value == 'conspiracy'\
       or value == 'false fact or prevention'\
       or value == 'panic buying'\
       or value == 'fake treatment'\
       or value == 'false public health response'):
         return True
    return False

def irrelevant(value):
    if (value == 'irrelevant' \
        or value == 'politics'\
        or value == 'sarcasm or satire'\
        or value == 'ambiguous or hard to classify'\
        or value == 'commercial activity or promotion'):
          return True
    return False

def bin_annotations(ann):
    if irrelevant(ann):
        return 100
    elif misinformation(ann):
        return 10
    elif legit(ann):
        return 1
    return 0

def map_annotations(ann1, ann2):
    num = bin_annotations(ann1) + bin_annotations(ann2)
    if num >= 100:
        return 3
    elif num >= 10:
        return 2
    elif num >= 1:
        return 1
    return 0

In [6]:
# Uncomment to redo binning
dataset['label'] = dataset.apply(lambda x: map_annotations(x.annotation1, x.annotation2), axis=1)
dataset.to_csv('../data/miscov19_processed.csv')
dataset.tail()

Unnamed: 0,status_id,text,annotation1,annotation2,label
4568,1280654762950774785,Da fark is wrong with these people?!\n\nhttps:...,calling out or correction,,1
4569,1280669090835828738,Doctor Finds Cure to COVID-19 (without Vaccine...,fake cure,,2
4570,1280675380014309376,I can’t believe people legitimately think that...,calling out or correction,,1
4571,1280676605396836352,"...(wrong with Trump continues), \n-Will push ...",politics,,3
4572,1280693005599506432,For anyone who still believes that COVID-19 is...,calling out or correction,,1


## 2. Model Training

In this section, we will train the model. The model will be defined in the engine.py file along with data loading and associated functions. We will need to develop an API to go between this Jupyter Notebook and the engine. This layer of separation is intentional, as it allows us to have pipeline components which are self contained.

### 2.1 Hyperparameter Optimization 

Using K-fold validation, we shall attempt to find optimal hyperparameters. The hyperparameter space has yet to be defined, but early stopping is known to be an important aspect of fine tuning BERT models.

## 3. Model Evaluation

In this section, we will evaluate the model based on its performance on a holdout set of test data.

In [None]:
import model_evaluate

#model_evaluate.sanity_check()

X_test = [0,0,0,0,0,0,0,0,0,0,0,0] # list-like of test data
y_real = [1,2,3,1,2,3,1,2,3,1,2,3] # list-like of real labels on the test data
y_pred = [1,2,3,2,1,3,3,2,1,3,1,2] # list-like of predicted labels on the test data

acc = model_evaluate.accuracy(y_real, y_pred)
pre = model_evaluate.precision(y_real, y_pred)
rec = model_evaluate.recall(y_real, y_pred)
f1 = model_evaluate.f1(y_real, y_pred)

print(f"Accuracy: {acc:.4f}\nPrecision: {pre:.4f}\nRecall: {rec:.4f}\nF1: {f1:.4f}")