

# **Step-by-step guide: Smiley prediction on Twitter data**

In this notebook we will finetune CT-BERT for sentiment (emoji) classification.

Learn more about the project [here](https://github.com/CS-433/cs-433-project-2-mlakes/)

### **Before proceeding**
Create a copy of this notebook by going to:
 `File 🡒 Save a Copy in Drive`

### **Contents**
See "Sommaire" (table of contents) in the sidebar to the left.

# Colab set-up

## 0.1 Training with a GPU

Make sure to change the runtime type to GPU under:

`Exécution 🡒 Modifier le type d'exécution 
To GPU and if possible Mémoire RAM élevée (Colab Pro)`

Verify that we are using GPUs, otherwise training will take a very long time. 

In [39]:
import torch
if torch.cuda.is_available():    
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available :( using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


## 0.2 Mounting Google Drive. 





In [40]:
# Mounting the Drive allows us to save and access our files
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [41]:
# Go to main directory
import os
os.chdir("/content/drive/My Drive")

## 0.3 Cloning the repository


In [42]:
# Clone our repo 
!git clone https://<username>:<password>@github.com/CS-433/cs-433-project-2-mlakes MLProject2

Cloning into 'MLProject2'...
remote: Enumerating objects: 120, done.[K
remote: Counting objects: 100% (120/120), done.[K
remote: Compressing objects: 100% (91/91), done.[K
remote: Total 170 (delta 65), reused 51 (delta 27), pack-reused 50[K
Receiving objects: 100% (170/170), 74.51 MiB | 20.21 MiB/s, done.
Resolving deltas: 100% (70/70), done.


In [44]:
os.chdir("/content/drive/My Drive/MLProject2")

In [45]:
!git checkout colab

M	src/preprocessing_glove/build_vocab.sh
M	src/preprocessing_glove/cooc.py
M	src/preprocessing_glove/cut_vocab.sh
M	src/preprocessing_glove/glove_template.py
M	src/preprocessing_glove/pickle_vocab.py
Branch 'colab' set up to track remote branch 'colab' from 'origin'.
Switched to a new branch 'colab'


In [46]:
# Move to repository and visualize files (similar to cd MLProject2 and ls)
os.chdir("/content/drive/My Drive/MLProject2")
os.listdir()

['.git',
 '.gitignore',
 'Dockerfile-notebook',
 'README.md',
 'data',
 'docs',
 'models',
 'notebooks',
 'predictions',
 'requirements.txt',
 'saved_models',
 'src',
 'test']

# Download Dataset
Download the data from [here](https://www.aicrowd.com/challenges/epfl-ml-text-classification/dataset_files) and place it in folder data. You can also download our [copy](https://drive.google.com/file/d/1ve0X5Mj6RAhtb4XFZUa5W_A9HIt4-bgE/view?usp=sharing) from Drive 

In [47]:
!unzip data/twitter-datasets.zip -d data/

Archive:  data/twitter-datasets.zip
  inflating: data/twitter-datasets/sample_submission.csv  
  inflating: data/twitter-datasets/test_data.txt  
  inflating: data/twitter-datasets/train_neg_full.txt  
  inflating: data/twitter-datasets/train_neg.txt  
  inflating: data/twitter-datasets/train_pos_full.txt  
  inflating: data/twitter-datasets/train_pos.txt  


In [48]:
# Move to main data folder
!mv data/twitter-datasets/train_neg.txt data/train_neg.txt 
!mv data/twitter-datasets/train_pos.txt data/train_pos.txt 
!mv data/twitter-datasets/train_neg_full.txt data/train_neg_full.txt 
!mv data/twitter-datasets/train_pos_full.txt data/train_pos_full.txt 
!mv data/twitter-datasets/test_data.txt data/test_data.txt

# Install and import libraries
Install the required dependencies

In [49]:
%%capture
!pip install emoji
!pip install unidecode
!pip install flair
!pip install git+https://github.com/huggingface/transformers.git

In [52]:
import sys
sys.path.append("/content/drive/My Drive/MLProject2")

In [53]:
import pandas as pd
import numpy as np

from src.preprocessing import apply_preprocessing, apply_preprocessing_bert
from src.data_loading import load_tweets, load_test_tweets, split_data, seed_everything, split_data_bert
from src.data_cleaning import clean_text
from src.models.bi_lstm import run_bidirectional_lstm
from src.models.machine_learning_models import run_tfidf_ml_model
from src.models.few_shot import run_zero_shot
from src.models.bert import run_bert, predict_bert

# Training

In [54]:
# Global variables
model_name = 'digitalepidemiologylab/covid-twitter-bert'

# For reproductibility 
seed_everything()

## 1.1 Load the training set

In [55]:
os.chdir("/content/drive/My Drive/MLProject2/src")

In order to demostrate the training pipeline we are going to use a small fraction of the dataset.
Please note that the model training was used with ALL the data

In [56]:
#tweets = load_tweets(sample=False, frac=1)
tweets = load_tweets(sample=True, frac=0.001)

Positive tweets: 98
Negative tweets: 99
Most frequent label model: 0.503


## 1.2 Preprocess the training set

Preprocessing does two things:
1. cleans the tweets 
2. tokenizes the cleaned tweets by:
  * adding [CLS] and [SEP] tokens
  * adds padding (maximum length 100)
  * creates attention masks




In [57]:
# To see complete documentation of our functions
? apply_preprocessing_bert

In [58]:
# Example of clean function
raw_tweet = tweets.tweet[9]
cleaned_tweet = clean_text(raw_tweet)
print(raw_tweet)
print(cleaned_tweet)

raw_tweet = tweets.tweet[17]
cleaned_tweet = clean_text(raw_tweet)
print(raw_tweet)
print(cleaned_tweet)

<user> translation pleaseee for international fans
translation please for international fans
<user> i #believe one day u will notice me ..  ...
i #believe one day you will notice me .. ...


In [59]:
tweets = apply_preprocessing_bert(tweets, model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=421.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




After the preprocessing, we have a tensor of three parts.
1. Tokens
2. Attention mask
3. Label

See the example below

In [60]:
tweets[17]

(tensor([ 101, 1045, 1001, 2903, 2028, 2154, 2017, 2097, 5060, 2033, 1012, 1012,
         1012, 1012, 1012,  102,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## 1.3 Transformer training



For this example we used 90% for training and 10% for testing. Note that for the final model 100% of the data was used to train.

In [61]:
train_tweets, val_tweets = split_data_bert(tweets, ratio=0.9)


In [62]:
run_bert(train_tweets=train_tweets,
          val_tweets=val_tweets,
          save_model = True,
          learning_rate = 5e-6,
          model_name = 'bert',
          epochs = 3)


----------------------------------------------------------------------------------------------------
MODEL TO RUN: Neural network with bert tokens
----------------------------------------------------------------------------------------------------

Creating batches...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1345000672.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at digitalepidemiologylab/covid-twitter-bert were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassifi

Saving model...


Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

./../models/bert_0
Done!


Iteration: 100%|██████████| 1/1 [00:00<00:00,  4.06it/s]
Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

EPOCH: 0. Losses: train = 0.6995129485925039, val = 0.6609669923782349.             Accuracy: 0.6


Iteration: 100%|██████████| 6/6 [00:05<00:00,  1.03it/s]


Saving model...


Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

./../models/bert_1
Done!


Iteration: 100%|██████████| 1/1 [00:00<00:00,  4.29it/s]
Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

EPOCH: 1. Losses: train = 0.629539300998052, val = 0.6394665837287903.             Accuracy: 0.7


Iteration: 100%|██████████| 6/6 [00:05<00:00,  1.02it/s]


Saving model...


Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

./../models/bert_2
Done!


Iteration: 100%|██████████| 1/1 [00:00<00:00,  4.18it/s]

EPOCH: 2. Losses: train = 0.6037647823492686, val = 0.6331812143325806.             Accuracy: 0.7





# Testing
Now we are going to use the trained model to predict.

You can also use our pre-trained model with all the data. Download the model [here](https://drive.google.com/drive/folders/1aLWxJdPFwOyqvNkkc_QyzhBC9ofY9tgS?usp=sharing) and place it in the models folder. 

## 2.1 Load the testing set

In [63]:
test_tweets = load_test_tweets()
test_tweets['polarity'] =  pd.to_numeric(test_tweets.id)  # to use same preprocessing function

## 2.2 Preprocess the testing set

In [64]:
dataset = apply_preprocessing_bert(test_tweets)

## 2.3 Make the predictions with our model

In [None]:
test_ids_list, binary_preds_list = predict_bert(dataset, 'bert_2')

Iteration:  15%|█▌        | 44/288 [00:13<01:16,  3.20it/s]

## 2.4 Format and save the predictions

In [None]:
test_ids = np.concatenate(test_ids_list).ravel()
binary_preds = np.concatenate(binary_preds_list).ravel()
binary_preds = np.where(binary_preds==0, -1, binary_preds) 
results = pd.DataFrame({'Id': test_ids, 'Prediction': binary_preds})
results.to_csv("./../predictions/predictions.csv", index=False)

In [None]:
results.head()

# Run models with embeddings

The BiLSTM can be trained with glove and word2vec embeddings.


## 3.1 Word2Vec
Constructs a a vocabulary list of words appearing at least 5 times.


In [None]:
os.chdir("/content/drive/My Drive/MLProject2")

In [None]:
!sh src/preprocessing_glove/build_vocab.sh
!sh src/preprocessing_glove/cut_vocab.sh
!python src/preprocessing_glove/pickle_vocab.py


In [None]:
os.chdir("/content/drive/My Drive/MLProject2/src")

In [None]:
tweets = load_tweets(sample=True, frac=0.001)
tweets = apply_preprocessing(tweets)
run_bidirectional_lstm(tweets=tweets[['tweet']],
                        labels=tweets[['polarity']],
                        save_model=False,
                        embeddings='word2vec')

## 3.2 GloVe

In [None]:
os.chdir("/content/drive/My Drive/MLProject2")

In [None]:
!wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
!mv glove.twitter.27B.zip data/embeddings/glove.twitter.27B.zip
!unzip data/embeddings/glove.twitter.27B.zip -d data/embeddings

In [None]:
os.chdir("/content/drive/My Drive/MLProject2/src")

In [None]:
run_bidirectional_lstm(tweets=tweets[['tweet']],
                        labels=tweets[['polarity']],
                        save_model=False,
                        embeddings='glove')